1
|
Sanchez-Pulido L, Ponting CP. Extending the Horizon of Homology Detection with Coevolution-based Structure Prediction. J Mol Biol 2021; 433:167106. [PMID: 34139218 PMCID: PMC8527833 DOI: 10.1016/j.jmb.2021.167106] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2021] [Revised: 06/09/2021] [Accepted: 06/09/2021] [Indexed: 12/12/2022]
Abstract
Traditional sequence analysis algorithms fail to identify distant homologies when they lie beyond a detection horizon. In this review, we discuss how co-evolution-based contact and distance prediction methods are pushing back this homology detection horizon, thereby yielding new functional insights and experimentally testable hypotheses. Based on correlated substitutions, these methods divine three-dimensional constraints among amino acids in protein sequences that were previously devoid of all annotated domains and repeats. The new algorithms discern hidden structure in an otherwise featureless sequence landscape. Their revelatory impact promises to be as profound as the use, by archaeologists, of ground-penetrating radar to discern long-hidden, subterranean structures. As examples of this, we describe how triplicated structures reflecting longin domains in MON1A-like proteins, or UVR-like repeats in DISC1, emerge from their predicted contact and distance maps. These methods also help to resolve structures that do not conform to a "beads-on-a-string" model of protein domains. In one such example, we describe CFAP298 whose ubiquitin-like domain was previously challenging to perceive owing to a large sequence insertion within it. More generally, the new algorithms permit an easier appreciation of domain families and folds whose evolution involved structural insertion or rearrangement. As we exemplify with α1-antitrypsin, coevolution-based predicted contacts may also yield insights into protein dynamics and conformational change. This new combination of structure prediction (using innovative co-evolution based methods) and homology inference (using more traditional sequence analysis approaches) shows great promise for bringing into view a sea of evolutionary relationships that had hitherto lain far beyond the horizon of homology detection.
Collapse
Affiliation(s)
- Luis Sanchez-Pulido
- Medical Research Council Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh EH4 2XU, UK.
| | - Chris P Ponting
- Medical Research Council Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh EH4 2XU, UK.
| |
Collapse
|
2
|
Alvarez-Carreño C, Coello G, Arciniega M. FiRES: A computational method for the de novo identification of internal structure similarity in proteins. Proteins 2020; 88:1169-1179. [PMID: 32112578 DOI: 10.1002/prot.25886] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2019] [Revised: 11/12/2019] [Accepted: 02/24/2020] [Indexed: 11/08/2022]
Abstract
Internal structure similarity in proteins can be observed at the domain and subdomain levels. From an evolutionary perspective, structurally similar elements may arise divergently by gene duplication and fusion events but may also be the product of convergent evolution under physicochemical constraints. The characterization of proteins that contain repeated structural elements has implications for many fields of protein science including protein domain evolution, structure classification, structure prediction, and protein engineering. FiRES (Find Repeated Elements in Structure) is an algorithm that relies on a topology-independent structure alignment method to identify repeating elements in protein structure. FiRES was tested against two hand curated databases of protein repeats: MALIDUP, for very divergent duplicated domains; and RepeatsDB for short tandem repeats. The performance of FiRES was compared to that of lalign, RADAR, HHrepID, CE-symm, ReUPred, and Swelfe. FiRES was the method that most accurately detected proteins either with duplicated domains (accuracy = 0.86) or with multiple repeated units (accuracy = 0.92). FiRES is a new methodology for the discovery of proteins containing structurally similar elements. The FiRES web server is publicly available at http://fires.ifc.unam.mx. The scripts, results, and benchmarks from this study can be downloaded from https://github.com/Claualvarez/fires.
Collapse
Affiliation(s)
- Claudia Alvarez-Carreño
- Department of Bioquímica y Biología Estructural, Instituto de Fisiología Celular, Universidad Nacional Autónoma de México, Mexico City, Mexico.,School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, Georgia, USA
| | - Gerardo Coello
- Unidad de Cómputo, Instituto de Fisiología Celular, Universidad Nacional Autónoma de México, Mexico City, Mexico
| | - Marcelino Arciniega
- Department of Bioquímica y Biología Estructural, Instituto de Fisiología Celular, Universidad Nacional Autónoma de México, Mexico City, Mexico
| |
Collapse
|
3
|
Shafee TMA, Robinson AJ, van der Weerden N, Anderson MA. Structural homology guided alignment of cysteine rich proteins. SPRINGERPLUS 2016; 5:27. [PMID: 26788439 PMCID: PMC4709342 DOI: 10.1186/s40064-015-1609-z] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/02/2015] [Accepted: 12/13/2015] [Indexed: 11/14/2022]
Abstract
Background Cysteine rich protein families are notoriously difficult to align due to low sequence identity and frequent insertions and deletions. Results Here we present an alignment method that ensures homologous cysteines align by assigning a unique 10 amino acid barcode to those identified as structurally homologous by the DALI webserver. The free inter-cysteine regions of the barcoded sequences can then be aligned using any standard algorithm. Finally the barcodes are replaced with the original columns to yield an alignment which requires the minimum of manual refinement. Conclusions Using structural homology information to constrain sequence alignments allows the alignment of highly divergent, repetitive sequences that are poorly dealt with by existing algorithms. Tools are provided to perform this method online using the CysBar web-tool (http://CysBar.science.latrobe.edu.au) and offline (python script available from http://github.com/ts404/CysBar). Electronic supplementary material The online version of this article (doi:10.1186/s40064-015-1609-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Thomas M A Shafee
- Department of Biochemistry, La Trobe Institute of Molecular Sciences, La Trobe University, Melbourne, 3086 Australia
| | - Andrew J Robinson
- College of Science, Health and Engineering, La Trobe University, Melbourne, 3086 Australia ; Life Sciences Computation Centre, Victorian Life Sciences Computation Initiative, Melbourne, 3053 Australia
| | - Nicole van der Weerden
- Department of Biochemistry, La Trobe Institute of Molecular Sciences, La Trobe University, Melbourne, 3086 Australia
| | - Marilyn A Anderson
- Department of Biochemistry, La Trobe Institute of Molecular Sciences, La Trobe University, Melbourne, 3086 Australia
| |
Collapse
|
4
|
Sasidharan R, Nepusz T, Swarbreck D, Huala E, Paccanaro A. GFam: a platform for automatic annotation of gene families. Nucleic Acids Res 2012; 40:e152. [PMID: 22790981 PMCID: PMC3479161 DOI: 10.1093/nar/gks631] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We have developed GFam, a platform for automatic annotation of gene/protein families. GFam provides a framework for genome initiatives and model organism resources to build domain-based families, derive meaningful functional labels and offers a seamless approach to propagate functional annotation across periodic genome updates. GFam is a hybrid approach that uses a greedy algorithm to chain component domains from InterPro annotation provided by its 12 member resources followed by a sequence-based connected component analysis of un-annotated sequence regions to derive consensus domain architecture for each sequence and subsequently generate families based on common architectures. Our integrated approach increases sequence coverage by 7.2 percentage points and residue coverage by 14.6 percentage points higher than the coverage relative to the best single-constituent database within InterPro for the proteome of Arabidopsis. The true power of GFam lies in maximizing annotation provided by the different InterPro data sources that offer resource-specific coverage for different regions of a sequence. GFam’s capability to capture higher sequence and residue coverage can be useful for genome annotation, comparative genomics and functional studies. GFam is a general-purpose software and can be used for any collection of protein sequences. The software is open source and can be obtained from http://www.paccanarolab.org/software/gfam/.
Collapse
Affiliation(s)
- Rajkumar Sasidharan
- Department of Molecular, Cell and Developmental Biology, University of California at Los Angeles, Los Angeles, CA 90095, USA.
| | | | | | | | | |
Collapse
|
5
|
Abstract
Circular permutation (CP) in a protein can be considered as if its sequence were circularized followed by a creation of termini at a new location. Since the first observation of CP in 1979, a substantial number of studies have concluded that circular permutants (CPs) usually retain native structures and functions, sometimes with increased stability or functional diversity. Although this interesting property has made CP useful in many protein engineering and folding researches, large-scale collections of CP-related information were not available until this study. Here we describe CPDB, the first CP DataBase. The organizational principle of CPDB is a hierarchical categorization in which pairs of circular permutants are grouped into CP clusters, which are further grouped into folds and in turn classes. Additions to CPDB include a useful set of tools and resources for the identification, characterization, comparison and visualization of CP. Besides, several viable CP site prediction methods are implemented and assessed in CPDB. This database can be useful in protein folding and evolution studies, the discovery of novel protein structural and functional relationships, and facilitating the production of new CPs with unique biotechnical or industrial interests. The CPDB database can be accessed at http://sarst.life.nthu.edu.tw/cpdb
Collapse
Affiliation(s)
- Wei-Cheng Lo
- Institute of Bioinformatics and Structural Biology, National Tsing Hua University, Hsinchu 30013, Taiwan
| | | | | | | |
Collapse
|
6
|
Self containment, a property of modular RNA structures, distinguishes microRNAs. PLoS Comput Biol 2008; 4:e1000150. [PMID: 18725951 PMCID: PMC2517099 DOI: 10.1371/journal.pcbi.1000150] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2007] [Accepted: 07/08/2008] [Indexed: 11/19/2022] Open
Abstract
RNA molecules will tend to adopt a folded conformation through the pairing of bases on a single strand; the resulting so-called secondary structure is critical to the function of many types of RNA. The secondary structure of a particular substring of functional RNA may depend on its surrounding sequence. Yet, some RNAs such as microRNAs retain their specific structures during biogenesis, which involves extraction of the substructure from a larger structural context, while other functional RNAs may be composed of a fusion of independent substructures. Such observations raise the question of whether particular functional RNA substructures may be selected for invariance of secondary structure to their surrounding nucleotide context. We define the property of self containment to be the tendency for an RNA sequence to robustly adopt the same optimal secondary structure regardless of whether it exists in isolation or is a substring of a longer sequence of arbitrary nucleotide content. We measured degree of self containment using a scoring method we call the self-containment index and found that miRNA stem loops exhibit high self containment, consistent with the requirement for structural invariance imposed by the miRNA biogenesis pathway, while most other structured RNAs do not. Further analysis revealed a trend toward higher self containment among clustered and conserved miRNAs, suggesting that high self containment may be a characteristic of novel miRNAs acquiring new genomic contexts. We found that miRNAs display significantly enhanced self containment compared to other functional RNAs, but we also found a trend toward natural selection for self containment in most functional RNA classes. We suggest that self containment arises out of selection for robustness against perturbations, invariance during biogenesis, and modular composition of structural function. Analysis of self containment will be important for both annotation and design of functional RNAs. A Python implementation and Web interface to calculate the self-containment index are available at http://kim.bio.upenn.edu/software/. An RNA molecule is made up of a linear sequence of nucleotides, which form pairwise interactions that define its folded three-dimensional structure; the particular structure largely depends on the specific sequence. These base-pairing interactions are stabilizing, and the RNA will tend to fold in a particular way to maximize stability. Consider some nucleotide sequence that optimally folds into some structure in isolation; if this sequence is now embedded inside a larger sequence, then either the original structure will be a robust subcomponent of the larger folded structure, or it will be disrupted due to new interactions between the original sequence and the surrounding sequence. We explore this property of context robustness of structure and in particular define the property of “self containment” to describe intrinsic context robustness—i.e., the tendency for certain sequences to be structurally robust in many different sequence contexts. Self containment turns out to be a strong characteristic of a class of RNAs called microRNAs, whose biogenesis process depends on the maintenance of structural robustness. This finding will be useful in future efforts to characterize novel miRNAs, as well as in understanding the regulation and evolution of noncoding functional RNAs as modular units.
Collapse
|
7
|
Lo WC, Lyu PC. CPSARST: an efficient circular permutation search tool applied to the detection of novel protein structural relationships. Genome Biol 2008; 9:R11. [PMID: 18201387 PMCID: PMC2395249 DOI: 10.1186/gb-2008-9-1-r11] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2007] [Revised: 11/19/2007] [Accepted: 01/18/2008] [Indexed: 12/04/2022] Open
Abstract
CPSARST (Circular Permutation Search Aided by Ramachandran Sequential Transformation) is an efficient database search tool that provides a new way for rapidly detecting novel relationships among proteins. Circular permutation of a protein can be visualized as if the original amino- and carboxyl termini were linked and new ones created elsewhere. It has been well-documented that circular permutants usually retain native structures and biological functions. Here we report CPSARST (Circular Permutation Search Aided by Ramachandran Sequential Transformation) to be an efficient database search tool. In this post-genomics era, when the amount of protein structural data is increasing exponentially, it provides a new way to rapidly detect novel relationships among proteins.
Collapse
Affiliation(s)
- Wei-Cheng Lo
- Institute of Bioinformatics and Structural Biology, National Tsing Hua University, Hsinchu 30013, Taiwan
| | | |
Collapse
|
8
|
Russell RB. Classification of protein folds. Mol Biotechnol 2007; 36:238-47. [PMID: 17873410 DOI: 10.1007/s12033-007-0032-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/1999] [Revised: 11/30/1999] [Accepted: 11/30/1999] [Indexed: 11/26/2022]
Abstract
The diversity and complexity of bioinformatics tools currently available for protein sequence analysis can make it difficult to know where to begin when presented with a new sequence. In this article, we present a protocol outlining one approach to sequence analysis that should give as comprehensive a picture as possible as to the likely structure and function of a protein given the limits of available tools. We also provide worked examples showing how these tools can have an impact on the understanding of protein function prior to experimental studies.
Collapse
Affiliation(s)
- Robert B Russell
- Structural Bioinformatics, EMBL, Meyerhofstrasse 1, Heidelberg, Germany.
| |
Collapse
|
9
|
Binkowski TA, DasGupta B, Liang J. Order independent structural alignment of circularly permuted proteins. CONFERENCE PROCEEDINGS : ... ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL CONFERENCE 2007; 2004:2781-4. [PMID: 17270854 DOI: 10.1109/iembs.2004.1403795] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Circular permutation connects the N and C termini of a protein and concurrently cleaves elsewhere in the chain, providing an important mechanism for generating novel protein fold and functions. However, their in genomes is unknown because current detection methods can miss many occurrences, mistaking random repeats as circular permutation. Here we develop a method for detecting circularly permuted proteins from structural comparison. Sequence order independent alignment of protein structures can be regarded as a special case of the maximum-weight independent set problem, which is known to be computationally hard. We develop an efficient approximation algorithm by repeatedly solving relaxations of an appropriate intermediate integer programming formulation, we show that the approximation ratio is much better than the theoretical worst case ratio of r=1/4. Circularly permuted proteins reported in literature can be identified rapidly with our method, while they escape the detection by publicly available servers for structural alignment.
Collapse
|
10
|
Haydel SE, Clark-Curtiss JE. The Mycobacterium tuberculosis TrcR response regulator represses transcription of the intracellularly expressed Rv1057 gene, encoding a seven-bladed beta-propeller. J Bacteriol 2006; 188:150-9. [PMID: 16352831 PMCID: PMC1317589 DOI: 10.1128/jb.188.1.150-159.2006] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
The Mycobacterium tuberculosis TrcR response regulator binds and regulates its own promoter via an AT-rich sequence. Sequences within this AT-rich region determined to be important for TrcR binding were used to search the M. tuberculosis H37Rv genome to identify additional related TrcR binding sites. A similar AT-rich sequence was identified within the intergenic region located upstream of the Rv1057 gene. In the present work, we demonstrate that TrcR binds to a 69-bp AT-rich sequence within the Rv1057 intergenic region and generates specific contacts on the same side of the DNA helix. An M. tuberculosis trcRS deletion mutant, designated STS10, was constructed and used to determine that TrcR functions as a repressor of Rv1057 expression. Additionally, identification of the Rv1057 transcriptional start site suggests that a SigE-regulated promoter also mediates control of Rv1057 expression. Using selective capture of transcribed sequences (SCOTS) analysis as an evaluation of intracellular expression, Rv1057 was shown to be expressed during early M. tuberculosis growth in human macrophages, and the Rv1057 expression profile correlated with a gene that would be repressed by TrcR. Based on structural predictions, motif analyses, and molecular modeling, Rv1057 consists of a series of antiparallel beta-strands which adopt a beta-propeller fold, and it was determined to be the only seven-bladed beta-propeller encoded in the M. tuberculosis genome. These results provide evidence of TrcR response regulator repression of the Rv1057 beta-propeller gene that is expressed during growth of M. tuberculosis within human macrophages.
Collapse
Affiliation(s)
- Shelley E Haydel
- Center for Infectious Diseases and Vaccinology, The Biodesign Institute, School of Life Sciences, Arizona State University, Tempe, AZ 85287, USA.
| | | |
Collapse
|
11
|
Phlippen N, Hoffmann K, Fischer R, Wolf K, Zimmermann M. The glutathione synthetase of Schizosaccharomyces pombe is synthesized as a homodimer but retains full activity when present as a heterotetramer. J Biol Chem 2003; 278:40152-61. [PMID: 12734194 DOI: 10.1074/jbc.m303102200] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
Glutathione synthetase was overexpressed as a histidine-tagged protein in Schizosaccharomyces pombe and purified by two-step affinity chromatography. The recovered enzyme occurred in two different forms: a homodimeric protein consisting of two identical 56-kDa subunits and a heterotetrameric protein composed of two 32-kDa and two 24-kDa subfragments. Both forms are encoded by the GSH2 gene. The 56-Da protein corresponds to the complete GSH2 open reading frame, while the subfragments are produced following the cleavage of this larger protein by a metalloprotease. A stable homodimer was obtained by site-directed mutagenesis to remove the protease cleavage site, and this showed normal activity. A structural model of the fission yeast glutathione synthetase was produced, based on the x-ray coordinates of the human enzyme. According to this model the interacting domains of the proteolytic subfragments are strongly entangled. The subfragments were therefore coexpressed as independent proteins. These subfragments assembled correctly to yield functional heterotetramers with equivalent activity to the wild type enzyme. Furthermore, a permuted version of the protein was created. This also showed normal levels of glutathione synthetase activity. These data provide novel insight into the mechanisms of protein folding and the structure and evolution of the glutathione synthetase family.
Collapse
Affiliation(s)
- Nadine Phlippen
- Institute of Biology IV (Microbiology and Genetics), Aachen University, Worringer Weg, D-52056 Aachen, Germany
| | | | | | | | | |
Collapse
|
12
|
Copley RR, Ponting CP, Schultz J, Bork P. Sequence analysis of multidomain proteins: past perspectives and future directions. ADVANCES IN PROTEIN CHEMISTRY 2003; 61:75-98. [PMID: 12461821 DOI: 10.1016/s0065-3233(02)61002-2] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
|
13
|
Gorbalenya AE, Pringle FM, Zeddam JL, Luke BT, Cameron CE, Kalmakoff J, Hanzlik TN, Gordon KHJ, Ward VK. The palm subdomain-based active site is internally permuted in viral RNA-dependent RNA polymerases of an ancient lineage. J Mol Biol 2002; 324:47-62. [PMID: 12421558 PMCID: PMC7127740 DOI: 10.1016/s0022-2836(02)01033-1] [Citation(s) in RCA: 188] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
Template-dependent polynucleotide synthesis is catalyzed by enzymes whose core component includes a ubiquitous alphabeta palm subdomain comprising A, B and C sequence motifs crucial for catalysis. Due to its unique, universal conservation in all RNA viruses, the palm subdomain of RNA-dependent RNA polymerases (RdRps) is widely used for evolutionary and taxonomic inferences. We report here the results of elaborated computer-assisted analysis of newly sequenced replicases from Thosea asigna virus (TaV) and the closely related Euprosterna elaeasa virus (EeV), insect-specific ssRNA+ viruses, which revise a capsid-based classification of these viruses with tetraviruses, an Alphavirus-like family. The replicases of TaV and EeV do not have characteristic methyltransferase and helicase domains, and include a putative RdRp with a unique C-A-B motif arrangement in the palm subdomain that is also found in two dsRNA birnaviruses. This circular motif rearrangement is a result of migration of approximately 22 amino acid (aa) residues encompassing motif C between two internal positions, separated by approximately 110 aa, in a conserved region of approximately 550 aa. Protein modeling shows that the canonical palm subdomain architecture of poliovirus (ssRNA+) RdRp could accommodate the identified sequence permutation through changes in backbone connectivity of the major structural elements in three loop regions underlying the active site. This permutation transforms the ferredoxin-like beta1alphaAbeta2beta3alphaBbeta4 fold of the palm subdomain into the beta2beta3beta1alphaAalphaBbeta4 structure and brings beta-strands carrying two principal catalytic Asp residues into sequential proximity such that unique structural properties and, ultimately, unique functionality of the permuted RdRps may result. The permuted enzymes show unprecedented interclass sequence conservation between RdRps of true ssRNA+ and dsRNA viruses and form a minor, deeply separated cluster in the RdRp tree, implying that other, as yet unidentified, viruses may employ this type of RdRp. The structural diversification of the palm subdomain might be a major event in the evolution of template-dependent polynucleotide polymerases in the RNA-protein world.
Collapse
Key Words
- rna viruses
- rna polymerases
- evolution
- protein permutation
- ancient palm subdomain
- aa, amino acid
- cd, conserved domain
- eev, euprosterna elaeasa virus
- ibdv, infectious bursal disease virus
- ipnvj, infectious pancreatic necrosis virus strain jasper
- pv, poliovirus
- tav, thosea asigna virus
- dsrna, double-stranded rna
- ssrna+, positive-stranded rna
- rdrp, rna-dependent rna polymerase
- hmm, hidden markov model
- orf, open reading frames
- nt, nucleotide
- tdpp, template-dependent polynucleotide polymerase
Collapse
Affiliation(s)
- Alexander E Gorbalenya
- Advanced Biomedical Computing Center, Science Applications International Corporation/National Cancer Institute, P.O. Box B, Frederick, MD 21702-1201, USA.
| | | | | | | | | | | | | | | | | |
Collapse
|
14
|
Bonneau R, Strauss CEM, Rohl CA, Chivian D, Bradley P, Malmström L, Robertson T, Baker D. De novo prediction of three-dimensional structures for major protein families. J Mol Biol 2002; 322:65-78. [PMID: 12215415 DOI: 10.1016/s0022-2836(02)00698-8] [Citation(s) in RCA: 178] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
We use the Rosetta de novo structure prediction method to produce three-dimensional structure models for all Pfam-A sequence families with average length under 150 residues and no link to any protein of known structure. To estimate the reliability of the predictions, the method was calibrated on 131 proteins of known structure. For approximately 60% of the proteins one of the top five models was correctly predicted for 50 or more residues, and for approximately 35%, the correct SCOP superfamily was identified in a structure-based search of the Protein Data Bank using one of the models. This performance is consistent with results from the fourth critical assessment of structure prediction (CASP4). Correct and incorrect predictions could be partially distinguished using a confidence function based on a combination of simulation convergence, protein length and the similarity of a given structure prediction to known protein structures. While the limited accuracy and reliability of the method precludes definitive conclusions, the Pfam models provide the only tertiary structure information available for the 12% of publicly available sequences represented by these large protein families.
Collapse
Affiliation(s)
- Richard Bonneau
- Department of Biochemistry, University of Washington, Seattle, WA 98195-7350, USA
| | | | | | | | | | | | | | | |
Collapse
|
15
|
George RA, Heringa J. Protein domain identification and improved sequence similarity searching using PSI-BLAST. Proteins 2002; 48:672-81. [PMID: 12211035 DOI: 10.1002/prot.10175] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Protein sequences containing more than one structural domain are problematic when used in homology searches where they can either stop an iterative database search prematurely or cause an explosion of a search to common domains. We describe a method, DOMAINATION, that infers domains and their boundaries in a query sequence from local gapped alignments generated using PSI-BLAST. Through a new technique to recognize domain insertions and permutations, DOMAINATION submits delineated domains as successive database queries in further iterative steps. Assessed over a set of 452 multidomain proteins, the method predicts structural domain boundaries with an overall accuracy of 50% and improves finding distant homologies by 14% compared with PSI-BLAST. DOMAINATION is available as a web based tool at http://mathbio.nimr.mrc.ac.uk, and the source code is available from the authors upon request.
Collapse
Affiliation(s)
- Richard A George
- Division of Mathematical Biology, National Institute for Medical Research, The Ridgeway, Mill Hill, United Kingdom
| | | |
Collapse
|
16
|
Abstract
An important question in protein evolution is to what extent proteins may have undergone swaps (switches of domain or fragment order) during evolution. Such events might have occurred in several forms: Swaps of short fragments, swaps of structural and functional motifs, or recombination of domains in multidomain proteins. This question is important for the theoretical understanding of the evolution of proteins, and has practical implications for using swaps as a design tool in protein engineering. In order to analyze the question systematically, we conducted a large scale survey of possible swaps and permutations among all pairs of protein from the Swissport database. A swap is defined as a specific kind of sequence mutation between two proteins in which two fragments that appear in both sequences have different relative order in the two sequences. For example, aXbYc and dYeXf are defined as a swap, where X and Y represent sequence fragments that switched their order. Identifying such swaps is difficult using standard sequence comparison packages. One of the main problems in the analysis stems from the fact that many sequences contain repeats, which may be identified as false-positive swaps. We have used two different approaches to detect pairs of proteins with swaps. The first approach is based on the predefined list of domains in Pfam. We identified all the proteins that share at least two domains and analyzed their relative order, looking for pairs in which the order of these domains was switched. We designed an algorithm to distinguish between real swaps and duplications. In the second approach, we used Blast to detect pairs of proteins that share several fragments. Then, we used an automatic procedure to select pairs that are likely to contain swaps. Those pairs were analyzed visually, using a graphical tool, to eliminate duplications. Combining these approaches, about 140 different cases of swaps in the Swissprot database were found (after eliminating multiple pairs within the same family). Some of the cases have been described in the literature, but many are novel examples. Although each new example identified may be interesting to analyze, our main conclusion is that cases of swaps are rare in protein evolution. This observation is at odds with the common view that proteins are very modular to the point that modules (e.g., domains) can be shuffled between proteins with minimal constraints. Our study suggests that sequential constraints, i.e., the relative order between domains, are highly conserved.
Collapse
Affiliation(s)
- Amit Fliess
- Faculty of Life Science, Bar-Ilan University, Ramat-Gan, Israel
| | | | | |
Collapse
|
17
|
Ponting CP, Russell RR. The natural history of protein domains. ANNUAL REVIEW OF BIOPHYSICS AND BIOMOLECULAR STRUCTURE 2002; 31:45-71. [PMID: 11988462 DOI: 10.1146/annurev.biophys.31.082901.134314] [Citation(s) in RCA: 193] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Genome sequencing and structural genomics projects are providing new insights into the evolutionary history ofprote in domains. As methods for sequence and structure comparison improve, more distantly related domains are shown to be homologous. Thus there is a need for domain families to be classified within a hierarchy similar to Linnaeus' Systema Naturae, the classification of species. With such a hierarchy in mind, we discuss the evolution of domains, their combination into proteins, and evidence as to the likely origin of protein domains. We also discuss when and how analysis of domains can be used to understand details of protein function. Unconventional features of domain evolution such as intragenomic competition, domain insertion, horizontal gene transfer, and convergent evolution are seen as analogs of organismal evolutionary events. These parallels illustrate how the concept of domains can be applied to provide insights into evolutionary biology.
Collapse
Affiliation(s)
- Chris P Ponting
- Department of Human Anatomy and Genetics, University of Oxford, MRC Functional Genetics Unit, South Parks Road, Oxford OX1 3QX, UK.
| | | |
Collapse
|
18
|
Aloy P, Oliva B, Querol E, Aviles FX, Russell RB. Structural similarity to link sequence space: new potential superfamilies and implications for structural genomics. Protein Sci 2002; 11:1101-16. [PMID: 11967367 PMCID: PMC2373547 DOI: 10.1110/ps.3950102] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
The current pace of structural biology now means that protein three-dimensional structure can be known before protein function, making methods for assigning homology via structure comparison of growing importance. Previous research has suggested that sequence similarity after structure-based alignment is one of the best discriminators of homology and often functional similarity. Here, we exploit this observation, together with a merger of protein structure and sequence databases, to predict distant homologous relationships. We use the Structural Classification of Proteins (SCOP) database to link sequence alignments from the SMART and Pfam databases. We thus provide new alignments that could not be constructed easily in the absence of known three-dimensional structures. We then extend the method of Murzin (1993b) to assign statistical significance to sequence identities found after structural alignment and thus suggest the best link between diverse sequence families. We find that several distantly related protein sequence families can be linked with confidence, showing the approach to be a means for inferring homologous relationships and thus possible functions when proteins are of known structure but of unknown function. The analysis also finds several new potential superfamilies, where inspection of the associated alignments and superimpositions reveals conservation of unusual structural features or co-location of conserved amino acids and bound substrates. We discuss implications for Structural Genomics initiatives and for improvements to sequence comparison methods.
Collapse
Affiliation(s)
- Patrick Aloy
- EMBL, Biocomputing, Meyerhofstrasse 1, D-69117 Heidelberg, Germany
| | | | | | | | | |
Collapse
|
19
|
Rigden DJ. Use of covariance analysis for the prediction of structural domain boundaries from multiple protein sequence alignments. Protein Eng Des Sel 2002; 15:65-77. [PMID: 11917143 DOI: 10.1093/protein/15.2.65] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Current methods for identification of domains within protein sequences require either structural information or the identification of homologous domain sequences in different sequence contexts. Knowledge of structural domain boundaries is important for fold recognition experiments and structural determination by X-ray crystallography or nuclear magnetic resonance spectroscopy using the divide-and-conquer approach. Here, a new and conceptually simple method for the identification of structural domain boundaries in multiple protein sequence alignments is presented. Analysis of covariance at positions within the alignment is first used to predict 3D contacts. By the nature of the domain as an independent folding unit, inter-domain predicted contacts are fewer than intra-domain predicted contacts. By analysing all possible domain boundaries and constructing a smoothed profile of predicted contact density (PCD), true structural domain boundaries are predicted as local profile minima associated with low PCD. A training data set is constructed from 52 non-homologous two-domain protein sequences of known 3D structure and used to determine optimal parameters for the profile analysis. The alignments in the training data set contained 48 +/- 17 (mean +/- SD) sequences and lengths of 257 +/- 121 residues. Of the 47 alignments yielding predictions, 35% of true domain boundaries are predicted to within 15 amino acids by the local profile minimum with the lowest profile value. Including predictions from the second- and third-lowest local minima increases the correct domain boundary coverage to 60%, whereas the lowest five local minima cover 79% of correct domain boundaries. Through further profile analysis, criteria are presented which reliably identify subsets of more accurate predictions. Retrospective analysis of CASP3 targets shows predictions of sufficient accuracy to enable dramatically improved fold recognition results. Finally, a prediction is made for geminivirus AL1 protein which is in full agreement with biochemical data, yielding a plausible, novel threading result.
Collapse
Affiliation(s)
- Daniel J Rigden
- Embrapa Genetic Resources and Biotechnology, Cenargen/Embrapa, S.A.I.N. Parque Rural, Final W5, Asa Norte, 70770-900, Brasília, Brazil.
| |
Collapse
|
20
|
Bonneau R, Baker D. Ab initio protein structure prediction: progress and prospects. ANNUAL REVIEW OF BIOPHYSICS AND BIOMOLECULAR STRUCTURE 2001; 30:173-89. [PMID: 11340057 DOI: 10.1146/annurev.biophys.30.1.173] [Citation(s) in RCA: 226] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Considerable recent progress has been made in the field of ab initio protein structure prediction, as witnessed by the third Critical Assessment of Structure Prediction (CASP3). In spite of this progress, much work remains, for the field has yet to produce consistently reliable ab initio structure prediction protocols. In this work, we review the features of current ab initio protocols in an attempt to highlight the foundations of recent progress in the field and suggest promising directions for future work.
Collapse
Affiliation(s)
- R Bonneau
- Department of Biochemistry, University of Washington, Seattle, Washington, Box 357350, 98195, USA.
| | | |
Collapse
|
21
|
Lupas AN, Ponting CP, Russell RB. On the evolution of protein folds: are similar motifs in different protein folds the result of convergence, insertion, or relics of an ancient peptide world? J Struct Biol 2001; 134:191-203. [PMID: 11551179 DOI: 10.1006/jsbi.2001.4393] [Citation(s) in RCA: 206] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
This paper presents and discusses evidence suggesting how the diversity of domain folds in existence today might have evolved from peptide ancestors. We apply a structure similarity detection method to detect instances where localized regions of different protein folds contain highly similar sequences and structures. Results of performing an all-on-all comparison of known structures are described and compared with other recently published findings. The numerous instances of local sequence and structure similarities within different protein folds, together with evidence from proteins containing sequence and structure repeats, argues in favor of the evolution of modern single polypeptide domains from ancient short peptide ancestors (antecedent domain segments (ADSs)). In this model, ancient protein structures were formed by self-assembling aggregates of short polypeptides. Subsequently, and perhaps concomitantly with the evolution of higher fidelity DNA replication and repair systems, single polypeptide domains arose from the fusion of ADSs genes. Thus modern protein domains may have a polyphyletic origin.
Collapse
Affiliation(s)
- A N Lupas
- Bioinformatics, GlaxoSmithKline, UP1345, 1250 South Collegeville Road, Collegeville, Pennsylvania 19426-0989, USA
| | | | | |
Collapse
|
22
|
Abstract
Typically, protein spatial structures are more conserved in evolution than amino acid sequences. However, the recent explosion of sequence and structure information accompanied by the development of powerful computational methods led to the accumulation of examples of homologous proteins with globally distinct structures. Significant sequence conservation, local structural resemblance, and functional similarity strongly indicate evolutionary relationships between these proteins despite pronounced structural differences at the fold level. Several mechanisms such as insertions/deletions/substitutions, circular permutations, and rearrangements in beta-sheet topologies account for the majority of detected structural irregularities. The existence of evolutionarily related proteins that possess different folds brings new challenges to the homology modeling techniques and the structure classification strategies and offers new opportunities for protein design in experimental studies.
Collapse
Affiliation(s)
- N V Grishin
- Howard Hughes Medical Institute, Department of Biochemistry, University of Texas Southwestern Medical Center, 5323 Harry Hines Boulevard, Dallas, Texas 75390-9050, USA
| |
Collapse
|
23
|
Bonneau R, Tsai J, Ruczinski I, Baker D. Functional inferences from blind ab initio protein structure predictions. J Struct Biol 2001; 134:186-90. [PMID: 11551178 DOI: 10.1006/jsbi.2000.4370] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Ab initio protein structure prediction methods have improved dramatically in the past several years. Because these methods require only the sequence of the protein of interest, they are potentially applicable to the open reading frames in the many organisms whose sequences have been and will be determined. Ab initio methods cannot currently produce models of high enough resolution for use in rational drug design, but there is an exciting potential for using the methods for functional annotation of protein sequences on a genomic scale. Here we illustrate how functional insights can be obtained from low-resolution predicted structures using examples from blind ab initio structure predictions from the third and fourth critical assessment of structure prediction (CASP3, CASP4) experiments.
Collapse
Affiliation(s)
- R Bonneau
- Department of Biochemistry, University of Washington, Seattle, Washington 98195, USA
| | | | | | | |
Collapse
|
24
|
Todd AE, Orengo CA, Thornton JM. Evolution of function in protein superfamilies, from a structural perspective. J Mol Biol 2001; 307:1113-43. [PMID: 11286560 DOI: 10.1006/jmbi.2001.4513] [Citation(s) in RCA: 459] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
The recent growth in protein databases has revealed the functional diversity of many protein superfamilies. We have assessed the functional variation of homologous enzyme superfamilies containing two or more enzymes, as defined by the CATH protein structure classification, by way of the Enzyme Commission (EC) scheme. Combining sequence and structure information to identify relatives, the majority of superfamilies display variation in enzyme function, with 25 % of superfamilies in the PDB having members of different enzyme types. We determined the extent of functional similarity at different levels of sequence identity for 486,000 homologous pairs (enzyme/enzyme and enzyme/non-enzyme), with structural and sequence relatives included. For single and multi-domain proteins, variation in EC number is rare above 40 % sequence identity, and above 30 %, the first three digits may be predicted with an accuracy of at least 90 %. For more distantly related proteins sharing less than 30 % sequence identity, functional variation is significant, and below this threshold, structural data are essential for understanding the molecular basis of observed functional differences. To explore the mechanisms for generating functional diversity during evolution, we have studied in detail 31 diverse structural enzyme superfamilies for which structural data are available. A large number of variations and peculiarities are observed, at the atomic level through to gross structural rearrangements. Almost all superfamilies exhibit functional diversity generated by local sequence variation and domain shuffling. Commonly, substrate specificity is diverse across a superfamily, whilst the reaction chemistry is maintained. In many superfamilies, the position of catalytic residues may vary despite playing equivalent functional roles in related proteins. The implications of functional diversity within supefamilies for the structural genomics projects are discussed. More detailed information on these superfamilies is available at http://www.biochem.ucl.ac.uk/bsm/FAM-EC/.
Collapse
Affiliation(s)
- A E Todd
- Biochemistry and Molecular Biology Department, University College London, Gower Street, London, WC1E 6BT, UK
| | | | | |
Collapse
|
25
|
Ostermeier M, Benkovic SJ. Evolution of protein function by domain swapping. ADVANCES IN PROTEIN CHEMISTRY 2001; 55:29-77. [PMID: 11050932 DOI: 10.1016/s0065-3233(01)55002-0] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Affiliation(s)
- M Ostermeier
- Department of Chemical Engineering, Johns Hopkins University, Baltimore, Maryland 21218, USA
| | | |
Collapse
|
26
|
Schirra HJ, Scanlon MJ, Lee MC, Anderson MA, Craik DJ. The solution structure of C1-T1, a two-domain proteinase inhibitor derived from a circular precursor protein from Nicotiana alata. J Mol Biol 2001; 306:69-79. [PMID: 11178894 DOI: 10.1006/jmbi.2000.4318] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
A two-domain portion of the proteinase inhibitor precursor from Nicotiana alata (NaProPI) has been expressed and its structure determined by NMR spectroscopy. NaProPI contains six almost identical 53 amino acid repeats that fold into six highly similar domains; however, the sequence repeats do not coincide with the structural domains. Five of the structural domains comprise the C-terminal portion of one repeat and the N-terminal portion of the next. The sixth domain contains the C-terminal portion of the sixth repeat and the N-terminal portion of the first repeat. Disulphide bonds link these C and N-terminal fragments to generate the clasped-bracelet fold of NaProPI. The three-dimensional structure of NaProPI is not known, but it is conceivable that adjacent domains in NaProPI interact to generate the circular "bracelet" with the N and C termini in close enough proximity to facilitate formation of the disulphide bonds that form the "clasp". The expressed protein, examined in the current study, comprises residues 25-135 of NaProPI and encompasses the first two contiguous structural domains, namely the chymotrypsin inhibitor C1 and the trypsin inhibitor T1, joined by a five-residue linker, and is referred to as C1-T1. The tertiary structure of each domain in C1-T1 is identical to that found in the isolated inhibitors. However, no nuclear Overhauser effect contacts are observed between the two domains and the five-residue linker adopts an extended conformation. The absence of interactions between the domains indicates that adjacent domains do not specifically interact to drive the circularisation of NaProPI. These results are in agreement with recent data which describe similar PI precursors from other members of the Solanaceae having two, three, or four repeats. The lack of strong interdomain association is likely to be important for the function of individual inhibitors by ensuring that there is no masking of reactive sites upon release from the precursor.
Collapse
Affiliation(s)
- H J Schirra
- Institute for Molecular Bioscience, University of Queensland, St. Lucia, Queensland, 4072, Australia
| | | | | | | | | |
Collapse
|
27
|
Collinet B, Herve M, Pecorari F, Minard P, Eder O, Desmadril M. Functionally accepted insertions of proteins within protein domains. J Biol Chem 2000; 275:17428-33. [PMID: 10747943 DOI: 10.1074/jbc.m000666200] [Citation(s) in RCA: 39] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Experiments were designed to explore the tolerance of protein structure and folding to very large insertions of folded protein within a structural domain. Dihydrofolate reductase and beta-lactamase have been inserted in four different positions of phosphoglycerate kinase. The resultant chimeric proteins are all overexpressed, and the host as well as the inserted partners are functional. Although not explicitly designed, functional coupling between the two fused partners was observed in some of the chimeras. These results show that the tolerance of protein structures to very large structured insertions is more general than previously expected and supports the idea that the natural sequence continuity of a structural domain is not required for the folding process. These results directly suggest a new experimental approach to screen, for example, for folded protein in randomized polypeptide sequences.
Collapse
Affiliation(s)
- B Collinet
- Laboratoire de Modélisation et d'Ingénierie des Protéines, EP1088 Université de Paris-Sud, F-91405 Orsay Cedex, France
| | | | | | | | | | | |
Collapse
|
28
|
Andrade MA, Ponting CP, Gibson TJ, Bork P. Homology-based method for identification of protein repeats using statistical significance estimates. J Mol Biol 2000; 298:521-37. [PMID: 10772867 DOI: 10.1006/jmbi.2000.3684] [Citation(s) in RCA: 156] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Short protein repeats, frequently with a length between 20 and 40 residues, represent a significant fraction of known proteins. Many repeats appear to possess high amino acid substitution rates and thus recognition of repeat homologues is highly problematic. Even if the presence of a certain repeat family is known, the exact locations and the number of repetitive units often cannot be determined using current methods. We have devised an iterative algorithm based on optimal and sub-optimal score distributions from profile analysis that estimates the significance of all repeats that are detected in a single sequence. This procedure allows the identification of homologues at alignment scores lower than the highest optimal alignment score for non-homologous sequences. The method has been used to investigate the occurrence of eleven families of repeats in Saccharomyces cerevisiae, Caenorhabditis elegans and Homo sapiens accounting for 1055, 2205 and 2320 repeats, respectively. For these examples, the method is both more sensitive and more selective than conventional homology search procedures. The method allowed the detection in the SwissProt database of more than 2000 previously unrecognised repeats belonging to the 11 families. In addition, the method was used to merge several repeat families that previously were supposed to be distinct, indicating common phylogenetic origins for these families.
Collapse
Affiliation(s)
- M A Andrade
- European Molecular Biology Laboratory, Meyerhofstr. 1, Heidelberg, 69012, Germany
| | | | | | | |
Collapse
|
29
|
|