1
|
Pei J, Andreeva A, Chuguransky S, Lázaro Pinto B, Paysan-Lafosse T, Dustin Schaeffer R, Bateman A, Cong Q, Grishin NV. Bridging the Gap between Sequence and Structure Classifications of Proteins with AlphaFold Models. J Mol Biol 2024; 436:168764. [PMID: 39197652 DOI: 10.1016/j.jmb.2024.168764] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2024] [Revised: 08/13/2024] [Accepted: 08/20/2024] [Indexed: 09/01/2024]
Abstract
Classification of protein domains based on homology and structural similarity serves as a fundamental tool to gain biological insights into protein function. Recent advancements in protein structure prediction, exemplified by AlphaFold, have revolutionized the availability of protein structural data. We focus on classifying about 9000 Pfam families into ECOD (Evolutionary Classification of Domains) by using predicted AlphaFold models and the DPAM (Domain Parser for AlphaFold Models) tool. Our results offer insights into their homologous relationships and domain boundaries. More than half of these Pfam families contain DPAM domains that can be confidently assigned to the ECOD hierarchy. Most assigned domains belong to highly populated folds such as Immunoglobulin-like (IgL), Armadillo (ARM), helix-turn-helix (HTH), and Src homology 3 (SH3). A large fraction of DPAM domains, however, cannot be confidently assigned to ECOD homologous groups. These unassigned domains exhibit statistically different characteristics, including shorter average length, fewer secondary structure elements, and more abundant transmembrane segments. They could potentially define novel families remotely related to domains with known structures or novel superfamilies and folds. Manual scrutiny of a subset of these domains revealed an abundance of internal duplications and recurring structural motifs. Exploring sequence and structural features such as disulfide bond patterns, metal-binding sites, and enzyme active sites helped uncover novel structural folds as well as remote evolutionary relationships. By bridging the gap between sequence-based Pfam and structure-based ECOD domain classifications, our study contributes to a more comprehensive understanding of the protein universe by providing structural and functional insights into previously uncharacterized proteins.
Collapse
Affiliation(s)
- Jimin Pei
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, USA; Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA; Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Antonina Andreeva
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Sara Chuguransky
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Beatriz Lázaro Pinto
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Typhaine Paysan-Lafosse
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - R Dustin Schaeffer
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK.
| | - Qian Cong
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, USA; Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA; Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA.
| | - Nick V Grishin
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA; Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX, USA.
| |
Collapse
|
2
|
Carter CW. Base Pairing Promoted the Self-Organization of Genetic Coding, Catalysis, and Free-Energy Transduction. Life (Basel) 2024; 14:199. [PMID: 38398709 PMCID: PMC10890426 DOI: 10.3390/life14020199] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2024] [Revised: 01/21/2024] [Accepted: 01/25/2024] [Indexed: 02/25/2024] Open
Abstract
How Nature discovered genetic coding is a largely ignored question, yet the answer is key to explaining the transition from biochemical building blocks to life. Other, related puzzles also fall inside the aegis enclosing the codes themselves. The peptide bond is unstable with respect to hydrolysis. So, it requires some form of chemical free energy to drive it. Amino acid activation and acyl transfer are also slow and must be catalyzed. All living things must thus also convert free energy and synchronize cellular chemistry. Most importantly, functional proteins occupy only small, isolated regions of sequence space. Nature evolved heritable symbolic data processing to seek out and use those sequences. That system has three parts: a memory of how amino acids behave in solution and inside proteins, a set of code keys to access that memory, and a scoring function. The code keys themselves are the genes for cognate pairs of tRNA and aminoacyl-tRNA synthetases, AARSs. The scoring function is the enzymatic specificity constant, kcat/kM, which measures both catalysis and specificity. The work described here deepens the evidence for and understanding of an unexpected consequence of ancestral bidirectional coding. Secondary structures occur in approximately the same places within antiparallel alignments of their gene products. However, the polar amino acids that define the molecular surface of one are reflected into core-defining non-polar side chains on the other. Proteins translated from base-paired coding strands fold up inside out. Bidirectional genes thus project an inverted structural duality into the proteome. I review how experimental data root the scoring functions responsible for the origins of coding and catalyzed activation of unfavorable chemical reactions in that duality.
Collapse
Affiliation(s)
- Charles W Carter
- Department of Biochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599-7260, USA
| |
Collapse
|
3
|
Chandonia JM, Fox NK, Brenner SE. SCOPe: classification of large macromolecular structures in the structural classification of proteins-extended database. Nucleic Acids Res 2020; 47:D475-D481. [PMID: 30500919 PMCID: PMC6323910 DOI: 10.1093/nar/gky1134] [Citation(s) in RCA: 81] [Impact Index Per Article: 20.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2018] [Accepted: 11/27/2018] [Indexed: 11/12/2022] Open
Abstract
The SCOPe (Structural Classification of Proteins—extended, https://scop.berkeley.edu) database hierarchically classifies domains from the majority of proteins of known structure according to their structural and evolutionary relationships. SCOPe also incorporates and updates the ASTRAL compendium, which provides multiple databases and tools to aid in the analysis of the sequences and structures of proteins classified in SCOPe. Protein structures are classified using a combination of manual curation and highly precise automated methods. In the current release of SCOPe, 2.07, we have focused our manual curation efforts on larger protein structures, including the spliceosome, proteasome and RNA polymerase I, as well as many other Pfam families that had not previously been classified. Domains from these large protein complexes are distinctive in several ways: novel non-globular folds are more common, and domains from previously observed protein families often have N- or C-terminal extensions that were disordered or not present in previous structures. The current monthly release update, SCOPe 2.07–2018-10–18, classifies 90 992 PDB entries (about two thirds of PDB entries).
Collapse
Affiliation(s)
- John-Marc Chandonia
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.,Molecular Biophysics and Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Naomi K Fox
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.,Molecular Biophysics and Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Steven E Brenner
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.,Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, USA
| |
Collapse
|
4
|
Exploring the dark foldable proteome by considering hydrophobic amino acids topology. Sci Rep 2017; 7:41425. [PMID: 28134276 PMCID: PMC5278394 DOI: 10.1038/srep41425] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2016] [Accepted: 12/19/2016] [Indexed: 12/18/2022] Open
Abstract
The protein universe corresponds to the set of all proteins found in all organisms. A way to explore it is by taking into account the domain content of the proteins. However, some part of sequences and many entire sequences remain un-annotated despite a converging number of domain families. The un-annotated part of the protein universe is referred to as the dark proteome and remains poorly characterized. In this study, we quantify the amount of foldable domains within the dark proteome by using the hydrophobic cluster analysis methodology. These un-annotated foldable domains were grouped using a combination of remote homology searches and domain annotations, leading to define different levels of darkness. The dark foldable domains were analyzed to understand what make them different from domains stored in databases and thus difficult to annotate. The un-annotated domains of the dark proteome universe display specific features relative to database domains: shorter length, non-canonical content and particular topology in hydrophobic residues, higher propensity for disorder, and a higher energy. These features make them hard to relate to known families. Based on these observations, we emphasize that domain annotation methodologies can still be improved to fully apprehend and decipher the molecular evolution of the protein universe.
Collapse
|
5
|
Lu Y, Lu Y, Deng J, Peng H, Lu H, Lu LJ. A novel essential domain perspective for exploring gene essentiality. Bioinformatics 2015; 31:2921-9. [PMID: 26002906 DOI: 10.1093/bioinformatics/btv312] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2015] [Accepted: 05/13/2015] [Indexed: 02/05/2023] Open
Abstract
MOTIVATION Genes with indispensable functions are identified as essential; however, the traditional gene-level studies of essentiality have several limitations. In this study, we characterized gene essentiality from a new perspective of protein domains, the independent structural or functional units of a polypeptide chain. RESULTS To identify such essential domains, we have developed an Expectation-Maximization (EM) algorithm-based Essential Domain Prediction (EDP) Model. With simulated datasets, the model provided convergent results given different initial values and offered accurate predictions even with noise. We then applied the EDP model to six microbial species and predicted 1879 domains to be essential in at least one species, ranging 10-23% in each species. The predicted essential domains were more conserved than either non-essential domains or essential genes. Comparing essential domains in prokaryotes and eukaryotes revealed an evolutionary distance consistent with that inferred from ribosomal RNA. When utilizing these essential domains to reproduce the annotation of essential genes, we received accurate results that suggest protein domains are more basic units for the essentiality of genes. Furthermore, we presented several examples to illustrate how the combination of essential and non-essential domains can lead to genes with divergent essentiality. In summary, we have described the first systematic analysis on gene essentiality on the level of domains. CONTACT huilu.bioinfo@gmail.com or Long.Lu@cchmc.org SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yao Lu
- Shanghai Institute of Medical Genetics, Shanghai Children's Hospital, Shanghai Jiao Tong University, 24/1400 Beijing (W) Road, Shanghai 200040, People's Republic of China
| | - Yulan Lu
- State Key Laboratory of Genetic Engineering Institute of Biostatistics, School of Life Science, Fudan University, Shanghai 200433, People's Republic of China
| | - Jingyuan Deng
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH 45229, USA
| | - Hai Peng
- Institute for Systems Biology, Jianghan University, Wuhan, Hubei, People's Republic of China
| | - Hui Lu
- Shanghai Institute of Medical Genetics, Shanghai Children's Hospital, Shanghai Jiao Tong University, 24/1400 Beijing (W) Road, Shanghai 200040, People's Republic of China, Department of Bioengineering (MC 063), University of Illinois at Chicago, Chicago, IL 60607-7052, USA and Collaborative Innovation Center for Biotherapy, West China Hospital, Sichuan University, Chengdu, China
| | - Long Jason Lu
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH 45229, USA, Institute for Systems Biology, Jianghan University, Wuhan, Hubei, People's Republic of China
| |
Collapse
|
6
|
Global patterns of protein domain gain and loss in superkingdoms. PLoS Comput Biol 2014; 10:e1003452. [PMID: 24499935 PMCID: PMC3907288 DOI: 10.1371/journal.pcbi.1003452] [Citation(s) in RCA: 51] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2013] [Accepted: 12/03/2013] [Indexed: 12/21/2022] Open
Abstract
Domains are modules within proteins that can fold and function independently and are evolutionarily conserved. Here we compared the usage and distribution of protein domain families in the free-living proteomes of Archaea, Bacteria and Eukarya and reconstructed species phylogenies while tracing the history of domain emergence and loss in proteomes. We show that both gains and losses of domains occurred frequently during proteome evolution. The rate of domain discovery increased approximately linearly in evolutionary time. Remarkably, gains generally outnumbered losses and the gain-to-loss ratios were much higher in akaryotes compared to eukaryotes. Functional annotations of domain families revealed that both Archaea and Bacteria gained and lost metabolic capabilities during the course of evolution while Eukarya acquired a number of diverse molecular functions including those involved in extracellular processes, immunological mechanisms, and cell regulation. Results also highlighted significant contemporary sharing of informational enzymes between Archaea and Eukarya and metabolic enzymes between Bacteria and Eukarya. Finally, the analysis provided useful insights into the evolution of species. The archaeal superkingdom appeared first in evolution by gradual loss of ancestral domains, bacterial lineages were the first to gain superkingdom-specific domains, and eukaryotes (likely) originated when an expanding proto-eukaryotic stem lineage gained organelles through endosymbiosis of already diversified bacterial lineages. The evolutionary dynamics of domain families in proteomes and the increasing number of domain gains is predicted to redefine the persistence strategies of organisms in superkingdoms, influence the make up of molecular functions, and enhance organismal complexity by the generation of new domain architectures. This dynamics highlights ongoing secondary evolutionary adaptations in akaryotic microbes, especially Archaea.
Collapse
|
7
|
Bornberg-Bauer E, Albà MM. Dynamics and adaptive benefits of modular protein evolution. Curr Opin Struct Biol 2013; 23:459-66. [PMID: 23562500 DOI: 10.1016/j.sbi.2013.02.012] [Citation(s) in RCA: 80] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2013] [Revised: 02/15/2013] [Accepted: 02/15/2013] [Indexed: 11/29/2022]
Abstract
During protein evolution, novel domain arrangements are continuously formed. Rearrangements are important for the creation of molecular biodiversity and for functional molecular changes which underlie developmental shifts in the bauplan of organisms. Here we review the mechanisms by which new arrangements arise and the potential benefits of rearrangements. We concentrate on how new domains emerge and why they rapidly spread across genomes, gaining higher copy numbers than older, more established domains. This spread is most likely a consequence of their high adaptive potential but is unlikely to make up on its own for the drastic loss of domains, which is observed across different taxa. We show that a significant portion of the recently emerged domains, especially those in multidomain families, are highly disordered and speculate about the significance of these findings for the evolvability of novel genetic material.
Collapse
Affiliation(s)
- Erich Bornberg-Bauer
- Institute for Evolution and Biodiversity, School of Biological Sciences, University of Münster, Hüfferstrasse 1, D48149 Münster, Germany.
| | | |
Collapse
|
8
|
Xu Q, Dunbrack RL. Assignment of protein sequences to existing domain and family classification systems: Pfam and the PDB. Bioinformatics 2012; 28:2763-72. [PMID: 22942020 DOI: 10.1093/bioinformatics/bts533] [Citation(s) in RCA: 53] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Automating the assignment of existing domain and protein family classifications to new sets of sequences is an important task. Current methods often miss assignments because remote relationships fail to achieve statistical significance. Some assignments are not as long as the actual domain definitions because local alignment methods often cut alignments short. Long insertions in query sequences often erroneously result in two copies of the domain assigned to the query. Divergent repeat sequences in proteins are often missed. RESULTS We have developed a multilevel procedure to produce nearly complete assignments of protein families of an existing classification system to a large set of sequences. We apply this to the task of assigning Pfam domains to sequences and structures in the Protein Data Bank (PDB). We found that HHsearch alignments frequently scored more remotely related Pfams in Pfam clans higher than closely related Pfams, thus, leading to erroneous assignment at the Pfam family level. A greedy algorithm allowing for partial overlaps was, thus, applied first to sequence/HMM alignments, then HMM-HMM alignments and then structure alignments, taking care to join partial alignments split by large insertions into single-domain assignments. Additional assignment of repeat Pfams with weaker E-values was allowed after stronger assignments of the repeat HMM. Our database of assignments, presented in a database called PDBfam, contains Pfams for 99.4% of chains >50 residues. AVAILABILITY The Pfam assignment data in PDBfam are available at http://dunbrack2.fccc.edu/ProtCid/PDBfam, which can be searched by PDB codes and Pfam identifiers. They will be updated regularly.
Collapse
Affiliation(s)
- Qifang Xu
- Institute for Cancer Research, Fox Chase Cancer Center, 333 Cottman Avenue, Philadelphia, PA 19111, USA
| | | |
Collapse
|
9
|
Charoensawan V, Wilson D, Teichmann SA. Genomic repertoires of DNA-binding transcription factors across the tree of life. Nucleic Acids Res 2010; 38:7364-77. [PMID: 20675356 PMCID: PMC2995046 DOI: 10.1093/nar/gkq617] [Citation(s) in RCA: 106] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Sequence-specific transcription factors (TFs) are important to genetic regulation in all organisms because they recognize and directly bind to regulatory regions on DNA. Here, we survey and summarize the TF resources available. We outline the organisms for which TF annotation is provided, and discuss the criteria and methods used to annotate TFs by different databases. By using genomic TF repertoires from ∼700 genomes across the tree of life, covering Bacteria, Archaea and Eukaryota, we review TF abundance with respect to the number of genes, as well as their structural complexity in diverse lineages. While typical eukaryotic TFs are longer than the average eukaryotic proteins, the inverse is true for prokaryotes. Only in eukaryotes does the same family of DNA-binding domain (DBD) occur multiple times within one polypeptide chain. This potentially increases the length and diversity of DNA-recognition sequence by reusing DBDs from the same family. We examined the increase in TF abundance with the number of genes in genomes, using the largest set of prokaryotic and eukaryotic genomes to date. As pointed out before, prokaryotic TFs increase faster than linearly. We further observe a similar relationship in eukaryotic genomes with a slower increase in TFs.
Collapse
|
10
|
Abstract
The protein universe is the set of all proteins of all organisms. Here, all currently known sequences are analyzed in terms of families that have single-domain or multidomain architectures and whether they have a known three-dimensional structure. Growth of new single-domain families is very slow: Almost all growth comes from new multidomain architectures that are combinations of domains characterized by approximately 15,000 sequence profiles. Single-domain families are mostly shared by the major groups of organisms, whereas multidomain architectures are specific and account for species diversity. There are known structures for a quarter of the single-domain families, and >70% of all sequences can be partially modeled thanks to their membership in these families.
Collapse
|
11
|
Wang Q, Dunbrack RL. An Integrated Database for Complex Protein Structure Modeling. PROCEEDINGS. IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE 2008; 2008:33-40. [PMID: 21151784 PMCID: PMC3000042 DOI: 10.1109/bibmw.2008.4686206] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
In homology modeling of protein structures, it is typical to find templates through a sequence search against a database of proteins with known structures. In more complicated modeling cases, such as modeling a protein structure in contact with a ligand, sequence information itself may not be enough and more biological information is required for a successful modeling process. SCOP and PFAM are two databases providing protein domain information which can be utilized in complex protein structure modeling. However, due to the manually-curated nature of both databases, they fail to provide timely coverage of protein sequences existing in the Protein Data Bank (PDB). In this paper, we introduce a new relational database, IDOPS, which integrates sequence and biological information extracted from remediated PDB files and protein domain information generated with HMM profiles of PFAM families. With a carefully designed protocol, this database is updated regularly and the coverage rate of PDB entries is guaranteed to be high.
Collapse
|
12
|
Ye L, Liu T, Wu Z, Zhou R. Sequence-based protein domain boundary prediction using BP neural network with various property profiles. Proteins 2008; 71:300-7. [PMID: 17932915 DOI: 10.1002/prot.21745] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Given the rapid growth in the number of sequences without known structures, it is becoming increasingly important to not only accurately define protein structural domains but also predict domain boundaries from the amino-acid sequence alone. In this article, we present a Back-Propagation (BP) neural network method using 9 different sequence profiles, based on chemical, physical, and statistical properties, to predict the domain boundary of two-domain proteins from one dimensional sequences. We have achieved an accuracy of 69% with a 10-fold cross validation on a 238 nonredundant two-domain protein dataset that we built based on a common set from both SCOP and CATH classifications. The method has also been applied to a larger third-party dataset with 522 proteins; and an accuracy of 62% has been achieved. Our prediction results on both datasets are found to be significantly better than those from some other methods, such as DomCut and DGS on the same datasets, and also comparable to that from the PPRODO method, upon which the larger dataset was based. Our cross validation results are also noticeably better than previous ones from other BP neural network methods, probably because we have used more property descriptors with significantly more training nodes in our neural network. The integration with PPRODO method also indicates that the information obtained from our current approach is complementary to that available through multiple sequence alignments. Moreover, the relative importance of each property profile has been analyzed in detail.
Collapse
Affiliation(s)
- Lei Ye
- Department of Computer Science, Zhejiang University, Hangzhou, China
| | | | | | | |
Collapse
|
13
|
Gewehr JE, Hintermair V, Zimmer R. AutoSCOP: automated prediction of SCOP classifications using unique pattern-class mappings. Bioinformatics 2007; 23:1203-10. [PMID: 17379694 DOI: 10.1093/bioinformatics/btm089] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The sequence patterns contained in the available motif and hidden Markov model (HMM) databases are a valuable source of information for protein sequence annotation. For structure prediction and fold recognition purposes, we computed mappings from such pattern databases to the protein domain hierarchy given by the ASTRAL compendium and applied them to the prediction of SCOP classifications. Our aim is to make highly confident predictions also for non-trivial cases if possible and abstain from a prediction otherwise, and thus to provide a method that can be used as a first step in a pipeline of prediction methods. We describe two successful examples for such pipelines. With the AutoSCOP approach, it is possible to make predictions in a large-scale manner for many domains of the available sequences in the well-known protein sequence databases. RESULTS AutoSCOP computes unique sequence patterns and pattern combinations for SCOP classifications. For instance, we assign a SCOP superfamily to a pattern found in its members whenever the pattern does not occur in any other SCOP superfamily. Especially on the fold and superfamily level, our method achieves both high sensitivity (above 93%) and high specificity (above 98%) on the difference set between two ASTRAL versions, due to being able to abstain from unreliable predictions. Further, on a harder test set filtered at low sequence identity, the combination with profile-profile alignments improves accuracy and performs comparably even to structure alignment methods. Integrating our method with structure alignment, we are able to achieve an accuracy of 99% on SCOP fold classifications on this set. In an analysis of false assignments of domains from new folds/superfamilies/families to existing SCOP classifications, AutoSCOP correctly abstains for more than 70% of the domains belonging to new folds and superfamilies, and more than 80% of the domains belonging to new families. These findings show that our approach is a useful additional filter for SCOP classification prediction of protein domains in combination with well-known methods such as profile-profile alignment. AVAILABILITY A web server where users can input their domain sequences is available at http://www.bio.ifi.lmu.de/autoscop.
Collapse
Affiliation(s)
- Jan E Gewehr
- Practical Informatics and Bioinformatics Group, Department of Informatics, Ludwig-Maximilians-University Munich, Amalienstr. 17, D-80333 Munich, Germany.
| | | | | |
Collapse
|
14
|
Ding C, Zhang Y, Li T, Holbrook S. Biclustering Protein Complex Interactions with a Biclique Finding Algorithm. ACTA ACUST UNITED AC 2006. [DOI: 10.1109/icdm.2006.27] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
15
|
Tracing the origin of functional and conserved domains in the human proteome: implications for protein evolution at the modular level. BMC Evol Biol 2006; 6:91. [PMID: 17090320 PMCID: PMC1654190 DOI: 10.1186/1471-2148-6-91] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2006] [Accepted: 11/07/2006] [Indexed: 11/29/2022] Open
Abstract
Background The functional repertoire of the human proteome is an incremental collection of functions accomplished by protein domains evolved along the Homo sapiens lineage. Therefore, knowledge on the origin of these functionalities provides a better understanding of the domain and protein evolution in human. The lack of proper comprehension about such origin has impelled us to study the evolutionary origin of human proteome in a unique way as detailed in this study. Results This study reports a unique approach for understanding the evolution of human proteome by tracing the origin of its constituting domains hierarchically, along the Homo sapiens lineage. The uniqueness of this method lies in subtractive searching of functional and conserved domains in the human proteome resulting in higher efficiency of detecting their origins. From these analyses the nature of protein evolution and trends in domain evolution can be observed in the context of the entire human proteome data. The method adopted here also helps delineate the degree of divergence of functional families occurred during the course of evolution. Conclusion This approach to trace the evolutionary origin of functional domains in the human proteome facilitates better understanding of their functional versatility as well as provides insights into the functionality of hypothetical proteins present in the human proteome. This work elucidates the origin of functional and conserved domains in human proteins, their distribution along the Homo sapiens lineage, occurrence frequency of different domain combinations and proteome-wide patterns of their distribution, providing insights into the evolutionary solution to the increased complexity of the human proteome.
Collapse
|
16
|
Han A, Kang HJ, Cho Y, Lee S, Kim YJ, Gong S. SNP@Domain: a web resource of single nucleotide polymorphisms (SNPs) within protein domain structures and sequences. Nucleic Acids Res 2006; 34:W642-4. [PMID: 16845090 PMCID: PMC1538855 DOI: 10.1093/nar/gkl323] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
The single nucleotide polymorphisms (SNPs) in conserved protein regions have been thought to be strong candidates that alter protein functions. Thus, we have developed SNP@Domain, a web resource, to identify SNPs within human protein domains. We annotated SNPs from dbSNP with protein structure-based as well as sequence-based domains: (i) structure-based using SCOP and (ii) sequence-based using Pfam to avoid conflicts from two domain assignment methodologies. Users can investigate SNPs within protein domains with 2D and 3D maps. We expect this visual annotation of SNPs within protein domains will help scientists select and interpret SNPs associated with diseases. A web interface for the SNP@Domain is freely available at and from .
Collapse
Affiliation(s)
| | | | - Yoobok Cho
- Object Interaction Technologies, Inc.Daejeon, Korea
| | | | | | - Sungsam Gong
- To whom correspondence should be addressed. Tel: +82 42 879 8549; Fax: +82 42 879 8519;
| |
Collapse
|
17
|
Weiner J, Beaussart F, Bornberg-Bauer E. Domain deletions and substitutions in the modular protein evolution. FEBS J 2006; 273:2037-47. [PMID: 16640566 DOI: 10.1111/j.1742-4658.2006.05220.x] [Citation(s) in RCA: 96] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
The main mechanisms shaping the modular evolution of proteins are gene duplication, fusion and fission, recombination and loss of fragments. While a large body of research has focused on duplications and fusions, we concentrated, in this study, on how domains are lost. We investigated motif databases and introduced a measure of protein similarity that is based on domain arrangements. Proteins are represented as strings of domains and comparison was based on the classic dynamic alignment scheme. We found that domain losses and duplications were more frequent at the ends of proteins. We showed that losses can be explained by the introduction of start and stop codons which render the terminal domains nonfunctional, such that further shortening, until the whole domain is lost, is not evolutionarily selected against. We demonstrated that domains which also occur as single-domain proteins are less likely to be lost at the N terminus and in the middle, than at the C terminus. We conclude that fission/fusion events with single-domain proteins occur mostly at the C terminus. We found that domain substitutions are rare, in particular in the middle of proteins. We also showed that many cases of substitutions or losses result from erroneous annotations, but we were also able to find courses of evolutionary events where domains vanish over time. This is explained by a case study on the bacterial formate dehydrogenases.
Collapse
Affiliation(s)
- January Weiner
- Division of Bioinformatics, School of Biological Sciences, The Westfalian Wilhelms University of Münster, Germany
| | | | | |
Collapse
|