1
|
Yang L, Jiao X. Distinguishing Enzymes and Non-enzymes Based on Structural Information with an Alignment Free Approach. Curr Bioinform 2021. [DOI: 10.2174/1574893615666200324134037] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Knowledge of protein functions is very crucial for the understanding of biological processes. Experimental methods for protein function prediction are powerless to treat the growing amount of protein sequence and structure data.
Objective:
To develop some computational techniques for the protein function prediction.
Method:
Based on the residue interaction network features and the motion mode information, an
SVM model was constructed and used as the predictor. The role of these features was analyzed
and some interesting results were obtained.
Results:
An alignment-free method for the classification of enzyme and non-enzyme is developed in this work. There is not any single feature that occupies a dominant position in the prediction process. The topological and the information-theoretic residue interaction network features have a better performance. The combination of the fast mode and the slow mode can get a better explanation for the classification result.
Conclusion:
The method proposed in this paper can act as a classifier for the enzymes and nonenzymes.
Collapse
Affiliation(s)
- Lifeng Yang
- College of Information and Computer, Taiyuan University of Technology, Taiyuan, 030600,China
| | - Xiong Jiao
- College of Biomedical Engineering, Taiyuan University of Technology, Taiyuan, 030600,China
| |
Collapse
|
2
|
Gerke M, Bornberg-Bauer E, Jiang X, Fuellen G. Finding Common Protein Interaction Patterns Across Organisms. Evol Bioinform Online 2017. [DOI: 10.1177/117693430600200011] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Protein interactions are an important resource to obtain an understanding of cell function. Recently, researchers have compared networks of interactions in order to understand network evolution. While current methods first infer homologs and then compare topologies, we here present a method which first searches for interesting topologies and then looks for homologs. PINA (protein interaction network analysis) takes the protein interaction networks of two organisms, scans both networks for subnetworks deemed interesting, and then tries to find orthologs among the interesting subnetworks. The application is very fast because orthology investigations are restricted to subnetworks like hubs and clusters that fulfill certain criteria regarding neighborhood and connectivity. Finally, the hubs or clusters found to be related can be visualized and analyzed according to protein annotation.
Collapse
Affiliation(s)
- Mirco Gerke
- Division of Bioinformatics, Biology Department, Schlossplatz 4, D-48149 Münster, Germany
- Institut für Informatik, Fachbereich Mathematik und Informatik, Einsteinstr. 62, D- 48149 Münster, Germany
| | - Erich Bornberg-Bauer
- Division of Bioinformatics, Biology Department, Schlossplatz 4, D-48149 Münster, Germany
| | - Xiaoyi Jiang
- Institut für Informatik, Fachbereich Mathematik und Informatik, Einsteinstr. 62, D- 48149 Münster, Germany
| | - Georg Fuellen
- Division of Bioinformatics, Biology Department, Schlossplatz 4, D-48149 Münster, Germany
- Department of Medicine, AG Bioinformatics, Domagkstr. 3, D-48149 Münster, Germany
| |
Collapse
|
3
|
The Adaptive Evolution Database (TAED): A New Release of a Database of Phylogenetically Indexed Gene Families from Chordates. J Mol Evol 2017; 85:46-56. [PMID: 28795237 DOI: 10.1007/s00239-017-9806-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2017] [Accepted: 08/03/2017] [Indexed: 12/11/2022]
Abstract
With the large collections of gene and genome sequences, there is a need to generate curated comparative genomic databases that enable interpretation of results in an evolutionary context. Such resources can facilitate an understanding of the co-evolution of genes in the context of a genome mapped onto a phylogeny, of a protein structure, and of interactions within a pathway. A phylogenetically indexed gene family database, the adaptive evolution database (TAED), is presented that organizes gene families and their evolutionary histories in a species tree context. Gene families include alignments, phylogenetic trees, lineage-specific dN/dS ratios, reconciliation with the species tree to enable both the mapping and the identification of duplication events, mapping of gene families onto pathways, and mapping of amino acid substitutions onto protein structures. In addition to organization of the data, new phylogenetic visualization tools have been developed to aid in interpreting the data that are also available, including TreeThrasher and TAED Tree Viewer. A new resource of gene families organized by species and taxonomic lineage promises to be a valuable comparative genomics database for molecular biologists, evolutionary biologists, and ecologists. The new visualization tools and database framework will be of interest to both evolutionary biologists and bioinformaticians.
Collapse
|
4
|
Benner S. Uniting Natural History with the Molecular Sciences. The Ultimate Multidisciplinarity. Acc Chem Res 2017; 50:498-502. [PMID: 28945399 DOI: 10.1021/acs.accounts.6b00496] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Life and the Earth have coevolved over the past four billion years to deliver a rich diversity of biological structure, from biomolecules to macrophysiology. One grand challenge seeks to interconnect these structures, in ways acceptable to both natural historians and physical scientists, to give an interconnecting web of models and experiments to create a planetary understanding of the phenomenon that we call "life". The molecular scientist wants experiments; the natural historian wants reference to Darwinian fitness. Paleogenetics offers both.
Collapse
Affiliation(s)
- Steven Benner
- Foundation for Applied Molecular Evolution, Firebird Biomolecular Sciences LLC and The Westheimer Institute for Science and Technology, Alachua, Florida 32615, United States
| |
Collapse
|
5
|
Tiwari AK, Srivastava R. A survey of computational intelligence techniques in protein function prediction. INTERNATIONAL JOURNAL OF PROTEOMICS 2014; 2014:845479. [PMID: 25574395 PMCID: PMC4276698 DOI: 10.1155/2014/845479] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 09/10/2014] [Revised: 10/31/2014] [Accepted: 11/07/2014] [Indexed: 02/08/2023]
Abstract
During the past, there was a massive growth of knowledge of unknown proteins with the advancement of high throughput microarray technologies. Protein function prediction is the most challenging problem in bioinformatics. In the past, the homology based approaches were used to predict the protein function, but they failed when a new protein was different from the previous one. Therefore, to alleviate the problems associated with homology based traditional approaches, numerous computational intelligence techniques have been proposed in the recent past. This paper presents a state-of-the-art comprehensive review of various computational intelligence techniques for protein function predictions using sequence, structure, protein-protein interaction network, and gene expression data used in wide areas of applications such as prediction of DNA and RNA binding sites, subcellular localization, enzyme functions, signal peptides, catalytic residues, nuclear/G-protein coupled receptors, membrane proteins, and pathway analysis from gene expression datasets. This paper also summarizes the result obtained by many researchers to solve these problems by using computational intelligence techniques with appropriate datasets to improve the prediction performance. The summary shows that ensemble classifiers and integration of multiple heterogeneous data are useful for protein function prediction.
Collapse
Affiliation(s)
- Arvind Kumar Tiwari
- Department of Computer Science & Engineering, Indian Institute of Technology (BHU), Varanasi 221005, India
| | - Rajeev Srivastava
- Department of Computer Science & Engineering, Indian Institute of Technology (BHU), Varanasi 221005, India
| |
Collapse
|
6
|
Wang S, Wang R, Xu T. Purifying selection on leptin genes in teleosts may be due to poikilothermy. J Genet 2014; 93:551-6. [PMID: 25189258 DOI: 10.1007/s12041-014-0410-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Shanchen Wang
- Laboratory of Fish Biogenetics and Immune Evolution, College of Marine Science, Zhejiang Ocean University, Zhoushan 316022, People's Republic of China.
| | | | | |
Collapse
|
7
|
Natural selection and adaptive evolution of leptin. CHINESE SCIENCE BULLETIN-CHINESE 2013. [DOI: 10.1007/s11434-012-5635-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
8
|
Evidence for positive selection on the leptin gene in Cetacea and Pinnipedia. PLoS One 2011; 6:e26579. [PMID: 22046310 PMCID: PMC3203152 DOI: 10.1371/journal.pone.0026579] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2011] [Accepted: 09/29/2011] [Indexed: 01/21/2023] Open
Abstract
The leptin gene has received intensive attention and scientific investigation for its importance in energy homeostasis and reproductive regulation in mammals. Furthermore, study of the leptin gene is of crucial importance for public health, particularly for its role in obesity, as well as for other numerous physiological roles that it plays in mammals. In the present work, we report the identification of novel leptin genes in 4 species of Cetacea, and a comparison with 55 publicly available leptin sequences from mammalian genome assemblies and previous studies. Our study provides evidence for positive selection in the suborder Odontoceti (toothed whales) of the Cetacea and the family Phocidae (earless seals) of the Pinnipedia. We also detected positive selection in several leptin gene residues in these two lineages. To test whether leptin and its receptor evolved in a coordinated manner, we analyzed 24 leptin receptor gene (LPR) sequences from available mammalian genome assemblies and other published data. Unlike the case of leptin, our analyses did not find evidence of positive selection for LPR across the Cetacea and Pinnipedia lineages. In line with this, positively selected sites identified in the leptin genes of these two lineages were located outside of leptin receptor binding sites, which at least partially explains why co-evolution of leptin and its receptor was not observed in the present study. Our study provides interesting insights into current understanding of the evolution of mammalian leptin genes in response to selective pressures from life in an aquatic environment, and leads to a hypothesis that new tissue specificity or novel physiologic functions of leptin genes may have arisen in both odontocetes and phocids. Additional data from other species encompassing varying life histories and functional tests of the adaptive role of the amino acid changes identified in this study will help determine the factors that promote the adaptive evolution of the leptin genes in marine mammals.
Collapse
|
9
|
The interplay of descriptor-based computational analysis with pharmacophore modeling builds the basis for a novel classification scheme for feruloyl esterases. Biotechnol Adv 2011; 29:94-110. [DOI: 10.1016/j.biotechadv.2010.09.003] [Citation(s) in RCA: 61] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2010] [Revised: 08/27/2010] [Accepted: 09/06/2010] [Indexed: 11/18/2022]
|
10
|
Chaurasiya M, Chandulah GB, Misra K, Chaurasiya VK. Nearest-neighbor classifier as a tool for classification of protein families. Bioinformation 2010; 4:396-8. [PMID: 20975888 PMCID: PMC2951634 DOI: 10.6026/97320630004396] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2009] [Revised: 01/30/2010] [Accepted: 11/13/2010] [Indexed: 11/23/2022] Open
Abstract
Knowledge about protein function is essential in understanding the biological processes. A specific class or family of protein shares common structural and chemical properties amongst its member sequences. The set of properties that display its unique characteristics for clearly classifying a protein sequence into its corresponding protein family needs to be studied. Our study of these important properties conducted on four major classes of proteins namely Globins, Homeoboxes, Heat Shock proteins (HSP) and Kinase have shown that frequency of twenty naturally occurring amino acids, hydrophobic content of protein, molecular weight of protein, isoelectric point of protein, secondary structure composition of amino acid residues as helices, coils and sheets and the composition of helices, coils and sheets in the secondary structure topology plays a significant role in correctly classifying the protein into its corresponding class or family as indicated by the overall efficiency of Nearest Neighbor Classifier as 84.92%.
Collapse
Affiliation(s)
- Mona Chaurasiya
- Indian Institute of Information Technology, Allahabad, India
| | | | | | | |
Collapse
|
11
|
Tang ZQ, Lin HH, Zhang HL, Han LY, Chen X, Chen YZ. Prediction of functional class of proteins and peptides irrespective of sequence homology by support vector machines. Bioinform Biol Insights 2009; 1:19-47. [PMID: 20066123 PMCID: PMC2789692 DOI: 10.4137/bbi.s315] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Various computational methods have been used for the prediction of protein and peptide function based on their sequences. A particular challenge is to derive functional properties from sequences that show low or no homology to proteins of known function. Recently, a machine learning method, support vector machines (SVM), have been explored for predicting functional class of proteins and peptides from amino acid sequence derived properties independent of sequence similarity, which have shown promising potential for a wide spectrum of protein and peptide classes including some of the low- and non-homologous proteins. This method can thus be explored as a potential tool to complement alignment-based, clustering-based, and structure-based methods for predicting protein function. This article reviews the strategies, current progresses, and underlying difficulties in using SVM for predicting the functional class of proteins. The relevant software and web-servers are described. The reported prediction performances in the application of these methods are also presented.
Collapse
Affiliation(s)
- Zhi Qun Tang
- Department of Pharmacy and Department of Computational Science, National University of Singapore, Republic of Singapore, 117543
| | - Hong Huang Lin
- Department of Pharmacy and Department of Computational Science, National University of Singapore, Republic of Singapore, 117543
| | - Hai Lei Zhang
- Department of Pharmacy and Department of Computational Science, National University of Singapore, Republic of Singapore, 117543
| | - Lian Yi Han
- Department of Pharmacy and Department of Computational Science, National University of Singapore, Republic of Singapore, 117543
| | - Xin Chen
- Department of Biotechnology, Zhejiang University, Hang Zhou, Zhejiang Province, P. R. China, 310029
| | - Yu Zong Chen
- Department of Pharmacy and Department of Computational Science, National University of Singapore, Republic of Singapore, 117543
- Shanghai Center for Bioinformatics Technology, Shanghai, P. R. China, 201203
| |
Collapse
|
12
|
Identification of protein functions using a machine-learning approach based on sequence-derived properties. Proteome Sci 2009; 7:27. [PMID: 19664241 PMCID: PMC2731080 DOI: 10.1186/1477-5956-7-27] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2009] [Accepted: 08/09/2009] [Indexed: 02/07/2023] Open
Abstract
Background Predicting the function of an unknown protein is an essential goal in bioinformatics. Sequence similarity-based approaches are widely used for function prediction; however, they are often inadequate in the absence of similar sequences or when the sequence similarity among known protein sequences is statistically weak. This study aimed to develop an accurate prediction method for identifying protein function, irrespective of sequence and structural similarities. Results A highly accurate prediction method capable of identifying protein function, based solely on protein sequence properties, is described. This method analyses and identifies specific features of the protein sequence that are highly correlated with certain protein functions and determines the combination of protein sequence features that best characterises protein function. Thirty-three features that represent subtle differences in local regions and full regions of the protein sequences were introduced. On the basis of 484 features extracted solely from the protein sequence, models were built to predict the functions of 11 different proteins from a broad range of cellular components, molecular functions, and biological processes. The accuracy of protein function prediction using random forests with feature selection ranged from 94.23% to 100%. The local sequence information was found to have a broad range of applicability in predicting protein function. Conclusion We present an accurate prediction method using a machine-learning approach based solely on protein sequence properties. The primary contribution of this paper is to propose new PNPRD features representing global and/or local differences in sequences, based on positively and/or negatively charged residues, to assist in predicting protein function. In addition, we identified a compact and useful feature subset for predicting the function of various proteins. Our results indicate that sequence-based classifiers can provide good results among a broad range of proteins, that the proposed features are useful in predicting several functions, and that the combination of our and traditional features may support the creation of a discriminative feature set for specific protein functions.
Collapse
|
13
|
Qiu JD, Luo SH, Huang JH, Liang RP. Using support vector machines to distinguish enzymes: approached by incorporating wavelet transform. J Theor Biol 2008; 256:625-31. [PMID: 19049810 DOI: 10.1016/j.jtbi.2008.10.026] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2008] [Revised: 09/26/2008] [Accepted: 10/20/2008] [Indexed: 10/21/2022]
Abstract
The enzymatic attributes of newly found protein sequences are usually determined either by biochemical analysis of eukaryotic and prokaryotic genomes or by microarray chips. These experimental methods are both time-consuming and costly. With the explosion of protein sequences registered in the databanks, it is highly desirable to develop an automated method to identify whether a given new sequence belongs to enzyme or non-enzyme. The discrete wavelet transform (DWT) and support vector machine (SVM) have been used in this study for distinguishing enzyme structures from non-enzymes. The networks have been trained and tested on two datasets of proteins with different wavelet basis functions, decomposition scales and hydrophobicity data types. Maximum accuracy has been obtained using SVM with a wavelet function of Bior2.4, a decomposition scale j=5, and Kyte-Doolittle hydrophobicity scales. The results obtained by the self-consistency test, jackknife test and independent dataset test are encouraging, which indicates that the proposed method can be employed as a useful assistant technique for distinguishing enzymes from non-enzymes.
Collapse
Affiliation(s)
- Jian-Ding Qiu
- Department of Chemistry, Nanchang University, Nanchang 330031, PR China.
| | | | | | | |
Collapse
|
14
|
Holbrook JD, Sanseau P. Drug discovery and computational evolutionary analysis. Drug Discov Today 2007; 12:826-32. [PMID: 17933683 DOI: 10.1016/j.drudis.2007.08.015] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2007] [Revised: 08/31/2007] [Accepted: 08/31/2007] [Indexed: 01/26/2023]
Abstract
Drug discovery remains a difficult business with a very high level of attrition. Many steps in this long process use data generated from various species. One key challenge is to successfully translate the pre-clinical findings of target validation and safety studies in animal models to diverse human beings in the clinic. Advanced computational evolutionary analysis techniques combined with the increasing availability of sequence information enable the application of systematic evolutionary approaches to targets and pathways of interest to drug discovery. These analyses have the potential to increase our understanding of experimental differences observed between species.
Collapse
Affiliation(s)
- Joanna D Holbrook
- GlaxoSmithKline, Molecular Discovery Research, Bioinformatics Analysis, Stevenage SG1 2NY, United Kingdom
| | | |
Collapse
|
15
|
Ardawatia H, Liberles DA. A systematic analysis of lineage-specific evolution in metabolic pathways. Gene 2007; 387:67-74. [PMID: 17034962 DOI: 10.1016/j.gene.2006.08.013] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2006] [Revised: 07/30/2006] [Accepted: 08/10/2006] [Indexed: 12/29/2022]
Abstract
In a search for the lineage-specific evolution of pathways between human, chimpanzee, mouse, and rat, orthologous gene families were generated from genome sequences. For each family, a model-based ratio of nonsynonymous to synonymous nucleotide substitution rates was calculated. Where the free-ratio model of individual ratios on each branch was supported, these families were mapped to two databases of metabolic pathways (KEGG and BioCyc) and the lineage-specific evolution of pathways was evaluated. The most similar pathway evolution was seen between mouse and rat, while the evolutionary pattern between human and chimpanzee was less correlated. Individual pathways in the human lineage were observed to evolve in a faster, lineage-specific manner, including the pathway involving arachidonic acid metabolism (identified through the KEGG analysis) and pyrimidine metabolism (identified through both analyses).
Collapse
Affiliation(s)
- Himanshu Ardawatia
- Computational Biology Unit, BCCS, University of Bergen, 5020 Bergen, Norway
| | | |
Collapse
|
16
|
Han L, Cui J, Lin H, Ji Z, Cao Z, Li Y, Chen Y. Recent progresses in the application of machine learning approach for predicting protein functional class independent of sequence similarity. Proteomics 2006; 6:4023-37. [PMID: 16791826 DOI: 10.1002/pmic.200500938] [Citation(s) in RCA: 50] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Protein sequence contains clues to its function. Functional prediction from sequence presents a challenge particularly for proteins that have low or no sequence similarity to proteins of known function. Recently, machine learning methods have been explored for predicting functional class of proteins from sequence-derived properties independent of sequence similarity, which showed promising potential for low- and non-homologous proteins. These methods can thus be explored as potential tools to complement alignment- and clustering-based methods for predicting protein function. This article reviews the strategies, current progresses, and underlying difficulties in using machine learning methods for predicting the functional class of proteins. The relevant software and web-servers are described. The reported prediction performances in the application of these methods are also presented, which need to be interpreted with caution as they are dependent on such factors as datasets used and choice of parameters.
Collapse
Affiliation(s)
- Lianyi Han
- Department of Computational Science, National University of Singapore, Singapore, Singapore
| | | | | | | | | | | | | |
Collapse
|
17
|
Yu X, Cao J, Cai Y, Shi T, Li Y. Predicting rRNA-, RNA-, and DNA-binding proteins from primary structure with support vector machines. J Theor Biol 2006; 240:175-84. [PMID: 16274699 DOI: 10.1016/j.jtbi.2005.09.018] [Citation(s) in RCA: 98] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2005] [Revised: 09/09/2005] [Accepted: 09/09/2005] [Indexed: 11/18/2022]
Abstract
In the post-genome era, the prediction of protein function is one of the most demanding tasks in the study of bioinformatics. Machine learning methods, such as the support vector machines (SVMs), greatly help to improve the classification of protein function. In this work, we integrated SVMs, protein sequence amino acid composition, and associated physicochemical properties into the study of nucleic-acid-binding proteins prediction. We developed the binary classifications for rRNA-, RNA-, DNA-binding proteins that play an important role in the control of many cell processes. Each SVM predicts whether a protein belongs to rRNA-, RNA-, or DNA-binding protein class. Self-consistency and jackknife tests were performed on the protein data sets in which the sequences identity was < 25%. Test results show that the accuracies of rRNA-, RNA-, DNA-binding SVMs predictions are approximately 84%, approximately 78%, approximately 72%, respectively. The predictions were also performed on the ambiguous and negative data set. The results demonstrate that the predicted scores of proteins in the ambiguous data set by RNA- and DNA-binding SVM models were distributed around zero, while most proteins in the negative data set were predicted as negative scores by all three SVMs. The score distributions agree well with the prior knowledge of those proteins and show the effectiveness of sequence associated physicochemical properties in the protein function prediction. The software is available from the author upon request.
Collapse
Affiliation(s)
- Xiaojing Yu
- Bioinformatics Center, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Graduate School of the Chinese Academy of Sciences, 320 Yueyang Road, Shanghai 200031, PR China
| | | | | | | | | |
Collapse
|
18
|
Li T, Chamberlin SG, Caraco MD, Liberles DA, Gaucher EA, Benner SA. Analysis of transitions at two-fold redundant sites in mammalian genomes. Transition redundant approach-to-equilibrium (TREx) distance metrics. BMC Evol Biol 2006; 6:25. [PMID: 16545144 PMCID: PMC1435776 DOI: 10.1186/1471-2148-6-25] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2005] [Accepted: 03/20/2006] [Indexed: 11/10/2022] Open
Abstract
Background The exchange of nucleotides at synonymous sites in a gene encoding a protein is believed to have little impact on the fitness of a host organism. This should be especially true for synonymous transitions, where a pyrimidine nucleotide is replaced by another pyrimidine, or a purine is replaced by another purine. This suggests that transition redundant exchange (TREx) processes at the third position of conserved two-fold codon systems might offer the best approximation for a neutral molecular clock, serving to examine, within coding regions, theories that require neutrality, determine whether transition rate constants differ within genes in a single lineage, and correlate dates of events recorded in genomes with dates in the geological and paleontological records. To date, TREx analysis of the yeast genome has recognized correlated duplications that established a new metabolic strategies in fungi, and supported analyses of functional change in aromatases in pigs. TREx dating has limitations, however. Multiple transitions at synonymous sites may cause equilibration and loss of information. Further, to be useful to correlate events in the genomic record, different genes within a genome must suffer transitions at similar rates. Results A formalism to analyze divergence at two fold redundant codon systems is presented. This formalism exploits two-state approach-to-equilibrium kinetics from chemistry. This formalism captures, in a single equation, the possibility of multiple substitutions at individual sites, avoiding any need to "correct" for these. The formalism also connects specific rate constants for transitions to specific approximations in an underlying evolutionary model, including assumptions that transition rate constants are invariant at different sites, in different genes, in different lineages, and at different times. Therefore, the formalism supports analyses that evaluate these approximations. Transitions at synonymous sites within two-fold redundant coding systems were examined in the mouse, rat, and human genomes. The key metric (f2), the fraction of those sites that holds the same nucleotide, was measured for putative ortholog pairs. A transition redundant exchange (TREx) distance was calculated from f2 for these pairs. Pyrimidine-pyrimidine transitions at these sites occur approximately 14% faster than purine-purine transitions in various lineages. Transition rate constants were similar in different genes within the same lineages; within a set of orthologs, the f2 distribution is only modest overdispersed. No correlation between disparity and overdispersion is observed. In rodents, evidence was found for greater conservation of TREx sites in genes on the X chromosome, accounting for a small part of the overdispersion, however. Conclusion The TREx metric is useful to analyze the history of transition rate constants within these mammals over the past 100 million years. The TREx metric estimates the extent to which silent nucleotide substitutions accumulate in different genes, on different chromosomes, with different compositions, in different lineages, and at different times.
Collapse
Affiliation(s)
- Tang Li
- Foundation for Applied Molecular Evolution, Gainesville FL 32604, USA
| | | | - M Daniel Caraco
- Foundation for Applied Molecular Evolution, Gainesville FL 32604, USA
| | - David A Liberles
- Department of Molecular Biology, University of Wyoming, Laramie, WY 82071, USA
| | - Eric A Gaucher
- Foundation for Applied Molecular Evolution, Gainesville FL 32604, USA
| | - Steven A Benner
- Foundation for Applied Molecular Evolution, Gainesville FL 32604, USA
| |
Collapse
|
19
|
Gaucher EA, De Kee DW, Benner SA. Application of DETECTER, an evolutionary genomic tool to analyze genetic variation, to the cystic fibrosis gene family. BMC Genomics 2006; 7:44. [PMID: 16522197 PMCID: PMC1420294 DOI: 10.1186/1471-2164-7-44] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2005] [Accepted: 03/07/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The medical community requires computational tools that distinguish missense genetic differences having phenotypic impact within the vast number of sense mutations that do not. Tools that do this will become increasingly important for those seeking to use human genome sequence data to predict disease, make prognoses, and customize therapy to individual patients. RESULTS An approach, termed DETECTER, is proposed to identify sites in a protein sequence where amino acid replacements are likely to have a significant effect on phenotype, including causing genetic disease. This approach uses a model-dependent tool to estimate the normalized replacement rate at individual sites in a protein sequence, based on a history of those sites extracted from an evolutionary analysis of the corresponding protein family. This tool identifies sites that have higher-than-average, average, or lower-than-average rates of change in the lineage leading to the sequence in the population of interest. The rates are then combined with sequence data to determine the likelihoods that particular amino acids were present at individual sites in the evolutionary history of the gene family. These likelihoods are used to predict whether any specific amino acid replacements, if introduced at the site in a modern human population, would have a significant impact on fitness. The DETECTER tool is used to analyze the cystic fibrosis transmembrane conductance regulator (CFTR) gene family. CONCLUSION In this system, DETECTER retrodicts amino acid replacements associated with the cystic fibrosis disease with greater accuracy than alternative approaches. While this result validates this approach for this particular family of proteins only, the approach may be applicable to the analysis of polymorphisms generally, including SNPs in a human population.
Collapse
Affiliation(s)
- Eric A Gaucher
- Foundation for Applied Molecular Evolution, Gainesville, FL USA
| | - Danny W De Kee
- Foundation for Applied Molecular Evolution, Gainesville, FL USA
| | - Steven A Benner
- Department of Chemistry, University of Florida, Gainesville, FL USA
| |
Collapse
|
20
|
Bradley ME, Benner SA. Integrating protein structures and precomputed genealogies in the Magnum database: examples with cellular retinoid binding proteins. BMC Bioinformatics 2006; 7:89. [PMID: 16504077 PMCID: PMC1475641 DOI: 10.1186/1471-2105-7-89] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2005] [Accepted: 02/23/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND When accurate models for the divergent evolution of protein sequences are integrated with complementary biological information, such as folded protein structures, analyses of the combined data often lead to new hypotheses about molecular physiology. This represents an excellent example of how bioinformatics can be used to guide experimental research. However, progress in this direction has been slowed by the lack of a publicly available resource suitable for general use. RESULTS The precomputed Magnum database offers a solution to this problem for ca. 1,800 full-length protein families with at least one crystal structure. The Magnum deliverables include 1) multiple sequence alignments, 2) mapping of alignment sites to crystal structure sites, 3) phylogenetic trees, 4) inferred ancestral sequences at internal tree nodes, and 5) amino acid replacements along tree branches. Comprehensive evaluations revealed that the automated procedures used to construct Magnum produced accurate models of how proteins divergently evolve, or genealogies, and correctly integrated these with the structural data. To demonstrate Magnum's capabilities, we asked for amino acid replacements requiring three nucleotide substitutions, located at internal protein structure sites, and occurring on short phylogenetic tree branches. In the cellular retinoid binding protein family a site that potentially modulates ligand binding affinity was discovered. Recruitment of cellular retinol binding protein to function as a lens crystallin in the diurnal gecko afforded another opportunity to showcase the predictive value of a browsable database containing branch replacement patterns integrated with protein structures. CONCLUSION We integrated two areas of protein science, evolution and structure, on a large scale and created a precomputed database, known as Magnum, which is the first freely available resource of its kind. Magnum provides evolutionary and structural bioinformatics resources that are useful for identifying experimentally testable hypotheses about the molecular basis of protein behaviors and functions, as illustrated with the examples from the cellular retinoid binding proteins.
Collapse
Affiliation(s)
- Michael E Bradley
- Department of Chemistry, University of Florida, P.O. Box 117200, Gainesville, FL, 32611, USA
- Division of Biological Sciences, Department of Ecology and Evolution, University of Chicago, 1101 East 57Street, Chicago, IL, 60615, USA
| | - Steven A Benner
- Foundation for Applied Molecular Evolution, 1115 NW 14Avenue, Gainesville, FL, 32601, USA
| |
Collapse
|
21
|
Cui J, Han LY, Cai CZ, Zheng CJ, Ji ZL, Chen YZ. Prediction of functional class of novel bacterial proteins without the use of sequence similarity by a statistical learning method. J Mol Microbiol Biotechnol 2006; 9:86-100. [PMID: 16319498 DOI: 10.1159/000088839] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
A substantial percentage of the putative protein-encoding open reading frames (ORFs) in bacterial genomes have no homolog of known function, and their function cannot be confidently assigned on the basis of sequence similarity. Methods not based on sequence similarity are needed and being developed. One method, SVMProt (http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi), predicts protein functional family irrespective of sequence similarity (Nucleic Acids Res. 2003;31:3692-3697). While it has been tested on a large number of proteins, its capability for non-homologous proteins has so far been evaluated for a relatively small number of proteins, and additional tests are needed to more fully assess SVMProt. In this work, 90 novel bacterial proteins (non-homologous to known proteins) are used to evaluate the capability of SVMProt. These proteins are such that none of their homologs are in the Swiss-Prot database, their functions not clearly described in the literature, and they themselves and their homologs are not included in the training sets of SVMProt. They represent proteins whose function cannot be confidently predicted by sequence similarity methods at present. The predicted functional class of 76.7% of each of these proteins shows various levels of consistency with the literature-described function, compared to the overall accuracy of 87% for the SVMProt functional class assignment of 34,582 proteins that have at least one homolog of known function. Our study suggests that SVMProt is capable of assigning functional class for novel bacterial proteins at a level not too much lower than that of sequence alignment methods for homologous proteins.
Collapse
Affiliation(s)
- J Cui
- Bioinformatics and Drug Design Group, Department of Computational Science, National University of Singapore, Singapore
| | | | | | | | | | | |
Collapse
|
22
|
Fuellen G, Spitzer M, Cullen P, Lorkowski S. Correspondence of function and phylogeny of ABC proteins based on an automated analysis of 20 model protein data sets. Proteins 2005; 61:888-99. [PMID: 16254912 DOI: 10.1002/prot.20616] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Using our BLAST-based procedure RiPE (Retrieval-induced Phylogeny Environment), which automates the evolutionary analysis of a protein family, we assembled a set of 1138 ABC protein components [adenosine triphosphate (ATP)-binding cassette and transmembrane domain] from the protein data sets of 20 model organisms and subjected them to phylogenetic and functional analysis. For maximum speed, we based the alignment directly on a homology search with a profile of all known human ABC proteins and used neighbor-joining tree estimation. All but 11 sequences from Homo sapiens, Arabidopsis thaliana, Drosophila melanogaster, and Saccharomyces cerevisiae were placed into the correct subtree/subfamily, reproducing published classifications of the individual organisms. By following a simple "function transfer rule", our comparative phylogenetic analysis successfully predicted the known function of human ABC proteins in 19 of 22 cases. Three functional predictions did not correspond, and 10 were novel. Predictions based on BLAST alone were inferior in five cases and superior in two. Bacterial sequences were placed close to the root of most subtrees. This placement coincides with domain architecture, suggesting an early diversification of the ABC family before the kingdoms split apart. Our approach can, in principle, be used to annotate any protein family of any organism included in the study.
Collapse
Affiliation(s)
- Georg Fuellen
- Department of Medicine, AG Bioinformatics, University of Münster, Münster, Germany.
| | | | | | | |
Collapse
|
23
|
Berthold CL, Moussatche P, Richards NGJ, Lindqvist Y. Structural basis for activation of the thiamin diphosphate-dependent enzyme oxalyl-CoA decarboxylase by adenosine diphosphate. J Biol Chem 2005; 280:41645-54. [PMID: 16216870 DOI: 10.1074/jbc.m509921200] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Oxalyl-coenzyme A decarboxylase is a thiamin diphosphate-dependent enzyme that plays an important role in the catabolism of the highly toxic compound oxalate. We have determined the crystal structure of the enzyme from Oxalobacter formigenes from a hemihedrally twinned crystal to 1.73 A resolution and characterized the steady-state kinetic behavior of the decarboxylase. The monomer of the tetrameric enzyme consists of three alpha/beta-type domains, commonly seen in this class of enzymes, and the thiamin diphosphate-binding site is located at the expected subunit-subunit interface between two of the domains with the cofactor bound in the conserved V-conformation. Although oxalyl-CoA decarboxylase is structurally homologous to acetohydroxyacid synthase, a molecule of ADP is bound in a region that is cognate to the FAD-binding site observed in acetohydroxyacid synthase and presumably fulfils a similar role in stabilizing the protein structure. This difference between the two enzymes may have physiological importance since oxalyl-CoA decarboxylation is an essential step in ATP generation in O. formigenes, and the decarboxylase activity is stimulated by exogenous ADP. Despite the significant degree of structural conservation between the two homologous enzymes and the similarity in catalytic mechanism to other thiamin diphosphate-dependent enzymes, the active site residues of oxalyl-CoA decarboxylase are unique. A suggestion for the reaction mechanism of the enzyme is presented.
Collapse
Affiliation(s)
- Catrine L Berthold
- Molecular Structural Biology, Department of Medical Biochemistry and Biophysics, Karolinska Institutet, S-17177 Stockholm, Sweden
| | | | | | | |
Collapse
|
24
|
Han LY, Zheng CJ, Lin HH, Cui J, Li H, Zhang HL, Tang ZQ, Chen YZ. Prediction of functional class of novel plant proteins by a statistical learning method. THE NEW PHYTOLOGIST 2005; 168:109-21. [PMID: 16159326 DOI: 10.1111/j.1469-8137.2005.01482.x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
In plant genomes, the function of a substantial percentage of the putative protein-coding open reading frames (ORFs) is unknown. These ORFs have no significant sequence similarity to known proteins, which complicates the task of functional study of these proteins. Efforts are being made to explore methods that are complementary to, or may be used in combination with, sequence alignment and clustering methods. A web-based protein functional class prediction software, SVMProt, has shown some capability for predicting functional class of distantly related proteins. Here the usefulness of SVMProt for functional study of novel plant proteins is evaluated. To test SVMProt, 49 plant proteins (without a sequence homolog in the Swiss-Prot protein database, not in the SVMProt training set, and with functional indications provided in the literature) were selected from a comprehensive search of MEDLINE abstracts and Swiss-Prot databases in 1999-2004. These represent unique proteins the function of which, at present, cannot be confidently predicted by sequence alignment and clustering methods. The predicted functional class of 31 proteins was consistent, and that of four other proteins was weakly consistent, with published functions. Overall, the functional class of 71.4% of these proteins was consistent, or weakly consistent, with functional indications described in the literature. SVMProt shows a certain level of ability to provide useful hints about the functions of novel plant proteins with no similarity to known proteins.
Collapse
Affiliation(s)
- L Y Han
- Department of Computational Science, National University of Singapore, Blk SOC1, Level 7, 3 Science Drive 2, Singapore 117543
| | | | | | | | | | | | | | | |
Collapse
|
25
|
Minshull J, Ness JE, Gustafsson C, Govindarajan S. Predicting enzyme function from protein sequence. Curr Opin Chem Biol 2005; 9:202-9. [PMID: 15811806 DOI: 10.1016/j.cbpa.2005.02.003] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
There are two main reasons to try to predict an enzyme's function from its sequence. The first is to identify the components and thus the functional capabilities of an organism, the second is to create enzymes with specific properties. Genomics, expression analysis, proteomics and metabonomics are largely directed towards understanding how information flows from DNA sequence to protein functions within an organism. This review focuses on information flow in the opposite direction: the applicability of what is being learned from natural enzymes to improve methods for catalyst design.
Collapse
|
26
|
Roth C, Betts MJ, Steffansson P, Saelensminde G, Liberles DA. The Adaptive Evolution Database (TAED): a phylogeny based tool for comparative genomics. Nucleic Acids Res 2005; 33:D495-7. [PMID: 15608245 PMCID: PMC540044 DOI: 10.1093/nar/gki090] [Citation(s) in RCA: 72] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
From 138 662 embryophyte (higher plant) and 348 142 chordate genes, 4216 embryophyte and 15 452 chordate gene families were generated. For each of these gene families, multiple sequence alignments, phylogenetic trees, ratios of non-synonymous to synonymous nucleotide substitution rates (Ka/Ks), mappings from gene trees to the NCBI taxonomy and structural links to solved three-dimensional protein structures in the Protein Data Bank (PDB) with Grantham-weighted mutational factors were all calculated. Of the ‘gene family trees’, 173 embryophyte and 505 chordate branches show Ka/Ks ≫ 1 and are candidates for functional adaptation. The calculated information is available both as a gene family database and as a phylogenetically indexed resource, called ‘The Adaptive Evolution Database’ (TAED), available at http://www.bioinfo.no/tools/TAED.
Collapse
Affiliation(s)
- Christian Roth
- Computational Biology Unit, BCCS, University of Bergen, 5020 Bergen, Norway
| | | | | | | | | |
Collapse
|
27
|
Han L, Cai C, Ji Z, Chen Y. Prediction of functional class of novel viral proteins by a statistical learning method irrespective of sequence similarity. Virology 2005; 331:136-43. [PMID: 15582660 PMCID: PMC7111859 DOI: 10.1016/j.virol.2004.10.020] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2004] [Revised: 09/15/2004] [Accepted: 10/09/2004] [Indexed: 11/19/2022]
Abstract
The function of a substantial percentage of the putative protein-coding open reading frames (ORFs) in viral genomes is unknown. As their sequence is not similar to that of proteins of known function, the function of these ORFs cannot be assigned on the basis of sequence similarity. Methods complement or in combination with sequence similarity-based approaches are being explored. The web-based software SVMProt (http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi) to some extent assigns protein functional family irrespective of sequence similarity and has been found to be useful for studying distantly related proteins [Cai, C.Z., Han, L.Y., Ji, Z.L., Chen, X., Chen, Y.Z., 2003. SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res. 31(13): 3692–3697]. Here 25 novel viral proteins are selected to test the capability of SVMProt for functional family assignment of viral proteins whose function cannot be confidently predicted on by sequence similarity methods at present. These proteins are without a sequence homolog in the Swissprot database, with its precise function provided in the literature, and not included in the training sets of SVMProt. The predicted functional classes of 72% of these proteins match the literature-described function, which is compared to the overall accuracy of 87% for SVMProt functional class assignment of 34 582 proteins. This suggests that SVMProt to some extent is capable of functional class assignment irrespective of sequence similarity and it is potentially useful for facilitating functional study of novel viral proteins.
Collapse
Affiliation(s)
- L.Y. Han
- Bioinformatics and Drug Design Group, Department of Computational Science, National University of Singapore, Block SOC1, Level 7, 3 Science Drive 2, Singapore 117543, Singapore
| | - C.Z. Cai
- Bioinformatics and Drug Design Group, Department of Computational Science, National University of Singapore, Block SOC1, Level 7, 3 Science Drive 2, Singapore 117543, Singapore
- Department of Applied Physics, Chongquing University, Chongquing 400044, PR China
| | - Z.L. Ji
- Department of Biology, School of Life Sciences, Xiamen University, Xiamen 361000, FuJian Province, PR China
| | - Y.Z. Chen
- Bioinformatics and Drug Design Group, Department of Computational Science, National University of Singapore, Block SOC1, Level 7, 3 Science Drive 2, Singapore 117543, Singapore
- Corresponding author. Fax: +65 6774 6756.
| |
Collapse
|
28
|
Chang MSS, Benner SA. Empirical analysis of protein insertions and deletions determining parameters for the correct placement of gaps in protein sequence alignments. J Mol Biol 2004; 341:617-31. [PMID: 15276848 DOI: 10.1016/j.jmb.2004.05.045] [Citation(s) in RCA: 54] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2003] [Revised: 05/17/2004] [Accepted: 05/24/2004] [Indexed: 10/26/2022]
Abstract
To understand how protein segments are inserted and deleted during divergent evolution, a set of pairwise alignments contained exactly one gap, and therefore arising from the first insertion-deletion (indel) event in the time separating the homologs, was examined. The alignments showed that "structure breaking" amino acids (PGDNS) were preferred within and flanking gapped regions, as are two residues with hydrophilic side-chains (QE) that frequently occur at the surface of protein folds. Conversely, hydrophobic residues (FMILYVW) occur infrequently within and flanking the gapped region. These preferences are modestly different in protein pairs separated by an episode of adaptive evolution, than in pairs diverging under strong functional constraints. Surprisingly, regions near an indel have not evolved more rapidly than the sequence pair overall, showing no evidence that an indel event must be compensated by local amino acid replacement. The gap-lengths are best approximated by a Zipfian distribution, with the probability of a gap of length L decreasing as a function of L(-1.8). These features are largely independent of the length of the gap and the extent of divergence (measured by both silent and non-silent sequence changes) separating the two proteins. Surprisingly, amino acid repeats were discovered in more than a third of the polypeptide segments in and around the gap. These correspond to repeats in the DNA sequence. This suggests that a signature of the mechanism by which indels occur in the DNA sequence remains in the encoded protein sequences. These data suggest specific tools to score gap placement in an alignment. They also suggest tools that distinguish true indels from gaps created by mistaken gene finding, including under-predicted and over-predicted introns. By providing mechanisms to identify errors, the tools will enhance the value of genome sequence databases in support of integrated paleogenomics strategies used to extract functional information in a post-genomic environment.
Collapse
Affiliation(s)
- Mike S S Chang
- Foundation for Applied Molecular Evolution, Gainesville, FL 32601, USA
| | | |
Collapse
|
29
|
Abstract
Background Joining a model for the molecular evolution of a protein family to the paleontological and geological records (geobiology), and then to the chemical structures of substrates, products, and protein folds, is emerging as a broad strategy for generating hypotheses concerning function in a post-genomic world. This strategy expands systems biology to a planetary context, necessary for a notion of fitness to underlie (as it must) any discussion of function within a biomolecular system. Results Here, we report an example of such an expansion, where tools from planetary biology were used to analyze three genes from the pig Sus scrofa that encode cytochrome P450 aromatases–enzymes that convert androgens into estrogens. The evolutionary history of the vertebrate aromatase gene family was reconstructed. Transition redundant exchange silent substitution metrics were used to interpolate dates for the divergence of family members, the paleontological record was consulted to identify changes in physiology that correlated in time with the change in molecular behavior, and new aromatase sequences from peccary were obtained. Metrics that detect changing function in proteins were then applied, including KA/KS values and those that exploit structural biology. These identified specific amino acid replacements that were associated with changing substrate and product specificity during the time of presumed adaptive change. The combined analysis suggests that aromatase paralogs arose in pigs as a result of selection for Suoidea with larger litters than their ancestors, and permitted the Suoidea to survive the global climatic trauma that began in the Eocene. Conclusions This combination of bioinformatics analysis, molecular evolution, paleontology, cladistics, global climatology, structural biology, and organic chemistry serves as a paradigm in planetary biology. As the geological, paleontological, and genomic records improve, this approach should become widely useful to make systems biology statements about high-level function for biomolecular systems.
Collapse
|
30
|
Abstract
One approach for facilitating protein function prediction is to classify proteins into functional families. Recent studies on the classification of G-protein coupled receptors and other proteins suggest that a statistical learning method, Support vector machines (SVM), may be potentially useful for protein classification into functional families. In this work, SVM is applied and tested on the classification of enzymes into functional families defined by the Enzyme Nomenclature Committee of IUBMB. SVM classification system for each family is trained from representative enzymes of that family and seed proteins of Pfam curated protein families. The classification accuracy for enzymes from 46 families and for non-enzymes is in the range of 50.0% to 95.7% and 79.0% to 100% respectively. The corresponding Matthews correlation coefficient is in the range of 54.1% to 96.1%. Moreover, 80.3% of the 8,291 correctly classified enzymes are uniquely classified into a specific enzyme family by using a scoring function, indicating that SVM may have certain level of unique prediction capability. Testing results also suggest that SVM in some cases is capable of classification of distantly related enzymes and homologous enzymes of different functions. Effort is being made to use a more comprehensive set of enzymes as training sets and to incorporate multi-class SVM classification systems to further enhance the unique prediction accuracy. Our results suggest the potential of SVM for enzyme family classification and for facilitating protein function prediction. Our software is accessible at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi.
Collapse
Affiliation(s)
- C Z Cai
- Department of Applied Physics, Chongqing University, Chongqing, Peoples Republic of China
| | | | | | | |
Collapse
|
31
|
Cai CZ, Han LY, Ji ZL, Chen X, Chen YZ. SVM-Prot: Web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res 2003; 31:3692-7. [PMID: 12824396 PMCID: PMC169006 DOI: 10.1093/nar/gkg600] [Citation(s) in RCA: 355] [Impact Index Per Article: 16.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Prediction of protein function is of significance in studying biological processes. One approach for function prediction is to classify a protein into functional family. Support vector machine (SVM) is a useful method for such classification, which may involve proteins with diverse sequence distribution. We have developed a web-based software, SVMProt, for SVM classification of a protein into functional family from its primary sequence. SVMProt classification system is trained from representative proteins of a number of functional families and seed proteins of Pfam curated protein families. It currently covers 54 functional families and additional families will be added in the near future. The computed accuracy for protein family classification is found to be in the range of 69.1-99.6%. SVMProt shows a certain degree of capability for the classification of distantly related proteins and homologous proteins of different function and thus may be used as a protein function prediction tool that complements sequence alignment methods. SVMProt can be accessed at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi.
Collapse
Affiliation(s)
- C Z Cai
- Department of Computational Science, National University of Singapore, Blk SOC1, Level 7, 3 Science Drive 2, Singapore 117543, Singapore
| | | | | | | | | |
Collapse
|
32
|
Affiliation(s)
- David B Searls
- Bioinformatics Division, Genetics Research, GlaxoSmithKline Pharmaceuticals, 709 Swedeland Road, P.O. Box 1539, King of Prussia, Pennsylvania 19406, USA.
| |
Collapse
|
33
|
|
34
|
Fukami-Kobayashi K, Schreiber DR, Benner SA. Detecting compensatory covariation signals in protein evolution using reconstructed ancestral sequences. J Mol Biol 2002; 319:729-43. [PMID: 12054866 DOI: 10.1016/s0022-2836(02)00239-5] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
When protein sequences divergently evolve under functional constraints, some individual amino acid replacements that reverse the charge (e.g. Lys to Asp) may be compensated by a replacement at a second position that reverses the charge in the opposite direction (e.g. Glu to Arg). When these side-chains are near in space (proximal), such double replacements might be driven by natural selection, if either is selectively disadvantageous, but both together restore fully the ability of the protein to contribute to fitness (are together "neutral"). Accordingly, many have sought to identify pairs of positions in a protein sequence that suffer compensatory replacements, often as a way to identify positions near in space in the folded structure. A "charge compensatory signal" might manifest itself in two ways. First, proximal charge compensatory replacements may occur more frequently than predicted from the product of the probabilities of individual positions suffering charge reversing replacements independently. Conversely, charge compensatory pairs of changes may be observed to occur more frequently in proximal pairs of sites than in the average pair. Normally, charge compensatory covariation is detected by comparing the sequences of extant proteins at the "leaves" of phylogenetic trees. We show here that the charge compensatory signal is more evident when it is sought by examining individual branches in the tree between reconstructed ancestral sequences at nodes in the tree. Here, we find that the signal is especially strong when the positions pairs are in a single secondary structural unit (e.g. alpha helix or beta strand) that brings the side-chains suffering charge compensatory covariation near in space, and may be useful in secondary structure prediction. Also, "node-node" and "node-leaf" compensatory covariation may be useful to identify the better of two equally parsimonious trees, in a way that is independent of the mathematical formalism used to construct the tree itself. Further, compensatory covariation may provide a signal that indicates whether an episode of sequence evolution contains more or less divergence in functional behavior. Compensatory covariation analysis on reconstructed evolutionary trees may become a valuable tool to analyze genome sequences, and use these analyses to extract biomedically useful information from proteome databases.
Collapse
Affiliation(s)
- K Fukami-Kobayashi
- Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, Mishima 411-8540, Japan
| | | | | |
Collapse
|
35
|
Benner SA, Caraco MD, Thomson JM, Gaucher EA. Planetary biology--paleontological, geological, and molecular histories of life. Science 2002; 296:864-8. [PMID: 11988562 DOI: 10.1126/science.1069863] [Citation(s) in RCA: 62] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
The history of life on Earth is chronicled in the geological strata, the fossil record, and the genomes of contemporary organisms. When examined together, these records help identify metabolic and regulatory pathways, annotate protein sequences, and identify animal models to develop new drugs, among other features of scientific and biomedical interest. Together, planetary analysis of genome and proteome databases is providing an enhanced understanding of how life interacts with the biosphere and adapts to global change.
Collapse
Affiliation(s)
- Steven A Benner
- Department of Chemistry, University of Florida, Gainesville FL, 32611-7200, USA.
| | | | | | | |
Collapse
|
36
|
Liberles DA. Evaluation of methods for determination of a reconstructed history of gene sequence evolution. Mol Biol Evol 2001; 18:2040-7. [PMID: 11606700 DOI: 10.1093/oxfordjournals.molbev.a003745] [Citation(s) in RCA: 56] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
With whole-genome sequences being completed at an increasing rate, it is important to develop and assess tools to analyze them. Following annotation of the protein content of a genome, one can compare sequences with previously characterized homologous genes to detect novel functions within specific proteins in the evolution of the newly sequenced genome. One common statistical method to detect such changes is to compare the ratios of nonsynonymous (K(a)) to synonymous (K(s)) nucleotide substitution rates. Here, the effects of several parameters that can influence this calculation (sequence reconstruction method, phylogenetic tree branch length weighting, GC content, and codon bias) are examined. Also, two new alternative measures of adaptive evolution, the point accepted mutations (PAM)/neutral evolutionary distance (NED) ratio and the sequence space assessment (SSA) statistic are presented. All of these methods are compared using two sequence families: the recent divergence of leptin orthologs in primates, and the more ancient divergence of the deoxyribonucleoside kinase family. The examination of these and other measures to detect changes of gene function along branches of a phylogenetic tree will become increasingly important in the postgenomic era.
Collapse
Affiliation(s)
- D A Liberles
- Department of Biochemistry and Biophysics and Stockholm Bioinformatics Center, Stockholm University, Stockholm, Sweden.
| |
Collapse
|
37
|
Peltier JB, Ytterberg J, Liberles DA, Roepstorff P, van Wijk KJ. Identification of a 350-kDa ClpP protease complex with 10 different Clp isoforms in chloroplasts of Arabidopsis thaliana. J Biol Chem 2001; 276:16318-27. [PMID: 11278690 DOI: 10.1074/jbc.m010503200] [Citation(s) in RCA: 96] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
A 350-kDa ClpP protease complex with 10 different subunits was identified in chloroplast of Arabidopsis thaliana, using Blue-Native gel electrophoresis, followed by matrix-assisted laser desorption ionization time-of-flight and nano-electrospray tandem mass spectrometry. The complex was copurified with the thylakoid membranes, and all identified Clp subunits show chloroplast targeting signals, supporting that this complex is indeed localized in the chloroplast. The complex contains chloroplast-encoded pClpP and six nuclear-encoded proteins nCpP1-6, as well as two unassigned Clp homologues (nClpP7, nClpP8). An additional Clp protein was identified in this complex; it does not belong to any of the known Clp genes families and is here assigned ClpS1. Expression and accumulation of several of these Clp proteins have never been shown earlier. Sequence and phylogenetic tree analysis suggests that nClpP5, nClpP2, and nClpP8 are not catalytically active and form a new group of Clp higher plant proteins, orthologous to the cyanobacterial ClpR protein, and are renamed ClpR1, -2, and -3, respectively. We speculate that ClpR1, -2, and -3 are part of the heptameric rings, whereas ClpS1 is a regulatory subunit positioned at the axial opening of the ClpP/R core. Several truncations and errors in intron and exon prediction of the annotated Clp genes were corrected using mass spectrometry data and by matching genomic sequences with cDNA sequences. This strategy will be widely applicable for the much needed verification of protein prediction from genomic sequence. The extreme complexity of the chloroplast Clp complex is discussed.
Collapse
Affiliation(s)
- J B Peltier
- Department of Biochemistry, Arrhenius Laboratories, Stockholm University, S-10691 Stockholm, Sweden
| | | | | | | | | |
Collapse
|
38
|
Gaucher EA, Miyamoto MM, Benner SA. Function-structure analysis of proteins using covarion-based evolutionary approaches: Elongation factors. Proc Natl Acad Sci U S A 2001; 98:548-52. [PMID: 11209054 PMCID: PMC14624 DOI: 10.1073/pnas.98.2.548] [Citation(s) in RCA: 83] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The divergent evolution of protein sequences from genomic databases can be analyzed by the use of different mathematical models. The most common treat all sites in a protein sequence as equally variable. More sophisticated models acknowledge the fact that purifying selection generally tolerates variable amounts of amino acid replacement at different positions in a protein sequence. In their "stationary" versions, such models assume that the replacement rate at individual positions remains constant throughout evolutionary history. "Nonstationary" covarion versions, however, allow the replacement rate at a position to vary in different branches of the evolutionary tree. Recently, statistical methods have been developed that highlight this type of variation in replacement rates. Here, we show how positions that have variable rates of divergence in different regions of a tree ("covarion behavior"), coupled with analyses of experimental three-dimensional structures, can provide experimentally testable hypotheses that relate individual amino acid residues to specific functional differences in those branches. We illustrate this in the elongation factor family of proteins as a paradigm for applications of this type of analysis in functional genomics generally.
Collapse
Affiliation(s)
- E A Gaucher
- Department of Chemistry and Molecular Cell Biology Program, College of Medicine, University of Florida, Gainesville, FL 32611-7200, USA.
| | | | | |
Collapse
|
39
|
Liberles DA, Schreiber DR, Govindarajan S, Chamberlin SG, Benner SA. The adaptive evolution database (TAED). Genome Biol 2001; 2:RESEARCH0028. [PMID: 11532212 PMCID: PMC55325 DOI: 10.1186/gb-2001-2-8-research0028] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2001] [Revised: 05/21/2001] [Accepted: 06/06/2001] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND The Master Catalog is a collection of evolutionary families, including multiple sequence alignments, phylogenetic trees and reconstructed ancestral sequences, for all protein-sequence modules encoded by genes in GenBank. It can therefore support large-scale genomic surveys, of which we present here The Adaptive Evolution Database (TAED). In TAED, potential examples of positive adaptation are identified by high values for the normalized ratio of nonsynonymous to synonymous nucleotide substitution rates (KA/KS values) on branches of an evolutionary tree between nodes representing reconstructed ancestral sequences. RESULTS Evolutionary trees and reconstructed ancestral sequences were extracted from the Master Catalog for every subtree containing proteins from the Chordata only or the Embryophyta only. Branches with high KA/KS values were identified. These represent candidate episodes in the history of the protein family when the protein may have undergone positive selection, where the mutant form conferred more fitness than the ancestral form. Such episodes are frequently associated with change in function. An unexpectedly large number of families (between 10% and 20% of those families examined) were found to have at least one branch with high KA/KS values above arbitrarily chosen cut-offs (1 and 0.6). Most of these survived a robustness test and were collected into TAED. CONCLUSIONS TAED is a raw resource for bioinformaticists interested in data mining and for experimental evolutionists seeking candidate examples of adaptive evolution for further experimental study. It can be expanded to include other evolutionary information (for example changes in gene regulation or splicing) placed in a phylogenetic perspective.
Collapse
Affiliation(s)
- D A Liberles
- Departments of Chemistry, University of Florida, Gainesville, FL 32611, USA.
| | | | | | | | | |
Collapse
|
40
|
Current Awareness on Comparative and Functional Genomics. Comp Funct Genomics 2001. [PMCID: PMC2447185 DOI: 10.1002/cfg.55] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
|
41
|
|