1
|
Residue mutations and their impact on protein structure and function: detecting beneficial and pathogenic changes. Biochem J 2013; 449:581-94. [DOI: 10.1042/bj20121221] [Citation(s) in RCA: 131] [Impact Index Per Article: 11.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
The present review focuses on the evolution of proteins and the impact of amino acid mutations on function from a structural perspective. Proteins evolve under the law of natural selection and undergo alternating periods of conservative evolution and of relatively rapid change. The likelihood of mutations being fixed in the genome depends on various factors, such as the fitness of the phenotype or the position of the residues in the three-dimensional structure. For example, co-evolution of residues located close together in three-dimensional space can occur to preserve global stability. Whereas point mutations can fine-tune the protein function, residue insertions and deletions (‘decorations’ at the structural level) can sometimes modify functional sites and protein interactions more dramatically. We discuss recent developments and tools to identify such episodic mutations, and examine their applications in medical research. Such tools have been tested on simulated data and applied to real data such as viruses or animal sequences. Traditionally, there has been little if any cross-talk between the fields of protein biophysics, protein structure–function and molecular evolution. However, the last several years have seen some exciting developments in combining these approaches to obtain an in-depth understanding of how proteins evolve. For example, a better understanding of how structural constraints affect protein evolution will greatly help us to optimize our models of sequence evolution. The present review explores this new synthesis of perspectives.
Collapse
|
2
|
Protein meta-functional signatures from combining sequence, structure, evolution, and amino acid property information. PLoS Comput Biol 2008; 4:e1000181. [PMID: 18818722 PMCID: PMC2526173 DOI: 10.1371/journal.pcbi.1000181] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2008] [Accepted: 08/07/2008] [Indexed: 11/19/2022] Open
Abstract
Protein function is mediated by different amino acid residues, both their positions and types, in a protein sequence. Some amino acids are responsible for the stability or overall shape of the protein, playing an indirect role in protein function. Others play a functionally important role as part of active or binding sites of the protein. For a given protein sequence, the residues and their degree of functional importance can be thought of as a signature representing the function of the protein. We have developed a combination of knowledge- and biophysics-based function prediction approaches to elucidate the relationships between the structural and the functional roles of individual residues and positions. Such a meta-functional signature (MFS), which is a collection of continuous values representing the functional significance of each residue in a protein, may be used to study proteins of known function in greater detail and to aid in experimental characterization of proteins of unknown function. We demonstrate the superior performance of MFS in predicting protein functional sites and also present four real-world examples to apply MFS in a wide range of settings to elucidate protein sequence-structure-function relationships. Our results indicate that the MFS approach, which can combine multiple sources of information and also give biological interpretation to each component, greatly facilitates the understanding and characterization of protein function.
Collapse
|
3
|
Xie BB, Chen XL, Zhang XY, He HL, Zhang YZ, Zhou BC. Predicting protein interaction interfaces from protein sequences: case studies of subtilisin and phycocyanin. Proteins 2008; 71:1461-74. [PMID: 18076046 DOI: 10.1002/prot.21836] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Identification of protein interaction interfaces is very important for understanding the molecular mechanisms underlying biological phenomena. Here, we present a novel method for predicting protein interaction interfaces from sequences by using PAM matrix (PIFPAM). Sequence alignments for interacting proteins were constructed and parsed into segments using sliding windows. By calculating distance matrix for each segment, the correlation coefficients between segments were estimated. The interaction interfaces were predicted by extracting highly correlated segment pairs from the correlation map. The predictions achieved an accuracy 0.41-0.71 for eight intraprotein interaction examples, and 0.07-0.60 for four interprotein interaction examples. Compared with three previously published methods, PIFPAM predicted more contacting site pairs for 11 out of the 12 example proteins, and predicted at least 34% more contacting site pairs for eight proteins of them. The factors affecting the predictions were also analyzed. Since PIFPAM uses only the alignments of the two interacting proteins as input, it is especially useful when no three-dimensional protein structure data are available.
Collapse
Affiliation(s)
- Bin-Bin Xie
- State Key Lab of Microbial Technology, Shandong University, Jinan 250100, People's Republic of China
| | | | | | | | | | | |
Collapse
|
4
|
Whelan S. Spatial and Temporal Heterogeneity in Nucleotide Sequence Evolution. Mol Biol Evol 2008; 25:1683-94. [DOI: 10.1093/molbev/msn119] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
|
5
|
Liu XS, Guo WL. Robustness of the residue conservation score reflecting both frequencies and physicochemistries. Amino Acids 2008; 34:643-52. [PMID: 18175048 DOI: 10.1007/s00726-007-0017-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2007] [Accepted: 12/07/2007] [Indexed: 10/22/2022]
Abstract
Measuring residue conservation at aligned positions has many applications in biology. Recently, a new conservation score has been defined. Unlike the previous methods, the new approach considers both residue frequencies and physicochemistries. Specifically, it measures physicochemistries based on BLOSUM matrices disregarding the meaning of the entries in such matrices, which may involve the problem of log-log probability. In this paper we present a conservation measure that also reflects both frequencies and physicochemistries while considering the fact that the entries of BLOSUM matrices are already interpreted as log probability. When the supposed score is applied to 14 protein examples, the results show that these two conservation scores are equivalent aside from the different score ranges. The method is also used to score the functional sites of three protein families. Compared with the widely used entropy-based methods, the resulting scores are more robust and consistent in the sense that the functional sites are much more conserved because of functional constraints.
Collapse
Affiliation(s)
- X-S Liu
- Institute of Nanoscience, Academy of Frontier Science, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China.
| | | |
Collapse
|
6
|
Negrisolo E, Bargelloni L, Patarnello T, Ozouf-Costaz C, Pisano E, di Prisco G, Verde C. Comparative and evolutionary genomics of globin genes in fish. Methods Enzymol 2008; 436:511-38. [PMID: 18237652 DOI: 10.1016/s0076-6879(08)36029-7] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Sequencing genomes of model organisms is a great challenge for biological sciences. In the past decade, scientists have developed a large number of methods to align and compare sequenced genomes. The analysis of a given sequence provides much information on the genome structure but to a lesser extent on the function. Comparative genomics are a useful tool for functional and evolutionary annotation of genomes. In principle, comparison of genomic sequences may allow for identification of the evolutionary selection (negative or positive) that the functional sequences have been subjected to over time. Positively selected genome regions are the most important ones for evolution, because most changes are adaptive and often induce biological differences in organisms. The draft genomes of five fish species have recently become available. We herewith review and discuss some new insights into comparative genomics in fish globin genes. Special attention will be given to a complementary methodological approach to comparative genomics, fluorescence in situ hybridization (FISH). Internet resources for analyzing sequence alignments and annotations and new bioinformatic tools to address critical problems are thoroughly discussed.
Collapse
Affiliation(s)
- Enrico Negrisolo
- Department of Public Health, Comparative Pathology, and Veterinary Hygiene, University of Padova, Legnaro, Italy
| | | | | | | | | | | | | |
Collapse
|
7
|
Zhang SW, Zhang YL, Yang HF, Zhao CH, Pan Q. Using the concept of Chou's pseudo amino acid composition to predict protein subcellular localization: an approach by incorporating evolutionary information and von Neumann entropies. Amino Acids 2007; 34:565-72. [PMID: 18074191 DOI: 10.1007/s00726-007-0010-9] [Citation(s) in RCA: 116] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2007] [Accepted: 11/15/2007] [Indexed: 11/24/2022]
Abstract
The rapidly increasing number of sequence entering into the genome databank has called for the need for developing automated methods to analyze them. Information on the subcellular localization of new found protein sequences is important for helping to reveal their functions in time and conducting the study of system biology at the cellular level. Based on the concept of Chou's pseudo-amino acid composition, a series of useful information and techniques, such as residue conservation scores, von Neumann entropies, multi-scale energy, and weighted auto-correlation function were utilized to generate the pseudo-amino acid components for representing the protein samples. Based on such an infrastructure, a hybridization predictor was developed for identifying uncharacterized proteins among the following 12 subcellular localizations: chloroplast, cytoplasm, cytoskeleton, endoplasmic reticulum, extracell, Golgi apparatus, lysosome, mitochondria, nucleus, peroxisome, plasma membrane, and vacuole. Compared with the results reported by the previous investigators, higher success rates were obtained, suggesting that the current approach is quite promising, and may become a useful high-throughput tool in the relevant areas.
Collapse
Affiliation(s)
- Shao-Wu Zhang
- College of Automation, Northwestern Polytechnical University, No. 127 Youyi West Road, Xi'an 710072, China.
| | | | | | | | | |
Collapse
|
8
|
Livesay DR, Kidd PD, Eskandari S, Roshan U. Assessing the ability of sequence-based methods to provide functional insight within membrane integral proteins: a case study analyzing the neurotransmitter/Na+ symporter family. BMC Bioinformatics 2007; 8:397. [PMID: 17941992 PMCID: PMC2194793 DOI: 10.1186/1471-2105-8-397] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2007] [Accepted: 10/17/2007] [Indexed: 01/09/2023] Open
Abstract
Background Efforts to predict functional sites from globular proteins is increasingly common; however, the most successful of these methods generally require structural insight. Unfortunately, despite several recent technological advances, structural coverage of membrane integral proteins continues to be sparse. ConSequently, sequence-based methods represent an important alternative to illuminate functional roles. In this report, we critically examine the ability of several computational methods to provide functional insight within two specific areas. First, can phylogenomic methods accurately describe the functional diversity across a membrane integral protein family? And second, can sequence-based strategies accurately predict key functional sites? Due to the presence of a recently solved structure and a vast amount of experimental mutagenesis data, the neurotransmitter/Na+ symporter (NSS) family is an ideal model system to assess the quality of our predictions. Results The raw NSS sequence dataset contains 181 sequences, which have been aligned by various methods. The resultant phylogenetic trees always contain six major subfamilies are consistent with the functional diversity across the family. Moreover, in well-represented subfamilies, phylogenetic clustering recapitulates several nuanced functional distinctions. Functional sites are predicted using six different methods (phylogenetic motifs, two methods that identify subfamily-specific positions, and three different conservation scores). A canonical set of 34 functional sites identified by Yamashita et al. within the recently solved LeuTAa structure is used to assess the quality of the predictions, most of which are predicted by the bioinformatic methods. Remarkably, the importance of these sites is largely confirmed by experimental mutagenesis. Furthermore, the collective set of functional site predictions qualitatively clusters along the proposed transport pathway, further demonstrating their utility. Interestingly, the various prediction schemes provide results that are predominantly orthogonal to each other. However, when the methods do provide overlapping results, specificity is shown to increase dramatically (e.g., sites predicted by any three methods have both accuracy and coverage greater than 50%). Conclusion The results presented herein clearly establish the viability of sequence-based bioinformatic strategies to provide functional insight within the NSS family. As such, we expect similar bioinformatic investigations will streamline functional investigations within membrane integral families in the absence of structure.
Collapse
Affiliation(s)
- Dennis R Livesay
- Department of Computer Science and Bioinformatics Research Center, University of North Carolina at Charlotte, Charlotte, NC 28262, USA.
| | | | | | | |
Collapse
|
9
|
Anisimova M, Liberles DA. The quest for natural selection in the age of comparative genomics. Heredity (Edinb) 2007; 99:567-79. [PMID: 17848974 DOI: 10.1038/sj.hdy.6801052] [Citation(s) in RCA: 68] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022] Open
Abstract
Continued genome sequencing has fueled progress in statistical methods for understanding the action of natural selection at the molecular level. This article reviews various statistical techniques (and their applicability) for detecting adaptation events and the functional divergence of proteins. As large-scale automated studies become more frequent, they provide a useful resource for generating biological null hypotheses for further experimental and statistical testing. Furthermore, they shed light on typical patterns of lineage-specific evolution of organisms, on the functional and structural evolution of protein families and on the interplay between the two. More complex models are being developed to better reflect the underlying biological and chemical processes and to complement simpler statistical models. Linking molecular processes to their statistical signatures in genomes can be demanding, and the proper application of statistical models is discussed.
Collapse
Affiliation(s)
- M Anisimova
- Department of Biology, University College London, London, UK
| | | |
Collapse
|
10
|
Chakrabarti S, Bryant SH, Panchenko AR. Functional specificity lies within the properties and evolutionary changes of amino acids. J Mol Biol 2007; 373:801-10. [PMID: 17868687 PMCID: PMC2605514 DOI: 10.1016/j.jmb.2007.08.036] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2007] [Revised: 07/03/2007] [Accepted: 08/16/2007] [Indexed: 10/22/2022]
Abstract
The rapid increase in the amount of protein sequence data has created a need for automated identification of sites that determine functional specificity among related subfamilies of proteins. A significant fraction of subfamily specific sites are only marginally conserved, which makes it extremely challenging to detect those amino acid changes that lead to functional diversification. To address this critical problem we developed a method named SPEER (specificity prediction using amino acids' properties, entropy and evolution rate) to distinguish specificity determining sites from others. SPEER encodes the conservation patterns of amino acid types using their physico-chemical properties and the heterogeneity of evolutionary changes between and within the subfamilies. To test the method, we compiled a test set containing 13 protein families with known specificity determining sites. Extensive benchmarking by comparing the performance of SPEER with other specificity site prediction algorithms has shown that it performs better in predicting several categories of subfamily specific sites.
Collapse
Affiliation(s)
- Saikat Chakrabarti
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | | | | |
Collapse
|
11
|
How accurate and statistically robust are catalytic site predictions based on closeness centrality? BMC Bioinformatics 2007; 8:153. [PMID: 17498304 PMCID: PMC1876251 DOI: 10.1186/1471-2105-8-153] [Citation(s) in RCA: 52] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2006] [Accepted: 05/11/2007] [Indexed: 11/25/2022] Open
Abstract
Background We examine the accuracy of enzyme catalytic residue predictions from a network representation of protein structure. In this model, amino acid α-carbons specify vertices within a graph and edges connect vertices that are proximal in structure. Closeness centrality, which has shown promise in previous investigations, is used to identify important positions within the network. Closeness centrality, a global measure of network centrality, is calculated as the reciprocal of the average distance between vertex i and all other vertices. Results We benchmark the approach against 283 structurally unique proteins within the Catalytic Site Atlas. Our results, which are inline with previous investigations of smaller datasets, indicate closeness centrality predictions are statistically significant. However, unlike previous approaches, we specifically focus on residues with the very best scores. Over the top five closeness centrality scores, we observe an average true to false positive rate ratio of 6.8 to 1. As demonstrated previously, adding a solvent accessibility filter significantly improves predictive power; the average ratio is increased to 15.3 to 1. We also demonstrate (for the first time) that filtering the predictions by residue identity improves the results even more than accessibility filtering. Here, we simply eliminate residues with physiochemical properties unlikely to be compatible with catalytic requirements from consideration. Residue identity filtering improves the average true to false positive rate ratio to 26.3 to 1. Combining the two filters together has little affect on the results. Calculated p-values for the three prediction schemes range from 2.7E-9 to less than 8.8E-134. Finally, the sensitivity of the predictions to structure choice and slight perturbations is examined. Conclusion Our results resolutely confirm that closeness centrality is a viable prediction scheme whose predictions are statistically significant. Simple filtering schemes substantially improve the method's predicted power. Moreover, no clear effect on performance is observed when comparing ligated and unligated structures. Similarly, the CC prediction results are robust to slight structural perturbations from molecular dynamics simulation.
Collapse
|
12
|
Abstract
Background The rate of evolution varies spatially along genomes and temporally in time. The presence of evolutionary rate variation is an informative signal that often marks functional regions of genomes and historical selection events. There exist many tests for temporal rate variation, or heterotachy, that start by partitioning sampled sequences into two or more groups and testing rate homogeneity among the groups. I develop a Bayesian method to infer phylogenetic trees with a divergence point, or dramatic temporal shifts in selection pressure that affect many nucleotide sites simultaneously, located at an unknown position in the tree. Results Simulation demonstrates that the method is most able to detect divergence points when rate variation and the number of affected sites is high, but not beyond biologically relevant values. The method is applied to two viral data sets. A divergence point is identified separating the B and C subtypes, two genetically distinct variants of HIV that have spread into different human populations with the AIDS epidemic. In contrast, no strong signal of temporal rate variation is found in a sample of F and H genotypes, two genetic variants of HBV that have likely evolved with humans during their immigration and expansion into the Americas. Conclusion Temporal shifts in evolutionary rate of sufficient magnitude are detectable in the history of sampled sequences. The ability to detect such divergence points without the need to specify a prior hypothesis about the location or timing of the divergence point should help scientists identify historically important selection events and decipher mechanisms of evolution.
Collapse
Affiliation(s)
- Karin S Dorman
- Department of Statistics, and the Program in Bioinformatics and Computational Biology, Iowa State University, Ames, IA, USA.
| |
Collapse
|
13
|
Kalinina OV, Russell RB, Rakhmaninova AB, Gelfand MS. Computational method for predicting protein functional sites with the use of specificity determinants. Mol Biol 2007. [DOI: 10.1134/s0026893307010189] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
14
|
Liu X, Li J, Guo W, Wang W. A new method for quantifying residue conservation and its applications to the protein folding nucleus. Biochem Biophys Res Commun 2006; 351:1031-6. [PMID: 17097065 DOI: 10.1016/j.bbrc.2006.10.157] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2006] [Accepted: 10/27/2006] [Indexed: 11/23/2022]
Abstract
The conservation of residues in columns of a multiple sequence alignment (MSA) reflects the importance of these residues for maintaining the structure and function of a protein. To date, many scores have been suggested for quantifying residue conservation, but none has achieved the full rigor both in biology and statistics. In this paper, we present a new approach for measuring the evolutionary conservation at aligned positions. Our conservation measure is related to the logarithmic probabilities for aligned positions, and combines the physicochemical properties and the frequencies of amino acids. Such a measure is both biologically and statistically meaningful. For testing the relationship between an amino acid's evolutionary conservation and its role in the Phi-value defined protein folding kinetics, our results indicate that the folding nucleus residues may not be significantly more conserved than other residues by using the biological-relevance weighted statistical scoring method suggested in this paper as an alternative to entropy-based procedures.
Collapse
Affiliation(s)
- Xinsheng Liu
- Institute of Nanoscience, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China.
| | | | | | | |
Collapse
|
15
|
Klosterman PS, Uzilov AV, Bendaña YR, Bradley RK, Chao S, Kosiol C, Goldman N, Holmes I. XRate: a fast prototyping, training and annotation tool for phylo-grammars. BMC Bioinformatics 2006; 7:428. [PMID: 17018148 PMCID: PMC1622757 DOI: 10.1186/1471-2105-7-428] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2006] [Accepted: 10/03/2006] [Indexed: 12/15/2022] Open
Abstract
Background Recent years have seen the emergence of genome annotation methods based on the phylo-grammar, a probabilistic model combining continuous-time Markov chains and stochastic grammars. Previously, phylo-grammars have required considerable effort to implement, limiting their adoption by computational biologists. Results We have developed an open source software tool, xrate, for working with reversible, irreversible or parametric substitution models combined with stochastic context-free grammars. xrate efficiently estimates maximum-likelihood parameters and phylogenetic trees using a novel "phylo-EM" algorithm that we describe. The grammar is specified in an external configuration file, allowing users to design new grammars, estimate rate parameters from training data and annotate multiple sequence alignments without the need to recompile code from source. We have used xrate to measure codon substitution rates and predict protein and RNA secondary structures. Conclusion Our results demonstrate that xrate estimates biologically meaningful rates and makes predictions whose accuracy is comparable to that of more specialized tools.
Collapse
Affiliation(s)
- Peter S Klosterman
- Department of Bioengineering, University of California, Berkeley CA, USA
| | - Andrew V Uzilov
- Department of Bioengineering, University of California, Berkeley CA, USA
| | - Yuri R Bendaña
- Department of Bioengineering, University of California, Berkeley CA, USA
| | - Robert K Bradley
- Department of Bioengineering, University of California, Berkeley CA, USA
| | - Sharon Chao
- Department of Bioengineering, University of California, Berkeley CA, USA
| | - Carolin Kosiol
- European Bioinformatics Institute, Hinxton, Cambridgeshire, UK
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca NY, USA
| | - Nick Goldman
- European Bioinformatics Institute, Hinxton, Cambridgeshire, UK
| | - Ian Holmes
- Department of Bioengineering, University of California, Berkeley CA, USA
| |
Collapse
|
16
|
Incorporating background frequency improves entropy-based residue conservation measures. BMC Bioinformatics 2006. [PMID: 16916457 DOI: 10.1186/1471‐2105‐7‐385] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Several entropy-based methods have been developed for scoring sequence conservation in protein multiple sequence alignments. High scoring amino acid positions may correlate with structurally or functionally important residues. However, amino acid background frequencies are usually not taken into account in these entropy-based scoring schemes. RESULTS We demonstrate that using a relative entropy measure that incorporates amino acid background frequency results in improved performance in identifying functional sites from protein multiple sequence alignments. CONCLUSION Our results suggest that the application of appropriate background frequency information may lead to more biologically relevant results in many areas of bioinformatics.
Collapse
|
17
|
Wang K, Samudrala R. Incorporating background frequency improves entropy-based residue conservation measures. BMC Bioinformatics 2006; 7:385. [PMID: 16916457 PMCID: PMC1562451 DOI: 10.1186/1471-2105-7-385] [Citation(s) in RCA: 71] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2006] [Accepted: 08/17/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Several entropy-based methods have been developed for scoring sequence conservation in protein multiple sequence alignments. High scoring amino acid positions may correlate with structurally or functionally important residues. However, amino acid background frequencies are usually not taken into account in these entropy-based scoring schemes. RESULTS We demonstrate that using a relative entropy measure that incorporates amino acid background frequency results in improved performance in identifying functional sites from protein multiple sequence alignments. CONCLUSION Our results suggest that the application of appropriate background frequency information may lead to more biologically relevant results in many areas of bioinformatics.
Collapse
Affiliation(s)
- Kai Wang
- Computational Genomics Group, Department of Microbiology, University of Washington, USA
| | - Ram Samudrala
- Computational Genomics Group, Department of Microbiology, University of Washington, USA
| |
Collapse
|
18
|
Mihalek I, Res I, Lichtarge O. Evolutionary and structural feedback on selection of sequences for comparative analysis of proteins. Proteins 2006; 63:87-99. [PMID: 16397893 DOI: 10.1002/prot.20866] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
It has been noted that slowly evolving protein residues have two properties: (a) they tend to cluster in the native fold, and (b) they delineate functional surfaces-parts of the surface through which the protein interacts with other proteins or small ligands. Herein, we demonstrate that the two are coupled sufficiently strongly that one effect, when observed, statistically implies the other. Detection of both can be accomplished in multiple sequence alignment related methods by the careful selection of relevant sequences. For the demonstration, we use two sets of protein families: a small set of diverse proteins with diverse functional surfaces, and a large set of homodimerizing enzymes. A practical outcome of our considerations is a simple prescriptive rule for the selection of homologous sequences for the comparative analysis of proteins: in order to optimize the detection of (potentially unknown) functional surfaces, it is sufficient to select sequences in such a way that the residues observed at any level of evolutionary divergence, as implied by the alignment, cluster on the folded protein.
Collapse
Affiliation(s)
- I Mihalek
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA.
| | | | | |
Collapse
|
19
|
Abhiman S, Daub CO, Sonnhammer ELL. Prediction of function divergence in protein families using the substitution rate variation parameter alpha. Mol Biol Evol 2006; 23:1406-13. [PMID: 16672285 DOI: 10.1093/molbev/msl002] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Protein families typically embody a range of related functions and may thus be decomposed into subfamilies with, for example, distinct substrate specificities. Detection of functionally divergent subfamilies is possible by methods for recognizing branches of adaptive evolution in a gene tree. As the number of genome sequences is growing rapidly, it is highly desirable to automatically detect subfamily function divergence. To this end, we here introduce a method for large-scale prediction of function divergence within protein families. It is called the alpha shift measure (ASM) as it is based on detecting a shift in the shape parameter (alpha [alpha]) of the substitution rate gamma distribution. Four different methods for estimating alpha were investigated. We benchmarked the accuracy of ASM using function annotation from Enzyme Commission numbers within Pfam protein families divided into subfamilies by the automatic tree-based method BETE. In a test using 563 subfamily pairs in 162 families, ASM outperformed functional site-based methods using rate or conservation shifting (rate shift measure [RSM] and conservation shift measure [CSM]). The best results were obtained using the "GZ-Gamma" method for estimating alpha. By combining ASM with RSM and CSM using linear discriminant analysis, the prediction accuracy was further improved.
Collapse
Affiliation(s)
- Saraswathi Abhiman
- Center for Genomics and Bioinformatics, Karolinska Institutet, Stockholm, Sweden.
| | | | | |
Collapse
|
20
|
Mihalek I, Res I, Lichtarge O. A structure and evolution-guided Monte Carlo sequence selection strategy for multiple alignment-based analysis of proteins. ACTA ACUST UNITED AC 2005; 22:149-56. [PMID: 16303797 DOI: 10.1093/bioinformatics/bti791] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Various multiple sequence alignment-based methods have been proposed to detect functional surfaces in proteins, such as active sites or protein interfaces. The effect that the choice of sequences has on the conclusions of such analysis has seldom been discussed. In particular, no method has been discussed in terms of its ability to optimize the sequence selection for the reliable detection of functional surfaces. RESULTS Here we propose, for the case of proteins with known structure, a heuristic Metropolis Monte Carlo strategy to select sequences from a large set of homologues, in order to improve detection of functional surfaces. The quantity guiding the optimization is the clustering of residues which are under increased evolutionary pressure, according to the sample of sequences under consideration. We show that we can either improve the overlap of our prediction with known functional surfaces in comparison with the sequence similarity criteria of selection or match the quality of prediction obtained through more elaborate non-structure based-methods of sequence selection. For the purpose of demonstration we use a set of 50 homodimerizing enzymes which were co-crystallized with their substrates and cofactors.
Collapse
Affiliation(s)
- I Mihalek
- Department of Molecular and Human Genetics, Baylor College of Medicine One Baylor Plaza, Houston, TX 77030, USA.
| | | | | |
Collapse
|
21
|
Pei J, Cai W, Kinch LN, Grishin NV. Prediction of functional specificity determinants from protein sequences using log-likelihood ratios. Bioinformatics 2005; 22:164-71. [PMID: 16278237 DOI: 10.1093/bioinformatics/bti766] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION A number of methods have been developed to predict functional specificity determinants in protein families based on sequence information. Most of these methods rely on pre-defined functional subgroups. Manual subgroup definition is difficult because of the limited number of experimentally characterized subfamilies with differing specificity, while automatic subgroup partitioning using computational tools is a non-trivial task and does not always yield ideal results. RESULTS We propose a new approach SPEL (specificity positions by evolutionary likelihood) to detect positions that are likely to be functional specificity determinants. SPEL, which does not require subgroup definition, takes a multiple sequence alignment of a protein family as the only input, and assigns a P-value to every position in the alignment. Positions with low P-values are likely to be important for functional specificity. An evolutionary tree is reconstructed during the calculation, and P-value estimation is based on a random model that involves evolutionary simulations. Evolutionary log-likelihood is chosen as a measure of amino acid distribution at a position. To illustrate the performance of the method, we carried out a detailed analysis of two protein families (LacI/PurR and G protein alpha subunit), and compared our method with two existing methods (evolutionary trace and mutual information based). All three methods were also compared on a set of protein families with known ligand-bound structures. AVAILABILITY SPEL is freely available for non-commercial use. Its pre-compiled versions for several platforms and alignments used in this work are available at ftp://iole.swmed.edu/pub/SPEL/
Collapse
Affiliation(s)
- Jimin Pei
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center 5323 Harry Hines Boulevard, Dallas, TX 75390-9050, USA
| | | | | | | |
Collapse
|