51
|
Jones DT, Singh T, Kosciolek T, Tetchner S. MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins. ACTA ACUST UNITED AC 2014; 31:999-1006. [PMID: 25431331 PMCID: PMC4382908 DOI: 10.1093/bioinformatics/btu791] [Citation(s) in RCA: 232] [Impact Index Per Article: 23.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2014] [Accepted: 11/22/2014] [Indexed: 12/13/2022]
Abstract
Motivation: Recent developments of statistical techniques to infer direct evolutionary couplings between residue pairs have rendered covariation-based contact prediction a viable means for accurate 3D modelling of proteins, with no information other than the sequence required. To extend the usefulness of contact prediction, we have designed a new meta-predictor (MetaPSICOV) which combines three distinct approaches for inferring covariation signals from multiple sequence alignments, considers a broad range of other sequence-derived features and, uniquely, a range of metrics which describe both the local and global quality of the input multiple sequence alignment. Finally, we use a two-stage predictor, where the second stage filters the output of the first stage. This two-stage predictor is additionally evaluated on its ability to accurately predict the long range network of hydrogen bonds, including correctly assigning the donor and acceptor residues. Results: Using the original PSICOV benchmark set of 150 protein families, MetaPSICOV achieves a mean precision of 0.54 for top-L predicted long range contacts—around 60% higher than PSICOV, and around 40% better than CCMpred. In de novo protein structure prediction using FRAGFOLD, MetaPSICOV is able to improve the TM-scores of models by a median of 0.05 compared with PSICOV. Lastly, for predicting long range hydrogen bonding, MetaPSICOV-HB achieves a precision of 0.69 for the top-L/10 hydrogen bonds compared with just 0.26 for the baseline MetaPSICOV. Availability and implementation: MetaPSICOV is available as a freely available web server at http://bioinf.cs.ucl.ac.uk/MetaPSICOV. Raw data (predicted contact lists and 3D models) and source code can be downloaded from http://bioinf.cs.ucl.ac.uk/downloads/MetaPSICOV. Contact:d.t.jones@ucl.ac.uk Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- David T Jones
- Bioinformatics Group, Department of Computer Science, University College London, London WC1E 6BT, UK
| | - Tanya Singh
- Bioinformatics Group, Department of Computer Science, University College London, London WC1E 6BT, UK
| | - Tomasz Kosciolek
- Bioinformatics Group, Department of Computer Science, University College London, London WC1E 6BT, UK
| | - Stuart Tetchner
- Bioinformatics Group, Department of Computer Science, University College London, London WC1E 6BT, UK
| |
Collapse
|
52
|
Feinauer C, Skwark MJ, Pagnani A, Aurell E. Improving contact prediction along three dimensions. PLoS Comput Biol 2014; 10:e1003847. [PMID: 25299132 PMCID: PMC4191875 DOI: 10.1371/journal.pcbi.1003847] [Citation(s) in RCA: 66] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2014] [Accepted: 08/07/2014] [Indexed: 11/18/2022] Open
Abstract
Correlation patterns in multiple sequence alignments of homologous proteins can be exploited to infer information on the three-dimensional structure of their members. The typical pipeline to address this task, which we in this paper refer to as the three dimensions of contact prediction, is to (i) filter and align the raw sequence data representing the evolutionarily related proteins; (ii) choose a predictive model to describe a sequence alignment; (iii) infer the model parameters and interpret them in terms of structural properties, such as an accurate contact map. We show here that all three dimensions are important for overall prediction success. In particular, we show that it is possible to improve significantly along the second dimension by going beyond the pair-wise Potts models from statistical physics, which have hitherto been the focus of the field. These (simple) extensions are motivated by multiple sequence alignments often containing long stretches of gaps which, as a data feature, would be rather untypical for independent samples drawn from a Potts model. Using a large test set of proteins we show that the combined improvements along the three dimensions are as large as any reported to date. Proteins are large molecules that living cells make by stringing together building blocks called amino acids or peptides, following their blue-prints in the DNA. Freshly made proteins are typically long, structure-less chains of peptides, but shortly afterwards most of them fold into characteristic structures. Proteins execute many functions in the cell, for which they need to have the right structure, which is therefore very important in determining what the proteins can do. The structure of a protein can be determined by X-ray diffraction and other experimental approaches which are all, to this day, somewhat labor-intensive and difficult. On the other hand, the order of the peptides in a protein can be read off from the DNA blue-print, and such protein sequences are today routinely produced in large numbers. In this paper we show that many similar protein sequences can be used to find information about the structure. The basic approach is to construct a probabilistic model for sequence variability, and then to use the parameters of that model to predict structure in three-dimensional space. The main technical novelty compared to previous contributions in the same general direction is that we use models more directly matched to the data.
Collapse
Affiliation(s)
- Christoph Feinauer
- DISAT and Center for Computational Sciences, Politecnico Torino, Torino, Italy
| | - Marcin J. Skwark
- Department of Information and Computer Science, Aalto University, Aalto, Finland
- Aalto Science Institute (AScI), Aalto University, Aalto, Finland
| | - Andrea Pagnani
- DISAT and Center for Computational Sciences, Politecnico Torino, Torino, Italy
- Human Genetics Foundation-Torino, Molecular Biotechnology Center, Torino, Italy
| | - Erik Aurell
- Department of Information and Computer Science, Aalto University, Aalto, Finland
- Aalto Science Institute (AScI), Aalto University, Aalto, Finland
- Department of Computational Biology, Royal Institute of Technology, AlbaNova University Centre, Stockholm, Sweden
- * E-mail:
| |
Collapse
|
53
|
Obermayer B, Levine E. Exploring the miRNA regulatory network using evolutionary correlations. PLoS Comput Biol 2014; 10:e1003860. [PMID: 25299225 PMCID: PMC4191876 DOI: 10.1371/journal.pcbi.1003860] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2014] [Accepted: 08/18/2014] [Indexed: 01/01/2023] Open
Abstract
Post-transcriptional regulation by miRNAs is a widespread and highly conserved phenomenon in metazoans, with several hundreds to thousands of conserved binding sites for each miRNA, and up to two thirds of all genes under miRNA regulation. At the same time, the effect of miRNA regulation on mRNA and protein levels is usually quite modest and associated phenotypes are often weak or subtle. This has given rise to the notion that the highly interconnected miRNA regulatory network exerts its function less through any individual link and more via collective effects that lead to a functional interdependence of network links. We present a Bayesian framework to quantify conservation of miRNA target sites using vertebrate whole-genome alignments. The increased statistical power of our phylogenetic model allows detection of evolutionary correlation in the conservation patterns of site pairs. Such correlations could result from collective functions in the regulatory network. For instance, co-conservation of target site pairs supports a selective benefit of combinatorial regulation by multiple miRNAs. We find that some miRNA families are under pronounced co-targeting constraints, indicating a high connectivity in the regulatory network, while others appear to function in a more isolated way. By analyzing coordinated targeting of different curated gene sets, we observe distinct evolutionary signatures for protein complexes and signaling pathways that could reflect differences in control strategies. Our method is easily scalable to analyze upcoming larger data sets, and readily adaptable to detect high-level selective constraints between other genomic loci. We thus provide a proof-of-principle method to understand regulatory networks from an evolutionary perspective.
Collapse
Affiliation(s)
- Benedikt Obermayer
- Systems Biology of Gene Regulatory Elements, Max-Delbrück Center for Molecular Medicine, Berlin, Germany
- Department of Physics and Center for Systems Biology, Harvard University, Cambridge, United Kingdom
- * E-mail: (BO); (EL)
| | - Erel Levine
- Systems Biology of Gene Regulatory Elements, Max-Delbrück Center for Molecular Medicine, Berlin, Germany
- Department of Physics and Center for Systems Biology, Harvard University, Cambridge, United Kingdom
- * E-mail: (BO); (EL)
| |
Collapse
|
54
|
Michel M, Hayat S, Skwark MJ, Sander C, Marks DS, Elofsson A. PconsFold: improved contact predictions improve protein models. Bioinformatics 2014; 30:i482-8. [PMID: 25161237 PMCID: PMC4147911 DOI: 10.1093/bioinformatics/btu458] [Citation(s) in RCA: 85] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023] Open
Abstract
MOTIVATION Recently it has been shown that the quality of protein contact prediction from evolutionary information can be improved significantly if direct and indirect information is separated. Given sufficiently large protein families, the contact predictions contain sufficient information to predict the structure of many protein families. However, since the first studies contact prediction methods have improved. Here, we ask how much the final models are improved if improved contact predictions are used. RESULTS In a small benchmark of 15 proteins, we show that the TM-scores of top-ranked models are improved by on average 33% using PconsFold compared with the original version of EVfold. In a larger benchmark, we find that the quality is improved with 15-30% when using PconsC in comparison with earlier contact prediction methods. Further, using Rosetta instead of CNS does not significantly improve global model accuracy, but the chemistry of models generated with Rosetta is improved. AVAILABILITY PconsFold is a fully automated pipeline for ab initio protein structure prediction based on evolutionary information. PconsFold is based on PconsC contact prediction and uses the Rosetta folding protocol. Due to its modularity, the contact prediction tool can be easily exchanged. The source code of PconsFold is available on GitHub at https://www.github.com/ElofssonLab/pcons-fold under the MIT license. PconsC is available from http://c.pcons.net/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mirco Michel
- Department of Biochemistry and Biophysics, Stockholm University, 10691 Stockholm, Sweden, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden, Department of Systems Biology, Harvard Medical School, Boston, MA, USA, Department of Information and Computer Science, Aalto University, PO Box 15400, FI-00076 Aalto, Finland and Computational Biology, Memorial Sloan-Kettering Cancer Center, New York, NY, USA Department of Biochemistry and Biophysics, Stockholm University, 10691 Stockholm, Sweden, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden, Department of Systems Biology, Harvard Medical School, Boston, MA, USA, Department of Information and Computer Science, Aalto University, PO Box 15400, FI-00076 Aalto, Finland and Computational Biology, Memorial Sloan-Kettering Cancer Center, New York, NY, USA
| | - Sikander Hayat
- Department of Biochemistry and Biophysics, Stockholm University, 10691 Stockholm, Sweden, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden, Department of Systems Biology, Harvard Medical School, Boston, MA, USA, Department of Information and Computer Science, Aalto University, PO Box 15400, FI-00076 Aalto, Finland and Computational Biology, Memorial Sloan-Kettering Cancer Center, New York, NY, USA
| | - Marcin J Skwark
- Department of Biochemistry and Biophysics, Stockholm University, 10691 Stockholm, Sweden, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden, Department of Systems Biology, Harvard Medical School, Boston, MA, USA, Department of Information and Computer Science, Aalto University, PO Box 15400, FI-00076 Aalto, Finland and Computational Biology, Memorial Sloan-Kettering Cancer Center, New York, NY, USA
| | - Chris Sander
- Department of Biochemistry and Biophysics, Stockholm University, 10691 Stockholm, Sweden, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden, Department of Systems Biology, Harvard Medical School, Boston, MA, USA, Department of Information and Computer Science, Aalto University, PO Box 15400, FI-00076 Aalto, Finland and Computational Biology, Memorial Sloan-Kettering Cancer Center, New York, NY, USA
| | - Debora S Marks
- Department of Biochemistry and Biophysics, Stockholm University, 10691 Stockholm, Sweden, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden, Department of Systems Biology, Harvard Medical School, Boston, MA, USA, Department of Information and Computer Science, Aalto University, PO Box 15400, FI-00076 Aalto, Finland and Computational Biology, Memorial Sloan-Kettering Cancer Center, New York, NY, USA
| | - Arne Elofsson
- Department of Biochemistry and Biophysics, Stockholm University, 10691 Stockholm, Sweden, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden, Department of Systems Biology, Harvard Medical School, Boston, MA, USA, Department of Information and Computer Science, Aalto University, PO Box 15400, FI-00076 Aalto, Finland and Computational Biology, Memorial Sloan-Kettering Cancer Center, New York, NY, USA Department of Biochemistry and Biophysics, Stockholm University, 10691 Stockholm, Sweden, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden, Department of Systems Biology, Harvard Medical School, Boston, MA, USA, Department of Information and Computer Science, Aalto University, PO Box 15400, FI-00076 Aalto, Finland and Computational Biology, Memorial Sloan-Kettering Cancer Center, New York, NY, USA
| |
Collapse
|
55
|
Ivankov DN, Finkelstein AV, Kondrashov FA. A structural perspective of compensatory evolution. Curr Opin Struct Biol 2014; 26:104-12. [PMID: 24981969 PMCID: PMC4141909 DOI: 10.1016/j.sbi.2014.05.004] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2014] [Revised: 04/11/2014] [Accepted: 05/16/2014] [Indexed: 11/25/2022]
Abstract
The study of molecular evolution is important because it reveals how protein functions emerge and evolve. Recently, several types of studies indicated that substitutions in molecular evolution occur in a compensatory manner, whereby the occurrence of a substitution depends on the amino acid residues at other sites. However, a molecular or structural basis behind the compensation often remains obscure. Here, we review studies on the interface of structural biology and molecular evolution that revealed novel aspects of compensatory evolution. In many cases structural studies benefit from evolutionary data while structural data often add a functional dimension to the study of molecular evolution.
Collapse
Affiliation(s)
- Dmitry N Ivankov
- Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 88 Dr. Aiguader, 08003 Barcelona, Spain; Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain; Laboratory of Protein Physics, Institute of Protein Research of the Russian Academy of Sciences, 4 Institutskaya str., Pushchino, Moscow Region, 142290, Russia
| | - Alexei V Finkelstein
- Laboratory of Protein Physics, Institute of Protein Research of the Russian Academy of Sciences, 4 Institutskaya str., Pushchino, Moscow Region, 142290, Russia
| | - Fyodor A Kondrashov
- Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 88 Dr. Aiguader, 08003 Barcelona, Spain; Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain; Institució Catalana de Recerca i Estudis Avançats (ICREA), 23 Pg. Lluís Companys, 08010 Barcelona, Spain.
| |
Collapse
|
56
|
Janda JO, Popal A, Bauer J, Busch M, Klocke M, Spitzer W, Keller J, Merkl R. H2rs: deducing evolutionary and functionally important residue positions by means of an entropy and similarity based analysis of multiple sequence alignments. BMC Bioinformatics 2014; 15:118. [PMID: 24766829 PMCID: PMC4021312 DOI: 10.1186/1471-2105-15-118] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2014] [Accepted: 04/17/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The identification of functionally important residue positions is an important task of computational biology. Methods of correlation analysis allow for the identification of pairs of residue positions, whose occupancy is mutually dependent due to constraints imposed by protein structure or function. A common measure assessing these dependencies is the mutual information, which is based on Shannon's information theory that utilizes probabilities only. Consequently, such approaches do not consider the similarity of residue pairs, which may degrade the algorithm's performance. One typical algorithm is H2r, which characterizes each individual residue position k by the conn(k)-value, which is the number of significantly correlated pairs it belongs to. RESULTS To improve specificity of H2r, we developed a revised algorithm, named H2rs, which is based on the von Neumann entropy (vNE). To compute the corresponding mutual information, a matrix A is required, which assesses the similarity of residue pairs. We determined A by deducing substitution frequencies from contacting residue pairs observed in the homologs of 35 809 proteins, whose structure is known. In analogy to H2r, the enhanced algorithm computes a normalized conn(k)-value. Within the framework of H2rs, only statistically significant vNE values were considered. To decide on significance, the algorithm calculates a p-value by performing a randomization test for each individual pair of residue positions. The analysis of a large in silico testbed demonstrated that specificity and precision were higher for H2rs than for H2r and two other methods of correlation analysis. The gain in prediction quality is further confirmed by a detailed assessment of five well-studied enzymes. The outcome of H2rs and of a method that predicts contacting residue positions (PSICOV) overlapped only marginally. H2rs can be downloaded from http://www-bioinf.uni-regensburg.de. CONCLUSIONS Considering substitution frequencies for residue pairs by means of the von Neumann entropy and a p-value improved the success rate in identifying important residue positions. The integration of proven statistical concepts and normalization allows for an easier comparison of results obtained with different proteins. Comparing the outcome of the local method H2rs and of the global method PSICOV indicates that such methods supplement each other and have different scopes of application.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | - Rainer Merkl
- Institute of Biophysics and Physical Biochemistry, University of Regensburg, D-93040 Regensburg, Germany.
| |
Collapse
|
57
|
Gültas M, Düzgün G, Herzog S, Jäger SJ, Meckbach C, Wingender E, Waack S. Quantum coupled mutation finder: predicting functionally or structurally important sites in proteins using quantum Jensen-Shannon divergence and CUDA programming. BMC Bioinformatics 2014; 15:96. [PMID: 24694117 PMCID: PMC4098773 DOI: 10.1186/1471-2105-15-96] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2013] [Accepted: 03/26/2014] [Indexed: 11/29/2022] Open
Abstract
Background The identification of functionally or structurally important non-conserved residue sites in protein MSAs is an important challenge for understanding the structural basis and molecular mechanism of protein functions. Despite the rich literature on compensatory mutations as well as sequence conservation analysis for the detection of those important residues, previous methods often rely on classical information-theoretic measures. However, these measures usually do not take into account dis/similarities of amino acids which are likely to be crucial for those residues. In this study, we present a new method, the Quantum Coupled Mutation Finder (QCMF) that incorporates significant dis/similar amino acid pair signals in the prediction of functionally or structurally important sites. Results The result of this study is twofold. First, using the essential sites of two human proteins, namely epidermal growth factor receptor (EGFR) and glucokinase (GCK), we tested the QCMF-method. The QCMF includes two metrics based on quantum Jensen-Shannon divergence to measure both sequence conservation and compensatory mutations. We found that the QCMF reaches an improved performance in identifying essential sites from MSAs of both proteins with a significantly higher Matthews correlation coefficient (MCC) value in comparison to previous methods. Second, using a data set of 153 proteins, we made a pairwise comparison between QCMF and three conventional methods. This comparison study strongly suggests that QCMF complements the conventional methods for the identification of correlated mutations in MSAs. Conclusions QCMF utilizes the notion of entanglement, which is a major resource of quantum information, to model significant dissimilar and similar amino acid pair signals in the detection of functionally or structurally important sites. Our results suggest that on the one hand QCMF significantly outperforms the previous method, which mainly focuses on dissimilar amino acid signals, to detect essential sites in proteins. On the other hand, it is complementary to the existing methods for the identification of correlated mutations. The method of QCMF is computationally intensive. To ensure a feasible computation time of the QCMF’s algorithm, we leveraged Compute Unified Device Architecture (CUDA). The QCMF server is freely accessible at http://qcmf.informatik.uni-goettingen.de/.
Collapse
Affiliation(s)
- Mehmet Gültas
- Institute of Computer Science, University of Göttingen, Goldschmidtstr, 7, 37077 Göttingen, Germany.
| | | | | | | | | | | | | |
Collapse
|
58
|
Baldassi C, Zamparo M, Feinauer C, Procaccini A, Zecchina R, Weigt M, Pagnani A. Fast and accurate multivariate Gaussian modeling of protein families: predicting residue contacts and protein-interaction partners. PLoS One 2014; 9:e92721. [PMID: 24663061 PMCID: PMC3963956 DOI: 10.1371/journal.pone.0092721] [Citation(s) in RCA: 89] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2013] [Accepted: 02/24/2014] [Indexed: 11/18/2022] Open
Abstract
In the course of evolution, proteins show a remarkable conservation of their three-dimensional structure and their biological function, leading to strong evolutionary constraints on the sequence variability between homologous proteins. Our method aims at extracting such constraints from rapidly accumulating sequence data, and thereby at inferring protein structure and function from sequence information alone. Recently, global statistical inference methods (e.g. direct-coupling analysis, sparse inverse covariance estimation) have achieved a breakthrough towards this aim, and their predictions have been successfully implemented into tertiary and quaternary protein structure prediction methods. However, due to the discrete nature of the underlying variable (amino-acids), exact inference requires exponential time in the protein length, and efficient approximations are needed for practical applicability. Here we propose a very efficient multivariate Gaussian modeling approach as a variant of direct-coupling analysis: the discrete amino-acid variables are replaced by continuous Gaussian random variables. The resulting statistical inference problem is efficiently and exactly solvable. We show that the quality of inference is comparable or superior to the one achieved by mean-field approximations to inference with discrete variables, as done by direct-coupling analysis. This is true for (i) the prediction of residue-residue contacts in proteins, and (ii) the identification of protein-protein interaction partner in bacterial signal transduction. An implementation of our multivariate Gaussian approach is available at the website http://areeweb.polito.it/ricerca/cmp/code.
Collapse
Affiliation(s)
- Carlo Baldassi
- Department of Applied Science and Technology and Center for Computational Sciences, Politecnico di Torino, Torino, Italy
- Human Genetics Foundation-Torino, Torino, Italy
| | - Marco Zamparo
- Department of Applied Science and Technology and Center for Computational Sciences, Politecnico di Torino, Torino, Italy
- Human Genetics Foundation-Torino, Torino, Italy
| | - Christoph Feinauer
- Department of Applied Science and Technology and Center for Computational Sciences, Politecnico di Torino, Torino, Italy
| | | | - Riccardo Zecchina
- Department of Applied Science and Technology and Center for Computational Sciences, Politecnico di Torino, Torino, Italy
- Human Genetics Foundation-Torino, Torino, Italy
| | - Martin Weigt
- Sorbonne Universités, Université Pierre et Marie Curie Paris 06, UMR 7238, Computational and Quantitative Biology, Paris, France
- Centre National de la Recherche Scientifique, UMR 7238, Computational and Quantitative Biology, Paris, France
| | - Andrea Pagnani
- Department of Applied Science and Technology and Center for Computational Sciences, Politecnico di Torino, Torino, Italy
- Human Genetics Foundation-Torino, Torino, Italy
- * E-mail:
| |
Collapse
|
59
|
Kosciolek T, Jones DT. De novo structure prediction of globular proteins aided by sequence variation-derived contacts. PLoS One 2014; 9:e92197. [PMID: 24637808 PMCID: PMC3956894 DOI: 10.1371/journal.pone.0092197] [Citation(s) in RCA: 93] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2013] [Accepted: 02/19/2014] [Indexed: 12/21/2022] Open
Abstract
The advent of high accuracy residue-residue intra-protein contact prediction methods enabled a significant boost in the quality of de novo structure predictions. Here, we investigate the potential benefits of combining a well-established fragment-based folding algorithm--FRAGFOLD, with PSICOV, a contact prediction method which uses sparse inverse covariance estimation to identify co-varying sites in multiple sequence alignments. Using a comprehensive set of 150 diverse globular target proteins, up to 266 amino acids in length, we are able to address the effectiveness and some limitations of such approaches to globular proteins in practice. Overall we find that using fragment assembly with both statistical potentials and predicted contacts is significantly better than either statistical potentials or contacts alone. Results show up to nearly 80% of correct predictions (TM-score ≥0.5) within analysed dataset and a mean TM-score of 0.54. Unsuccessful modelling cases emerged either from conformational sampling problems, or insufficient contact prediction accuracy. Nevertheless, a strong dependency of the quality of final models on the fraction of satisfied predicted long-range contacts was observed. This not only highlights the importance of these contacts on determining the protein fold, but also (combined with other ensemble-derived qualities) provides a powerful guide as to the choice of correct models and the global quality of the selected model. A proposed quality assessment scoring function achieves 0.93 precision and 0.77 recall for the discrimination of correct folds on our dataset of decoys. These findings suggest the approach is well-suited for blind predictions on a variety of globular proteins of unknown 3D structure, provided that enough homologous sequences are available to construct a large and accurate multiple sequence alignment for the initial contact prediction step.
Collapse
Affiliation(s)
- Tomasz Kosciolek
- Bioinformatics Group, Department of Computer Science, University College London, London, United Kingdom
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - David T. Jones
- Bioinformatics Group, Department of Computer Science, University College London, London, United Kingdom
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| |
Collapse
|
60
|
Lee YCG, Langley CH, Begun DJ. Differential strengths of positive selection revealed by hitchhiking effects at small physical scales in Drosophila melanogaster. Mol Biol Evol 2013; 31:804-16. [PMID: 24361994 DOI: 10.1093/molbev/mst270] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
The long time scale of adaptive evolution makes it difficult to directly observe the spread of most beneficial mutations through natural populations. Therefore, inferring attributes of beneficial mutations by studying the genomic signals left by directional selection is an important component of population genetics research. One kind of signal is a trough in nearby neutral genetic variation due to selective fixation of initially rare alleles, a phenomenon known as "genetic hitchhiking." Accumulated evidence suggests that a considerable fraction of substitutions in the Drosophila genome results from positive selection, most of which are expected to have small selection coefficients and influence the population genetics of sites in the immediate vicinity. Using Drosophila melanogaster population genomic data, we found that the heterogeneity in synonymous polymorphism surrounding different categories of coding fixations is readily observable even within 25 bp of focal substitutions, which we interpret as the result of small-scale hitchhiking effects. The strength of natural selection on different sites appears to be quite heterogeneous. Particularly, neighboring fixations that changed amino acid polarities in a way that maintained the overall polarities of a protein were under stronger selection than other categories of fixations. Interestingly, we found that substitutions in slow-evolving genes are associated with stronger hitchhiking effects. This is consistent with the idea that adaptive evolution may involve few substitutions with large effects or many substitutions with small effects. Because our approach only weakly depends on the numbers of recent nonsynonymous substitutions, it can provide a complimentary view to the adaptive evolution inferred by other divergence-based evolutionary genetic methods.
Collapse
Affiliation(s)
- Yuh Chwen G Lee
- Department of Evolution and Ecology and Center for Population Biology, University of California, Davis
| | | | | |
Collapse
|
61
|
Gleichmann T, Diensthuber RP, Möglich A. Charting the signal trajectory in a light-oxygen-voltage photoreceptor by random mutagenesis and covariance analysis. J Biol Chem 2013; 288:29345-55. [PMID: 24003219 DOI: 10.1074/jbc.m113.506139] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Modular signal receptors empower organisms to process environmental stimuli into adequate physiological responses. At the molecular level, a sensor module receives signals and processes the inherent information into changes of biological activity of an effector module. To better understand the molecular bases underpinning these processes, we analyzed signal reception and processing in the dimeric light-oxygen-voltage (LOV) blue light receptor YF1 that serves as a paradigm for the widespread Per-ARNT-Sim (PAS) signal receptors. Random mutagenesis identifies numerous YF1 variants in which biological activity is retained but where light regulation is abolished or inverted. One group of variants carries mutations within the LOV photosensor that disrupt proper coupling of the flavin-nucleotide chromophore to the protein scaffold. Another larger group bears mutations that cluster at the dyad interface and disrupt signal transmission to two coaxial coiled-coils that connect to the effector. Sequence covariation implies wide conservation of structural and mechanistic motifs, as also borne out by comparison to several PAS domains in which mutations leading to disruption of signal transduction consistently map to confined regions broadly equivalent to those identified in YF1. Not only do these data provide insight into general mechanisms of signal transduction, but also they establish concrete means for customized reprogramming of signal receptors.
Collapse
Affiliation(s)
- Tobias Gleichmann
- From the Humboldt-Universität zu Berlin, Institut für Biologie, Biophysikalische Chemie, Invalidenstraße 42, 10115 Berlin, Germany
| | | | | |
Collapse
|
62
|
Feizi S, Marbach D, Médard M, Kellis M. Network deconvolution as a general method to distinguish direct dependencies in networks. Nat Biotechnol 2013; 31:726-33. [PMID: 23851448 PMCID: PMC3773370 DOI: 10.1038/nbt.2635] [Citation(s) in RCA: 136] [Impact Index Per Article: 12.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2012] [Accepted: 06/11/2013] [Indexed: 01/08/2023]
Abstract
Recognizing direct relationships between variables connected in a network is a pervasive problem in biological, social and information sciences as correlation-based networks contain numerous indirect relationships. Here we present a general method for inferring direct effects from an observed correlation matrix containing both direct and indirect effects. We formulate the problem as the inverse of network convolution, and introduce an algorithm that removes the combined effect of all indirect paths of arbitrary length in a closed-form solution by exploiting eigen-decomposition and infinite-series sums. We demonstrate the effectiveness of our approach in several network applications: distinguishing direct targets in gene expression regulatory networks; recognizing directly-interacting amino-acid residues for protein structure prediction from sequence alignments; and distinguishing strong collaborations in co-authorship social networks using connectivity information alone.
Collapse
Affiliation(s)
- Soheil Feizi
- Computer Science and Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology (MIT), Cambridge, Massachusetts, USA
| | | | | | | |
Collapse
|
63
|
Taylor WR, Hamilton RS, Sadowski MI. Prediction of contacts from correlated sequence substitutions. Curr Opin Struct Biol 2013; 23:473-9. [PMID: 23680395 DOI: 10.1016/j.sbi.2013.04.001] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2013] [Revised: 03/12/2013] [Accepted: 04/02/2013] [Indexed: 11/26/2022]
Abstract
Recent work has led to a substantial improvement in the accuracy of predictions of contacts between amino acids using evolutionary information derived from multiple sequence alignments. Where large numbers of diverse sequence relatives are available and can be aligned to the sequence of a protein of unknown structure it is now possible to generate high-resolution models without recourse to the structure of a template. In this review we describe these exciting new techniques and critically assess the state-of-the-art in contact prediction in the light of these. While concentrating on methods, we also discuss applications to protein and RNA structure prediction as well as potential future developments.
Collapse
Affiliation(s)
- William R Taylor
- Division of Mathematical Biology, MRC National Institute for Medical Research, The Ridgeway, Mill Hill, London NW7 1AA, UK.
| | | | | |
Collapse
|
64
|
Protein structure prediction from sequence variation. Nat Biotechnol 2013; 30:1072-80. [PMID: 23138306 DOI: 10.1038/nbt.2419] [Citation(s) in RCA: 431] [Impact Index Per Article: 39.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2012] [Accepted: 10/15/2012] [Indexed: 02/07/2023]
Abstract
Genomic sequences contain rich evolutionary information about functional constraints on macromolecules such as proteins. This information can be efficiently mined to detect evolutionary couplings between residues in proteins and address the long-standing challenge to compute protein three-dimensional structures from amino acid sequences. Substantial progress has recently been made on this problem owing to the explosive growth in available sequences and the application of global statistical methods. In addition to three-dimensional structure, the improved understanding of covariation may help identify functional residues involved in ligand binding, protein-complex formation and conformational changes. We expect computation of covariation patterns to complement experimental structural biology in elucidating the full spectrum of protein structures, their functional interactions and evolutionary dynamics.
Collapse
|
65
|
Abstract
Co-evolution is a fundamental component of the theory of evolution and is essential for understanding the relationships between species in complex ecological networks. A wide range of co-evolution-inspired computational methods has been designed to predict molecular interactions, but it is only recently that important advances have been made. Breakthroughs in the handling of phylogenetic information and in disentangling indirect relationships have resulted in an improved capacity to predict interactions between proteins and contacts between different protein residues. Here, we review the main co-evolution-based computational approaches, their theoretical basis, potential applications and foreseeable developments.
Collapse
Affiliation(s)
- David de Juan
- Structural Biology and Biocomputing Programme, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | | | | |
Collapse
|
66
|
Abstract
Recent work has led to a substantial improvement in the accuracy of predictions of contacts between amino acids using evolutionary information derived from multiple sequence alignments. Where large numbers of diverse sequence relatives are available and can be aligned to the sequence of a protein of unknown structure, it is now possible to generate high-resolution models without recourse to the structure of a template. In this review, we describe these exciting new techniques and critically assess the state of the art in contact prediction in light of them. We discuss areas for immediate research and development as well as potential future developments.
Collapse
|
67
|
Jeong CS, Kim D. Reliable and robust detection of coevolving protein residues†. Protein Eng Des Sel 2012; 25:705-13. [DOI: 10.1093/protein/gzs081] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
68
|
The emergence of protein complexes: quaternary structure, dynamics and allostery. Colworth Medal Lecture. Biochem Soc Trans 2012; 40:475-91. [PMID: 22616857 DOI: 10.1042/bst20120056] [Citation(s) in RCA: 64] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
All proteins require physical interactions with other proteins in order to perform their functions. Most of them oligomerize into homomers, and a vast majority of these homomers interact with other proteins, at least part of the time, forming transient or obligate heteromers. In the present paper, we review the structural, biophysical and evolutionary aspects of these protein interactions. We discuss how protein function and stability benefit from oligomerization, as well as evolutionary pathways by which oligomers emerge, mostly from the perspective of homomers. Finally, we emphasize the specificities of heteromeric complexes and their structure and evolution. We also discuss two analytical approaches increasingly being used to study protein structures as well as their interactions. First, we review the use of the biological networks and graph theory for analysis of protein interactions and structure. Secondly, we discuss recent advances in techniques for detecting correlated mutations, with the emphasis on their role in identifying pathways of allosteric communication.
Collapse
|
69
|
Gültas M, Haubrock M, Tüysüz N, Waack S. Coupled mutation finder: a new entropy-based method quantifying phylogenetic noise for the detection of compensatory mutations. BMC Bioinformatics 2012; 13:225. [PMID: 22963049 PMCID: PMC3577461 DOI: 10.1186/1471-2105-13-225] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2012] [Accepted: 08/23/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The detection of significant compensatory mutation signals in multiple sequence alignments (MSAs) is often complicated by noise. A challenging problem in bioinformatics is remains the separation of significant signals between two or more non-conserved residue sites from the phylogenetic noise and unrelated pair signals. Determination of these non-conserved residue sites is as important as the recognition of strictly conserved positions for understanding of the structural basis of protein functions and identification of functionally important residue regions. In this study, we developed a new method, the Coupled Mutation Finder (CMF) quantifying the phylogenetic noise for the detection of compensatory mutations. RESULTS To demonstrate the effectiveness of this method, we analyzed essential sites of two human proteins: epidermal growth factor receptor (EGFR) and glucokinase (GCK). Our results suggest that the CMF is able to separate significant compensatory mutation signals from the phylogenetic noise and unrelated pair signals. The vast majority of compensatory mutation sites found by the CMF are related to essential sites of both proteins and they are likely to affect protein stability or functionality. CONCLUSIONS The CMF is a new method, which includes an MSA-specific statistical model based on multiple testing procedures that quantify the error made in terms of the false discovery rate and a novel entropy-based metric to upscale BLOSUM62 dissimilar compensatory mutations. Therefore, it is a helpful tool to predict and investigate compensatory mutation sites of structural or functional importance in proteins. We suggest that the CMF could be used as a novel automated function prediction tool that is required for a better understanding of the structural basis of proteins. The CMF server is freely accessible at http://cmf.bioinf.med.uni-goettingen.de.
Collapse
Affiliation(s)
- Mehmet Gültas
- Institute of Computer Science, University of Göttingen, Goldschmidtstr. 7, Göttingen, 37077, Germany.
| | | | | | | |
Collapse
|
70
|
Akashi H, Osada N, Ohta T. Weak selection and protein evolution. Genetics 2012; 192:15-31. [PMID: 22964835 PMCID: PMC3430532 DOI: 10.1534/genetics.112.140178] [Citation(s) in RCA: 91] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2012] [Accepted: 06/11/2012] [Indexed: 01/23/2023] Open
Abstract
The "nearly neutral" theory of molecular evolution proposes that many features of genomes arise from the interaction of three weak evolutionary forces: mutation, genetic drift, and natural selection acting at its limit of efficacy. Such forces generally have little impact on allele frequencies within populations from generation to generation but can have substantial effects on long-term evolution. The evolutionary dynamics of weakly selected mutations are highly sensitive to population size, and near neutrality was initially proposed as an adjustment to the neutral theory to account for general patterns in available protein and DNA variation data. Here, we review the motivation for the nearly neutral theory, discuss the structure of the model and its predictions, and evaluate current empirical support for interactions among weak evolutionary forces in protein evolution. Near neutrality may be a prevalent mode of evolution across a range of functional categories of mutations and taxa. However, multiple evolutionary mechanisms (including adaptive evolution, linked selection, changes in fitness-effect distributions, and weak selection) can often explain the same patterns of genome variation. Strong parameter sensitivity remains a limitation of the nearly neutral model, and we discuss concave fitness functions as a plausible underlying basis for weak selection.
Collapse
Affiliation(s)
- Hiroshi Akashi
- Division of Evolutionary Genetics, Department of Population Genetics, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan.
| | | | | |
Collapse
|
71
|
Kensche PR, Duarte I, Huynen MA. A three-dimensional topology of complex I inferred from evolutionary correlations. BMC STRUCTURAL BIOLOGY 2012; 12:19. [PMID: 22857522 PMCID: PMC3436739 DOI: 10.1186/1472-6807-12-19] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/13/2012] [Accepted: 06/28/2012] [Indexed: 11/22/2022]
Abstract
Background The quaternary structure of eukaryotic NADH:ubiquinone oxidoreductase (complex I), the largest complex of the oxidative phosphorylation, is still mostly unresolved. Furthermore, it is unknown where transiently bound assembly factors interact with complex I. We therefore asked whether the evolution of complex I contains information about its 3D topology and the binding positions of its assembly factors. We approached these questions by correlating the evolutionary rates of eukaryotic complex I subunits using the mirror-tree method and mapping the results into a 3D representation by multidimensional scaling. Results More than 60% of the evolutionary correlation among the conserved seven subunits of the complex I matrix arm can be explained by the physical distance between the subunits. The three-dimensional evolutionary model of the eukaryotic conserved matrix arm has a striking similarity to the matrix arm quaternary structure in the bacterium Thermus thermophilus (rmsd=19 Å) and supports the previous finding that in eukaryotes the N-module is turned relative to the Q-module when compared to bacteria. By contrast, the evolutionary rates contained little information about the structure of the membrane arm. A large evolutionary model of 45 subunits and assembly factors allows to predict subunit positions and interactions (rmsd = 52.6 Å). The model supports an interaction of NDUFAF3, C8orf38 and C2orf56 during the assembly of the proximal matrix arm and the membrane arm. The model further suggests a tight relationship between the assembly factor NUBPL and NDUFA2, which both have been linked to iron-sulfur cluster assembly, as well as between NDUFA12 and its paralog, the assembly factor NDUFAF2. Conclusions The physical distance between subunits of complex I is a major correlate of the rate of protein evolution in the complex I matrix arm and is sufficient to infer parts of the complex’s structure with high accuracy. The resulting evolutionary model predicts the positions of a number of subunits and assembly factors.
Collapse
Affiliation(s)
- Philip R Kensche
- Center for Molecular and Biomolecular Informatics/Nijmegen Center for Molecular Life Sciences, Radboud University Medical Center, PO Box 9101, Nijmegen, HB, 6500, The Netherlands.
| | | | | |
Collapse
|
72
|
Dietrich S, Borst N, Schlee S, Schneider D, Janda JO, Sterner R, Merkl R. Experimental assessment of the importance of amino acid positions identified by an entropy-based correlation analysis of multiple-sequence alignments. Biochemistry 2012; 51:5633-41. [PMID: 22737967 DOI: 10.1021/bi300747r] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The analysis of a multiple-sequence alignment (MSA) with correlation methods identifies pairs of residue positions whose occupation with amino acids changes in a concerted manner. It is plausible to assume that positions that are part of many such correlation pairs are important for protein function or stability. We have used the algorithm H2r to identify positions k in the MSAs of the enzymes anthranilate phosphoribosyl transferase (AnPRT) and indole-3-glycerol phosphate synthase (IGPS) that show a high conn(k) value, i.e., a large number of significant correlations in which k is involved. The importance of the identified residues was experimentally validated by performing mutagenesis studies with sAnPRT and sIGPS from the archaeon Sulfolobus solfataricus. For sAnPRT, five H2r mutant proteins were generated by replacing nonconserved residues with alanine or the prevalent residue of the MSA. As a control, five residues with conn(k) values of zero were chosen randomly and replaced with alanine. The catalytic activities and conformational stabilities of the H2r and control mutant proteins were analyzed by steady-state enzyme kinetics and thermal unfolding studies. Compared to wild-type sAnPRT, the catalytic efficiencies (k(cat)/K(M)) were largely unaltered. In contrast, the apparent thermal unfolding temperature (T(M)(app)) was lowered in most proteins. Remarkably, the strongest observed destabilization (ΔT(M)(app) = 14 °C) was caused by the V284A exchange, which pertains to the position with the highest correlation signal [conn(k) = 11]. For sIGPS, six H2r mutant and four control proteins with alanine exchanges were generated and characterized. The k(cat)/K(M) values of four H2r mutant proteins were reduced between 13- and 120-fold, and their T(M)(app) values were decreased by up to 5 °C. For the sIGPS control proteins, the observed activity and stability decreases were much less severe. Our findings demonstrate that positions with high conn(k) values have an increased probability of being important for enzyme function or stability.
Collapse
Affiliation(s)
- Susanne Dietrich
- Institute of Biophysics and Physical Biochemistry, University of Regensburg, Universitätsstrasse 31, D-93053 Regensburg, Germany
| | | | | | | | | | | | | |
Collapse
|
73
|
Structural basis of histidine kinase autophosphorylation deduced by integrating genomics, molecular dynamics, and mutagenesis. Proc Natl Acad Sci U S A 2012; 109:E1733-42. [PMID: 22670053 DOI: 10.1073/pnas.1201301109] [Citation(s) in RCA: 114] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
Signal transduction proteins such as bacterial sensor histidine kinases, designed to transition between multiple conformations, are often ruled by unstable transient interactions making structural characterization of all functional states difficult. This study explored the inactive and signal-activated conformational states of the two catalytic domains of sensor histidine kinases, HisKA and HATPase. Direct coupling analyses, a global statistical inference approach, was applied to >13,000 such domains from protein databases to identify residue contacts between the two domains. These contacts guided structural assembly of the domains using MAGMA, an advanced molecular dynamics docking method. The active conformation structure generated by MAGMA simultaneously accommodated the sequence derived residue contacts and the ATP-catalytic histidine contact. The validity of this structure was confirmed biologically by mutation of contact positions in the Bacillus subtilis sensor histidine kinase KinA and by restoration of activity in an inactive KinA(HisKA):KinD(HATPase) hybrid protein. These data indicate that signals binding to sensor domains activate sensor histidine kinases by causing localized strain and unwinding at the end of the C-terminal helix of the HisKA domain. This destabilizes the contact positions of the inactive conformation of the two domains, identified by previous crystal structure analyses and by the sequence analysis described here, inducing the formation of the active conformation. This study reveals that structures of unstable transient complexes of interacting proteins and of protein domains are accessible by applying this combination of cross-validating technologies.
Collapse
|
74
|
Accurate de novo structure prediction of large transmembrane protein domains using fragment-assembly and correlated mutation analysis. Proc Natl Acad Sci U S A 2012; 109:E1540-7. [PMID: 22645369 DOI: 10.1073/pnas.1120036109] [Citation(s) in RCA: 164] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
A new de novo protein structure prediction method for transmembrane proteins (FILM3) is described that is able to accurately predict the structures of large membrane proteins domains using an ensemble of two secondary structure prediction methods to guide fragment selection in combination with a scoring function based solely on correlated mutations detected in multiple sequence alignments. This approach has been validated by generating models for 28 membrane proteins with a diverse range of complex topologies and an average length of over 300 residues with results showing that TM-scores > 0.5 can be achieved in almost every case following refinement using MODELLER. In one of the most impressive results, a model of mitochondrial cytochrome c oxidase polypeptide I was obtained with a TM-score > 0.75 and an rmsd of only 5.7 Å over all 514 residues. These results suggest that FILM3 could be applicable to a wide range of transmembrane proteins of as-yet-unknown 3D structure given sufficient homologous sequences.
Collapse
|
75
|
Gulyás-Kovács A. Integrated analysis of residue coevolution and protein structure in ABC transporters. PLoS One 2012; 7:e36546. [PMID: 22590562 PMCID: PMC3348156 DOI: 10.1371/journal.pone.0036546] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2012] [Accepted: 04/06/2012] [Indexed: 12/22/2022] Open
Abstract
Intraprotein side chain contacts can couple the evolutionary process of amino acid substitution at one position to that at another. This coupling, known as residue coevolution, may vary in strength. Conserved contacts thus not only define 3-dimensional protein structure, but also indicate which residue-residue interactions are crucial to a protein's function. Therefore, prediction of strongly coevolving residue-pairs helps clarify molecular mechanisms underlying function. Previously, various coevolution detectors have been employed separately to predict these pairs purely from multiple sequence alignments, while disregarding available structural information. This study introduces an integrative framework that improves the accuracy of such predictions, relative to previous approaches, by combining multiple coevolution detectors and incorporating structural contact information. This framework is applied to the ABC-B and ABC-C transporter families, which include the drug exporter P-glycoprotein involved in multidrug resistance of cancer cells, as well as the CFTR chloride channel linked to cystic fibrosis disease. The predicted coevolving pairs are further analyzed based on conformational changes inferred from outward- and inward-facing transporter structures. The analysis suggests that some pairs coevolved to directly regulate conformational changes of the alternating-access transport mechanism, while others to stabilize rigid-body-like components of the protein structure. Moreover, some identified pairs correspond to residues previously implicated in cystic fibrosis.
Collapse
Affiliation(s)
- Attila Gulyás-Kovács
- Laboratory of Cardiac/Membrane Physiology, Rockefeller University, New York, New York, United States of America.
| |
Collapse
|
76
|
Patsalo V, Raleigh DP, Green DF. Rational and computational design of stabilized variants of cyanovirin-N that retain affinity and specificity for glycan ligands. Biochemistry 2011; 50:10698-712. [PMID: 22032696 PMCID: PMC3234137 DOI: 10.1021/bi201411c] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
Cyanovirin-N (CVN) is an 11 kDa pseudosymmetric cyanobacterial lectin that has been shown to inhibit infection by the human immunodeficiency virus by binding to high-mannose oligosaccharides on the surface of the viral envelope glycoprotein gp120. In this work, we describe rationally designed CVN variants that stabilize the protein fold while maintaining high affinity and selectivity for their glycan targets. Poisson-Boltzmann calculations and protein repacking algorithms were used to select stabilizing mutations in the protein core. By substituting the buried polar side chains of Ser11, Ser20, and Thr61 with aliphatic groups, we stabilized CVN by nearly 12 °C against thermal denaturation, and by 1 M GuaHCl against chemical denaturation, relative to a previously characterized stabilized mutant. Glycan microarray binding experiments confirmed that the specificity profile of carbohydrate binding is unperturbed by the mutations and is identical for all variants. In particular, the variants selectively bound glycans containing the Manα(1→2)Man linkage, which is the known minimal binding unit of CVN. We also report the slow denaturation kinetics of CVN and show that they can complicate thermodynamic analysis; in particular, the unfolding of CVN cannot be described as a fixed two-state transition. Accurate thermodynamic parameters are needed to describe the complicated free energy landscape of CVN, and we provide updated values for CVN unfolding.
Collapse
Affiliation(s)
- Vadim Patsalo
- Department of Applied Mathematics and Statistics Stony Brook University Stony Brook, New York 11794 USA
- Laufer Center for Physical and Quantitative Biology Stony Brook University Stony Brook, New York 11794 USA
| | - Daniel P. Raleigh
- Department of Chemistry Stony Brook University Stony Brook, New York 11794 USA
- Graduate Program in Biochemistry and Structural Biology Stony Brook University Stony Brook, New York 11794 USA
| | - David F. Green
- Department of Applied Mathematics and Statistics Stony Brook University Stony Brook, New York 11794 USA
- Laufer Center for Physical and Quantitative Biology Stony Brook University Stony Brook, New York 11794 USA
- Department of Chemistry Stony Brook University Stony Brook, New York 11794 USA
- Graduate Program in Biochemistry and Structural Biology Stony Brook University Stony Brook, New York 11794 USA
| |
Collapse
|
77
|
Henriksen SB, Mortensen RJ, Geertz-Hansen HM, Neves-Petersen MT, Arnason O, Söring J, Petersen SB. Hyperdimensional analysis of amino acid pair distributions in proteins. PLoS One 2011; 6:e25638. [PMID: 22174733 PMCID: PMC3235099 DOI: 10.1371/journal.pone.0025638] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2011] [Accepted: 09/08/2011] [Indexed: 01/06/2023] Open
Abstract
Our manuscript presents a novel approach to protein structure analyses. We have organized an 8-dimensional data cube with protein 3D-structural information from 8706 high-resolution non-redundant protein-chains with the aim of identifying packing rules at the amino acid pair level. The cube contains information about amino acid type, solvent accessibility, spatial and sequence distance, secondary structure and sequence length. We are able to pose structural queries to the data cube using program ProPack. The response is a 1, 2 or 3D graph. Whereas the response is of a statistical nature, the user can obtain an instant list of all PDB-structures where such pair is found. The user may select a particular structure, which is displayed highlighting the pair in question. The user may pose millions of different queries and for each one he will receive the answer in a few seconds. In order to demonstrate the capabilities of the data cube as well as the programs, we have selected well known structural features, disulphide bridges and salt bridges, where we illustrate how the queries are posed, and how answers are given. Motifs involving cysteines such as disulphide bridges, zinc-fingers and iron-sulfur clusters are clearly identified and differentiated. ProPack also reveals that whereas pairs of Lys residues virtually never appear in close spatial proximity, pairs of Arg are abundant and appear at close spatial distance, contrasting the belief that electrostatic repulsion would prevent this juxtaposition and that Arg-Lys is perceived as a conservative mutation. The presented programs can find and visualize novel packing preferences in proteins structures allowing the user to unravel correlations between pairs of amino acids. The new tools allow the user to view statistical information and visualize instantly the structures that underpin the statistical information, which is far from trivial with most other SW tools for protein structure analysis.
Collapse
Affiliation(s)
- Svend B. Henriksen
- NanoBiotechnology Group, Department of Physics and Nanotechnology, Aalborg University, Aalborg, Denmark
| | - Rasmus J. Mortensen
- NanoBiotechnology Group, Department of Physics and Nanotechnology, Aalborg University, Aalborg, Denmark
| | - Henrik M. Geertz-Hansen
- NanoBiotechnology Group, Department of Physics and Nanotechnology, Aalborg University, Aalborg, Denmark
| | - Maria Teresa Neves-Petersen
- International Iberian Nanotechnol Lab (INL), Braga, Portugal
- Nanobiotechnology Group, Department of Biotechnology, Chemistry and Environmental Sciences, University of Aalborg, Aalborg, Denmark
- * E-mail:
| | - Omar Arnason
- NanoBiotechnology Group, Department of Physics and Nanotechnology, Aalborg University, Aalborg, Denmark
| | - Jón Söring
- NanoBiotechnology Group, Department of Physics and Nanotechnology, Aalborg University, Aalborg, Denmark
| | - Steffen B. Petersen
- Nanobiotechnology Group, Department of Health Science and Technology, Aalborg University, Aalborg, Denmark
- The Institute for Lasers, Photonics and Biophotonics, University at Buffalo, The State University of New York, Buffalo, New York, United States of America
| |
Collapse
|
78
|
Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, Zecchina R, Sander C. Protein 3D structure computed from evolutionary sequence variation. PLoS One 2011; 6:e28766. [PMID: 22163331 PMCID: PMC3233603 DOI: 10.1371/journal.pone.0028766] [Citation(s) in RCA: 748] [Impact Index Per Article: 57.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2011] [Accepted: 11/14/2011] [Indexed: 11/19/2022] Open
Abstract
The evolutionary trajectory of a protein through sequence space is constrained by its function. Collections of sequence homologs record the outcomes of millions of evolutionary experiments in which the protein evolves according to these constraints. Deciphering the evolutionary record held in these sequences and exploiting it for predictive and engineering purposes presents a formidable challenge. The potential benefit of solving this challenge is amplified by the advent of inexpensive high-throughput genomic sequencing. In this paper we ask whether we can infer evolutionary constraints from a set of sequence homologs of a protein. The challenge is to distinguish true co-evolution couplings from the noisy set of observed correlations. We address this challenge using a maximum entropy model of the protein sequence, constrained by the statistics of the multiple sequence alignment, to infer residue pair couplings. Surprisingly, we find that the strength of these inferred couplings is an excellent predictor of residue-residue proximity in folded structures. Indeed, the top-scoring residue couplings are sufficiently accurate and well-distributed to define the 3D protein fold with remarkable accuracy. We quantify this observation by computing, from sequence alone, all-atom 3D structures of fifteen test proteins from different fold classes, ranging in size from 50 to 260 residues., including a G-protein coupled receptor. These blinded inferences are de novo, i.e., they do not use homology modeling or sequence-similar fragments from known structures. The co-evolution signals provide sufficient information to determine accurate 3D protein structure to 2.7–4.8 Å Cα-RMSD error relative to the observed structure, over at least two-thirds of the protein (method called EVfold, details at http://EVfold.org). This discovery provides insight into essential interactions constraining protein evolution and will facilitate a comprehensive survey of the universe of protein structures, new strategies in protein and drug design, and the identification of functional genetic variants in normal and disease genomes.
Collapse
Affiliation(s)
- Debora S Marks
- Department of Systems Biology, Harvard Medical School, Boston, Massachusetts, United States of America.
| | | | | | | | | | | | | |
Collapse
|
79
|
Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc Natl Acad Sci U S A 2011; 108:E1293-301. [PMID: 22106262 DOI: 10.1073/pnas.1111471108] [Citation(s) in RCA: 894] [Impact Index Per Article: 68.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The similarity in the three-dimensional structures of homologous proteins imposes strong constraints on their sequence variability. It has long been suggested that the resulting correlations among amino acid compositions at different sequence positions can be exploited to infer spatial contacts within the tertiary protein structure. Crucial to this inference is the ability to disentangle direct and indirect correlations, as accomplished by the recently introduced direct-coupling analysis (DCA). Here we develop a computationally efficient implementation of DCA, which allows us to evaluate the accuracy of contact prediction by DCA for a large number of protein domains, based purely on sequence information. DCA is shown to yield a large number of correctly predicted contacts, recapitulating the global structure of the contact map for the majority of the protein domains examined. Furthermore, our analysis captures clear signals beyond intradomain residue contacts, arising, e.g., from alternative protein conformations, ligand-mediated residue couplings, and interdomain interactions in protein oligomers. Our findings suggest that contacts predicted by DCA can be used as a reliable guide to facilitate computational predictions of alternative protein conformations, protein complex formation, and even the de novo prediction of protein domain structures, contingent on the existence of a large number of homologous sequences which are being rapidly made available due to advances in genome sequencing.
Collapse
|
80
|
Jones DT, Buchan DWA, Cozzetto D, Pontil M. PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. ACTA ACUST UNITED AC 2011; 28:184-90. [PMID: 22101153 DOI: 10.1093/bioinformatics/btr638] [Citation(s) in RCA: 529] [Impact Index Per Article: 40.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION The accurate prediction of residue-residue contacts, critical for maintaining the native fold of a protein, remains an open problem in the field of structural bioinformatics. Interest in this long-standing problem has increased recently with algorithmic improvements and the rapid growth in the sizes of sequence families. Progress could have major impacts in both structure and function prediction to name but two benefits. Sequence-based contact predictions are usually made by identifying correlated mutations within multiple sequence alignments (MSAs), most commonly through the information-theoretic approach of calculating mutual information between pairs of sites in proteins. These predictions are often inaccurate because the true covariation signal in the MSA is often masked by biases from many ancillary indirect-coupling or phylogenetic effects. Here we present a novel method, PSICOV, which introduces the use of sparse inverse covariance estimation to the problem of protein contact prediction. Our method builds on work which had previously demonstrated corrections for phylogenetic and entropic correlation noise and allows accurate discrimination of direct from indirectly coupled mutation correlations in the MSA. RESULTS PSICOV displays a mean precision substantially better than the best performing normalized mutual information approach and Bayesian networks. For 118 out of 150 targets, the L/5 (i.e. top-L/5 predictions for a protein of length L) precision for long-range contacts (sequence separation >23) was ≥ 0.5, which represents an improvement sufficient to be of significant benefit in protein structure prediction or model quality assessment. AVAILABILITY The PSICOV source code can be downloaded from http://bioinf.cs.ucl.ac.uk/downloads/PSICOV.
Collapse
Affiliation(s)
- David T Jones
- Department of Computer Science, Bioinformatics Group, Centre for Computational Statistics and Machine Learning, University College London, Malet Place, London WC1E 6BT, UK.
| | | | | | | |
Collapse
|
81
|
Dutheil JY. Detecting coevolving positions in a molecule: why and how to account for phylogeny. Brief Bioinform 2011; 13:228-43. [PMID: 21949241 DOI: 10.1093/bib/bbr048] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Positions in a molecule that share a common constraint do not evolve independently, and therefore leave a signature in the patterns of homologous sequences. Exhibiting such positions with a coevolution pattern from a sequence alignment has great potential for predicting functional and structural properties of molecules through comparative analysis. This task is complicated by the existence of additional correlation sources, leading to false predictions. The nature of the data is a major source of noise correlation: sequences are taken from individuals with different degrees of relatedness, and who therefore are intrinsically correlated. This has led to several method developments in different fields that are potentially confusing for non-expert users interested in these methodologies. It also explains why coevolution detection methods are largely unemployed despite the importance of the biological questions they address. In this article, I focus on the role of shared ancestry for understanding molecular coevolution patterns. I review and classify existing coevolution detection methods according to their ability to handle shared ancestry. Using a ribosomal RNA benchmark data set, for which detailed knowledge of the structure and coevolution patterns is available, I demonstrate and explain why taking the underlying evolutionary history of sequences into account is the only way to extract the full coevolution signal in the data. I also evaluate, using rigorous statistical procedures, the best approaches to do so, and discuss several important biological aspects to consider when performing coevolution analyses.
Collapse
Affiliation(s)
- Julien Y Dutheil
- Institut des Sciences de l'Evolution - Montpellier (I.S.E.-M.) Unité Mixte de Recherche UMII - CNRS (UMR 5554) Université de Montpellier II - CC 065 34095 Montpellier Cedex 05.
| |
Collapse
|
82
|
Sadowski MI, Maksimiak K, Taylor WR. Direct correlation analysis improves fold recognition. Comput Biol Chem 2011; 35:323-32. [PMID: 22000804 PMCID: PMC3267019 DOI: 10.1016/j.compbiolchem.2011.08.002] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2011] [Revised: 08/11/2011] [Accepted: 08/11/2011] [Indexed: 11/23/2022]
Abstract
The extraction of correlated mutations through the method of direct information (DI) provides predicted contact residue pairs that can be used to constrain the three dimensional structures of proteins. We apply this method to a large set of decoy protein folds consisting of many thousand well-constructed models, only tens of which have the correct fold. We find that DI is able to greatly improve the ranking of the true (native) fold but others still remain high scoring that would be difficult to discard due to small shifts in the core beta sheets.
Collapse
Affiliation(s)
| | | | - William R. Taylor
- Corresponding author. Tel.: +44 208 816 2298; fax: +44 208 816 2460.
| |
Collapse
|
83
|
Jeon J, Nam HJ, Choi YS, Yang JS, Hwang J, Kim S. Molecular evolution of protein conformational changes revealed by a network of evolutionarily coupled residues. Mol Biol Evol 2011; 28:2675-85. [PMID: 21470969 DOI: 10.1093/molbev/msr094] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
An improved understanding of protein conformational changes has broad implications for elucidating the mechanisms of various biological processes and for the design of protein engineering experiments. Understanding rearrangements of residue interactions is a key component in the challenge of describing structural transitions. Evolutionary properties of protein sequences and structures are extensively studied; however, evolution of protein motions, especially with respect to interaction rearrangements, has yet to be explored. Here, we investigated the relationship between sequence evolution and protein conformational changes and discovered that structural transitions are encoded in amino acid sequences as coevolving residue pairs. Furthermore, we found that highly coevolving residues are clustered in the flexible regions of proteins and facilitate structural transitions by forming and disrupting their interactions cooperatively. Our results provide insight into the evolution of protein conformational changes and help to identify residues important for structural transitions.
Collapse
Affiliation(s)
- Jouhyun Jeon
- Division of Molecular and Life Science, Pohang University of Science and Technology, Pohang, Korea
| | | | | | | | | | | |
Collapse
|
84
|
Callahan B, Neher RA, Bachtrog D, Andolfatto P, Shraiman BI. Correlated evolution of nearby residues in Drosophilid proteins. PLoS Genet 2011; 7:e1001315. [PMID: 21383965 PMCID: PMC3044683 DOI: 10.1371/journal.pgen.1001315] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2010] [Accepted: 01/19/2011] [Indexed: 11/19/2022] Open
Abstract
Here we investigate the correlations between coding sequence substitutions as a function of their separation along the protein sequence. We consider both substitutions between the reference genomes of several Drosophilids as well as polymorphisms in a population sample of Zimbabwean Drosophila melanogaster. We find that amino acid substitutions are “clustered” along the protein sequence, that is, the frequency of additional substitutions is strongly enhanced within ≈10 residues of a first such substitution. No such clustering is observed for synonymous substitutions, supporting a “correlation length” associated with selection on proteins as the causative mechanism. Clustering is stronger between substitutions that arose in the same lineage than it is between substitutions that arose in different lineages. We consider several possible origins of clustering, concluding that epistasis (interactions between amino acids within a protein that affect function) and positional heterogeneity in the strength of purifying selection are primarily responsible. The role of epistasis is directly supported by the tendency of nearby substitutions that arose on the same lineage to preserve the total charge of the residues within the correlation length and by the preferential cosegregation of neighboring derived alleles in our population sample. We interpret the observed length scale of clustering as a statistical reflection of the functional locality (or modularity) of proteins: amino acids that are near each other on the protein backbone are more likely to contribute to, and collaborate toward, a common subfunction. Genes are templates for proteins, yet evolutionary studies of genes and proteins often bear little resemblance. Analyses of gene evolution typically treat each codon independently, quantifying gene evolution by summing over the constituent codons. In contrast, studies of protein evolution generally incorporate protein structure and interactions between amino acids explicitly. We investigate correlations in the evolution of codons as a function of their distance from each other along the protein coding sequence. This approach is motivated by the expectation that codons near each other in sequence often encode amino acids belonging to the same functional unit. Consequently, these amino acids are more likely to interact and/or experience similar selective regimes, introducing correlation between the evolution of the underlying codons. We find codon evolution in Drosophilids to be correlated over a characteristic length scale of ≈10 codons. Specifically, the presence of a non-synonymous substitution substantially increases the probability of further such substitutions nearby, particularly within that lineage. Further analysis suggests both functional interactions between amino acids and correlation in the strength of selection contribute to this effect. These findings are relevant for understanding the relative importance of different modes of selection, and particularly the role of epistasis, in gene and protein evolution.
Collapse
Affiliation(s)
- Benjamin Callahan
- Department of Applied Physics, Stanford University, Stanford, California, United States of America.
| | | | | | | | | |
Collapse
|
85
|
Within-host co-evolution of Gag P453L and protease D30N/N88D demonstrates virological advantage in a highly protease inhibitor-exposed HIV-1 case. Antiviral Res 2011; 90:33-41. [PMID: 21338625 DOI: 10.1016/j.antiviral.2011.02.004] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2010] [Revised: 12/28/2010] [Accepted: 02/11/2011] [Indexed: 11/22/2022]
Abstract
To better understand the mechanism of HIV group-specific antigen (Gag) and protease (PR) co-evolution in drug-resistance acquisition, we analyzed a drug-resistance case by both bioinformatics and virological methods. We especially considered the quality of sequence data and analytical accuracy by introducing single-genome sequencing (SGS) and Spidermonkey/Bayesian graphical models (BGM) analysis, respectively. We analyzed 129 HIV-1 Gag-PR linkage sequences obtained from 8 time points, and the resulting sequences were applied to the Spidermonkey co-evolution analysis program, which identified ten mutation pairs as significantly co-evolving. Among these, we focused on associations between Gag-P453L, the P5' position of the p1/p6 cleavage-site mutation, and PR-D30N/N88D nelfinavir-resistant mutations, and attempted to clarify their virological significance in vitro by constructing recombinant clones. The results showed that P453L(Gag) has the potential to improve replication capacity and the Gag processing efficiency of viruses with D30N(PR)/N88D(PR) but has little effect on nelfinavir susceptibility. Homology modeling analysis suggested that hydrogen bonds between the 30th PR residue and the R452Gag are disturbed by the D30N(PR) mutation, but the impaired interaction is compensated by P453L(Gag) generating new hydrophobic interactions. Furthermore, database analysis indicated that the P453L(Gag)/D30N(PR)/N88D(PR) association was not specific only to our clinical case, but was common among AIDS patients.
Collapse
|
86
|
Du QS, Wang CH, Liao SM, Huang RB. Correlation analysis for protein evolutionary family based on amino acid position mutations and application in PDZ domain. PLoS One 2010; 5:e13207. [PMID: 20949088 PMCID: PMC2950854 DOI: 10.1371/journal.pone.0013207] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2010] [Accepted: 09/10/2010] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND It has been widely recognized that the mutations at specific directions are caused by the functional constraints in protein family and the directional mutations at certain positions control the evolutionary direction of the protein family. The mutations at different positions, even distantly separated, are mutually coupled and form an evolutionary network. Finding the controlling mutative positions and the mutative network among residues are firstly important for protein rational design and enzyme engineering. METHODOLOGY A computational approach, namely amino acid position conservation-mutation correlation analysis (CMCA), is developed to predict mutually mutative positions and find the evolutionary network in protein family. The amino acid position mutative function, which is the foundational equation of CMCA measuring the mutation of a residue at a position, is derived from the MSA (multiple structure alignment) database of protein evolutionary family. Then the position conservation correlation matrix and position mutation correlation matrix is constructed from the amino acid position mutative equation. Unlike traditional SCA (statistical coupling analysis) approach, which is based on the statistical analysis of position conservations, the CMCA focuses on the correlation analysis of position mutations. CONCLUSIONS As an example the CMCA approach is used to study the PDZ domain of protein family, and the results well illustrate the distantly allosteric mechanism in PDZ protein family, and find the functional mutative network among residues. We expect that the CMCA approach may find applications in protein engineering study, and suggest new strategy to improve bioactivities and physicochemical properties of enzymes.
Collapse
Affiliation(s)
- Qi-Shi Du
- State Key Laboratory of Bioenergy Enzyme Technology, National Engineering Research Center for Non-food Biorefinery, Guangxi Academy of Sciences, Nanning, Guangxi, China.
| | | | | | | |
Collapse
|
87
|
Bagowski CP, Bruins W, te Velthuis AJ. The nature of protein domain evolution: shaping the interaction network. Curr Genomics 2010; 11:368-76. [PMID: 21286315 PMCID: PMC2945003 DOI: 10.2174/138920210791616725] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2010] [Revised: 06/04/2010] [Accepted: 06/13/2010] [Indexed: 11/30/2022] Open
Abstract
The proteomes that make up the collection of proteins in contemporary organisms evolved through recombination and duplication of a limited set of domains. These protein domains are essentially the main components of globular proteins and are the most principal level at which protein function and protein interactions can be understood. An important aspect of domain evolution is their atomic structure and biochemical function, which are both specified by the information in the amino acid sequence. Changes in this information may bring about new folds, functions and protein architectures. With the present and still increasing wealth of sequences and annotation data brought about by genomics, new evolutionary relationships are constantly being revealed, unknown structures modeled and phylogenies inferred. Such investigations not only help predict the function of newly discovered proteins, but also assist in mapping unforeseen pathways of evolution and reveal crucial, co-evolving inter- and intra-molecular interactions. In turn this will help us describe how protein domains shaped cellular interaction networks and the dynamics with which they are regulated in the cell. Additionally, these studies can be used for the design of new and optimized protein domains for therapy. In this review, we aim to describe the basic concepts of protein domain evolution and illustrate recent developments in molecular evolution that have provided valuable new insights in the field of comparative genomics and protein interaction networks.
Collapse
Affiliation(s)
- Christoph P Bagowski
- German University Cairo, Faculty of Pharmacy and Biotechnology, New Cairo City, Egypt
| | - Wouter Bruins
- Institute of Biology, Leiden University, 2333 AL Leiden, The Netherlands
| | - Aartjan J.W te Velthuis
- Department of Medical Microbiology, Molecular Virology Laboratory, Leiden University Medical Center, Albinusdreef 2, 2333 ZA Leiden, The Netherlands
- Department of Bionanoscience, Delft University of Technology, Lorentzweg 1, 2628 CJ, Delft, The Netherlands
| |
Collapse
|
88
|
Jeong CS, Kim D. Linear predictive coding representation of correlated mutation for protein sequence alignment. BMC Bioinformatics 2010; 11 Suppl 2:S2. [PMID: 20406500 PMCID: PMC3165164 DOI: 10.1186/1471-2105-11-s2-s2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
Abstract
Background Although both conservation and correlated mutation (CM) are important information reflecting the different sorts of context in multiple sequence alignment, most of alignment methods use sequence profiles that only represent conservation. There is no general way to represent correlated mutation and incorporate it with sequence alignment yet. Methods We develop a novel method, CM profile, to represent correlated mutation as the spectral feature derived by using linear predictive coding where correlated mutations among different positions are represented by a fixed number of values. We combine CM profile with conventional sequence profile to improve alignment quality. Results For distantly related protein pairs, using CM profile improves the profile-profile alignment with or without predicted secondary structure. Especially, at superfamily level, combining CM profile with sequence profile improves profile-profile alignment by 9.5% while predicted secondary structure does by 6.0%. More significantly, using both of them improves profile-profile alignment by 13.9%. We also exemplify the effectiveness of CM profile by demonstrating that the resulting alignment preserves share coevolution and contacts. Conclusions In this work, we introduce a novel method, CM profile, which represents correlated mutation information as paralleled form, and apply it to the protein sequence alignment problem. When combined with conventional sequence profile, CM profile improves alignment quality significantly better than predicted secondary structure information, which should be beneficial for target-template alignment in protein structure prediction. Because of the generality of CM profile, it can be used for other bioinformatics applications in the same way of using sequence profile.
Collapse
Affiliation(s)
- Chan-seok Jeong
- Department of Bio and Brain Engineering, KAIST, 373-1 Guseong-dong, Yuseong-gu, Daejeon, 305-701, Korea
| | | |
Collapse
|
89
|
Xu Y, Tillier ERM. Regional covariation and its application for predicting protein contact patches. Proteins 2010; 78:548-58. [PMID: 19768681 DOI: 10.1002/prot.22576] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Correlated mutation analysis (CMA) is an effective approach for predicting functional and structural residue interactions from multiple sequence alignments (MSAs) of proteins. As nearby residues may also play a role in a given functional interaction, we were interested in seeing whether covarying sites were clustered, and whether this could be used to enhance the predictive power of CMA. A large-scale search for coevolving regions within protein domains revealed that if two sites in a MSA covary, then neighboring sites in the alignment also typically covary, resulting in clusters of covarying residues. The program PatchD(http://www.uhnres.utoronto.ca/labs/tillier/) was developed to measure the covariation between disconnected sequence clusters to reveal patch covariation. Patches that exhibit strong covariation identify multiple residues that are generally nearby in the protein structure, suggesting that the detection of covarying patches can be used in conjunction with traditional CMA approaches to reveal functional interaction partners.
Collapse
Affiliation(s)
- Yongbai Xu
- Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada
| | | |
Collapse
|
90
|
Ashkenazy H, Kliger Y. Reducing phylogenetic bias in correlated mutation analysis. Protein Eng Des Sel 2010; 23:321-6. [PMID: 20067922 DOI: 10.1093/protein/gzp078] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Correlated mutation analysis (CMA) is a sequence-based approach for ab initio protein contact map prediction. The basis of this approach is the observed correlation between mutations in interacting amino acid residues. These correlations are often estimated by either calculating the Pearson's correlation coefficient (PCC) or the mutual information (MI) between columns in a multiple sequence alignment (MSA) of the protein of interest and its homologs. A major challenge of CMA is to filter out the background noise originating from phylogenetic relatedness between sequences included in the MSA. Recently, a procedure to reduce this background noise was demonstrated to improve an MI-based predictor. Herein, we tested whether a similar approach can also improve the performance of the classical PCC-based method. Indeed, performance improvements were achieved for all four major SCOP classes. Furthermore, the results reveal that the improved PCC-based method is superior to MI-based methods for proteins having MSAs of up to 100 sequences.
Collapse
|
91
|
Noivirt-Brik O, Horovitz A, Unger R. Trade-off between positive and negative design of protein stability: from lattice models to real proteins. PLoS Comput Biol 2009; 5:e1000592. [PMID: 20011105 PMCID: PMC2781108 DOI: 10.1371/journal.pcbi.1000592] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2009] [Accepted: 11/03/2009] [Indexed: 11/18/2022] Open
Abstract
Two different strategies for stabilizing proteins are (i) positive design in which the native state is stabilized and (ii) negative design in which competing non-native conformations are destabilized. Here, the circumstances under which one strategy might be favored over the other are explored in the case of lattice models of proteins and then generalized and discussed with regard to real proteins. The balance between positive and negative design of proteins is found to be determined by their average "contact-frequency", a property that corresponds to the fraction of states in the conformational ensemble of the sequence in which a pair of residues is in contact. Lattice model proteins with a high average contact-frequency are found to use negative design more than model proteins with a low average contact-frequency. A mathematical derivation of this result indicates that it is general and likely to hold also for real proteins. Comparison of the results of correlated mutation analysis for real proteins with typical contact-frequencies to those of proteins likely to have high contact-frequencies (such as disordered proteins and proteins that are dependent on chaperonins for their folding) indicates that the latter tend to have stronger interactions between residues that are not in contact in their native conformation. Hence, our work indicates that negative design is employed when insufficient stabilization is achieved via positive design owing to high contact-frequencies.
Collapse
Affiliation(s)
- Orly Noivirt-Brik
- Department of Structural Biology, Weizmann Institute of Science, Rehovot, Israel
| | - Amnon Horovitz
- Department of Structural Biology, Weizmann Institute of Science, Rehovot, Israel
- * E-mail:
| | - Ron Unger
- The Mina and Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat-Gan, Israel
| |
Collapse
|
92
|
|
93
|
Protein sectors: evolutionary units of three-dimensional structure. Cell 2009; 138:774-86. [PMID: 19703402 DOI: 10.1016/j.cell.2009.07.038] [Citation(s) in RCA: 511] [Impact Index Per Article: 34.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2009] [Revised: 07/03/2009] [Accepted: 07/30/2009] [Indexed: 11/23/2022]
Abstract
Proteins display a hierarchy of structural features at primary, secondary, tertiary, and higher-order levels, an organization that guides our current understanding of their biological properties and evolutionary origins. Here, we reveal a structural organization distinct from this traditional hierarchy by statistical analysis of correlated evolution between amino acids. Applied to the S1A serine proteases, the analysis indicates a decomposition of the protein into three quasi-independent groups of correlated amino acids that we term "protein sectors." Each sector is physically connected in the tertiary structure, has a distinct functional role, and constitutes an independent mode of sequence divergence in the protein family. Functionally relevant sectors are evident in other protein families as well, suggesting that they may be general features of proteins. We propose that sectors represent a structural organization of proteins that reflects their evolutionary histories.
Collapse
|
94
|
Lee BC, Kim D. A new method for revealing correlated mutations under the structural and functional constraints in proteins. ACTA ACUST UNITED AC 2009; 25:2506-13. [PMID: 19628501 DOI: 10.1093/bioinformatics/btp455] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Diverse studies have shown that correlated mutation (CM) is an important molecular evolutionary process alongside conservation. However, attempts to find the residue pairs that co-evolve under the structural and/or functional constraints are complicated by the fact that a large portion of covariance signals found in multiple sequence alignments arise from correlations due to common ancestry and stochastic noise. RESULTS Assuming that the background noise can be estimated from the coevolutionary relationships among residues, we propose a new measure for background noise called the normalized coevolutionary pattern similarity (NCPS) score. By subtracting NCPS scores from raw CM scores and combining the results with an entropy factor, we show that these new scores effectively reduce the background noise. To test the effectiveness of this method in detecting residue pairs coevolving under the structural constraints, two independent test sets were performed, showing that this new method performs better than the most accurate method currently available. In addition, we also applied our method to double mutant cycle experiments and protein-protein interactions. Although more rigorous tests are required, we obtained promising results that our method tended to explain those data better than other methods. These results suggest that the new noise-reduced CM scores developed in this study can be a valuable tool for the study of correlated mutations under the structural and/or functional constraints in proteins. AVAILABILITY http://pbil.kaist.ac.kr
Collapse
Affiliation(s)
- Byung-Chul Lee
- Department of Bio and Brain Engineering, KAIST, Daejeon 305-701, Korea
| | | |
Collapse
|
95
|
Wang N, Smith WF, Miller BR, Aivazian D, Lugovskoy AA, Reff ME, Glaser SM, Croner LJ, Demarest SJ. Conserved amino acid networks involved in antibody variable domain interactions. Proteins 2009; 76:99-114. [PMID: 19089973 DOI: 10.1002/prot.22319] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Engineered antibodies are a large and growing class of protein therapeutics comprising both marketed products and many molecules in clinical trials in various disease indications. We investigated naturally conserved networks of amino acids that support antibody V(H) and V(L) function, with the goal of generating information to assist in the engineering of robust antibody or antibody-like therapeutics. We generated a large and diverse sequence alignment of V-class Ig-folds, of which V(H) and V(L) domains are family members. To identify conserved amino acid networks, covariations between residues at all possible position pairs were quantified as correlation coefficients (phi-values). We provide rosters of the key conserved amino acid pairs in antibody V(H) and V(L) domains, for reference and use by the antibody research community. The majority of the most strongly conserved amino acid pairs in V(H) and V(L) are at or adjacent to the V(H)-V(L) interface suggesting that the ability to heterodimerize is a constraining feature of antibody evolution. For the V(H) domain, but not the V(L) domain, residue pairs at the variable-constant domain interface (V(H)-C(H)1 interface) are also strongly conserved. The same network of conserved V(H) positions involved in interactions with both the V(L) and C(H)1 domains is found in camelid V(HH) domains, which have evolved to lack interactions with V(L) and C(H)1 domains in their mature structures; however, the amino acids at these positions are different, reflecting their different function. Overall, the data describe naturally occurring amino acid networks in antibody Fv regions that can be referenced when designing antibodies or antibody-like fragments with the goal of improving their biophysical properties.
Collapse
Affiliation(s)
- Norman Wang
- Biogen Idec, San Diego, California 92122, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
96
|
Frenkel-Morgenstern M, Tworowski D, Klipcan L, Safro M. Intra-protein compensatory mutations analysis highlights the tRNA recognition regions in aminoacyl-tRNA synthetases. J Biomol Struct Dyn 2009; 27:115-26. [PMID: 19583438 DOI: 10.1080/07391102.2009.10507302] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
The aminoacyl-tRNA synthetases (aaRSs) covalently attach amino acids to their corresponding nucleic acid adapter molecules, tRNAs. The interactions in the tRNA-aaRSs complexes are mostly non-specific, and largely electrostatic. Tracing a way of aaRS-tRNA mutual adaptation throughout evolution offers a clearer view of understanding how aaRS-tRNA systems preserve patterns of tRNA recognition and binding. In this study, we used the compensatory mutations analysis to explore adaptation of aaRSs in respond to random mutations that can occur in the tRNA-recognition area. We showed that the frequency of compensatory mutations among residues that belong to the recognition region is 1.75-fold higher than that of the exposed residues. The highest frequencies of compensatory mutations are observed for pairs of charged residues, wherein one residue is located within the tRNA-recognition area, while the second is placed outside of the area, and contributes to the formation of the aaRS electrostatic landscape. Given charged residues are compensated by buried charge residues in more than 60% of the analyzed mutations. The cytoplasmatic and mitochondrial aaRSs preserve similar patterns of compensatory mutations in the tRNA recognition areas. Moreover, we found that mitochondrial aaRSs demonstrate a significant increase in the frequency of compensatory mutations in the area. Our findings shed light on the physical nature of compensatory mutations in aaRSs, thereby keeping unchanged tRNA-recognition patterns.
Collapse
|
97
|
Liu Z, Chen J, Thirumalai D. On the accuracy of inferring energetic coupling between distant sites in protein families from evolutionary imprints: Illustrations using lattice model. Proteins 2009; 77:823-31. [DOI: 10.1002/prot.22498] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
98
|
Abstract
Covariation between sites can arise due to a common evolutionary history. At the same time, structure and function of proteins play significant role in evolvability of different sites that are not directly connected with the common ancestry. The nature of forces which cause residues to coevolve is still not thoroughly understood, it is especially not clear how coevolutionary processes are related to functional diversification within protein families. We analyzed both functional and structural factors that might cause covariation of specificity determinants and showed that they more often participate in coevolutionary relationships with each other and other sites compared with functional sites and those sites that are not under strong functional constraints. We also found that protein sites with higher number of coevolutionary connections with other sites have a tendency to evolve slower. Our results indicate that in some cases coevolutionary connections exist between specificity sites that are located far away in space but are under similar functional constraints. Such correlated changes and compensations can be realized through the stepwise coevolutionary processes which in turn can shed light on the mechanisms of functional diversification.
Collapse
Affiliation(s)
- Saikat Chakrabarti
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA.
| | | |
Collapse
|
99
|
Qiu P, Sanfiorenzo V, Curry S, Guo Z, Liu S, Skelton A, Xia E, Cullen C, Ralston R, Greene J, Tong X. Identification of HCV protease inhibitor resistance mutations by selection pressure-based method. Nucleic Acids Res 2009; 37:e74. [PMID: 19395595 PMCID: PMC2691846 DOI: 10.1093/nar/gkp251] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
A major challenge to successful antiviral therapy is the emergence of drug-resistant viruses. Recent studies have developed several automated analyses of HIV sequence polymorphism based on calculations of selection pressure (Ka/Ks) to predict drug resistance mutations. Similar resistance analysis programs for HCV inhibitors are not currently available. Taking advantage of the recently available sequence data of patient HCV samples from a Phase II clinical study of protease inhibitor boceprevir, we calculated the selection pressure for all codons in the HCV protease region (amino acid 1–181) to identify potential resistance mutations. The correlation between mutations was also calculated to evaluate linkage between any two mutations. Using this approach, we identified previously known major resistant mutations, including a recently reported mutation V55A. In addition, a novel mutation V158I was identified, and we further confirmed its resistance to boceprevir in protease enzyme and replicon assay. We also extended the approach to analyze potential interactions between individual mutations and identified three pairs of correlated changes. Our data suggests that selection pressure-based analysis and correlation mapping could provide useful tools to analyze large amount of sequencing data from clinical samples and to identify new drug resistance mutations as well as their linkage and correlations.
Collapse
Affiliation(s)
- Ping Qiu
- Molecular Design and Informatics, Schering-Plough Research Institute, 2015 Galloping Hill Road, Kenilworth, NJ 07033, USA.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
100
|
Fatakia SN, Costanzi S, Chow CC. Computing highly correlated positions using mutual information and graph theory for G protein-coupled receptors. PLoS One 2009; 4:e4681. [PMID: 19262747 PMCID: PMC2650788 DOI: 10.1371/journal.pone.0004681] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2008] [Accepted: 01/07/2009] [Indexed: 01/06/2023] Open
Abstract
G protein-coupled receptors (GPCRs) are a superfamily of seven transmembrane-spanning proteins involved in a wide array of physiological functions and are the most common targets of pharmaceuticals. This study aims to identify a cohort or clique of positions that share high mutual information. Using a multiple sequence alignment of the transmembrane (TM) domains, we calculated the mutual information between all inter-TM pairs of aligned positions and ranked the pairs by mutual information. A mutual information graph was constructed with vertices that corresponded to TM positions and edges between vertices were drawn if the mutual information exceeded a threshold of statistical significance. Positions with high degree (i.e. had significant mutual information with a large number of other positions) were found to line a well defined inter-TM ligand binding cavity for class A as well as class C GPCRs. Although the natural ligands of class C receptors bind to their extracellular N-terminal domains, the possibility of modulating their activity through ligands that bind to their helical bundle has been reported. Such positions were not found for class B GPCRs, in agreement with the observation that there are not known ligands that bind within their TM helical bundle. All identified key positions formed a clique within the MI graph of interest. For a subset of class A receptors we also considered the alignment of a portion of the second extracellular loop, and found that the two positions adjacent to the conserved Cys that bridges the loop with the TM3 qualified as key positions. Our algorithm may be useful for localizing topologically conserved regions in other protein families.
Collapse
Affiliation(s)
- Sarosh N. Fatakia
- Laboratory of Biological Modeling, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Stefano Costanzi
- Laboratory of Biological Modeling, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Carson C. Chow
- Laboratory of Biological Modeling, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland, United States of America
| |
Collapse
|