1
|
Abstract
Compensatory substitutions happen when one mutation is advantageously selected because it restores the loss of fitness induced by a previous deleterious mutation. How frequent such mutations occur in evolution and what is the structural and functional context permitting their emergence remain open questions. We built an atlas of intra-protein compensatory substitutions using a phylogenetic approach and a dataset of 1,630 bacterial protein families for which high-quality sequence alignments and experimentally derived protein structures were available. We identified more than 51,000 positions coevolving by the mean of predicted compensatory mutations. Using the evolutionary and structural properties of the analyzed positions, we demonstrate that compensatory mutations are scarce (typically only a few in the protein history) but widespread (the majority of proteins experienced at least one). Typical coevolving residues are evolving slowly, are located in the protein core outside secondary structure motifs, and are more often in contact than expected by chance, even after accounting for their evolutionary rate and solvent exposure. An exception to this general scheme is residues coevolving for charge compensation, which are evolving faster than noncoevolving sites, in contradiction with predictions from simple coevolutionary models, but similar to stem pairs in RNA. While sites with a significant pattern of coevolution by compensatory mutations are rare, the comparative analysis of hundreds of structures ultimately permits a better understanding of the link between the three-dimensional structure of a protein and its fitness landscape.
Collapse
Affiliation(s)
- Shilpi Chaurasia
- RG Molecular Systems Evolution, Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, August-Thienemann-Straße 2, 24306 Plön, Germany.,Excelra Knowledge Solutions Pvt Ltd, Hyderabad, India
| | - Julien Y Dutheil
- RG Molecular Systems Evolution, Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, August-Thienemann-Straße 2, 24306 Plön, Germany.,Institute of Evolution Sciences of Montpellier (ISEM), CNRS, University of Montpellier, IRD, EPHE, 34095 Montpellier, France
| |
Collapse
|
2
|
Aadland K, Kolaczkowski B. Alignment-Integrated Reconstruction of Ancestral Sequences Improves Accuracy. Genome Biol Evol 2021; 12:1549-1565. [PMID: 32785673 PMCID: PMC7523730 DOI: 10.1093/gbe/evaa164] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/03/2020] [Indexed: 12/31/2022] Open
Abstract
Ancestral sequence reconstruction (ASR) uses an alignment of extant protein sequences, a phylogeny describing the history of the protein family and a model of the molecular-evolutionary process to infer the sequences of ancient proteins, allowing researchers to directly investigate the impact of sequence evolution on protein structure and function. Like all statistical inferences, ASR can be sensitive to violations of its underlying assumptions. Previous studies have shown that, whereas phylogenetic uncertainty has only a very weak impact on ASR accuracy, uncertainty in the protein sequence alignment can more strongly affect inferred ancestral sequences. Here, we show that errors in sequence alignment can produce errors in ASR across a range of realistic and simplified evolutionary scenarios. Importantly, sequence reconstruction errors can lead to errors in estimates of structural and functional properties of ancestral proteins, potentially undermining the reliability of analyses relying on ASR. We introduce an alignment-integrated ASR approach that combines information from many different sequence alignments. We show that integrating alignment uncertainty improves ASR accuracy and the accuracy of downstream structural and functional inferences, often performing as well as highly accurate structure-guided alignment. Given the growing evidence that sequence alignment errors can impact the reliability of ASR studies, we recommend that future studies incorporate approaches to mitigate the impact of alignment uncertainty. Probabilistic modeling of insertion and deletion events has the potential to radically improve ASR accuracy when the model reflects the true underlying evolutionary history, but further studies are required to thoroughly evaluate the reliability of these approaches under realistic conditions.
Collapse
Affiliation(s)
- Kelsey Aadland
- Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida
| | - Bryan Kolaczkowski
- Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida
| |
Collapse
|
3
|
Turjanski P, Ferreiro DU. On the Natural Structure of Amino Acid Patterns in Families of Protein Sequences. J Phys Chem B 2018; 122:11295-11301. [PMID: 30239207 DOI: 10.1021/acs.jpcb.8b07206] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
All known terrestrial proteins are coded as continuous strings of ≈20 amino acids. The patterns formed by the repetitions of elements in groups of finite sequences describes the natural architectures of protein families. We present a method to search for patterns and groupings of patterns in protein sequences using a mathematically precise definition for "repetition", an efficient algorithmic implementation and a robust scoring system with no adjustable parameters. We show that the sequence patterns can be well-separated into disjoint classes according to their recurrence in nested structures. The statistics of the occurrences of patterns indicate that short repetitions are sufficient to account for the differences between natural families and randomized groups of sequences by more than 10 standard deviations, while contiguous sequence patterns shorter than 5 residues are effectively random in their occurrences. A small subset of patterns is sufficient to account for a robust "familiarity" definition between arbitrary sets of sequences.
Collapse
Affiliation(s)
- Pablo Turjanski
- KAPOW, Departamento de Computación , Facultad de Ciencias Exactas y Naturales, UBA-CONICET-ICC , Buenos Aires , Argentina
| | - Diego U Ferreiro
- Protein Physiology Lab, Departamento de Química Biológica , Facultad de Ciencias Exactas y Naturales, UBA-CONICET-IQUIBICEN , Buenos Aires , Argentina
| |
Collapse
|
4
|
Gil N, Fiser A. Identifying functionally informative evolutionary sequence profiles. Bioinformatics 2018; 34:1278-1286. [PMID: 29211823 PMCID: PMC5905606 DOI: 10.1093/bioinformatics/btx779] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2017] [Accepted: 11/29/2017] [Indexed: 01/06/2023] Open
Abstract
Motivation Multiple sequence alignments (MSAs) can provide essential input to many bioinformatics applications, including protein structure prediction and functional annotation. However, the optimal selection of sequences to obtain biologically informative MSAs for such purposes is poorly explored, and has traditionally been performed manually. Results We present Selection of Alignment by Maximal Mutual Information (SAMMI), an automated, sequence-based approach to objectively select an optimal MSA from a large set of alternatives sampled from a general sequence database search. The hypothesis of this approach is that the mutual information among MSA columns will be maximal for those MSAs that contain the most diverse set possible of the most structurally and functionally homogeneous protein sequences. SAMMI was tested to select MSAs for functional site residue prediction by analysis of conservation patterns on a set of 435 proteins obtained from protein-ligand (peptides, nucleic acids and small substrates) and protein-protein interaction databases. Availability and implementation: A freely accessible program, including source code, implementing SAMMI is available at https://github.com/nelsongil92/SAMMI.git. Contact andras.fiser@einstein.yu.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Nelson Gil
- Department of Systems & Computational Biology, Albert Einstein College of Medicine, Bronx, NY 10461, USA
| | - Andras Fiser
- Department of Systems & Computational Biology, Albert Einstein College of Medicine, Bronx, NY 10461, USA
| |
Collapse
|
5
|
Gao H, Yu X, Dou Y, Wang J. New Measurement for Correlation of Co-evolution Relationship of Subsequences in Protein. Interdiscip Sci 2015; 7:364-72. [PMID: 26396121 DOI: 10.1007/s12539-015-0024-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2014] [Revised: 04/08/2014] [Accepted: 04/16/2014] [Indexed: 11/26/2022]
Abstract
Many computational tools have been developed to measure the protein residues co-evolution. Most of them only focus on co-evolution for pairwise residues in a protein sequence. However, number of residues participate in co-evolution might be multiple. And some co-evolved residues are clustered in several distinct regions in primary structure. Therefore, the co-evolution among the adjacent residues and the correlation between the distinct regions offer insights into function and evolution of the protein and residues. Subsequence is used to represent the adjacent multiple residues in one distinct region. In the paper, co-evolution relationship in each subsequence is represented by mutual information matrix (MIM). Then, Pearson's correlation coefficient: R value is developed to measure the similarity correlation of two MIMs. MSAs from Catalytic Data Base (Catalytic Site Atlas, CSA) are used for testing. R value characterizes a specific class of residues. In contrast to individual pairwise co-evolved residues, adjacent residues without high individual MI values are found since the co-evolved relationship among them is similar to that among another set of adjacent residues. These subsequences possess some flexibility in the composition of side chains, such as the catalyzed environment.
Collapse
Affiliation(s)
- Hongyun Gao
- School of Mathematical Sciences, Dalian University of Technology, Dalian, 116024, China
- Information and Engineering College, Dalian University, Dalian, 116622, China
| | - Xiaoqing Yu
- College of Sciences, Shanghai Institute of Technology, Shanghai, 201418, China
| | - Yongchao Dou
- Center for Plant Science and Innovation, School of Biological Sciences, University of Nebraska, Lincoln, NE, 68588, USA
| | - Jun Wang
- Department of Mathematics, Shanghai Normal University, Shanghai, 200234, China.
| |
Collapse
|
6
|
Ó Conchúir S, Barlow KA, Pache RA, Ollikainen N, Kundert K, O'Meara MJ, Smith CA, Kortemme T. A Web Resource for Standardized Benchmark Datasets, Metrics, and Rosetta Protocols for Macromolecular Modeling and Design. PLoS One 2015; 10:e0130433. [PMID: 26335248 PMCID: PMC4559433 DOI: 10.1371/journal.pone.0130433] [Citation(s) in RCA: 58] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2015] [Accepted: 05/20/2015] [Indexed: 11/18/2022] Open
Abstract
The development and validation of computational macromolecular modeling and design methods depend on suitable benchmark datasets and informative metrics for comparing protocols. In addition, if a method is intended to be adopted broadly in diverse biological applications, there needs to be information on appropriate parameters for each protocol, as well as metrics describing the expected accuracy compared to experimental data. In certain disciplines, there exist established benchmarks and public resources where experts in a particular methodology are encouraged to supply their most efficient implementation of each particular benchmark. We aim to provide such a resource for protocols in macromolecular modeling and design. We present a freely accessible web resource (https://kortemmelab.ucsf.edu/benchmarks) to guide the development of protocols for protein modeling and design. The site provides benchmark datasets and metrics to compare the performance of a variety of modeling protocols using different computational sampling methods and energy functions, providing a "best practice" set of parameters for each method. Each benchmark has an associated downloadable benchmark capture archive containing the input files, analysis scripts, and tutorials for running the benchmark. The captures may be run with any suitable modeling method; we supply command lines for running the benchmarks using the Rosetta software suite. We have compiled initial benchmarks for the resource spanning three key areas: prediction of energetic effects of mutations, protein design, and protein structure prediction, each with associated state-of-the-art modeling protocols. With the help of the wider macromolecular modeling community, we hope to expand the variety of benchmarks included on the website and continue to evaluate new iterations of current methods as they become available.
Collapse
Affiliation(s)
- Shane Ó Conchúir
- California Institute for Quantitative Biosciences (QB3), University of California San Francisco, San Francisco, California, United States of America
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, California, United States of America
| | - Kyle A. Barlow
- Graduate Program in Bioinformatics, University of California San Francisco, San Francisco, California, United States of America
| | - Roland A. Pache
- California Institute for Quantitative Biosciences (QB3), University of California San Francisco, San Francisco, California, United States of America
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, California, United States of America
| | - Noah Ollikainen
- Graduate Program in Bioinformatics, University of California San Francisco, San Francisco, California, United States of America
| | - Kale Kundert
- Graduate Program in Biophysics, University of California San Francisco, San Francisco, California, United States of America
| | - Matthew J. O'Meara
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, California, United States of America
| | - Colin A. Smith
- California Institute for Quantitative Biosciences (QB3), University of California San Francisco, San Francisco, California, United States of America
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, California, United States of America
- Graduate Program in Bioinformatics, University of California San Francisco, San Francisco, California, United States of America
| | - Tanja Kortemme
- California Institute for Quantitative Biosciences (QB3), University of California San Francisco, San Francisco, California, United States of America
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, California, United States of America
- Graduate Program in Bioinformatics, University of California San Francisco, San Francisco, California, United States of America
- Graduate Program in Biophysics, University of California San Francisco, San Francisco, California, United States of America
| |
Collapse
|
7
|
Herman JL, Novák Á, Lyngsø R, Szabó A, Miklós I, Hein J. Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs. BMC Bioinformatics 2015; 16:108. [PMID: 25888064 PMCID: PMC4395974 DOI: 10.1186/s12859-015-0516-1] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2014] [Accepted: 02/24/2015] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND A standard procedure in many areas of bioinformatics is to use a single multiple sequence alignment (MSA) as the basis for various types of analysis. However, downstream results may be highly sensitive to the alignment used, and neglecting the uncertainty in the alignment can lead to significant bias in the resulting inference. In recent years, a number of approaches have been developed for probabilistic sampling of alignments, rather than simply generating a single optimum. However, this type of probabilistic information is currently not widely used in the context of downstream inference, since most existing algorithms are set up to make use of a single alignment. RESULTS In this work we present a framework for representing a set of sampled alignments as a directed acyclic graph (DAG) whose nodes are alignment columns; each path through this DAG then represents a valid alignment. Since the probabilities of individual columns can be estimated from empirical frequencies, this approach enables sample-based estimation of posterior alignment probabilities. Moreover, due to conditional independencies between columns, the graph structure encodes a much larger set of alignments than the original set of sampled MSAs, such that the effective sample size is greatly increased. CONCLUSIONS The alignment DAG provides a natural way to represent a distribution in the space of MSAs, and allows for existing algorithms to be efficiently scaled up to operate on large sets of alignments. As an example, we show how this can be used to compute marginal probabilities for tree topologies, averaging over a very large number of MSAs. This framework can also be used to generate a statistically meaningful summary alignment; example applications show that this summary alignment is consistently more accurate than the majority of the alignment samples, leading to improvements in downstream tree inference. Implementations of the methods described in this article are available at http://statalign.github.io/WeaveAlign .
Collapse
Affiliation(s)
- Joseph L Herman
- Department of Statistics, University of Oxford, 1 South Parks Road, Oxford, OX1 3TG, UK.
- Division of Mathematical Biology, National Institute of Medical Research,, The Ridgeway, London, NW7 1AA, UK.
| | - Ádám Novák
- Department of Statistics, University of Oxford, 1 South Parks Road, Oxford, OX1 3TG, UK.
| | - Rune Lyngsø
- Department of Statistics, University of Oxford, 1 South Parks Road, Oxford, OX1 3TG, UK.
| | - Adrienn Szabó
- Institute of Computer Science and Control, Hungarian Academy of Sciences, Lagymanyosi u. 11., Budapest, 1111, Hungary.
| | - István Miklós
- Institute of Computer Science and Control, Hungarian Academy of Sciences, Lagymanyosi u. 11., Budapest, 1111, Hungary.
- Department of Stochastics, Rényi Institute, Reáltanoda u. 13-15, Budapest, 1053, Hungary.
| | - Jotun Hein
- Department of Statistics, University of Oxford, 1 South Parks Road, Oxford, OX1 3TG, UK.
| |
Collapse
|
8
|
Selwa E, Davi M, Chenal A, Sotomayor-Pérez AC, Ladant D, Malliavin TE. Allosteric activation of Bordetella pertussis adenylyl cyclase by calmodulin: molecular dynamics and mutagenesis studies. J Biol Chem 2015; 289:21131-41. [PMID: 24907274 DOI: 10.1074/jbc.m113.530410] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023] Open
Abstract
Adenylyl cyclase (AC) toxin is an essential toxin that allows Bordetella pertussis to invade eukaryotic cells, where it is activated after binding to calmodulin (CaM). Based on the crystal structure of the AC catalytic domain in complex with the C-terminal half of CaM (C-CaM), our previous molecular dynamics simulations (Selwa, E., Laine, E., and Malliavin, T. (2012) Differential role of calmodulin and calcium ions in the stabilization of the catalytic domain of adenyl cyclase CyaA from Bordetella pertussis. Proteins 80, 1028–1040) suggested that three residues (i.e. Arg(338), Asn(347), and Asp(360)) might be important for stabilizing the AC/CaM interaction. These residues belong to a loop-helix-loop motif at the C-terminal end of AC, which is located at the interface between CaM and the AC catalytic loop. In the present study, we conducted the in silico and in vitro characterization of three AC variants, where one (Asn(347); ACm1A), two (Arg(338) and Asp(360); ACm2A), or three residues (Arg(338), Asn(347), and Asp(360); ACm3A) were substituted with Ala. Biochemical studies showed that the affinities of ACm1A and ACm2A for CaM were not affected significantly, whereas that of ACm3A was reduced dramatically. To understand the effects of these modifications, molecular dynamics simulations were performed based on the modified proteins. The molecular dynamics trajectories recorded for the ACm3AC-CaM complex showed that the calcium-binding loops of C-CaM exhibited large fluctuations, which could be related to the weakened interaction between ACm3A and its activator. Overall, our results suggest that the loop-helix-loop motif at the C-terminal end of AC is crucial during CaM binding for stabilizing the AC catalytic loop in an active configuration.
Collapse
|
9
|
Gao H, Yu X, Dou Y, Wang J. New measurement for correlation of co-evolution relationship of subsequences in protein. Interdiscip Sci 2015. [PMID: 25663109 DOI: 10.1007/s12539-014-0221-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2014] [Revised: 04/08/2014] [Accepted: 04/16/2014] [Indexed: 11/24/2022]
Abstract
Many computational tools have been developed to measure the protein residues co-evolution. Most of them only focus on co-evolution for pairwise residues in a protein sequence. However, number of residues participate in co-evolution might be multiple. And some co-evolved residues are clustered in several distinct regions in primary structure. Therefore, the co-evolution among the adjacent residues, and the correlation between the distinct regions offer insights into function and evolution of the protein and residues. Subsequence is used to represent the adjacent multiple residues in one distinct region. In the paper, co-evolution relationship in each subsequence is represented by mutual information matrix (MIM). Then, Pearson's Correlation Coefficient: R value is developed to measure the similarity correlation of two MIMs. MSAs from Catalytic Data Base (Catalytic Site Atlas, CSA) is used for testing. R value characterizes a specific class of residues. In contrast to individual pairwise co-evolved residues, adjacent residues without high individual MI values are found since the co-evolved relationship among them is similar to that among another set of adjacent residues. These subsequences possess some flexibility in the composition of side chains, such as the catalyzed environment.
Collapse
Affiliation(s)
- Hongyun Gao
- School of Mathematical Sciences, Dalian University of Technology, Dalian, 116024, China
| | | | | | | |
Collapse
|
10
|
Control of catalytic efficiency by a coevolving network of catalytic and noncatalytic residues. Proc Natl Acad Sci U S A 2014; 111:E2376-83. [PMID: 24912189 DOI: 10.1073/pnas.1322352111] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
The active sites of enzymes consist of residues necessary for catalysis and structurally important noncatalytic residues that together maintain the architecture and function of the active site. Examples of evolutionary interactions between catalytic and noncatalytic residues have been difficult to define and experimentally validate due to a general intolerance of these residues to substitution. Here, using computational methods to predict coevolving residues, we identify a network of positions consisting of two catalytic metal-binding residues and two adjacent noncatalytic residues in LAGLIDADG homing endonucleases (LHEs). Distinct combinations of the four residues in the network map to distinct LHE subfamilies, with a striking distribution of the metal-binding Asp (D) and Glu (E) residues. Mutation of these four positions in three LHEs--I-LtrI, I-OnuI, and I-HjeMI--indicate that the combinations of residues tolerated are specific to each enzyme. Kinetic analyses under single-turnover conditions revealed that I-LtrI activity could be modulated over an ∼100-fold range by mutation of residues in the coevolving network. I-LtrI catalytic site variants with low activity could be rescued by compensatory mutations at adjacent noncatalytic sites that restore an optimal coevolving network and vice versa. Our results demonstrate that LHE activity is constrained by an evolutionary barrier of residues with strong context-dependent effects. Creation of optimal coevolving active-site networks is therefore an important consideration in engineering of LHEs and other enzymes.
Collapse
|
11
|
Clark GW, Ackerman SH, Tillier ER, Gatti DL. Multidimensional mutual information methods for the analysis of covariation in multiple sequence alignments. BMC Bioinformatics 2014; 15:157. [PMID: 24886131 PMCID: PMC4046016 DOI: 10.1186/1471-2105-15-157] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2013] [Accepted: 05/06/2014] [Indexed: 11/10/2022] Open
Abstract
Background Several methods are available for the detection of covarying positions from a multiple sequence alignment (MSA). If the MSA contains a large number of sequences, information about the proximities between residues derived from covariation maps can be sufficient to predict a protein fold. However, in many cases the structure is already known, and information on the covarying positions can be valuable to understand the protein mechanism and dynamic properties. Results In this study we have sought to determine whether a multivariate (multidimensional) extension of traditional mutual information (MI) can be an additional tool to study covariation. The performance of two multidimensional MI (mdMI) methods, designed to remove the effect of ternary/quaternary interdependencies, was tested with a set of 9 MSAs each containing <400 sequences, and was shown to be comparable to that of the newest methods based on maximum entropy/pseudolikelyhood statistical models of protein sequences. However, while all the methods tested detected a similar number of covarying pairs among the residues separated by < 8 Å in the reference X-ray structures, there was on average less than 65% overlap between the top scoring pairs detected by methods that are based on different principles. Conclusions Given the large variety of structure and evolutionary history of different proteins it is possible that a single best method to detect covariation in all proteins does not exist, and that for each protein family the best information can be derived by merging/comparing results obtained with different methods. This approach may be particularly valuable in those cases in which the size of the MSA is small or the quality of the alignment is low, leading to significant differences in the pairs detected by different methods.
Collapse
Affiliation(s)
| | | | - Elisabeth R Tillier
- Department of Medical Biophysics, University of Toronto, Campbell Family Institute for Cancer Research, Ontario Cancer Institute, University Health Network, Toronto, Ontario, Canada.
| | | |
Collapse
|
12
|
Abstract
Positions in a protein are thought to coevolve to maintain important structural and functional interactions over evolutionary time. The detection of putative coevolving positions can provide important new insights into a protein family in the same way that knowledge is gained by recognizing evolutionarily conserved characters and characteristics. Putatively coevolving positions can be detected with statistical methods that identify covarying positions. However, positions in protein alignments can covary for many other reasons than coevolution; thus, it is crucial to create high-quality multiple sequence alignments for coevolution inference. Furthermore, it is important to understand common signs and sources of error. When confounding factors are accounted for, coevolution is a rich resource for protein engineering information.
Collapse
|
13
|
Ollikainen N, Kortemme T. Computational protein design quantifies structural constraints on amino acid covariation. PLoS Comput Biol 2013; 9:e1003313. [PMID: 24244128 PMCID: PMC3828131 DOI: 10.1371/journal.pcbi.1003313] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2013] [Accepted: 09/20/2013] [Indexed: 02/02/2023] Open
Abstract
Amino acid covariation, where the identities of amino acids at different sequence positions are correlated, is a hallmark of naturally occurring proteins. This covariation can arise from multiple factors, including selective pressures for maintaining protein structure, requirements imposed by a specific function, or from phylogenetic sampling bias. Here we employed flexible backbone computational protein design to quantify the extent to which protein structure has constrained amino acid covariation for 40 diverse protein domains. We find significant similarities between the amino acid covariation in alignments of natural protein sequences and sequences optimized for their structures by computational protein design methods. These results indicate that the structural constraints imposed by protein architecture play a dominant role in shaping amino acid covariation and that computational protein design methods can capture these effects. We also find that the similarity between natural and designed covariation is sensitive to the magnitude and mechanism of backbone flexibility used in computational protein design. Our results thus highlight the necessity of including backbone flexibility to correctly model precise details of correlated amino acid changes and give insights into the pressures underlying these correlations. Proteins generally fold into specific three-dimensional structures to perform their cellular functions, and the presence of misfolded proteins is often deleterious for cellular and organismal fitness. For these reasons, maintenance of protein structure is thought to be one of the major fitness pressures acting on proteins. Consequently, the sequences of today's naturally occurring proteins contain signatures reflecting the constraints imposed by protein structure. Here we test the ability of computational protein design methods to recapitulate and explain these signatures. We focus on the physical basis of evolutionary pressures that act on interactions between amino acids in folded proteins, which are critical in determining protein structure and function. Such pressures can be observed from the appearance of amino acid covariation, where the amino acids at certain positions in protein sequences are correlated with each other. We find similar patterns of amino acid covariation in natural sequences and sequences optimized for their structures using computational protein design, demonstrating the importance of structural constraints in protein molecular evolution and providing insights into the structural mechanisms leading to covariation. In addition, these results characterize the ability of computational methods to model the precise details of correlated amino acid changes, which is critical for engineering new proteins with useful functions beyond those seen in nature.
Collapse
Affiliation(s)
- Noah Ollikainen
- Graduate Program in Bioinformatics, University of California San Francisco, San Francisco, California, United States of America
| | - Tanja Kortemme
- Graduate Program in Bioinformatics, University of California San Francisco, San Francisco, California, United States of America
- California Institute for Quantitative Biosciences (QB3), University of California San Francisco, San Francisco, California, United States of America
- Department of Bioengineering and Therapeutic Science, University of California San Francisco, San Francisco, California, United States of America
- * E-mail:
| |
Collapse
|
14
|
Ollikainen N, Smith CA, Fraser JS, Kortemme T. Flexible backbone sampling methods to model and design protein alternative conformations. Methods Enzymol 2013; 523:61-85. [PMID: 23422426 DOI: 10.1016/b978-0-12-394292-0.00004-7] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Sampling alternative conformations is key to understanding how proteins work and engineering them for new functions. However, accurately characterizing and modeling protein conformational ensembles remain experimentally and computationally challenging. These challenges must be met before protein conformational heterogeneity can be exploited in protein engineering and design. Here, as a stepping stone, we describe methods to detect alternative conformations in proteins and strategies to model these near-native conformational changes based on backrub-type Monte Carlo moves in Rosetta. We illustrate how Rosetta simulations that apply backrub moves improve modeling of point mutant side-chain conformations, native side-chain conformational heterogeneity, functional conformational changes, tolerated sequence space, protein interaction specificity, and amino acid covariation across protein-protein interfaces. We include relevant Rosetta command lines and RosettaScripts to encourage the application of these types of simulations to other systems. Our work highlights that critical scoring and sampling improvements will be necessary to approximate conformational landscapes. Challenges for the future development of these methods include modeling conformational changes that propagate away from designed mutation sites and modulating backbone flexibility to predictively design functionally important conformational heterogeneity.
Collapse
Affiliation(s)
- Noah Ollikainen
- Graduate Program in Bioinformatics, University of California San Francisco, San Francisco, California, USA
| | | | | | | |
Collapse
|
15
|
Ashenberg O, Laub MT. Using analyses of amino Acid coevolution to understand protein structure and function. Methods Enzymol 2013; 523:191-212. [PMID: 23422431 DOI: 10.1016/b978-0-12-394292-0.00009-6] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
Determining which residues of a protein contribute to a specific function is a difficult problem. Analyses of amino acid covariation within a protein family can serve as a useful guide by identifying residues that are functionally coupled. Covariation analyses have been successfully used on several different protein families to identify residues that work together to promote folding, enable protein-protein interactions, or contribute to an enzymatic activity. Covariation is a statistical signal that can be measured in a multiple sequence alignment of homologous proteins. As sequence databases have expanded dramatically, covariation analyses have become easier and more powerful. In this chapter, we describe how functional covariation arises during the evolution of proteins and how this signal can be distinguished from various background signals. We discuss the basic methodology for performing amino acid covariation analysis, using bacterial two-component signal transduction proteins as an example. We provide practical suggestions for each step of the process including assembly of protein sequences, construction of a multiple sequence alignment, measurement of covariation, and analysis of results.
Collapse
Affiliation(s)
- Orr Ashenberg
- Department of Biology, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | | |
Collapse
|
16
|
Chakraborty S, Rao BJ, Baker N, Asgeirsson B. Structural phylogeny by profile extraction and multiple superimposition using electrostatic congruence as a discriminator. INTRINSICALLY DISORDERED PROTEINS 2013; 1. [PMID: 25364645 PMCID: PMC4212511 DOI: 10.4161/idp.25463] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
Phylogenetic analysis of proteins using multiple sequence alignment (MSA) assumes an underlying evolutionary relationship in these proteins which occasionally remains undetected due to considerable sequence divergence. Structural alignment programs have been developed to unravel such fuzzy relationships. However, none of these structure based methods have used electrostatic properties to discriminate between spatially equivalent residues. We present a methodology for MSA of a set of related proteins with known structures using electrostatic properties as an additional discriminator (STEEP). STEEP first extracts a profile, then generates a multiple structural superimposition providing a consolidated spatial framework for comparing residues and finally emits the MSA. Residues that are aligned differently by including or excluding electrostatic properties can be targeted by directed evolution experiments to transform the enzymatic properties of one protein into another. We have compared STEEP results to those obtained from a MSA program (ClustalW) and a structural alignment method (MUSTANG) for chymotrypsin serine proteases. Subsequently, we used PhyML to generate phylogenetic trees for the serine and metallo-β-lactamase superfamilies from the STEEP generated MSA, and corroborated the accepted relationships in these superfamilies. We have observed that STEEP acts as a functional classifier when electrostatic congruence is used as a discriminator, and thus identifies potential targets for directed evolution experiments. In summary, STEEP is unique among phylogenetic methods for its ability to use electrostatic congruence to specify mutations that might be the source of the functional divergence in a protein family. Based on our results, we also hypothesize that the active site and its close vicinity contains enough information to infer the correct phylogeny for related proteins.
Collapse
Affiliation(s)
- Sandeep Chakraborty
- Department of Biological Sciences, Tata Institute of Fundamental Research, Homi Bhabha Road, Mumbai 400 005, India
| | - Basuthkar J Rao
- Department of Biological Sciences, Tata Institute of Fundamental Research, Homi Bhabha Road, Mumbai 400 005, India
| | - Nathan Baker
- Pacific Northwest National Laboratory, 902 Battelle Blvd, Richland, Washington 99354, United States
| | - Bjarni Asgeirsson
- Science Institute, Department of Biochemistry, University of Iceland, Dunhaga 3, IS-107 Reykjavik, Iceland
| |
Collapse
|
17
|
Dickson RJ, Gloor GB. Protein sequence alignment analysis by local covariation: coevolution statistics detect benchmark alignment errors. PLoS One 2012; 7:e37645. [PMID: 22715369 PMCID: PMC3371027 DOI: 10.1371/journal.pone.0037645] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2011] [Accepted: 04/26/2012] [Indexed: 11/19/2022] Open
Abstract
The use of sequence alignments to understand protein families is ubiquitous in molecular biology. High quality alignments are difficult to build and protein alignment remains one of the largest open problems in computational biology. Misalignments can lead to inferential errors about protein structure, folding, function, phylogeny, and residue importance. Identifying alignment errors is difficult because alignments are built and validated on the same primary criteria: sequence conservation. Local covariation identifies systematic misalignments and is independent of conservation. We demonstrate an alignment curation tool, LoCo, that integrates local covariation scores with the Jalview alignment editor. Using LoCo, we illustrate how local covariation is capable of identifying alignment errors due to the reduction of positional independence in the region of misalignment. We highlight three alignments from the benchmark database, BAliBASE 3, that contain regions of high local covariation, and investigate the causes to illustrate these types of scenarios. Two alignments contain sequential and structural shifts that cause elevated local covariation. Realignment of these misaligned segments reduces local covariation; these alternative alignments are supported with structural evidence. We also show that local covariation identifies active site residues in a validated alignment of paralogous structures. Loco is available at https://sourceforge.net/projects/locoprotein/files/
Collapse
Affiliation(s)
| | - Gregory B. Gloor
- Department of Biochemistry, The University of Western Ontario, London, Canada
- * E-mail:
| |
Collapse
|
18
|
Lee Y, Mick J, Furdui C, Beamer LJ. A coevolutionary residue network at the site of a functionally important conformational change in a phosphohexomutase enzyme family. PLoS One 2012; 7:e38114. [PMID: 22685552 PMCID: PMC3369874 DOI: 10.1371/journal.pone.0038114] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2012] [Accepted: 05/01/2012] [Indexed: 11/26/2022] Open
Abstract
Coevolution analyses identify residues that co-vary with each other during evolution, revealing sequence relationships unobservable from traditional multiple sequence alignments. Here we describe a coevolutionary analysis of phosphomannomutase/phosphoglucomutase (PMM/PGM), a widespread and diverse enzyme family involved in carbohydrate biosynthesis. Mutual information and graph theory were utilized to identify a network of highly connected residues with high significance. An examination of the most tightly connected regions of the coevolutionary network reveals that most of the involved residues are localized near an interdomain interface of this enzyme, known to be the site of a functionally important conformational change. The roles of four interface residues found in this network were examined via site-directed mutagenesis and kinetic characterization. For three of these residues, mutation to alanine reduces enzyme specificity to ∼10% or less of wild-type, while the other has ∼45% activity of wild-type enzyme. An additional mutant of an interface residue that is not densely connected in the coevolutionary network was also characterized, and shows no change in activity relative to wild-type enzyme. The results of these studies are interpreted in the context of structural and functional data on PMM/PGM. Together, they demonstrate that a network of coevolving residues links the highly conserved active site with the interdomain conformational change necessary for the multi-step catalytic reaction. This work adds to our understanding of the functional roles of coevolving residue networks, and has implications for the definition of catalytically important residues.
Collapse
Affiliation(s)
- Yingying Lee
- Department of Chemistry, University of Missouri, Columbia, Missouri, United States of America
| | - Jacob Mick
- Department of Biochemistry, University of Missouri, Columbia, Missouri, United States of America
| | - Cristina Furdui
- Department of Internal Medicine, Wake Forest University Health Sciences Winston-Salem, North Carolina, United States of America
| | - Lesa J. Beamer
- Department of Chemistry, University of Missouri, Columbia, Missouri, United States of America
- Department of Biochemistry, University of Missouri, Columbia, Missouri, United States of America
- * E-mail:
| |
Collapse
|
19
|
Livesay DR, Kreth KE, Fodor AA. A critical evaluation of correlated mutation algorithms and coevolution within allosteric mechanisms. Methods Mol Biol 2012; 796:385-398. [PMID: 22052502 DOI: 10.1007/978-1-61779-334-9_21] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
The notion of using the evolutionary history encoded within multiple sequence alignments to predict allosteric mechanisms is appealing. In this approach, correlated mutations are expected to reflect coordinated changes that maintain intramolecular coupling between residue pairs. Despite much early fanfare, the general suitability of correlated mutations to predict allosteric couplings has not yet been established. Lack of progress along these lines has been hindered by several algorithmic limitations including phylogenetic artifacts within alignments masking true covariance and the computational intractability of consideration of more than two correlated residues at a time. Recent progress in algorithm development, however, has been substantial with a new generation of correlated mutation algorithms that have made fundamental progress toward solving these difficult problems. Despite these encouraging results, there remains little evidence to suggest that the evolutionary constraints acting on allosteric couplings are sufficient to be recovered from multiple sequence alignments. In this review, we argue that due to the exquisite sensitivity of protein dynamics, and hence that of allosteric mechanisms, the latter vary widely within protein families. If it turns out to be generally true that even very similar homologs display a wide divergence of allosteric mechanisms, then even a perfect correlated mutation algorithm could not be reliably used as a general mechanism for discovery of allosteric pathways.
Collapse
Affiliation(s)
- Dennis R Livesay
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC, USA
| | | | | |
Collapse
|
20
|
Blackburne BP, Whelan S. Measuring the distance between multiple sequence alignments. Bioinformatics 2011; 28:495-502. [PMID: 22199391 DOI: 10.1093/bioinformatics/btr701] [Citation(s) in RCA: 90] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Multiple sequence alignment (MSA) is a core method in bioinformatics. The accuracy of such alignments may influence the success of downstream analyses such as phylogenetic inference, protein structure prediction, and functional prediction. The importance of MSA has lead to the proliferation of MSA methods, with different objective functions and heuristics to search for the optimal MSA. Different methods of inferring MSAs produce different results in all but the most trivial cases. By measuring the differences between inferred alignments, we may be able to develop an understanding of how these differences (i) relate to the objective functions and heuristics used in MSA methods, and (ii) affect downstream analyses. RESULTS We introduce four metrics to compare MSAs, which include the position in a sequence where a gap occurs or the location on a phylogenetic tree where an insertion or deletion (indel) event occurs. We use both real and synthetic data to explore the information given by these metrics and demonstrate how the different metrics in combination can yield more information about MSA methods and the differences between them. AVAILABILITY MetAl is a free software implementation of these metrics in Haskell. Source and binaries for Windows, Linux and Mac OS X are available from http://kumiho.smith.man.ac.uk/whelan/software/metal/.
Collapse
Affiliation(s)
- Benjamin P Blackburne
- Computational and Evolutionary Biology, Faculty of Life Sciences, University of Manchester, Manchester M13 9PT, UK
| | | |
Collapse
|
21
|
Use of mutual information arrays to predict coevolving sites in the full length HIV gp120 protein for subtypes B and C. Virol Sin 2011; 26:95-104. [PMID: 21468932 DOI: 10.1007/s12250-011-3188-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2011] [Accepted: 02/22/2011] [Indexed: 10/18/2022] Open
Abstract
It is well established that different sites within a protein evolve at different rates according to their role within the protein; identification of these correlated mutations can aid in tasks such as ab initio protein structure, structure function analysis or sequence alignment. Mutual Information is a standard measure for coevolution between two sites but its application is limited by signal to noise ratio. In this work we report a preliminary study to investigate whether larger sequence sets could circumvent this problem by calculating mutual information arrays for two sets of drug naïve sequences from the HIV gp120 protein for the B and C subtypes. Our results suggest that while the larger sequences sets can improve the signal to noise ratio, the gain is offset by the high mutation rate of the HIV virus which makes it more difficult to achieve consistent alignments. Nevertheless, we were able to predict a number of coevolving sites that were supported by previous experimental studies as well as a region close to the C terminal of the protein that was highly variable in the C subtype but highly conserved in the B subtype.
Collapse
|
22
|
Markova-Raina P, Petrov D. High sensitivity to aligner and high rate of false positives in the estimates of positive selection in the 12 Drosophila genomes. Genome Res 2011; 21:863-74. [PMID: 21393387 DOI: 10.1101/gr.115949.110] [Citation(s) in RCA: 110] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
We investigate the effect of aligner choice on inferences of positive selection using site-specific models of molecular evolution. We find that independently of the choice of aligner, the rate of false positives is unacceptably high. Our study is a whole-genome analysis of all protein-coding genes in 12 Drosophila genomes annotated in either all 12 species (~6690 genes) or in the six melanogaster group species. We compare six popular aligners: PRANK, T-Coffee, ClustalW, ProbCons, AMAP, and MUSCLE, and find that the aligner choice strongly influences the estimates of positive selection. Differences persist when we use (1) different stringency cutoffs, (2) different selection inference models, (3) alignments with or without gaps, and/or additional masking, (4) per-site versus per-gene statistics, (5) closely related melanogaster group species versus more distant 12 Drosophila genomes. Furthermore, we find that these differences are consequential for downstream analyses such as determination of over/under-represented GO terms associated with positive selection. Visual analysis indicates that most sites inferred as positively selected are, in fact, misaligned at the codon level, resulting in false positive rates of 48%-82%. PRANK, which has been reported to outperform other aligners in simulations, performed best in our empirical study as well. Unfortunately, PRANK still had a high, and unacceptable for most applications, false positives rate of 50%-55%. We identify misannotations and indels, many of which appear to be located in disordered protein regions, as primary culprits for the high misalignment-related error levels and discuss possible workaround approaches to this apparently pervasive problem in genome-wide evolutionary analyses.
Collapse
Affiliation(s)
- Penka Markova-Raina
- Department of Biology, Stanford University, Stanford, California 94305, USA.
| | | |
Collapse
|