1
|
Kende J, Bonomi M, Temmam S, Regnault B, Pérot P, Eloit M, Bigot T. Virus Pop-Expanding Viral Databases by Protein Sequence Simulation. Viruses 2023; 15:1227. [PMID: 37376527 DOI: 10.3390/v15061227] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Revised: 05/15/2023] [Accepted: 05/16/2023] [Indexed: 06/29/2023] Open
Abstract
The improvement of our knowledge of the virosphere, which includes unknown viruses, is a key area in virology. Metagenomics tools, which perform taxonomic assignation from high throughput sequencing datasets, are generally evaluated with datasets derived from biological samples or in silico spiked samples containing known viral sequences present in public databases, resulting in the inability to evaluate the capacity of these tools to detect novel or distant viruses. Simulating realistic evolutionary directions is therefore key to benchmark and improve these tools. Additionally, expanding current databases with realistic simulated sequences can improve the capacity of alignment-based searching strategies for finding distant viruses, which could lead to a better characterization of the "dark matter" of metagenomics data. Here, we present Virus Pop, a novel pipeline for simulating realistic protein sequences and adding new branches to a protein phylogenetic tree. The tool generates simulated sequences with substitution rate variations that are dependent on protein domains and inferred from the input dataset, allowing for a realistic representation of protein evolution. The pipeline also infers ancestral sequences corresponding to multiple internal nodes of the input data phylogenetic tree, enabling new sequences to be inserted at various points of interest in the group studied. We demonstrated that Virus Pop produces simulated sequences that closely match the structural and functional characteristics of real protein sequences, taking as an example the spike protein of sarbecoviruses. Virus Pop also succeeded at creating sequences that resemble real sequences not included in the databases, which facilitated the identification of a novel pathogenic human circovirus not included in the input database. In conclusion, Virus Pop is helpful for challenging taxonomic assignation tools and could help improve databases to better detect distant viruses.
Collapse
Affiliation(s)
- Julia Kende
- Bioinformatics and Biostatistics Hub, Institut Pasteur, Université Paris Cité, F-75015 Paris, France
| | - Massimiliano Bonomi
- Department of Structural Biology and Chemistry, Institut Pasteur, Université Paris Cité, CNRS UMR 3528, F-75015 Paris, France
| | - Sarah Temmam
- Pathogen Discovery Laboratory, Institut Pasteur, Université Paris Cité, F-75015 Paris, France
| | - Béatrice Regnault
- Pathogen Discovery Laboratory, Institut Pasteur, Université Paris Cité, F-75015 Paris, France
| | - Philippe Pérot
- Pathogen Discovery Laboratory, Institut Pasteur, Université Paris Cité, F-75015 Paris, France
| | - Marc Eloit
- Pathogen Discovery Laboratory, Institut Pasteur, Université Paris Cité, F-75015 Paris, France
- Ecole Nationale Vétérinaire d'Alfort, F-94700 Maisons-Alfort, France
| | - Thomas Bigot
- Bioinformatics and Biostatistics Hub, Institut Pasteur, Université Paris Cité, F-75015 Paris, France
- Pathogen Discovery Laboratory, Institut Pasteur, Université Paris Cité, F-75015 Paris, France
| |
Collapse
|
2
|
Jump-Chain Simulation of Markov Substitution Processes Over Phylogenies. J Mol Evol 2022; 90:239-243. [PMID: 35652926 PMCID: PMC9233627 DOI: 10.1007/s00239-022-10058-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2022] [Accepted: 05/11/2022] [Indexed: 10/28/2022]
Abstract
We draw attention to an under-appreciated simulation method for generating artificial data in a phylogenetic context. The approach, which we refer to as jump-chain simulation, can invoke rich models of molecular evolution having intractable likelihood functions. As an example, we simulate data under a context-dependent model allowing for CpG hypermutability and show how such a feature can mislead common codon models used for detecting positive selection. We discuss more generally how this method can serve to elucidate the ways by which currently used models for inference are susceptible to violations of their underlying assumptions. Finally, we show how the method could serve as an inference engine in the Approximate Bayesian Computation framework.
Collapse
|
3
|
Selberg AGA, Gaucher EA, Liberles DA. Ancestral Sequence Reconstruction: From Chemical Paleogenetics to Maximum Likelihood Algorithms and Beyond. J Mol Evol 2021; 89:157-164. [PMID: 33486547 PMCID: PMC7828096 DOI: 10.1007/s00239-021-09993-1] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2020] [Accepted: 01/04/2021] [Indexed: 12/13/2022]
Abstract
As both a computational and an experimental endeavor, ancestral sequence reconstruction remains a timely and important technique. Modern approaches to conduct ancestral sequence reconstruction for proteins are built upon a conceptual framework from journal founder Emile Zuckerkandl. On top of this, work on maximum likelihood phylogenetics published in Journal of Molecular Evolution in 1996 was one of the first approaches for generating maximum likelihood ancestral sequences of proteins. From its computational history, future model development needs as well as potential applications in areas as diverse as computational systems biology, molecular community ecology, infectious disease therapeutics and other biomedical applications, and biotechnology are discussed. From its past in this journal, there is a bright future for ancestral sequence reconstruction in the field of evolutionary biology.
Collapse
Affiliation(s)
- Avery G A Selberg
- Department of Biology and Center for Computational Genetics and Genomics, Temple University, Philadelphia, PA, 19122, USA
| | - Eric A Gaucher
- Department of Biology, Georgia State University, Atlanta, GA, 30303, USA
| | - David A Liberles
- Department of Biology and Center for Computational Genetics and Genomics, Temple University, Philadelphia, PA, 19122, USA.
| |
Collapse
|
4
|
Perron U, Kozlov AM, Stamatakis A, Goldman N, Moal IH. Modeling Structural Constraints on Protein Evolution via Side-Chain Conformational States. Mol Biol Evol 2020; 36:2086-2103. [PMID: 31114882 PMCID: PMC6736381 DOI: 10.1093/molbev/msz122] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Few models of sequence evolution incorporate parameters describing protein structure, despite its high conservation, essential functional role and increasing availability. We present a structurally aware empirical substitution model for amino acid sequence evolution in which proteins are expressed using an expanded alphabet that relays both amino acid identity and structural information. Each character specifies an amino acid as well as information about the rotamer configuration of its side-chain: the discrete geometric pattern of permitted side-chain atomic positions, as defined by the dihedral angles between covalently linked atoms. By assigning rotamer states in 251,194 protein structures and identifying 4,508,390 substitutions between closely related sequences, we generate a 55-state “Dayhoff-like” model that shows that the evolutionary properties of amino acids depend strongly upon side-chain geometry. The model performs as well as or better than traditional 20-state models for divergence time estimation, tree inference, and ancestral state reconstruction. We conclude that not only is rotamer configuration a valuable source of information for phylogenetic studies, but that modeling the concomitant evolution of sequence and structure may have important implications for understanding protein folding and function.
Collapse
Affiliation(s)
- Umberto Perron
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridgeshire, United Kingdom
| | - Alexey M Kozlov
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
| | - Alexandros Stamatakis
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany.,Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - Nick Goldman
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridgeshire, United Kingdom
| | - Iain H Moal
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridgeshire, United Kingdom.,Computational and Modelling Sciences, GlaxoSmithKline Research and Development, Stevenage, United Kingdom
| |
Collapse
|
5
|
Quintero I, Landis MJ. Interdependent Phenotypic and Biogeographic Evolution Driven by Biotic Interactions. Syst Biol 2019; 69:739-755. [DOI: 10.1093/sysbio/syz082] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2019] [Revised: 12/06/2019] [Accepted: 12/10/2019] [Indexed: 11/13/2022] Open
Abstract
Abstract
Biotic interactions are hypothesized to be one of the main processes shaping trait and biogeographic evolution during lineage diversification. Theoretical and empirical evidence suggests that species with similar ecological requirements either spatially exclude each other, by preventing the colonization of competitors or by driving coexisting populations to extinction, or show niche divergence when in sympatry. However, the extent and generality of the effect of interspecific competition in trait and biogeographic evolution has been limited by a dearth of appropriate process-generating models to directly test the effect of biotic interactions. Here, we formulate a phylogenetic parametric model that allows interdependence between trait and biogeographic evolution, thus enabling a direct test of central hypotheses on how biotic interactions shape these evolutionary processes. We adopt a Bayesian data augmentation approach to estimate the joint posterior distribution of trait histories, range histories, and coevolutionary process parameters under this analytically intractable model. Through simulations, we show that our model is capable of distinguishing alternative scenarios of biotic interactions. We apply our model to the radiation of Darwin’s finches—a classic example of adaptive divergence—and find limited support for in situ trait divergence in beak size, but stronger evidence for convergence in traits such as beak shape and tarsus length and for competitive exclusion throughout their evolutionary history. These findings are more consistent with presympatric, rather than postsympatric, niche divergence. Our modeling framework opens new possibilities for testing more complex hypotheses about the processes underlying lineage diversification. More generally, it provides a robust probabilistic methodology to model correlated evolution of continuous and discrete characters. [Bayesian; biotic interactions; competition; data augmentation; historical biogeography; trait evolution.]
Collapse
Affiliation(s)
- Ignacio Quintero
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT 06511, USA
- Département de Biologie, Institut de Biologie de l’ENS (IBENS), École Normale Supérieure, CNRS, INSERM, Université PSL, 75005 Paris, France
| | - Michael J Landis
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT 06511, USA
- Department of Biology, Washington University in St. Louis, St. Louis, MO 63130, USA
| |
Collapse
|
6
|
Laurin-Lemay S, Rodrigue N, Lartillot N, Philippe H. Conditional Approximate Bayesian Computation: A New Approach for Across-Site Dependency in High-Dimensional Mutation-Selection Models. Mol Biol Evol 2019; 35:2819-2834. [PMID: 30203003 DOI: 10.1093/molbev/msy173] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
A key question in molecular evolutionary biology concerns the relative roles of mutation and selection in shaping genomic data. Moreover, features of mutation and selection are heterogeneous along the genome and over time. Mechanistic codon substitution models based on the mutation-selection framework are promising approaches to separating these effects. In practice, however, several complications arise, since accounting for such heterogeneities often implies handling models of high dimensionality (e.g., amino acid preferences), or leads to across-site dependence (e.g., CpG hypermutability), making the likelihood function intractable. Approximate Bayesian Computation (ABC) could address this latter issue. Here, we propose a new approach, named Conditional ABC (CABC), which combines the sampling efficiency of MCMC and the flexibility of ABC. To illustrate the potential of the CABC approach, we apply it to the study of mammalian CpG hypermutability based on a new mutation-level parameter implying dependence across adjacent sites, combined with site-specific purifying selection on amino-acids captured by a Dirichlet process. Our proof-of-concept of the CABC methodology opens new modeling perspectives. Our application of the method reveals a high level of heterogeneity of CpG hypermutability across loci and mild heterogeneity across taxonomic groups; and finally, we show that CpG hypermutability is an important evolutionary factor in rendering relative synonymous codon usage. All source code is available as a GitHub repository (https://github.com/Simonll/LikelihoodFreePhylogenetics.git).
Collapse
Affiliation(s)
- Simon Laurin-Lemay
- Robert-Cedergren Center for Bioinformatics and Genomics, Department of Biochemistry and Molecular Medicine, Faculty of Medicine, Université de Montréal, Montréal, QC, Canada
| | - Nicolas Rodrigue
- Department of Biology, Institute of Biochemistry, and School of Mathematics and Statistics, Carleton University, Ottawa, ON, Canada
| | - Nicolas Lartillot
- Laboratoire de Biométrie et Biologie Évolutive, UMR CNRS 5558, Université Lyon 1, Lyon, France
| | - Hervé Philippe
- Robert-Cedergren Center for Bioinformatics and Genomics, Department of Biochemistry and Molecular Medicine, Faculty of Medicine, Université de Montréal, Montréal, QC, Canada.,Centre de Théorisation et de Modélisation de la Biodiversité, Station d'Écologie Théorique et Expérimentale, UMR CNRS 5321, Moulis, France
| |
Collapse
|
7
|
Herman JL. Enhancing Statistical Multiple Sequence Alignment and Tree Inference Using Structural Information. Methods Mol Biol 2019; 1851:183-214. [PMID: 30298398 DOI: 10.1007/978-1-4939-8736-8_10] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
For highly divergent sequences, there is often insufficient information to reliably construct alignments and phylogenetic trees. Since protein structure may be strongly conserved despite large divergences in sequence, structural information can be used to help identify homology in such cases.While there exist well-studied models of sequence evolution, structurally informed alignment methods have typically made use of geometric measures of deviation that do not take into account the underlying mutational processes. In order to integrate structural information into sequence-based evolutionary models, we recently developed a stochastic model of structural evolution on a phylogenetic tree and implemented this as the StructAlign plugin for the StatAlign statistical alignment package.In this chapter, we will outline the types of analyses that can be carried out using StructAlign, illustrating how the inclusion of structural information can be used to inform joint estimation of alignments and trees. StructAlign can also be used to infer branch-specific rates of structural evolution, and analysis of an example globin dataset highlights strong variation in the inferred rate across the tree. While structure is more highly conserved within clades, the rate of structural divergence as a function of sequence variation is larger between functionally divergent proteins. Allowing for the rate of structural divergence to vary over the tree results in an improved fit to the empirically observed pairwise RMSD values.
Collapse
Affiliation(s)
- Joseph L Herman
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
8
|
Brown JM, Thomson RC. Evaluating Model Performance in Evolutionary Biology. ANNUAL REVIEW OF ECOLOGY EVOLUTION AND SYSTEMATICS 2018. [DOI: 10.1146/annurev-ecolsys-110617-062249] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Many fields of evolutionary biology now depend on stochastic mathematical models. These models are valuable for their ability to formalize predictions in the face of uncertainty and provide a quantitative framework for testing hypotheses. However, no mathematical model will fully capture biological complexity. Instead, these models attempt to capture the important features of biological systems using relatively simple mathematical principles. These simplifications can allow us to focus on differences that are meaningful, while ignoring those that are not. However, simplification also requires assumptions, and to the extent that these are wrong, so is our ability to predict or compare. Here, we discuss approaches for evaluating the performance of evolutionary models in light of their assumptions by comparing them against reality. We highlight general approaches, how they are applied, and remaining opportunities. Absolute tests of fit, even when not explicitly framed as such, are fundamental to progress in understanding evolution.
Collapse
Affiliation(s)
- Jeremy M. Brown
- Department of Biological Sciences and Museum of Natural Science, Louisiana State University, Baton Rouge, Louisiana 70803, USA
| | - Robert C. Thomson
- Department of Biology, University of Hawai'i, Honolulu, Hawai'i 96822, USA
| |
Collapse
|
9
|
Bayesian selection of misspecified models is overconfident and may cause spurious posterior probabilities for phylogenetic trees. Proc Natl Acad Sci U S A 2018; 115:1854-1859. [PMID: 29432193 DOI: 10.1073/pnas.1712673115] [Citation(s) in RCA: 45] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
The Bayesian method is noted to produce spuriously high posterior probabilities for phylogenetic trees in analysis of large datasets, but the precise reasons for this overconfidence are unknown. In general, the performance of Bayesian selection of misspecified models is poorly understood, even though this is of great scientific interest since models are never true in real data analysis. Here we characterize the asymptotic behavior of Bayesian model selection and show that when the competing models are equally wrong, Bayesian model selection exhibits surprising and polarized behaviors in large datasets, supporting one model with full force while rejecting the others. If one model is slightly less wrong than the other, the less wrong model will eventually win when the amount of data increases, but the method may become overconfident before it becomes reliable. We suggest that this extreme behavior may be a major factor for the spuriously high posterior probabilities for evolutionary trees. The philosophical implications of our results to the application of Bayesian model selection to evaluate opposing scientific hypotheses are yet to be explored, as are the behaviors of non-Bayesian methods in similar situations.
Collapse
|
10
|
Ronquist F, Lartillot N, Phillips MJ. Closing the gap between rocks and clocks using total-evidence dating. Philos Trans R Soc Lond B Biol Sci 2017; 371:rstb.2015.0136. [PMID: 27325833 PMCID: PMC4920337 DOI: 10.1098/rstb.2015.0136] [Citation(s) in RCA: 58] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/10/2016] [Indexed: 11/12/2022] Open
Abstract
Total-evidence dating (TED) allows evolutionary biologists to incorporate a wide range of dating information into a unified statistical analysis. One might expect this to improve the agreement between rocks and clocks but this is not necessarily the case. We explore the reasons for such discordance using a mammalian dataset with rich molecular, morphological and fossil information. There is strong conflict in this dataset between morphology and molecules under standard stochastic models. This causes TED to push divergence events back in time when using inadequate models or vague priors, a phenomenon we term 'deep root attraction' (DRA). We identify several causes of DRA. Failure to account for diversified sampling results in dramatic DRA, but this can be addressed using existing techniques. Inadequate morphological models also appear to be a major contributor to DRA. The major reason seems to be that current models do not account for dependencies among morphological characters, causing distorted topology and branch length estimates. This is particularly problematic for huge morphological datasets, which may contain large numbers of correlated characters. Finally, diversification and fossil sampling priors that do not incorporate all the available background information can contribute to DRA, but these priors can also be used to compensate for DRA. Specifically, we show that DRA in the mammalian dataset can be addressed by introducing a modest extra penalty for ghost lineages that are unobserved in the fossil record, for instance by assuming rapid diversification, rare extinction or high fossil sampling rate; any of these assumptions produces highly congruent divergence time estimates with a minimal gap between rocks and clocks. Under these conditions, fossils have a stabilizing influence on divergence time estimates and significantly increase the precision of those estimates, which are generally close to the dates suggested by palaeontologists.This article is part of the themed issue 'Dating species divergences using rocks and clocks'.
Collapse
Affiliation(s)
- Fredrik Ronquist
- Department of Bioinformatics and Genetics, Swedish Museum of Natural History, PO Box 50007, 104 05 Stockholm, Sweden
| | - Nicolas Lartillot
- Laboratoire de Biométrie et Biologie Evolutive, UMR CNRS 5558, Université Claude Bernard Lyon 1, F-69622 Villeurbanne Cedex, France
| | - Matthew J Phillips
- School of Earth, Environmental and Biological Sciences, Queensland University of Technology, 2 George Street, Brisbane, Queensland 4000, Australia
| |
Collapse
|
11
|
Baele G, Lemey P, Suchard MA. Genealogical Working Distributions for Bayesian Model Testing with Phylogenetic Uncertainty. Syst Biol 2015; 65:250-64. [PMID: 26526428 DOI: 10.1093/sysbio/syv083] [Citation(s) in RCA: 79] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2015] [Accepted: 10/28/2015] [Indexed: 11/12/2022] Open
Abstract
Marginal likelihood estimates to compare models using Bayes factors frequently accompany Bayesian phylogenetic inference. Approaches to estimate marginal likelihoods have garnered increased attention over the past decade. In particular, the introduction of path sampling (PS) and stepping-stone sampling (SS) into Bayesian phylogenetics has tremendously improved the accuracy of model selection. These sampling techniques are now used to evaluate complex evolutionary and population genetic models on empirical data sets, but considerable computational demands hamper their widespread adoption. Further, when very diffuse, but proper priors are specified for model parameters, numerical issues complicate the exploration of the priors, a necessary step in marginal likelihood estimation using PS or SS. To avoid such instabilities, generalized SS (GSS) has recently been proposed, introducing the concept of "working distributions" to facilitate--or shorten--the integration process that underlies marginal likelihood estimation. However, the need to fix the tree topology currently limits GSS in a coalescent-based framework. Here, we extend GSS by relaxing the fixed underlying tree topology assumption. To this purpose, we introduce a "working" distribution on the space of genealogies, which enables estimating marginal likelihoods while accommodating phylogenetic uncertainty. We propose two different "working" distributions that help GSS to outperform PS and SS in terms of accuracy when comparing demographic and evolutionary models applied to synthetic data and real-world examples. Further, we show that the use of very diffuse priors can lead to a considerable overestimation in marginal likelihood when using PS and SS, while still retrieving the correct marginal likelihood using both GSS approaches. The methods used in this article are available in BEAST, a powerful user-friendly software package to perform Bayesian evolutionary analyses.
Collapse
Affiliation(s)
- Guy Baele
- Department of Microbiology and Immunology, Rega Institute, KU Leuven-University of Leuven, Leuven, Belgium
| | - Philippe Lemey
- Department of Microbiology and Immunology, Rega Institute, KU Leuven-University of Leuven, Leuven, Belgium
| | - Marc A Suchard
- Department of Biomathematics and Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA 90095, USA Department of Biostatistics, School of Public Health, University of California, Los Angeles, CA 90095, USA
| |
Collapse
|
12
|
Lee HJ, Rodrigue N, Thorne JL. Relaxing the Molecular Clock to Different Degrees for Different Substitution Types. Mol Biol Evol 2015; 32:1948-61. [PMID: 25931515 PMCID: PMC4833082 DOI: 10.1093/molbev/msv099] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
Rates of molecular evolution can vary over time. Diverse statistical techniques for divergence time estimation have been developed to accommodate this variation. These typically require that all sequence (or codon) positions at a locus change independently of one another. They also generally assume that the rates of different types of nucleotide substitutions vary across a phylogeny in the same way. This permits divergence time estimation procedures to employ an instantaneous rate matrix with relative rates that do not differ among branches. However, previous studies have suggested that some substitution types (e.g., CpG to TpG changes in mammals) are more clock-like than others. As has been previously noted, this is biologically plausible given the mutational mechanism of CpG to TpG changes. Through stochastic mapping of sequence histories from context-independent substitution models, our approach allows for context-dependent nucleotide substitutions to change their relative rates over time. We apply our approach to the analysis of a 0.15 Mb intergenic region from eight primates. In accord with previous findings, we find comparatively little rate variation over time for CpG to TpG substitutions but we find more for other substitution types. We conclude by discussing the limitations and prospects of our approach.
Collapse
Affiliation(s)
- Hui-Jie Lee
- Department of Statistics, North Carolina State University
| | | | - Jeffrey L Thorne
- Department of Statistics, North Carolina State University Department of Biological Sciences, North Carolina State University
| |
Collapse
|
13
|
Contingency and entrenchment in protein evolution under purifying selection. Proc Natl Acad Sci U S A 2015; 112:E3226-35. [PMID: 26056312 DOI: 10.1073/pnas.1412933112] [Citation(s) in RCA: 140] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
The phenotypic effect of an allele at one genetic site may depend on alleles at other sites, a phenomenon known as epistasis. Epistasis can profoundly influence the process of evolution in populations and shape the patterns of protein divergence across species. Whereas epistasis between adaptive substitutions has been studied extensively, relatively little is known about epistasis under purifying selection. Here we use computational models of thermodynamic stability in a ligand-binding protein to explore the structure of epistasis in simulations of protein sequence evolution. Even though the predicted effects on stability of random mutations are almost completely additive, the mutations that fix under purifying selection are enriched for epistasis. In particular, the mutations that fix are contingent on previous substitutions: Although nearly neutral at their time of fixation, these mutations would be deleterious in the absence of preceding substitutions. Conversely, substitutions under purifying selection are subsequently entrenched by epistasis with later substitutions: They become increasingly deleterious to revert over time. Our results imply that, even under purifying selection, protein sequence evolution is often contingent on history and so it cannot be predicted by the phenotypic effects of mutations assayed in the ancestral background.
Collapse
|
14
|
Wang K, Yu S, Ji X, Lakner C, Griffing A, Thorne JL. Roles of solvent accessibility and gene expression in modeling protein sequence evolution. Evol Bioinform Online 2015; 11:85-96. [PMID: 25987828 PMCID: PMC4415675 DOI: 10.4137/ebo.s22911] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2014] [Revised: 02/04/2015] [Accepted: 02/09/2015] [Indexed: 11/05/2022] Open
Abstract
Models of protein evolution tend to ignore functional constraints, although structural constraints are sometimes incorporated. Here we propose a probabilistic framework for codon substitution that evaluates joint effects of relative solvent accessibility (RSA), a structural constraint; and gene expression, a functional constraint. First, we explore the relationship between RSA and codon usage at the genomic scale as well as at the individual gene scale. Motivated by these results, we construct our framework by determining how probable is an amino acid, given RSA and gene expression, and then evaluating the relative probability of observing a codon compared to other synonymous codons. We come to the biologically plausible conclusion that both RSA and gene expression are related to amino acid frequencies, but, among synonymous codons, the relative probability of a particular codon is more closely related to gene expression than RSA. To illustrate the potential applications of our framework, we propose a new codon substitution model. Using this model, we obtain estimates of 2N s, the product of effective population size N, and relative fitness difference of allele s. For a training data set consisting of human proteins with known structures and expression data, 2N s is estimated separately for synonymous and nonsynonymous substitutions in each protein. We then contrast the patterns of synonymous and nonsynonymous 2N s estimates across proteins while also taking gene expression levels of the proteins into account. We conclude that our 2N s estimates are too concentrated around 0, and we discuss potential explanations for this lack of variability.
Collapse
Affiliation(s)
- Kuangyu Wang
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, USA
| | - Shuhui Yu
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, USA. ; College of Life Science, Chongqing University, Chongqing, China
| | - Xiang Ji
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, USA
| | - Clemens Lakner
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, USA
| | - Alexander Griffing
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, USA
| | - Jeffrey L Thorne
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, USA
| |
Collapse
|
15
|
Fu M, Huang Z, Mao Y, Tao S. Neighbor preferences of amino acids and context-dependent effects of amino acid substitutions in human, mouse, and dog. Int J Mol Sci 2014; 15:15963-80. [PMID: 25210846 PMCID: PMC4200849 DOI: 10.3390/ijms150915963] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2014] [Revised: 08/27/2014] [Accepted: 09/02/2014] [Indexed: 12/23/2022] Open
Abstract
Amino acids show apparent propensities toward their neighbors. In addition to preferences of amino acids for their neighborhood context, amino acid substitutions are also considered to be context-dependent. However, context-dependence patterns of amino acid substitutions still remain poorly understood. Using relative entropy, we investigated the neighbor preferences of 20 amino acids and the context-dependent effects of amino acid substitutions with protein sequences in human, mouse, and dog. For 20 amino acids, the highest relative entropy was mostly observed at the nearest adjacent site of either N- or C-terminus except C and G. C showed the highest relative entropy at the third flanking site and periodic pattern was detected at G flanking sites. Furthermore, neighbor preference patterns of amino acids varied greatly in different secondary structures. We then comprehensively investigated the context-dependent effects of amino acid substitutions. Our results showed that nearly half of 380 substitution types were evidently context dependent, and the context-dependent patterns relied on protein secondary structures. Among 20 amino acids, P elicited the greatest effect on amino acid substitutions. The underlying mechanisms of context-dependent effects of amino acid substitutions were possibly mutation bias at a DNA level and natural selection. Our findings may improve secondary structure prediction algorithms and protein design; moreover, this study provided useful information to develop empirical models of protein evolution that consider dependence between residues.
Collapse
Affiliation(s)
- Mingchuan Fu
- College of Life Sciences and State Key Laboratory of Crop Stress Biology in Arid Areas, Northwest A&F University, Yangling 712100, China.
| | - Zhuoran Huang
- College of Life Sciences and State Key Laboratory of Crop Stress Biology in Arid Areas, Northwest A&F University, Yangling 712100, China.
| | - Yuanhui Mao
- College of Life Sciences and State Key Laboratory of Crop Stress Biology in Arid Areas, Northwest A&F University, Yangling 712100, China.
| | - Shiheng Tao
- College of Life Sciences and State Key Laboratory of Crop Stress Biology in Arid Areas, Northwest A&F University, Yangling 712100, China.
| |
Collapse
|
16
|
Eme L, Sharpe SC, Brown MW, Roger AJ. On the age of eukaryotes: evaluating evidence from fossils and molecular clocks. Cold Spring Harb Perspect Biol 2014; 6:6/8/a016139. [PMID: 25085908 DOI: 10.1101/cshperspect.a016139] [Citation(s) in RCA: 116] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Our understanding of the phylogenetic relationships among eukaryotic lineages has improved dramatically over the few past decades thanks to the development of sophisticated phylogenetic methods and models of evolution, in combination with the increasing availability of sequence data for a variety of eukaryotic lineages. Concurrently, efforts have been made to infer the age of major evolutionary events along the tree of eukaryotes using fossil-calibrated molecular clock-based methods. Here, we review the progress and pitfalls in estimating the age of the last eukaryotic common ancestor (LECA) and major lineages. After reviewing previous attempts to date deep eukaryote divergences, we present the results of a Bayesian relaxed-molecular clock analysis of a large dataset (159 proteins, 85 taxa) using 19 fossil calibrations. We show that for major eukaryote groups estimated dates of divergence, as well as their credible intervals, are heavily influenced by the relaxed molecular clock models and methods used, and by the nature and treatment of fossil calibrations. Whereas the estimated age of LECA varied widely, ranging from 1007 (943-1102) Ma to 1898 (1655-2094) Ma, all analyses suggested that the eukaryotic supergroups subsequently diverged rapidly (i.e., within 300 Ma of LECA). The extreme variability of these and previously published analyses preclude definitive conclusions regarding the age of major eukaryote clades at this time. As more reliable fossil data on eukaryotes from the Proterozoic become available and improvements are made in relaxed molecular clock modeling, we may be able to date the age of extant eukaryotes more precisely.
Collapse
Affiliation(s)
- Laura Eme
- Centre for Comparative Genomics and Evolutionary Bioinformatics, Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax B3H 4R2, Canada
| | - Susan C Sharpe
- Centre for Comparative Genomics and Evolutionary Bioinformatics, Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax B3H 4R2, Canada
| | - Matthew W Brown
- Centre for Comparative Genomics and Evolutionary Bioinformatics, Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax B3H 4R2, Canada
| | - Andrew J Roger
- Centre for Comparative Genomics and Evolutionary Bioinformatics, Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax B3H 4R2, Canada
| |
Collapse
|
17
|
Phylogenetic Gaussian process model for the inference of functionally important regions in protein tertiary structures. PLoS Comput Biol 2014; 10:e1003429. [PMID: 24453956 PMCID: PMC3894161 DOI: 10.1371/journal.pcbi.1003429] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2013] [Accepted: 11/22/2013] [Indexed: 11/30/2022] Open
Abstract
A critical question in biology is the identification of functionally important amino acid sites in proteins. Because functionally important sites are under stronger purifying selection, site-specific substitution rates tend to be lower than usual at these sites. A large number of phylogenetic models have been developed to estimate site-specific substitution rates in proteins and the extraordinarily low substitution rates have been used as evidence of function. Most of the existing tools, e.g. Rate4Site, assume that site-specific substitution rates are independent across sites. However, site-specific substitution rates may be strongly correlated in the protein tertiary structure, since functionally important sites tend to be clustered together to form functional patches. We have developed a new model, GP4Rate, which incorporates the Gaussian process model with the standard phylogenetic model to identify slowly evolved regions in protein tertiary structures. GP4Rate uses the Gaussian process to define a nonparametric prior distribution of site-specific substitution rates, which naturally captures the spatial correlation of substitution rates. Simulations suggest that GP4Rate can potentially estimate site-specific substitution rates with a much higher accuracy than Rate4Site and tends to report slowly evolved regions rather than individual sites. In addition, GP4Rate can estimate the strength of the spatial correlation of substitution rates from the data. By applying GP4Rate to a set of mammalian B7-1 genes, we found a highly conserved region which coincides with experimental evidence. GP4Rate may be a useful tool for the in silico prediction of functionally important regions in the proteins with known structures. To understand how a protein functions, a critical step is to know which regions in its protein tertiary structure may be functionally important. Functionally important protein regions are typically more conserved than other regions because mutations in these regions are more likely to be deleterious. A number of phylogenetic models have been developed to identify conserved sites or regions in proteins by comparing protein sequences from multiple species. However, most of these methods treat amino acid sites independently and do not consider the spatial clustering of conserved sites in the protein tertiary structure. Therefore, their power of identifying functional protein regions is limited. We develop a new statistical model, GP4Rate, which combines the information from the protein sequences and the protein tertiary structure to infer conserved regions. We demonstrate that GP4Rate outperforms Rate4Site, the most widely used phylogenetic software for inferring functional amino acid sites, via simulations with a case study of B7-1 genes. GP4Rate is a potentially useful tool for guiding mutagenesis experiments or providing insights on the relationship between protein structures and functions.
Collapse
|
18
|
Mutational effects on stability are largely conserved during protein evolution. Proc Natl Acad Sci U S A 2013; 110:21071-6. [PMID: 24324165 DOI: 10.1073/pnas.1314781111] [Citation(s) in RCA: 105] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Protein stability and folding are the result of cooperative interactions among many residues, yet phylogenetic approaches assume that sites are independent. This discrepancy has engendered concerns about large evolutionary shifts in mutational effects that might confound phylogenetic approaches. Here we experimentally investigate this issue by introducing the same mutations into a set of diverged homologs of the influenza nucleoprotein and measuring the effects on stability. We find that mutational effects on stability are largely conserved across the homologs. We reach qualitatively similar conclusions when we simulate protein evolution with molecular-mechanics force fields. Our results do not mean that proteins evolve without epistasis, which can still arise even when mutational stability effects are conserved. However, our findings indicate that large evolutionary shifts in mutational effects on stability are rare, at least among homologs with similar structures and functions. We suggest that properly describing the clearly observable and highly conserved amino acid preferences at individual sites is likely to be far more important for phylogenetic analyses than accounting for rare shifts in amino acid propensities due to site covariation.
Collapse
|
19
|
Baele G, Lemey P, Vansteelandt S. Make the most of your samples: Bayes factor estimators for high-dimensional models of sequence evolution. BMC Bioinformatics 2013; 14:85. [PMID: 23497171 PMCID: PMC3651733 DOI: 10.1186/1471-2105-14-85] [Citation(s) in RCA: 78] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2012] [Accepted: 01/22/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Accurate model comparison requires extensive computation times, especially for parameter-rich models of sequence evolution. In the Bayesian framework, model selection is typically performed through the evaluation of a Bayes factor, the ratio of two marginal likelihoods (one for each model). Recently introduced techniques to estimate (log) marginal likelihoods, such as path sampling and stepping-stone sampling, offer increased accuracy over the traditional harmonic mean estimator at an increased computational cost. Most often, each model's marginal likelihood will be estimated individually, which leads the resulting Bayes factor to suffer from errors associated with each of these independent estimation processes. RESULTS We here assess the original 'model-switch' path sampling approach for direct Bayes factor estimation in phylogenetics, as well as an extension that uses more samples, to construct a direct path between two competing models, thereby eliminating the need to calculate each model's marginal likelihood independently. Further, we provide a competing Bayes factor estimator using an adaptation of the recently introduced stepping-stone sampling algorithm and set out to determine appropriate settings for accurately calculating such Bayes factors, with context-dependent evolutionary models as an example. While we show that modest efforts are required to roughly identify the increase in model fit, only drastically increased computation times ensure the accuracy needed to detect more subtle details of the evolutionary process. CONCLUSIONS We show that our adaptation of stepping-stone sampling for direct Bayes factor calculation outperforms the original path sampling approach as well as an extension that exploits more samples. Our proposed approach for Bayes factor estimation also has preferable statistical properties over the use of individual marginal likelihood estimates for both models under comparison. Assuming a sigmoid function to determine the path between two competing models, we provide evidence that a single well-chosen sigmoid shape value requires less computational efforts in order to approximate the true value of the (log) Bayes factor compared to the original approach. We show that the (log) Bayes factors calculated using path sampling and stepping-stone sampling differ drastically from those estimated using either of the harmonic mean estimators, supporting earlier claims that the latter systematically overestimate the performance of high-dimensional models, which we show can lead to erroneous conclusions. Based on our results, we argue that highly accurate estimation of differences in model fit for high-dimensional models requires much more computational effort than suggested in recent studies on marginal likelihood estimation.
Collapse
Affiliation(s)
- Guy Baele
- Department of Microbiology and Immunology, Rega Institute, KU Leuven, Leuven, Belgium.
| | | | | |
Collapse
|
20
|
Meyer AG, Wilke CO. Integrating sequence variation and protein structure to identify sites under selection. Mol Biol Evol 2012; 30:36-44. [PMID: 22977116 DOI: 10.1093/molbev/mss217] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
We present a novel method to identify sites under selection in protein-coding genes. Our method combines the traditional Goldman-Yang model of coding-sequence evolution with the information obtained from the 3D structure of the evolving protein, specifically the relative solvent accessibility (RSA) of individual residues. We develop a random-effects likelihood sites model in which rate classes are RSA dependent. The RSA dependence is modeled with linear functions. We demonstrate that our RSA-dependent model provides a significantly better fit to molecular sequence data than does a traditional, RSA-independent model. We further show that our model provides a natural, RSA-dependent neutral baseline for the evolutionary rate ratio ω = dN/dS Sites that deviate from this neutral baseline likely experience selection pressure for function. We apply our method to the influenza proteins hemagglutinin and neuraminidase. For hemagglutinin, our method recovers positively selected sites near the sialic acid-binding site and negatively selected sites that may be important for trimerization. For neuraminidase, our method recovers the oseltamivir resistance site and otherwise suggests that few sites deviate from the neutral baseline. Our method is broadly applicable to any protein sequences for which structural data are available or can be obtained via homology modeling or threading.
Collapse
Affiliation(s)
- Austin G Meyer
- Section of Integrative Biology, Institute for Cellular and Molecular Biology, Center for Computational Biology and Bioinformatics, University of Texas at Austin, Austin, TX, USA
| | | |
Collapse
|
21
|
Scherrer MP, Meyer AG, Wilke CO. Modeling coding-sequence evolution within the context of residue solvent accessibility. BMC Evol Biol 2012; 12:179. [PMID: 22967129 PMCID: PMC3527230 DOI: 10.1186/1471-2148-12-179] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2012] [Accepted: 09/03/2012] [Indexed: 11/30/2022] Open
Abstract
Background Protein structure mediates site-specific patterns of sequence divergence. In particular, residues in the core of a protein (solvent-inaccessible residues) tend to be more evolutionarily conserved than residues on the surface (solvent-accessible residues). Results Here, we present a model of sequence evolution that explicitly accounts for the relative solvent accessibility of each residue in a protein. Our model is a variant of the Goldman-Yang 1994 (GY94) model in which all model parameters can be functions of the relative solvent accessibility (RSA) of a residue. We apply this model to a data set comprised of nearly 600 yeast genes, and find that an evolutionary-rate ratio ω that varies linearly with RSA provides a better model fit than an RSA-independent ω or an ω that is estimated separately in individual RSA bins. We further show that the branch length t and the transition-transverion ratio κ also vary with RSA. The RSA-dependent GY94 model performs better than an RSA-dependent Muse-Gaut 1994 (MG94) model in which the synonymous and non-synonymous rates individually are linear functions of RSA. Finally, protein core size affects the slope of the linear relationship between ω and RSA, and gene expression level affects both the intercept and the slope. Conclusions Structure-aware models of sequence evolution provide a significantly better fit than traditional models that neglect structure. The linear relationship between ω and RSA implies that genes are better characterized by their ω slope and intercept than by just their mean ω.
Collapse
Affiliation(s)
- Michael P Scherrer
- Center for Computational Biology and Bioinformatics, Institute for Cellular and Molecular Biology, and Section of Integrative Biology, The University of Texas at Austin, Austin, TX 78712, USA
| | | | | |
Collapse
|
22
|
Baele G, Lemey P, Bedford T, Rambaut A, Suchard MA, Alekseyenko AV. Improving the accuracy of demographic and molecular clock model comparison while accommodating phylogenetic uncertainty. Mol Biol Evol 2012; 29:2157-67. [PMID: 22403239 PMCID: PMC3424409 DOI: 10.1093/molbev/mss084] [Citation(s) in RCA: 789] [Impact Index Per Article: 65.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023] Open
Abstract
Recent developments in marginal likelihood estimation for model selection in the field of Bayesian phylogenetics and molecular evolution have emphasized the poor performance of the harmonic mean estimator (HME). Although these studies have shown the merits of new approaches applied to standard normally distributed examples and small real-world data sets, not much is currently known concerning the performance and computational issues of these methods when fitting complex evolutionary and population genetic models to empirical real-world data sets. Further, these approaches have not yet seen widespread application in the field due to the lack of implementations of these computationally demanding techniques in commonly used phylogenetic packages. We here investigate the performance of some of these new marginal likelihood estimators, specifically, path sampling (PS) and stepping-stone (SS) sampling for comparing models of demographic change and relaxed molecular clocks, using synthetic data and real-world examples for which unexpected inferences were made using the HME. Given the drastically increased computational demands of PS and SS sampling, we also investigate a posterior simulation-based analogue of Akaike's information criterion (AIC) through Markov chain Monte Carlo (MCMC), a model comparison approach that shares with the HME the appealing feature of having a low computational overhead over the original MCMC analysis. We confirm that the HME systematically overestimates the marginal likelihood and fails to yield reliable model classification and show that the AICM performs better and may be a useful initial evaluation of model choice but that it is also, to a lesser degree, unreliable. We show that PS and SS sampling substantially outperform these estimators and adjust the conclusions made concerning previous analyses for the three real-world data sets that we reanalyzed. The methods used in this article are now available in BEAST, a powerful user-friendly software package to perform Bayesian evolutionary analyses.
Collapse
Affiliation(s)
- Guy Baele
- Department of Microbiology and Immunology, KU Leuven, Leuven, Belgium.
| | | | | | | | | | | |
Collapse
|
23
|
Liberles DA, Teichmann SA, Bahar I, Bastolla U, Bloom J, Bornberg-Bauer E, Colwell LJ, de Koning APJ, Dokholyan NV, Echave J, Elofsson A, Gerloff DL, Goldstein RA, Grahnen JA, Holder MT, Lakner C, Lartillot N, Lovell SC, Naylor G, Perica T, Pollock DD, Pupko T, Regan L, Roger A, Rubinstein N, Shakhnovich E, Sjölander K, Sunyaev S, Teufel AI, Thorne JL, Thornton JW, Weinreich DM, Whelan S. The interface of protein structure, protein biophysics, and molecular evolution. Protein Sci 2012; 21:769-85. [PMID: 22528593 PMCID: PMC3403413 DOI: 10.1002/pro.2071] [Citation(s) in RCA: 149] [Impact Index Per Article: 12.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2012] [Revised: 03/22/2012] [Accepted: 03/23/2012] [Indexed: 12/20/2022]
Abstract
Abstract The interface of protein structural biology, protein biophysics, molecular evolution, and molecular population genetics forms the foundations for a mechanistic understanding of many aspects of protein biochemistry. Current efforts in interdisciplinary protein modeling are in their infancy and the state-of-the art of such models is described. Beyond the relationship between amino acid substitution and static protein structure, protein function, and corresponding organismal fitness, other considerations are also discussed. More complex mutational processes such as insertion and deletion and domain rearrangements and even circular permutations should be evaluated. The role of intrinsically disordered proteins is still controversial, but may be increasingly important to consider. Protein geometry and protein dynamics as a deviation from static considerations of protein structure are also important. Protein expression level is known to be a major determinant of evolutionary rate and several considerations including selection at the mRNA level and the role of interaction specificity are discussed. Lastly, the relationship between modeling and needed high-throughput experimental data as well as experimental examination of protein evolution using ancestral sequence resurrection and in vitro biochemistry are presented, towards an aim of ultimately generating better models for biological inference and prediction.
Collapse
Affiliation(s)
- David A Liberles
- Department of Molecular Biology, University of WyomingLaramie, Wyoming 82071
| | - Sarah A Teichmann
- MRC Laboratory of Molecular BiologyHills Road, Cambridge CB2 0QH, United Kingdom
| | - Ivet Bahar
- Department of Computational and Systems Biology, School of Medicine, University of PittsburghPittsburgh, Pennsylvania 15213
| | - Ugo Bastolla
- Bioinformatics Unit. Centro de Biología Molecular Severo Ochoa (CSIC-UAM), Universidad Autonoma de Madrid28049 Cantoblanco Madrid, Spain
| | - Jesse Bloom
- Division of Basic Sciences, Fred Hutchinson Cancer Research CenterSeattle, Washington 98109
| | - Erich Bornberg-Bauer
- Evolutionary Bioinformatics Group, Institute for Evolution and Biodiversity, University of MuensterGermany
| | - Lucy J Colwell
- MRC Laboratory of Molecular BiologyHills Road, Cambridge CB2 0QH, United Kingdom
| | - A P Jason de Koning
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of ColoradoAurora, Colorado
| | - Nikolay V Dokholyan
- Department of Biochemistry and Biophysics, University of North Carolina at Chapel HillNorth Carolina 27599
| | - Julian Echave
- Escuela de Ciencia y Tecnología, Universidad Nacional de San MartínMartín de Irigoyen 3100, 1650 San Martín, Buenos Aires, Argentina
| | - Arne Elofsson
- Department of Biochemistry and Biophysics, Center for Biomembrane Research, Stockholm Bioinformatics Center, Science for Life Laboratory, Swedish E-science Research Center, Stockholm University106 91 Stockholm, Sweden
| | - Dietlind L Gerloff
- Biomolecular Engineering Department, University of CaliforniaSanta Cruz, California 95064
| | - Richard A Goldstein
- Division of Mathematical Biology, National Institute for Medical Research (MRC)Mill Hill, London NW7 1AA, United Kingdom
| | - Johan A Grahnen
- Department of Molecular Biology, University of WyomingLaramie, Wyoming 82071
| | - Mark T Holder
- Department of Ecology and Evolutionary Biology, University of KansasLawrence, Kansas 66045
| | - Clemens Lakner
- Bioinformatics Research Center, North Carolina State UniversityRaleigh, North Carolina 27695
| | - Nicholas Lartillot
- Département de Biochimie, Faculté de Médecine, Université de MontréalMontréal, QC H3T1J4, Canada
| | - Simon C Lovell
- Faculty of Life Sciences, University of ManchesterManchester M13 9PT, United Kingdom
| | - Gavin Naylor
- Department of Biology, College of CharlestonCharleston, South Carolina 29424
| | - Tina Perica
- MRC Laboratory of Molecular BiologyHills Road, Cambridge CB2 0QH, United Kingdom
| | - David D Pollock
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of ColoradoAurora, Colorado
| | - Tal Pupko
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv UniversityTel Aviv, Israel
| | - Lynne Regan
- Department of Molecular Biophysics and Biochemistry, Yale UniversityNew Haven 06511
| | - Andrew Roger
- Department of Biochemistry and Molecular Biology, Dalhousie UniversityHalifax, NS, Canada
| | - Nimrod Rubinstein
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv UniversityTel Aviv, Israel
| | - Eugene Shakhnovich
- Department of Chemistry and Chemical Biology, Harvard UniversityCambridge, Massachusetts 02138
| | - Kimmen Sjölander
- Department of Bioengineering, University of CaliforniaBerkeley, Berkeley, California 94720
| | - Shamil Sunyaev
- Division of Genetics, Brigham and Women's Hospital, Harvard Medical School77 Avenue Louis Pasteur, Boston, Massachusetts 02115
| | - Ashley I Teufel
- Department of Molecular Biology, University of WyomingLaramie, Wyoming 82071
| | - Jeffrey L Thorne
- Bioinformatics Research Center, North Carolina State UniversityRaleigh, North Carolina 27695
| | - Joseph W Thornton
- Howard Hughes Medical Institute and Institute for Ecology and Evolution, University of OregonEugene, Oregon 97403
- Department of Human Genetics, University of ChicagoChicago, Illinois 60637
- Department of Ecology and Evolution, University of ChicagoChicago, Illinois 60637
| | - Daniel M Weinreich
- Department of Ecology and Evolutionary Biology, and Center for Computational Molecular Biology, Brown UniversityProvidence, Rhode Island 02912
| | - Simon Whelan
- Faculty of Life Sciences, University of ManchesterManchester M13 9PT, United Kingdom
| |
Collapse
|
24
|
Abstract
The process of amino acid replacement in proteins is context-dependent, with substitution rates influenced by local structure, functional role, and amino acids at other locations. Predicting how these differences affect replacement processes is difficult. To make such inference easier, it is often assumed that the acceptabilities of different amino acids at a position are constant. However, evolutionary interactions among residue positions will tend to invalidate this assumption. Here, we use simulations of purple acid phosphatase evolution to show that amino acid propensities at a position undergo predictable change after an amino acid replacement at that position. After a replacement, the new amino acid and similar amino acids tend to become gradually more acceptable over time at that position. In other words, proteins tend to equilibrate to the presence of an amino acid at a position through replacements at other positions. Such a shift is reminiscent of the spectroscopy effect known as the Stokes shift, where molecules receiving a quantum of energy and moving to a higher electronic state will adjust to the new state and emit a smaller quantum of energy whenever they shift back down to the original ground state. Predictions of changes in stability in real proteins show that mutation reversals become less favorable over time, and thus, broadly support our results. The observation of an evolutionary Stokes shift has profound implications for the study of protein evolution and the modeling of evolutionary processes.
Collapse
|
25
|
Estimating the distribution of selection coefficients from phylogenetic data using sitewise mutation-selection models. Genetics 2011; 190:1101-15. [PMID: 22209901 PMCID: PMC3296245 DOI: 10.1534/genetics.111.136432] [Citation(s) in RCA: 97] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
Estimation of the distribution of selection coefficients of mutations is a long-standing issue in molecular evolution. In addition to population-based methods, the distribution can be estimated from DNA sequence data by phylogenetic-based models. Previous models have generally found unimodal distributions where the probability mass is concentrated between mildly deleterious and nearly neutral mutations. Here we use a sitewise mutation–selection phylogenetic model to estimate the distribution of selection coefficients among novel and fixed mutations (substitutions) in a data set of 244 mammalian mitochondrial genomes and a set of 401 PB2 proteins from influenza. We find a bimodal distribution of selection coefficients for novel mutations in both the mitochondrial data set and for the influenza protein evolving in its natural reservoir, birds. Most of the mutations are strongly deleterious with the rest of the probability mass concentrated around mildly deleterious to neutral mutations. The distribution of the coefficients among substitutions is unimodal and symmetrical around nearly neutral substitutions for both data sets at adaptive equilibrium. About 0.5% of the nonsynonymous mutations and 14% of the nonsynonymous substitutions in the mitochondrial proteins are advantageous, with 0.5% and 24% observed for the influenza protein. Following a host shift of influenza from birds to humans, however, we find among novel mutations in PB2 a trimodal distribution with a small mode of advantageous mutations.
Collapse
|
26
|
Context-Dependent Evolutionary Models for Non-Coding Sequences: An Overview of Several Decades of Research and an Analysis of Laurasiatheria and Primate Evolution. Evol Biol 2011. [DOI: 10.1007/s11692-011-9139-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
27
|
Rodrigue N, Aris-Brosou S. Fast Bayesian choice of phylogenetic models: prospecting data augmentation-based thermodynamic integration. Syst Biol 2011; 60:881-7. [PMID: 21804092 DOI: 10.1093/sysbio/syr065] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Affiliation(s)
- Nicolas Rodrigue
- Department of Biology and Center for Advanced Research in Environmental Genomics, University of Ottawa, 30 Marie Curie Pvt., Ottawa, ON, Canada
| | | |
Collapse
|
28
|
Philippe H, Brinkmann H, Lavrov DV, Littlewood DTJ, Manuel M, Wörheide G, Baurain D. Resolving difficult phylogenetic questions: why more sequences are not enough. PLoS Biol 2011; 9:e1000602. [PMID: 21423652 PMCID: PMC3057953 DOI: 10.1371/journal.pbio.1000602] [Citation(s) in RCA: 701] [Impact Index Per Article: 53.9] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Affiliation(s)
- Hervé Philippe
- Département de Biochimie, Centre Robert-Cedergren, Université de Montréal, Montréal, Québec, Canada.
| | | | | | | | | | | | | |
Collapse
|
29
|
Cartwright RA, Lartillot N, Thorne JL. History can matter: non-Markovian behavior of ancestral lineages. Syst Biol 2011; 60:276-90. [PMID: 21398626 DOI: 10.1093/sysbio/syr012] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Although most of the important evolutionary events in the history of biology can only be studied via interspecific comparisons, it is challenging to apply the rich body of population genetic theory to the study of interspecific genetic variation. Probabilistic modeling of the substitution process would ideally be derived from first principles of population genetics, allowing a quantitative connection to be made between the parameters describing mutation, selection, drift, and the patterns of interspecific variation. There has been progress in reconciling population genetics and interspecific evolution for the case where mutation rates are sufficiently low, but when mutation rates are higher, reconciliation has been hampered due to complications from how the loss or fixation of new mutations can be influenced by linked nonneutral polymorphisms (i.e., the Hill-Robertson effect). To investigate the generation of interspecific genetic variation when concurrent fitness-affecting polymorphisms are common and the Hill-Robertson effect is thereby potentially strong, we used the Wright-Fisher model of population genetics to simulate very many generations of mutation, natural selection, and genetic drift. This was done so that the chronological history of advantageous, deleterious, and neutral substitutions could be traced over time along the ancestral lineage. Our simulations show that the process by which a nonrecombining sequence changes over time can markedly deviate from the Markov assumption that is ubiquitous in molecular phylogenetics. In particular, we find tendencies for advantageous substitutions to be followed by deleterious ones and for deleterious substitutions to be followed by advantageous ones. Such non-Markovian patterns reflect the fact that the fate of the ancestral lineage depends not only on its current allelic state but also on gene copies not belonging to the ancestral lineage. Although our simulations describe nonrecombining sequences, we conclude by discussing how non-Markovian behavior of the ancestral lineage is plausible even when recombination rates are not low. As a result, we believe that increased attention needs to be devoted to the robustness of evolutionary inference procedures that rely upon the Markov assumption.
Collapse
Affiliation(s)
- Reed A Cartwright
- Department of Genetics, Bioinformatics Research Center, North Carolina State University, Raleigh, NC 27695-7566, USA
| | | | | |
Collapse
|
30
|
Lakner C, Holder MT, Goldman N, Naylor GJP. What's in a Likelihood? Simple Models of Protein Evolution and the Contribution of Structurally Viable Reconstructions to the Likelihood. Syst Biol 2011; 60:161-74. [DOI: 10.1093/sysbio/syq088] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Affiliation(s)
- Clemens Lakner
- Department of Biological Science, Section of Ecology and Evolution
- Department of Scientific Computing, Florida State University, Tallahassee, FL 32306-4120, USA
| | - Mark T. Holder
- Department of Ecology and Evolution, University of Kansas, 6031 Haworth, 1200 Sunnyside Avenue, Lawrence, KS 66045
| | - Nick Goldman
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Gavin J. P. Naylor
- Department of Scientific Computing, Florida State University, Tallahassee, FL 32306-4120, USA
| |
Collapse
|
31
|
Baele G, Van de Peer Y, Vansteelandt S. Modelling the ancestral sequence distribution and model frequencies in context-dependent models for primate non-coding sequences. BMC Evol Biol 2010; 10:244. [PMID: 20698960 PMCID: PMC2928787 DOI: 10.1186/1471-2148-10-244] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2010] [Accepted: 08/10/2010] [Indexed: 12/04/2022] Open
Abstract
Background Recent approaches for context-dependent evolutionary modelling assume that the evolution of a given site depends upon its ancestor and that ancestor's immediate flanking sites. Because such dependency pattern cannot be imposed on the root sequence, we consider the use of different orders of Markov chains to model dependence at the ancestral root sequence. Root distributions which are coupled to the context-dependent model across the underlying phylogenetic tree are deemed more realistic than decoupled Markov chains models, as the evolutionary process is responsible for shaping the composition of the ancestral root sequence. Results We find strong support, in terms of Bayes Factors, for using a second-order Markov chain at the ancestral root sequence along with a context-dependent model throughout the remainder of the phylogenetic tree in an ancestral repeats dataset, and for using a first-order Markov chain at the ancestral root sequence in a pseudogene dataset. Relaxing the assumption of a single context-independent set of independent model frequencies as presented in previous work, yields a further drastic increase in model fit. We show that the substitution rates associated with the CpG-methylation-deamination process can be modelled through context-dependent model frequencies and that their accuracy depends on the (order of the) Markov chain imposed at the ancestral root sequence. In addition, we provide evidence that this approach (which assumes that root distribution and evolutionary model are decoupled) outperforms an approach inspired by the work of Arndt et al., where the root distribution is coupled to the evolutionary model. We show that the continuous-time approximation of Hwang and Green has stronger support in terms of Bayes Factors, but the parameter estimates show minimal differences. Conclusions We show that the combination of a dependency scheme at the ancestral root sequence and a context-dependent evolutionary model across the remainder of the tree allows for accurate estimation of the model's parameters. The different assumptions tested in this manuscript clearly show that designing accurate context-dependent models is a complex process, with many different assumptions that require validation. Further, these assumptions are shown to change across different datasets, making the search for an adequate model for a given dataset quite challenging.
Collapse
Affiliation(s)
- Guy Baele
- Department of Plant Systems Biology, VIB, B-9052 Ghent, Belgium
| | | | | |
Collapse
|
32
|
Baele G, Van de Peer Y, Vansteelandt S. Using non-reversible context-dependent evolutionary models to study substitution patterns in primate non-coding sequences. J Mol Evol 2010; 71:34-50. [PMID: 20623275 DOI: 10.1007/s00239-010-9362-y] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2009] [Accepted: 05/26/2010] [Indexed: 11/28/2022]
Abstract
We discuss the importance of non-reversible evolutionary models when analyzing context-dependence. Given the inherent non-reversible nature of the well-known CpG-methylation-deamination process in mammalian evolution, non-reversible context-dependent evolutionary models may be well able to accurately model such a process. In particular, the lack of constraints on non-reversible substitution models might allow for more accurate estimation of context-dependent substitution parameters. To demonstrate this, we have developed different time-homogeneous context-dependent evolutionary models to analyze a large genomic dataset of primate ancestral repeats based on existing independent evolutionary models. We have calculated the difference in model fit for each of these models using Bayes Factors obtained via thermodynamic integration. We find that non-reversible context-dependent models can drastically increase model fit when compared to independent models and this on two primate non-coding datasets. Further, we show that further improvements are possible by clustering similar parameters across contexts.
Collapse
Affiliation(s)
- Guy Baele
- Department of Plant Systems Biology, VIB, Ghent University, Technologiepark 927, 9052, Ghent, Belgium.
| | | | | |
Collapse
|
33
|
Kleinman CL, Rodrigue N, Lartillot N, Philippe H. Statistical potentials for improved structurally constrained evolutionary models. Mol Biol Evol 2010; 27:1546-60. [PMID: 20159780 DOI: 10.1093/molbev/msq047] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
Assessing the influence of three-dimensional protein structure on sequence evolution is a difficult task, mainly because of the assumption of independence between sites required by probabilistic phylogenetic methods. Recently, models that include an explicit treatment of protein structure and site interdependencies have been developed: a statistical potential (an energy-like scoring system for sequence-structure compatibility) is used to evaluate the probability of fixation of a given mutation, assuming a coarse-grained protein structure that is constant through evolution. Yet, due to the novelty of these models and the small degree of overlap between the fields of structural and evolutionary biology, only simple representations of protein structure have been used so far. In this work, we present new forms of statistical potentials using a probabilistic framework recently developed for evolutionary studies. Terms related to pairwise distance interactions, torsion angles, solvent accessibility, and flexibility of the residues are included in the potentials, so as to study the effects of the main factors known to influence protein structure. The new potentials, with a more detailed representation of the protein structure, yield a better fit than the previously used scoring functions, with pairwise interactions contributing to more than half of this improvement. In a phylogenetic context, however, the structurally constrained models are still outperformed by some of the available site-independent models in terms of fit, possibly indicating that alternatives to coarse-grained statistical potentials should be explored in order to better model structural constraints.
Collapse
Affiliation(s)
- Claudia L Kleinman
- Département de Biochimie, Centre Robert Cedergren, Université de Montréal, Montreal, Quebec, Canada.
| | | | | | | |
Collapse
|
34
|
Ronquist F, Deans AR. Bayesian phylogenetics and its influence on insect systematics. ANNUAL REVIEW OF ENTOMOLOGY 2010; 55:189-206. [PMID: 19961329 DOI: 10.1146/annurev.ento.54.110807.090529] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
Bayesian inference and Markov chain Monte Carlo techniques have enjoyed enormous popularity since they were introduced into phylogenetics about a decade ago. We provide an overview of the field, with emphasis on recent developments of importance to empirical systematists. In particular, we describe a number of recent advances in the stochastic modeling of evolution that address major deficiencies in current models in a computationally efficient way. These include models of process heterogeneity across sites and lineages, as well as alignment-free models and model averaging approaches. Many of these methods should find their way into standard analyses in the near future. We also summarize the influence of Bayesian methods on insect systematics, with particular focus on current practices and how they could be improved using existing and emerging techniques.
Collapse
Affiliation(s)
- Fredrik Ronquist
- Department of Entomology, Swedish Museum of Natural History, Stockholm, Sweden.
| | | |
Collapse
|
35
|
de Koning APJ, Gu W, Pollock DD. Rapid likelihood analysis on large phylogenies using partial sampling of substitution histories. Mol Biol Evol 2009; 27:249-65. [PMID: 19783593 DOI: 10.1093/molbev/msp228] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Likelihood-based approaches can reconstruct evolutionary processes in greater detail and with better precision from larger data sets. The extremely large comparative genomic data sets that are now being generated thus create new opportunities for understanding molecular evolution, but analysis of such large quantities of data poses escalating computational challenges. Recently developed Markov chain Monte Carlo methods that augment substitution histories are a promising approach to alleviate these computational costs. We analyzed the computational costs of several such approaches, considering how they scale with model and data set complexity. This provided a theoretical framework to understand the most important computational bottlenecks, leading us to combine novel variations of our conditional pathway integration approach with recent advances made by others. The resulting technique ("partial sampling" of substitution histories) is considerably faster than all other approaches we considered. It is accurate, simple to implement, and scales exceptionally well with dimensions of model complexity and data set size. In particular, the time complexity of sampling unobserved substitution histories using the new method is much faster than previously existing methods, and model parameter and branch length updates are independent of data set size. We compared the performance of methods on a 224-taxon set of mammalian cytochrome-b sequences. For a simple nucleotide substitution model, partial sampling was at least 10 times faster than the PhyloBayes program, which samples substitutions in continuous time, and about 100 times faster than when using fully integrated substitution histories. Under a general reversible model of amino acid substitution, the partial sampling method was 1,600 times faster than when using fully integrated substitution histories, confirming significantly improved scaling with model state-space complexity. Partial sampling of substitutions thus dramatically improves the utility of likelihood approaches for analyzing complex evolutionary processes on large data sets.
Collapse
Affiliation(s)
- A P Jason de Koning
- Department of Biochemistry and Molecular Genetics, and Consortium for Comparative Genomics, University of Colorado Denver School of Medicine, USA
| | | | | |
Collapse
|
36
|
Baele G, Van de Peer Y, Vansteelandt S. Efficient context-dependent model building based on clustering posterior distributions for non-coding sequences. BMC Evol Biol 2009; 9:87. [PMID: 19405957 PMCID: PMC2695821 DOI: 10.1186/1471-2148-9-87] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2009] [Accepted: 04/30/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Many recent studies that relax the assumption of independent evolution of sites have done so at the expense of a drastic increase in the number of substitution parameters. While additional parameters cannot be avoided to model context-dependent evolution, a large increase in model dimensionality is only justified when accompanied with careful model-building strategies that guard against overfitting. An increased dimensionality leads to increases in numerical computations of the models, increased convergence times in Bayesian Markov chain Monte Carlo algorithms and even more tedious Bayes Factor calculations. RESULTS We have developed two model-search algorithms which reduce the number of Bayes Factor calculations by clustering posterior densities to decide on the equality of substitution behavior in different contexts. The selected model's fit is evaluated using a Bayes Factor, which we calculate via model-switch thermodynamic integration. To reduce computation time and to increase the precision of this integration, we propose to split the calculations over different computers and to appropriately calibrate the individual runs. Using the proposed strategies, we find, in a dataset of primate Ancestral Repeats, that careful modeling of context-dependent evolution may increase model fit considerably and that the combination of a context-dependent model with the assumption of varying rates across sites offers even larger improvements in terms of model fit. Using a smaller nuclear SSU rRNA dataset, we show that context-dependence may only become detectable upon applying model-building strategies. CONCLUSION While context-dependent evolutionary models can increase the model fit over traditional independent evolutionary models, such complex models will often contain too many parameters. Justification for the added parameters is thus required so that only those parameters that model evolutionary processes previously unaccounted for are added to the evolutionary model. To obtain an optimal balance between the number of parameters in a context-dependent model and the performance in terms of model fit, we have designed two parameter-reduction strategies and we have shown that model fit can be greatly improved by reducing the number of parameters in a context-dependent evolutionary model.
Collapse
Affiliation(s)
- Guy Baele
- Department of Applied Mathematics and Computer Science, Ghent University, Ghent, Belgium.
| | | | | |
Collapse
|
37
|
Rodrigue N, Kleinman CL, Philippe H, Lartillot N. Computational Methods for Evaluating Phylogenetic Models of Coding Sequence Evolution with Dependence between Codons. Mol Biol Evol 2009; 26:1663-76. [DOI: 10.1093/molbev/msp078] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
38
|
Baele G, Van de Peer Y, Vansteelandt S. A model-based approach to study nearest-neighbor influences reveals complex substitution patterns in non-coding sequences. Syst Biol 2008; 57:675-92. [PMID: 18853356 DOI: 10.1080/10635150802422324] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
In this article, we present a likelihood-based framework for modeling site dependencies. Our approach builds upon standard evolutionary models but incorporates site dependencies across the entire tree by letting the evolutionary parameters in these models depend upon the ancestral states at the neighboring sites. It thus avoids the need for introducing new and high-dimensional evolutionary models for site-dependent evolution. We propose a Markov chain Monte Carlo approach with data augmentation to infer the evolutionary parameters under our model. Although our approach allows for wide-ranging site dependencies, we illustrate its use, in two non-coding datasets, in the case of nearest-neighbor dependencies (i.e., evolution directly depending only upon the immediate flanking sites). The results reveal that the general time-reversible model with nearest-neighbor dependencies substantially improves the fit to the data as compared to the corresponding model with site independence. Using the parameter estimates from our model, we elaborate on the importance of the 5-methylcytosine deamination process (i.e., the CpG effect) and show that this process also depends upon the 5' neighboring base identity. We hint at the possibility of a so-called TpA effect and show that the observed substitution behavior is very complex in the light of dinucleotide estimates. We also discuss the presence of CpG effects in a nuclear small subunit dataset and find significant evidence that evolutionary models incorporating context-dependent effects perform substantially better than independent-site models and in some cases even outperform models that incorporate varying rates across sites.
Collapse
Affiliation(s)
- Guy Baele
- Department of Applied Mathematics and Computer Science, Ghent University, Ghent, Belgium
| | | | | |
Collapse
|
39
|
Abstract
Probabilistic models of sequence evolution are in widespread use in phylogenetics and molecular sequence evolution. These models have become increasingly sophisticated and combined with statistical model comparison techniques have helped to shed light on how genes and proteins evolve. Models of codon evolution have been particularly useful, because, in addition to providing a significant improvement in model realism for protein-coding sequences, codon models can also be designed to test hypotheses about the selective pressures that shape the evolution of the sequences. Such models typically assume a phylogeny and can be used to identify sites or lineages that have evolved adaptively. Recently some of the key assumptions that underlie phylogenetic tests of selection have been questioned, such as the assumption that the rate of synonymous changes is constant across sites or that a single phylogenetic tree can be assumed at all sites for recombining sequences. While some of these issues have been addressed through the development of novel methods, others remain as caveats that need to be considered on a case-by-case basis. Here, we outline the theory of codon models and their application to the detection of positive selection. We review some of the more recent developments that have improved their power and utility, laying a foundation for further advances in the modeling of coding sequence evolution.
Collapse
Affiliation(s)
- Wayne Delport
- University of Cape Town, Observatory, 7925, Cape Town, South Africa
| | | | | |
Collapse
|
40
|
Anisimova M, Kosiol C. Investigating protein-coding sequence evolution with probabilistic codon substitution models. Mol Biol Evol 2008; 26:255-71. [PMID: 18922761 DOI: 10.1093/molbev/msn232] [Citation(s) in RCA: 127] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
This review is motivated by the true explosion in the number of recent studies both developing and ameliorating probabilistic models of codon evolution. Traditionally parametric, the first codon models focused on estimating the effects of selective pressure on the protein via an explicit parameter in the maximum likelihood framework. Likelihood ratio tests of nested codon models armed the biologists with powerful tools, which provided unambiguous evidence for positive selection in real data. This, in turn, triggered a new wave of methodological developments. The new generation of models views the codon evolution process in a more sophisticated way, relaxing several mathematical assumptions. These models make a greater use of physicochemical amino acid properties, genetic code machinery, and the large amounts of data from the public domain. The overview of the most recent advances on modeling codon evolution is presented here, and a wide range of their applications to real data is discussed. On the downside, availability of a large variety of models, each accounting for various biological factors, increases the margin for misinterpretation; the biological meaning of certain parameters may vary among models, and model selection procedures also deserve greater attention. Solid understanding of the modeling assumptions and their applicability is essential for successful statistical data analysis.
Collapse
Affiliation(s)
- Maria Anisimova
- Institute of Computational Science, Swiss Federal Institute of Technology, Zurich, Switzerland.
| | | |
Collapse
|
41
|
Abstract
In 1994, Muse and Gaut (MG) and Goldman and Yang (GY) proposed evolutionary models that recognize the coding structure of the nucleotide sequences under study, by defining a Markovian substitution process with a state space consisting of the 61 sense codons (assuming the universal genetic code). Several variations and extensions to their models have since been proposed, but no general and flexible framework for contrasting the relative performance of alternative approaches has yet been applied. Here, we compute Bayes factors to evaluate the relative merit of several MG and GY styles of codon substitution models, including recent extensions acknowledging heterogeneous nonsynonymous rates across sites, as well as selective effects inducing uneven amino acid or codon preferences. Our results on three real data sets support a logical model construction following the MG formulation, allowing for a flexible account of global amino acid or codon preferences, while maintaining distinct parameters governing overall nucleotide propensities. Through posterior predictive checks, we highlight the importance of such a parameterization. Altogether, the framework presented here suggests a broad modeling project in the MG style, stressing the importance of combining and contrasting available model formulations and grounding developments in a sound probabilistic paradigm.
Collapse
|
42
|
Choi SC, Stone EA, Kishino H, Thorne JL. Estimates of natural selection due to protein tertiary structure inform the ancestry of biallelic loci. Gene 2008; 441:45-52. [PMID: 18725272 DOI: 10.1016/j.gene.2008.07.020] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2008] [Accepted: 07/10/2008] [Indexed: 10/21/2022]
Abstract
We consider the inference of which of two alleles is ancestral when the alleles have a single nonsynonymous difference and when natural selection acts via protein tertiary structure. Whereas the probability that an allele is ancestral under neutrality is equal to its frequency, under selection this probability depends on allele frequency and on the magnitude and direction of selection pressure. Although allele frequencies can be well estimated from intraspecific data, small fitness differences have a large evolutionary impact but can be difficult to estimate with only intraspecific data. Methods for predicting aspects of phenotype from genotype can supplement intraspecific sequence data. Recently developed statistical techniques can assess effects of phenotypes, such as protein tertiary structure on molecular evolution. While these techniques were initially designed for comparing protein-coding genes from different species, the resulting interspecific inferences can be assigned population genetic interpretations to assess the effect of selection pressure, and we use them here along with intraspecific allele frequency data to estimate the probability that an allele is ancestral. We focus on 140 nonsynonymous single nucleotide polymorphisms of humans that are in proteins with known tertiary structures. We find that our technique for employing protein tertiary structure information yields some biologically plausible results but that it does not substantially improve the inference of ancestral human allele types.
Collapse
Affiliation(s)
- Sang Chul Choi
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC 27695-7566, USA
| | | | | | | |
Collapse
|
43
|
Whelan S. Spatial and Temporal Heterogeneity in Nucleotide Sequence Evolution. Mol Biol Evol 2008; 25:1683-94. [DOI: 10.1093/molbev/msn119] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
|
44
|
Characterizing positive and negative selection and their phylogenetic effects. Gene 2008; 418:22-6. [PMID: 18486364 DOI: 10.1016/j.gene.2008.03.017] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2007] [Revised: 02/27/2008] [Accepted: 03/27/2008] [Indexed: 11/22/2022]
Abstract
Total evidence and the use of large datasets to overcome uncertainty are the state of the art in systematic analysis. This assumes that the only true phylogenetic signal is ancestry and that functional, structural, and other factors will not add an alternative signal. Using gene families, where individual codon positions were sorted into bins based upon average-pairwise dN/dS ratio, we show that standard, common phylogenetic methods that were designed for stochastic, neutral, site-independent processes, generate less robust phylogenetic signal for bins with strong negative or positive selection. This was true for phylogenetic reconstruction with parsimony, distance, and likelihood methods. Further, we present a case for the potential existence of systematic functional or structural signal that competes with ancestral signal. For the example of positive selection, we simulate the evolution of sequences through three dimensional lattice constructs with folding constraint and changing binding functionality and show that total evidence for these lattice genes presents trees with functional signal, but that the neutral synonymous sites in these genes show the true ancestral signal. In this case, sequence convergence is promoted by functional convergence.
Collapse
|
45
|
Uniformization for sampling realizations of Markov processes: applications to Bayesian implementations of codon substitution models. Bioinformatics 2007; 24:56-62. [DOI: 10.1093/bioinformatics/btm532] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
46
|
Gouveia-Oliveira R, Pedersen AG. Finding coevolving amino acid residues using row and column weighting of mutual information and multi-dimensional amino acid representation. Algorithms Mol Biol 2007; 2:12. [PMID: 17915013 PMCID: PMC2234412 DOI: 10.1186/1748-7188-2-12] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2007] [Accepted: 10/03/2007] [Indexed: 11/10/2022] Open
Abstract
Background Some amino acid residues functionally interact with each other. This interaction will result in an evolutionary co-variation between these residues – coevolution. Our goal is to find these coevolving residues. Results We present six new methods for detecting coevolving residues. Among other things, we suggest measures that are variants of Mutual Information, and measures that use a multidimensional representation of each residue in order to capture the physico-chemical similarities between amino acids. We created a benchmarking system, in silico, able to evaluate these methods through a wide range of realistic conditions. Finally, we use the combination of different methods as a way of improving performance. Conclusion Our best method (Row and Column Weighed Mutual Information) has an estimated accuracy increase of 63% over Mutual Information. Furthermore, we show that the combination of different methods is efficient, and that the methods are quite sensitive to the different conditions tested.
Collapse
Affiliation(s)
- Rodrigo Gouveia-Oliveira
- Center for Biological sequence analysis, The Technical University of Denmark, Building 208, 2800 Lyngby, Denmark
| | - Anders G Pedersen
- Center for Biological sequence analysis, The Technical University of Denmark, Building 208, 2800 Lyngby, Denmark
| |
Collapse
|
47
|
Rodrigue N, Philippe H, Lartillot N. Exploring Fast Computational Strategies for Probabilistic Phylogenetic Analysis. Syst Biol 2007; 56:711-26. [PMID: 17849326 DOI: 10.1080/10635150701611258] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
In recent years, the advent of Markov chain Monte Carlo (MCMC) techniques, coupled with modern computational capabilities, has enabled the study of evolutionary models without a closed form solution of the likelihood function. However, current Bayesian MCMC applications can incur significant computational costs, as they are based on a full sampling from the posterior probability distribution of the parameters of interest. Here, we draw attention as to how MCMC techniques can be embedded within normal approximation strategies for more economical statistical computation. The overall procedure is based on an estimate of the first and second moments of the likelihood function, as well as a maximum likelihood estimate. Through examples, we review several MCMC-based methods used in the statistical literature for such estimation, applying the approaches to constructing posterior distributions under non-analytical evolutionary models relaxing the assumptions of rate homogeneity, and of independence between sites. Finally, we use the procedures for conducting Bayesian model selection, based on Laplace approximations of Bayes factors, which we find to be accurate and computationally advantageous. Altogether, the methods we expound here, as well as other related approaches from the statistical literature, should prove useful when investigating increasingly complex descriptions of molecular evolution, alleviating some of the difficulties associated with nonanalytical models.
Collapse
Affiliation(s)
- Nicolas Rodrigue
- Canadian Institute for Advanced Research, Département de Biochimie, Université de Montréal, Québec, Canada.
| | | | | |
Collapse
|
48
|
Anisimova M, Liberles DA. The quest for natural selection in the age of comparative genomics. Heredity (Edinb) 2007; 99:567-79. [PMID: 17848974 DOI: 10.1038/sj.hdy.6801052] [Citation(s) in RCA: 68] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022] Open
Abstract
Continued genome sequencing has fueled progress in statistical methods for understanding the action of natural selection at the molecular level. This article reviews various statistical techniques (and their applicability) for detecting adaptation events and the functional divergence of proteins. As large-scale automated studies become more frequent, they provide a useful resource for generating biological null hypotheses for further experimental and statistical testing. Furthermore, they shed light on typical patterns of lineage-specific evolution of organisms, on the functional and structural evolution of protein families and on the interplay between the two. More complex models are being developed to better reflect the underlying biological and chemical processes and to complement simpler statistical models. Linking molecular processes to their statistical signatures in genomes can be demanding, and the proper application of statistical models is discussed.
Collapse
Affiliation(s)
- M Anisimova
- Department of Biology, University College London, London, UK
| | | |
Collapse
|
49
|
Thorne JL. Protein evolution constraints and model-based techniques to study them. Curr Opin Struct Biol 2007; 17:337-41. [PMID: 17572082 DOI: 10.1016/j.sbi.2007.05.006] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2007] [Revised: 04/11/2007] [Accepted: 05/29/2007] [Indexed: 11/17/2022]
Abstract
There have been substantial improvements in statistical tools for assessing the evolutionary roles of mutation and natural selection from interspecific sequence data. The importance of having the rate at which a point mutation occurs depend on the DNA sequence at sites surrounding the mutation is now better appreciated and can be accommodated in probabilistic models of protein evolution. To quantify the evolutionary impact of some aspect of phenotype, one promising strategy is to develop a system for predicting phenotype from the DNA sequence and to then infer how the evolutionary rates of sequence change are affected by the predicted phenotypic consequences of the changes. Although statistical tools for characterizing protein evolution are improving, the list of candidate phenomena that can affect rates of protein evolution is long and the relative contributions of these phenomena are only beginning to be disentangled.
Collapse
Affiliation(s)
- Jeffrey L Thorne
- Wissenschaftskolleg zu Berlin, Wallotstrasse 19, 14193 Berlin, Germany.
| |
Collapse
|