1
|
Stark TL, Liberles DA. Characterizing Amino Acid Substitution with Complete Linkage of Sites on a Lineage. Genome Biol Evol 2021; 13:6377338. [PMID: 34581792 PMCID: PMC8557849 DOI: 10.1093/gbe/evab225] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/17/2021] [Indexed: 11/16/2022] Open
Abstract
Amino acid substitution models are commonly used for phylogenetic inference, for ancestral sequence reconstruction, and for the inference of positive selection. All commonly used models explicitly assume that each site evolves independently, an assumption that is violated by both linkage and protein structural and functional constraints. We introduce two new models for amino acid substitution which incorporate linkage between sites, each based on the (population-genetic) Moran model. The first model is a generalized population process tracking arbitrarily many sites which undergo mutation, with individuals replaced according to their fitnesses. This model provides a reasonably complete framework for simulations but is numerically and analytically intractable. We also introduce a second model which includes several simplifying assumptions but for which some theoretical results can be derived. We analyze the simplified model to determine conditions where linkage is likely to have meaningful effects on sitewise substitution probabilities, as well as conditions under which the effects are likely to be negligible. These findings are an important step in the generation of tractable phylogenetic models that parameterize selective coefficients for amino acid substitution while accounting for linkage of sites leading to both hitchhiking and background selection.
Collapse
Affiliation(s)
- Tristan L Stark
- Department of Biology and Center for Computational Genetics and Genomics, Temple University, Philadelphia, PA, USA
| | - David A Liberles
- Department of Biology and Center for Computational Genetics and Genomics, Temple University, Philadelphia, PA, USA
| |
Collapse
|
2
|
Stark TL, Kaufman RS, Maltepes MA, Chi PB, Liberles DA. Detecting Selection on Segregating Gene Duplicates in a Population. J Mol Evol 2021; 89:554-564. [PMID: 34341836 DOI: 10.1007/s00239-021-10024-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Accepted: 07/20/2021] [Indexed: 11/26/2022]
Abstract
Gene duplication is a fundamental process that has the potential to drive phenotypic differences between populations and species. While evolutionarily neutral changes have the potential to affect phenotypes, detecting selection acting on gene duplicates can uncover cases of adaptive diversification. Existing methods to detect selection on duplicates work mostly inter-specifically and are based upon selection on coding sequence changes, here we present a method to detect selection directly on a copy number variant segregating in a population. The method relies upon expected relationships between allele (new duplication) age and frequency in the population dependent upon the effective population size. Using both a haploid and a diploid population with a Moran Model under several population sizes, the neutral baseline for copy number variants is established. The ability of the method to reject neutrality for duplicates with known age (measured in pairwise dS value) and frequency in the population is established through mathematical analysis and through simulations. Power is particularly good in the diploid case and with larger effective population sizes, as expected. With extension of this method to larger population sizes, this is a tool to analyze selection on copy number variants in any natural or experimentally evolving population. We have made an R package available at https://github.com/peterbchi/CNVSelectR/ which implements the method introduced here.
Collapse
Affiliation(s)
- Tristan L Stark
- Department of Biology and Center for Computational Genetics and Genomics, Temple University, Philadelphia, PA, 19122, USA.
- Discipline of Mathematics, University of Tasmania, Hobart, Tasmania, 7001, Australia.
| | - Rebecca S Kaufman
- Department of Biology and Center for Computational Genetics and Genomics, Temple University, Philadelphia, PA, 19122, USA
| | - Maria A Maltepes
- Department of Biology and Center for Computational Genetics and Genomics, Temple University, Philadelphia, PA, 19122, USA
| | - Peter B Chi
- Department of Biology and Center for Computational Genetics and Genomics, Temple University, Philadelphia, PA, 19122, USA
- Department of Mathematics and Statistics, Villanova University, Villanova, PA, 19085, USA
| | - David A Liberles
- Department of Biology and Center for Computational Genetics and Genomics, Temple University, Philadelphia, PA, 19122, USA.
| |
Collapse
|
3
|
Dornhaus A, Smith B, Hristova K, Buckley LB. How can we fully realize the potential of mathematical and biological models to reintegrate biology? Integr Comp Biol 2021; 61:2244-2254. [PMID: 34160617 DOI: 10.1093/icb/icab142] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Both mathematical models and biological model systems stand as tractable representations of complex biological systems or behaviors. They facilitate research and provide insights, and they can describe general rules. Models that represent biological processes or formalize general hypotheses are essential to any broad understanding. Mathematical or biological models necessarily omit details of the natural systems and thus may ultimately be "incorrect" representations. A key challenge is that tractability requires relatively simple models but simplification can result in models that are incorrect in their qualitative, broad implications if the abstracted details matter. Our paper discusses this tension, and how we can improve our inferences from models. We advocate for further efforts dedicated to model development, improvement, and acceptance by the scientific community, all of which may necessitate a more explicit discussion of the purpose and power of models. We argue that models should play a central role in reintegrating biology as a way to test our integrated understanding of how molecules, cells, organs, organisms, populations, and ecosystems function.
Collapse
Affiliation(s)
- Anna Dornhaus
- Department of Ecology & Evolutionary Biology, University of Arizona, Tucson, AZ 85721
| | - Brian Smith
- School of Life Sciences, Arizona State University, Tempe, AZ 85287
| | - Kalina Hristova
- Department of Materials Science and Engineering, and Program in Molecular Biology, John Hopkins University, Baltimore, MD 21218
| | - Lauren B Buckley
- Department of Biology, University of Washington, Seattle, WA 98115
| |
Collapse
|
4
|
Spielman SJ. Relative Model Fit Does Not Predict Topological Accuracy in Single-Gene Protein Phylogenetics. Mol Biol Evol 2021; 37:2110-2123. [PMID: 32191313 PMCID: PMC7306691 DOI: 10.1093/molbev/msaa075] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
It is regarded as best practice in phylogenetic reconstruction to perform relative model selection to determine an appropriate evolutionary model for the data. This procedure ranks a set of candidate models according to their goodness of fit to the data, commonly using an information theoretic criterion. Users then specify the best-ranking model for inference. Although it is often assumed that better-fitting models translate to increase accuracy, recent studies have shown that the specific model employed may not substantially affect inferences. We examine whether there is a systematic relationship between relative model fit and topological inference accuracy in protein phylogenetics, using simulations and real sequences. Simulations employed site-heterogeneous mechanistic codon models that are distinct from protein-level phylogenetic inference models, allowing us to investigate how protein models performs when they are misspecified to the data, as will be the case for any real sequence analysis. We broadly find that phylogenies inferred across models with vastly different fits to the data produce highly consistent topologies. We additionally find that all models infer similar proportions of false-positive splits, raising the possibility that all available models of protein evolution are similarly misspecified. Moreover, we find that the parameter-rich GTR (general time reversible) model, whose amino acid exchangeabilities are free parameters, performs similarly to models with fixed exchangeabilities, although the inference precision associated with GTR models was not examined. We conclude that, although relative model selection may not hinder phylogenetic analysis on protein data, it may not offer specific predictable improvements and is not a reliable proxy for accuracy.
Collapse
|
5
|
Jones CT, Youssef N, Susko E, Bielawski JP. A Phenotype-Genotype Codon Model for Detecting Adaptive Evolution. Syst Biol 2021; 69:722-738. [PMID: 31730199 DOI: 10.1093/sysbio/syz075] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2019] [Revised: 11/09/2019] [Accepted: 11/11/2019] [Indexed: 01/03/2023] Open
Abstract
A central objective in biology is to link adaptive evolution in a gene to structural and/or functional phenotypic novelties. Yet most analytic methods make inferences mainly from either phenotypic data or genetic data alone. A small number of models have been developed to infer correlations between the rate of molecular evolution and changes in a discrete or continuous life history trait. But such correlations are not necessarily evidence of adaptation. Here, we present a novel approach called the phenotype-genotype branch-site model (PG-BSM) designed to detect evidence of adaptive codon evolution associated with discrete-state phenotype evolution. An episode of adaptation is inferred under standard codon substitution models when there is evidence of positive selection in the form of an elevation in the nonsynonymous-to-synonymous rate ratio $\omega$ to a value $\omega > 1$. As it is becoming increasingly clear that $\omega > 1$ can occur without adaptation, the PG-BSM was formulated to infer an instance of adaptive evolution without appealing to evidence of positive selection. The null model makes use of a covarion-like component to account for general heterotachy (i.e., random changes in the evolutionary rate at a site over time). The alternative model employs samples of the phenotypic evolutionary history to test for phenomenological patterns of heterotachy consistent with specific mechanisms of molecular adaptation. These include 1) a persistent increase/decrease in $\omega$ at a site following a change in phenotype (the pattern) consistent with an increase/decrease in the functional importance of the site (the mechanism); and 2) a transient increase in $\omega$ at a site along a branch over which the phenotype changed (the pattern) consistent with a change in the site's optimal amino acid (the mechanism). Rejection of the null is followed by post hoc analyses to identify sites with strongest evidence for adaptation in association with changes in the phenotype as well as the most likely evolutionary history of the phenotype. Simulation studies based on a novel method for generating mechanistically realistic signatures of molecular adaptation show that the PG-BSM has good statistical properties. Analyses of real alignments show that site patterns identified post hoc are consistent with the specific mechanisms of adaptation included in the alternate model. Further simulation studies show that the covarion-like component of the PG-BSM plays a crucial role in mitigating recently discovered statistical pathologies associated with confounding by accounting for heterotachy-by-any-cause. [Adaptive evolution; branch-site model; confounding; mutation-selection; phenotype-genotype.].
Collapse
Affiliation(s)
- Christopher T Jones
- Department of Mathematics and Statistics, Dalhousie University, 1233 LeMarchant Street, B3H 4R2, Halifax, Nova Scotia, Canada
| | - Noor Youssef
- Department of Biology, Dalhousie University, 1233 LeMarchant Street, B3H 4R2, Halifax, Nova Scotia, Canada
| | - Edward Susko
- Department of Mathematics and Statistics, Dalhousie University, 1233 LeMarchant Street, B3H 4R2, Halifax, Nova Scotia, Canada.,Centre for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, 1233 LeMarchant Street, B3H 4R2, Halifax, Nova Scotia, Canada
| | - Joseph P Bielawski
- Department of Mathematics and Statistics, Dalhousie University, 1233 LeMarchant Street, B3H 4R2, Halifax, Nova Scotia, Canada.,Department of Biology, Dalhousie University, 1233 LeMarchant Street, B3H 4R2, Halifax, Nova Scotia, Canada.,Centre for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, 1233 LeMarchant Street, B3H 4R2, Halifax, Nova Scotia, Canada
| |
Collapse
|
6
|
Affiliation(s)
- David A Liberles
- Department of Biology and Center for Computational Genetics and Genomics, Temple University, Philadelphia, PA, 19122, USA.
| |
Collapse
|
7
|
Abstract
Proteins are commonly used as molecular targets against pathogens such as viruses and bacteria. However, pathogens can evolve rapidly permitting their populations to increase in protein diversity over time and thus escape to the activity of a molecular therapy. Subsequently, in order to design more durable and robust therapies as well as to understand viral evolution in a host and subsequent transmission, it is central to understand the evolution of pathogen proteins. This understanding can enable the detection of protein regions that can be potential targets for therapies and predict the emergence of molecular resistance against therapies. In this direction, two articles published recently in the Journal of Molecular Evolution investigated the evolution of proteomes of diverse flaviviruses, including Zika virus, Dengue virus and West Nile virus. Here I discuss the importance of considering the evolution of viral proteins, with the use of as realistic as possible models and methods that mimic protein evolution, to improve the design of antiviral therapies.
Collapse
|
8
|
Griswold CK. Properties of Samples With Segregating Polymerase Chain Reaction (PCR) Dropout Mutations Within a Species. Evol Bioinform Online 2019; 15:1176934319883612. [PMID: 31723319 PMCID: PMC6831972 DOI: 10.1177/1176934319883612] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2019] [Accepted: 09/25/2019] [Indexed: 11/17/2022] Open
Abstract
In polymerase chain reaction (PCR)-based DNA sequencing studies, there is the
possibility that mutations at the binding sites of primers result in no primer
binding and therefore no amplification. In this article, we call such mutations
PCR dropouts and present a coalescent-based theory of the distribution of
segregating PCR dropout mutations within a species. We show that dropout
mutations typically occur along branch sections that are at or near the base of
a coalescent tree, if at all. Given that a dropout mutation occurs along a
branch section near the base of a tree, there is a good chance that it causes
the alleles of a large fraction of a species to go unamplified, which distorts
the tree shape. Expected coalescence times and distributions of pairwise
sequence differences in the presence of PCR dropout mutations are derived under
the assumptions of both neutrality and background selection. These expectations
differ from when PCR dropout mutations are absent and may form the basis of
inferential approaches to detect the presence of dropout mutations, as well as
the development of unbiased estimators of statistics associated with
population-level genetic variation.
Collapse
|
9
|
Jones CT, Youssef N, Susko E, Bielawski JP. Phenomenological Load on Model Parameters Can Lead to False Biological Conclusions. Mol Biol Evol 2019; 35:1473-1488. [PMID: 29596684 DOI: 10.1093/molbev/msy049] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
When a substitution model is fitted to an alignment using maximum likelihood, its parameters are adjusted to account for as much site-pattern variation as possible. A parameter might therefore absorb a substantial quantity of the total variance in an alignment (or more formally, bring about a substantial reduction in the deviance of the fitted model) even if the process it represents played no role in the generation of the data. When this occurs, we say that the parameter estimate carries phenomenological load (PL). Large PL in a parameter estimate is a concern because it not only invalidates its mechanistic interpretation (if it has one) but also increases the likelihood that it will be found to be statistically significant. The problem of PL was not identified in the past because most off-the-shelf substitution models make simplifying assumptions that preclude the generation of realistic levels of variation. In this study, we use the more realistic mutation-selection framework as the basis of a generating model formulated to produce data that mimic an alignment of mammalian mitochondrial DNA. We show that a parameter estimate can carry PL when 1) the substitution model is underspecified and 2) the parameter represents a process that is confounded with other processes represented in the data-generating model. We then provide a method that can be used to identify signal for the process that a given parameter represents despite the existence of PL.
Collapse
Affiliation(s)
- Christopher T Jones
- Department of Mathematics and Statistics, Dalhousie University, Halifax, NS, Canada
| | - Noor Youssef
- Department of Biology, Dalhousie University, Halifax, NS, Canada
| | - Edward Susko
- Department of Mathematics and Statistics, Dalhousie University, Halifax, NS, Canada
| | | |
Collapse
|
10
|
Dunn KA, Kenney T, Gu H, Bielawski JP. Improved inference of site-specific positive selection under a generalized parametric codon model when there are multinucleotide mutations and multiple nonsynonymous rates. BMC Evol Biol 2019; 19:22. [PMID: 30642241 PMCID: PMC6332903 DOI: 10.1186/s12862-018-1326-7] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2018] [Accepted: 12/11/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND An excess of nonsynonymous substitutions, over neutrality, is considered evidence of positive Darwinian selection. Inference for proteins often relies on estimation of the nonsynonymous to synonymous ratio (ω = dN/dS) within a codon model. However, to ease computational difficulties, ω is typically estimated assuming an idealized substitution process where (i) all nonsynonymous substitutions have the same rate (regardless of impact on organism fitness) and (ii) instantaneous double and triple (DT) nucleotide mutations have zero probability (despite evidence that they can occur). It follows that estimates of ω represent an imperfect summary of the intensity of selection, and that tests based on the ω > 1 threshold could be negatively impacted. RESULTS We developed a general-purpose parametric (GPP) modelling framework for codons. This novel approach allows specification of all possible instantaneous codon substitutions, including multiple nonsynonymous rates (MNRs) and instantaneous DT nucleotide changes. Existing codon models are specified as special cases of the GPP model. We use GPP models to implement likelihood ratio tests for ω > 1 that accommodate MNRs and DT mutations. Through both simulation and real data analysis, we find that failure to model MNRs and DT mutations reduces power in some cases and inflates false positives in others. False positives under traditional M2a and M8 models were very sensitive to DT changes. This was exacerbated by the choice of frequency parameterization (GY vs. MG), with rates sometimes > 90% under MG. By including MNRs and DT mutations, accuracy and power was greatly improved under the GPP framework. However, we also find that over-parameterized models can perform less well, and this can contribute to degraded performance of LRTs. CONCLUSIONS We suggest GPP models should be used alongside traditional codon models. Further, all codon models should be deployed within an experimental design that includes (i) assessing robustness to model assumptions, and (ii) investigation of non-standard behaviour of MLEs. As the goal of every analysis is to avoid false conclusions, more work is needed on model selection methods that consider both the increase in fit engendered by a model parameter and the degree to which that parameter is affected by un-modelled evolutionary processes.
Collapse
Affiliation(s)
- Katherine A. Dunn
- Department of Biology, Dalhousie University, Halifax, Nova Scotia B3H 4J1 Canada
| | - Toby Kenney
- Department of Mathematics & Statistics, Dalhousie University, Halifax, Nova Scotia B3H 4J1 Canada
| | - Hong Gu
- Department of Mathematics & Statistics, Dalhousie University, Halifax, Nova Scotia B3H 4J1 Canada
| | - Joseph P. Bielawski
- Department of Biology, Dalhousie University, Halifax, Nova Scotia B3H 4J1 Canada
- Department of Mathematics & Statistics, Dalhousie University, Halifax, Nova Scotia B3H 4J1 Canada
- Centre Comparative Genomics and Evolutionary Bioinformatics (CGEB) at Dalhousie University, Halifax, Canada
| |
Collapse
|
11
|
Looking for Darwin in Genomic Sequences: Validity and Success Depends on the Relationship Between Model and Data. Methods Mol Biol 2019; 1910:399-426. [PMID: 31278672 DOI: 10.1007/978-1-4939-9074-0_13] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Codon substitution models (CSMs) are commonly used to infer the history of natural section for a set of protein-coding sequences, often with the explicit goal of detecting the signature of positive Darwinian selection. However, the validity and success of CSMs used in conjunction with the maximum likelihood (ML) framework is sometimes challenged with claims that the approach might too often support false conclusions. In this chapter, we use a case study approach to identify four legitimate statistical difficulties associated with inference of evolutionary events using CSMs. These include: (1) model misspecification, (2) low information content, (3) the confounding of processes, and (4) phenomenological load, or PL. While past criticisms of CSMs can be connected to these issues, the historical critiques were often misdirected, or overstated, because they failed to recognize that the success of any model-based approach depends on the relationship between model and data. Here, we explore this relationship and provide a candid assessment of the limitations of CSMs to extract historical information from extant sequences. To aid in this assessment, we provide a brief overview of: (1) a more realistic way of thinking about the process of codon evolution framed in terms of population genetic parameters, and (2) a novel presentation of the ML statistical framework. We then divide the development of CSMs into two broad phases of scientific activity and show that the latter phase is characterized by increases in model complexity that can sometimes negatively impact inference of evolutionary mechanisms. Such problems are not yet widely appreciated by the users of CSMs. These problems can be avoided by using a model that is appropriate for the data; but, understanding the relationship between the data and a fitted model is a difficult task. We argue that the only way to properly understand that relationship is to perform in silico experiments using a generating process that can mimic the data as closely as possible. The mutation-selection modeling framework (MutSel) is presented as the basis of such a generating process. We contend that if complex CSMs continue to be developed for testing explicit mechanistic hypotheses, then additional analyses such as those described in here (e.g., penalized LRTs and estimation of PL) will need to be applied alongside the more traditional inferential methods.
Collapse
|
12
|
Yohe LR, Liu L, Dávalos LM, Liberles DA. Protocols for the Molecular Evolutionary Analysis of Membrane Protein Gene Duplicates. Methods Mol Biol 2019; 1851:49-62. [PMID: 30298391 DOI: 10.1007/978-1-4939-8736-8_3] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
Gene duplication is an important process in the evolution of gene content in eukaryotic genomes. Understanding when gene duplicates contribute new molecular functions to genomes through molecular adaptation is one important goal in comparative genomics. In large gene families, however, characterizing adaptation and neofunctionalization across species is challenging, as models have traditionally quantified the timing of duplications without considering underlying gene trees. This protocol combines multiple approaches to detect adaptation in protein duplicates at a phylogenetic scale. We include a description of models for gene tree-species tree reconciliation that enable different types of inference, as well as a practical guide to their use. Although simulation-based approaches successfully detect shifts in the rate of duplication/retention, the conflation between the duplication and retention processes, the distinct trajectories of duplicates under non-, sub-, and neofunctionalization, as well as dosage effects offer hitherto unexplored analytical avenues. We introduce mathematical descriptions of these probabilities and offer a road map to computational implementation whose starting point is parsimony reconciliation. Sequence evolution information based on the ratio of nonsynonymous to synonymous nucleotide substitution rates (dN/dS) can be combined with duplicate survival probabilities to better predict the emergence of new molecular functions in retained duplicates. Together, these methods enable characterization of potentially adaptive candidate duplicates whose neofunctionalization may contribute to phenotypic divergence across species.
Collapse
Affiliation(s)
- Laurel R Yohe
- Department of Geology & Geophysics, Yale University, New Haven, CT, USA.
| | - Liang Liu
- Department of Statistics and Institute of Bioinformatics, University of Georgia, Athens, GA, USA
| | - Liliana M Dávalos
- Department of Ecology and Evolution, Stony Brook University, Stony Brook, NY, USA
| | - David A Liberles
- Department of Biology and Center for Computational Genetics and Genomics, Temple University, Philadelphia, PA, USA.
| |
Collapse
|
13
|
Using the Mutation-Selection Framework to Characterize Selection on Protein Sequences. Genes (Basel) 2018; 9:genes9080409. [PMID: 30104502 PMCID: PMC6115872 DOI: 10.3390/genes9080409] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2018] [Revised: 08/02/2018] [Accepted: 08/09/2018] [Indexed: 12/13/2022] Open
Abstract
When mutational pressure is weak, the generative process of protein evolution involves explicit probabilities of mutations of different types coupled to their conditional probabilities of fixation dependent on selection. Establishing this mechanistic modeling framework for the detection of selection has been a goal in the field of molecular evolution. Building on a mathematical framework proposed more than a decade ago, numerous methods have been introduced in an attempt to detect and measure selection on protein sequences. In this review, we discuss the structure of the original model, subsequent advances, and the series of assumptions that these models operate under.
Collapse
|
14
|
Dandage R, Pandey R, Jayaraj G, Rai M, Berger D, Chakraborty K. Differential strengths of molecular determinants guide environment specific mutational fates. PLoS Genet 2018; 14:e1007419. [PMID: 29813059 PMCID: PMC5993328 DOI: 10.1371/journal.pgen.1007419] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2018] [Revised: 06/08/2018] [Accepted: 05/16/2018] [Indexed: 01/14/2023] Open
Abstract
Organisms maintain competitive fitness in the face of environmental challenges through molecular evolution. However, it remains largely unknown how different biophysical factors constrain molecular evolution in a given environment. Here, using deep mutational scanning, we quantified empirical fitness of >2000 single site mutants of the Gentamicin-resistant gene (GmR) in Escherichia coli, in a representative set of physical (non-native temperatures) and chemical (small molecule supplements) environments. From this, we could infer how different biophysical parameters of the mutations constrain molecular function in different environments. We find ligand binding, and protein stability to be the best predictors of mutants' fitness, but their relative predictive power differs across environments. While protein folding emerges as the strongest predictor at minimal antibiotic concentration, ligand binding becomes a stronger predictor of mutant fitness at higher concentration. Remarkably, strengths of environment-specific selection pressures were largely predictable from the degree of mutational perturbation of protein folding and ligand binding. By identifying structural constraints that act as determinants of fitness, our study thus provides coarse mechanistic insights into the environment specific accessibility of mutational fates.
Collapse
Affiliation(s)
- Rohan Dandage
- CSIR- Institute of Genomics and Integrative Biology, New Delhi, India
- Academy of Scientific and Innovative Research (AcSIR), New Delhi, India
| | - Rajesh Pandey
- CSIR Ayurgenomics Unit—TRISUTRA, CSIR- Institute of Genomics and Integrative Biology, New Delhi, India
| | - Gopal Jayaraj
- CSIR- Institute of Genomics and Integrative Biology, New Delhi, India
- Academy of Scientific and Innovative Research (AcSIR), New Delhi, India
| | - Manish Rai
- CSIR- Institute of Genomics and Integrative Biology, New Delhi, India
- Academy of Scientific and Innovative Research (AcSIR), New Delhi, India
| | - David Berger
- Department of Ecology and Genetics, Animal Ecology, Evolutionary Biology Centre at Uppsala University, Uppsala, Sweden
| | - Kausik Chakraborty
- CSIR- Institute of Genomics and Integrative Biology, New Delhi, India
- Academy of Scientific and Innovative Research (AcSIR), New Delhi, India
| |
Collapse
|
15
|
Platt A, Weber CC, Liberles DA. Protein evolution depends on multiple distinct population size parameters. BMC Evol Biol 2018; 18:17. [PMID: 29422024 PMCID: PMC5806465 DOI: 10.1186/s12862-017-1085-x] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2017] [Accepted: 11/20/2017] [Indexed: 01/08/2023] Open
Abstract
That population size affects the fate of new mutations arising in genomes, modulating both how frequently they arise and how efficiently natural selection is able to filter them, is well established. It is therefore clear that these distinct roles for population size that characterize different processes should affect the evolution of proteins and need to be carefully defined. Empirical evidence is consistent with a role for demography in influencing protein evolution, supporting the idea that functional constraints alone do not determine the composition of coding sequences. Given that the relationship between population size, mutant fitness and fixation probability has been well characterized, estimating fitness from observed substitutions is well within reach with well-formulated models. Molecular evolution research has, therefore, increasingly begun to leverage concepts from population genetics to quantify the selective effects associated with different classes of mutation. However, in order for this type of analysis to provide meaningful information about the intra- and inter-specific evolution of coding sequences, a clear definition of concepts of population size, what they influence, and how they are best parameterized is essential. Here, we present an overview of the many distinct concepts that “population size” and “effective population size” may refer to, what they represent for studying proteins, and how this knowledge can be harnessed to produce better specified models of protein evolution.
Collapse
Affiliation(s)
- Alexander Platt
- Department of Biology and Center for Computational Genetics and Genomics, Temple University, Philadelphia, 19121, USA
| | - Claudia C Weber
- Department of Biology and Center for Computational Genetics and Genomics, Temple University, Philadelphia, 19121, USA
| | - David A Liberles
- Department of Biology and Center for Computational Genetics and Genomics, Temple University, Philadelphia, 19121, USA.
| |
Collapse
|
16
|
Chi PB, Kim D, Lai JK, Bykova N, Weber CC, Kubelka J, Liberles DA. A new parameter-rich structure-aware mechanistic model for amino acid substitution during evolution. Proteins 2017; 86:218-228. [PMID: 29178386 DOI: 10.1002/prot.25429] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2017] [Revised: 11/14/2017] [Accepted: 11/22/2017] [Indexed: 02/06/2023]
Abstract
Improvements in the description of amino acid substitution are required to develop better pseudo-energy-based protein structure-aware models for use in phylogenetic studies. These models are used to characterize the probabilities of amino acid substitution and enable better simulation of protein sequences over a phylogeny. A better characterization of amino acid substitution probabilities in turn enables numerous downstream applications, like detecting positive selection, ancestral sequence reconstruction, and evolutionarily-motivated protein engineering. Many existing Markov models for amino acid substitution in molecular evolution disregard molecular structure and describe the amino acid substitution process over longer evolutionary periods poorly. Here, we present a new model upgraded with a site-specific parameterization of pseudo-energy terms in a coarse-grained force field, which describes local heterogeneity in physical constraints on amino acid substitution better than a previous pseudo-energy-based model with minimum cost in runtime. The importance of each weight term parameterization in characterizing underlying features of the site, including contact number, solvent accessibility, and secondary structural elements was evaluated, returning both expected and biologically reasonable relationships between model parameters. This results in the acceptance of proposed amino acid substitutions that more closely resemble those observed site-specific frequencies in gene family alignments. The modular site-specific pseudo-energy function is made available for download through the following website: https://liberles.cst.temple.edu/Software/CASS/index.html.
Collapse
Affiliation(s)
- Peter B Chi
- Department of Biology and Center for Computational Genetics and Genomics, Temple University, Philadelphia, Pennsylvania, 19122.,Department of Mathematics and Computer Science, Ursinus College, Collegeville, Pennsylvania, 19426
| | - Dohyup Kim
- Department of Molecular Biology, University of Wyoming, Laramie, Wyoming, 82071
| | - Jason K Lai
- Department of Molecular Biology, University of Wyoming, Laramie, Wyoming, 82071
| | - Nadia Bykova
- Department of Molecular Biology, University of Wyoming, Laramie, Wyoming, 82071.,Faculty of Bioengineering and Bioinformatics, Moscow State University, Moscow, 119234, Russia
| | - Claudia C Weber
- Department of Biology and Center for Computational Genetics and Genomics, Temple University, Philadelphia, Pennsylvania, 19122
| | - Jan Kubelka
- Department of Chemistry, University of Wyoming, Laramie, Wyoming, 82071
| | - David A Liberles
- Department of Biology and Center for Computational Genetics and Genomics, Temple University, Philadelphia, Pennsylvania, 19122.,Department of Molecular Biology, University of Wyoming, Laramie, Wyoming, 82071
| |
Collapse
|
17
|
Jones CT, Youssef N, Susko E, Bielawski JP. Shifting Balance on a Static Mutation-Selection Landscape: A Novel Scenario of Positive Selection. Mol Biol Evol 2017; 34:391-407. [PMID: 28110273 DOI: 10.1093/molbev/msw237] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023] Open
Abstract
A version of the mechanistic mutation-selection (MutSel) model that accounts for temporal dynamics at a site is presented. This is used to show that the rate ratio dN/dS at a site can be transiently >1 even when fitness coefficients are fixed or the fitness landscape is static. This occurs whenever a site drifts away from its fitness peak and is then forced back by selection, a process reminiscent of shifting balance. Shifting balance is strongest when the substitution process is not dominated by selection or drift, but admits interplay between the two. Under this condition, site-specific changes in dN/dS were inferred in 78-100% of trials, and positive selection (i.e., dN/dS>1) in 10-40% of trials, when sequence alignments generated under MutSel were fitted to two popular phenomenological branch-site models. These results demonstrate that positive selection can occur without a change in fitness regime, and that this is detectable by branch-site models. In addition, MutSel is used to show that a site can be occupied by a sub-optimal amino acid for long periods on a fixed landscape when selection is stringent. This has implications for the interpretation of constant-but-different site patterns typically attributed to changes in fitness. Furthermore, a version of MutSel with episodic changes in fitness coefficients is used to illustrate systematic differences between parameters used to generate data under MutSel and their counterparts estimated by a simple codon model. Motivated by a discrepancy in the literature, interpretation of dN/dS in the context of MutSel is also discussed.
Collapse
Affiliation(s)
- Christopher T Jones
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia
| | - Noor Youssef
- Department of Biology, Dalhousie University, Halifax, Nova Scotia
| | - Edward Susko
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia.,Center for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, Halifax, Nova Scotia
| | - Joseph P Bielawski
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia.,Department of Biology, Dalhousie University, Halifax, Nova Scotia.,Center for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, Halifax, Nova Scotia
| |
Collapse
|
18
|
Dunning LT, Lundgren MR, Moreno-Villena JJ, Namaganda M, Edwards EJ, Nosil P, Osborne CP, Christin PA. Introgression and repeated co-option facilitated the recurrent emergence of C 4 photosynthesis among close relatives. Evolution 2017; 71:1541-1555. [PMID: 28395112 PMCID: PMC5488178 DOI: 10.1111/evo.13250] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2016] [Accepted: 04/04/2017] [Indexed: 01/16/2023]
Abstract
The origins of novel traits are often studied using species trees and modeling phenotypes as different states of the same character, an approach that cannot always distinguish multiple origins from fewer origins followed by reversals. We address this issue by studying the origins of C4 photosynthesis, an adaptation to warm and dry conditions, in the grass Alloteropsis. We dissect the C4 trait into its components, and show two independent origins of the C4 phenotype via different anatomical modifications, and the use of distinct sets of genes. Further, inference of enzyme adaptation suggests that one of the two groups encompasses two transitions to a full C4 state from a common ancestor with an intermediate phenotype that had some C4 anatomical and biochemical components. Molecular dating of C4 genes confirms the introgression of two key C4 components between species, while the inheritance of all others matches the species tree. The number of origins consequently varies among C4 components, a scenario that could not have been inferred from analyses of the species tree alone. Our results highlight the power of studying individual components of complex traits to reconstruct trajectories toward novel adaptations.
Collapse
Affiliation(s)
- Luke T Dunning
- Department of Animal and Plant Sciences, University of Sheffield, Sheffield, S10 2TN, United Kingdom
| | - Marjorie R Lundgren
- Department of Animal and Plant Sciences, University of Sheffield, Sheffield, S10 2TN, United Kingdom
| | - Jose J Moreno-Villena
- Department of Animal and Plant Sciences, University of Sheffield, Sheffield, S10 2TN, United Kingdom
| | | | - Erika J Edwards
- Department of Ecology and Evolutionary Biology, Brown University, Providence, Rhode Island, 02912
| | - Patrik Nosil
- Department of Animal and Plant Sciences, University of Sheffield, Sheffield, S10 2TN, United Kingdom
| | - Colin P Osborne
- Department of Animal and Plant Sciences, University of Sheffield, Sheffield, S10 2TN, United Kingdom
| | - Pascal-Antoine Christin
- Department of Animal and Plant Sciences, University of Sheffield, Sheffield, S10 2TN, United Kingdom
| |
Collapse
|
19
|
Stark TL, Liberles DA, Holland BR, O'Reilly MM. Analysis of a mechanistic Markov model for gene duplicates evolving under subfunctionalization. BMC Evol Biol 2017; 17:38. [PMID: 28143390 PMCID: PMC5282866 DOI: 10.1186/s12862-016-0848-0] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2016] [Accepted: 12/08/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Gene duplication has been identified as a key process driving functional change in many genomes. Several biological models exist for the evolution of a pair of duplicates after a duplication event, and it is believed that gene duplicates can evolve in different ways, according to one process, or a mix of processes. Subfunctionalization is one such process, under which the two duplicates can be preserved by dividing up the function of the original gene between them. Analysis of genomic data using subfunctionalization and related processes has thus far been relatively coarse-grained, with mathematical treatments usually focusing on the phenomenological features of gene duplicate evolution. RESULTS Here, we develop and analyze a mathematical model using the mechanics of subfunctionalization and the assumption of Poisson rates of mutation. By making use of the results from the literature on the Phase-Type distribution, we are able to derive exact analytical results for the model. The main advantage of the mechanistic model is that it leads to testable predictions of the phenomenological behavior (instead of building this behavior into the model a priori), and allows for the estimation of biologically meaningful parameters. We fit the survival function implied by this model to real genome data (Homo sapiens, Mus musculus, Rattus norvegicus and Canis familiaris), and compare the fit against commonly used phenomenological survival functions. We estimate the number of regulatory regions, and rates of mutation (relative to silent site mutation) in the coding and regulatory regions. We find that for the four genomes tested the subfunctionalization model predicts that duplicates most-likely have just a few regulatory regions, and the rate of mutation in the coding region is around 5-10 times greater than the rate in the regulatory regions. This is the first model-based estimate of the number of regulatory regions in duplicates. CONCLUSIONS Strong agreement between empirical results and the predictions of our model suggest that subfunctionalization provides a consistent explanation for the evolution of many gene duplicates.
Collapse
Affiliation(s)
- Tristan L Stark
- School of Physical Sciences, University of Tasmania, Churchill Ave, Hobart, 7001, Australia.
| | - David A Liberles
- Center for Computational Genetics and Genomics and Department of Biology, Temple University, Philadelphia, 19122, USA
| | - Barbara R Holland
- School of Physical Sciences, University of Tasmania, Churchill Ave, Hobart, 7001, Australia
| | - Małgorzata M O'Reilly
- School of Physical Sciences, University of Tasmania, Churchill Ave, Hobart, 7001, Australia
| |
Collapse
|
20
|
Spielman SJ, Wan S, Wilke CO. A Comparison of One-Rate and Two-Rate Inference Frameworks for Site-Specific dN/dS Estimation. Genetics 2016; 204:499-511. [PMID: 27535929 PMCID: PMC5068842 DOI: 10.1534/genetics.115.185264] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2015] [Accepted: 08/11/2016] [Indexed: 11/18/2022] Open
Abstract
Two broad paradigms exist for inferring [Formula: see text] the ratio of nonsynonymous to synonymous substitution rates, from coding sequences: (i) a one-rate approach, where [Formula: see text] is represented with a single parameter, or (ii) a two-rate approach, where [Formula: see text] and [Formula: see text] are estimated separately. The performances of these two approaches have been well studied in the specific context of proper model specification, i.e., when the inference model matches the simulation model. By contrast, the relative performances of one-rate vs. two-rate parameterizations when applied to data generated according to a different mechanism remain unclear. Here, we compare the relative merits of one-rate and two-rate approaches in the specific context of model misspecification by simulating alignments with mutation-selection models rather than with [Formula: see text]-based models. We find that one-rate frameworks generally infer more accurate [Formula: see text] point estimates, even when [Formula: see text] varies among sites. In other words, modeling [Formula: see text] variation may substantially reduce accuracy of [Formula: see text] point estimates. These results appear to depend on the selective constraint operating at a given site. For sites under strong purifying selection ([Formula: see text]), one-rate and two-rate models show comparable performances. However, one-rate models significantly outperform two-rate models for sites under moderate-to-weak purifying selection. We attribute this distinction to the fact that, for these more quickly evolving sites, a given substitution is more likely to be nonsynonymous than synonymous. The data will therefore be relatively enriched for nonsynonymous changes, and modeling [Formula: see text] contributes excessive noise to [Formula: see text] estimates. We additionally find that high levels of divergence among sequences, rather than the number of sequences in the alignment, are more critical for obtaining precise point estimates.
Collapse
Affiliation(s)
- Stephanie J Spielman
- Department of Integrative Biology, Center for Computational Biology and Bioinformatics, and Institute for Cellular and Molecular Biology, The University of Texas, Austin, Texas 78712
| | - Suyang Wan
- School of Physics and Astronomy, The University of Minnesota, Minneapolis, Minnesota 55455
| | - Claus O Wilke
- Department of Integrative Biology, Center for Computational Biology and Bioinformatics, and Institute for Cellular and Molecular Biology, The University of Texas, Austin, Texas 78712
| |
Collapse
|
21
|
Auffray C, Balling R, Barroso I, Bencze L, Benson M, Bergeron J, Bernal-Delgado E, Blomberg N, Bock C, Conesa A, Del Signore S, Delogne C, Devilee P, Di Meglio A, Eijkemans M, Flicek P, Graf N, Grimm V, Guchelaar HJ, Guo YK, Gut IG, Hanbury A, Hanif S, Hilgers RD, Honrado Á, Hose DR, Houwing-Duistermaat J, Hubbard T, Janacek SH, Karanikas H, Kievits T, Kohler M, Kremer A, Lanfear J, Lengauer T, Maes E, Meert T, Müller W, Nickel D, Oledzki P, Pedersen B, Petkovic M, Pliakos K, Rattray M, I Màs JR, Schneider R, Sengstag T, Serra-Picamal X, Spek W, Vaas LAI, van Batenburg O, Vandelaer M, Varnai P, Villoslada P, Vizcaíno JA, Wubbe JPM, Zanetti G. Making sense of big data in health research: Towards an EU action plan. Genome Med 2016; 8:71. [PMID: 27338147 PMCID: PMC4919856 DOI: 10.1186/s13073-016-0323-y] [Citation(s) in RCA: 129] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
Medicine and healthcare are undergoing profound changes. Whole-genome sequencing and high-resolution imaging technologies are key drivers of this rapid and crucial transformation. Technological innovation combined with automation and miniaturization has triggered an explosion in data production that will soon reach exabyte proportions. How are we going to deal with this exponential increase in data production? The potential of "big data" for improving health is enormous but, at the same time, we face a wide range of challenges to overcome urgently. Europe is very proud of its cultural diversity; however, exploitation of the data made available through advances in genomic medicine, imaging, and a wide range of mobile health applications or connected devices is hampered by numerous historical, technical, legal, and political barriers. European health systems and databases are diverse and fragmented. There is a lack of harmonization of data formats, processing, analysis, and data transfer, which leads to incompatibilities and lost opportunities. Legal frameworks for data sharing are evolving. Clinicians, researchers, and citizens need improved methods, tools, and training to generate, analyze, and query data effectively. Addressing these barriers will contribute to creating the European Single Market for health, which will improve health and healthcare for all Europeans.
Collapse
Affiliation(s)
- Charles Auffray
- European Institute for Systems Biology and Medicine, 1 avenue Claude Vellefaux, 75010, Paris, France.
- CIRI-UMR5308, CNRS-ENS-INSERM-UCBL, Université de Lyon, 50 avenue Tony Garnier, 69007, Lyon, France.
| | - Rudi Balling
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 7 Avenue des Hauts Fourneaux, 4362, Esch-sur-Alzette, Luxembourg.
| | - Inês Barroso
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - László Bencze
- Health Services Management Training Centre, Faculty of Health and Public Services, Semmelweis University, Kútvölgyi út 2, 1125, Budapest, Hungary
| | - Mikael Benson
- Centre for Personalised Medicine, Linköping University, 581 85, Linköping, Sweden
| | - Jay Bergeron
- Translational & Bioinformatics, Pfizer Inc., 300 Technology Square, Cambridge, MA, 02139, USA
| | - Enrique Bernal-Delgado
- Institute for Health Sciences, IACS - IIS Aragon, San Juan Bosco 13, 50009, Zaragoza, Spain
| | - Niklas Blomberg
- ELIXIR, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Christoph Bock
- CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, Lazarettgasse 14, AKH BT25.2, 1090, Vienna, Austria
- Department of Laboratory Medicine, Medical University of Vienna, Lazarettgasse 14, AKH BT25.2, 1090, Vienna, Austria
- Max Planck Institute for Informatics, Campus E1 4, 66123, Saarbrücken, Germany
| | - Ana Conesa
- Príncipe Felipe Research Center, C/ Eduardo Primo Yúfera 3, 46012, Valencia, Spain
- University of Florida, Institute of Food and Agricultural Sciences (IFAS), 2033 Mowry Road, Gainesville, FL, 32610, USA
| | | | - Christophe Delogne
- Technology, Data & Analytics, KPMG Luxembourg, Société Coopérative, 39 Avenue John F. Kennedy, 1855, Luxembourg, Luxembourg
| | - Peter Devilee
- Department of Human Genetics, Department of Pathology, Leiden University Medical Centre, Einthovenweg 20, 2333 ZC, Leiden, The Netherlands
| | - Alberto Di Meglio
- Information Technology Department, European Organization for Nuclear Research (CERN), 385 Route de Meyrin, 1211, Geneva 23, Switzerland
| | - Marinus Eijkemans
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Heidelberglaan 100, 3508 GA, Utrecht, The Netherlands
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Norbert Graf
- Department of Pediatric Oncology/Hematology, Saarland University, Campus Homburg, Building 9, 66421, Homburg, Germany
| | - Vera Grimm
- Project Management Jülich, Forschungszentrum Jülich GmbH, Wilhelm-Johnen-Straße, 52428, Jülich, Germany
| | - Henk-Jan Guchelaar
- Department of Clinical Pharmacy & Toxicology, Leiden University Medical Center, Albinusdreef 2, 2333 ZA, Leiden, The Netherlands
| | - Yi-Ke Guo
- Data Science Institute, Imperial College London, South Kensington, London, SW7 2AZ, UK
| | - Ivo Glynne Gut
- CNAG-CRG, Center for Genomic Regulation, Barcelona Institute for Science and Technology (BIST), C/Baldiri Reixac 4, 08029, Barcelona, Spain
| | - Allan Hanbury
- Institute of Software Technology and Interactive Systems, TU Wien, Favoritenstrasse 9-11/188, 1040, Vienna, Austria
| | - Shahid Hanif
- The Association of the British Pharmaceutical Industry, 7th Floor, Southside, 105 Victoria Street, London, SW1E 6QT, UK
| | - Ralf-Dieter Hilgers
- Department of Medical Statistics, RWTH-Aachen University, Universitätsklinikum Aachen, Pauwelsstraße 30, 52074, Aachen, Germany
| | - Ángel Honrado
- SYNAPSE Research Management Partners, Diputació 237, Àtic 3ª, 08007, Barcelona, Spain
| | - D Rod Hose
- Department of Infection, Immunity and Cardiovascular Disease and Insigneo Institute for In-Silico Medicine, Medical School, University of Sheffield, Beech Hill Road, Sheffield, S10 2RX, UK
| | | | - Tim Hubbard
- Department of Medical & Molecular Genetics, King's College London, London, SE1 9RT, UK
- Genomics England, London, EC1M 6BQ, UK
| | - Sophie Helen Janacek
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Haralampos Karanikas
- National and Kapodistrian University of Athens, Medical School, Xristou Lada 6, 10561, Athens, Greece
| | - Tim Kievits
- Vitromics Healthcare Holding B.V., Onderwijsboulevard 225, 5223 DE, 's-Hertogenbosch, The Netherlands
| | - Manfred Kohler
- Fraunhofer Institute for Molecular Biology and Applied Ecology ScreeningPort, Schnackenburgallee 114, 22525, Hamburg, Germany
| | - Andreas Kremer
- ITTM S.A., 9 avenue des Hauts Fourneaux, 4362, Esch-sur-Alzette, Luxembourg
| | - Jerry Lanfear
- Research Business Technology, Pfizer Ltd, GP4 Building, Granta Park, Cambridge, CB21 6GP, UK
| | - Thomas Lengauer
- Max Planck Institute for Informatics, Campus E1 4, 66123, Saarbrücken, Germany
| | - Edith Maes
- Health Economics & Outcomes Research, Deloitte Belgium, Berkenlaan 8A, 1831, Diegem, Belgium
| | - Theo Meert
- Janssen Pharmaceutica N.V., R&D G3O, Turnhoutseweg 30, 2340, Beerse, Belgium
| | - Werner Müller
- Faculty of Life Sciences, University of Manchester, AV Hill Building, Oxford Road, Manchester, M13 9PT, UK
| | - Dörthe Nickel
- UMR3664 IC/CNRS, Institut Curie, Section Recherche, Pavillon Pasteur, 26 rue d'Ulm, 75248, Paris cedex 05, France
| | - Peter Oledzki
- Linguamatics Ltd, 324 Cambridge Science Park Milton Rd, Cambridge, CB4 0WG, UK
| | - Bertrand Pedersen
- PwC Luxembourg, 2 rue Gerhard Mercator, 2182, Luxembourg, Luxembourg
| | - Milan Petkovic
- Philips, HighTechCampus 36, 5656AE, Eindhoven, The Netherlands
| | - Konstantinos Pliakos
- Department of Public Health and Primary Care, KU Leuven Kulak, Etienne Sabbelaan 53, 8500, Kortrijk, Belgium
| | - Magnus Rattray
- Faculty of Life Sciences, University of Manchester, AV Hill Building, Oxford Road, Manchester, M13 9PT, UK
| | - Josep Redón I Màs
- INCLIVA Health Research Institute, University of Valencia, CIBERobn ISCIII, Avenida Menéndez Pelayo 4 accesorio, 46010, Valencia, Spain
| | - Reinhard Schneider
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, 7 Avenue des Hauts Fourneaux, 4362, Esch-sur-Alzette, Luxembourg
| | - Thierry Sengstag
- Swiss Institute of Bioinformatics (SIB) and University of Basel, Klingelbergstrasse 50/70, 4056, Basel, Switzerland
| | - Xavier Serra-Picamal
- Agency for Health Quality and Assessment of Catalonia (AQuAS), Carrer de Roc Boronat 81-95, 08005, Barcelona, Spain
| | - Wouter Spek
- EuroBioForum Foundation, Chrysantstraat 10, 3135 HG, Vlaardingen, The Netherlands
| | - Lea A I Vaas
- Fraunhofer Institute for Molecular Biology and Applied Ecology ScreeningPort, Schnackenburgallee 114, 22525, Hamburg, Germany
| | - Okker van Batenburg
- EuroBioForum Foundation, Chrysantstraat 10, 3135 HG, Vlaardingen, The Netherlands
| | - Marc Vandelaer
- Integrated BioBank of Luxembourg, 6 rue Nicolas-Ernest Barblé, 1210, Luxembourg, Luxembourg
| | - Peter Varnai
- Technopolis Group, 3 Pavilion Buildings, Brighton, BN1 1EE, UK
| | - Pablo Villoslada
- Hospital Clinic of Barcelona, Institute d'Investigacions Biomediques August Pi Sunyer (IDIBAPS), Rosello 149, 08036, Barcelona, Spain
| | - Juan Antonio Vizcaíno
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - John Peter Mary Wubbe
- European Platform for Patients' Organisations, Science and Industry (Epposi), De Meeûs Square 38-40, 1000, Brussels, Belgium
| | - Gianluigi Zanetti
- CRS4, Ed.1 POLARIS, 09129, Pula, Italy
- BBMRI-ERIC, Neue Stiftingtalstrasse 2/B/6, 8010, Graz, Austria
| |
Collapse
|
22
|
Echave J, Spielman SJ, Wilke CO. Causes of evolutionary rate variation among protein sites. Nat Rev Genet 2016; 17:109-21. [PMID: 26781812 DOI: 10.1038/nrg.2015.18] [Citation(s) in RCA: 180] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
It has long been recognized that certain sites within a protein, such as sites in the protein core or catalytic residues in enzymes, are evolutionarily more conserved than other sites. However, our understanding of rate variation among sites remains surprisingly limited. Recent progress to address this includes the development of a wide array of reliable methods to estimate site-specific substitution rates from sequence alignments. In addition, several molecular traits have been identified that correlate with site-specific mutation rates, and novel mechanistic biophysical models have been proposed to explain the observed correlations. Nonetheless, current models explain, at best, approximately 60% of the observed variance, highlighting the limitations of current methods and models and the need for new research directions.
Collapse
Affiliation(s)
- Julian Echave
- Escuela de Ciencia y Tecnología, Universidad Nacional de San Martín, 1650 San Martín, Buenos Aires, Argentina
| | - Stephanie J Spielman
- Department of Integrative Biology, Center for Computational Biology and Bioinformatics, and Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, Texas 78712, USA
| | - Claus O Wilke
- Department of Integrative Biology, Center for Computational Biology and Bioinformatics, and Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, Texas 78712, USA
| |
Collapse
|
23
|
Teufel AI, Masel J, Liberles DA. What Fraction of Duplicates Observed in Recently Sequenced Genomes Is Segregating and Destined to Fail to Fix? Genome Biol Evol 2015. [PMID: 26220936 PMCID: PMC4558857 DOI: 10.1093/gbe/evv139] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023] Open
Abstract
Most sequenced eukaryotic genomes show a large excess of recent duplicates. As duplicates age, both the population genetic process of failed fixation and the mutation-driven process of nonfunctionalization act to reduce the observed number of duplicates. Understanding the processes generating the age distributions of recent duplicates is important to also understand the role of duplicate genes in the functional divergence of genomes. To date, mechanistic models for duplicate gene retention only account for the mutation-driven nonfunctionalization process. Here, a neutral model for the distribution of synonymous substitutions in duplicated genes which are segregating and expected to never fix in a population is introduced. This model enables differentiation of neutral loss due to failed fixation from loss due to mutation-driven nonfunctionalization. The model has been validated on simulated data and subsequent analysis with the model on genomic data from human and mouse shows that conclusions about the underlying mechanisms for duplicate gene retention can be sensitive to consideration of population genetic processes.
Collapse
Affiliation(s)
- Ashley I Teufel
- Department of Molecular Biology, University of Wyoming Center for Computational Genetics and Genomics and Department of Biology, Temple University
| | - Joanna Masel
- Department of Ecology and Evolutionary Biology, University of Arizona
| | - David A Liberles
- Department of Molecular Biology, University of Wyoming Center for Computational Genetics and Genomics and Department of Biology, Temple University
| |
Collapse
|
24
|
Abstract
Numerous computational methods exist to assess the mode and strength of natural selection in protein-coding sequences, yet how distinct methods relate to one another remains largely unknown. Here, we elucidate the relationship between two widely used phylogenetic modeling frameworks: dN/dS models and mutation-selection (MutSel) models. We derive a mathematical relationship between dN/dS and scaled selection coefficients, the focal parameters of MutSel models, and use this relationship to gain deeper insight into the behaviors, limitations, and applicabilities of these two modeling frameworks. We prove that, if all synonymous changes are neutral, standard MutSel models correspond to dN/dS ≤ 1. However, if synonymous codons differ in fitness, dN/dS can take on arbitrarily high values even if all selection is purifying. Thus, the MutSel modeling framework cannot necessarily accommodate positive, diversifying selection, while dN/dS cannot distinguish between purifying selection on synonymous codons and positive selection on amino acids. We further propose a new benchmarking strategy of dN/dS inferences against MutSel simulations and demonstrate that the widely used Goldman-Yang-style dN/dS models yield substantially biased dN/dS estimates on realistic sequence data. In contrast, the less frequently used Muse-Gaut-style models display much less bias. Strikingly, the least-biased and most precise dN/dS estimates are never found in the models with the best fit to the data, measured through both AIC and BIC scores. Thus, selecting models based on goodness-of-fit criteria can yield poor parameter estimates if the models considered do not precisely correspond to the underlying mechanism that generated the data. In conclusion, establishing mathematical links among modeling frameworks represents a novel, powerful strategy to pinpoint previously unrecognized model limitations and strengths.
Collapse
Affiliation(s)
- Stephanie J Spielman
- Department of Integrative Biology, Center for Computational Biology and Bioinformatics, and Institute of Cellular and Molecular Biology, The University of Texas at Austin
| | - Claus O Wilke
- Department of Integrative Biology, Center for Computational Biology and Bioinformatics, and Institute of Cellular and Molecular Biology, The University of Texas at Austin
| |
Collapse
|
25
|
Dharia AP, Obla A, Gajdosik MD, Simon A, Nelson CE. Tempo and mode of gene duplication in mammalian ribosomal protein evolution. PLoS One 2014; 9:e111721. [PMID: 25369106 PMCID: PMC4219774 DOI: 10.1371/journal.pone.0111721] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2014] [Accepted: 10/06/2014] [Indexed: 12/17/2022] Open
Abstract
Gene duplication has been widely recognized as a major driver of evolutionary change and organismal complexity through the generation of multi-gene families. Therefore, understanding the forces that govern the evolution of gene families through the retention or loss of duplicated genes is fundamentally important in our efforts to study genome evolution. Previous work from our lab has shown that ribosomal protein (RP) genes constitute one of the largest classes of conserved duplicated genes in mammals. This result was surprising due to the fact that ribosomal protein genes evolve slowly and transcript levels are very tightly regulated. In our present study, we identified and characterized all RP duplicates in eight mammalian genomes in order to investigate the tempo and mode of ribosomal protein family evolution. We show that a sizable number of duplicates are transcriptionally active and are very highly conserved. Furthermore, we conclude that existing gene duplication models do not readily account for the preservation of a very large number of intact retroduplicated ribosomal protein (RT-RP) genes observed in mammalian genomes. We suggest that selection against dominant-negative mutations may underlie the unexpected retention and conservation of duplicated RP genes, and may shape the fate of newly duplicated genes, regardless of duplication mechanism.
Collapse
Affiliation(s)
- Asav P. Dharia
- University of Connecticut Department of Molecular and Cell Biology, Storrs, Connecticut, United States of America
| | - Ajay Obla
- University of Connecticut Department of Molecular and Cell Biology, Storrs, Connecticut, United States of America
| | - Matthew D. Gajdosik
- University of Connecticut Department of Molecular and Cell Biology, Storrs, Connecticut, United States of America
| | - Amanda Simon
- University of Connecticut Department of Molecular and Cell Biology, Storrs, Connecticut, United States of America
| | - Craig E. Nelson
- University of Connecticut Department of Molecular and Cell Biology, Storrs, Connecticut, United States of America
- * E-mail:
| |
Collapse
|
26
|
Chen HS, Hutter CM, Mechanic LE, Amos CI, Bafna V, Hauser ER, Hernandez RD, Li C, Liberles DA, McAllister K, Moore JH, Paltoo DN, Papanicolaou GJ, Peng B, Ritchie MD, Rosenfeld G, Witte JS, Gillanders EM, Feuer EJ. Genetic simulation tools for post-genome wide association studies of complex diseases. Genet Epidemiol 2014; 39:11-19. [PMID: 25371374 DOI: 10.1002/gepi.21870] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2014] [Revised: 09/02/2014] [Accepted: 09/26/2014] [Indexed: 01/12/2023]
Abstract
Genetic simulation programs are used to model data under specified assumptions to facilitate the understanding and study of complex genetic systems. Standardized data sets generated using genetic simulation are essential for the development and application of novel analytical tools in genetic epidemiology studies. With continuing advances in high-throughput genomic technologies and generation and analysis of larger, more complex data sets, there is a need for updating current approaches in genetic simulation modeling. To provide a forum to address current and emerging challenges in this area, the National Cancer Institute (NCI) sponsored a workshop, entitled "Genetic Simulation Tools for Post-Genome Wide Association Studies of Complex Diseases" at the National Institutes of Health (NIH) in Bethesda, Maryland on March 11-12, 2014. The goals of the workshop were to (1) identify opportunities, challenges, and resource needs for the development and application of genetic simulation models; (2) improve the integration of tools for modeling and analysis of simulated data; and (3) foster collaborations to facilitate development and applications of genetic simulation. During the course of the meeting, the group identified challenges and opportunities for the science of simulation, software and methods development, and collaboration. This paper summarizes key discussions at the meeting, and highlights important challenges and opportunities to advance the field of genetic simulation.
Collapse
Affiliation(s)
- Huann-Sheng Chen
- Surveillance Research Program, Division of Cancer Control and Population Sciences, National Cancer Institute, NIH, Bethesda, MD 20892
| | - Carolyn M Hutter
- Division of Genomic Medicine, National Human Genome Research Institute, NIH, Bethesda, MD 20892
| | - Leah E Mechanic
- Epidemiology and Genomics Research Program, Division of Cancer Control and Population Sciences, National Cancer Institute, NIH, Bethesda, MD 20892
| | - Christopher I Amos
- Division of Community, Family Medicine, Dartmouth College, Lebanon, NH 03755
| | - Vineet Bafna
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA 92093
| | | | - Ryan D Hernandez
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94143
| | - Chun Li
- Department of Biostatistics, Vanderbilt University, Nashville, TN 37235
| | - David A Liberles
- Department of Molecular Biology, University of Wyoming, Laramie, WY 82071
| | - Kimberly McAllister
- Susceptibility and Population Health Branch, National Institute of Environmental Health Sciences, NIH, Research Triangle Park, NC 27709
| | - Jason H Moore
- Department of Genetics, Dartmouth College, Lebanon, NH 03755
| | - Dina N Paltoo
- Office of Director, National Institutes of Health, Bethesda, MD 20892
| | - George J Papanicolaou
- Division of Cardiovascular Sciences, Prevention and Population Sciences Program, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, MD 20892
| | - Bo Peng
- Department of Bioinformatics and Computational Biology, University of Texas MD Anderson Cancer Center, Houston, TX 77030
| | - Marylyn D Ritchie
- Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, PA 16802
| | - Gabriel Rosenfeld
- Surveillance Research Program, Division of Cancer Control and Population Sciences, National Cancer Institute, NIH, Bethesda, MD 20892
| | - John S Witte
- Department of Epidemiology and Biostatistics, University of California, San Francisco, CA 94107
| | - Elizabeth M Gillanders
- Epidemiology and Genomics Research Program, Division of Cancer Control and Population Sciences, National Cancer Institute, NIH, Bethesda, MD 20892
| | - Eric J Feuer
- Surveillance Research Program, Division of Cancer Control and Population Sciences, National Cancer Institute, NIH, Bethesda, MD 20892
| |
Collapse
|
27
|
On Mechanistic Modeling of Gene Content Evolution: Birth-Death Models and Mechanisms of Gene Birth and Gene Retention. COMPUTATION 2014. [DOI: 10.3390/computation2030112] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
|
28
|
Benguigui M, Arenas M. Spatial and temporal simulation of human evolution. Methods, frameworks and applications. Curr Genomics 2014; 15:245-55. [PMID: 25132795 PMCID: PMC4133948 DOI: 10.2174/1389202915666140506223639] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2014] [Revised: 04/05/2014] [Accepted: 05/04/2014] [Indexed: 01/29/2023] Open
Abstract
Analyses of human evolution are fundamental to understand the current gradients of human diversity. In this concern, genetic samples collected from current populations together with archaeological data are the most important resources to study human evolution. However, they are often insufficient to properly evaluate a variety of evolutionary scenarios, leading to continuous debates and discussions. A commonly applied strategy consists of the use of computer simulations based on, as realistic as possible, evolutionary models, to evaluate alternative evolutionary scenarios through statistical correlations with the real data. Computer simulations can also be applied to estimate evolutionary parameters or to study the role of each parameter on the evolutionary process. Here we review the mainly used methods and evolutionary frameworks to perform realistic spatially explicit computer simulations of human evolution. Although we focus on human evolution, most of the methods and software we describe can also be used to study other species. We also describe the importance of considering spatially explicit models to better mimic human evolutionary scenarios based on a variety of phenomena such as range expansions, range shifts, range contractions, sex-biased dispersal, long-distance dispersal or admixtures of populations. We finally discuss future implementations to improve current spatially explicit simulations and their derived applications in human evolution.
Collapse
Affiliation(s)
- Macarena Benguigui
- Centre for Molecular Biology "Severo Ochoa", Consejo Superior de Investigaciones Científicas (CSIC), Madrid, Spain
| | - Miguel Arenas
- Centre for Molecular Biology "Severo Ochoa", Consejo Superior de Investigaciones Científicas (CSIC), Madrid, Spain
| |
Collapse
|
29
|
Abstract
Words are built from smaller meaning bearing parts, called morphemes. As one word can contain multiple morphemes, one morpheme can be present in different words. The number of distinct words a morpheme can be found in is its family size. Here we used Birth-Death-Innovation Models (BDIMs) to analyze the distribution of morpheme family sizes in English and German vocabulary over the last 200 years. Rather than just fitting to a probability distribution, these mechanistic models allow for the direct interpretation of identified parameters. Despite the complexity of language change, we indeed found that a specific variant of this pure stochastic model, the second order linear balanced BDIM, significantly fitted the observed distributions. In this model, birth and death rates are increased for smaller morpheme families. This finding indicates an influence of morpheme family sizes on vocabulary changes. This could be an effect of word formation, perception or both. On a more general level, we give an example on how mechanistic models can enable the identification of statistical trends in language change usually hidden by cultural influences.
Collapse
|