1
|
Yang Y, Xu T, Conant G, Kishino H, Thorne JL, Ji X. Interlocus Gene Conversion, Natural Selection, and Paralog Homogenization. Mol Biol Evol 2023; 40:msad198. [PMID: 37675606 PMCID: PMC10503786 DOI: 10.1093/molbev/msad198] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Revised: 08/07/2023] [Accepted: 09/05/2023] [Indexed: 09/08/2023] Open
Abstract
Following a duplication, the resulting paralogs tend to diverge. While mutation and natural selection can accelerate this process, they can also slow it. Here, we quantify the paralog homogenization that is caused by point mutations and interlocus gene conversion (IGC). Among 164 duplicated teleost genes, the median percentage of postduplication codon substitutions that arise from IGC rather than point mutation is estimated to be between 7% and 8%. By differentiating between the nonsynonymous codon substitutions that homogenize the protein sequences of paralogs and the nonhomogenizing nonsynonymous substitutions, we estimate the homogenizing nonsynonymous rates to be higher for 163 of the 164 teleost data sets as well as for all 14 data sets of duplicated yeast ribosomal protein-coding genes that we consider. For all 14 yeast data sets, the estimated homogenizing nonsynonymous rates exceed the synonymous rates.
Collapse
Affiliation(s)
- Yixuan Yang
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, USA
| | - Tanchumin Xu
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, USA
- Department of Statistics, North Carolina State University, Raleigh, NC, USA
| | - Gavin Conant
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, USA
- Department of Biological Sciences, North Carolina State University, Raleigh, NC, USA
| | - Hirohisa Kishino
- AI/Data Science Social Implementation Laboratory, Chuo University, Tokyo, Japan
| | - Jeffrey L Thorne
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, USA
- Department of Statistics, North Carolina State University, Raleigh, NC, USA
- Department of Biological Sciences, North Carolina State University, Raleigh, NC, USA
| | - Xiang Ji
- Department of Mathematics, Tulane University, New Orleans, LA, USA
| |
Collapse
|
2
|
Abstract
Compensatory substitutions happen when one mutation is advantageously selected because it restores the loss of fitness induced by a previous deleterious mutation. How frequent such mutations occur in evolution and what is the structural and functional context permitting their emergence remain open questions. We built an atlas of intra-protein compensatory substitutions using a phylogenetic approach and a dataset of 1,630 bacterial protein families for which high-quality sequence alignments and experimentally derived protein structures were available. We identified more than 51,000 positions coevolving by the mean of predicted compensatory mutations. Using the evolutionary and structural properties of the analyzed positions, we demonstrate that compensatory mutations are scarce (typically only a few in the protein history) but widespread (the majority of proteins experienced at least one). Typical coevolving residues are evolving slowly, are located in the protein core outside secondary structure motifs, and are more often in contact than expected by chance, even after accounting for their evolutionary rate and solvent exposure. An exception to this general scheme is residues coevolving for charge compensation, which are evolving faster than noncoevolving sites, in contradiction with predictions from simple coevolutionary models, but similar to stem pairs in RNA. While sites with a significant pattern of coevolution by compensatory mutations are rare, the comparative analysis of hundreds of structures ultimately permits a better understanding of the link between the three-dimensional structure of a protein and its fitness landscape.
Collapse
Affiliation(s)
- Shilpi Chaurasia
- RG Molecular Systems Evolution, Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, August-Thienemann-Straße 2, 24306 Plön, Germany.,Excelra Knowledge Solutions Pvt Ltd, Hyderabad, India
| | - Julien Y Dutheil
- RG Molecular Systems Evolution, Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, August-Thienemann-Straße 2, 24306 Plön, Germany.,Institute of Evolution Sciences of Montpellier (ISEM), CNRS, University of Montpellier, IRD, EPHE, 34095 Montpellier, France
| |
Collapse
|
3
|
Guéguen L, Duret L. Unbiased Estimate of Synonymous and Nonsynonymous Substitution Rates with Nonstationary Base Composition. Mol Biol Evol 2017; 35:734-742. [PMID: 29220511 PMCID: PMC5850866 DOI: 10.1093/molbev/msx308] [Citation(s) in RCA: 31] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
The measurement of synonymous and nonsynonymous substitution rates (dS and dN) is useful for assessing selection operating on protein sequences or for investigating mutational processes affecting genomes. In particular, the ratio dNdS is expected to be a good proxy for ω, the ratio of fixation probabilities of nonsynonymous mutations relative to that of neutral mutations. Standard methods for estimating dN, dS, or ω rely on the assumption that the base composition of sequences is at the equilibrium of the evolutionary process. In many clades, this assumption of stationarity is in fact incorrect, and we show here through simulations and analyses of empirical data that nonstationarity biases the estimate of dN, dS, and ω. We show that the bias in the estimate of ω can be fixed by explicitly taking into consideration nonstationarity in the modeling of codon evolution, in a maximum likelihood framework. Moreover, we propose an exact method for estimating dN and dS on branches, based on stochastic mapping, that can take into account nonstationarity. This method can be directly applied to any kind of codon evolution model, as long as neutrality is clearly parameterized.
Collapse
Affiliation(s)
- Laurent Guéguen
- Laboratoire de Biologie et Biométrie Évolutive, CNRS UMR 5558, Université Claude Bernard Lyon 1-Université de Lyon, Villeurbanne, France
| | - Laurent Duret
- Laboratoire de Biologie et Biométrie Évolutive, CNRS UMR 5558, Université Claude Bernard Lyon 1-Université de Lyon, Villeurbanne, France
| |
Collapse
|
4
|
Ji X, Griffing A, Thorne JL. A Phylogenetic Approach Finds Abundant Interlocus Gene Conversion in Yeast. Mol Biol Evol 2016; 33:2469-76. [PMID: 27297467 DOI: 10.1093/molbev/msw114] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Interlocus gene conversion (IGC) homogenizes repeats. While genomes can be repeat-rich, the evolutionary importance of IGC is poorly understood. Additional statistical tools for characterizing it are needed. We propose a composite likelihood strategy for incorporating IGC into widely-used probabilistic models for sequence changes that originate with point mutation. We estimated the percentage of nucleotide substitutions that originate with an IGC event rather than a point mutation in 14 groups of yeast ribosomal protein-coding genes, and found values ranging from 20% to 38%. We designed and applied a procedure to determine whether these percentages are inflated due to artifacts arising from model misspecification. The results of this procedure are consistent with IGC having had an important role in the evolution of each of these 14 gene families. We further investigate the properties of our IGC approach via simulation. In contrast to usual practice, our findings suggest that the IGC should and can be considered when multigene family evolution is investigated.
Collapse
Affiliation(s)
- Xiang Ji
- Bioinformatics Research Center, North Carolina State University Department of Statistics, North Carolina State University
| | - Alexander Griffing
- Bioinformatics Research Center, North Carolina State University Department of Biological Sciences, North Carolina State University
| | - Jeffrey L Thorne
- Bioinformatics Research Center, North Carolina State University Department of Statistics, North Carolina State University Department of Biological Sciences, North Carolina State University
| |
Collapse
|
5
|
Lee HJ, Rodrigue N, Thorne JL. Relaxing the Molecular Clock to Different Degrees for Different Substitution Types. Mol Biol Evol 2015; 32:1948-61. [PMID: 25931515 PMCID: PMC4833082 DOI: 10.1093/molbev/msv099] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
Rates of molecular evolution can vary over time. Diverse statistical techniques for divergence time estimation have been developed to accommodate this variation. These typically require that all sequence (or codon) positions at a locus change independently of one another. They also generally assume that the rates of different types of nucleotide substitutions vary across a phylogeny in the same way. This permits divergence time estimation procedures to employ an instantaneous rate matrix with relative rates that do not differ among branches. However, previous studies have suggested that some substitution types (e.g., CpG to TpG changes in mammals) are more clock-like than others. As has been previously noted, this is biologically plausible given the mutational mechanism of CpG to TpG changes. Through stochastic mapping of sequence histories from context-independent substitution models, our approach allows for context-dependent nucleotide substitutions to change their relative rates over time. We apply our approach to the analysis of a 0.15 Mb intergenic region from eight primates. In accord with previous findings, we find comparatively little rate variation over time for CpG to TpG substitutions but we find more for other substitution types. We conclude by discussing the limitations and prospects of our approach.
Collapse
Affiliation(s)
- Hui-Jie Lee
- Department of Statistics, North Carolina State University
| | | | - Jeffrey L Thorne
- Department of Statistics, North Carolina State University Department of Biological Sciences, North Carolina State University
| |
Collapse
|
6
|
Evaluation of Ancestral Sequence Reconstruction Methods to Infer Nonstationary Patterns of Nucleotide Substitution. Genetics 2015; 200:873-90. [PMID: 25948563 DOI: 10.1534/genetics.115.177386] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2015] [Accepted: 04/28/2015] [Indexed: 01/07/2023] Open
Abstract
Inference of gene sequences in ancestral species has been widely used to test hypotheses concerning the process of molecular sequence evolution. However, the approach may produce spurious results, mainly because using the single best reconstruction while ignoring the suboptimal ones creates systematic biases. Here we implement methods to correct for such biases and use computer simulation to evaluate their performance when the substitution process is nonstationary. The methods we evaluated include parsimony and likelihood using the single best reconstruction (SBR), averaging over reconstructions weighted by the posterior probabilities (AWP), and a new method called expected Markov counting (EMC) that produces maximum-likelihood estimates of substitution counts for any branch under a nonstationary Markov model. We simulated base composition evolution on a phylogeny for six species, with different selective pressures on G+C content among lineages, and compared the counts of nucleotide substitutions recorded during simulation with the inference by different methods. We found that large systematic biases resulted from (i) the use of parsimony or likelihood with SBR, (ii) the use of a stationary model when the substitution process is nonstationary, and (iii) the use of the Hasegawa-Kishino-Yano (HKY) model, which is too simple to adequately describe the substitution process. The nonstationary general time reversible (GTR) model, used with AWP or EMC, accurately recovered the substitution counts, even in cases of complex parameter fluctuations. We discuss model complexity and the compromise between bias and variance and suggest that the new methods may be useful for studying complex patterns of nucleotide substitution in large genomic data sets.
Collapse
|
7
|
Liu YY, Li S, Li F, Song L, Rehg JM. Efficient Learning of Continuous-Time Hidden Markov Models for Disease Progression. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 2015; 28:3599-3607. [PMID: 27019571 PMCID: PMC4804157] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
The Continuous-Time Hidden Markov Model (CT-HMM) is an attractive approach to modeling disease progression due to its ability to describe noisy observations arriving irregularly in time. However, the lack of an efficient parameter learning algorithm for CT-HMM restricts its use to very small models or requires unrealistic constraints on the state transitions. In this paper, we present the first complete characterization of efficient EM-based learning methods for CT-HMM models. We demonstrate that the learning problem consists of two challenges: the estimation of posterior state probabilities and the computation of end-state conditioned statistics. We solve the first challenge by reformulating the estimation problem in terms of an equivalent discrete time-inhomogeneous hidden Markov model. The second challenge is addressed by adapting three approaches from the continuous time Markov chain literature to the CT-HMM domain. We demonstrate the use of CT-HMMs with more than 100 states to visualize and predict disease progression using a glaucoma dataset and an Alzheimer's disease dataset.
Collapse
Affiliation(s)
- Yu-Ying Liu
- College of Computing Georgia Institute of Technology Atlanta, GA
| | - Shuang Li
- College of Computing Georgia Institute of Technology Atlanta, GA
| | - Fuxin Li
- College of Computing Georgia Institute of Technology Atlanta, GA
| | - Le Song
- College of Computing Georgia Institute of Technology Atlanta, GA
| | - James M Rehg
- College of Computing Georgia Institute of Technology Atlanta, GA
| |
Collapse
|
8
|
van Rosmalen J, Toy M, O'Mahony JF. A mathematical approach for evaluating Markov models in continuous time without discrete-event simulation. Med Decis Making 2013; 33:767-79. [PMID: 23715464 DOI: 10.1177/0272989x13487947] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
Markov models are a simple and powerful tool for analyzing the health and economic effects of health care interventions. These models are usually evaluated in discrete time using cohort analysis. The use of discrete time assumes that changes in health states occur only at the end of a cycle period. Discrete-time Markov models only approximate the process of disease progression, as clinical events typically occur in continuous time. The approximation can yield biased cost-effectiveness estimates for Markov models with long cycle periods and if no half-cycle correction is made. The purpose of this article is to present an overview of methods for evaluating Markov models in continuous time. These methods use mathematical results from stochastic process theory and control theory. The methods are illustrated using an applied example on the cost-effectiveness of antiviral therapy for chronic hepatitis B. The main result is a mathematical solution for the expected time spent in each state in a continuous-time Markov model. It is shown how this solution can account for age-dependent transition rates and discounting of costs and health effects, and how the concept of tunnel states can be used to account for transition rates that depend on the time spent in a state. The applied example shows that the continuous-time model yields more accurate results than the discrete-time model but does not require much computation time and is easily implemented. In conclusion, continuous-time Markov models are a feasible alternative to cohort analysis and can offer several theoretical and practical advantages.
Collapse
Affiliation(s)
- Joost van Rosmalen
- Department of Public Health, Erasmus MC, University Medical Center, Rotterdam, the Netherlands (JVR, MT, JFO),Department of Biostatistics, Erasmus MC, University Medical Center, Rotterdam, the Netherlands (JVR)
| | - Mehlika Toy
- Department of Public Health, Erasmus MC, University Medical Center, Rotterdam, the Netherlands (JVR, MT, JFO),Department of Global Health and Population, Harvard School of Public Health, Boston, Massachusetts (MT)
| | - James F O'Mahony
- Department of Public Health, Erasmus MC, University Medical Center, Rotterdam, the Netherlands (JVR, MT, JFO),Department of Health Policy and Management, Trinity College Dublin, Dublin, Ireland (JFO)
| |
Collapse
|
9
|
Guéguen L, Gaillard S, Boussau B, Gouy M, Groussin M, Rochette NC, Bigot T, Fournier D, Pouyet F, Cahais V, Bernard A, Scornavacca C, Nabholz B, Haudry A, Dachary L, Galtier N, Belkhir K, Dutheil JY. Bio++: Efficient Extensible Libraries and Tools for Computational Molecular Evolution. Mol Biol Evol 2013; 30:1745-50. [DOI: 10.1093/molbev/mst097] [Citation(s) in RCA: 132] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
|
10
|
Miyazawa S. Prediction of contact residue pairs based on co-substitution between sites in protein structures. PLoS One 2013; 8:e54252. [PMID: 23342110 PMCID: PMC3546969 DOI: 10.1371/journal.pone.0054252] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2012] [Accepted: 12/10/2012] [Indexed: 11/18/2022] Open
Abstract
Residue-residue interactions that fold a protein into a unique three-dimensional structure and make it play a specific function impose structural and functional constraints in varying degrees on each residue site. Selective constraints on residue sites are recorded in amino acid orders in homologous sequences and also in the evolutionary trace of amino acid substitutions. A challenge is to extract direct dependences between residue sites by removing phylogenetic correlations and indirect dependences through other residues within a protein or even through other molecules. Rapid growth of protein families with unknown folds requires an accurate de novo prediction method for protein structure. Recent attempts of disentangling direct from indirect dependences of amino acid types between residue positions in multiple sequence alignments have revealed that inferred residue-residue proximities can be sufficient information to predict a protein fold without the use of known three-dimensional structures. Here, we propose an alternative method of inferring coevolving site pairs from concurrent and compensatory substitutions between sites in each branch of a phylogenetic tree. Substitution probability and physico-chemical changes (volume, charge, hydrogen-bonding capability, and others) accompanied by substitutions at each site in each branch of a phylogenetic tree are estimated with the likelihood of each substitution, and their direct correlations between sites are used to detect concurrent and compensatory substitutions. In order to extract direct dependences between sites, partial correlation coefficients of the characteristic changes along branches between sites, in which linear multiple dependences on feature vectors at other sites are removed, are calculated and used to rank coevolving site pairs. Accuracy of contact prediction based on the present coevolution score is comparable to that achieved by a maximum entropy model of protein sequences for 15 protein families taken from the Pfam release 26.0. Besides, this excellent accuracy indicates that compensatory substitutions are significant in protein evolution.
Collapse
Affiliation(s)
- Sanzo Miyazawa
- Graduate School of Engineering, Gunma University, Kiryu, Gunma, Japan.
| |
Collapse
|
11
|
Romiguier J, Figuet E, Galtier N, Douzery EJP, Boussau B, Dutheil JY, Ranwez V. Fast and robust characterization of time-heterogeneous sequence evolutionary processes using substitution mapping. PLoS One 2012; 7:e33852. [PMID: 22479459 PMCID: PMC3313935 DOI: 10.1371/journal.pone.0033852] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2011] [Accepted: 02/22/2012] [Indexed: 12/22/2022] Open
Abstract
Genes and genomes do not evolve similarly in all branches of the tree of life. Detecting and characterizing the heterogeneity in time, and between lineages, of the nucleotide (or amino acid) substitution process is an important goal of current molecular evolutionary research. This task is typically achieved through the use of non-homogeneous models of sequence evolution, which being highly parametrized and computationally-demanding are not appropriate for large-scale analyses. Here we investigate an alternative methodological option based on probabilistic substitution mapping. The idea is to first reconstruct the substitutional history of each site of an alignment under a homogeneous model of sequence evolution, then to characterize variations in the substitution process across lineages based on substitution counts. Using simulated and published datasets, we demonstrate that probabilistic substitution mapping is robust in that it typically provides accurate reconstruction of sequence ancestry even when the true process is heterogeneous, but a homogeneous model is adopted. Consequently, we show that the new approach is essentially as efficient as and extremely faster than (up to 25 000 times) existing methods, thus paving the way for a systematic survey of substitution process heterogeneity across genes and lineages.
Collapse
Affiliation(s)
- Jonathan Romiguier
- Institut des Sciences de l'Evolution de Montpellier, CNRS-Université Montpellier 2, Montpellier, France.
| | | | | | | | | | | | | |
Collapse
|
12
|
Dutheil JY, Galtier N, Romiguier J, Douzery EJ, Ranwez V, Boussau B. Efficient Selection of Branch-Specific Models of Sequence Evolution. Mol Biol Evol 2012; 29:1861-74. [DOI: 10.1093/molbev/mss059] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|