1
|
Sennett MA, Theobald DL. Extant Sequence Reconstruction: The Accuracy of Ancestral Sequence Reconstructions Evaluated by Extant Sequence Cross-Validation. J Mol Evol 2024; 92:181-206. [PMID: 38502220 PMCID: PMC10978691 DOI: 10.1007/s00239-024-10162-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Accepted: 02/20/2024] [Indexed: 03/21/2024]
Abstract
Ancestral sequence reconstruction (ASR) is a phylogenetic method widely used to analyze the properties of ancient biomolecules and to elucidate mechanisms of molecular evolution. Despite its increasingly widespread application, the accuracy of ASR is currently unknown, as it is generally impossible to compare resurrected proteins to the true ancestors. Which evolutionary models are best for ASR? How accurate are the resulting inferences? Here we answer these questions using a cross-validation method to reconstruct each extant sequence in an alignment with ASR methodology, a method we term "extant sequence reconstruction" (ESR). We thus can evaluate the accuracy of ASR methodology by comparing ESR reconstructions to the corresponding known true sequences. We find that a common measure of the quality of a reconstructed sequence, the average probability, is indeed a good estimate of the fraction of correct amino acids when the evolutionary model is accurate or overparameterized. However, the average probability is a poor measure for comparing reconstructions from different models, because, surprisingly, a more accurate phylogenetic model often results in reconstructions with lower probability. While better (more predictive) models may produce reconstructions with lower sequence identity to the true sequences, better models nevertheless produce reconstructions that are more biophysically similar to true ancestors. In addition, we find that a large fraction of sequences sampled from the reconstruction distribution may have fewer errors than the single most probable (SMP) sequence reconstruction, despite the fact that the SMP has the lowest expected error of all possible sequences. Our results emphasize the importance of model selection for ASR and the usefulness of sampling sequence reconstructions for analyzing ancestral protein properties. ESR is a powerful method for validating the evolutionary models used for ASR and can be applied in practice to any phylogenetic analysis of real biological sequences. Most significantly, ESR uses ASR methodology to provide a general method by which the biophysical properties of resurrected proteins can be compared to the properties of the true protein.
Collapse
Affiliation(s)
- Michael A Sennett
- Department of Biochemistry, Brandeis University, Waltham, MA, 02453, USA
| | - Douglas L Theobald
- Department of Biochemistry, Brandeis University, Waltham, MA, 02453, USA.
| |
Collapse
|
2
|
Del Amparo R, Arenas M. Influence of substitution model selection on protein phylogenetic tree reconstruction. Gene 2023; 865:147336. [PMID: 36871672 DOI: 10.1016/j.gene.2023.147336] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2023] [Revised: 02/22/2023] [Accepted: 02/28/2023] [Indexed: 03/06/2023]
Abstract
Probabilistic phylogenetic tree reconstruction is traditionally performed under a best-fitting substitution model of molecular evolution previously selected according to diverse statistical criteria. Interestingly, some recent studies proposed that this procedure is unnecessary for phylogenetic tree reconstruction leading to a debate in the field. In contrast to DNA sequences, phylogenetic tree reconstruction from protein sequences is traditionally based on empirical exchangeability matrices that can differ among taxonomic groups and protein families. Considering this aspect, here we investigated the influence of selecting a substitution model of protein evolution on phylogenetic tree reconstruction by the analyses of real and simulated data. We found that phylogenetic tree reconstructions based on a selected best-fitting substitution model of protein evolution are the most accurate, in terms of topology and branch lengths, compared with those derived from substitution models with amino acid replacement matrices far from the selected best-fitting model, especially when the data has large genetic diversity. Indeed, we found that substitution models with similar amino acid replacement matrices produce similar reconstructed phylogenetic trees, suggesting the use of substitution models as similar as possible to a selected best-fitting model when the latter cannot be used. Therefore, we recommend the use of the traditional protocol of selection among substitution models of evolution for protein phylogenetic tree reconstruction.
Collapse
Affiliation(s)
- Roberto Del Amparo
- CINBIO, Universidade de Vigo, 36310 Vigo, Spain; Department of Biochemistry, Genetics and Immunology, Universidade de Vigo, 36310 Vigo, Spain.
| | - Miguel Arenas
- CINBIO, Universidade de Vigo, 36310 Vigo, Spain; Department of Biochemistry, Genetics and Immunology, Universidade de Vigo, 36310 Vigo, Spain; Galicia Sur Health Research Institute (IIS Galicia Sur), 36310 Vigo, Spain.
| |
Collapse
|
3
|
Paradis E, Claramunt S, Brown J, Schliep K. Confidence intervals in molecular dating by maximum likelihood. Mol Phylogenet Evol 2023; 178:107652. [PMID: 36306994 DOI: 10.1016/j.ympev.2022.107652] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2022] [Revised: 10/11/2022] [Accepted: 10/19/2022] [Indexed: 11/06/2022]
Abstract
Molecular dating has been widely used to infer the times of past evolutionary events using molecular sequences. This paper describes three bootstrap methods to infer confidence intervals under a penalized likelihood framework. The basic idea is to use data pseudoreplicates to infer uncertainty in the branch lengths of a phylogeny reconstructed with molecular sequences. The three specific bootstrap methods are nonparametric (direct tree bootstrapping), semiparametric (rate smoothing), and parametric (Poisson simulation). Our extensive simulation study showed that the three methods perform generally well under a simple strict clock model of molecular evolution; however, the results were less positive with data simulated using an uncorrelated or a correlated relaxed clock model. Several factors impacted, possibly in interaction, the performance of the confidence intervals. Increasing the number of calibration points had a positive effect, as well as increasing the sequence length or the number of sequences although both latter effects depended on the model of evolution. A case study is presented with a molecular phylogeny of the Felidae (Mammalia: Carnivora). A comparison was made with a Bayesian analysis: the results were very close in terms of confidence intervals and there was no marked tendency for an approach to produce younger or older bounds compared to the other.
Collapse
Affiliation(s)
| | - Santiago Claramunt
- Department of Natural History, Royal Ontario Museum, Toronto, ON 5S2C6, Canada
| | - Joseph Brown
- Department of Natural History, Royal Ontario Museum, Toronto, ON 5S2C6, Canada
| | - Klaus Schliep
- Institute of Computational Biotechnology, Technology University Graz, Austria
| |
Collapse
|
4
|
Costa FP, Schrago CG, Mello B. Assessing the relative performance of fast molecular dating methods for phylogenomic data. BMC Genomics 2022; 23:798. [PMID: 36460948 PMCID: PMC9719170 DOI: 10.1186/s12864-022-09030-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2022] [Accepted: 11/21/2022] [Indexed: 12/05/2022] Open
Abstract
Advances in genome sequencing techniques produced a significant growth of phylogenomic datasets. This massive amount of data represents a computational challenge for molecular dating with Bayesian approaches. Rapid molecular dating methods have been proposed over the last few decades to overcome these issues. However, a comparative evaluation of their relative performance on empirical data sets is lacking. We analyzed 23 empirical phylogenomic datasets to investigate the performance of two commonly employed fast dating methodologies: penalized likelihood (PL), implemented in treePL, and the relative rate framework (RRF), implemented in RelTime. They were compared to Bayesian analyses using the closest possible substitution models and calibration settings. We found that RRF was computationally faster and generally provided node age estimates statistically equivalent to Bayesian divergence times. PL time estimates consistently exhibited low levels of uncertainty. Overall, to approximate Bayesian approaches, RelTime is an efficient method with significantly lower computational demand, being more than 100 times faster than treePL. Thus, to alleviate the computational burden of Bayesian divergence time inference in the era of massive genomic data, molecular dating can be facilitated using the RRF, allowing evolutionary hypotheses to be tested more quickly and efficiently.
Collapse
Affiliation(s)
- Fernanda P. Costa
- grid.8536.80000 0001 2294 473XDepartment of Genetics, Federal University of Rio de Janeiro, Rio de Janeiro, RJ 21941-617 Brazil
| | - Carlos G. Schrago
- grid.8536.80000 0001 2294 473XDepartment of Genetics, Federal University of Rio de Janeiro, Rio de Janeiro, RJ 21941-617 Brazil
| | - Beatriz Mello
- grid.8536.80000 0001 2294 473XDepartment of Genetics, Federal University of Rio de Janeiro, Rio de Janeiro, RJ 21941-617 Brazil
| |
Collapse
|
5
|
Ayuso-Fernández I, Molpeceres G, Camarero S, Ruiz-Dueñas FJ, Martínez AT. Ancestral sequence reconstruction as a tool to study the evolution of wood decaying fungi. FRONTIERS IN FUNGAL BIOLOGY 2022; 3:1003489. [PMID: 37746217 PMCID: PMC10512382 DOI: 10.3389/ffunb.2022.1003489] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/26/2022] [Accepted: 09/22/2022] [Indexed: 09/26/2023]
Abstract
The study of evolution is limited by the techniques available to do so. Aside from the use of the fossil record, molecular phylogenetics can provide a detailed characterization of evolutionary histories using genes, genomes and proteins. However, these tools provide scarce biochemical information of the organisms and systems of interest and are therefore very limited when they come to explain protein evolution. In the past decade, this limitation has been overcome by the development of ancestral sequence reconstruction (ASR) methods. ASR allows the subsequent resurrection in the laboratory of inferred proteins from now extinct organisms, becoming an outstanding tool to study enzyme evolution. Here we review the recent advances in ASR methods and their application to study fungal evolution, with special focus on wood-decay fungi as essential organisms in the global carbon cycling.
Collapse
Affiliation(s)
- Iván Ayuso-Fernández
- Faculty of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences (NMBU), Ås, Norway
| | - Gonzalo Molpeceres
- Centro de Investigaciones Biológicas “Margarita Salas” (CIB), CSIC, Madrid, Spain
| | - Susana Camarero
- Centro de Investigaciones Biológicas “Margarita Salas” (CIB), CSIC, Madrid, Spain
| | | | - Angel T. Martínez
- Centro de Investigaciones Biológicas “Margarita Salas” (CIB), CSIC, Madrid, Spain
| |
Collapse
|
6
|
Del Amparo R, Arenas M. Consequences of Substitution Model Selection on Protein Ancestral Sequence Reconstruction. Mol Biol Evol 2022; 39:6628884. [PMID: 35789388 PMCID: PMC9254009 DOI: 10.1093/molbev/msac144] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
The selection of the best-fitting substitution model of molecular evolution is a traditional step for phylogenetic inferences, including ancestral sequence reconstruction (ASR). However, a few recent studies suggested that applying this procedure does not affect the accuracy of phylogenetic tree reconstruction. Here, we revisited this debate topic by analyzing the influence of selection among substitution models of protein evolution, with focus on exchangeability matrices, on the accuracy of ASR using simulated and real data. We found that the selected best-fitting substitution model produces the most accurate ancestral sequences, especially if the data present large genetic diversity. Indeed, ancestral sequences reconstructed under substitution models with similar exchangeability matrices were similar, suggesting that if the selected best-fitting model cannot be used for the reconstruction, applying a model similar to the selected one is preferred. We conclude that selecting among substitution models of protein evolution is recommended for reconstructing accurate ancestral sequences.
Collapse
Affiliation(s)
- Roberto Del Amparo
- CINBIO, Universidade de Vigo, Vigo, Spain.,Departamento de Bioquímica, Xenética e Immunoloxía, Universidade de Vigo, Vigo, Spain
| | - Miguel Arenas
- CINBIO, Universidade de Vigo, Vigo, Spain.,Departamento de Bioquímica, Xenética e Immunoloxía, Universidade de Vigo, Vigo, Spain.,Galicia Sur Health Research Institute (IIS Galicia Sur), Vigo, Spain
| |
Collapse
|
7
|
Mongiardino Koch N, Thompson JR, Hiley AS, McCowin MF, Armstrong AF, Coppard SE, Aguilera F, Bronstein O, Kroh A, Mooi R, Rouse GW. Phylogenomic analyses of echinoid diversification prompt a re-evaluation of their fossil record. eLife 2022; 11:72460. [PMID: 35315317 PMCID: PMC8940180 DOI: 10.7554/elife.72460] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2021] [Accepted: 03/03/2022] [Indexed: 12/25/2022] Open
Abstract
Echinoids are key components of modern marine ecosystems. Despite a remarkable fossil record, the emergence of their crown group is documented by few specimens of unclear affinities, rendering their early history uncertain. The origin of sand dollars, one of its most distinctive clades, is also unclear due to an unstable phylogenetic context. We employ 18 novel genomes and transcriptomes to build a phylogenomic dataset with a near-complete sampling of major lineages. With it, we revise the phylogeny and divergence times of echinoids, and place their history within the broader context of echinoderm evolution. We also introduce the concept of a chronospace - a multidimensional representation of node ages - and use it to explore methodological decisions involved in time calibrating phylogenies. We find the choice of clock model to have the strongest impact on divergence times, while the use of site-heterogeneous models and alternative node prior distributions show minimal effects. The choice of loci has an intermediate impact, affecting mostly deep Paleozoic nodes, for which clock-like genes recover dates more congruent with fossil evidence. Our results reveal that crown group echinoids originated in the Permian and diversified rapidly in the Triassic, despite the relative lack of fossil evidence for this early diversification. We also clarify the relationships between sand dollars and their close relatives and confidently date their origins to the Cretaceous, implying ghost ranges spanning approximately 50 million years, a remarkable discrepancy with their rich fossil record.
Collapse
Affiliation(s)
- Nicolás Mongiardino Koch
- Department of Earth & Planetary Sciences, Yale University, New Haven, United States.,Scripps Institution of Oceanography, University of California San Diego, La Jolla, United States
| | - Jeffrey R Thompson
- Department of Earth Sciences, Natural History Museum, London, United Kingdom.,University College London Center for Life's Origins and Evolution, London, United Kingdom
| | - Avery S Hiley
- Scripps Institution of Oceanography, University of California San Diego, La Jolla, United States
| | - Marina F McCowin
- Scripps Institution of Oceanography, University of California San Diego, La Jolla, United States
| | - A Frances Armstrong
- Department of Invertebrate Zoology and Geology, California Academy of Sciences, San Francisco, United States
| | - Simon E Coppard
- Bader International Study Centre, Queen's University, Herstmonceux Castle, East Sussex, United Kingdom
| | - Felipe Aguilera
- Departamento de Bioquímica y Biología Molecular, Facultad de Ciencias Biológicas, Universidad de Concepción, Concepción, Chile
| | - Omri Bronstein
- School of Zoology, Faculty of Life Sciences, Tel Aviv University, Tel Aviv, Israel.,Steinhardt Museum of Natural History, Tel-Aviv, Israel
| | - Andreas Kroh
- Department of Geology and Palaeontology, Natural History Museum Vienna, Vienna, Austria
| | - Rich Mooi
- Department of Invertebrate Zoology and Geology, California Academy of Sciences, San Francisco, United States
| | - Greg W Rouse
- Scripps Institution of Oceanography, University of California San Diego, La Jolla, United States
| |
Collapse
|
8
|
Abstract
The reconstruction of genetic material of ancestral organisms constitutes a powerful application of evolutionary biology. A fundamental step in this inference is the ancestral sequence reconstruction (ASR), which can be performed with diverse methodologies implemented in computer frameworks. However, most of these methodologies ignore evolutionary properties frequently observed in microbes, such as genetic recombination and complex selection processes, that can bias the traditional ASR. From a practical perspective, here I review methodologies for the reconstruction of ancestral DNA and protein sequences, with particular focus on microbes, and including biases, recommendations, and software implementations. I conclude that microbial ASR is a complex analysis that should be carefully performed and that there is a need for methods to infer more realistic ancestral microbial sequences.
Collapse
Affiliation(s)
- Miguel Arenas
- Biomedical Research Center (CINBIO), University of Vigo, Vigo, Spain.
- Department of Biochemistry, Genetics and Immunology, University of Vigo, Vigo, Spain.
- Galicia Sur Health Research Institute (IIS Galicia Sur), Vigo, Spain.
| |
Collapse
|
9
|
Tao Q, Barba-Montoya J, Kumar S. Data-driven speciation tree prior for better species divergence times in calibration-poor molecular phylogenies. Bioinformatics 2021; 37:i102-i110. [PMID: 34252953 PMCID: PMC8275332 DOI: 10.1093/bioinformatics/btab307] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Precise time calibrations needed to estimate ages of species divergence are not always available due to fossil records' incompleteness. Consequently, clock calibrations available for Bayesian dating analyses can be few and diffused, i.e. phylogenies are calibration-poor, impeding reliable inference of the timetree of life. We examined the role of speciation birth-death (BD) tree prior on Bayesian node age estimates in calibration-poor phylogenies and tested the usefulness of an informative, data-driven tree prior to enhancing the accuracy and precision of estimated times. RESULTS We present a simple method to estimate parameters of the BD tree prior from the molecular phylogeny for use in Bayesian dating analyses. The use of a data-driven birth-death (ddBD) tree prior leads to improvement in Bayesian node age estimates for calibration-poor phylogenies. We show that the ddBD tree prior, along with only a few well-constrained calibrations, can produce excellent node ages and credibility intervals, whereas the use of an uninformative, uniform (flat) tree prior may require more calibrations. Relaxed clock dating with ddBD tree prior also produced better results than a flat tree prior when using diffused node calibrations. We also suggest using ddBD tree priors to improve the detection of outliers and influential calibrations in cross-validation analyses.These results have practical applications because the ddBD tree prior reduces the number of well-constrained calibrations necessary to obtain reliable node age estimates. This would help address key impediments in building the grand timetree of life, revealing the process of speciation and elucidating the dynamics of biological diversification. AVAILABILITY AND IMPLEMENTATION An R module for computing the ddBD tree prior, simulated datasets and empirical datasets are available at https://github.com/cathyqqtao/ddBD-tree-prior.
Collapse
Affiliation(s)
- Qiqing Tao
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA 19122, USA.,Department of Biology, Temple University, Philadelphia, PA 19122, USA
| | - Jose Barba-Montoya
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA 19122, USA.,Department of Biology, Temple University, Philadelphia, PA 19122, USA
| | - Sudhir Kumar
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA 19122, USA.,Department of Biology, Temple University, Philadelphia, PA 19122, USA.,Center for Excellence in Genome Medicine and Research, King Abdulaziz University, Jeddah, Saudi Arabia
| |
Collapse
|
10
|
Barba-Montoya J, Tao Q, Kumar S. Using a GTR+Γ substitution model for dating sequence divergence when stationarity and time-reversibility assumptions are violated. Bioinformatics 2021; 36:i884-i894. [PMID: 33381826 DOI: 10.1093/bioinformatics/btaa820] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/07/2020] [Indexed: 11/15/2022] Open
Abstract
MOTIVATION As the number and diversity of species and genes grow in contemporary datasets, two common assumptions made in all molecular dating methods, namely the time-reversibility and stationarity of the substitution process, become untenable. No software tools for molecular dating allow researchers to relax these two assumptions in their data analyses. Frequently the same General Time Reversible (GTR) model across lineages along with a gamma (+Γ) distributed rates across sites is used in relaxed clock analyses, which assumes time-reversibility and stationarity of the substitution process. Many reports have quantified the impact of violations of these underlying assumptions on molecular phylogeny, but none have systematically analyzed their impact on divergence time estimates. RESULTS We quantified the bias on time estimates that resulted from using the GTR + Γ model for the analysis of computer-simulated nucleotide sequence alignments that were evolved with non-stationary (NS) and non-reversible (NR) substitution models. We tested Bayesian and RelTime approaches that do not require a molecular clock for estimating divergence times. Divergence times obtained using a GTR + Γ model differed only slightly (∼3% on average) from the expected times for NR datasets, but the difference was larger for NS datasets (∼10% on average). The use of only a few calibrations reduced these biases considerably (∼5%). Confidence and credibility intervals from GTR + Γ analysis usually contained correct times. Therefore, the bias introduced by the use of the GTR + Γ model to analyze datasets, in which the time-reversibility and stationarity assumptions are violated, is likely not large and can be reduced by applying multiple calibrations. AVAILABILITY AND IMPLEMENTATION All datasets are deposited in Figshare: https://doi.org/10.6084/m9.figshare.12594638.
Collapse
Affiliation(s)
- Jose Barba-Montoya
- Institute for Genomics and Evolutionary Medicine.,Department of Biology, Temple University, Philadelphia, PA 19122, USA
| | - Qiqing Tao
- Institute for Genomics and Evolutionary Medicine.,Department of Biology, Temple University, Philadelphia, PA 19122, USA
| | - Sudhir Kumar
- Institute for Genomics and Evolutionary Medicine.,Department of Biology, Temple University, Philadelphia, PA 19122, USA.,Center for Excellence in Genome Medicine and Research, King Abdulaziz University, Jeddah 21589, Saudi Arabia
| |
Collapse
|