1
|
Casanellas M, Fernández-Sánchez J, Garrote-López M, Sabaté-Vidales M. Designing Weights for Quartet-Based Methods When Data are Heterogeneous Across Lineages. Bull Math Biol 2023; 85:68. [PMID: 37310552 DOI: 10.1007/s11538-023-01167-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2022] [Accepted: 05/15/2023] [Indexed: 06/14/2023]
Abstract
Homogeneity across lineages is a general assumption in phylogenetics according to which nucleotide substitution rates are common to all lineages. Many phylogenetic methods relax this hypothesis but keep a simple enough model to make the process of sequence evolution more tractable. On the other hand, dealing successfully with the general case (heterogeneity of rates across lineages) is one of the key features of phylogenetic reconstruction methods based on algebraic tools. The goal of this paper is twofold. First, we present a new weighting system for quartets (ASAQ) based on algebraic and semi-algebraic tools, thus especially indicated to deal with data evolving under heterogeneous rates. This method combines the weights of two previous methods by means of a test based on the positivity of the branch lengths estimated with the paralinear distance. ASAQ is statistically consistent when applied to data generated under the general Markov model, considers rate and base composition heterogeneity among lineages and does not assume stationarity nor time-reversibility. Second, we test and compare the performance of several quartet-based methods for phylogenetic tree reconstruction (namely QFM, wQFM, quartet puzzling, weight optimization and Willson's method) in combination with several systems of weights, including ASAQ weights and other weights based on algebraic and semi-algebraic methods or on the paralinear distance. These tests are applied to both simulated and real data and support weight optimization with ASAQ weights as a reliable and successful reconstruction method that improves upon the accuracy of global methods (such as neighbor-joining or maximum likelihood) in the presence of long branches or on mixtures of distributions on trees.
Collapse
Affiliation(s)
- Marta Casanellas
- Institut de Matematiques de la UPC-BarcelonaTech (IMTech), Universitat Politècnica de Catalunya and Centre de Recerca Matemàtica, Av. Diagonal 647, 08028, Barcelona, Spain.
| | - Jesús Fernández-Sánchez
- Institut de Matematiques de la UPC-BarcelonaTech (IMTech), Universitat Politècnica de Catalunya and Centre de Recerca Matemàtica, Av. Diagonal 647, 08028, Barcelona, Spain
| | | | | |
Collapse
|
2
|
Casanellas M, Fernandez-Sanchez J, Garrote-Lopez M. SAQ: Semi-Algebraic Quartet Reconstruction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2855-2861. [PMID: 34339375 DOI: 10.1109/tcbb.2021.3101278] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
We present the phylogenetic quartet reconstruction method SAQ (Semi-Algebraic Quartet reconstruction). SAQ is consistent with the most general Markov model of nucleotide substitution and, in particular, it allows for rate heterogeneity across lineages. Based on the algebraic and semi-algebraic description of distributions that arise from the general Markov model on a quartet, the method outputs normalized weights for the three trivalent quartets (which can be used as input of quartet-based methods). We show that SAQ is a highly competitive method that outperforms most of the well known reconstruction methods on data simulated under the general Markov model on 4-taxon trees. Moreover, it also achieves a high performance on data that violates the underlying assumptions.
Collapse
|
3
|
Vera-Ruiz VA, Robinson J, Jermiin LS. A Likelihood-Ratio Test for Lumpability of Phylogenetic Data: Is the Markovian Property of an Evolutionary Process retained in Recoded DNA? Syst Biol 2021; 71:660-675. [PMID: 34498090 DOI: 10.1093/sysbio/syab074] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2021] [Revised: 08/19/2021] [Accepted: 08/27/2021] [Indexed: 11/12/2022] Open
Abstract
In molecular phylogenetics, it is typically assumed that the evolutionary process for DNA can be approximated by independent and identically distributed Markovian processes at the variable sites and that these processes diverge over the edges of a rooted bifurcating tree. Sometimes the nucleotides are transformed from a 4-state alphabet to a 3- or 2-state alphabet by a procedure that is called recoding, lumping, or grouping of states. Here, we introduce a likelihood-ratio test for lumpability for DNA that has diverged under different Markovian conditions, which assesses the assumption that the Markovian property of the evolutionary process over each edge is retained after recoding of the nucleotides. The test is derived and validated numerically on simulated data. To demonstrate the insights that can be gained by using the test, we assessed two published data sets, one of mitochondrial DNA from a phylogenetic study of the ratites (Syst. Biol. 59:90-107 [2010]) and the other of nuclear DNA from a phylogenetic study of yeast (Mol. Biol. Evol. 21:1455-1458 [2004]). Our analysis of these data sets revealed that recoding of the DNA eliminated some of the compositional heterogeneity detected over the sequences. However, the Markovian property of the original evolutionary process was not retained by the recoding, leading to some significant distortions of edge lengths in reconstructed trees.
Collapse
Affiliation(s)
- Victor A Vera-Ruiz
- School of Mathematics and Statistics, University of Sydney, NSW 2006, Australia.,Department of Mathematics and Statistics, University of Nevada, Reno, NV 89557, USA
| | - John Robinson
- School of Mathematics and Statistics, University of Sydney, NSW 2006, Australia
| | - Lars S Jermiin
- Research School of Biology, Australian National University, Canberra, ACT 2601, Australia.,School of Biology and Environmental Science, University College Dublin, Belfield, Dublin 4, Ireland.,Earth Institute, University College Dublin, Belfield, Dublin 4, Ireland
| |
Collapse
|
4
|
Peta V, Raths R, Bücking H. Massilia horti sp. nov. and Noviherbaspirillum arenae sp. nov., two novel soil bacteria of the Oxalobacteraceae. Int J Syst Evol Microbiol 2021; 71. [PMID: 33956597 DOI: 10.1099/ijsem.0.004765] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
We isolated two new soil bacteria: ONC3T (from garden soil in NC, USA; LMG 31738T=NRRL B-65553T) and M1T (from farmed soil in MI, USA; NRRL B-65551T=ATCC TSD-197T=LMG 31739T) and characterized their metabolic phenotype based on Biolog, MALDI-TOF MS and fatty acid analyses, and compared 16S rRNA and whole genome sequences to other members of the Oxalobacteraceae after sequencing on an Illumina Nextera platform. Based on the results of 16S rRNA sequence analysis, ONC3T shows the highest sequence similarity to Massilia solisilvae J18T (97.8 %), Massilia terrae J11T (97.7 %) and Massilia agilis J9T (97.3 %). Strain M1T is most closely related to Noviherbaspirillum denitrificans TSA40T, Noviherbaspirillum agri K-1-15T and Noviherbaspirillum autotrophicum TSA66T (sequence identity of 98.2, 98.0 and 97.8 %, respectively). The whole genome of ONC3T has an assembled size of 5.62 Mbp, a G+C content of 63.8 mol% and contains 5104 protein-coding sequences, 56 tRNA genes and two rRNA operons. The genome of M1T has a length of 4.71 MBp, a G+C content of 63.81 mol% and includes 4967 protein-coding genes, two rRNA operons and 44 tRNA genes. Whole genome comparisons identified Massilia sp. WG5 with a 79.3 % average nucleotide identity (ANI) and 22.6 % digital DNA-DNA hybridization (dDDH), and Massilia sp. UBA11196 with 78.2 % average amino acid identity (AAI) as the most closely related species to ONC3T. M1T is most closely related to N. autotrophicum TSA66T with an ANI of 80.27 %, or N. denitrificans TSA40T with a dDDH of 22.3 %. The application of community-accepted standards such as <98.7 % in 16S sequence similarity and <95-96 % ANI or 70 % DDH support the classification of Massilia horti ONC3T and Noviherbaspirillum arenae M1T as novel species within the Oxalobacteraceae.
Collapse
Affiliation(s)
- Vincent Peta
- South Dakota State University, Biology and Microbiology Department, Brookings SD 57007, USA
| | - Rachel Raths
- South Dakota State University, Biology and Microbiology Department, Brookings SD 57007, USA
| | - Heike Bücking
- University of Missouri, Division of Plant Sciences, College of Agriculture, Food and Natural Resources, Columbia, MO 65211, USA.,South Dakota State University, Biology and Microbiology Department, Brookings SD 57007, USA
| |
Collapse
|
5
|
Abstract
In 1981, the Journal of Molecular Evolution (JME) published an article entitled "Evolutionary trees from DNA sequences: A maximum likelihood approach" by Joseph (Joe) Felsenstein (J Mol Evol 17:368-376, 1981). This groundbreaking work laid the foundation for the emerging field of statistical phylogenetics, providing a tractable way of finding maximum likelihood (ML) estimates of evolutionary trees from DNA sequence data. This paper is the second most cited (more than 9000 citations) in JME after Kimura's (J Mol Evol 16:111-120, 1980) seminal paper on a model of nucleotide substitution (with nearly 20,000 citations). On the occasion of the 50th anniversary of JME, we elaborate on the significance of Felsenstein's ML approach to estimating phylogenetic trees.
Collapse
Affiliation(s)
- David Posada
- CINBIO, Universidade de Vigo, 36310, Vigo, Spain.
- Department of Biochemistry, Genetics, and Immunology, Universidade de Vigo, 36310, Vigo, Spain.
- Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, Vigo, Spain.
| | - Keith A Crandall
- Computational Biology Institute and Milken Institute School of Public Health, The George Washington University, Washington, DC, 20052, USA.
- Department of Biostatistics & Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC, 20052, USA.
| |
Collapse
|
6
|
Jermiin LS, Catullo RA, Holland BR. A new phylogenetic protocol: dealing with model misspecification and confirmation bias in molecular phylogenetics. NAR Genom Bioinform 2020; 2:lqaa041. [PMID: 33575594 PMCID: PMC7671319 DOI: 10.1093/nargab/lqaa041] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2020] [Revised: 05/18/2020] [Accepted: 06/04/2020] [Indexed: 12/15/2022] Open
Abstract
Molecular phylogenetics plays a key role in comparative genomics and has increasingly significant impacts on science, industry, government, public health and society. In this paper, we posit that the current phylogenetic protocol is missing two critical steps, and that their absence allows model misspecification and confirmation bias to unduly influence phylogenetic estimates. Based on the potential offered by well-established but under-used procedures, such as assessment of phylogenetic assumptions and tests of goodness of fit, we introduce a new phylogenetic protocol that will reduce confirmation bias and increase the accuracy of phylogenetic estimates.
Collapse
Affiliation(s)
- Lars S Jermiin
- CSIRO Land & Water, Canberra, ACT 2601, Australia
- Research School of Biology, Australian National University, Canberra, ACT 2601, Australia
- School of Biology & Environment Science, University College Dublin, Belfield, Dublin 4, Ireland
- Earth Institute, University College Dublin, Belfield, Dublin 4, Ireland
| | - Renee A Catullo
- CSIRO Land & Water, Canberra, ACT 2601, Australia
- Research School of Biology, Australian National University, Canberra, ACT 2601, Australia
- School of Science and Health & Hawkesbury Institute of the Environment, Western Sydney University, Penrith, NSW 2751, Australia
| | - Barbara R Holland
- School of Natural Sciences, University of Tasmania, Hobart, TAS 7001, Australia
| |
Collapse
|
7
|
Naser-Khdour S, Minh BQ, Zhang W, Stone EA, Lanfear R. The Prevalence and Impact of Model Violations in Phylogenetic Analysis. Genome Biol Evol 2019; 11:3341-3352. [PMID: 31536115 PMCID: PMC6893154 DOI: 10.1093/gbe/evz193] [Citation(s) in RCA: 84] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/03/2019] [Indexed: 12/24/2022] Open
Abstract
In phylogenetic inference, we commonly use models of substitution which assume that sequence evolution is stationary, reversible, and homogeneous (SRH). Although the use of such models is often criticized, the extent of SRH violations and their effects on phylogenetic inference of tree topologies and edge lengths are not well understood. Here, we introduce and apply the maximal matched-pairs tests of homogeneity to assess the scale and impact of SRH model violations on 3,572 partitions from 35 published phylogenetic data sets. We show that roughly one-quarter of all the partitions we analyzed (23.5%) reject the SRH assumptions, and that for 25% of data sets, tree topologies inferred from all partitions differ significantly from topologies inferred using the subset of partitions that do not reject the SRH assumptions. This proportion increases when comparing trees inferred using the subset of partitions that rejects the SRH assumptions, to those inferred from partitions that do not reject the SRH assumptions. These results suggest that the extent and effects of model violation in phylogenetics may be substantial. They highlight the importance of testing for model violations and possibly excluding partitions that violate models prior to tree reconstruction. Our results also suggest that further effort in developing models that do not require SRH assumptions could lead to large improvements in the accuracy of phylogenomic inference. The scripts necessary to perform the analysis are available in https://github.com/roblanf/SRHtests, and the new tests we describe are available as a new option in IQ-TREE (http://www.iqtree.org).
Collapse
Affiliation(s)
- Suha Naser-Khdour
- Department of Ecology and Evolution, Research School of Biology, Australian National University, Canberra, Australian Capital Territory, Australia
| | - Bui Quang Minh
- Department of Ecology and Evolution, Research School of Biology, Australian National University, Canberra, Australian Capital Territory, Australia
- Research School of Computer Science, Australian National University, Canberra, Australian Capital Territory, Australia
| | - Wenqi Zhang
- Department of Ecology and Evolution, Research School of Biology, Australian National University, Canberra, Australian Capital Territory, Australia
| | - Eric A Stone
- Department of Ecology and Evolution, Research School of Biology, Australian National University, Canberra, Australian Capital Territory, Australia
| | - Robert Lanfear
- Department of Ecology and Evolution, Research School of Biology, Australian National University, Canberra, Australian Capital Territory, Australia
| |
Collapse
|
8
|
Kaehler BD, Yap VB, Huttley GA. Standard Codon Substitution Models Overestimate Purifying Selection for Nonstationary Data. Genome Biol Evol 2018; 9:134-149. [PMID: 28175284 PMCID: PMC5381540 DOI: 10.1093/gbe/evw308] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/02/2017] [Indexed: 01/28/2023] Open
Abstract
Estimation of natural selection on protein-coding sequences is a key comparative genomics approach for de novo prediction of lineage-specific adaptations. Selective pressure is measured on a per-gene basis by comparing the rate of nonsynonymous substitutions to the rate of synonymous substitutions. All published codon substitution models have been time-reversible and thus assume that sequence composition does not change over time. We previously demonstrated that if time-reversible DNA substitution models are applied in the presence of changing sequence composition, the number of substitutions is systematically biased towards overestimation. We extend these findings to the case of codon substitution models and further demonstrate that the ratio of nonsynonymous to synonymous rates of substitution tends to be underestimated over three data sets of mammals, vertebrates, and insects. Our basis for comparison is a nonstationary codon substitution model that allows sequence composition to change. Goodness-of-fit results demonstrate that our new model tends to fit the data better. Direct measurement of nonstationarity shows that bias in estimates of natural selection and genetic distance increases with the degree of violation of the stationarity assumption. Additionally, inferences drawn under time-reversible models are systematically affected by compositional divergence. As genomic sequences accumulate at an accelerating rate, the importance of accurate de novo estimation of natural selection increases. Our results establish that our new model provides a more robust perspective on this fundamental quantity.
Collapse
Affiliation(s)
- Benjamin D Kaehler
- Research School of Biology, College of Medicine, Biology, and Environment, Australian National University, Canberra, ACT, Australia
| | - Von Bing Yap
- Department of Statistics and Applied Probability, National University of Singapore, Singapore, Singapore
| | - Gavin A Huttley
- Research School of Biology, College of Medicine, Biology, and Environment, Australian National University, Canberra, ACT, Australia
| |
Collapse
|
9
|
Abstract
Most phylogenetic methods are model-based and depend on models of evolution designed to approximate the evolutionary processes. Several methods have been developed to identify suitable models of evolution for phylogenetic analysis of alignments of nucleotide or amino acid sequences and some of these methods are now firmly embedded in the phylogenetic protocol. However, in a disturbingly large number of cases, it appears that these models were used without acknowledgement of their inherent shortcomings. In this chapter, we discuss the problem of model selection and show how some of the inherent shortcomings may be identified and overcome.
Collapse
Affiliation(s)
| | - Vivek Jayaswal
- School of Biomedical Sciences, Queensland University of Technology, Brisbane, QLD, Australia
| | - Faisal M Ababneh
- Department of Mathematics & Statistics, Al-Hussein Bin Talal University, Ma'an, Jordan
| | - John Robinson
- School of Mathematics & Statistics, University of Sydney, Sydney, NSW, Australia
| |
Collapse
|
10
|
Klopfstein S, Vilhelmsen L, Ronquist F. A Nonstationary Markov Model Detects Directional Evolution in Hymenopteran Morphology. Syst Biol 2015; 64:1089-103. [PMID: 26272507 PMCID: PMC4604834 DOI: 10.1093/sysbio/syv052] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2015] [Accepted: 07/17/2015] [Indexed: 11/13/2022] Open
Abstract
Directional evolution has played an important role in shaping the morphological, ecological, and molecular diversity of life. However, standard substitution models assume stationarity of the evolutionary process over the time scale examined, thus impeding the study of directionality. Here we explore a simple, nonstationary model of evolution for discrete data, which assumes that the state frequencies at the root differ from the equilibrium frequencies of the homogeneous evolutionary process along the rest of the tree (i.e., the process is nonstationary, nonreversible, but homogeneous). Within this framework, we develop a Bayesian approach for testing directional versus stationary evolution using a reversible-jump algorithm. Simulations show that when only data from extant taxa are available, the success in inferring directionality is strongly dependent on the evolutionary rate, the shape of the tree, the relative branch lengths, and the number of taxa. Given suitable evolutionary rates (0.1-0.5 expected substitutions between root and tips), accounting for directionality improves tree inference and often allows correct rooting of the tree without the use of an outgroup. As an empirical test, we apply our method to study directional evolution in hymenopteran morphology. We focus on three character systems: wing veins, muscles, and sclerites. We find strong support for a trend toward loss of wing veins and muscles, while stationarity cannot be ruled out for sclerites. Adding fossil and time information in a total-evidence dating approach, we show that accounting for directionality results in more precise estimates not only of the ancestral state at the root of the tree, but also of the divergence times. Our model relaxes the assumption of stationarity and reversibility by adding a minimum of additional parameters, and is thus well suited to studying the nature of the evolutionary process in data sets of limited size, such as morphology and ecology.
Collapse
Affiliation(s)
- Seraina Klopfstein
- Department of Bioinformatics and Genetics, Swedish Museum of Natural History, SE-104 05 Stockholm, Sweden; The University of Adelaide, ACEBB, Adelaide SA 5005, Australia; Natural History Museum, Department of Invertebrates, CH-3005 Bern, Switzerland;
| | - Lars Vilhelmsen
- Biosystematics, Natural History Museum of Denmark, DK-2100 Copenhagen Ø, Denmark
| | - Fredrik Ronquist
- Department of Bioinformatics and Genetics, Swedish Museum of Natural History, SE-104 05 Stockholm, Sweden
| |
Collapse
|
11
|
Kaehler BD, Yap VB, Zhang R, Huttley GA. Genetic distance for a general non-stationary markov substitution process. Syst Biol 2015; 64:281-93. [PMID: 25503772 PMCID: PMC4380038 DOI: 10.1093/sysbio/syu106] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2013] [Accepted: 12/01/2014] [Indexed: 11/18/2022] Open
Abstract
The genetic distance between biological sequences is a fundamental quantity in molecular evolution. It pertains to questions of rates of evolution, existence of a molecular clock, and phylogenetic inference. Under the class of continuous-time substitution models, the distance is commonly defined as the expected number of substitutions at any site in the sequence. We eschew the almost ubiquitous assumptions of evolution under stationarity and time-reversible conditions and extend the concept of the expected number of substitutions to nonstationary Markov models where the only remaining constraint is of time homogeneity between nodes in the tree. Our measure of genetic distance reduces to the standard formulation if the data in question are consistent with the stationarity assumption. We apply this general model to samples from across the tree of life to compare distances so obtained with those from the general time-reversible model, with and without rate heterogeneity across sites, and the paralinear distance, an empirical pairwise method explicitly designed to address nonstationarity. We discover that estimates from both variants of the general time-reversible model and the paralinear distance systematically overestimate genetic distance and departure from the molecular clock. The magnitude of the distance bias is proportional to departure from stationarity, which we demonstrate to be associated with longer edge lengths. The marked improvement in consistency between the general nonstationary Markov model and sequence alignments leads us to conclude that analyses of evolutionary rates and phylogenies will be substantively improved by application of this model.
Collapse
Affiliation(s)
- Benjamin D Kaehler
- John Curtin School of Medical Research, Australian National University, Canberra, ACT, 2600, Australia; and
| | - Von Bing Yap
- Department of Statistics and Applied Probability, National University of Singapore, Singapore, 117546, Singapore
| | - Rongli Zhang
- Department of Statistics and Applied Probability, National University of Singapore, Singapore, 117546, Singapore
| | - Gavin A Huttley
- John Curtin School of Medical Research, Australian National University, Canberra, ACT, 2600, Australia; and
| |
Collapse
|
12
|
Jayaswal V, Wong TKF, Robinson J, Poladian L, Jermiin LS. Mixture models of nucleotide sequence evolution that account for heterogeneity in the substitution process across sites and across lineages. Syst Biol 2014; 63:726-42. [PMID: 24927722 DOI: 10.1093/sysbio/syu036] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Molecular phylogenetic studies of homologous sequences of nucleotides often assume that the underlying evolutionary process was globally stationary, reversible, and homogeneous (SRH), and that a model of evolution with one or more site-specific and time-reversible rate matrices (e.g., the GTR rate matrix) is enough to accurately model the evolution of data over the whole tree. However, an increasing body of data suggests that evolution under these conditions is an exception, rather than the norm. To address this issue, several non-SRH models of molecular evolution have been proposed, but they either ignore heterogeneity in the substitution process across sites (HAS) or assume it can be modeled accurately using the distribution. As an alternative to these models of evolution, we introduce a family of mixture models that approximate HAS without the assumption of an underlying predefined statistical distribution. This family of mixture models is combined with non-SRH models of evolution that account for heterogeneity in the substitution process across lineages (HAL). We also present two algorithms for searching model space and identifying an optimal model of evolution that is less likely to over- or underparameterize the data. The performance of the two new algorithms was evaluated using alignments of nucleotides with 10 000 sites simulated under complex non-SRH conditions on a 25-tipped tree. The algorithms were found to be very successful, identifying the correct HAL model with a 75% success rate (the average success rate for assigning rate matrices to the tree's 48 edges was 99.25%) and, for the correct HAL model, identifying the correct HAS model with a 98% success rate. Finally, parameter estimates obtained under the correct HAL-HAS model were found to be accurate and precise. The merits of our new algorithms were illustrated with an analysis of 42 337 second codon sites extracted from a concatenation of 106 alignments of orthologous genes encoded by the nuclear genomes of Saccharomyces cerevisiae, S. paradoxus, S. mikatae, S. kudriavzevii, S. castellii, S. kluyveri, S. bayanus, and Candida albicans. Our results show that second codon sites in the ancestral genome of these species contained 49.1% invariable sites, 39.6% variable sites belonging to one rate category (V1), and 11.3% variable sites belonging to a second rate category (V2). The ancestral nucleotide content was found to differ markedly across these three sets of sites, and the evolutionary processes operating at the variable sites were found to be non-SRH and best modeled by a combination of eight edge-specific rate matrices (four for V1 and four for V2). The number of substitutions per site at the variable sites also differed markedly, with sites belonging to V1 evolving slower than those belonging to V2 along the lineages separating the seven species of Saccharomyces. Finally, sites belonging to V1 appeared to have ceased evolving along the lineages separating S. cerevisiae, S. paradoxus, S. mikatae, S. kudriavzevii, and S. bayanus, implying that they might have become so selectively constrained that they could be considered invariable sites in these species.
Collapse
Affiliation(s)
- Vivek Jayaswal
- School of Biomedical Sciences, Queensland University of Technology, Brisbane, QLD 4000, Australia; School of Mathematics and Statistics, University of Sydney, Sydney, NSW 2006, Australia; CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia; and Centre for Mathematical Biology, University of Sydney, Sydney, NSW 2006, AustraliaSchool of Biomedical Sciences, Queensland University of Technology, Brisbane, QLD 4000, Australia; School of Mathematics and Statistics, University of Sydney, Sydney, NSW 2006, Australia; CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia; and Centre for Mathematical Biology, University of Sydney, Sydney, NSW 2006, Australia
| | - Thomas K F Wong
- School of Biomedical Sciences, Queensland University of Technology, Brisbane, QLD 4000, Australia; School of Mathematics and Statistics, University of Sydney, Sydney, NSW 2006, Australia; CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia; and Centre for Mathematical Biology, University of Sydney, Sydney, NSW 2006, Australia
| | - John Robinson
- School of Biomedical Sciences, Queensland University of Technology, Brisbane, QLD 4000, Australia; School of Mathematics and Statistics, University of Sydney, Sydney, NSW 2006, Australia; CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia; and Centre for Mathematical Biology, University of Sydney, Sydney, NSW 2006, AustraliaSchool of Biomedical Sciences, Queensland University of Technology, Brisbane, QLD 4000, Australia; School of Mathematics and Statistics, University of Sydney, Sydney, NSW 2006, Australia; CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia; and Centre for Mathematical Biology, University of Sydney, Sydney, NSW 2006, Australia
| | - Leon Poladian
- School of Biomedical Sciences, Queensland University of Technology, Brisbane, QLD 4000, Australia; School of Mathematics and Statistics, University of Sydney, Sydney, NSW 2006, Australia; CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia; and Centre for Mathematical Biology, University of Sydney, Sydney, NSW 2006, AustraliaSchool of Biomedical Sciences, Queensland University of Technology, Brisbane, QLD 4000, Australia; School of Mathematics and Statistics, University of Sydney, Sydney, NSW 2006, Australia; CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia; and Centre for Mathematical Biology, University of Sydney, Sydney, NSW 2006, Australia
| | - Lars S Jermiin
- School of Biomedical Sciences, Queensland University of Technology, Brisbane, QLD 4000, Australia; School of Mathematics and Statistics, University of Sydney, Sydney, NSW 2006, Australia; CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia; and Centre for Mathematical Biology, University of Sydney, Sydney, NSW 2006, Australia
| |
Collapse
|
13
|
Groussin M, Boussau B, Gouy M. A branch-heterogeneous model of protein evolution for efficient inference of ancestral sequences. Syst Biol 2013; 62:523-38. [PMID: 23475623 PMCID: PMC3676677 DOI: 10.1093/sysbio/syt016] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Most models of nucleotide or amino acid substitution used in phylogenetic studies assume that the evolutionary process has been homogeneous across lineages and that composition of nucleotides or amino acids has remained the same throughout the tree. These oversimplified assumptions are refuted by the observation that compositional variability characterizes extant biological sequences. Branch-heterogeneous models of protein evolution that account for compositional variability have been developed, but are not yet in common use because of the large number of parameters required, leading to high computational costs and potential overparameterization. Here, we present a new branch-nonhomogeneous and nonstationary model of protein evolution that captures more accurately the high complexity of sequence evolution. This model, henceforth called Correspondence and likelihood analysis (COaLA), makes use of a correspondence analysis to reduce the number of parameters to be optimized through maximum likelihood, focusing on most of the compositional variation observed in the data. The model was thoroughly tested on both simulated and biological data sets to show its high performance in terms of data fitting and CPU time. COaLA efficiently estimates ancestral amino acid frequencies and sequences, making it relevant for studies aiming at reconstructing and resurrecting ancestral amino acid sequences. Finally, we applied COaLA on a concatenate of universal amino acid sequences to confirm previous results obtained with a nonhomogeneous Bayesian model regarding the early pattern of adaptation to optimal growth temperature, supporting the mesophilic nature of the Last Universal Common Ancestor.
Collapse
Affiliation(s)
- M Groussin
- Laboratoire de Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, UMR5558, Villeurbanne, France.
| | | | | |
Collapse
|
14
|
Holland BR, Jarvis PD, Sumner JG. Low-Parameter Phylogenetic Inference Under the General Markov Model. Syst Biol 2012; 62:78-92. [DOI: 10.1093/sysbio/sys072] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Affiliation(s)
- Barbara R. Holland
- School of Mathematics and Physics, University of Tasmania, Hobart 7001, Australia
| | - Peter D. Jarvis
- School of Mathematics and Physics, University of Tasmania, Hobart 7001, Australia
| | - Jeremy G. Sumner
- School of Mathematics and Physics, University of Tasmania, Hobart 7001, Australia
| |
Collapse
|
15
|
Regier JC, Zwick A. Sources of signal in 62 protein-coding nuclear genes for higher-level phylogenetics of arthropods. PLoS One 2011; 6:e23408. [PMID: 21829732 PMCID: PMC3150433 DOI: 10.1371/journal.pone.0023408] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2011] [Accepted: 07/15/2011] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND This study aims to investigate the strength of various sources of phylogenetic information that led to recent seemingly robust conclusions about higher-level arthropod phylogeny and to assess the role of excluding or downweighting synonymous change for arriving at those conclusions. METHODOLOGY/PRINCIPAL FINDINGS The current study analyzes DNA sequences from 68 gene segments of 62 distinct protein-coding nuclear genes for 80 species. Gene segments analyzed individually support numerous nodes recovered in combined-gene analyses, but few of the higher-level nodes of greatest current interest. However, neither is there support for conflicting alternatives to these higher-level nodes. Gene segments with higher rates of nonsynonymous change tend to be more informative overall, but those with lower rates tend to provide stronger support for deeper nodes. Higher-level nodes with bootstrap values in the 80% - 99% range for the complete data matrix are markedly more sensitive to substantial drops in their bootstrap percentages after character subsampling than those with 100% bootstrap, suggesting that these nodes are likely not to have been strongly supported with many fewer data than in the full matrix. Data set partitioning of total data by (mostly) synonymous and (mostly) nonsynonymous change improves overall node support, but the result remains much inferior to analysis of (unpartitioned) nonsynonymous change alone. Clusters of genes with similar nonsynonymous rate properties (e.g., faster vs. slower) show some distinct patterns of node support but few conflicts. Synonymous change is shown to contribute little, if any, phylogenetic signal to the support of higher-level nodes, but it does contribute nonphylogenetic signal, probably through its underlying heterogeneous nucleotide composition. Analysis of seemingly conservative indels does not prove useful. CONCLUSIONS Generating a robust molecular higher-level phylogeny of Arthropoda is currently possible with large amounts of data and an exclusive reliance on nonsynonymous change.
Collapse
Affiliation(s)
- Jerome C. Regier
- Institute for Bioscience and Biotechnology Research, University of Maryland, College Park, Maryland, United States of America
- Department of Entomology, University of Maryland, College Park, Maryland, United States of America
- Center for Biosystems Research, University of Maryland Biotechnology Institute, College Park, Maryland, United States of America
| | - Andreas Zwick
- Center for Biosystems Research, University of Maryland Biotechnology Institute, College Park, Maryland, United States of America
- Entomology, State Museum of Natural History, Stuttgart, Germany
| |
Collapse
|
16
|
Jayaswal V, Ababneh F, Jermiin LS, Robinson J. Reducing Model Complexity of the General Markov Model of Evolution. Mol Biol Evol 2011; 28:3045-59. [DOI: 10.1093/molbev/msr128] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
17
|
Berger SA, Krompass D, Stamatakis A. Performance, accuracy, and Web server for evolutionary placement of short sequence reads under maximum likelihood. Syst Biol 2011; 60:291-302. [PMID: 21436105 PMCID: PMC3078422 DOI: 10.1093/sysbio/syr010] [Citation(s) in RCA: 348] [Impact Index Per Article: 24.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2010] [Revised: 06/08/2010] [Accepted: 01/24/2011] [Indexed: 11/23/2022] Open
Abstract
We present an evolutionary placement algorithm (EPA) and a Web server for the rapid assignment of sequence fragments (short reads) to edges of a given phylogenetic tree under the maximum-likelihood model. The accuracy of the algorithm is evaluated on several real-world data sets and compared with placement by pair-wise sequence comparison, using edit distances and BLAST. We introduce a slow and accurate as well as a fast and less accurate placement algorithm. For the slow algorithm, we develop additional heuristic techniques that yield almost the same run times as the fast version with only a small loss of accuracy. When those additional heuristics are employed, the run time of the more accurate algorithm is comparable with that of a simple BLAST search for data sets with a high number of short query sequences. Moreover, the accuracy of the EPA is significantly higher, in particular when the sample of taxa in the reference topology is sparse or inadequate. Our algorithm, which has been integrated into RAxML, therefore provides an equally fast but more accurate alternative to BLAST for tree-based inference of the evolutionary origin and composition of short sequence reads. We are also actively developing a Web server that offers a freely available service for computing read placements on trees using the EPA.
Collapse
Affiliation(s)
- Simon A. Berger
- The Exelixis Lab, Scientific Computing Group, Heidelberg Institute for Theoretical Studies, Schloss-Wolfsbrunnenweg 35, D-69118 Heidelberg, Germany
| | - Denis Krompass
- The Exelixis Lab, Scientific Computing Group, Heidelberg Institute for Theoretical Studies, Schloss-Wolfsbrunnenweg 35, D-69118 Heidelberg, Germany
| | - Alexandros Stamatakis
- The Exelixis Lab, Scientific Computing Group, Heidelberg Institute for Theoretical Studies, Schloss-Wolfsbrunnenweg 35, D-69118 Heidelberg, Germany
| |
Collapse
|
18
|
Jayaswal V, Jermiin LS, Poladian L, Robinson J. Two stationary nonhomogeneous Markov models of nucleotide sequence evolution. Syst Biol 2010; 60:74-86. [PMID: 21081482 DOI: 10.1093/sysbio/syq076] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The general Markov model (GMM) of nucleotide substitution does not assume the evolutionary process to be stationary, reversible, or homogeneous. The GMM can be simplified by assuming the evolutionary process to be stationary. A stationary GMM is appropriate for analyses of phylogenetic data sets that are compositionally homogeneous; a data set is considered to be compositionally homogeneous if a statistical test does not detect significant differences in the marginal distributions of the sequences. Though the general time-reversible (GTR) model assumes stationarity, it also assumes reversibility and homogeneity. We propose two new stationary and nonhomogeneous models--one constrains the GMM to be reversible, whereas the other does not. The two models, coupled with the GTR model, comprise a set of nested models that can be used to test the assumptions of reversibility and homogeneity for stationary processes. The two models are extended to incorporate invariable sites and used to analyze a seven-taxon hominoid data set that displays compositional homogeneity. We show that within the class of stationary models, a nonhomogeneous model fits the hominoid data better than the GTR model. We note that if one considers a wider set of models that are not constrained to be stationary, then an even better fit can be obtained for the hominoid data. However, the methods for reducing model complexity from an extremely large set of nonstationary models are yet to be developed.
Collapse
Affiliation(s)
- Vivek Jayaswal
- School of Mathematics and Statistics, University of Sydney, Sydney, NSW 2006, Australia
| | | | | | | |
Collapse
|
19
|
Jermiin LS, Ho JWK, Lau KW, Jayaswal V. SeqVis: a tool for detecting compositional heterogeneity among aligned nucleotide sequences. Methods Mol Biol 2009; 537:65-91. [PMID: 19378140 DOI: 10.1007/978-1-59745-251-9_4] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/19/2023]
Abstract
Compositional heterogeneity is a poorly appreciated attribute of aligned nucleotide and amino acid sequences. It is a common property of molecular phylogenetic data, and it has been found to occur across sequences and/or across sites. Most molecular phylogenetic methods assume that the sequences have evolved under globally stationary, reversible, and homogeneous conditions, implying that the sequences should be compositionally homogeneous. The presence of the above-mentioned compositional heterogeneity implies that the sequences must have evolved under more general conditions than is commonly assumed. Consequently, there is a need for reliable methods to detect under what conditions alignments of nucleotides or amino acids may have evolved. In this chapter, we describe one such program. SeqVis is designed to survey aligned nucleotide sequences. We discuss pros-et-cons of this program in the context of other methods to detect compositional heterogeneity and violated phylogenetic assumptions. The benefits provided by SeqVis are demonstrated in two studies of alignments of nucleotides, one of which contained 7542 nucleotides from 53 species.
Collapse
Affiliation(s)
- Lars Sommer Jermiin
- School of Biological Sciences, Centre for Mathematical Biology and Sydney Bioinformatics, University of Sydney, Sydney, Australia
| | | | | | | |
Collapse
|
20
|
PITMAN MEDAL. AUST NZ J STAT 2009. [DOI: 10.1111/j.1467-842x.2009.00539.x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
21
|
Oscamou M, McDonald D, Yap VB, Huttley GA, Lladser ME, Knight R. Comparison of methods for estimating the nucleotide substitution matrix. BMC Bioinformatics 2008; 9:511. [PMID: 19046431 PMCID: PMC2655096 DOI: 10.1186/1471-2105-9-511] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2008] [Accepted: 12/01/2008] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND The nucleotide substitution rate matrix is a key parameter of molecular evolution. Several methods for inferring this parameter have been proposed, with different mathematical bases. These methods include counting sequence differences and taking the log of the resulting probability matrices, methods based on Markov triples, and maximum likelihood methods that infer the substitution probabilities that lead to the most likely model of evolution. However, the speed and accuracy of these methods has not been compared. RESULTS Different methods differ in performance by orders of magnitude (ranging from 1 ms to 10 s per matrix), but differences in accuracy of rate matrix reconstruction appear to be relatively small. Encouragingly, relatively simple and fast methods can provide results at least as accurate as far more complex and computationally intensive methods, especially when the sequences to be compared are relatively short. CONCLUSION Based on the conditions tested, we recommend the use of method of Gojobori et al. (1982) for long sequences (> 600 nucleotides), and the method of Goldman et al. (1996) for shorter sequences (< 600 nucleotides). The method of Barry and Hartigan (1987) can provide somewhat more accuracy, measured as the Euclidean distance between the true and inferred matrices, on long sequences (> 2000 nucleotides) at the expense of substantially longer computation time. The availability of methods that are both fast and accurate will allow us to gain a global picture of change in the nucleotide substitution rate matrix on a genomewide scale across the tree of life.
Collapse
Affiliation(s)
- Maribeth Oscamou
- Department of Applied Mathematics, University of Colorado, Boulder, CO, USA
| | - Daniel McDonald
- Department of Computer Science, University of Colorado, Boulder, CO, USA
| | - Von Bing Yap
- Department of Statistics and Applied Probability, National University of Singapore, 21 Lower Kent Ridge Road 119077, Singapore
| | - Gavin A Huttley
- John Curtin School of Medical Research, Australian National University, Canberra, Australia
| | - Manuel E Lladser
- Department of Applied Mathematics, University of Colorado, Boulder, CO, USA
| | - Rob Knight
- Department of Chemistry & Biochemistry, University of Colorado, Boulder, CO, USA
| |
Collapse
|
22
|
Beiko RG, Doolittle WF, Charlebois RL. The Impact of Reticulate Evolution on Genome Phylogeny. Syst Biol 2008; 57:844-56. [DOI: 10.1080/10635150802559265] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022] Open
Affiliation(s)
- Robert G. Beiko
- Faculty of Computer Science, Dalhousie University, and Institute for Molecular Bioscience/ARC Centre for Bioinformatics
Brisbane, Australia; E-mail:
| | - W. Ford Doolittle
- Genome Atlantic, Department of Biochemistry & Molecular Biology, Dalhousie University
Halifax, Nova Scotia, Canada
| | - Robert L. Charlebois
- Genome Atlantic, Department of Biochemistry & Molecular Biology, Dalhousie University
Halifax, Nova Scotia, Canada
| |
Collapse
|
23
|
Hyman IT, Ho SY, Jermiin LS. Molecular phylogeny of Australian Helicarionidae, Euconulidae and related groups (Gastropoda: Pulmonata: Stylommatophora) based on mitochondrial DNA. Mol Phylogenet Evol 2007; 45:792-812. [DOI: 10.1016/j.ympev.2007.08.018] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2006] [Revised: 07/27/2007] [Accepted: 08/07/2007] [Indexed: 10/22/2022]
|
24
|
Allman ES, Rhodes JA. Identifying evolutionary trees and substitution parameters for the general Markov model with invariable sites. Math Biosci 2007; 211:18-33. [PMID: 17964612 DOI: 10.1016/j.mbs.2007.09.001] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2007] [Revised: 07/14/2007] [Accepted: 09/13/2007] [Indexed: 11/25/2022]
Abstract
The general Markov plus invariable sites (GM+I) model of biological sequence evolution is a two-class model in which an unknown proportion of sites are not allowed to change, while the remainder undergo substitutions according to a Markov process on a tree. For statistical use it is important to know if the model is identifiable; can both the tree topology and the numerical parameters be determined from a joint distribution describing sequences only at the leaves of the tree? We establish that for generic parameters both the tree and all numerical parameter values can be recovered, up to clearly understood issues of 'label swapping'. The method of analysis is algebraic, using phylogenetic invariants to study the variety defined by the model. Simple rational formulas, expressed in terms of determinantal ratios, are found for recovering numerical parameters describing the invariable sites.
Collapse
Affiliation(s)
- Elizabeth S Allman
- Department of Mathematics and Statistics, University of Alaska Fairbanks, Fairbanks, AK 99775, USA.
| | | |
Collapse
|