1
|
Hassler GW, Magee A, Zhang Z, Baele G, Lemey P, Ji X, Fourment M, Suchard MA. Data integration in Bayesian phylogenetics. ANNUAL REVIEW OF STATISTICS AND ITS APPLICATION 2022; 10:353-377. [PMID: 38774036 PMCID: PMC11108065 DOI: 10.1146/annurev-statistics-033021-112532] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2024]
Abstract
Researchers studying the evolution of viral pathogens and other organisms increasingly encounter and use large and complex data sets from multiple different sources. Statistical research in Bayesian phylogenetics has risen to this challenge. Researchers use phylogenetics not only to reconstruct the evolutionary history of a group of organisms, but also to understand the processes that guide its evolution and spread through space and time. To this end, it is now the norm to integrate numerous sources of data. For example, epidemiologists studying the spread of a virus through a region incorporate data including genetic sequences (e.g. DNA), time, location (both continuous and discrete) and environmental covariates (e.g. social connectivity between regions) into a coherent statistical model. Evolutionary biologists routinely do the same with genetic sequences, location, time, fossil and modern phenotypes, and ecological covariates. These complex, hierarchical models readily accommodate both discrete and continuous data and have enormous combined discrete/continuous parameter spaces including, at a minimum, phylogenetic tree topologies and branch lengths. The increased size and complexity of these statistical models have spurred advances in computational methods to make them tractable. We discuss both the modeling and computational advances below, as well as unsolved problems and areas of active research.
Collapse
Affiliation(s)
- Gabriel W Hassler
- Department of Computational Medicine, University of California, Los Angeles, USA, 90095
| | - Andrew Magee
- Department of Biostatistics, University of California, Los Angeles, USA, 90095
| | - Zhenyu Zhang
- Department of Biostatistics, University of California, Los Angeles, USA, 90095
| | - Guy Baele
- Department of Microbiology and Immunology, Rega Institute, KU Leuven, Leuven, Belgium, 3000
| | - Philippe Lemey
- Department of Microbiology and Immunology, Rega Institute, KU Leuven, Leuven, Belgium, 3000
| | - Xiang Ji
- Department of Mathematics, Tulane University, New Orleans, USA, 70118
| | - Mathieu Fourment
- Australian Institute for Microbiology and Infection, University of Technology Sydney, Ultimo NSW, Australia, 2007
| | - Marc A Suchard
- Department of Computational Medicine, University of California, Los Angeles, USA, 90095
- Department of Biostatistics, University of California, Los Angeles, USA, 90095
- Department of Human Genetics, University of California, Los Angeles, USA, 90095
| |
Collapse
|
2
|
On equal-input and monotone Markov matrices. ADV APPL PROBAB 2022. [DOI: 10.1017/apr.2021.39] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
AbstractThe practically important classes of equal-input and of monotone Markov matrices are revisited, with special focus on embeddability, infinite divisibility, and mutual relations. Several uniqueness results for the classic Markov embedding problem are obtained in the process. To achieve our results, we need to employ various algebraic and geometric tools, including commutativity, permutation invariance, and convexity. Of particular relevance in several demarcation results are Markov matrices that are idempotents.
Collapse
|
3
|
Ardiyansyah M, Kosta D, Kubjas K. The model-specific Markov embedding problem for symmetric group-based models. J Math Biol 2021; 83:33. [PMID: 34499233 PMCID: PMC8429190 DOI: 10.1007/s00285-021-01656-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2020] [Revised: 08/04/2021] [Accepted: 08/17/2021] [Indexed: 11/26/2022]
Abstract
We study model embeddability, which is a variation of the famous embedding problem in probability theory, when apart from the requirement that the Markov matrix is the matrix exponential of a rate matrix, we additionally ask that the rate matrix follows the model structure. We provide a characterisation of model embeddable Markov matrices corresponding to symmetric group-based phylogenetic models. In particular, we provide necessary and sufficient conditions in terms of the eigenvalues of symmetric group-based matrices. To showcase our main result on model embeddability, we provide an application to hachimoji models, which are eight-state models for synthetic DNA. Moreover, our main result on model embeddability enables us to compute the volume of the set of model embeddable Markov matrices relative to the volume of other relevant sets of Markov matrices within the model.
Collapse
Affiliation(s)
- Muhammad Ardiyansyah
- Department of Mathematics and Systems Analysis, Aalto University, Espoo, Finland
| | - Dimitra Kosta
- School of Mathematics, University of Edinburgh, Edinburgh, UK
| | - Kaie Kubjas
- Department of Mathematics and Systems Analysis, Aalto University, Espoo, Finland
| |
Collapse
|
4
|
Abstract
In 1981, the Journal of Molecular Evolution (JME) published an article entitled "Evolutionary trees from DNA sequences: A maximum likelihood approach" by Joseph (Joe) Felsenstein (J Mol Evol 17:368-376, 1981). This groundbreaking work laid the foundation for the emerging field of statistical phylogenetics, providing a tractable way of finding maximum likelihood (ML) estimates of evolutionary trees from DNA sequence data. This paper is the second most cited (more than 9000 citations) in JME after Kimura's (J Mol Evol 16:111-120, 1980) seminal paper on a model of nucleotide substitution (with nearly 20,000 citations). On the occasion of the 50th anniversary of JME, we elaborate on the significance of Felsenstein's ML approach to estimating phylogenetic trees.
Collapse
Affiliation(s)
- David Posada
- CINBIO, Universidade de Vigo, 36310, Vigo, Spain.
- Department of Biochemistry, Genetics, and Immunology, Universidade de Vigo, 36310, Vigo, Spain.
- Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, Vigo, Spain.
| | - Keith A Crandall
- Computational Biology Institute and Milken Institute School of Public Health, The George Washington University, Washington, DC, 20052, USA.
- Department of Biostatistics & Bioinformatics, Milken Institute School of Public Health, The George Washington University, Washington, DC, 20052, USA.
| |
Collapse
|
5
|
Hannaford NE, Heaps SE, Nye TMW, Williams TA, Embley TM. Incorporating compositional heterogeneity into Lie Markov models for phylogenetic inference. Ann Appl Stat 2020. [DOI: 10.1214/20-aoas1369] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
6
|
Phylosymmetric Algebras: Mathematical Properties of a New Tool in Phylogenetics. Bull Math Biol 2020; 82:151. [PMID: 33221986 PMCID: PMC7680336 DOI: 10.1007/s11538-020-00832-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2020] [Accepted: 11/02/2020] [Indexed: 11/04/2022]
Abstract
In phylogenetics, it is of interest for rate matrix sets to satisfy closure under matrix multiplication as this makes finding the set of corresponding transition matrices possible without having to compute matrix exponentials. It is also advantageous to have a small number of free parameters as this, in applications, will result in a reduction in computation time. We explore a method of building a rate matrix set from a rooted tree structure by assigning rates to internal tree nodes and states to the leaves, then defining the rate of change between two states as the rate assigned to the most recent common ancestor of those two states. We investigate the properties of these matrix sets from both a linear algebra and a graph theory perspective and show that any rate matrix set generated this way is closed under matrix multiplication. The consequences of setting two rates assigned to internal tree nodes to be equal are then considered. This methodology could be used to develop parameterised models of amino acid substitution which have a small number of parameters but convey biological meaning.
Collapse
|
7
|
Shore JA, Sumner JG, Holland BR. The impracticalities of multiplicatively-closed codon models: a retreat to linear alternatives. J Math Biol 2020; 81:549-573. [PMID: 32710155 DOI: 10.1007/s00285-020-01519-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2019] [Revised: 06/09/2020] [Indexed: 10/23/2022]
Abstract
A matrix Lie algebra is a linear space of matrices closed under the operation [Formula: see text]. The "Lie closure" of a set of matrices is the smallest matrix Lie algebra which contains the set. In the context of Markov chain theory, if a set of rate matrices form a Lie algebra, their corresponding Markov matrices are closed under matrix multiplication; this has been found to be a useful property in phylogenetics. Inspired by previous research involving Lie closures of DNA models, it was hypothesised that finding the Lie closure of a codon model could help to solve the problem of mis-estimation of the non-synonymous/synonymous rate ratio, [Formula: see text]. We propose two different methods of finding a linear space from a model: the first is the linear closure which is the smallest linear space which contains the model, and the second is the linear version which changes multiplicative constraints in the model to additive ones. For each of these linear spaces we then find the Lie closures of them. Under both methods, it was found that closed codon models would require thousands of parameters, and that any partial solution to this problem that was of a reasonable size violated stochasticity. Investigation of toy models indicated that finding the Lie closure of matrix linear spaces which deviated only slightly from a simple model resulted in a Lie closure that was close to having the maximum number of parameters possible. Given that Lie closures are not practical, we propose further consideration of the two variants of linearly closed models.
Collapse
|
8
|
Naser-Khdour S, Minh BQ, Zhang W, Stone EA, Lanfear R. The Prevalence and Impact of Model Violations in Phylogenetic Analysis. Genome Biol Evol 2019; 11:3341-3352. [PMID: 31536115 PMCID: PMC6893154 DOI: 10.1093/gbe/evz193] [Citation(s) in RCA: 59] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/03/2019] [Indexed: 12/24/2022] Open
Abstract
In phylogenetic inference, we commonly use models of substitution which assume that sequence evolution is stationary, reversible, and homogeneous (SRH). Although the use of such models is often criticized, the extent of SRH violations and their effects on phylogenetic inference of tree topologies and edge lengths are not well understood. Here, we introduce and apply the maximal matched-pairs tests of homogeneity to assess the scale and impact of SRH model violations on 3,572 partitions from 35 published phylogenetic data sets. We show that roughly one-quarter of all the partitions we analyzed (23.5%) reject the SRH assumptions, and that for 25% of data sets, tree topologies inferred from all partitions differ significantly from topologies inferred using the subset of partitions that do not reject the SRH assumptions. This proportion increases when comparing trees inferred using the subset of partitions that rejects the SRH assumptions, to those inferred from partitions that do not reject the SRH assumptions. These results suggest that the extent and effects of model violation in phylogenetics may be substantial. They highlight the importance of testing for model violations and possibly excluding partitions that violate models prior to tree reconstruction. Our results also suggest that further effort in developing models that do not require SRH assumptions could lead to large improvements in the accuracy of phylogenomic inference. The scripts necessary to perform the analysis are available in https://github.com/roblanf/SRHtests, and the new tests we describe are available as a new option in IQ-TREE (http://www.iqtree.org).
Collapse
Affiliation(s)
- Suha Naser-Khdour
- Department of Ecology and Evolution, Research School of Biology, Australian National University, Canberra, Australian Capital Territory, Australia
| | - Bui Quang Minh
- Department of Ecology and Evolution, Research School of Biology, Australian National University, Canberra, Australian Capital Territory, Australia
- Research School of Computer Science, Australian National University, Canberra, Australian Capital Territory, Australia
| | - Wenqi Zhang
- Department of Ecology and Evolution, Research School of Biology, Australian National University, Canberra, Australian Capital Territory, Australia
| | - Eric A Stone
- Department of Ecology and Evolution, Research School of Biology, Australian National University, Canberra, Australian Capital Territory, Australia
| | - Robert Lanfear
- Department of Ecology and Evolution, Research School of Biology, Australian National University, Canberra, Australian Capital Territory, Australia
| |
Collapse
|
9
|
Lie-Markov Models Derived from Finite Semigroups. Bull Math Biol 2018; 81:361-383. [PMID: 30073568 DOI: 10.1007/s11538-018-0455-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2017] [Accepted: 06/08/2018] [Indexed: 10/28/2022]
Abstract
We present and explore a general method for deriving a Lie-Markov model from a finite semigroup. If the degree of the semigroup is k, the resulting model is a continuous-time Markov chain on k-states and, as a consequence of the product rule in the semigroup, satisfies the property of multiplicative closure. This means that the product of any two probability substitution matrices taken from the model produces another substitution matrix also in the model. We show that our construction is a natural generalization of the concept of group-based models.
Collapse
|
10
|
Sundberg H, Kruys Å, Bergsten J, Ekman S. Position specificity in the genus Coreomyces ( Laboulbeniomycetes, Ascomycota). Fungal Syst Evol 2018; 1:217-228. [PMID: 32490367 PMCID: PMC7259236 DOI: 10.3114/fuse.2018.01.09] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022] Open
Abstract
To study position specificity in the insect-parasitic fungal genus Coreomyces (Laboulbeniaceae, Laboulbeniales), we sampled corixid hosts (Corixidae, Heteroptera) in southern Scandinavia. We detected Coreomyces thalli in five different positions on the hosts. Thalli from the various positions grouped in four distinct clusters in the resulting gene trees, distinctly so in the ITS and LSU of the nuclear ribosomal DNA, less so in the SSU of the nuclear ribosomal DNA and the mitochondrial ribosomal DNA. Thalli from the left side of abdomen grouped in a single cluster, and so did thalli from the ventral right side. Thalli in the mid-ventral position turned out to be a mix of three clades, while thalli growing dorsally grouped with thalli from the left and right abdominal clades. The mid-ventral and dorsal positions were found in male hosts only. The position on the left hemelytron was shared by members from two sister clades. Statistical analyses demonstrate a significant positive correlation between clade and position on the host, but also a weak correlation between host sex and clade membership. These results indicate that sex-of-host specificity may be a non-existent extreme in a continuum, where instead weak pREFERENCES for one host sex may turn out to be frequent.
Collapse
Affiliation(s)
- H Sundberg
- Systematic Biology, Department of Organismal Biology, Evolutionary Biology Centre, Uppsala University, Uppsala, Sweden
| | - Å Kruys
- Museum of Evolution, Uppsala University, Uppsala, Sweden
| | - J Bergsten
- Department of Zoology, Swedish Museum of Natural History, Stockholm, Sweden
| | - S Ekman
- Museum of Evolution, Uppsala University, Uppsala, Sweden
| |
Collapse
|
11
|
Embeddability of Kimura 3ST Markov matrices. J Theor Biol 2018; 445:128-135. [PMID: 29462627 DOI: 10.1016/j.jtbi.2018.02.005] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2017] [Revised: 01/24/2018] [Accepted: 02/05/2018] [Indexed: 01/05/2023]
Abstract
In this note, we characterize the embeddability of generic Kimura 3ST Markov matrices in terms of their eigenvalues. As a consequence, we are able to compute the volume of such matrices relative to the volume of all Markov matrices within the model. We also provide examples showing that, in general, mutation rates are not identifiable from substitution probabilities. These examples also illustrate that symmetries between mutation probabilities do not necessarily arise from symmetries between the corresponding mutation rates.
Collapse
|
12
|
Sumner JG, Taylor A, Holland BR, Jarvis PD. Developing a statistically powerful measure for quartet tree inference using phylogenetic identities and Markov invariants. J Math Biol 2017; 75:1619-1654. [PMID: 28434023 DOI: 10.1007/s00285-017-1129-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2016] [Revised: 04/06/2017] [Indexed: 11/24/2022]
Abstract
Recently there has been renewed interest in phylogenetic inference methods based on phylogenetic invariants, alongside the related Markov invariants. Broadly speaking, both these approaches give rise to polynomial functions of sequence site patterns that, in expectation value, either vanish for particular evolutionary trees (in the case of phylogenetic invariants) or have well understood transformation properties (in the case of Markov invariants). While both approaches have been valued for their intrinsic mathematical interest, it is not clear how they relate to each other, and to what extent they can be used as practical tools for inference of phylogenetic trees. In this paper, by focusing on the special case of binary sequence data and quartets of taxa, we are able to view these two different polynomial-based approaches within a common framework. To motivate the discussion, we present three desirable statistical properties that we argue any invariant-based phylogenetic method should satisfy: (1) sensible behaviour under reordering of input sequences; (2) stability as the taxa evolve independently according to a Markov process; and (3) explicit dependence on the assumption of a continuous-time process. Motivated by these statistical properties, we develop and explore several new phylogenetic inference methods. In particular, we develop a statistically bias-corrected version of the Markov invariants approach which satisfies all three properties. We also extend previous work by showing that the phylogenetic invariants can be implemented in such a way as to satisfy property (3). A simulation study shows that, in comparison to other methods, our new proposed approach based on bias-corrected Markov invariants is extremely powerful for phylogenetic inference. The binary case is of particular theoretical interest as-in this case only-the Markov invariants can be expressed as linear combinations of the phylogenetic invariants. A wider implication of this is that, for models with more than two states-for example DNA sequence alignments with four-state models-we find that methods which rely on phylogenetic invariants are incapable of satisfying all three of the stated statistical properties. This is because in these cases the relevant Markov invariants belong to a class of polynomials independent from the phylogenetic invariants.
Collapse
Affiliation(s)
- Jeremy G Sumner
- School of Physical Sciences, University of Tasmania, Hobart, Australia.
| | | | - Barbara R Holland
- School of Physical Sciences, University of Tasmania, Hobart, Australia
| | - Peter D Jarvis
- School of Physical Sciences, University of Tasmania, Hobart, Australia
| |
Collapse
|
13
|
Dimensional Reduction for the General Markov Model on Phylogenetic Trees. Bull Math Biol 2017; 79:619-634. [PMID: 28188429 DOI: 10.1007/s11538-017-0249-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2016] [Accepted: 01/19/2017] [Indexed: 10/20/2022]
Abstract
We present a method of dimensional reduction for the general Markov model of sequence evolution on a phylogenetic tree. We show that taking certain linear combinations of the associated random variables (site pattern counts) reduces the dimensionality of the model from exponential in the number of extant taxa, to quadratic in the number of taxa, while retaining the ability to statistically identify phylogenetic divergence events. A key feature is the identification of an invariant subspace which depends only bilinearly on the model parameters, in contrast to the usual multi-linear dependence in the full space. We discuss potential applications including the computation of split (edge) weights on phylogenetic trees from observed sequence data.
Collapse
|
14
|
Tugume AK, Mukasa SB, Valkonen JPT. Mixed Infections of Four Viruses, the Incidence and Phylogenetic Relationships of Sweet Potato Chlorotic Fleck Virus (Betaflexiviridae) Isolates in Wild Species and Sweetpotatoes in Uganda and Evidence of Distinct Isolates in East Africa. PLoS One 2016; 11:e0167769. [PMID: 28005969 PMCID: PMC5179071 DOI: 10.1371/journal.pone.0167769] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2016] [Accepted: 11/18/2016] [Indexed: 01/05/2023] Open
Abstract
Viruses infecting wild flora may have a significant negative impact on nearby crops, and vice-versa. Only limited information is available on wild species able to host economically important viruses that infect sweetpotatoes (Ipomoea batatas). In this study, Sweet potato chlorotic fleck virus (SPCFV; Carlavirus, Betaflexiviridae) and Sweet potato chlorotic stunt virus (SPCSV; Crinivirus, Closteroviridae) were surveyed in wild plants of family Convolvulaceae (genera Astripomoea, Ipomoea, Hewittia and Lepistemon) in Uganda. Plants belonging to 26 wild species, including annuals, biannuals and perennials from four agro-ecological zones, were observed for virus-like symptoms in 2004 and 2007 and sampled for virus testing. SPCFV was detected in 84 (2.9%) of 2864 plants tested from 17 species. SPCSV was detected in 66 (5.4%) of the 1224 plants from 12 species sampled in 2007. Some SPCSV-infected plants were also infected with Sweet potato feathery mottle virus (SPFMV; Potyvirus, Potyviridae; 1.3%), Sweet potato mild mottle virus (SPMMV; Ipomovirus, Potyviridae; 0.5%) or both (0.4%), but none of these three viruses were detected in SPCFV-infected plants. Co-infection of SPFMV with SPMMV was detected in 1.2% of plants sampled. Virus-like symptoms were observed in 367 wild plants (12.8%), of which 42 plants (11.4%) were negative for the viruses tested. Almost all (92.4%) the 419 sweetpotato plants sampled from fields close to the tested wild plants displayed virus-like symptoms, and 87.1% were infected with one or more of the four viruses. Phylogenetic and evolutionary analyses of the 3'-proximal genomic region of SPCFV, including the silencing suppressor (NaBP)- and coat protein (CP)-coding regions implicated strong purifying selection on the CP and NaBP, and that the SPCFV strains from East Africa are distinguishable from those from other continents. However, the strains from wild species and sweetpotato were indistinguishable, suggesting reciprocal movement of SPCFV between wild and cultivated Convolvulaceae plants in the field.
Collapse
Affiliation(s)
- Arthur K. Tugume
- Department of Agricultural Sciences, Faculty of Agriculture and Forestry, University of Helsinki, Helsinki, Finland
- Department of Plant Sciences, Microbiology and Biotechnology, School of Biosciences, College of Natural Sciences, Makerere University, Kampala, Uganda
| | - Settumba B. Mukasa
- Department of Agricultural Production, School of Agricultural Sciences, College of Agricultural and Environmental Sciences, Makerere University, Kampala, Uganda
| | - Jari P. T. Valkonen
- Department of Agricultural Sciences, Faculty of Agriculture and Forestry, University of Helsinki, Helsinki, Finland
| |
Collapse
|
15
|
Tseng SH, Chen HY, Wang RC. Lie algebra solution of Lie-Markov model: Application of freshness K-value. JOURNAL OF STATISTICS & MANAGEMENT SYSTEMS 2016. [DOI: 10.1080/09720510.2016.1224470] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Affiliation(s)
- Shih-Hsien Tseng
- 200 Chung Pei Road, Chung Li District., Taoyuan City, Taiwan 32023, R.O.C. Chung Yuan Christian University, Department of Business Administration
| | - Hsiao-Yin Chen
- No.1 Kainan Rd, Shinshing Tsuen, Luchu Shiang, Taoyuan, Taiwan, R.O.C. Kainan University, Department of Business and Entrepreneurial Management
| | - Ruei-Ci Wang
- Gottåkrav. 28, 236 41 Höllviken, Sweden, Minastir Asset Management AB, Quantitative Developer
| |
Collapse
|
16
|
House T. Lie Algebra Solution of Population Models Based on Time-Inhomogeneous Markov Chains. J Appl Probab 2016. [DOI: 10.1239/jap/1339878799] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Many natural populations are well modelled through time-inhomogeneous stochastic processes. Such processes have been analysed in the physical sciences using a method based on Lie algebras, but this methodology is not widely used for models with ecological, medical, and social applications. In this paper we present the Lie algebraic method, and apply it to three biologically well-motivated examples. The result of this is a solution form that is often highly computationally advantageous.
Collapse
|
17
|
Abstract
Many natural populations are well modelled through time-inhomogeneous stochastic processes. Such processes have been analysed in the physical sciences using a method based on Lie algebras, but this methodology is not widely used for models with ecological, medical, and social applications. In this paper we present the Lie algebraic method, and apply it to three biologically well-motivated examples. The result of this is a solution form that is often highly computationally advantageous.
Collapse
|
18
|
Jarvis PD, Sumner JG. Matrix group structure and Markov invariants in the strand symmetric phylogenetic substitution model. J Math Biol 2015; 73:259-82. [PMID: 26660305 DOI: 10.1007/s00285-015-0951-7] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2014] [Revised: 08/25/2015] [Indexed: 11/29/2022]
Abstract
We consider the continuous-time presentation of the strand symmetric phylogenetic substitution model (in which rate parameters are unchanged under nucleotide permutations given by Watson-Crick base conjugation). Algebraic analysis of the model's underlying structure as a matrix group leads to a change of basis where the rate generator matrix is given by a two-part block decomposition. We apply representation theoretic techniques and, for any (fixed) number of phylogenetic taxa L and polynomial degree D of interest, provide the means to classify and enumerate the associated Markov invariants. In particular, in the quadratic and cubic cases we prove there are precisely [Formula: see text] and [Formula: see text] linearly independent Markov invariants, respectively. Additionally, we give the explicit polynomial forms of the Markov invariants for (i) the quadratic case with any number of taxa L, and (ii) the cubic case in the special case of a three-taxon phylogenetic tree. We close by showing our results are of practical interest since the quadratic Markov invariants provide independent estimates of phylogenetic distances based on (i) substitution rates within Watson-Crick conjugate pairs, and (ii) substitution rates across conjugate base pairs.
Collapse
Affiliation(s)
- Peter D Jarvis
- School of Physical Sciences, University of Tasmania, Private Bag 37, GPO, Hobart, TAS, 7001, Australia
| | - Jeremy G Sumner
- School of Physical Sciences, University of Tasmania, Private Bag 37, GPO, Hobart, TAS, 7001, Australia.
| |
Collapse
|
19
|
Arenas M. Trends in substitution models of molecular evolution. Front Genet 2015; 6:319. [PMID: 26579193 PMCID: PMC4620419 DOI: 10.3389/fgene.2015.00319] [Citation(s) in RCA: 79] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2015] [Accepted: 10/09/2015] [Indexed: 11/13/2022] Open
Abstract
Substitution models of evolution describe the process of genetic variation through fixed mutations and constitute the basis of the evolutionary analysis at the molecular level. Almost 40 years after the development of first substitution models, highly sophisticated, and data-specific substitution models continue emerging with the aim of better mimicking real evolutionary processes. Here I describe current trends in substitution models of DNA, codon and amino acid sequence evolution, including advantages and pitfalls of the most popular models. The perspective concludes that despite the large number of currently available substitution models, further research is required for more realistic modeling, especially for DNA coding and amino acid data. Additionally, the development of more accurate complex models should be coupled with new implementations and improvements of methods and frameworks for substitution model selection and downstream evolutionary analysis.
Collapse
Affiliation(s)
- Miguel Arenas
- Institute of Molecular Pathology and Immunology of the University of Porto Porto, Portugal
| |
Collapse
|
20
|
Woodhams MD, Fernández-Sánchez J, Sumner JG. A New Hierarchy of Phylogenetic Models Consistent with Heterogeneous Substitution Rates. Syst Biol 2015; 64:638-50. [PMID: 25858352 PMCID: PMC4468350 DOI: 10.1093/sysbio/syv021] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2014] [Accepted: 03/30/2015] [Indexed: 11/14/2022] Open
Abstract
When the process underlying DNA substitutions varies across evolutionary history, some standard Markov models underlying phylogenetic methods are mathematically inconsistent. The most prominent example is the general time-reversible model (GTR) together with some, but not all, of its submodels. To rectify this deficiency, nonhomogeneous Lie Markov models have been identified as the class of models that are consistent in the face of a changing process of DNA substitutions regardless of taxon sampling. Some well-known models in popular use are within this class, but are either overly simplistic (e.g., the Kimura two-parameter model) or overly complex (the general Markov model). On a diverse set of biological data sets, we test a hierarchy of Lie Markov models spanning the full range of parameter richness. Compared against the benchmark of the ever-popular GTR model, we find that as a whole the Lie Markov models perform well, with the best performing models having 8–10 parameters and the ability to recognize the distinction between purines and pyrimidines.
Collapse
Affiliation(s)
- Michael D Woodhams
- School of Physical Sciences, University of Tasmania, Hobart, TAS 7005, Australia and Departament de Matemàtica Aplicada I, Universitat Politècnica de Catalunya, Barcelona, Spain
| | - Jesús Fernández-Sánchez
- School of Physical Sciences, University of Tasmania, Hobart, TAS 7005, Australia and Departament de Matemàtica Aplicada I, Universitat Politècnica de Catalunya, Barcelona, Spain
| | - Jeremy G Sumner
- School of Physical Sciences, University of Tasmania, Hobart, TAS 7005, Australia and Departament de Matemàtica Aplicada I, Universitat Politècnica de Catalunya, Barcelona, Spain
| |
Collapse
|
21
|
Sumner JG, Jarvis PD, Holland BR. A tensorial approach to the inversion of group-based phylogenetic models. BMC Evol Biol 2014; 14:236. [PMID: 25472897 PMCID: PMC4268818 DOI: 10.1186/s12862-014-0236-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2014] [Accepted: 11/06/2014] [Indexed: 11/16/2022] Open
Abstract
Background Hadamard conjugation is part of the standard mathematical armoury in the analysis of molecular phylogenetic methods. For group-based models, the approach provides a one-to-one correspondence between the so-called “edge length” and “sequence” spectrum on a phylogenetic tree. The Hadamard conjugation has been used in diverse phylogenetic applications not only for inference but also as an important conceptual tool for thinking about molecular data leading to generalizations beyond strictly tree-like evolutionary modelling. Results For general group-based models of phylogenetic branching processes, we reformulate the problem of constructing a one-one correspondence between pattern probabilities and edge parameters. This takes a classic result previously shown through use of Fourier analysis and presents it in the language of tensors and group representation theory. This derivation makes it clear why the inversion is possible, because, under their usual definition, group-based models are defined for abelian groups only. Conclusion We provide an inversion of group-based phylogenetic models that can implemented using matrix multiplication between rectangular matrices indexed by ordered-partitions of varying sizes. Our approach provides additional context for the construction of phylogenetic probability distributions on network structures, and highlights the potential limitations of restricting to group-based models in this setting.
Collapse
Affiliation(s)
- Jeremy G Sumner
- School of Physical Sciences, University of Tasmania, Hobart TAS 7001, Australia.
| | | | | |
Collapse
|
22
|
Fernández-Sánchez J, Sumner JG, Jarvis PD, Woodhams MD. Lie Markov models with purine/pyrimidine symmetry. J Math Biol 2014; 70:855-91. [PMID: 24723068 DOI: 10.1007/s00285-014-0773-z] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2012] [Revised: 02/13/2014] [Indexed: 10/25/2022]
Abstract
Continuous-time Markov chains are a standard tool in phylogenetic inference. If homogeneity is assumed, the chain is formulated by specifying time-independent rates of substitutions between states in the chain. In applications, there are usually extra constraints on the rates, depending on the situation. If a model is formulated in this way, it is possible to generalise it and allow for an inhomogeneous process, with time-dependent rates satisfying the same constraints. It is then useful to require that, under some time restrictions, there exists a homogeneous average of this inhomogeneous process within the same model. This leads to the definition of "Lie Markov models" which, as we will show, are precisely the class of models where such an average exists. These models form Lie algebras and hence concepts from Lie group theory are central to their derivation. In this paper, we concentrate on applications to phylogenetics and nucleotide evolution, and derive the complete hierarchy of Lie Markov models that respect the grouping of nucleotides into purines and pyrimidines-that is, models with purine/pyrimidine symmetry. We also discuss how to handle the subtleties of applying Lie group methods, most naturally defined over the complex field, to the stochastic case of a Markov process, where parameter values are restricted to be real and positive. In particular, we explore the geometric embedding of the cone of stochastic rate matrices within the ambient space of the associated complex Lie algebra.
Collapse
Affiliation(s)
- Jesús Fernández-Sánchez
- Departament de Matemàtica Aplicada I, Universitat Politècnica de Catalunya, Barcelona, Spain,
| | | | | | | |
Collapse
|
23
|
Verbyla KL, Yap VB, Pahwa A, Shao Y, Huttley GA. The embedding problem for markov models of nucleotide substitution. PLoS One 2013; 8:e69187. [PMID: 23935949 PMCID: PMC3728303 DOI: 10.1371/journal.pone.0069187] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2012] [Accepted: 06/10/2013] [Indexed: 11/18/2022] Open
Abstract
Continuous-time Markov processes are often used to model the complex natural phenomenon of sequence evolution. To make the process of sequence evolution tractable, simplifying assumptions are often made about the sequence properties and the underlying process. The validity of one such assumption, time-homogeneity, has never been explored. Violations of this assumption can be found by identifying non-embeddability. A process is non-embeddable if it can not be embedded in a continuous time-homogeneous Markov process. In this study, non-embeddability was demonstrated to exist when modelling sequence evolution with Markov models. Evidence of non-embeddability was found primarily at the third codon position, possibly resulting from changes in mutation rate over time. Outgroup edges and those with a deeper time depth were found to have an increased probability of the underlying process being non-embeddable. Overall, low levels of non-embeddability were detected when examining individual edges of triads across a diverse set of alignments. Subsequent phylogenetic reconstruction analyses demonstrated that non-embeddability could impact on the correct prediction of phylogenies, but at extremely low levels. Despite the existence of non-embeddability, there is minimal evidence of violations of the local time homogeneity assumption and consequently the impact is likely to be minor.
Collapse
Affiliation(s)
- Klara L. Verbyla
- Computational Genomics Laboratory, John Curtin School of Medical Research, The Australian National University, Canberra, Australian Capital Territory, Australia
- CSIRO Mathematic, Informatics and Statistics, CSIRO, Canberra, Australian Capital Territory, Australia
| | - Von Bing Yap
- Department of Statistics and Applied Probability, National University of Singapore, Singapore
| | - Anuj Pahwa
- Computational Genomics Laboratory, John Curtin School of Medical Research, The Australian National University, Canberra, Australian Capital Territory, Australia
| | - Yunli Shao
- Computational Genomics Laboratory, John Curtin School of Medical Research, The Australian National University, Canberra, Australian Capital Territory, Australia
| | - Gavin A. Huttley
- Computational Genomics Laboratory, John Curtin School of Medical Research, The Australian National University, Canberra, Australian Capital Territory, Australia
| |
Collapse
|
24
|
Lie geometry of 2×2 Markov matrices. J Theor Biol 2013; 327:88-90. [DOI: 10.1016/j.jtbi.2013.01.026] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2012] [Accepted: 01/29/2013] [Indexed: 11/22/2022]
|
25
|
Groussin M, Boussau B, Gouy M. A branch-heterogeneous model of protein evolution for efficient inference of ancestral sequences. Syst Biol 2013; 62:523-38. [PMID: 23475623 PMCID: PMC3676677 DOI: 10.1093/sysbio/syt016] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Most models of nucleotide or amino acid substitution used in phylogenetic studies assume that the evolutionary process has been homogeneous across lineages and that composition of nucleotides or amino acids has remained the same throughout the tree. These oversimplified assumptions are refuted by the observation that compositional variability characterizes extant biological sequences. Branch-heterogeneous models of protein evolution that account for compositional variability have been developed, but are not yet in common use because of the large number of parameters required, leading to high computational costs and potential overparameterization. Here, we present a new branch-nonhomogeneous and nonstationary model of protein evolution that captures more accurately the high complexity of sequence evolution. This model, henceforth called Correspondence and likelihood analysis (COaLA), makes use of a correspondence analysis to reduce the number of parameters to be optimized through maximum likelihood, focusing on most of the compositional variation observed in the data. The model was thoroughly tested on both simulated and biological data sets to show its high performance in terms of data fitting and CPU time. COaLA efficiently estimates ancestral amino acid frequencies and sequences, making it relevant for studies aiming at reconstructing and resurrecting ancestral amino acid sequences. Finally, we applied COaLA on a concatenate of universal amino acid sequences to confirm previous results obtained with a nonhomogeneous Bayesian model regarding the early pattern of adaptation to optimal growth temperature, supporting the mesophilic nature of the Last Universal Common Ancestor.
Collapse
Affiliation(s)
- M Groussin
- Laboratoire de Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, UMR5558, Villeurbanne, France.
| | | | | |
Collapse
|
26
|
Casanellas M, Fernández-Sánchez J, Kedzierska AM. The space of phylogenetic mixtures for equivariant models. Algorithms Mol Biol 2012. [PMID: 23190710 PMCID: PMC3608327 DOI: 10.1186/1748-7188-7-33] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Background The selection of an evolutionary model to best fit given molecular data is usually a heuristic choice. In his seminal book, J. Felsenstein suggested that certain linear equations satisfied by the expected probabilities of patterns observed at the leaves of a phylogenetic tree could be used for model selection. It remained an open question, however, whether these equations were sufficient to fully characterize the evolutionary model under consideration. Results Here we prove that, for most equivariant models of evolution, the space of distributions satisfying these linear equations coincides with the space of distributions arising from mixtures of trees. In other words, we prove that the evolution of an observed multiple sequence alignment can be modeled by a mixture of phylogenetic trees under an equivariant evolutionary model if and only if the distribution of patterns at its columns satisfies the linear equations mentioned above. Moreover, we provide a set of linearly independent equations defining this space of phylogenetic mixtures for each equivariant model and for any number of taxa. Lastly, we use these results to perform a study of identifiability of phylogenetic mixtures. Conclusions The space of phylogenetic mixtures under equivariant models is a linear space that fully characterizes the evolutionary model. We provide an explicit algorithm to obtain the equations defining these spaces for a number of models and taxa. Its implementation has proved to be a powerful tool for model selection.
Collapse
|
27
|
Holland BR, Jarvis PD, Sumner JG. Low-Parameter Phylogenetic Inference Under the General Markov Model. Syst Biol 2012; 62:78-92. [DOI: 10.1093/sysbio/sys072] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Affiliation(s)
- Barbara R. Holland
- School of Mathematics and Physics, University of Tasmania, Hobart 7001, Australia
| | - Peter D. Jarvis
- School of Mathematics and Physics, University of Tasmania, Hobart 7001, Australia
| | - Jeremy G. Sumner
- School of Mathematics and Physics, University of Tasmania, Hobart 7001, Australia
| |
Collapse
|
28
|
Doerr D, Gronau I, Moran S, Yavneh I. Stochastic errors vs. modeling errors in distance based phylogenetic reconstructions. Algorithms Mol Biol 2012; 7:22. [PMID: 22938153 PMCID: PMC3538584 DOI: 10.1186/1748-7188-7-22] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2011] [Accepted: 06/28/2012] [Indexed: 11/24/2022] Open
Abstract
UNLABELLED BACKGROUND Distance-based phylogenetic reconstruction methods use evolutionary distances between species in order to reconstruct the phylogenetic tree spanning them. There are many different methods for estimating distances from sequence data. These methods assume different substitution models and have different statistical properties. Since the true substitution model is typically unknown, it is important to consider the effect of model misspecification on the performance of a distance estimation method. RESULTS This paper continues the line of research which attempts to adjust to each given set of input sequences a distance function which maximizes the expected topological accuracy of the reconstructed tree. We focus here on the effect of systematic error caused by assuming an inadequate model, but consider also the stochastic error caused by using short sequences. We introduce a theoretical framework for analyzing both sources of error based on the notion of deviation from additivity, which quantifies the contribution of model misspecification to the estimation error. We demonstrate this framework by studying the behavior of the Jukes-Cantor distance function when applied to data generated according to Kimura's two-parameter model with a transition-transversion bias. We provide both a theoretical derivation for this case, and a detailed simulation study on quartet trees. CONCLUSIONS We demonstrate both analytically and experimentally that by deliberately assuming an oversimplified evolutionary model, it is possible to increase the topological accuracy of reconstruction. Our theoretical framework provides new insights into the mechanisms that enables statistically inconsistent reconstruction methods to outperform consistent methods.
Collapse
Affiliation(s)
- Daniel Doerr
- Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Ilan Gronau
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, USA
| | - Shlomo Moran
- Computer Science Department, Technion - Israel Institute of Technology, Haifa, Israel
| | - Irad Yavneh
- Computer Science Department, Technion - Israel Institute of Technology, Haifa, Israel
| |
Collapse
|
29
|
Jarvis PD, Sumner JG. Markov invariants for phylogenetic rate matrices derived from embedded submodels. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:828-836. [PMID: 22331860 DOI: 10.1109/tcbb.2012.24] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
We consider novel phylogenetic models with rate matrices that arise via the embedding of a progenitor model on a small number of character states, into a target model on a larger number of character states. Adapting representation-theoretic results from recent investigations of Markov invariants for the general rate matrix model, we give a prescription for identifying and counting Markov invariants for such “symmetric embedded” models, and we provide enumerations of these for the first few cases with a small number of character states. The simplest example is a target model on three states, constructed from a general 2 state model; the "2 --> 3" embedding. We show that for 2 taxa, there exist two invariants of quadratic degree that can be used to directly infer pairwise distances from observed sequences under this model. A simple simulation study verifies their theoretical expected values, and suggests that, given the appropriateness of the model class, they have superior statistical properties than the standard (log) Det invariant (which is of cubic degree for this case).
Collapse
Affiliation(s)
- Peter D Jarvis
- School of Mathematics and Physics, University of Tasmania, Private Bag 37, Hobart Tas 7001, Australia.
| | | |
Collapse
|
30
|
Zou L, Susko E, Field C, Roger AJ. Fitting nonstationary general-time-reversible models to obtain edge-lengths and frequencies for the barry-hartigan model. Syst Biol 2012; 61:927-40. [PMID: 22508720 DOI: 10.1093/sysbio/sys046] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Among models of nucleotide evolution, the Barry and Hartigan (BH) model (also known as the General Markov Model) is very flexible as it allows separate arbitrary substitution matrices along edges. For a given tree, the estimates of the BH model are a set of joint probability matrices, each giving the pairwise frequencies of nucleotides at the ends of the edge. We have previously shown that, due to an identifiability problem, these cannot be expected to consistently estimate the actual pairwise frequencies. A further consequence is that internal node frequency estimates are likely to be incorrect. Here we define a nonstationary GTR model for each edge that we refer to as the NSGTR model. We fit the NSGTR model by minimizing the sums of squares between the estimates of transition probabilities under the NSGTR model and the estimates provided by a fitted BH model. This NSGTR model provides estimates that avoid the identifiability difficulties of the BH model while closely fitting it. With the best-fitting NSGTR estimates, we are able to get interpretable frequency vectors at internal nodes as well as edge length estimates that are otherwise not yielded by the BH model. These edge lengths are interpretable as the expected number of substitutions along an edge for the model. We also show that for a nonstationary continuous-time model these are not the same as the edge length parameters for conventional substitution matrices that are output by nonstationary model phylogenetic estimation programs such as nhPhyML.
Collapse
Affiliation(s)
- Liwen Zou
- Bioinformatics Research Center, Department of Genetics, North Carolina State University, NC, USA
| | | | | | | |
Collapse
|
31
|
Sumner JG, Jarvis PD, Fernández-Sánchez J, Kaine BT, Woodhams MD, Holland BR. Is the general time-reversible model bad for molecular phylogenetics? Syst Biol 2012; 61:1069-74. [PMID: 22442193 DOI: 10.1093/sysbio/sys042] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Affiliation(s)
- Jeremy G Sumner
- School of Mathematics and Physics, University of Tasmania, Hobart 7001,
| | | | | | | | | | | |
Collapse
|