1
|
Baños H, Susko E, Roger AJ. Is Over-parameterization a Problem for Profile Mixture Models? Syst Biol 2024; 73:53-75. [PMID: 37843172 PMCID: PMC11129589 DOI: 10.1093/sysbio/syad063] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2022] [Revised: 09/12/2023] [Accepted: 10/13/2023] [Indexed: 10/17/2023] Open
Abstract
Biochemical constraints on the admissible amino acids at specific sites in proteins lead to heterogeneity of the amino acid substitution process over sites in alignments. It is well known that phylogenetic models of protein sequence evolution that do not account for site heterogeneity are prone to long-branch attraction (LBA) artifacts. Profile mixture models were developed to model heterogeneity of preferred amino acids at sites via a finite distribution of site classes each with a distinct set of equilibrium amino acid frequencies. However, it is unknown whether the large number of parameters in such models associated with the many amino acid frequency vectors can adversely affect tree topology estimates because of over-parameterization. Here, we demonstrate theoretically that for long sequences, over-parameterization does not create problems for estimation with profile mixture models. Under mild conditions, tree, amino acid frequencies, and other model parameters converge to true values as sequence length increases, even when there are large numbers of components in the frequency profile distributions. Because large sample theory does not necessarily imply good behavior for shorter alignments we explore the performance of these models with short alignments simulated with tree topologies that are prone to LBA artifacts. We find that over-parameterization is not a problem for complex profile mixture models even when there are many amino acid frequency vectors. In fact, simple models with few site classes behave poorly. Interestingly, we also found that misspecification of the amino acid frequency vectors does not lead to increased LBA artifacts as long as the estimated cumulative distribution function of the amino acid frequencies at sites adequately approximates the true one. In contrast, misspecification of the amino acid exchangeability rates can severely negatively affect parameter estimation. Finally, we explore the effects of including in the profile mixture model an additional "F-class" representing the overall frequencies of amino acids in the data set. Surprisingly, the F-class does not help parameter estimation significantly and can decrease the probability of correct tree estimation, depending on the scenario, even though it tends to improve likelihood scores.
Collapse
Affiliation(s)
- Hector Baños
- Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Nova Scotia B3H 4R2, Canada
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia B3H 4R2, Canada
- Institute for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, Halifax, Nova Scotia B3H 4R2, Canada
| | - Edward Susko
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia B3H 4R2, Canada
- Institute for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, Halifax, Nova Scotia B3H 4R2, Canada
| | - Andrew J Roger
- Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Nova Scotia B3H 4R2, Canada
- Institute for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, Halifax, Nova Scotia B3H 4R2, Canada
| |
Collapse
|
2
|
Burgstaller-Muehlbacher S, Crotty SM, Schmidt HA, Reden F, Drucks T, von Haeseler A. ModelRevelator: Fast phylogenetic model estimation via deep learning. Mol Phylogenet Evol 2023; 188:107905. [PMID: 37595933 DOI: 10.1016/j.ympev.2023.107905] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Accepted: 08/12/2023] [Indexed: 08/20/2023]
Abstract
Selecting the best model of sequence evolution for a multiple-sequence-alignment (MSA) constitutes the first step of phylogenetic tree reconstruction. Common approaches for inferring nucleotide models typically apply maximum likelihood (ML) methods, with discrimination between models determined by one of several information criteria. This requires tree reconstruction and optimisation which can be computationally expensive. We demonstrate that neural networks can be used to perform model selection, without the need to reconstruct trees, optimise parameters, or calculate likelihoods. We introduce ModelRevelator, a model selection tool underpinned by two deep neural networks. The first neural network, NNmodelfind, recommends one of six commonly used models of sequence evolution, ranging in complexity from Jukes and Cantor to General Time Reversible. The second, NNalphafind, recommends whether or not a Γ-distributed rate heterogeneous model should be incorporated, and if so, provides an estimate of the shape parameter, ɑ. Users can simply input an MSA into ModelRevelator, and swiftly receive output recommending the evolutionary model, inclusive of the presence or absence of rate heterogeneity, and an estimate of ɑ. We show that ModelRevelator performs comparably with likelihood-based methods and the recently published machine learning method ModelTeller over a wide range of parameter settings, with significant potential savings in computational effort. Further, we show that this performance is not restricted to the alignments on which the networks were trained, but is maintained even on unseen empirical data. We expect that ModelRevelator will provide a valuable alternative for phylogeneticists, especially where traditional methods of model selection are computationally prohibitive.
Collapse
Affiliation(s)
- Sebastian Burgstaller-Muehlbacher
- Center for Integrative Bioinformatics Vienna, Max Perutz Labs, University of Vienna and Medical University of Vienna, Vienna BioCenter (VBC) 5, 1030 Vienna, Austria.
| | - Stephen M Crotty
- School of Mathematical Sciences, University of Adelaide, Adelaide, SA 5005, Australia; ARC Centre of Excellence for Mathematical and Statistical Frontiers, University of Adelaide, Adelaide, SA 5005, Australia
| | - Heiko A Schmidt
- Center for Integrative Bioinformatics Vienna, Max Perutz Labs, University of Vienna and Medical University of Vienna, Vienna BioCenter (VBC) 5, 1030 Vienna, Austria
| | - Franziska Reden
- Center for Integrative Bioinformatics Vienna, Max Perutz Labs, University of Vienna and Medical University of Vienna, Vienna BioCenter (VBC) 5, 1030 Vienna, Austria
| | - Tamara Drucks
- Center for Integrative Bioinformatics Vienna, Max Perutz Labs, University of Vienna and Medical University of Vienna, Vienna BioCenter (VBC) 5, 1030 Vienna, Austria; Research Unit Machine Learning, TU Wien, 1040 Vienna, Austria
| | - Arndt von Haeseler
- Center for Integrative Bioinformatics Vienna, Max Perutz Labs, University of Vienna and Medical University of Vienna, Vienna BioCenter (VBC) 5, 1030 Vienna, Austria; Bioinformatics and Computational Biology, Faculty of Computer Science, University of Vienna, Währinger Straße 29, 1090 Vienna, Austria
| |
Collapse
|
3
|
Lartillot N. Identifying the Best Approximating Model in Bayesian Phylogenetics: Bayes Factors, Cross-Validation or wAIC? Syst Biol 2023; 72:616-638. [PMID: 36810802 PMCID: PMC10276628 DOI: 10.1093/sysbio/syad004] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2022] [Revised: 01/20/2023] [Accepted: 02/17/2023] [Indexed: 02/23/2023] Open
Abstract
There is still no consensus as to how to select models in Bayesian phylogenetics, and more generally in applied Bayesian statistics. Bayes factors are often presented as the method of choice, yet other approaches have been proposed, such as cross-validation or information criteria. Each of these paradigms raises specific computational challenges, but they also differ in their statistical meaning, being motivated by different objectives: either testing hypotheses or finding the best-approximating model. These alternative goals entail different compromises, and as a result, Bayes factors, cross-validation, and information criteria may be valid for addressing different questions. Here, the question of Bayesian model selection is revisited, with a focus on the problem of finding the best-approximating model. Several model selection approaches were re-implemented, numerically assessed and compared: Bayes factors, cross-validation (CV), in its different forms (k-fold or leave-one-out), and the widely applicable information criterion (wAIC), which is asymptotically equivalent to leave-one-out cross-validation (LOO-CV). Using a combination of analytical results and empirical and simulation analyses, it is shown that Bayes factors are unduly conservative. In contrast, CV represents a more adequate formalism for selecting the model returning the best approximation of the data-generating process and the most accurate estimates of the parameters of interest. Among alternative CV schemes, LOO-CV and its asymptotic equivalent represented by the wAIC, stand out as the best choices, conceptually and computationally, given that both can be simultaneously computed based on standard Markov chain Monte Carlo runs under the posterior distribution. [Bayes factor; cross-validation; marginal likelihood; model comparison; wAIC.].
Collapse
Affiliation(s)
- Nicolas Lartillot
- Université de Lyon, Université Lyon 1, CNRS, VetAgro Sup, Laboratoire de Biométrie et Biologie Evolutive, UMR5558, Villeurbanne, France
| |
Collapse
|
4
|
Abstract
This paper deals with the problem of choosing the optimum criterion for selecting the best model out of a set of overlapping binary models. The criteria we studied were the well-known AIC and SBIC, and a third one called C2. Special attention was paid to the setting where neither of the competing models was correctly specified. This situation has not been studied very much but it is the most common case in empirical works. The theoretical study we carried out allowed us to conclude that, in general terms, all criteria perform well. A Monte Carlo exercise corroborated those results.
Collapse
|
5
|
Crotty SM, Holland BR. Comparing partitioned models to mixture models: Do information criteria apply? Syst Biol 2022; 71:1541-1548. [PMID: 35041002 PMCID: PMC9558833 DOI: 10.1093/sysbio/syac003] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Revised: 12/15/2021] [Accepted: 01/10/2022] [Indexed: 12/01/2022] Open
Abstract
The use of information criteria to distinguish between phylogenetic models has become ubiquitous within the field. However, the variety and complexity of available models are much greater now than when these practices were established. The literature shows an increasing trajectory of healthy skepticism with regard to the use of information theory-based model selection within phylogenetics. We add to this by analyzing the specific case of comparison between partition and mixture models. We argue from a theoretical basis that information criteria are inherently more likely to favor partition models over mixture models, and we then demonstrate this through simulation. Based on our findings, we suggest that partition and mixture models are not suitable for information-theory based model comparison. [AIC, BIC; information criteria; maximum likelihood; mixture models; partitioned model; phylogenetics.]
Collapse
Affiliation(s)
- Stephen M Crotty
- School of Mathematical Sciences, University of Adelaide, Adelaide, SA 5005, Australia.,Center for Integrative Bioinformatics Vienna, Max F. Perutz Laboratories, University of Vienna and Medical University of Vienna, Vienna, Austria.,ARC Centre of Excellence for Mathematical and Statistical Frontiers, The University of Adelaide, Adelaide, SA, Australia
| | - Barbara R Holland
- School of Natural Sciences (Mathematics), University of Tasmania, Hobart, TAS 7001, Australia
| |
Collapse
|
6
|
Seo TK, Gascuel O, Thorne JL. Measuring Phylogenetic Information of Incomplete Sequence Data. Syst Biol 2021; 71:630-648. [PMID: 34469581 DOI: 10.1093/sysbio/syab073] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2021] [Revised: 08/26/2021] [Accepted: 08/27/2021] [Indexed: 11/13/2022] Open
Abstract
Widely used approaches for extracting phylogenetic information from aligned sets of molecular sequences rely upon probabilistic models of nucleotide substitution or amino-acid replacement. The phylogenetic information that can be extracted depends on the number of columns in the sequence alignment and will be decreased when the alignment contains gaps due to insertion or deletion events. Motivated by the measurement of information loss, we suggest assessment of the Effective Sequence Length (ESL) of an aligned data set. The ESL can differ from the actual number of columns in a sequence alignment because of the presence of alignment gaps. Furthermore, the estimation of phylogenetic information is affected by model misspecification. Inevitably, the actual process of molecular evolution differs from the probabilistic models employed to describe this process. This disparity means the amount of phylogenetic information in an actual sequence alignment will differ from the amount in a simulated data set of equal size, which motivated us to develop a new test for model adequacy. Via theory and empirical data analysis, we show how to disentangle the effects of gaps and model misspecification. By comparing the Fisher information of actual and simulated sequences, we identify which alignment sites and tree branches are most affected by gaps and model misspecification.
Collapse
Affiliation(s)
- Tae-Kun Seo
- Department of Biological Sciences, Korea Polar Research Institute, 26 Songdomirae-ro, Yeonsu-gu, Incheon 21990, Republic of Korea.,Unit Bioinformatique Evolutive, C3BI USR 3756, Institut Pasteur and CNRS, Paris, France [Sabbatical affiliation of T-K S]
| | - Olivier Gascuel
- Unit Bioinformatique Evolutive, C3BI USR 3756, Institut Pasteur and CNRS, Paris, France [Sabbatical affiliation of T-K S].,Institut de Systmatique, Evolution Biodiversit, (ISYEB - UMR 7205, CNRS, Musum National d'Histoire Naturel, SU, EPHE, UA), Paris, France [Current affiliation of O.G.]
| | - Jeffrey L Thorne
- Departments of Biological Sciences and Statistics, North Carolina State University, Raleigh NC 27695-7566 U.S.A
| |
Collapse
|
7
|
A unified resource and configurable model of the synapse proteome and its role in disease. Sci Rep 2021; 11:9967. [PMID: 33976238 PMCID: PMC8113277 DOI: 10.1038/s41598-021-88945-7] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2021] [Accepted: 04/15/2021] [Indexed: 02/03/2023] Open
Abstract
Genes encoding synaptic proteins are highly associated with neuronal disorders many of which show clinical co-morbidity. We integrated 58 published synaptic proteomic datasets that describe over 8000 proteins and combined them with direct protein-protein interactions and functional metadata to build a network resource that reveals the shared and unique protein components that underpin multiple disorders. All the data are provided in a flexible and accessible format to encourage custom use.
Collapse
|
8
|
Evidence for sponges as sister to all other animals from partitioned phylogenomics with mixture models and recoding. Nat Commun 2021; 12:1783. [PMID: 33741994 PMCID: PMC7979703 DOI: 10.1038/s41467-021-22074-7] [Citation(s) in RCA: 48] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2020] [Accepted: 02/24/2021] [Indexed: 11/08/2022] Open
Abstract
Resolving the relationships between the major lineages in the animal tree of life is necessary to understand the origin and evolution of key animal traits. Sponges, characterized by their simple body plan, were traditionally considered the sister group of all other animal lineages, implying a gradual increase in animal complexity from unicellularity to complex multicellularity. However, the availability of genomic data has sparked tremendous controversy as some phylogenomic studies support comb jellies taking this position, requiring secondary loss or independent origins of complex traits. Here we show that incorporating site-heterogeneous mixture models and recoding into partitioned phylogenomics alleviates systematic errors that hamper commonly-applied phylogenetic models. Testing on real datasets, we show a great improvement in model-fit that attenuates branching artefacts induced by systematic error. We reanalyse key datasets and show that partitioned phylogenomics does not support comb jellies as sister to other animals at either the supermatrix or partition-specific level.
Collapse
|
9
|
Simões TR, Caldwell MW, Pierce SE. Sphenodontian phylogeny and the impact of model choice in Bayesian morphological clock estimates of divergence times and evolutionary rates. BMC Biol 2020; 18:191. [PMID: 33287835 PMCID: PMC7720557 DOI: 10.1186/s12915-020-00901-5] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2020] [Accepted: 10/16/2020] [Indexed: 12/17/2022] Open
Abstract
BACKGROUND The vast majority of all life that ever existed on earth is now extinct and several aspects of their evolutionary history can only be assessed by using morphological data from the fossil record. Sphenodontian reptiles are a classic example, having an evolutionary history of at least 230 million years, but currently represented by a single living species (Sphenodon punctatus). Hence, it is imperative to improve the development and implementation of probabilistic models to estimate evolutionary trees from morphological data (e.g., morphological clocks), which has direct benefits to understanding relationships and evolutionary patterns for both fossil and living species. However, the impact of model choice on morphology-only datasets has been poorly explored. RESULTS Here, we investigate the impact of a wide array of model choices on the inference of evolutionary trees and macroevolutionary parameters (divergence times and evolutionary rates) using a new data matrix on sphenodontian reptiles. Specifically, we tested different clock models, clock partitioning, taxon sampling strategies, sampling for ancestors, and variations on the fossilized birth-death (FBD) tree model parameters through time. We find a strong impact on divergence times and background evolutionary rates when applying widely utilized approaches, such as allowing for ancestors in the tree and the inappropriate assumption of diversification parameters being constant through time. We compare those results with previous studies on the impact of model choice to molecular data analysis and provide suggestions for improving the implementation of morphological clocks. Optimal model combinations find the radiation of most major lineages of sphenodontians to be in the Triassic and a gradual but continuous drop in morphological rates of evolution across distinct regions of the phenotype throughout the history of the group. CONCLUSIONS We provide a new hypothesis of sphenodontian classification, along with detailed macroevolutionary patterns in the evolutionary history of the group. Importantly, we provide suggestions to avoid overestimated divergence times and biased parameter estimates using morphological clocks. Partitioning relaxed clocks offers methodological limitations, but those can be at least partially circumvented to reveal a detailed assessment of rates of evolution across the phenotype and tests of evolutionary mosaicism.
Collapse
Affiliation(s)
- Tiago R Simões
- Museum of Comparative Zoology & Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, 02138, USA.
| | - Michael W Caldwell
- Department of Biological Sciences, University of Alberta, Edmonton, Alberta, T6G 2E9, Canada
- Department of Earth and Atmospheric Sciences, University of Alberta, Edmonton, Alberta, T6G 2E9, Canada
| | - Stephanie E Pierce
- Museum of Comparative Zoology & Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, 02138, USA
| |
Collapse
|
10
|
Susko E, Roger AJ. On the Use of Information Criteria for Model Selection in Phylogenetics. Mol Biol Evol 2020; 37:549-562. [PMID: 31688943 DOI: 10.1093/molbev/msz228] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
The information criteria Akaike information criterion (AIC), AICc, and Bayesian information criterion (BIC) are widely used for model selection in phylogenetics, however, their theoretical justification and performance have not been carefully examined in this setting. Here, we investigate these methods under simple and complex phylogenetic models. We show that AIC can give a biased estimate of its intended target, the expected predictive log likelihood (EPLnL) or, equivalently, expected Kullback-Leibler divergence between the estimated model and the true distribution for the data. Reasons for bias include commonly occurring issues such as small edge-lengths or, in mixture models, small weights. The use of partitioned models is another issue that can cause problems with information criteria. We show that for partitioned models, a different BIC correction is required for it to be a valid approximation to a Bayes factor. The commonly used AICc correction is not clearly defined in partitioned models and can actually create a substantial bias when the number of parameters gets large as is the case with larger trees and partitioned models. Bias-corrected cross-validation corrections are shown to provide better approximations to EPLnL than AIC. We also illustrate how EPLnL, the estimation target of AIC, can sometimes favor an incorrect model and give reasons for why selection of incorrectly under-partitioned models might be desirable in partitioned model settings.
Collapse
Affiliation(s)
- Edward Susko
- Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia, Canada
| | - Andrew J Roger
- Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Nova Scotia, Canada
| |
Collapse
|
11
|
Wang HC, Susko E, Roger AJ. The Relative Importance of Modeling Site Pattern Heterogeneity Versus Partition-Wise Heterotachy in Phylogenomic Inference. Syst Biol 2020; 68:1003-1019. [PMID: 31140564 DOI: 10.1093/sysbio/syz021] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2018] [Revised: 02/04/2019] [Accepted: 04/09/2019] [Indexed: 12/18/2022] Open
Abstract
Large taxa-rich genome-scale data sets are often necessary for resolving ancient phylogenetic relationships. But accurate phylogenetic inference requires that they are analyzed with realistic models that account for the heterogeneity in substitution patterns amongst the sites, genes and lineages. Two kinds of adjustments are frequently used: models that account for heterogeneity in amino acid frequencies at sites in proteins, and partitioned models that accommodate the heterogeneity in rates (branch lengths) among different proteins in different lineages (protein-wise heterotachy). Although partitioned and site-heterogeneous models are both widely used in isolation, their relative importance to the inference of correct phylogenies has not been carefully evaluated. We conducted several empirical analyses and a large set of simulations to compare the relative performances of partitioned models, site-heterogeneous models, and combined partitioned site heterogeneous models. In general, site-homogeneous models (partitioned or not) performed worse than site heterogeneous, except in simulations with extreme protein-wise heterotachy. Furthermore, simulations using empirically-derived realistic parameter settings showed a marked long-branch attraction (LBA) problem for analyses employing protein-wise partitioning even when the generating model included partitioning. This LBA problem results from a small sample bias compounded over many single protein alignments. In some cases, this problem was ameliorated by clustering similarly-evolving proteins together into larger partitions using the PartitionFinder method. Similar results were obtained under simulations with larger numbers of taxa or heterogeneity in simulating topologies over genes. For an empirical Microsporidia test data set, all but one tested site-heterogeneous models (with or without partitioning) obtain the correct Microsporidia+Fungi grouping, whereas site-homogenous models (with or without partitioning) did not. The single exception was the fully partitioned site-heterogeneous analysis that succumbed to the compounded small sample LBA bias. In general unless protein-wise heterotachy effects are extreme, it is more important to model site-heterogeneity than protein-wise heterotachy in phylogenomic analyses. Complete protein-wise partitioning should be avoided as it can lead to a serious LBA bias. In cases of extreme protein-wise heterotachy, approaches that cluster similarly-evolving proteins together and coupled with site-heterogeneous models work well for phylogenetic estimation.
Collapse
Affiliation(s)
- Huai-Chun Wang
- Department of Mathematics and Statistics, Dalhousie University, 6316 Coburg Road, Halifax, Nova Scotia B3H 4R2, Canada.,Centre for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, 5850 College Street, Halifax, Nova Scotia B3H 4R2, Canada
| | - Edward Susko
- Department of Mathematics and Statistics, Dalhousie University, 6316 Coburg Road, Halifax, Nova Scotia B3H 4R2, Canada.,Centre for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, 5850 College Street, Halifax, Nova Scotia B3H 4R2, Canada
| | - Andrew J Roger
- Centre for Comparative Genomics and Evolutionary Bioinformatics, Dalhousie University, 5850 College Street, Halifax, Nova Scotia B3H 4R2, Canada.,Department of Biochemistry and Molecular Biology, Dalhousie University, 5850 College Street, Halifax, Nova Scotia B3H 4R2, Canada
| |
Collapse
|
12
|
Smith SA, Walker-Hale N, Walker JF, Brown JW. Phylogenetic Conflicts, Combinability, and Deep Phylogenomics in Plants. Syst Biol 2019; 69:579-592. [DOI: 10.1093/sysbio/syz078] [Citation(s) in RCA: 36] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2018] [Revised: 10/16/2019] [Accepted: 11/18/2019] [Indexed: 11/13/2022] Open
Abstract
Abstract
Studies have demonstrated that pervasive gene tree conflict underlies several important phylogenetic relationships where different species tree methods produce conflicting results. Here, we present a means of dissecting the phylogenetic signal for alternative resolutions within a data set in order to resolve recalcitrant relationships and, importantly, identify what the data set is unable to resolve. These procedures extend upon methods for isolating conflict and concordance involving specific candidate relationships and can be used to identify systematic error and disambiguate sources of conflict among species tree inference methods. We demonstrate these on a large phylogenomic plant data set. Our results support the placement of Amborella as sister to the remaining extant angiosperms, Gnetales as sister to pines, and the monophyly of extant gymnosperms. Several other contentious relationships, including the resolution of relationships within the bryophytes and the eudicots, remain uncertain given the low number of supporting gene trees. To address whether concatenation of filtered genes amplified phylogenetic signal for relationships, we implemented a combinatorial heuristic to test combinability of genes. We found that nested conflicts limited the ability of data filtering methods to fully ameliorate conflicting signal amongst gene trees. These analyses confirmed that the underlying conflicting signal does not support broad concatenation of genes. Our approach provides a means of dissecting a specific data set to address deep phylogenetic relationships while also identifying the inferential boundaries of the data set. [Angiosperms; coalescent; gene-tree conflict; genomics; phylogenetics; phylogenomics.]
Collapse
Affiliation(s)
- Stephen A Smith
- Department of Ecology and Evolutionary Biology, University of Michigan, 1105 North University Ave, Biological Sciences Building, Ann Arbor, MI 48109-1085, USA
| | - Nathanael Walker-Hale
- Department of Ecology and Evolutionary Biology, University of Michigan, 1105 North University Ave, Biological Sciences Building, Ann Arbor, MI 48109-1085, USA
- Department of Plant Sciences, University of Cambridge, Downing Street, Cambridge CB2 3EA, Cambridge, UK
| | - Joseph F Walker
- Department of Ecology and Evolutionary Biology, University of Michigan, 1105 North University Ave, Biological Sciences Building, Ann Arbor, MI 48109-1085, USA
- Sainsbury Laboratory (SLCU), University of Cambrige, Bateman St, Cambridge CB2 1LR, Cambridge, UK
| | - Joseph W Brown
- Department of Animal and Plant Sciences, University of Sheffield, Sheffield S10 2TN, Sheffield, UK
| |
Collapse
|