1
|
Silvestro D, Latrille T, Salamin N. Toward a Semi-Supervised Learning Approach to Phylogenetic Estimation. Syst Biol 2024; 73:789-806. [PMID: 38916476 PMCID: PMC11639169 DOI: 10.1093/sysbio/syae029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 05/21/2024] [Accepted: 06/24/2024] [Indexed: 06/26/2024] Open
Abstract
Models have always been central to inferring molecular evolution and to reconstructing phylogenetic trees. Their use typically involves the development of a mechanistic framework reflecting our understanding of the underlying biological processes, such as nucleotide substitutions, and the estimation of model parameters by maximum likelihood or Bayesian inference. However, deriving and optimizing the likelihood of the data is not always possible under complex evolutionary scenarios or even tractable for large datasets, often leading to unrealistic simplifying assumptions in the fitted models. To overcome this issue, we coupled stochastic simulations of genome evolution with a new supervised deep-learning model to infer key parameters of molecular evolution. Our model is designed to directly analyze multiple sequence alignments and estimate per-site evolutionary rates and divergence without requiring a known phylogenetic tree. The accuracy of our predictions matched that of likelihood-based phylogenetic inference when rate heterogeneity followed a simple gamma distribution, but it strongly exceeded it under more complex patterns of rate variation, such as codon models. Our approach is highly scalable and can be efficiently applied to genomic data, as we showed on a dataset of 26 million nucleotides from the clownfish clade. Our simulations also showed that the integration of per-site rates obtained by deep learning within a Bayesian framework led to significantly more accurate phylogenetic inference, particularly with respect to the estimated branch lengths. We thus propose that future advancements in phylogenetic analysis will benefit from a semi-supervised learning approach that combines deep-learning estimation of substitution rates, which allows for more flexible models of rate variation, and probabilistic inference of the phylogenetic tree, which guarantees interpretability and a rigorous assessment of statistical support.
Collapse
Affiliation(s)
- Daniele Silvestro
- Department of Biology, University of Fribourg and Swiss Institute of Bioinformatics, 1700 Fribourg, Switzerland
- Department of Biological and Environmental Sciences, Gothenburg Global Biodiversity Centre, University of Gothenburg, 40530 Gothenburg, Sweden
| | - Thibault Latrille
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
| | - Nicolas Salamin
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
| |
Collapse
|
2
|
Moffett AS, Cui G, Thomas PJ, Hunt WD, McCarty NA, Westafer RS, Eckford AW. Permissive and nonpermissive channel closings in CFTR revealed by a factor graph inference algorithm. BIOPHYSICAL REPORTS 2022; 2:100083. [PMID: 36425670 PMCID: PMC9680790 DOI: 10.1016/j.bpr.2022.100083] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/14/2022] [Accepted: 10/13/2022] [Indexed: 06/16/2023]
Abstract
The closing of the gated ion channel in the cystic fibrosis transmembrane conductance regulator can be categorized as nonpermissive to reopening, which involves the unbinding of ADP or ATP, or permissive, which does not. Identifying the type of closing is of interest as interactions with nucleotides can be affected in mutants or by introducing agonists. However, all closings are electrically silent and difficult to differentiate. For single-channel patch-clamp traces, we show that the type of the closing can be accurately determined by an inference algorithm implemented on a factor graph, which we demonstrate using both simulated and lab-obtained patch-clamp traces.
Collapse
Affiliation(s)
- Alexander S. Moffett
- Department of Electrical Engineering and Computer Science, York University, Toronto, ON, Canada
| | - Guiying Cui
- Emory + Children’s Center for Cystic Fibrosis and Airways Disease Research, Emory University School of Medicine and Children’s Healthcare of Atlanta, Atlanta, Georgia
| | - Peter J. Thomas
- Department of Mathematics, Applied Mathematics, and Statistics, Case Western Reserve University, Cleveland, Ohio
| | - William D. Hunt
- School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, Georgia
| | - Nael A. McCarty
- Emory + Children’s Center for Cystic Fibrosis and Airways Disease Research, Emory University School of Medicine and Children’s Healthcare of Atlanta, Atlanta, Georgia
| | | | - Andrew W. Eckford
- Department of Electrical Engineering and Computer Science, York University, Toronto, ON, Canada
| |
Collapse
|
3
|
May MR, Contreras DL, Sundue MA, Nagalingum NS, Looy CV, Rothfels CJ. Inferring the Total-Evidence Timescale of Marattialean Fern Evolution in the Face of Model Sensitivity. Syst Biol 2021; 70:1232-1255. [PMID: 33760075 PMCID: PMC8513765 DOI: 10.1093/sysbio/syab020] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2020] [Revised: 03/09/2021] [Accepted: 03/22/2021] [Indexed: 11/24/2022] Open
Abstract
Phylogenetic divergence-time estimation has been revolutionized by two recent developments: 1) total-evidence dating (or "tip-dating") approaches that allow for the incorporation of fossils as tips in the analysis, with their phylogenetic and temporal relationships to the extant taxa inferred from the data and 2) the fossilized birth-death (FBD) class of tree models that capture the processes that produce the tree (speciation, extinction, and fossilization) and thus provide a coherent and biologically interpretable tree prior. To explore the behavior of these methods, we apply them to marattialean ferns, a group that was dominant in Carboniferous landscapes prior to declining to its modest extant diversity of slightly over 100 species. We show that tree models have a dramatic influence on estimates of both divergence times and topological relationships. This influence is driven by the strong, counter-intuitive informativeness of the uniform tree prior, and the inherent nonidentifiability of divergence-time models. In contrast to the strong influence of the tree models, we find minor effects of differing the morphological transition model or the morphological clock model. We compare the performance of a large pool of candidate models using a combination of posterior-predictive simulation and Bayes factors. Notably, an FBD model with epoch-specific speciation and extinction rates was strongly favored by Bayes factors. Our best-fitting model infers stem and crown divergences for the Marattiales in the mid-Devonian and Late Cretaceous, respectively, with elevated speciation rates in the Mississippian and elevated extinction rates in the Cisuralian leading to a peak diversity of ${\sim}$2800 species at the end of the Carboniferous, representing the heyday of the Psaroniaceae. This peak is followed by the rapid decline and ultimate extinction of the Psaroniaceae, with their descendants, the Marattiaceae, persisting at approximately stable levels of diversity until the present. This general diversification pattern appears to be insensitive to potential biases in the fossil record; despite the preponderance of available fossils being from Pennsylvanian coal balls, incorporating fossilization-rate variation does not improve model fit. In addition, by incorporating temporal data directly within the model and allowing for the inference of the phylogenetic position of the fossils, our study makes the surprising inference that the clade of extant Marattiales is relatively young, younger than any of the fossils historically thought to be congeneric with extant species. This result is a dramatic demonstration of the dangers of node-based approaches to divergence-time estimation, where the assignment of fossils to particular clades is made a priori (earlier node-based studies that constrained the minimum ages of extant genera based on these fossils resulted in much older age estimates than in our study) and of the utility of explicit models of morphological evolution and lineage diversification. [Bayesian model comparison; Carboniferous; divergence-time estimation; fossil record; fossilized birth-death; lineage diversification; Marattiales; models of morphological evolution; Psaronius; RevBayes.].
Collapse
Affiliation(s)
- Michael R May
- Department of Integrative Biology, University of California, Berkeley, 3040 Valley Life Sciences Building #3140, Berkeley, CA 94720, USA
- University Herbarium, University of California, Berkeley, 1001 Valley Life Sciences Building #2465, Berkeley, CA 94720, USA
| | - Dori L Contreras
- Department of Paleontology, Perot Museum of Nature and Science, 2201 N. Field Street, Dallas TX 75201, USA
| | - Michael A Sundue
- Department of Plant Biology, University of Vermont, 111 Jeffords Hall, 63 Carrigan Drive, Burlington, VT 05405, USA
- The Pringle Herbarium, University of Vermont, 305 Jeffords Hall, 63 Carrigan Drive, Burlington, VT 05405, USA
| | - Nathalie S Nagalingum
- Department of Botany, California Academy of Sciences, Golden Gate Park, 55 Music Concourse Drive, San Francisco, CA 94118, USA
| | - Cindy V Looy
- Department of Integrative Biology, University of California, Berkeley, 3040 Valley Life Sciences Building #3140, Berkeley, CA 94720, USA
- University Herbarium, University of California, Berkeley, 1001 Valley Life Sciences Building #2465, Berkeley, CA 94720, USA
- Museum of Paleontology, University of California, 1101 Valley Life Sciences Building, Berkeley, CA 94720, USA
| | - Carl J Rothfels
- Department of Integrative Biology, University of California, Berkeley, 3040 Valley Life Sciences Building #3140, Berkeley, CA 94720, USA
- University Herbarium, University of California, Berkeley, 1001 Valley Life Sciences Building #2465, Berkeley, CA 94720, USA
| |
Collapse
|
4
|
Magee AF, Hilton SK, DeWitt WS. Robustness of phylogenetic inference to model misspecification caused by pairwise epistasis. Mol Biol Evol 2021; 38:4603-4615. [PMID: 34043795 PMCID: PMC8476159 DOI: 10.1093/molbev/msab163] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
Likelihood-based phylogenetic inference posits a probabilistic model of character state change along branches of a phylogenetic tree. These models typically assume statistical independence of sites in the sequence alignment. This is a restrictive assumption that facilitates computational tractability, but ignores how epistasis, the effect of genetic background on mutational effects, influences the evolution of functional sequences. We consider the effect of using a misspecified site-independent model on the accuracy of Bayesian phylogenetic inference in the setting of pairwise-site epistasis. Previous work has shown that as alignment length increases, tree reconstruction accuracy also increases. Here, we present a simulation study demonstrating that accuracy increases with alignment size even if the additional sites are epistatically coupled. We introduce an alignment-based test statistic that is a diagnostic for pairwise epistasis and can be used in posterior predictive checks.
Collapse
Affiliation(s)
- Andrew F Magee
- Departments of Biology.,Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, WA
| | - Sarah K Hilton
- Departments of Genome Sciences, University of Washington, Seattle, USA.,Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, WA
| | - William S DeWitt
- Departments of Genome Sciences, University of Washington, Seattle, USA.,Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, WA
| |
Collapse
|
5
|
Nasir A, Mughal F, Caetano-Anollés G. The tree of life describes a tripartite cellular world. Bioessays 2021; 43:e2000343. [PMID: 33837594 DOI: 10.1002/bies.202000343] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2020] [Revised: 03/11/2021] [Accepted: 03/15/2021] [Indexed: 12/28/2022]
Abstract
The canonical view of a 3-domain (3D) tree of life was recently challenged by the discovery of Asgardarchaeota encoding eukaryote signature proteins (ESPs), which were treated as missing links of a 2-domain (2D) tree. Here we revisit the debate. We discuss methodological limitations of building trees with alignment-dependent approaches, which often fail to satisfactorily address the problem of ''gaps.'' In addition, most phylogenies are reconstructed unrooted, neglecting the power of direct rooting methods. Alignment-free methodologies lift most difficulties but require employing realistic evolutionary models. We argue that the discoveries of Asgards and ESPs, by themselves, do not rule out the 3D tree, which is strongly supported by comparative and evolutionary genomic analyses and vast genomic and biochemical superkingdom distinctions. Given uncertainties of retrodiction and interpretation difficulties, we conclude that the 3D view has not been falsified but instead has been strengthened by genomic analyses. In turn, the objections to the 2D model have not been lifted. The debate remains open. Also see the video abstract here: https://youtu.be/-6TBN0bubI8.
Collapse
Affiliation(s)
- Arshan Nasir
- Theoretical Biology and Biophysics (T-6), Los Alamos National Laboratory, Los Alamos, New Mexico, USA
| | - Fizza Mughal
- Department of Crop Sciences, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
| | - Gustavo Caetano-Anollés
- Department of Crop Sciences, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
| |
Collapse
|
6
|
Meyer X. Adaptive Tree Proposals for Bayesian Phylogenetic Inference. Syst Biol 2021; 70:1015-1032. [PMID: 33515248 PMCID: PMC8357345 DOI: 10.1093/sysbio/syab004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2019] [Revised: 01/07/2021] [Accepted: 01/17/2021] [Indexed: 11/14/2022] Open
Abstract
Bayesian inference of phylogeny with MCMC plays a key role in the study of evolution. Yet, this method still suffers from a practical challenge identified more than two decades ago: designing tree topology proposals that efficiently sample tree spaces. In this article, I introduce the concept of adaptive tree proposals for unrooted topologies, that is tree proposals adapting to the posterior distribution as it is estimated. I use this concept to elaborate two adaptive variants of existing proposals and an adaptive proposal based on a novel design philosophy in which the structure of the proposal is informed by the posterior distribution of trees. I investigate the performance of these proposals by first presenting a metric that captures the performance of each proposal within a mixture of proposals. Using this metric, I compare the performance of the adaptive proposals to the performance of standard and parsimony-guided proposals on 11 empirical datasets. Using adaptive proposals led to consistent performance gains and resulted in up to 18-fold increases in mixing efficiency and 6-fold increases in convergence rate without increasing the computational cost of these analyses.
Collapse
Affiliation(s)
- X Meyer
- Department of Integrative Biology, University of California, Berkeley, California 94720, USA
| |
Collapse
|