1
|
Dai J, Rubel T, Han Y, Molloy EK. Dollo-CDP: a polynomial-time algorithm for the clade-constrained large Dollo parsimony problem. Algorithms Mol Biol 2024; 19:2. [PMID: 38191515 PMCID: PMC10775561 DOI: 10.1186/s13015-023-00249-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2023] [Accepted: 12/10/2023] [Indexed: 01/10/2024] Open
Abstract
The last decade of phylogenetics has seen the development of many methods that leverage constraints plus dynamic programming. The goal of this algorithmic technique is to produce a phylogeny that is optimal with respect to some objective function and that lies within a constrained version of tree space. The popular species tree estimation method ASTRAL, for example, returns a tree that (1) maximizes the quartet score computed with respect to the input gene trees and that (2) draws its branches (bipartitions) from the input constraint set. This technique has yet to be used for parsimony problems where the input are binary characters, sometimes with missing values. Here, we introduce the clade-constrained character parsimony problem and present an algorithm that solves this problem for the Dollo criterion score in [Formula: see text] time, where n is the number of leaves, k is the number of characters, and [Formula: see text] is the set of clades used as constraints. Dollo parsimony, which requires traits/mutations to be gained at most once but allows them to be lost any number of times, is widely used for tumor phylogenetics as well as species phylogenetics, for example analyses of low-homoplasy retroelement insertions across the vertebrate tree of life. This motivated us to implement our algorithm in a software package, called Dollo-CDP, and evaluate its utility for analyzing retroelement insertion presence / absence patterns for bats, birds, toothed whales as well as simulated data. Our results show that Dollo-CDP can improve upon heuristic search from a single starting tree, often recovering a better scoring tree. Moreover, Dollo-CDP scales to data sets with much larger numbers of taxa than branch-and-bound while still having an optimality guarantee, albeit a more restricted one. Lastly, we show that our algorithm for Dollo parsimony can easily be adapted to Camin-Sokal parsimony but not Fitch parsimony.
Collapse
Affiliation(s)
- Junyan Dai
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Tobias Rubel
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Yunheng Han
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Erin K Molloy
- Department of Computer Science, University of Maryland, College Park, MD, USA.
- University of Maryland Institute for Advanced Computer Studies, College Park, MD, USA.
| |
Collapse
|
2
|
Han Y, Molloy EK. Quartets enable statistically consistent estimation of cell lineage trees under an unbiased error and missingness model. Algorithms Mol Biol 2023; 18:19. [PMID: 38041123 PMCID: PMC10691101 DOI: 10.1186/s13015-023-00248-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 11/19/2023] [Indexed: 12/03/2023] Open
Abstract
Cancer progression and treatment can be informed by reconstructing its evolutionary history from tumor cells. Although many methods exist to estimate evolutionary trees (called phylogenies) from molecular sequences, traditional approaches assume the input data are error-free and the output tree is fully resolved. These assumptions are challenged in tumor phylogenetics because single-cell sequencing produces sparse, error-ridden data and because tumors evolve clonally. Here, we study the theoretical utility of methods based on quartets (four-leaf, unrooted phylogenetic trees) in light of these barriers. We consider a popular tumor phylogenetics model, in which mutations arise on a (highly unresolved) tree and then (unbiased) errors and missing values are introduced. Quartets are then implied by mutations present in two cells and absent from two cells. Our main result is that the most probable quartet identifies the unrooted model tree on four cells. This motivates seeking a tree such that the number of quartets shared between it and the input mutations is maximized. We prove an optimal solution to this problem is a consistent estimator of the unrooted cell lineage tree; this guarantee includes the case where the model tree is highly unresolved, with error defined as the number of false negative branches. Lastly, we outline how quartet-based methods might be employed when there are copy number aberrations and other challenges specific to tumor phylogenetics.
Collapse
Affiliation(s)
- Yunheng Han
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Erin K Molloy
- Department of Computer Science, University of Maryland, College Park, MD, USA.
- University of Maryland Institute for Advanced Computer Studies, College Park, MD, USA.
| |
Collapse
|
3
|
Simmons MP, Goloboff PA, Stöver BC, Springer MS, Gatesy J. Quantification of congruence among gene trees with polytomies using overall success of resolution for phylogenomic coalescent analyses. Cladistics 2023; 39:418-436. [PMID: 37096985 DOI: 10.1111/cla.12540] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2022] [Revised: 02/22/2023] [Accepted: 03/24/2023] [Indexed: 04/26/2023] Open
Abstract
Gene-tree-inference error can cause species-tree-inference artefacts in summary phylogenomic coalescent analyses. Here we integrate two ways of accommodating these inference errors: collapsing arbitrarily or dubiously resolved gene-tree branches, and subsampling gene trees based on their pairwise congruence. We tested the effect of collapsing gene-tree branches with 0% approximate-likelihood-ratio-test (SH-like aLRT) support in likelihood analyses and strict consensus trees for parsimony, and then subsampled those partially resolved trees based on congruence measures that do not penalize polytomies. For this purpose we developed a new TNT script for congruence sorting (congsort), and used it to calculate topological incongruence for eight phylogenomic datasets using three distance measures: standard Robinson-Foulds (RF) distances; overall success of resolution (OSR), which is based on counting both matching and contradicting clades; and RF contradictions, which only counts contradictory clades. As expected, we found that gene-tree incongruence was often concentrated in clades that are arbitrarily or dubiously resolved and that there was greater congruence between the partially collapsed gene trees and the coalescent and concatenation topologies inferred from those genes. Coalescent branch lengths typically increased as the most incongruent gene trees were excluded, although branch supports typically did not. We investigated two successful and complementary approaches to prioritizing genes for investigation of alignment or homology errors. Coalescent-tree clades that contradicted concatenation-tree clades were generally less robust to gene-tree subsampling than congruent clades. Our preferred approach to collapsing likelihood gene-tree clades (0% SH-like aLRT support) and subsampling those trees (OSR) generally outperformed competing approaches for a large fungal dataset with respect to branch lengths, support and congruence. We recommend widespread application of this approach (and strict consensus trees for parsimony-based analyses) for improving quantification of gene-tree congruence/conflict, estimating coalescent branch lengths, testing robustness of coalescent analyses to gene-tree-estimation error, and improving topological robustness of summary coalescent analyses. This approach is quick and easy to implement, even for huge datasets.
Collapse
Affiliation(s)
- Mark P Simmons
- Department of Biology, Colorado State University, Fort Collins, CO, 80523, USA
| | - Pablo A Goloboff
- CONICET, INSUE, Fundación Miguel Lillo, Miguel Lillo 251, 4000, S.M. de Tucumán, Argentina
| | - Ben C Stöver
- Institute for Evolution and Biodiversity, WMU Münster, 48149, Münster, Germany
| | - Mark S Springer
- Department of Evolution, Ecology, and Organismal Biology, University of California, Riverside, CA, 92521, USA
| | - John Gatesy
- Division of Vertebrate Zoology, American Museum of Natural History, New York, NY, 10024, USA
| |
Collapse
|
4
|
Hatami E, Jones KE, Kilian N. New Insights Into the Relationships Within Subtribe Scorzonerinae (Cichorieae, Asteraceae) Using Hybrid Capture Phylogenomics (Hyb-Seq). FRONTIERS IN PLANT SCIENCE 2022; 13:851716. [PMID: 35873957 PMCID: PMC9298463 DOI: 10.3389/fpls.2022.851716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/10/2022] [Accepted: 05/18/2022] [Indexed: 06/15/2023]
Abstract
Subtribe Scorzonerinae (Cichorieae, Asteraceae) contains 12 main lineages and approximately 300 species. Relationships within the subtribe, either at inter- or intrageneric levels, were largely unresolved in phylogenetic studies to date, due to the lack of phylogenetic signal provided by traditional Sanger sequencing markers. In this study, we employed a phylogenomics approach (Hyb-Seq) that targets 1,061 nuclear-conserved ortholog loci designed for Asteraceae and obtained chloroplast coding regions as a by-product of off-target reads. Our objectives were to evaluate the potential of the Hyb-Seq approach in resolving the phylogenetic relationships across the subtribe at deep and shallow nodes, investigate the relationships of major lineages at inter- and intrageneric levels, and examine the impact of the different datasets and approaches on the robustness of phylogenetic inferences. We analyzed three nuclear datasets: exon only, excluding all potentially paralogous loci; exon only, including loci that were only potentially paralogous in 1-3 samples; exon plus intron regions (supercontigs); and the plastome CDS region. Phylogenetic relationships were reconstructed using both multispecies coalescent and concatenation (Maximum Likelihood and Bayesian analyses) approaches. Overall, our phylogenetic reconstructions recovered the same monophyletic major lineages found in previous studies and were successful in fully resolving the backbone phylogeny of the subtribe, while the internal resolution of the lineages was comparatively poor. The backbone topologies were largely congruent among all inferences, but some incongruent relationships were recovered between nuclear and plastome datasets, which are discussed and assumed to represent cases of cytonuclear discordance. Considering the newly resolved phylogenies, a new infrageneric classification of Scorzonera in its revised circumscription is proposed.
Collapse
Affiliation(s)
- Elham Hatami
- Department of Biology, Faculty of Science, Shahid Bahonar University of Kerman, Kerman, Iran
| | - Katy E. Jones
- Botanic Garden and Botanical Museum Berlin, Freie Universität Berlin, Berlin, Germany
| | - Norbert Kilian
- Botanic Garden and Botanical Museum Berlin, Freie Universität Berlin, Berlin, Germany
| |
Collapse
|
5
|
Gatesy J, Springer MS. Phylogenomic Coalescent Analyses of Avian Retroelements Infer Zero-Length Branches at the Base of Neoaves, Emergent Support for Controversial Clades, and Ancient Introgressive Hybridization in Afroaves. Genes (Basel) 2022; 13:genes13071167. [PMID: 35885951 PMCID: PMC9324441 DOI: 10.3390/genes13071167] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2022] [Revised: 06/20/2022] [Accepted: 06/21/2022] [Indexed: 01/25/2023] Open
Abstract
Retroelement insertions (RIs) are low-homoplasy characters that are ideal data for addressing deep evolutionary radiations, where gene tree reconstruction errors can severely hinder phylogenetic inference with DNA and protein sequence data. Phylogenomic studies of Neoaves, a large clade of birds (>9000 species) that first diversified near the Cretaceous−Paleogene boundary, have yielded an array of robustly supported, contradictory relationships among deep lineages. Here, we reanalyzed a large RI matrix for birds using recently proposed quartet-based coalescent methods that enable inference of large species trees including branch lengths in coalescent units, clade-support, statistical tests for gene flow, and combined analysis with DNA-sequence-based gene trees. Genome-scale coalescent analyses revealed extremely short branches at the base of Neoaves, meager branch support, and limited congruence with previous work at the most challenging nodes. Despite widespread topological conflicts with DNA-sequence-based trees, combined analyses of RIs with thousands of gene trees show emergent support for multiple higher-level clades (Columbea, Passerea, Columbimorphae, Otidimorphae, Phaethoquornithes). RIs express asymmetrical support for deep relationships within the subclade Afroaves that hints at ancient gene flow involving the owl lineage (Strigiformes). Because DNA-sequence data are challenged by gene tree-reconstruction error, analysis of RIs represents one approach for improving gene tree-based methods when divergences are deep, internodes are short, terminal branches are long, and introgressive hybridization further confounds species−tree inference.
Collapse
Affiliation(s)
- John Gatesy
- Division of Vertebrate Zoology, American Museum of Natural History, New York, NY 10024, USA
- Correspondence:
| | - Mark S. Springer
- Department of Evolution, Ecology, and Organismal Biology, University of California, Riverside, CA 92521, USA;
| |
Collapse
|
6
|
Contradictory Phylogenetic Signals in the Laurasiatheria Anomaly Zone. Genes (Basel) 2022; 13:genes13050766. [PMID: 35627151 PMCID: PMC9141728 DOI: 10.3390/genes13050766] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2022] [Revised: 04/12/2022] [Accepted: 04/21/2022] [Indexed: 02/04/2023] Open
Abstract
Relationships among laurasiatherian clades represent one of the most highly disputed topics in mammalian phylogeny. In this study, we attempt to disentangle laurasiatherian interordinal relationships using two independent genome-level approaches: (1) quantifying retrotransposon presence/absence patterns, and (2) comparisons of exon datasets at the levels of nucleotides and amino acids. The two approaches revealed contradictory phylogenetic signals, possibly due to a high level of ancestral incomplete lineage sorting. The positions of Eulipotyphla and Chiroptera as the first and second earliest divergences were consistent across the approaches. However, the phylogenetic relationships of Perissodactyla, Cetartiodactyla, and Ferae, were contradictory. While retrotransposon insertion analyses suggest a clade with Cetartiodactyla and Ferae, the exon dataset favoured Cetartiodactyla and Perissodactyla. Future analyses of hitherto unsampled laurasiatherian lineages and synergistic analyses of retrotransposon insertions, exon and conserved intron/intergenic sequences might unravel the conflicting patterns of relationships in this major mammalian clade.
Collapse
|
7
|
Storer JM, Hubley R, Rosen J, Smit AFA. Methodologies for the De novo Discovery of Transposable Element Families. Genes (Basel) 2022; 13:709. [PMID: 35456515 PMCID: PMC9025800 DOI: 10.3390/genes13040709] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2022] [Revised: 04/14/2022] [Accepted: 04/15/2022] [Indexed: 02/07/2023] Open
Abstract
The discovery and characterization of transposable element (TE) families are crucial tasks in the process of genome annotation. Careful curation of TE libraries for each organism is necessary as each has been exposed to a unique and often complex set of TE families. De novo methods have been developed; however, a fully automated and accurate approach to the development of complete libraries remains elusive. In this review, we cover established methods and recent developments in de novo TE analysis. We also present various methodologies used to assess these tools and discuss opportunities for further advancement of the field.
Collapse
Affiliation(s)
| | | | | | - Arian F. A. Smit
- Institute for Systems Biology, Seattle, WA 98109, USA; (J.M.S.); (R.H.); (J.R.)
| |
Collapse
|
8
|
SINE-Based Phylogenomics Reveal Extensive Introgression and Incomplete Lineage Sorting in Myotis. Genes (Basel) 2022; 13:genes13030399. [PMID: 35327953 PMCID: PMC8951037 DOI: 10.3390/genes13030399] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2022] [Revised: 02/20/2022] [Accepted: 02/22/2022] [Indexed: 01/08/2023] Open
Abstract
Using presence/absence data from over 10,000 Ves SINE insertions, we reconstructed a phylogeny for 11 Myotis species. With nearly one-third of individual Ves gene trees discordant with the overall species tree, phylogenetic conflict appears to be rampant in this genus. From the observed conflict, we infer that ILS is likely a major contributor to the discordance. Much of the discordance can be attributed to the hypothesized split between the Old World and New World Myotis clades and with the first radiation of Myotis within the New World. Quartet asymmetry tests reveal signs of introgression between Old and New World taxa that may have persisted until approximately 8 MYA. Our introgression tests also revealed evidence of both historic and more recent, perhaps even contemporary, gene flow among Myotis species of the New World. Our findings suggest that hybridization likely played an important role in the evolutionary history of Myotis and may still be happening in areas of sympatry. Despite limitations arising from extreme discordance, our SINE-based phylogeny better resolved deeper relationships (particularly the positioning of M. brandtii) and was able to identify potential introgression pathways among the Myotis species sampled.
Collapse
|
9
|
Simmons MP, Springer MS, Gatesy J. Gene-tree misrooting drives conflicts in phylogenomic coalescent analyses of palaeognath birds. Mol Phylogenet Evol 2021; 167:107344. [PMID: 34748873 DOI: 10.1016/j.ympev.2021.107344] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Revised: 10/08/2021] [Accepted: 11/02/2021] [Indexed: 10/19/2022]
Abstract
Phylogenomic analyses of ancient rapid radiations can produce conflicting results that are driven by differential sampling of taxa and characters as well as the limitations of alternative analytical methods. We re-examine basal relationships of palaeognath birds (ratites and tinamous) using recently published datasets of nucleotide characters from 20,850 loci as well as 4301 retroelement insertions. The original studies attributed conflicting resolutions of rheas in their inferred coalescent and concatenation trees to concatenation failing in the anomaly zone. By contrast, we find that the coalescent-based resolution of rheas is premised upon extensive gene-tree estimation errors. Furthermore, retroelement insertions contain much more conflict than originally reported and multiple insertion loci support the basal position of rheas found in concatenation trees, while none were reported in the original publication. We demonstrate how even remarkable congruence in phylogenomic studies may be driven by long-branch misplacement of a divergent outgroup, highly incongruent gene trees, differential taxon sampling that can result in gene-tree misrooting errors that bias species-tree inference, and gross homology errors. What was previously interpreted as broad, robustly supported corroboration for a single resolution in coalescent analyses may instead indicate a common bias that taints phylogenomic results across multiple genome-scale datasets. The updated retroelement dataset now supports a species tree with branch lengths that suggest an ancient anomaly zone, and both concatenation and coalescent analyses of the huge nucleotide datasets fail to yield coherent, reliable results in this challenging phylogenetic context.
Collapse
Affiliation(s)
- Mark P Simmons
- Department of Biology, Colorado State University, Fort Collins, CO 80523, USA.
| | - Mark S Springer
- Department of Evolution, Ecology, and Organismal Biology, University of California, Riverside, CA 92521, USA
| | - John Gatesy
- Division of Vertebrate Zoology and Sackler Institute for Comparative Genomics, American Museum of Natural History, New York, NY 10024, USA
| |
Collapse
|