Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Dasarathy G, Nowak R, Roch S. Data Requirement for Phylogenetic Inference from Multiple Loci: A New Distance Method. IEEE/ACM Trans Comput Biol Bioinform 2015;12:422-432. [PMID: 26357228 DOI: 10.1109/tcbb.2014.2361685] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]

For:	Dasarathy G, Nowak R, Roch S. Data Requirement for Phylogenetic Inference from Multiple Loci: A New Distance Method. IEEE/ACM Trans Comput Biol Bioinform 2015;12:422-432. [PMID: 26357228 DOI: 10.1109/tcbb.2014.2361685] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]

Number

Cited by Other Article(s)

Hill M, Legried B, Roch S. Species tree estimation under joint modeling of coalescence and duplication: Sample complexity of quartet methods. ANN APPL PROBAB 2022. [DOI: 10.1214/22-aap1799] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]

Dasarathy G, Mossel E, Nowak R, Roch S. A stochastic Farris transform for genetic data under the multispecies coalescent with applications to data requirements. J Math Biol 2022;84:36. [PMID: 35394192 PMCID: PMC9258723 DOI: 10.1007/s00285-022-01731-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2020] [Revised: 02/15/2022] [Accepted: 02/17/2022] [Indexed: 10/18/2022]

Abstract

Species tree estimation faces many significant hurdles. Chief among them is that the trees describing the ancestral lineages of each individual gene-the gene trees-often differ from the species tree. The multispecies coalescent is commonly used to model this gene tree discordance, at least when it is believed to arise from incomplete lineage sorting, a population-genetic effect. Another significant challenge in this area is that molecular sequences associated to each gene typically provide limited information about the gene trees themselves. While the modeling of sequence evolution by single-site substitutions is well-studied, few species tree reconstruction methods with theoretical guarantees actually address this latter issue. Instead, a standard-but unsatisfactory-assumption is that gene trees are perfectly reconstructed before being fed into a so-called summary method. Hence much remains to be done in the development of inference methodologies that rigorously account for gene tree estimation error-or completely avoid gene tree estimation in the first place. In previous work, a data requirement trade-off was derived between the number of loci m needed for an accurate reconstruction and the length of the locus sequences k. It was shown that to reconstruct an internal branch of length f, one needs m to be of the order of [Formula: see text]. That previous result was obtained under the restrictive assumption that mutation rates as well as population sizes are constant across the species phylogeny. Here we further generalize this result beyond this assumption. Our main contribution is a novel reduction to the molecular clock case under the multispecies coalescent, which we refer to as a stochastic Farris transform. As a corollary, we also obtain a new identifiability result of independent interest: for any species tree with [Formula: see text] species, the rooted topology of the species tree can be identified from the distribution of its unrooted weighted gene trees even in the absence of a molecular clock.

Collapse

Identifiability of species network topologies from genomic sequences using the logDet distance. J Math Biol 2022;84:35. [PMID: 35385988 DOI: 10.1007/s00285-022-01734-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Revised: 01/12/2022] [Accepted: 03/02/2022] [Indexed: 10/18/2022]

Sanderson MJ, Búrquez A, Copetti D, McMahon MM, Zeng Y, Wojciechowski MF. Origin and diversification of the saguaro cactus (Carnegiea gigantea): a within-species phylogenomic analysis. Syst Biol 2022;71:1178-1194. [PMID: 35244183 DOI: 10.1093/sysbio/syac017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2020] [Revised: 02/18/2022] [Accepted: 02/25/2022] [Indexed: 11/14/2022] Open

Abstract

Reconstructing accurate historical relationships within a species poses numerous challenges, not least in many plant groups in which gene flow is high enough to extend well beyond species boundaries. Nonetheless, the extent of tree-like history within a species is an empirical question on which it is now possible to bring large amounts of genome sequence to bear. We assess phylogenetic structure across the geographic range of the saguaro cactus, an emblematic member of Cactaceae, a clade known for extensive hybridization and porous species boundaries. Using 200 Gb of whole genome resequencing data from 20 individuals sampled from 10 localities, we assembled two data sets comprising 150,000 biallelic single nucleotide polymorphisms (SNPs) from protein coding sequences. From these we inferred within-species trees and evaluated their significance and robustness using five qualitatively different inference methods. Despite the low sequence diversity, large census population sizes, and presence of wide-ranging pollen and seed dispersal agents, phylogenetic trees were well resolved and highly consistent across both data sets and all methods. We inferred that the most likely root, based on marginal likelihood comparisons, is to the east and south of the region of highest genetic diversity, which lies along the coast of the Gulf of California in Sonora, Mexico. Together with striking decreases in marginal likelihood found to the north, this supports hypotheses that saguaro's current range reflects post-glacial expansion from the refugia in the south of its range. We conclude with observations about practical and theoretical issues raised by phylogenomic data sets within species, in which SNP-based methods must be used rather than gene tree methods that are widely used when sequence divergence is higher. These include computational scalability, inference of gene flow, and proper assessment of statistical support in the presence of linkage effects.

Collapse

Rabiee M, Mirarab S. OUP accepted manuscript. Bioinformatics 2022;38:i413-i421. [PMID: 35758818 PMCID: PMC9235488 DOI: 10.1093/bioinformatics/btac265] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open

Mirarab S, Nakhleh L, Warnow T. Multispecies Coalescent: Theory and Applications in Phylogenetics. ANNUAL REVIEW OF ECOLOGY, EVOLUTION, AND SYSTEMATICS 2021. [DOI: 10.1146/annurev-ecolsys-012121-095340] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]

New Approaches for Inferring Phylogenies in the Presence of Paralogs. Trends Genet 2021;37:174-187. [DOI: 10.1016/j.tig.2020.08.012] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2020] [Revised: 08/13/2020] [Accepted: 08/19/2020] [Indexed: 12/18/2022]

Molloy EK, Warnow T. Statistically consistent divide-and-conquer pipelines for phylogeny estimation using NJMerge. Algorithms Mol Biol 2019;14:14. [PMID: 31360216 PMCID: PMC6642500 DOI: 10.1186/s13015-019-0151-x] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2019] [Accepted: 06/13/2019] [Indexed: 12/26/2022] Open

Abstract

BACKGROUND

Divide-and-conquer methods, which divide the species set into overlapping subsets, construct a tree on each subset, and then combine the subset trees using a supertree method, provide a key algorithmic framework for boosting the scalability of phylogeny estimation methods to large datasets. Yet the use of supertree methods, which typically attempt to solve NP-hard optimization problems, limits the scalability of such approaches.

RESULTS

In this paper, we introduce a divide-and-conquer approach that does not require supertree estimation: we divide the species set into pairwise disjoint subsets, construct a tree on each subset using a base method, and then combine the subset trees using a distance matrix. For this merger step, we present a new method, called NJMerge, which is a polynomial-time extension of Neighbor Joining (NJ); thus, NJMerge can be viewed either as a method for improving traditional NJ or as a method for scaling the base method to larger datasets. We prove that NJMerge can be used to create divide-and-conquer pipelines that are statistically consistent under some models of evolution. We also report the results of an extensive simulation study evaluating NJMerge on multi-locus datasets with up to 1000 species. We found that NJMerge sometimes improved the accuracy of traditional NJ and substantially reduced the running time of three popular species tree methods (ASTRAL-III, SVDquartets, and "concatenation" using RAxML) without sacrificing accuracy. Finally, although NJMerge can fail to return a tree, in our experiments, NJMerge failed on only 11 out of 2560 test cases.

CONCLUSIONS

Theoretical and empirical results suggest that NJMerge is a valuable technique for large-scale phylogeny estimation, especially when computational resources are limited. NJMerge is freely available on Github (http://github.com/ekmolloy/njmerge).

Collapse

Roch S, Nute M, Warnow T. Long-Branch Attraction in Species Tree Estimation: Inconsistency of Partitioned Likelihood and Topology-Based Summary Methods. Syst Biol 2019;68:281-297. [PMID: 30247732 DOI: 10.1093/sysbio/syy061] [Citation(s) in RCA: 54] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2018] [Accepted: 09/12/2018] [Indexed: 11/13/2022] Open

Allman ES, Long C, Rhodes JA. SPECIES TREE INFERENCE FROM GENOMIC SEQUENCES USING THE LOG-DET DISTANCE. SIAM JOURNAL ON APPLIED ALGEBRA AND GEOMETRY 2019;3:107-127. [PMID: 33163826 PMCID: PMC7643864 DOI: 10.1137/18m1194134] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]

Sarmashghi S, Bohmann K, P. Gilbert MT, Bafna V, Mirarab S. Skmer: assembly-free and alignment-free sample identification using genome skims. Genome Biol 2019;20:34. [PMID: 30760303 PMCID: PMC6374904 DOI: 10.1186/s13059-019-1632-4] [Citation(s) in RCA: 54] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2018] [Accepted: 01/16/2019] [Indexed: 01/10/2023] Open

Malinsky M, Svardal H, Tyers AM, Miska EA, Genner MJ, Turner GF, Durbin R. Whole-genome sequences of Malawi cichlids reveal multiple radiations interconnected by gene flow. Nat Ecol Evol 2018;2:1940-1955. [PMID: 30455444 PMCID: PMC6443041 DOI: 10.1038/s41559-018-0717-x] [Citation(s) in RCA: 257] [Impact Index Per Article: 42.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2017] [Accepted: 10/10/2018] [Indexed: 12/30/2022]

Shekhar S, Roch S, Mirarab S. Species Tree Estimation Using ASTRAL: How Many Genes Are Enough? IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018;15:1738-1747. [PMID: 28976320 DOI: 10.1109/tcbb.2017.2757930] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]

SVDquest: Improving SVDquartets species tree estimation using exact optimization within a constrained search space. Mol Phylogenet Evol 2018. [DOI: 10.1016/j.ympev.2018.03.006] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]

Nute M, Chou J, Molloy EK, Warnow T. The performance of coalescent-based species tree estimation methods under models of missing data. BMC Genomics 2018;19:286. [PMID: 29745854 PMCID: PMC5998899 DOI: 10.1186/s12864-018-4619-8] [Citation(s) in RCA: 42] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open

Durden C, Sullivant S. Identifiability of Phylogenetic Parameters from k-mer Data Under the Coalescent. Bull Math Biol 2018;81:431-451. [PMID: 29392644 DOI: 10.1007/s11538-018-0399-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2017] [Accepted: 01/19/2018] [Indexed: 11/30/2022]

Mossel E, Roch S. Distance-based species tree estimation under the coalescent: Information-theoretic trade-off between number of loci and sequence length. ANN APPL PROBAB 2017. [DOI: 10.1214/16-aap1273] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]

Molloy EK, Warnow T. To Include or Not to Include: The Impact of Gene Filtering on Species Tree Estimation Methods. Syst Biol 2017;67:285-303. [DOI: 10.1093/sysbio/syx077] [Citation(s) in RCA: 138] [Impact Index Per Article: 19.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2016] [Accepted: 09/13/2017] [Indexed: 01/27/2023] Open

Bhattacharyya S, Mukherjee J. IDXL: Species Tree Inference Using Internode Distance and Excess Gene Leaf Count. J Mol Evol 2017;85:57-78. [PMID: 28835989 DOI: 10.1007/s00239-017-9807-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2016] [Accepted: 08/09/2017] [Indexed: 11/28/2022]

Rusinko J, McPartlon M. Species tree estimation using Neighbor Joining. J Theor Biol 2017;414:5-7. [DOI: 10.1016/j.jtbi.2016.11.005] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2016] [Revised: 10/28/2016] [Accepted: 11/03/2016] [Indexed: 12/21/2022]

Sayyari E, Mirarab S. Anchoring quartet-based phylogenetic distances and applications to species tree reconstruction. BMC Genomics 2016;17:783. [PMID: 28185574 PMCID: PMC5123309 DOI: 10.1186/s12864-016-3098-z] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Uricchio LH, Warnow T, Rosenberg NA. An analytical upper bound on the number of loci required for all splits of a species tree to appear in a set of gene trees. BMC Bioinformatics 2016;17:417. [PMID: 28185570 PMCID: PMC5123308 DOI: 10.1186/s12859-016-1266-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open

Vachaspati P, Warnow T. ASTRID: Accurate Species TRees from Internode Distances. BMC Genomics 2015;16 Suppl 10:S3. [PMID: 26449326 PMCID: PMC4602181 DOI: 10.1186/1471-2164-16-s10-s3] [Citation(s) in RCA: 129] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open

Bayzid MS, Mirarab S, Boussau B, Warnow T. Weighted Statistical Binning: Enabling Statistically Consistent Genome-Scale Phylogenetic Analyses. PLoS One 2015;10:e0129183. [PMID: 26086579 PMCID: PMC4472720 DOI: 10.1371/journal.pone.0129183] [Citation(s) in RCA: 84] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2014] [Accepted: 05/05/2015] [Indexed: 11/19/2022] Open

Abstract

Because biological processes can result in different loci having different evolutionary histories, species tree estimation requires multiple loci from across multiple genomes. While many processes can result in discord between gene trees and species trees, incomplete lineage sorting (ILS), modeled by the multi-species coalescent, is considered to be a dominant cause for gene tree heterogeneity. Coalescent-based methods have been developed to estimate species trees, many of which operate by combining estimated gene trees, and so are called "summary methods". Because summary methods are generally fast (and much faster than more complicated coalescent-based methods that co-estimate gene trees and species trees), they have become very popular techniques for estimating species trees from multiple loci. However, recent studies have established that summary methods can have reduced accuracy in the presence of gene tree estimation error, and also that many biological datasets have substantial gene tree estimation error, so that summary methods may not be highly accurate in biologically realistic conditions. Mirarab et al. (Science 2014) presented the "statistical binning" technique to improve gene tree estimation in multi-locus analyses, and showed that it improved the accuracy of MP-EST, one of the most popular coalescent-based summary methods. Statistical binning, which uses a simple heuristic to evaluate "combinability" and then uses the larger sets of genes to re-calculate gene trees, has good empirical performance, but using statistical binning within a phylogenomic pipeline does not have the desirable property of being statistically consistent. We show that weighting the re-calculated gene trees by the bin sizes makes statistical binning statistically consistent under the multispecies coalescent, and maintains the good empirical performance. Thus, "weighted statistical binning" enables highly accurate genome-scale species tree estimation, and is also statistically consistent under the multi-species coalescent model. New data used in this study are available at DOI: http://dx.doi.org/10.6084/m9.figshare.1411146, and the software is available at https://github.com/smirarab/binning.

Collapse

Warnow T. Concatenation Analyses in the Presence of Incomplete Lineage Sorting. PLOS CURRENTS 2015;7:ecurrents.currents.tol.8d41ac0f13d1abedf4c4a59f5d17b1f7. [PMID: 26064786 PMCID: PMC4450984 DOI: 10.1371/currents.tol.8d41ac0f13d1abedf4c4a59f5d17b1f7] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]