1
|
Speeding up iterative applications of the BUILD supertree algorithm. PeerJ 2024; 12:e16624. [PMID: 38188165 PMCID: PMC10768670 DOI: 10.7717/peerj.16624] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2023] [Accepted: 11/16/2023] [Indexed: 01/09/2024] Open
Abstract
The Open Tree of Life (OToL) project produces a supertree that summarizes phylogenetic knowledge from tree estimates published in the primary literature. The supertree construction algorithm iteratively calls Aho's Build algorithm thousands of times in order to assess the compatability of different phylogenetic groupings. We describe an incrementalized version of the Build algorithm that is able to share work between successive calls to Build. We provide details that allow a programmer to implement the incremental algorithm BuildInc, including pseudo-code and a description of data structures. We assess the effect of BuildInc on our supertree algorithm by analyzing simulated data and by analyzing a supertree problem taken from the OpenTree 13.4 synthesis tree. We find that BuildInc provides up to 550-fold speedup for our supertree algorithm.
Collapse
|
2
|
OpenTree: A Python Package for Accessing and Analyzing Data from the Open Tree of Life. Syst Biol 2021; 70:1295-1301. [PMID: 33970279 PMCID: PMC8513759 DOI: 10.1093/sysbio/syab033] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Revised: 04/27/2021] [Accepted: 05/03/2021] [Indexed: 11/14/2022] Open
Abstract
The Open Tree of Life project constructs a comprehensive, dynamic, and digitally available tree of life by synthesizing published phylogenetic trees along with taxonomic data. Open Tree of Life provides web-service application programming interfaces (APIs) to make the tree estimate, unified taxonomy, and input phylogenetic data available to anyone. Here, we describe the Python package opentree, which provides a user friendly Python wrapper for these APIs and a set of scripts and tutorials for straightforward downstream data analyses. We demonstrate the utility of these tools by generating an estimate of the phylogenetic relationships of all bird families, and by capturing a phylogenetic estimate for all taxa observed at the University of California Merced Vernal Pools and Grassland Reserve.[Evolution; open science; phylogenetics; Python; taxonomy.].
Collapse
|
3
|
Incorporating the speciation process into species delimitation. PLoS Comput Biol 2021; 17:e1008924. [PMID: 33983918 PMCID: PMC8118268 DOI: 10.1371/journal.pcbi.1008924] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2020] [Accepted: 03/29/2021] [Indexed: 11/22/2022] Open
Abstract
The "multispecies" coalescent (MSC) model that underlies many genomic species-delimitation approaches is problematic because it does not distinguish between genetic structure associated with species versus that of populations within species. Consequently, as both the genomic and spatial resolution of data increases, a proliferation of artifactual species results as within-species population lineages, detected due to restrictions in gene flow, are identified as distinct species. The toll of this extends beyond systematic studies, getting magnified across the many disciplines that rely upon an accurate framework of identified species. Here we present the first of a new class of approaches that addresses this issue by incorporating an extended speciation process for species delimitation. We model the formation of population lineages and their subsequent development into independent species as separate processes and provide for a way to incorporate current understanding of the species boundaries in the system through specification of species identities of a subset of population lineages. As a result, species boundaries and within-species lineages boundaries can be discriminated across the entire system, and species identities can be assigned to the remaining lineages of unknown affinities with quantified probabilities. In addition to the identification of species units in nature, the primary goal of species delimitation, the incorporation of a speciation model also allows us insights into the links between population and species-level processes. By explicitly accounting for restrictions in gene flow not only between, but also within, species, we also address the limits of genetic data for delimiting species. Specifically, while genetic data alone is not sufficient for accurate delimitation, when considered in conjunction with other information we are able to not only learn about species boundaries, but also about the tempo of the speciation process itself.
Collapse
|
4
|
Genome-wide genotyping estimates mating system parameters and paternity in the island species Tolpis succulenta. AMERICAN JOURNAL OF BOTANY 2020; 107:1189-1197. [PMID: 32864742 DOI: 10.1002/ajb2.1515] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/21/2019] [Accepted: 02/22/2020] [Indexed: 06/11/2023]
Abstract
PREMISE The mating system has profound consequences, not only for ecology and evolution, but also for the conservation of threatened or endangered species. Unfortunately, small populations are difficult to study owing to limits on sample size and genetic marker diversity. Here, we estimated mating system parameters in three small populations of an island plant using genomic genotyping. Although self-incompatible (SI) species are known to often set some self-seed, little is known about how "leaky SI" affects selfing rates in nature or the role that multiple paternity plays in small populations. METHODS We generalized the BORICE mating system program to determine the siring pattern within maternal families. We applied this algorithm to maternal families from three populations of Tolpis succulenta from Madeira Island and genotyped the progeny using RADseq. We applied BORICE to estimate each individual offspring as outcrossed or selfed, the paternity of each outcrossed offspring, and the level of inbreeding of each maternal plant. RESULTS Despite a functional self-incompatibility system, these data establish T. succulenta as a pseudo-self-compatible (PSC) species. Two of 75 offspring were strongly indicated as products of self-fertilization. Despite selfing, all adult maternal plants were fully outbred. There was high differentiation among and low variation within populations, consistent with a history of genetic isolation of these small populations. There were generally multiple sires per maternal family. Twenty-two percent of sib contrasts (between outcrossed offspring within maternal families) shared the same sire. CONCLUSIONS Genome-wide genotyping, combined with appropriate analytical methods, enables estimation of mating system and multiple paternity in small populations. These data address questions about the evolution of reproductive traits and the conservation of threatened populations.
Collapse
|
5
|
A supertree pipeline for summarizing phylogenetic and taxonomic information for millions of species. PeerJ 2017; 5:e3058. [PMID: 28265520 PMCID: PMC5335690 DOI: 10.7717/peerj.3058] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2016] [Accepted: 02/02/2017] [Indexed: 11/20/2022] Open
Abstract
We present a new supertree method that enables rapid estimation of a summary tree on the scale of millions of leaves. This supertree method summarizes a collection of input phylogenies and an input taxonomy. We introduce formal goals and criteria for such a supertree to satisfy in order to transparently and justifiably represent the input trees. In addition to producing a supertree, our method computes annotations that describe which grouping in the input trees support and conflict with each group in the supertree. We compare our supertree construction method to a previously published supertree construction method by assessing their performance on input trees used to construct the Open Tree of Life version 4, and find that our method increases the number of displayed input splits from 35,518 to 39,639 and decreases the number of conflicting input splits from 2,760 to 1,357. The new supertree method also improves on the previous supertree construction method in that it produces no unsupported branches and avoids unnecessary polytomies. This pipeline is currently used by the Open Tree of Life project to produce all of the versions of project's "synthetic tree" starting at version 5. This software pipeline is called "propinquity". It relies heavily on "otcetera"-a set of C++ tools to perform most of the steps of the pipeline. All of the components are free software and are available on GitHub.
Collapse
|
6
|
Twisted trees and inconsistency of tree estimation when gaps are treated as missing data - The impact of model mis-specification in distance corrections. Mol Phylogenet Evol 2015; 93:289-95. [PMID: 26256643 DOI: 10.1016/j.ympev.2015.07.027] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2015] [Revised: 07/09/2015] [Accepted: 07/21/2015] [Indexed: 10/23/2022]
Abstract
Statistically consistent estimation of phylogenetic trees or gene trees is possible if pairwise sequence dissimilarities can be converted to a set of distances that are proportional to the true evolutionary distances. Susko et al. (2004) reported some strikingly broad results about the forms of inconsistency in tree estimation that can arise if corrected distances are not proportional to the true distances. They showed that if the corrected distance is a concave function of the true distance, then inconsistency due to long branch attraction will occur. If these functions are convex, then two "long branch repulsion" trees will be preferred over the true tree - though these two incorrect trees are expected to be tied as the preferred true. Here we extend their results, and demonstrate the existence of a tree shape (which we refer to as a "twisted Farris-zone" tree) for which a single incorrect tree topology will be guaranteed to be preferred if the corrected distance function is convex. We also report that the standard practice of treating gaps in sequence alignments as missing data is sufficient to produce non-linear corrected distance functions if the substitution process is not independent of the insertion/deletion process. Taken together, these results imply inconsistent tree inference under mild conditions. For example, if some positions in a sequence are constrained to be free of substitutions and insertion/deletion events while the remaining sites evolve with independent substitutions and insertion/deletion events, then the distances obtained by treating gaps as missing data can support an incorrect tree topology even given an unlimited amount of data.
Collapse
|
7
|
Phylesystem: a git-based data store for community-curated phylogenetic estimates. Bioinformatics 2015; 31:2794-800. [PMID: 25940563 PMCID: PMC4547614 DOI: 10.1093/bioinformatics/btv276] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2015] [Accepted: 04/27/2015] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Phylogenetic estimates from published studies can be archived using general platforms like Dryad (Vision, 2010) or TreeBASE (Sanderson et al., 1994). Such services fulfill a crucial role in ensuring transparency and reproducibility in phylogenetic research. However, digital tree data files often require some editing (e.g. rerooting) to improve the accuracy and reusability of the phylogenetic statements. Furthermore, establishing the mapping between tip labels used in a tree and taxa in a single common taxonomy dramatically improves the ability of other researchers to reuse phylogenetic estimates. As the process of curating a published phylogenetic estimate is not error-free, retaining a full record of the provenance of edits to a tree is crucial for openness, allowing editors to receive credit for their work and making errors introduced during curation easier to correct. RESULTS Here, we report the development of software infrastructure to support the open curation of phylogenetic data by the community of biologists. The backend of the system provides an interface for the standard database operations of creating, reading, updating and deleting records by making commits to a git repository. The record of the history of edits to a tree is preserved by git's version control features. Hosting this data store on GitHub (http://github.com/) provides open access to the data store using tools familiar to many developers. We have deployed a server running the 'phylesystem-api', which wraps the interactions with git and GitHub. The Open Tree of Life project has also developed and deployed a JavaScript application that uses the phylesystem-api and other web services to enable input and curation of published phylogenetic statements. AVAILABILITY AND IMPLEMENTATION Source code for the web service layer is available at https://github.com/OpenTreeOfLife/phylesystem-api. The data store can be cloned from: https://github.com/OpenTreeOfLife/phylesystem. A web application that uses the phylesystem web services is deployed at http://tree.opentreeoflife.org/curator. Code for that tool is available from https://github.com/OpenTreeOfLife/opentree. CONTACT mtholder@gmail.com.
Collapse
|
8
|
Phycas: Software for Bayesian Phylogenetic Analysis. Syst Biol 2015; 64:525-31. [DOI: 10.1093/sysbio/syu132] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2014] [Accepted: 12/24/2014] [Indexed: 12/15/2022] Open
|
9
|
An Algorithm for Calculating the Probability of Classes of Data Patterns on a Genealogy. PLOS CURRENTS 2012; 4:e4fd1286980c08. [PMID: 23868168 PMCID: PMC3712476 DOI: 10.1371/4fd1286980c08] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Felsenstein's pruning algorithm allows one to calculate the probability of any particular data pattern arising on a phylogeny given a model of character evolution. Here we present a similar dynamic programming algorithm. Our algorithm treats the tree and model as known. The algorithm makes it feasible to calculate the probability that a randomly selected character will be a member of a particular class of character patterns. Specifically, we are interested in binning patterns by the number of parsimony steps and the set of states observed at the tips of the tree. This algorithm was developed to expand the range of data set sizes that can be used with Waddell et al.'s marginal testing approach for assessing the adequacy of a model. The algorithms introduced can also be used in likelihood calculations which correct for ascertainment biases. For example, Lewis introduced an Mkv model which corrects for the lack of constant sites. The probability of a constant pattern arising can be calculated using the algorithm that we present, or by enumerating all possible constant patterns and calculating the probability of each one. Because the number of constant data patterns is small, both methods are efficient. However, elaborations of the Mkv model (such as those in Nylander et al) require calculating the probability of parsimony-uninformative patterns arising. For large trees and characters with many possible character states, the number of possible parismony-uninformative patterns is immense. In these cases, the algorithms introduced here will be more efficient. The algorithm has been implemented in open source software written in C++.
Collapse
|
10
|
Evidence for climate-driven diversification? A caution for interpreting ABC inferences of simultaneous historical events. Evolution 2012; 67:991-1010. [PMID: 23550751 DOI: 10.1111/j.1558-5646.2012.01840.x] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
Approximate Bayesian computation (ABC) is rapidly gaining popularity in population genetics. One example, msBayes, infers the distribution of divergence times among pairs of taxa, allowing phylogeographers to test hypotheses about historical causes of diversification in co-distributed groups of organisms. Using msBayes, we infer the distribution of divergence times among 22 pairs of populations of vertebrates distributed across the Philippine Archipelago. Our objective was to test whether sea-level oscillations during the Pleistocene caused diversification across the islands. To guide interpretation of our results, we perform a suite of simulation-based power analyses. Our empirical results strongly support a recent simultaneous divergence event for all 22 taxon pairs, consistent with the prediction of the Pleistocene-driven diversification hypothesis. However, our empirical estimates are sensitive to changes in prior distributions, and our simulations reveal low power of the method to detect random variation in divergence times and bias toward supporting clustered divergences. Our results demonstrate that analyses exploring power and prior sensitivity should accompany ABC model selection inferences. The problems we identify are potentially mitigable with uniform priors over divergence models (rather than classes of models) and more flexible prior distributions on demographic and divergence-time parameters.
Collapse
|
11
|
Phylogenetic assessment of filoviruses: how many lineages of Marburg virus? Ecol Evol 2012; 2:1826-33. [PMID: 22957185 PMCID: PMC3433987 DOI: 10.1002/ece3.297] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2012] [Revised: 05/07/2012] [Accepted: 05/08/2012] [Indexed: 11/14/2022] Open
Abstract
Filoviruses have to date been considered as consisting of one diverse genus (Ebola viruses) and one undifferentiated genus (Marburg virus). We reconsider this idea by means of detailed phylogenetic analyses of sequence data available for the Filoviridae: using coalescent simulations, we ascertain that two Marburg isolates (termed the "RAVN" strain) represent a quite-distinct lineage that should be considered in studies of biogeography and host associations, and may merit recognition at the level of species. In contrast, filovirus isolates recently obtained from bat tissues are not distinct from previously known strains, and should be considered as drawn from the same population. Implications for understanding the transmission geography and host associations of these viruses are discussed.
Collapse
|
12
|
NeXML: rich, extensible, and verifiable representation of comparative data and metadata. Syst Biol 2012; 61:675-89. [PMID: 22357728 PMCID: PMC3376374 DOI: 10.1093/sysbio/sys025] [Citation(s) in RCA: 64] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2011] [Revised: 07/29/2011] [Accepted: 02/07/2012] [Indexed: 12/13/2022] Open
Abstract
In scientific research, integration and synthesis require a common understanding of where data come from, how much they can be trusted, and what they may be used for. To make such an understanding computer-accessible requires standards for exchanging richly annotated data. The challenges of conveying reusable data are particularly acute in regard to evolutionary comparative analysis, which comprises an ever-expanding list of data types, methods, research aims, and subdisciplines. To facilitate interoperability in evolutionary comparative analysis, we present NeXML, an XML standard (inspired by the current standard, NEXUS) that supports exchange of richly annotated comparative data. NeXML defines syntax for operational taxonomic units, character-state matrices, and phylogenetic trees and networks. Documents can be validated unambiguously. Importantly, any data element can be annotated, to an arbitrary degree of richness, using a system that is both flexible and rigorous. We describe how the use of NeXML by the TreeBASE and Phenoscape projects satisfies user needs that cannot be satisfied with other available file formats. By relying on XML Schema Definition, the design of NeXML facilitates the development and deployment of software for processing, transforming, and querying documents. The adoption of NeXML for practical use is facilitated by the availability of (1) an online manual with code samples and a reference to all defined elements and attributes, (2) programming toolkits in most of the languages used commonly in evolutionary informatics, and (3) input-output support in several widely used software applications. An active, open, community-based development process enables future revision and expansion of NeXML.
Collapse
|
13
|
The interface of protein structure, protein biophysics, and molecular evolution. Protein Sci 2012; 21:769-85. [PMID: 22528593 PMCID: PMC3403413 DOI: 10.1002/pro.2071] [Citation(s) in RCA: 140] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2012] [Revised: 03/22/2012] [Accepted: 03/23/2012] [Indexed: 12/20/2022]
Abstract
Abstract The interface of protein structural biology, protein biophysics, molecular evolution, and molecular population genetics forms the foundations for a mechanistic understanding of many aspects of protein biochemistry. Current efforts in interdisciplinary protein modeling are in their infancy and the state-of-the art of such models is described. Beyond the relationship between amino acid substitution and static protein structure, protein function, and corresponding organismal fitness, other considerations are also discussed. More complex mutational processes such as insertion and deletion and domain rearrangements and even circular permutations should be evaluated. The role of intrinsically disordered proteins is still controversial, but may be increasingly important to consider. Protein geometry and protein dynamics as a deviation from static considerations of protein structure are also important. Protein expression level is known to be a major determinant of evolutionary rate and several considerations including selection at the mRNA level and the role of interaction specificity are discussed. Lastly, the relationship between modeling and needed high-throughput experimental data as well as experimental examination of protein evolution using ancestral sequence resurrection and in vitro biochemistry are presented, towards an aim of ultimately generating better models for biological inference and prediction.
Collapse
|
14
|
SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. Syst Biol 2011; 61:90-106. [PMID: 22139466 DOI: 10.1093/sysbio/syr095] [Citation(s) in RCA: 220] [Impact Index Per Article: 16.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Highly accurate estimation of phylogenetic trees for large data sets is difficult, in part because multiple sequence alignments must be accurate for phylogeny estimation methods to be accurate. Coestimation of alignments and trees has been attempted but currently only SATé estimates reasonably accurate trees and alignments for large data sets in practical time frames (Liu K., Raghavan S., Nelesen S., Linder C.R., Warnow T. 2009b. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science. 324:1561-1564). Here, we present a modification to the original SATé algorithm that improves upon SATé (which we now call SATé-I) in terms of speed and of phylogenetic and alignment accuracy. SATé-II uses a different divide-and-conquer strategy than SATé-I and so produces smaller more closely related subsets than SATé-I; as a result, SATé-II produces more accurate alignments and trees, can analyze larger data sets, and runs more efficiently than SATé-I. Generally, SATé is a metamethod that takes an existing multiple sequence alignment method as an input parameter and boosts the quality of that alignment method. SATé-II-boosted alignment methods are significantly more accurate than their unboosted versions, and trees based upon these improved alignments are more accurate than trees based upon the original alignments. Because SATé-I used maximum likelihood (ML) methods that treat gaps as missing data to estimate trees and because we found a correlation between the quality of tree/alignment pairs and ML scores, we explored the degree to which SATé's performance depends on using ML with gaps treated as missing data to determine the best tree/alignment pair. We present two lines of evidence that using ML with gaps treated as missing data to optimize the alignment and tree produces very poor results. First, we show that the optimization problem where a set of unaligned DNA sequences is given and the output is the tree and alignment of those sequences that maximize likelihood under the Jukes-Cantor model is uninformative in the worst possible sense. For all inputs, all trees optimize the likelihood score. Second, we show that a greedy heuristic that uses GTR+Gamma ML to optimize the alignment and the tree can produce very poor alignments and trees. Therefore, the excellent performance of SATé-II and SATé-I is not because ML is used as an optimization criterion for choosing the best tree/alignment pair but rather due to the particular divide-and-conquer realignment techniques employed.
Collapse
|
15
|
Abstract
We introduce a new model for relaxing the assumption of a strict molecular clock for use as a prior in Bayesian methods for divergence time estimation. Lineage-specific rates of substitution are modeled using a Dirichlet process prior (DPP), a type of stochastic process that assumes lineages of a phylogenetic tree are distributed into distinct rate classes. Under the Dirichlet process, the number of rate classes, assignment of branches to rate classes, and the rate value associated with each class are treated as random variables. The performance of this model was evaluated by conducting analyses on data sets simulated under a range of different models. We compared the Dirichlet process model with two alternative models for rate variation: the strict molecular clock and the independent rates model. Our results show that divergence time estimation under the DPP provides robust estimates of node ages and branch rates without significantly reducing power. Further analyses were conducted on a biological data set, and we provide examples of ways to summarize Markov chain Monte Carlo samples under this model.
Collapse
|
16
|
BEAGLE: an application programming interface and high-performance computing library for statistical phylogenetics. Syst Biol 2011; 61:170-3. [PMID: 21963610 PMCID: PMC3243739 DOI: 10.1093/sysbio/syr100] [Citation(s) in RCA: 366] [Impact Index Per Article: 28.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Phylogenetic inference is fundamental to our understanding of most aspects of the origin and evolution of life, and in recent years, there has been a concentration of interest in statistical approaches such as Bayesian inference and maximum likelihood estimation. Yet, for large data sets and realistic or interesting models of evolution, these approaches remain computationally demanding. High-throughput sequencing can yield data for thousands of taxa, but scaling to such problems using serial computing often necessitates the use of nonstatistical or approximate approaches. The recent emergence of graphics processing units (GPUs) provides an opportunity to leverage their excellent floating-point computational performance to accelerate statistical phylogenetic inference. A specialized library for phylogenetic calculation would allow existing software packages to make more effective use of available computer hardware, including GPUs. Adoption of a common library would also make it easier for other emerging computing architectures, such as field programmable gate arrays, to be used in the future. We present BEAGLE, an application programming interface (API) and library for high-performance statistical phylogenetic inference. The API provides a uniform interface for performing phylogenetic likelihood calculations on a variety of compute hardware platforms. The library includes a set of efficient implementations and can currently exploit hardware including GPUs using NVIDIA CUDA, central processing units (CPUs) with Streaming SIMD Extensions and related processor supplementary instruction sets, and multicore CPUs via OpenMP. To demonstrate the advantages of a common API, we have incorporated the library into several popular phylogenetic software packages. The BEAGLE library is free open source software licensed under the Lesser GPL and available from http://beagle-lib.googlecode.com. An example client program is available as public domain software.
Collapse
|
17
|
Estimating phylogenetic trees from pairwise likelihoods and posterior probabilities of substitution counts. J Theor Biol 2011; 280:159-66. [PMID: 21540039 DOI: 10.1016/j.jtbi.2011.04.005] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2010] [Revised: 02/20/2011] [Accepted: 04/08/2011] [Indexed: 10/18/2022]
Abstract
The field of phylogenetic tree estimation has been dominated by three broad classes of methods: distance-based approaches, parsimony and likelihood-based methods (including maximum likelihood (ML) and Bayesian approaches). Here we introduce two new approaches to tree inference: pairwise likelihood estimation and a distance-based method that estimates the number of substitutions along the paths through the tree. Our results include the derivation of the formulae for the probability that two leaves will be identical at a site given a number of substitutions along the path connecting them. We also derive the posterior probability of the number of substitutions along a path between two sequences. The calculations for the posterior probabilities are exact for group-based, symmetric models of character evolution, but are only approximate for more general models.
Collapse
|
18
|
What's in a Likelihood? Simple Models of Protein Evolution and the Contribution of Structurally Viable Reconstructions to the Likelihood. Syst Biol 2011; 60:161-74. [DOI: 10.1093/sysbio/syq088] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
|
19
|
Abstract
We present Ginkgo, a software package for agent-based, forward-time simulations of genealogies of multiple unlinked loci from diploid populations. Ginkgo simulates the evolution of one or more species on a spatially explicit landscape of cells. The user of the software can specify the geographical and environmental characteristics of the landscape, and these properties can change according to a prespecified schedule. The geographical elements modelled include the arrangement of cells and movement rates between particular cells. Each species has a function that can calculate a fitness score for any combination of an individual organism's phenotype and environmental characteristics. The user can control the number of fitness factors (the dimensionality of the cell-specific fitness factors and the individuals phenotypic vectors) and the weighting of each of these dimensions in the fitness calculation. Cell-specific fitness trait optima can be specified across the landscape to mimic differences in habitat. In addition to their differing fitness functions, species can differ in terms of their vagility and fecundity. Genealogies and occurrence data can be produced at any time during the simulation in NEXUS and ESRI Ascii Grid formats, respectively.
Collapse
|
20
|
The phylogenetic position of Myxozoa: exploring conflicting signals in phylogenomic and ribosomal data sets. Mol Biol Evol 2010; 27:2733-46. [PMID: 20576761 DOI: 10.1093/molbev/msq159] [Citation(s) in RCA: 61] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
Myxozoans are a diverse group of microscopic endoparasites that have been the focus of much controversy regarding their phylogenetic position. Two dramatically different hypotheses have been put forward regarding the placement of Myxozoa within Metazoa. One hypothesis, supported by ribosomal DNA (rDNA) data, place Myxozoa as a sister taxon to Bilateria. The alternative hypothesis, supported by phylogenomic data and morphology, place Myxozoa within Cnidaria. Here, we investigate these conflicting hypotheses and explore the effects of missing data, model choice, and inference methods, all of which can have an effect in placing highly divergent taxa. In addition, we identify subsets of the data that most influence the placement of Myxozoa and explore their effects by removing them from the data sets. Assembling the largest taxonomic sampling of myxozoans and cnidarians to date, with a comprehensive sampling of other metazoans for 18S and 28S nuclear rDNA sequences, we recover a well-supported placement of Myxozoa as an early diverging clade of Bilateria. By conducting parametric bootstrapping, we find that the bilaterian placement of Buddenbrockia could not alone be explained by long-branch attraction. After trimming a published phylogenomic data set, to circumvent problems of missing data, we recover the myxozoan Buddenbrockia plumatellae as a medusozoan cnidarian. In further explorations of these data sets, we find that removal of just a few identified sites under a maximum likelihood criterion employing the Whelan and Goldman amino acid substitution model changes the placement of Buddenbrockia from within Cnidaria to the alternative hypothesis at the base of Bilateria. Under a Bayesian criterion employing the CAT model, the cnidarian placement is more resilient to data removal, but under one test, a well-supported early diverging bilaterian position for Buddenbrockia is recovered. Our results confirm the existence of two relatively stable placements for myxozoans and demonstrate that conflicting signal exists not only between the two types of data but also within the phylogenomic data set. These analyses underscore the importance of careful model selection, taxon and data sampling, and in-depth data exploration when investigating the phylogenetic placement of highly divergent taxa.
Collapse
|
21
|
|
22
|
The akaike information criterion will not choose the no common mechanism model. Syst Biol 2010; 59:477-85. [PMID: 20547783 DOI: 10.1093/sysbio/syq028] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
23
|
Abstract
UNLABELLED DendroPy is a cross-platform library for the Python programming language that provides for object-oriented reading, writing, simulation and manipulation of phylogenetic data, with an emphasis on phylogenetic tree operations. DendroPy uses a splits-hash mapping to perform rapid calculations of tree distances, similarities and shape under various metrics. It contains rich simulation routines to generate trees under a number of different phylogenetic and coalescent models. DendroPy's data simulation and manipulation facilities, in conjunction with its support of a broad range of phylogenetic data formats (NEXUS, Newick, PHYLIP, FASTA, NeXML, etc.), allow it to serve a useful role in various phyloinformatics and phylogeographic pipelines. AVAILABILITY The stable release of the library is available for download and automated installation through the Python Package Index site (http://pypi.python.org/pypi/DendroPy), while the active development source code repository is available to the public from GitHub (http://github.com/jeetsukumaran/DendroPy).
Collapse
|
24
|
Evaluating the robustness of phylogenetic methods to among-site variability in substitution processes. Philos Trans R Soc Lond B Biol Sci 2008; 363:4013-21. [PMID: 18852108 PMCID: PMC2607409 DOI: 10.1098/rstb.2008.0162] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Computer simulations provide a flexible method for assessing the power and robustness of phylogenetic inference methods. Unfortunately, simulated data are often obviously atypical of data encountered in studies of molecular evolution. Unrealistic simulations can lead to conclusions that are irrelevant to real-data analyses or can provide a biased view of which methods perform well. Here, we present a software tool designed to generate data under a complex codon model that allows each residue in the protein sequence to have a different set of equilibrium amino acid frequencies. The software can obtain maximum-likelihood estimates of the parameters of the Halpern and Bruno model from empirical data and a fixed tree; given an arbitrary tree and a fixed set of parameters, the software can then simulate artificial datasets.We present the results of a simulation experiment using randomly generated tree shapes and substitution parameters estimated from 1610 mammalian cytochrome b sequences.We tested tree inference at the amino acid, nucleotide and codon levels and under parsimony, maximum-likelihood, Bayesian and distance criteria (for a total of more than 650 analyses on each dataset). Based on these simulations, nucleotide-level analyses seem to be more accurate than amino acid and codon analyses. The performance of distance-based phylogenetic methods appears to be quite sensitive to the choice of model and the form of rate heterogeneity used. Further studies are needed to assess the generality of these conclusions. For example, fitting parameters of the Halpern Bruno model to sequences from other genes will reveal the extent to which our conclusions were influenced by the choice of cytochrome b. Incorporating codon bias and more sources heterogeneity into the simulator will be crucial to determining whether the current results are caused by a bias in the current simulation study in favour of nucleotide analyses.
Collapse
|
25
|
A justification for reporting the majority-rule consensus tree in Bayesian phylogenetics. Syst Biol 2008; 57:814-21. [PMID: 18853367 DOI: 10.1080/10635150802422308] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022] Open
|
26
|
The Posterior and the Prior in Bayesian Phylogenetics. ANNUAL REVIEW OF ECOLOGY EVOLUTION AND SYSTEMATICS 2006. [DOI: 10.1146/annurev.ecolsys.37.091305.110021] [Citation(s) in RCA: 142] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
27
|
|
28
|
Abstract
Bayesian phylogenetic analyses are now very popular in systematics and molecular evolution because they allow the use of much more realistic models than currently possible with maximum likelihood methods. There are, however, a growing number of examples in which large Bayesian posterior clade probabilities are associated with very short branch lengths and low values for non-Bayesian measures of support such as nonparametric bootstrapping. For the four-taxon case when the true tree is the star phylogeny, Bayesian analyses become increasingly unpredictable in their preference for one of the three possible resolved tree topologies as data set size increases. This leads to the prediction that hard (or near-hard) polytomies in nature will cause unpredictable behavior in Bayesian analyses, with arbitrary resolutions of the polytomy receiving very high posterior probabilities in some cases. We present a simple solution to this problem involving a reversible-jump Markov chain Monte Carlo (MCMC) algorithm that allows exploration of all of tree space, including unresolved tree topologies with one or more polytomies. The reversible-jump MCMC approach allows prior distributions to place some weight on less-resolved tree topologies, which eliminates misleadingly high posteriors associated with arbitrary resolutions of hard polytomies. Fortunately, assigning some prior probability to polytomous tree topologies does not appear to come with a significant cost in terms of the ability to assess the level of support for edges that do exist in the true tree. Methods are discussed for applying arbitrary prior distributions to tree topologies of varying resolution, and an empirical example showing evidence of polytomies is analyzed and discussed.
Collapse
|
29
|
Abstract
We investigated the usefulness of a parallel genetic algorithm for phylogenetic inference under the maximum-likelihood (ML) optimality criterion. Parallelization was accomplished by assigning each "individual" in the genetic algorithm "population" to a separate processor so that the number of processors used was equal to the size of the evolving population (plus one additional processor for the control of operations). The genetic algorithm incorporated branch-length and topological mutation, recombination, selection on the ML score, and (in some cases) migration and recombination among subpopulations. We tested this parallel genetic algorithm with large (228 taxa) data sets of both empirically observed DNA sequence data (for angiosperms) as well as simulated DNA sequence data. For both observed and simulated data, search-time improvement was nearly linear with respect to the number of processors, so the parallelization strategy appears to be highly effective at improving computation time for large phylogenetic problems using the genetic algorithm. We also explored various ways of optimizing and tuning the parameters of the genetic algorithm. Under the conditions of our analyses, we did not find the best-known solution using the genetic algorithm approach before terminating each run. We discuss some possible limitations of the current implementation of this genetic algorithm as well as of avenues for its future improvement.
Collapse
|
30
|
|
31
|
Abstract
During the period of September 1997 through July 1998, two coelacanth fishes were captured off Manado Tua Island, Sulawesi, Indonesia. These specimens were caught almost 10,000 km from the only other known population of living coelacanths, Latimeria chalumnae, near the Comores. The Indonesian fish was described recently as a new species, Latimeria menadoensis, based on morphological differentiation and DNA sequence divergence in fragments of the cytochrome b and 12S rRNA genes. We have obtained the sequence of 4,823 bp of mitochondrial DNA from the same specimen, including the entire genes for cytochrome b, 12S rRNA, 16S rRNA, four tRNAs, and the control region. The sequence is 4.1% different from the published sequence of an animal captured from the Comores, indicating substantial divergence between the Indonesian and Comorean populations. Nine morphological and meristic differences are purported to distinguish L. menadoensis and L. chalumnae, based on comparison of a single specimen of L. menadoensis to a description of five individuals of L. chalumnae from the Comores. A survey of the literature provided data on 4 of the characters used to distinguish L. menadoensis from L. chalumnae from an additional 16 African coelacanths; for all 4 characters, the Indonesian sample was within the range of variation reported for the African specimens. Nonetheless, L. chalumnae and L. menadoensis appear to be separate species based on divergence of mitochondrial DNA.
Collapse
|
32
|
Caprine microsatellite dinucleotide repeat polymorphisms at the SR-CRSP21, SR-CRSP22, SR-CRSP23, SR-CRSP24, SR-CRSP25, SR-CRSP26 and SR-CRSP27 loci. Anim Genet 1997; 28:380-1. [PMID: 9363617] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
|
33
|
Bovine microsatellite dinucleotide repeat polymorphisms at the TEXAN11, TEXAN12, TEXAN13, TEXAN14 and TEXAN15 loci. Anim Genet 1995; 26:201-2. [PMID: 7793692 DOI: 10.1111/j.1365-2052.1995.tb03165.x] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
|
34
|
Bovine microsatellite dinucleotide repeat polymorphisms at the TEXAN16, TEXAN17, TEXAN18, TEXAN19 and TEXAN20 loci. Anim Genet 1995; 26:208-9. [PMID: 7793700 DOI: 10.1111/j.1365-2052.1995.tb03174.x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
|
35
|
Bovine microsatellite mononucleotide and dinucleotide repeat polymorphisms at the TEXAN6, TEXAN7, TEXAN8, TEXAN9 and TEXAN10 loci. Anim Genet 1995; 26:128-9. [PMID: 7733502 DOI: 10.1111/j.1365-2052.1995.tb02654.x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
|
36
|
Bovine microsatellite dinucleotide repeat polymorphisms at the TEXAN-1, TEXAN-2, TEXAN-3, TEXAN-4 and TEXAN-5 loci. Anim Genet 1994; 25:201. [PMID: 7943968 DOI: 10.1111/j.1365-2052.1994.tb00123.x] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
|