1
|
Roestel JA, Wiersema JH, Jansen RK, Borsch T, Gruenstaeudl M. On the importance of sequence alignment inspections in plastid phylogenomics - an example from revisiting the relationships of the water-lilies. Cladistics 2024; 40:469-495. [PMID: 38761095 DOI: 10.1111/cla.12584] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Revised: 04/27/2024] [Accepted: 04/29/2024] [Indexed: 05/20/2024] Open
Abstract
The water-lily clade represents the second earliest-diverging branch of angiosperms. Most of its species belong to Nymphaeaceae, of which the "core Nymphaeaceae"-comprising the genera Euryale, Nymphaea and Victoria-is the most diverse clade. Despite previous molecular phylogenetic studies on the core Nymphaeaceae, various aspects of their evolutionary relationships have remained unresolved. The length-variable introns and intergenic spacers are known to contain most of the sequence variability within the water-lily plastomes. Despite the challenges with multiple sequence alignment, any new molecular phylogenetic investigation on the core Nymphaeaceae should focus on these noncoding plastome regions. For example, a new plastid phylogenomic study on the core Nymphaeaceae should generate DNA sequence alignments of all plastid introns and intergenic spacers based on the principle of conserved sequence motifs. In this investigation, we revisit the phylogenetic history of the core Nymphaeaceae by employing such an approach. Specifically, we use a plastid phylogenomic analysis strategy in which all coding and noncoding partitions are separated and then undergo software-driven DNA sequence alignment, followed by a motif-based alignment inspection and adjustment. This approach allows us to increase the reliability of the character base compared to the default practice of aligning complete plastomes through software algorithms alone. Our approach produces significantly different phylogenetic tree reconstructions for several of the plastome regions under study. The results of these reconstructions underscore that Nymphaea is paraphyletic in its current circumscription, that each of the five subgenera of Nymphaea is monophyletic, and that the subgenus Nymphaea is sister to all other subgenera of Nymphaea. Our results also clarify many evolutionary relationships within the Nymphaea subgenera Brachyceras, Hydrocallis and Nymphaea. In closing, we discuss whether the phylogenetic reconstructions obtained through our motif-based alignment adjustments are in line with morphological evidence on water-lily evolution.
Collapse
Affiliation(s)
- Jessica A Roestel
- Institut für Biologie, Systematische Botanik und Pflanzengeographie, Freie Universität Berlin, Berlin, 14195, Germany
| | - John H Wiersema
- Department of Botany, National Museum of Natural History - Smithsonian Institution, Washington, DC, 37012, USA
| | - Robert K Jansen
- Department of Integrative Biology, University of Texas at Austin, Austin, TX, 78712, USA
| | - Thomas Borsch
- Institut für Biologie, Systematische Botanik und Pflanzengeographie, Freie Universität Berlin, Berlin, 14195, Germany
- Botanischer Garten und Botanisches Museum Berlin, Freie Universität Berlin, 14195, Berlin, Germany
| | - Michael Gruenstaeudl
- Institut für Biologie, Systematische Botanik und Pflanzengeographie, Freie Universität Berlin, Berlin, 14195, Germany
- Department of Biological Sciences, Fort Hays State University, Hays, KS, 67601, USA
| |
Collapse
|
2
|
Escobari B, Borsch T, Quedensley TS, Gruenstaeudl M. Plastid phylogenomics of the Gynoxoid group (Senecioneae, Asteraceae) highlights the importance of motif-based sequence alignment amid low genetic distances. AMERICAN JOURNAL OF BOTANY 2021; 108:2235-2256. [PMID: 34636417 DOI: 10.1002/ajb2.1775] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/23/2021] [Accepted: 08/12/2021] [Indexed: 06/13/2023]
Abstract
PREMISE The genus Gynoxys and relatives form a species-rich lineage of Andean shrubs and trees with low genetic distances within the sunflower subtribe Tussilaginineae. Previous molecular phylogenetic investigations of the Tussilaginineae have included few, if any, representatives of this Gynoxoid group or reconstructed ambiguous patterns of relationships for it. METHODS We sequenced complete plastid genomes of 21 species of the Gynoxoid group and related Tussilaginineae and conducted detailed comparisons of the phylogenetic relationships supported by the gene, intron, and intergenic spacer partitions of these genomes. We also evaluated the impact of manual, motif-based adjustments of automatic DNA sequence alignments on phylogenetic tree inference. RESULTS Our results indicate that the inclusion of all plastid genome partitions is needed to infer well-supported phylogenetic trees of the Gynoxoid group. Whole plastome-based tree inference suggests that the genera Gynoxys and Nordenstamia are polyphyletic and form the core clade of the Gynoxoid group. This clade is sister to a clade of Aequatorium and Paragynoxys and also includes some but not all representatives of Paracalia. CONCLUSIONS The concatenation and combined analysis of all plastid genome partitions and the construction of manually-curated, motif-based DNA sequence alignments are found to be instrumental in the recovery of well-supported relationships of the Gynoxoid group. We demonstrate that the correct assessment of homology in genome-level plastid sequence data sets is crucial for subsequent phylogeny reconstruction and that the manual post-processing of multiple sequence alignments improves the reliability of such reconstructions amid low genetic distances between taxa.
Collapse
Affiliation(s)
- Belen Escobari
- Botanischer Garten und Botanisches Museum Berlin, Freie Universität Berlin, Berlin, 14195, Germany
- Herbario Nacional de Bolivia, Universidad Mayor de San Andres, Casilla, La Paz, 10077, Bolivia
| | - Thomas Borsch
- Botanischer Garten und Botanisches Museum Berlin, Freie Universität Berlin, Berlin, 14195, Germany
- Institut für Biologie, Systematische Botanik und Pflanzengeographie, Freie Universität Berlin, Berlin, 14195, Germany
| | - Taylor S Quedensley
- Department of Biology, Texas Christian University, Fort Worth, TX, 76109, USA
| | - Michael Gruenstaeudl
- Institut für Biologie, Systematische Botanik und Pflanzengeographie, Freie Universität Berlin, Berlin, 14195, Germany
| |
Collapse
|
3
|
Lehtonen S. Phenotypic characters of static homology increase phylogenetic stability under direct optimization of otherwise dynamic homology characters. Cladistics 2021; 36:617-626. [PMID: 34618977 DOI: 10.1111/cla.12438] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/20/2020] [Indexed: 11/29/2022] Open
Abstract
Direct optimization of unaligned sequence characters provides a natural framework to explore the sensitivity of phylogenetic hypotheses to variation in analytical parameters. Phenotypic data, when combined into such analyses, are typically analyzed with static homology correspondences unlike the dynamic homology sequence data. Static homology characters may be expected to constrain the direct optimization and thus, potentially increase the similarity of phylogenetic hypotheses under different cost sets. However, whether a total-evidence approach increases the phylogenetic stability or not remains empirically largely unexplored. Here, I studied the impact of static homology data on sensitivity using six empirical data sets composed of several molecular markers and phenotypic data. The inclusion of static homology phenotypic data increased the average stability of phylogenetic hypothesis in five out of the six data sets. To investigate if any static homology characters would have similar effect, the analyses were repeated with randomized phenotypic data, and with one of the molecular markers fixed as static homology characters. These analyses had, on average, almost no effect on the phylogenetic stability, although the randomized phenotypic data sometimes resulted in even higher stability than empirical phenotypic data. The impact was related to the strength of the phylogenetic signal in the phenotypic data: higher average jackknife support of the phenotypic tree correlated with stronger stabilizing effect in the total-evidence analysis. Phenotypic data with a strong signal made the total-evidence trees topologically more similar to the phenotypic trees, thus, they constrained the dynamic homology correspondences of the sequence data. Characters that increase phylogenetic stability are particularly valuable for phylogenetic inference. These results indicate an important role and additive value of phenotypic data in increasing the stability of phylogenetic hypotheses in total-evidence analyses.
Collapse
Affiliation(s)
- Samuli Lehtonen
- Biodiversity Unit, University of Turku, Turku, FI-20014, Finland
| |
Collapse
|
4
|
Nute M, Saleh E, Warnow T. Evaluating Statistical Multiple Sequence Alignment in Comparison to Other Alignment Methods on Protein Data Sets. Syst Biol 2019; 68:396-411. [PMID: 30329135 PMCID: PMC6472439 DOI: 10.1093/sysbio/syy068] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Revised: 09/27/2018] [Accepted: 10/11/2018] [Indexed: 01/15/2023] Open
Abstract
The estimation of multiple sequence alignments of protein sequences is a basic step in many bioinformatics pipelines, including protein structure prediction, protein family identification, and phylogeny estimation. Statistical coestimation of alignments and trees under stochastic models of sequence evolution has long been considered the most rigorous technique for estimating alignments and trees, but little is known about the accuracy of such methods on biological benchmarks. We report the results of an extensive study evaluating the most popular protein alignment methods as well as the statistical coestimation method BAli-Phy on 1192 protein data sets from established benchmarks as well as on 120 simulated data sets. Our study (which used more than 230 CPU years for the BAli-Phy analyses alone) shows that BAli-Phy has better precision and recall (with respect to the true alignments) than the other alignment methods on the simulated data sets but has consistently lower recall on the biological benchmarks (with respect to the reference alignments) than many of the other methods. In other words, we find that BAli-Phy systematically underaligns when operating on biological sequence data but shows no sign of this on simulated data. There are several potential causes for this change in performance, including model misspecification, errors in the reference alignments, and conflicts between structural alignment and evolutionary alignments, and future research is needed to determine the most likely explanation. We conclude with a discussion of the potential ramifications for each of these possibilities. [BAli-Phy; homology; multiple sequence alignment; protein sequences; structural alignment.]
Collapse
Affiliation(s)
- Michael Nute
- Department of Statistics, University of Illinois at Urbana-Champaign, 725 S Wright St #101, Champaign, IL 61820, USA
| | - Ehsan Saleh
- Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N. Goodwin Ave, Urbana, IL 61801, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N. Goodwin Ave, Urbana, IL 61801, USA.,Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, 1205 W. Clark St., Urbana, IL 61801, USA.,National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
5
|
Herman JL, Novák Á, Lyngsø R, Szabó A, Miklós I, Hein J. Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs. BMC Bioinformatics 2015; 16:108. [PMID: 25888064 PMCID: PMC4395974 DOI: 10.1186/s12859-015-0516-1] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2014] [Accepted: 02/24/2015] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND A standard procedure in many areas of bioinformatics is to use a single multiple sequence alignment (MSA) as the basis for various types of analysis. However, downstream results may be highly sensitive to the alignment used, and neglecting the uncertainty in the alignment can lead to significant bias in the resulting inference. In recent years, a number of approaches have been developed for probabilistic sampling of alignments, rather than simply generating a single optimum. However, this type of probabilistic information is currently not widely used in the context of downstream inference, since most existing algorithms are set up to make use of a single alignment. RESULTS In this work we present a framework for representing a set of sampled alignments as a directed acyclic graph (DAG) whose nodes are alignment columns; each path through this DAG then represents a valid alignment. Since the probabilities of individual columns can be estimated from empirical frequencies, this approach enables sample-based estimation of posterior alignment probabilities. Moreover, due to conditional independencies between columns, the graph structure encodes a much larger set of alignments than the original set of sampled MSAs, such that the effective sample size is greatly increased. CONCLUSIONS The alignment DAG provides a natural way to represent a distribution in the space of MSAs, and allows for existing algorithms to be efficiently scaled up to operate on large sets of alignments. As an example, we show how this can be used to compute marginal probabilities for tree topologies, averaging over a very large number of MSAs. This framework can also be used to generate a statistically meaningful summary alignment; example applications show that this summary alignment is consistently more accurate than the majority of the alignment samples, leading to improvements in downstream tree inference. Implementations of the methods described in this article are available at http://statalign.github.io/WeaveAlign .
Collapse
Affiliation(s)
- Joseph L Herman
- Department of Statistics, University of Oxford, 1 South Parks Road, Oxford, OX1 3TG, UK.
- Division of Mathematical Biology, National Institute of Medical Research,, The Ridgeway, London, NW7 1AA, UK.
| | - Ádám Novák
- Department of Statistics, University of Oxford, 1 South Parks Road, Oxford, OX1 3TG, UK.
| | - Rune Lyngsø
- Department of Statistics, University of Oxford, 1 South Parks Road, Oxford, OX1 3TG, UK.
| | - Adrienn Szabó
- Institute of Computer Science and Control, Hungarian Academy of Sciences, Lagymanyosi u. 11., Budapest, 1111, Hungary.
| | - István Miklós
- Institute of Computer Science and Control, Hungarian Academy of Sciences, Lagymanyosi u. 11., Budapest, 1111, Hungary.
- Department of Stochastics, Rényi Institute, Reáltanoda u. 13-15, Budapest, 1053, Hungary.
| | - Jotun Hein
- Department of Statistics, University of Oxford, 1 South Parks Road, Oxford, OX1 3TG, UK.
| |
Collapse
|
6
|
Simmons MP. Limitations of locally sampled characters in phylogenetic analyses of sparse supermatrices. Mol Phylogenet Evol 2014; 74:1-14. [DOI: 10.1016/j.ympev.2014.01.030] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2013] [Revised: 01/30/2014] [Accepted: 01/30/2014] [Indexed: 11/16/2022]
|
7
|
An artifact caused by undersampling optimal trees in supermatrix analyses of locally sampled characters. Mol Phylogenet Evol 2013; 69:265-75. [DOI: 10.1016/j.ympev.2013.06.001] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2013] [Revised: 05/28/2013] [Accepted: 06/01/2013] [Indexed: 11/22/2022]
|
8
|
Simmons MP, Norton AP. Quantification and relative severity of inflated branch-support values generated by alternative methods: An empirical example. Mol Phylogenet Evol 2013; 67:277-96. [DOI: 10.1016/j.ympev.2013.01.020] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2012] [Revised: 01/18/2013] [Accepted: 01/31/2013] [Indexed: 10/27/2022]
|
9
|
Nagy LG, Kocsubé S, Csanádi Z, Kovács GM, Petkovits T, Vágvölgyi C, Papp T. Re-mind the gap! Insertion - deletion data reveal neglected phylogenetic potential of the nuclear ribosomal internal transcribed spacer (ITS) of fungi. PLoS One 2012; 7:e49794. [PMID: 23185439 PMCID: PMC3501463 DOI: 10.1371/journal.pone.0049794] [Citation(s) in RCA: 83] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2012] [Accepted: 10/12/2012] [Indexed: 01/09/2023] Open
Abstract
Rapidly evolving, indel-rich phylogenetic markers play a pivotal role in our understanding of the relationships at multiple levels of the tree of life. There is extensive evidence that indels provide conserved phylogenetic signal, however, the range of phylogenetic depths for which gaps retain tree signal has not been investigated in detail. Here we address this question using the fungal internal transcribed spacer (ITS), which is central in many phylogenetic studies, molecular ecology, detection and identification of pathogenic and non-pathogenic species. ITS is repeatedly criticized for indel-induced alignment problems and the lack of phylogenetic resolution above species level, although these have not been critically investigated. In this study, we examined whether the inclusion of gap characters in the analyses shifts the phylogenetic utility of ITS alignments towards earlier divergences. By re-analyzing 115 published fungal ITS alignments, we found that indels are slightly more conserved than nucleotide substitutions, and when included in phylogenetic analyses, improved the resolution and branch support of phylogenies across an array of taxonomic ranges and extended the resolving power of ITS towards earlier nodes of phylogenetic trees. Our results reconcile previous contradicting evidence for the effects of data exclusion: in the case of more sophisticated indel placement, the exclusion of indel-rich regions from the analyses results in a loss of tree resolution, whereas in the case of simpler alignment methods, the exclusion of gapped sites improves it. Although the empirical datasets do not provide to measure alignment accuracy objectively, our results for the ITS region are consistent with previous simulations studies alignment algorithms. We suggest that sophisticated alignment algorithms and the inclusion of indels make the ITS region and potentially other rapidly evolving indel-rich loci valuable sources of phylogenetic information, which can be exploited at multiple taxonomic levels.
Collapse
Affiliation(s)
- László G Nagy
- University of Szeged, Faculty of Science and Informatics, Department of Microbiology, Szeged, Hungary.
| | | | | | | | | | | | | |
Collapse
|
10
|
Can sensitivity analysis help to detect long-branch attraction? Mol Phylogenet Evol 2011; 61:899-903. [DOI: 10.1016/j.ympev.2011.08.007] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2011] [Revised: 08/01/2011] [Accepted: 08/04/2011] [Indexed: 11/19/2022]
|
11
|
Simmons MP. Radical instability and spurious branch support by likelihood when applied to matrices with non-random distributions of missing data. Mol Phylogenet Evol 2011; 62:472-84. [PMID: 22067131 DOI: 10.1016/j.ympev.2011.10.017] [Citation(s) in RCA: 58] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2011] [Revised: 10/10/2011] [Accepted: 10/23/2011] [Indexed: 10/16/2022]
Abstract
Non-random distributions of missing data are a general problem for likelihood-based statistical analyses, including those in a phylogenetic context. Extensive non-randomly distributed missing data are particularly problematic in supermatrix analyses that include many terminals and/or loci. It has been widely reported that missing data can lead to loss of resolution, but only very rarely create misleading or otherwise unsupported results in a parsimony context. Yet this does not hold for all parametric-based analyses because of their assumption of homogeneity across characters and lineages, which can lead to both long-branch attraction and long-branch repulsion. Contrived examples were used to demonstrate that non-random distributions of missing data, even without rate heterogeneity among characters and a well fitting model, can provide misleading likelihood-based topologies and branch-support values that are radically unstable based on slight modifications to character sampling. The same can occur despite complete absence of parsimony-informative characters. Otherwise unsupported resolution and high branch support for these clades were found to occur frequently in 22 empirical examples derived from a published supermatrix. Partitioning characters based on the distribution of missing data helped to decrease, but did not eliminate, these artifacts. These artifacts were exacerbated by low quality tree searches, particularly when holding only a single optimal tree that must be fully resolved.
Collapse
Affiliation(s)
- Mark P Simmons
- Department of Biology, Colorado State University, Fort Collins, CO 80523-1878, USA.
| |
Collapse
|
12
|
Simmons MP. Misleading results of likelihood-based phylogenetic analyses in the presence of missing data. Cladistics 2011; 28:208-222. [DOI: 10.1111/j.1096-0031.2011.00375.x] [Citation(s) in RCA: 82] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
|
13
|
Simmons MP, Freudenstein JV. Spurious 99% bootstrap and jackknife support for unsupported clades. Mol Phylogenet Evol 2011; 61:177-91. [DOI: 10.1016/j.ympev.2011.06.003] [Citation(s) in RCA: 61] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2011] [Revised: 05/25/2011] [Accepted: 06/08/2011] [Indexed: 11/27/2022]
|