1
|
de Oliveira Martins L, Bloomfield S, Stoakes E, Grant AJ, Page AJ, Mather AE. Tatajuba: exploring the distribution of homopolymer tracts. NAR Genom Bioinform 2022; 4:lqac003. [PMID: 35118377 PMCID: PMC8808543 DOI: 10.1093/nargab/lqac003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2021] [Revised: 11/18/2021] [Accepted: 01/05/2022] [Indexed: 11/14/2022] Open
Abstract
Length variation of homopolymeric tracts, which induces phase variation, is known to regulate gene expression leading to phenotypic variation in a wide range of bacterial species. There is no specialized bioinformatics software which can, at scale, exhaustively explore and describe these features from sequencing data. Identifying these is non-trivial as sequencing and bioinformatics methods are prone to introducing artefacts when presented with homopolymeric tracts due to the decreased base diversity. We present tatajuba, which can automatically identify potential homopolymeric tracts and help predict their putative phenotypic impact, allowing for rapid investigation. We use it to detect all tracts in two separate datasets, one of Campylobacter jejuni and one of three Bordetella species, and to highlight those tracts that are polymorphic across samples. With this we confirm homopolymer tract variation with phenotypic impact found in previous studies and additionally find many more with potential variability. The software is written in C and is available under the open source licence GNU GPLv3.
Collapse
Affiliation(s)
| | - Samuel Bloomfield
- Quadram Institute Bioscience, Norwich Research Park, Norwich NR4 7UQ, UK
| | - Emily Stoakes
- Department of Veterinary Medicine, University of Cambridge, Madingley Road, Cambridge CB3 0ES, UK
| | - Andrew J Grant
- Department of Veterinary Medicine, University of Cambridge, Madingley Road, Cambridge CB3 0ES, UK
| | - Andrew J Page
- Quadram Institute Bioscience, Norwich Research Park, Norwich NR4 7UQ, UK
| | - Alison E Mather
- Quadram Institute Bioscience, Norwich Research Park, Norwich NR4 7UQ, UK
| |
Collapse
|
2
|
Jiang X, Edwards SV, Liu L. The Multispecies Coalescent Model Outperforms Concatenation Across Diverse Phylogenomic Data Sets. Syst Biol 2021; 69:795-812. [PMID: 32011711 PMCID: PMC7302055 DOI: 10.1093/sysbio/syaa008] [Citation(s) in RCA: 35] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2019] [Revised: 12/24/2019] [Accepted: 01/02/2020] [Indexed: 11/30/2022] Open
Abstract
A statistical framework of model comparison and model validation is essential to resolving the debates over concatenation and coalescent models in phylogenomic data analysis. A set of statistical tests are here applied and developed to evaluate and compare the adequacy of substitution, concatenation, and multispecies coalescent (MSC) models across 47 phylogenomic data sets collected across tree of life. Tests for substitution models and the concatenation assumption of topologically congruent gene trees suggest that a poor fit of substitution models, rejected by 44% of loci, and concatenation models, rejected by 38% of loci, is widespread. Logistic regression shows that the proportions of GC content and informative sites are both negatively correlated with the fit of substitution models across loci. Moreover, a substantial violation of the concatenation assumption of congruent gene trees is consistently observed across six major groups (birds, mammals, fish, insects, reptiles, and others, including other invertebrates). In contrast, among those loci adequately described by a given substitution model, the proportion of loci rejecting the MSC model is 11%, significantly lower than those rejecting the substitution and concatenation models. Although conducted on reduced data sets due to computational constraints, Bayesian model validation and comparison both strongly favor the MSC over concatenation across all data sets; the concatenation assumption of congruent gene trees rarely holds for phylogenomic data sets with more than 10 loci. Thus, for large phylogenomic data sets, model comparisons are expected to consistently and more strongly favor the coalescent model over the concatenation model. We also found that loci rejecting the MSC have little effect on species tree estimation. Our study reveals the value of model validation and comparison in phylogenomic data analysis, as well as the need for further improvements of multilocus models and computational tools for phylogenetic inference. [Bayes factor; Bayesian model validation; coalescent prior; congruent gene trees; independent prior; Metazoa; posterior predictive simulation.]
Collapse
Affiliation(s)
- Xiaodong Jiang
- Department of Statistics, University of Georgia, 310 Herty Drive, Athens, GA 30602, USA
| | - Scott V Edwards
- Department of Organismic and Evolutionary Biology and Museum of Comparative Zoology, Harvard, 26 Oxford Street, Cambridge, MA 02138, USA
| | - Liang Liu
- Department of Statistics, University of Georgia, 310 Herty Drive, Athens, GA 30602, USA.,Institute of Bioinformatics, University of Georgia, 120 Green Street, Athens, GA 30602, USA
| |
Collapse
|
3
|
A Shepherd D, Klaere S. How Well Does Your Phylogenetic Model Fit Your Data? Syst Biol 2018; 68:157-167. [DOI: 10.1093/sysbio/syy066] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2016] [Accepted: 10/11/2018] [Indexed: 12/27/2022] Open
Affiliation(s)
- Daisy A Shepherd
- Department of Statistics, The University of Auckland, Auckland, New Zealand
| | - Steffen Klaere
- Department of Statistics, The University of Auckland, Auckland, New Zealand
- School of Biological Sciences, The University of Auckland, Auckland, New Zealand
| |
Collapse
|
4
|
Xi Z, Liu L, Davis CC. The Impact of Missing Data on Species Tree Estimation. Mol Biol Evol 2015; 33:838-60. [DOI: 10.1093/molbev/msv266] [Citation(s) in RCA: 101] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
|
5
|
Goremykin VV, Nikiforova SV, Cavalieri D, Pindo M, Lockhart P. The Root of Flowering Plants and Total Evidence. Syst Biol 2015; 64:879-91. [DOI: 10.1093/sysbio/syv028] [Citation(s) in RCA: 43] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2014] [Accepted: 05/05/2015] [Indexed: 11/14/2022] Open
|
6
|
Gutiérrez EE, Anderson RP, Voss RS, Ochoa-G. J, Aguilera M, Jansa SA. Phylogeography ofMarmosa robinsoni: insights into the biogeography of dry forests in northern South America. J Mammal 2014. [DOI: 10.1644/14-mamm-a-069] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
|
7
|
Ekman S, Blaalid R. The Devil in the Details: Interactions between the Branch-Length Prior and Likelihood Model Affect Node Support and Branch Lengths in the Phylogeny of the Psoraceae. Syst Biol 2011; 60:541-61. [DOI: 10.1093/sysbio/syr022] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Stefan Ekman
- Museum of Evolution, Uppsala University, Norbyvägen 16, SE-752 36 Uppsala, Sweden
- Department of Biology, University of Bergen, PO Box 7800, N-5020 Bergen, Norway
| | - Rakel Blaalid
- Department of Biology, University of Bergen, PO Box 7800, N-5020 Bergen, Norway
- Department of Biology, University of Oslo, PO Box 1066 Blindern, N-0316 Oslo, Norway
| |
Collapse
|
8
|
Li C, Lu G, Ortí G. Optimal data partitioning and a test case for ray-finned fishes (Actinopterygii) based on ten nuclear loci. Syst Biol 2010; 57:519-39. [PMID: 18622808 DOI: 10.1080/10635150802206883] [Citation(s) in RCA: 152] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022] Open
Abstract
Data partitioning, the combined phylogenetic analysis of homogeneous blocks of data, is a common strategy used to accommodate heterogeneities in complex multilocus data sets. Variation in evolutionary rates and substitution patterns among sites are typically addressed by partitioning data by gene, codon position, or both. Excessive partitioning of the data, however, could lead to overparameterization; therefore, it seems critical to define the minimum numbers of partitions necessary to improve the overall fit of the model. We propose a new method, based on cluster analysis, to find an optimal partitioning strategy for multilocus protein-coding data sets. A heuristic exploration of alternative partitioning schemes, based on Bayesian and maximum likelihood (ML) criteria, is shown here to produce an optimal number of partitions. We tested this method using sequence data of 10 nuclear genes collected from 52 ray-finned fish (Actinopterygii) and four tetrapods. The concatenated sequences included 7995 nucleotide sites maximally split into 30 partitions defined a priori based on gene and codon position. Our results show that a model based on only 10 partitions defined by cluster analysis performed better than partitioning by both gene and codon position. Alternative data partitioning schemes also are shown to affect the topologies resulting from phylogenetic analysis, especially when Bayesian methods are used, suggesting that overpartitioning may be of major concern. The phylogenetic relationships among the major clades of ray-finned fish were assessed using the best data-partitioning schemes under ML and Bayesian methods. Some significant results include the monophyly of "Holostei" (Amia and Lepisosteus), the sister-group relationships between (1) esociforms and salmoniforms and (2) osmeriforms and stomiiforms, the polyphyly of Perciformes, and a close relationship of cichlids and atherinomorphs.
Collapse
Affiliation(s)
- Chenhong Li
- School of Biological Sciences, University of Nebraska, Lincoln, NE 68588, USA.
| | | | | |
Collapse
|
9
|
Waddell PJ, Ota R, Penny D. Measuring fit of sequence data to phylogenetic model: gain of power using marginal tests. J Mol Evol 2009; 69:289-99. [PMID: 19851702 DOI: 10.1007/s00239-009-9268-8] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2009] [Accepted: 07/28/2009] [Indexed: 11/29/2022]
Abstract
Testing fit of data to model is fundamentally important to any science, but publications in the field of phylogenetics rarely do this. Such analyses discard fundamental aspects of science as prescribed by Karl Popper. Indeed, not without cause, Popper (Unended quest: an intellectual autobiography. Fontana, London, 1976) once argued that evolutionary biology was unscientific as its hypotheses were untestable. Here we trace developments in assessing fit from Penny et al. (Nature 297:197-200, 1982) to the present. We compare the general log-likelihood ratio (the G or G (2) statistic) statistic between the evolutionary tree model and the multinomial model with that of marginalized tests applied to an alignment (using placental mammal coding sequence data). It is seen that the most general test does not reject the fit of data to model (P approximately 0.5), but the marginalized tests do. Tests on pairwise frequency (F) matrices, strongly (P < 0.001) reject the most general phylogenetic (GTR) models commonly in use. It is also clear (P < 0.01) that the sequences are not stationary in their nucleotide composition. Deviations from stationarity and homogeneity seem to be unevenly distributed amongst taxa; not necessarily those expected from examining other regions of the genome. By marginalizing the 4( t ) patterns of the i.i.d. model to observed and expected parsimony counts, that is, from constant sites, to singletons, to parsimony informative characters of a minimum possible length, then the likelihood ratio test regains power, and it too rejects the evolutionary model with P << 0.001. Given such behavior over relatively recent evolutionary time, readers in general should maintain a healthy skepticism of results, as the scale of the systematic errors in published trees may really be far larger than the analytical methods (e.g., bootstrap) report.
Collapse
Affiliation(s)
- Peter J Waddell
- Department of Biological Sciences, Purdue University, West Lafayette, IN 47906, USA.
| | | | | |
Collapse
|
10
|
Cheng F, Hartmann S, Gupta M, Ibrahim JG, Vision TJ. A hierarchical model for incomplete alignments in phylogenetic inference. Bioinformatics 2009; 25:592-8. [PMID: 19147663 PMCID: PMC2647833 DOI: 10.1093/bioinformatics/btp015] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2008] [Revised: 01/05/2009] [Accepted: 01/05/2009] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Full-length DNA and protein sequences that span the entire length of a gene are ideally used for multiple sequence alignments (MSAs) and the subsequent inference of their relationships. Frequently, however, MSAs contain a substantial amount of missing data. For example, expressed sequence tags (ESTs), which are partial sequences of expressed genes, are the predominant source of sequence data for many organisms. The patterns of missing data typical for EST-derived alignments greatly compromise the accuracy of estimated phylogenies. RESULTS We present a statistical method for inferring phylogenetic trees from EST-based incomplete MSA data. We propose a class of hierarchical models for modeling pairwise distances between the sequences, and develop a fully Bayesian approach for estimation of the model parameters. Once the distance matrix is estimated, the phylogenetic tree may be constructed by applying neighbor-joining (or any other algorithm of choice). We also show that maximizing the marginal likelihood from the Bayesian approach yields similar results to a profile likelihood estimation. The proposed methods are illustrated using simulated protein families, for which the true phylogeny is known, and one real protein family. AVAILABILITY R code for fitting these models are available from: http://people.bu.edu/gupta/software.htm.
Collapse
Affiliation(s)
- Fuxia Cheng
- Department of Mathematics, Illinois State University, Normal, IL, USA
| | | | | | | | | |
Collapse
|
11
|
Hartmann S, Vision TJ. Using ESTs for phylogenomics: can one accurately infer a phylogenetic tree from a gappy alignment? BMC Evol Biol 2008; 8:95. [PMID: 18366758 PMCID: PMC2359737 DOI: 10.1186/1471-2148-8-95] [Citation(s) in RCA: 89] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2007] [Accepted: 03/26/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND While full genome sequences are still only available for a handful of taxa, large collections of partial gene sequences are available for many more. The alignment of partial gene sequences results in a multiple sequence alignment containing large gaps that are arranged in a staggered pattern. The consequences of this pattern of missing data on the accuracy of phylogenetic analysis are not well understood. We conducted a simulation study to determine the accuracy of phylogenetic trees obtained from gappy alignments using three commonly used phylogenetic reconstruction methods (Neighbor Joining, Maximum Parsimony, and Maximum Likelihood) and studied ways to improve the accuracy of trees obtained from such datasets. RESULTS We found that the pattern of gappiness in multiple sequence alignments derived from partial gene sequences substantially compromised phylogenetic accuracy even in the absence of alignment error. The decline in accuracy was beyond what would be expected based on the amount of missing data. The decline was particularly dramatic for Neighbor Joining and Maximum Parsimony, where the majority of gappy alignments contained 25% to 40% incorrect quartets. To improve the accuracy of the trees obtained from a gappy multiple sequence alignment, we examined two approaches. In the first approach, alignment masking, potentially problematic columns and input sequences are excluded from from the dataset. Even in the absence of alignment error, masking improved phylogenetic accuracy up to 100-fold. However, masking retained, on average, only 83% of the input sequences. In the second approach, alignment subdivision, the missing data is statistically modelled in order to retain as many sequences as possible in the phylogenetic analysis. Subdivision resulted in more modest improvements to alignment accuracy, but succeeded in including almost all of the input sequences. CONCLUSION These results demonstrate that partial gene sequences and gappy multiple sequence alignments can pose a major problem for phylogenetic analysis. The concern will be greatest for high-throughput phylogenomic analyses, in which Neighbor Joining is often the preferred method due to its computational efficiency. Both approaches can be used to increase the accuracy of phylogenetic inference from a gappy alignment. The choice between the two approaches will depend upon how robust the application is to the loss of sequences from the input set, with alignment masking generally giving a much greater improvement in accuracy but at the cost of discarding a larger number of the input sequences.
Collapse
Affiliation(s)
- Stefanie Hartmann
- Department of Biology, University of North Carolina, Chapel Hill, NC 27599, USA
- Institute for Biochemistry and Biology, Karl-Liebknecht-Strasse 24-25, University of Potsdam, 14476 Potsdam, Germany
| | - Todd J Vision
- Department of Biology, University of North Carolina, Chapel Hill, NC 27599, USA
| |
Collapse
|
12
|
Gatesy J. A tenth crucial question regarding model use in phylogenetics. Trends Ecol Evol 2007; 22:509-10. [PMID: 17804115 DOI: 10.1016/j.tree.2007.08.002] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2007] [Revised: 07/09/2007] [Accepted: 08/20/2007] [Indexed: 11/19/2022]
|
13
|
White WT, Hills SF, Gaddam R, Holland BR, Penny D. Treeness triangles: visualizing the loss of phylogenetic signal. Mol Biol Evol 2007; 24:2029-39. [PMID: 17630280 DOI: 10.1093/molbev/msm139] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
It is well known that molecular data "saturates" with increasing sequence divergence (thereby losing phylogenetic information) and that in addition the accumulation of misleading information due to chance similarities or to systematic bias may accompany saturation as well. Exploratory data analysis methods that can quantify the extent of signal loss or convergence for a given data set are scarce. Such methods are needed because genomics delivers very long sequence alignments spanning substantial phylogenetic depth, where site saturation may be compounded by systematic biases or other alternative signals. Here we introduce the Treeness Triangle (TT) graph, in which signals detectable by Hadamard (spectral) analysis are summed into 3 categories--those supporting 1) external and 2) internal branches in the optimal tree, in addition to 3) the residuals (potential internal branches not present in the optimal tree). These 3 values are plotted in a standard ternary coordinate system. The approach is illustrated with simulated and real data sets, the latter from complete chloroplast genomes, where potential problems of paralogy or lateral gene acquisition can be excluded. The TT uncovers the divergence-dependent loss of phylogenetic signal as subsets of chloroplast genomes are investigated that span increasingly deeper evolutionary timescales. The rate of signal loss (or signal retention) varies with the gene and/or the method of analysis.
Collapse
Affiliation(s)
- W T White
- Allan Wilson Center for Molecular Ecology and Evolution, Massey University, Palmerston North, New Zealand
| | | | | | | | | |
Collapse
|