1
|
Zou Y, Zhang Z, Zeng Y, Hu H, Hao Y, Huang S, Li B. Common Methods for Phylogenetic Tree Construction and Their Implementation in R. Bioengineering (Basel) 2024; 11:480. [PMID: 38790347 PMCID: PMC11117635 DOI: 10.3390/bioengineering11050480] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2024] [Revised: 05/04/2024] [Accepted: 05/07/2024] [Indexed: 05/26/2024] Open
Abstract
A phylogenetic tree can reflect the evolutionary relationships between species or gene families, and they play a critical role in modern biological research. In this review, we summarize common methods for constructing phylogenetic trees, including distance methods, maximum parsimony, maximum likelihood, Bayesian inference, and tree-integration methods (supermatrix and supertree). Here we discuss the advantages, shortcomings, and applications of each method and offer relevant codes to construct phylogenetic trees from molecular data using packages and algorithms in R. This review aims to provide comprehensive guidance and reference for researchers seeking to construct phylogenetic trees while also promoting further development and innovation in this field. By offering a clear and concise overview of the different methods available, we hope to enable researchers to select the most appropriate approach for their specific research questions and datasets.
Collapse
Affiliation(s)
- Yue Zou
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, China; (Y.Z.); (Z.Z.); (Y.Z.); (H.H.); (Y.H.)
| | - Zixuan Zhang
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, China; (Y.Z.); (Z.Z.); (Y.Z.); (H.H.); (Y.H.)
| | - Yujie Zeng
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, China; (Y.Z.); (Z.Z.); (Y.Z.); (H.H.); (Y.H.)
| | - Hanyue Hu
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, China; (Y.Z.); (Z.Z.); (Y.Z.); (H.H.); (Y.H.)
| | - Youjin Hao
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, China; (Y.Z.); (Z.Z.); (Y.Z.); (H.H.); (Y.H.)
| | - Sheng Huang
- Animal Nutrition Institute, Chongqing Academy of Animal Science, Chongqing 402460, China
| | - Bo Li
- College of Life Sciences, Chongqing Normal University, Chongqing 401331, China; (Y.Z.); (Z.Z.); (Y.Z.); (H.H.); (Y.H.)
| |
Collapse
|
2
|
Steenwyk JL, Li Y, Zhou X, Shen XX, Rokas A. Incongruence in the phylogenomics era. Nat Rev Genet 2023; 24:834-850. [PMID: 37369847 DOI: 10.1038/s41576-023-00620-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/19/2023] [Indexed: 06/29/2023]
Abstract
Genome-scale data and the development of novel statistical phylogenetic approaches have greatly aided the reconstruction of a broad sketch of the tree of life and resolved many of its branches. However, incongruence - the inference of conflicting evolutionary histories - remains pervasive in phylogenomic data, hampering our ability to reconstruct and interpret the tree of life. Biological factors, such as incomplete lineage sorting, horizontal gene transfer, hybridization, introgression, recombination and convergent molecular evolution, can lead to gene phylogenies that differ from the species tree. In addition, analytical factors, including stochastic, systematic and treatment errors, can drive incongruence. Here, we review these factors, discuss methodological advances to identify and handle incongruence, and highlight avenues for future research.
Collapse
Affiliation(s)
- Jacob L Steenwyk
- Howards Hughes Medical Institute and the Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, CA, USA
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, USA
- Vanderbilt Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN, USA
| | - Yuanning Li
- Institute of Marine Science and Technology, Shandong University, Qingdao, China
| | - Xiaofan Zhou
- Guangdong Laboratory for Lingnan Modern Agriculture, Guangdong Province Key Laboratory of Microbial Signals and Disease Control, Integrative Microbiology Research Centre, South China Agricultural University, Guangzhou, China
| | - Xing-Xing Shen
- Key Laboratory of Biology of Crop Pathogens and Insects of Zhejiang Province, Institute of Insect Sciences, Zhejiang University, Hangzhou, China
| | - Antonis Rokas
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, USA.
- Vanderbilt Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN, USA.
- Heidelberg Institute for Theoretical Studies, Heidelberg, Germany.
| |
Collapse
|
3
|
Ahdritz G, Bouatta N, Kadyan S, Jarosch L, Berenberg D, Fisk I, Watkins AM, Ra S, Bonneau R, AlQuraishi M. OpenProteinSet: Training data for structural biology at scale. ARXIV 2023:arXiv:2308.05326v1. [PMID: 37608940 PMCID: PMC10441447] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 08/24/2023]
Abstract
Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.
Collapse
Affiliation(s)
| | - Nazim Bouatta
- Laboratory of Systems Pharmacology, Harvard Medical School
| | | | | | - Daniel Berenberg
- Prescient Design, Genentech & Department of Computer Science, New York University
| | | | | | | | | | | |
Collapse
|
4
|
Nayeem MA, Samudro NA, Rahman MS, Rahman MS. MAMMLE: A Framework for Phylogeny Estimation Based on Multiobjective Application-aware Multiple Sequence Alignment and Maximum Likelihood Ensemble. J Comput Biol 2023; 30:245-249. [PMID: 36706434 DOI: 10.1089/cmb.2021.0533] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
Abstract
Motivation: Phylogenetic trees are often inferred from a multiple sequence alignment (MSA) where the tree accuracy is heavily impacted by the nature of estimated alignment. Carefully equipping an MSA tool with multiple application-aware objectives positively impacts its capability to yield better trees. Results: We introduce Multiobjective Application-aware Multiple Sequence Alignment and Maximum Likelihood Ensemble (MAMMLE), a framework for inferring better phylogenetic trees from unaligned sequences by hybridizing two MSA tools [i.e., Multiple Sequence Comparison by Log-Expectation (MUSCLE) and Multiple Alignment using Fast Fourier Transform (MAFFT)] with multiobjective optimization strategy and leveraging multiple maximum likelihood hypotheses. In our experiments, MAMMLE exhibits 5.57% (4.77%) median improvement (deterioration) over MUSCLE on 50.34% (37.41%) of instances.
Collapse
|
5
|
Shen C, Park M, Warnow T. WITCH: Improved Multiple Sequence Alignment Through Weighted Consensus Hidden Markov Model Alignment. J Comput Biol 2022; 29:782-801. [PMID: 35575747 DOI: 10.1089/cmb.2021.0585] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
Accurate multiple sequence alignment is challenging on many data sets, including those that are large, evolve under high rates of evolution, or have sequence length heterogeneity. While substantial progress has been made over the last decade in addressing the first two challenges, sequence length heterogeneity remains a significant issue for many data sets. Sequence length heterogeneity occurs for biological and technological reasons, including large insertions or deletions (indels) that occurred in the evolutionary history relating the sequences, or the inclusion of sequences that are not fully assembled. Ultra-large alignments using Phylogeny-Aware Profiles (UPP) (Nguyen et al. 2015) is one of the most accurate approaches for aligning data sets that exhibit sequence length heterogeneity: it constructs an alignment on the subset of sequences it considers "full-length," represents this "backbone alignment" using an ensemble of hidden Markov models (HMMs), and then adds each remaining sequence into the backbone alignment based on an HMM selected for that sequence from the ensemble. Our new method, WeIghTed Consensus Hmm alignment (WITCH), improves on UPP in three important ways: first, it uses a statistically principled technique to weight and rank the HMMs; second, it uses k>1 HMMs from the ensemble rather than a single HMM; and third, it combines the alignments for each of the selected HMMs using a consensus algorithm that takes the weights into account. We show that this approach provides improved alignment accuracy compared with UPP and other leading alignment methods, as well as improved accuracy for maximum likelihood trees based on these alignments.
Collapse
Affiliation(s)
- Chengze Shen
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, Illinois, USA
| | - Minhyuk Park
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, Illinois, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, Illinois, USA
| |
Collapse
|
6
|
Nayeem MA, Bayzid MS, Rahman AH, Shahriyar R, Rahman MS. Multiobjective Formulation of Multiple Sequence Alignment for Phylogeny Inference. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:2775-2786. [PMID: 33044939 DOI: 10.1109/tcyb.2020.3020308] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Multiple sequence alignment (MSA) is a preliminary task for estimating phylogenies. It is used for homology inference among the sequences of a set of species. Generally, the MSA task is handled as a single-objective optimization process. The alignments computed under one criterion may be different from the alignments generated by other criteria, inferring discordant homologies and thus leading to different hypothesized evolutionary histories relating the sequences. The multiobjective (MO) formulation of MSA has recently been advocated by several researchers, to address this issue. An MO approach independently optimizes multiple (often conflicting) objective functions at the same time and outputs a set of competitive alignments. However, no conceptual or experimental rational from a real-world application perspective has been reported so far for any MO formulation of MSA. This article work investigates the impact of MO formulation in the context of an important scientific problem, namely, phylogeny estimation. Employing popular evolutionary MO algorithms, we show that: 1) trees inferred based on alignments produced by the existing MSA methods used in practice are substantially worse in quality than the trees inferred based on the alignment's output by an MO algorithm and 2) even high-quality alignments (according to popular measures available in the literature) may fail to achieve acceptable accuracy in generating phylogenetic trees. Thus, we essentially ask the following natural question: "can a phylogeny-aware (i.e., application-aware) metric guide in selecting appropriate MO formulations to ensure better phylogeny estimation?" Here, we report a carefully designed extensive experimental study that positively answers this question.
Collapse
|
7
|
Zhang Y, Zhang Q, Zhou J, Zou Q. A survey on the algorithm and development of multiple sequence alignment. Brief Bioinform 2022; 23:6546258. [PMID: 35272347 DOI: 10.1093/bib/bbac069] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Revised: 01/30/2022] [Accepted: 02/09/2022] [Indexed: 12/21/2022] Open
Abstract
Multiple sequence alignment (MSA) is an essential cornerstone in bioinformatics, which can reveal the potential information in biological sequences, such as function, evolution and structure. MSA is widely used in many bioinformatics scenarios, such as phylogenetic analysis, protein analysis and genomic analysis. However, MSA faces new challenges with the gradual increase in sequence scale and the increasing demand for alignment accuracy. Therefore, developing an efficient and accurate strategy for MSA has become one of the research hotspots in bioinformatics. In this work, we mainly summarize the algorithms for MSA and its applications in bioinformatics. To provide a structured and clear perspective, we systematically introduce MSA's knowledge, including background, database, metric and benchmark. Besides, we list the most common applications of MSA in the field of bioinformatics, including database searching, phylogenetic analysis, genomic analysis, metagenomic analysis and protein analysis. Furthermore, we categorize and analyze classical and state-of-the-art algorithms, divided into progressive alignment, iterative algorithm, heuristics, machine learning and divide-and-conquer. Moreover, we also discuss the challenges and opportunities of MSA in bioinformatics. Our work provides a comprehensive survey of MSA applications and their relevant algorithms. It could bring valuable insights for researchers to contribute their knowledge to MSA and relevant studies.
Collapse
Affiliation(s)
- Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China.,School of Computer Science and Engineering, University of Electronic Science and Technology of China, 611731, Chengdu, China
| | - Qiang Zhang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Jiliu Zhou
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, 610054, Chengdu, China
| |
Collapse
|
8
|
Spielman SJ, Miraglia ML. Relative model selection of evolutionary substitution models can be sensitive to multiple sequence alignment uncertainty. BMC Ecol Evol 2021; 21:214. [PMID: 34844571 PMCID: PMC8628390 DOI: 10.1186/s12862-021-01931-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2021] [Accepted: 10/06/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Multiple sequence alignments (MSAs) represent the fundamental unit of data inputted to most comparative sequence analyses. In phylogenetic analyses in particular, errors in MSA construction have the potential to induce further errors in downstream analyses such as phylogenetic reconstruction itself, ancestral state reconstruction, and divergence time estimation. In addition to providing phylogenetic methods with an MSA to analyze, researchers must also specify a suitable evolutionary model for the given analysis. Most commonly, researchers apply relative model selection to select a model from candidate set and then provide both the MSA and the selected model as input to subsequent analyses. While the influence of MSA errors has been explored for most stages of phylogenetics pipelines, the potential effects of MSA uncertainty on the relative model selection procedure itself have not been explored. RESULTS We assessed the consistency of relative model selection when presented with multiple perturbed versions of a given MSA. We find that while relative model selection is mostly robust to MSA uncertainty, in a substantial proportion of circumstances, relative model selection identifies distinct best-fitting models from different MSAs created from the same set of sequences. We find that this issue is more pervasive for nucleotide data compared to amino-acid data. However, we also find that it is challenging to predict whether relative model selection will be robust or sensitive to uncertainty in a given MSA. CONCLUSIONS We find that that MSA uncertainty can affect virtually all steps of phylogenetic analysis pipelines to a greater extent than has previously been recognized, including relative model selection.
Collapse
Affiliation(s)
| | - Molly L Miraglia
- Department of Molecular and Cellular Biosciences, Rowan University, Glassboro, NJ, 08028, USA.,Fox Chase Cancer Center, Philadelphia, PA, 19111, USA
| |
Collapse
|
9
|
Shipunov A, Fernández-Alonso JL, Hassemer G, Alp S, Lee HJ, Pay K. Molecular and Morphological Data Improve the Classification of Plantagineae (Lamiales). PLANTS (BASEL, SWITZERLAND) 2021; 10:2299. [PMID: 34834664 PMCID: PMC8625185 DOI: 10.3390/plants10112299] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Revised: 10/12/2021] [Accepted: 10/13/2021] [Indexed: 06/13/2023]
Abstract
The tribe Plantagineae (Lamiales) is a group of plants with worldwide distribution, notorious for its complicated taxonomy and still unresolved natural history. We describe the result of a broadly sampled phylogenetic study of tribe. The expanded sampling dataset is based on the trnL-F spacer, rbcL, and ITS2 markers across all three included genera (Aragoa, Littorella and Plantago) and makes this the most comprehensive study to date. The other dataset uses five markers and provides remarkably good resolution throughout the tree, including support for all of the major clades. In addition to the molecular phylogeny, a morphology database of 114 binary characters was assembled to provide comparison with the molecular phylogeny and to develop a means to assign species not sampled in the molecular analysis to their most closely related species that were sampled. Based on the molecular phylogeny and the assignment algorithm to place unsampled species, a key to sections is presented, and a revised classification of the tribe is provided. We also include the description of new species from North America.
Collapse
Affiliation(s)
- Alexey Shipunov
- Department of Biology, Minot State University, Minot, ND 58707, USA; (S.A.); (H.J.L.); (K.P.)
| | | | - Gustavo Hassemer
- Três Lagoas Campus, Federal University of Mato Grosso do Sul, Três Lagoas CEP 79610-100, Brazil;
| | - Sean Alp
- Department of Biology, Minot State University, Minot, ND 58707, USA; (S.A.); (H.J.L.); (K.P.)
| | - Hye Ji Lee
- Department of Biology, Minot State University, Minot, ND 58707, USA; (S.A.); (H.J.L.); (K.P.)
| | - Kyle Pay
- Department of Biology, Minot State University, Minot, ND 58707, USA; (S.A.); (H.J.L.); (K.P.)
| |
Collapse
|
10
|
Understanding the Genetic Diversity of Picobirnavirus: A Classification Update Based on Phylogenetic and Pairwise Sequence Comparison Approaches. Viruses 2021; 13:v13081476. [PMID: 34452341 PMCID: PMC8402817 DOI: 10.3390/v13081476] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2021] [Revised: 07/20/2021] [Accepted: 07/23/2021] [Indexed: 11/29/2022] Open
Abstract
Picobirnaviruses (PBVs) are small, double stranded RNA viruses with an ability to infect a myriad of hosts and possessing a high degree of genetic diversity. PBVs are currently classified into two genogroups based upon classification of a 200 nt sequence of RdRp. We demonstrate here that this phylogenetic marker is saturated, affected by homoplasy, and has high phylogenetic noise, resulting in 34% unsolved topologies. By contrast, full-length RdRp sequences provide reliable topologies that allow ancestralism of members to be correctly inferred. MAFFT alignment and maximum likelihood trees were established as the optimal methods to determine phylogenetic relationships, providing complete resolution of PBV RdRp and capsid taxa, each into three monophyletic groupings. Pairwise distance calculations revealed these lineages represent three species. For RdRp, the application of cutoffs determined by theoretical taxonomic distributions indicates that there are five genotypes in species 1, eight genotypes in species 2, and three genotypes in species 3. Capsids were also divided into three species, but sequences did not segregate into statistically supported subdivisions, indicating that diversity is lower than RdRp. We thus propose the adoption of a new nomenclature to indicate the species of each segment (e.g., PBV-C1R2).
Collapse
|
11
|
Neuwald AF, Kolaczkowski BD, Altschul SF. eCOMPASS: evaluative comparison of multiple protein alignments by statistical score. Bioinformatics 2021; 37:3456-3463. [PMID: 33983436 PMCID: PMC8545322 DOI: 10.1093/bioinformatics/btab374] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2020] [Revised: 03/31/2021] [Accepted: 05/12/2021] [Indexed: 11/21/2022] Open
Abstract
Motivation Detecting subtle biologically relevant patterns in protein sequences often requires the construction of a large and accurate multiple sequence alignment (MSA). Methods for constructing MSAs are usually evaluated using benchmark alignments, which, however, typically contain very few sequences and are therefore inappropriate when dealing with large numbers of proteins. Results eCOMPASS addresses this problem using a statistical measure of relative alignment quality based on direct coupling analysis (DCA): to maintain protein structural integrity over evolutionary time, substitutions at one residue position typically result in compensating substitutions at other positions. eCOMPASS computes the statistical significance of the congruence between high scoring directly coupled pairs and 3D contacts in corresponding structures, which depends upon properly aligned homologous residues. We illustrate eCOMPASS using both simulated and real MSAs. Availability and implementation The eCOMPASS executable, C++ open source code and input data sets are available at https://www.igs.umaryland.edu/labs/neuwald/software/compass Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Andrew F Neuwald
- Department of Biochemistry & Molecular Biology, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Bryan D Kolaczkowski
- Department of Microbiology & Cell Science, University of Florida, Gainesville, FL 32611, USA
| | - Stephen F Altschul
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| |
Collapse
|
12
|
Zhan Q, Fu Y, Jiang Q, Liu B, Peng J, Wang Y. SpliVert: A Protein Multiple Sequence Alignment Refinement Method Based on Splitting-Splicing Vertically. Protein Pept Lett 2020; 27:295-302. [PMID: 31385760 DOI: 10.2174/0929866526666190806143959] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2019] [Revised: 04/26/2019] [Accepted: 06/14/2019] [Indexed: 11/22/2022]
Abstract
BACKGROUND Multiple Sequence Alignment (MSA) is a fundamental task in bioinformatics and is required for many biological analysis tasks. The more accurate the alignments are, the more credible the downstream analyses. Most protein MSA algorithms realign an alignment to refine it by dividing it into two groups horizontally and then realign the two groups. However, this strategy does not consider that different regions of the sequences have different conservation; this property may lead to incorrect residue-residue or residue-gap pairs, which cannot be corrected by this strategy. OBJECTIVE In this article, our motivation is to develop a novel refinement method based on splitting- splicing vertically. METHODS Here, we present a novel refinement method based on splitting-splicing vertically, called SpliVert. For an alignment, we split it vertically into 3 parts, remove the gap characters in the middle, realign the middle part alone, and splice the realigned middle parts with the other two initial pieces to obtain a refined alignment. In the realign procedure of our method, the aligner will only focus on a certain part, ignoring the disturbance of the other parts, which could help fix the incorrect pairs. RESULTS We tested our refinement strategy for 2 leading MSA tools on 3 standard benchmarks, according to the commonly used average SP (and TC) score. The results show that given appropriate proportions to split the initial alignment, the average scores are increased comparably or slightly after using our method. We also compared the alignments refined by our method with alignments directly refined by the original alignment tools. The results suggest that using our SpliVert method to refine alignments can also outperform direct use of the original alignment tools. CONCLUSION The results reveal that splitting vertically and realigning part of the alignment is a good strategy for the refinement of protein multiple sequence alignments.
Collapse
Affiliation(s)
- Qing Zhan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Yilei Fu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Qinghua Jiang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Bo Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Jiajie Peng
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
13
|
Jermiin LS, Catullo RA, Holland BR. A new phylogenetic protocol: dealing with model misspecification and confirmation bias in molecular phylogenetics. NAR Genom Bioinform 2020; 2:lqaa041. [PMID: 33575594 PMCID: PMC7671319 DOI: 10.1093/nargab/lqaa041] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2020] [Revised: 05/18/2020] [Accepted: 06/04/2020] [Indexed: 12/15/2022] Open
Abstract
Molecular phylogenetics plays a key role in comparative genomics and has increasingly significant impacts on science, industry, government, public health and society. In this paper, we posit that the current phylogenetic protocol is missing two critical steps, and that their absence allows model misspecification and confirmation bias to unduly influence phylogenetic estimates. Based on the potential offered by well-established but under-used procedures, such as assessment of phylogenetic assumptions and tests of goodness of fit, we introduce a new phylogenetic protocol that will reduce confirmation bias and increase the accuracy of phylogenetic estimates.
Collapse
Affiliation(s)
- Lars S Jermiin
- CSIRO Land & Water, Canberra, ACT 2601, Australia
- Research School of Biology, Australian National University, Canberra, ACT 2601, Australia
- School of Biology & Environment Science, University College Dublin, Belfield, Dublin 4, Ireland
- Earth Institute, University College Dublin, Belfield, Dublin 4, Ireland
| | - Renee A Catullo
- CSIRO Land & Water, Canberra, ACT 2601, Australia
- Research School of Biology, Australian National University, Canberra, ACT 2601, Australia
- School of Science and Health & Hawkesbury Institute of the Environment, Western Sydney University, Penrith, NSW 2751, Australia
| | - Barbara R Holland
- School of Natural Sciences, University of Tasmania, Hobart, TAS 7001, Australia
| |
Collapse
|
14
|
Alkhamis MA, Li C, Torremorell M. Animal Disease Surveillance in the 21st Century: Applications and Robustness of Phylodynamic Methods in Recent U.S. Human-Like H3 Swine Influenza Outbreaks. Front Vet Sci 2020; 7:176. [PMID: 32373634 PMCID: PMC7186338 DOI: 10.3389/fvets.2020.00176] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2019] [Accepted: 03/16/2020] [Indexed: 11/22/2022] Open
Abstract
Emerging and endemic animal viral diseases continue to impose substantial impacts on animal and human health. Most current and past molecular surveillance studies of animal diseases investigated spatio-temporal and evolutionary dynamics of the viruses in a disjointed analytical framework, ignoring many uncertainties and made joint conclusions from both analytical approaches. Phylodynamic methods offer a uniquely integrated platform capable of inferring complex epidemiological and evolutionary processes from the phylogeny of viruses in populations using a single Bayesian statistical framework. In this study, we reviewed and outlined basic concepts and aspects of phylodynamic methods and attempted to summarize essential components of the methodology in one analytical pipeline to facilitate the proper use of the methods by animal health researchers. Also, we challenged the robustness of the posterior evolutionary parameters, inferred by the commonly used phylodynamic models, using hemagglutinin (HA) and polymerase basic 2 (PB2) segments of the currently circulating human-like H3 swine influenza (SI) viruses isolated in the United States and multiple priors. Subsequently, we compared similarities and differences between the posterior parameters inferred from sequence data using multiple phylodynamic models. Our suggested phylodynamic approach attempts to reduce the impact of its inherent limitations to offer less biased and biologically plausible inferences about the pathogen evolutionary characteristics to properly guide intervention activities. We also pinpointed requirements and challenges for integrating phylodynamic methods in routine animal disease surveillance activities.
Collapse
Affiliation(s)
- Moh A Alkhamis
- Department of Epidemiology and Biostatistics, Faculty of Public Health, Health Sciences Center, Kuwait University, Kuwait City, Kuwait.,Department of Veterinary Population Medicine, College of Veterinary Medicine, University of Minnesota, St. Paul, MN, United States
| | - Chong Li
- Department of Veterinary Population Medicine, College of Veterinary Medicine, University of Minnesota, St. Paul, MN, United States
| | - Montserrat Torremorell
- Department of Veterinary Population Medicine, College of Veterinary Medicine, University of Minnesota, St. Paul, MN, United States
| |
Collapse
|
15
|
Shipunov A, Carr S, Furniss S, Pay K, Pirani JR. First Phylogeny of Bitterbush Family, Picramniaceae (Picramniales). PLANTS (BASEL, SWITZERLAND) 2020; 9:E284. [PMID: 32098193 PMCID: PMC7076446 DOI: 10.3390/plants9020284] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/17/2019] [Revised: 02/10/2020] [Accepted: 02/19/2020] [Indexed: 11/17/2022]
Abstract
Picramniaceae is the only member of Picramniales which is sister to the clade (Sapindales (Huerteales (Malvales, Brassicales))) in the rosidsmalvids. Not much is known about most aspects of their ecology, geography, and morphology. The family is restricted to American tropics. Picramniaceae representatives are rich in secondary metabolites; some species are known to be important for pharmaceutical purposes. Traditionally, Picramniaceae was classified as a subfamily of Simaroubaceae, but from 1995 on, it has been segregated containing two genera, Picramnia and Alvaradoa, with the recent addition of a third genus, Nothotalisia, described in 2011. Only a few species of the family have been the subject of DNA-related research, and fewer than half of the species have been included in morphological phylogenetic analyses. It is clear that Picramniaceae remains a largely under-researched plant group. Here we present the first molecular phylogenetic tree of the group, based on both chloroplast and nuclear markers, widely adopted in the plant DNA barcoding. The main findings are: The family and its genera are monophyletic and Picramnia is sister to two other genera; some clades corroborate previous assumptions of relationships made on a morphological or geographical basis, while most parts of the molecular topology suggest high levels of homoplasy in the morphological evolution of Picramnia.
Collapse
Affiliation(s)
| | - Shyla Carr
- Minot State University, Minot, ND 58707, USA
| | | | - Kyle Pay
- Minot State University, Minot, ND 58707, USA
| | | |
Collapse
|
16
|
Six Impossible Things before Breakfast: Assumptions, Models, and Belief in Molecular Dating. Trends Ecol Evol 2019; 34:474-486. [PMID: 30904189 DOI: 10.1016/j.tree.2019.01.017] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2018] [Revised: 01/29/2019] [Accepted: 01/31/2019] [Indexed: 01/16/2023]
Abstract
Confidence in molecular dating analyses has grown with the increasing sophistication of the methods. Some problematic cases where molecular dates disagreed with paleontological estimates appear to have been resolved with a growing agreement between molecules and fossils. But we cannot relax just yet. The growing analytical sophistication of many molecular dating methods relies on an increasingly large number of assumptions about evolutionary history and processes. Many of these assumptions are based on statistical tractability rather than being informed by improved understanding of molecular evolution, yet changing the assumptions can influence molecular dates. How can we tell if the answers we get are driven more by the assumptions we make than by the molecular data being analyzed?
Collapse
|
17
|
Chang JM, Floden EW, Herrero J, Gascuel O, Di Tommaso P, Notredame C. Incorporating alignment uncertainty into Felsenstein's phylogenetic bootstrap to improve its reliability. Bioinformatics 2019; 37:1506-1514. [PMID: 30726875 PMCID: PMC8275982 DOI: 10.1093/bioinformatics/btz082] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2018] [Revised: 12/12/2018] [Accepted: 02/05/2019] [Indexed: 12/30/2022] Open
Abstract
Motivation Most evolutionary analyses are based on pre-estimated multiple sequence alignment. Wong et al. established the existence of an uncertainty induced by multiple sequence alignment when reconstructing phylogenies. They were able to show that in many cases different aligners produce different phylogenies, with no simple objective criterion sufficient to distinguish among these alternatives. Results We demonstrate that incorporating MSA induced uncertainty into bootstrap sampling can significantly increase correlation between clade correctness and its corresponding bootstrap value. Our procedure involves concatenating several alternative multiple sequence alignments of the same sequences, produced using different commonly used aligners. We then draw bootstrap replicates while favoring columns of the more unique aligner among the concatenated aligners. We named this concatenation and bootstrapping method, Weighted Partial Super Bootstrap (wpSBOOT). We show on three simulated datasets of 16, 32 and 64 tips that our method improves the predictive power of bootstrap values. We also used as a benchmark an empirical collection of 853 1-to-1 orthologous genes from seven yeast species and found wpSBOOT to significantly improve discrimination capacity between topologically correct and incorrect trees. Bootstrap values of wpSBOOT are comparable to similar readouts estimated using a single method. However, for reduced trees by 50% and 95% bootstrap thresholds, wpSBOOT comes out the lowest Type I error (less FP). Availability The automated generation of replicates has been implemented in the T-Coffee package, which is available as open source freeware available from www.tcoffee.org. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jia-Ming Chang
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
| | - Evan W Floden
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Javier Herrero
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
| | - Olivier Gascuel
- Unité Bioinformatique Evolutive, Centre de Bioinformatique, Biostatistique et Biologie Intégrative (C3BI)-USR 3756 CNRS and Institut Pasteur, Paris, France
| | - Paolo Di Tommaso
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Cedric Notredame
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain
| |
Collapse
|