1
|
Park M, Warnow T. HMMerge: an ensemble method for multiple sequence alignment. BIOINFORMATICS ADVANCES 2023; 3:vbad052. [PMID: 37128578 PMCID: PMC10148686 DOI: 10.1093/bioadv/vbad052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/12/2022] [Revised: 04/06/2023] [Accepted: 04/13/2023] [Indexed: 05/03/2023]
Abstract
Motivation Despite advances in method development for multiple sequence alignment over the last several decades, the alignment of datasets exhibiting substantial sequence length heterogeneity, especially when the input sequences include very short sequences (either as a result of sequencing technologies or of large deletions during evolution) remains an inadequately solved problem. Results We present HMMerge, a method to compute an alignment of datasets exhibiting high sequence length heterogeneity, or to add short sequences into a given 'backbone' alignment. HMMerge builds on the technique from its predecessor alignment methods, UPP and WITCH, which build an ensemble of profile HMMs to represent the backbone alignment and add the remaining sequences into the backbone alignment using the ensemble. HMMerge differs from UPP and WITCH by building a new 'merged' HMM from the ensemble, and then using that merged HMM to align the query sequences. We show that HMMerge is competitive with WITCH, with an advantage over WITCH when adding very short sequences into backbone alignments. Availability and implementation HMMerge is freely available at https://github.com/MinhyukPark/HMMerge. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Minhyuk Park
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | | |
Collapse
|
2
|
Zaharias P, Warnow T. Recent progress on methods for estimating and updating large phylogenies. Philos Trans R Soc Lond B Biol Sci 2022; 377:20210244. [PMID: 35989607 PMCID: PMC9393559 DOI: 10.1098/rstb.2021.0244] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Accepted: 01/07/2022] [Indexed: 12/20/2022] Open
Abstract
With the increased availability of sequence data and even of fully sequenced and assembled genomes, phylogeny estimation of very large trees (even of hundreds of thousands of sequences) is now a goal for some biologists. Yet, the construction of these phylogenies is a complex pipeline presenting analytical and computational challenges, especially when the number of sequences is very large. In the past few years, new methods have been developed that aim to enable highly accurate phylogeny estimations on these large datasets, including divide-and-conquer techniques for multiple sequence alignment and/or tree estimation, methods that can estimate species trees from multi-locus datasets while addressing heterogeneity due to biological processes (e.g. incomplete lineage sorting and gene duplication and loss), and methods to add sequences into large gene trees or species trees. Here we present some of these recent advances and discuss opportunities for future improvements. This article is part of a discussion meeting issue 'Genomic population structures of microbial pathogens'.
Collapse
Affiliation(s)
- Paul Zaharias
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
3
|
Lemoine F, Blassel L, Voznica J, Gascuel O. COVID-Align: accurate online alignment of hCoV-19 genomes using a profile HMM. Bioinformatics 2021; 37:1761-1762. [PMID: 33045068 PMCID: PMC7745650 DOI: 10.1093/bioinformatics/btaa871] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2020] [Revised: 05/23/2020] [Accepted: 09/24/2020] [Indexed: 11/12/2022] Open
Abstract
Motivation The first cases of the COVID-19 pandemic emerged in December 2019. Until the end of February 2020, the number of available genomes was below 1000 and their multiple alignment was easily achieved using standard approaches. Subsequently, the availability of genomes has grown dramatically. Moreover, some genomes are of low quality with sequencing/assembly errors, making accurate re-alignment of all genomes nearly impossible on a daily basis. A more efficient, yet accurate approach was clearly required to pursue all subsequent bioinformatics analyses of this crucial data. Results hCoV-19 genomes are highly conserved, with very few indels and no recombination. This makes the profile HMM approach particularly well suited to align new genomes, add them to an existing alignment and filter problematic ones. Using a core of ∼2500 high quality genomes, we estimated a profile using HMMER, and implemented this profile in COVID-Align, a user-friendly interface to be used online or as standalone via Docker. The alignment of 1000 genomes requires ∼50 minutes on our cluster. Moreover, COVID-Align provides summary statistics, which can be used to determine the sequencing quality and evolutionary novelty of input genomes (e.g. number of new mutations and indels). Availability and implementation https://covalign.pasteur.cloud, hub.docker.com/r/evolbioinfo/covid-align. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Frédéric Lemoine
- Unité de Bioinformatique Evolutive, USR 3756 (DBC/C3BI), Institut Pasteur & CNRS, 75015 - Paris, France.,Hub de Bioinformatique et Biostatistique, USR 3756 (DBC/C3BI), Institut Pasteur & CNRS, 75015 - Paris, France
| | - Luc Blassel
- Unité de Bioinformatique Evolutive, USR 3756 (DBC/C3BI), Institut Pasteur & CNRS, 75015 - Paris, France.,ED515, Sorbonne Université, Collège Doctoral, 75006 - Paris, France
| | - Jakub Voznica
- Unité de Bioinformatique Evolutive, USR 3756 (DBC/C3BI), Institut Pasteur & CNRS, 75015 - Paris, France.,Université de Paris, 75006 Paris, France
| | - Olivier Gascuel
- Unité de Bioinformatique Evolutive, USR 3756 (DBC/C3BI), Institut Pasteur & CNRS, 75015 - Paris, France.,Académie des Sciences, USR 3756, CNRS, 75015 - Paris, France
| |
Collapse
|
4
|
Abstract
Multiple sequence alignment is a core first step in many bioinformatics analyses, and errors in these alignments can have negative consequences for scientific studies. In this article, we review some of the recent literature evaluating multiple sequence alignment methods and identify specific challenges that arise when performing these evaluations. In particular, we discuss the different trends observed in simulation studies and when using biological benchmarks. Overall, we find that multiple sequence alignment, far from being a "solved problem," would benefit from new attention.
Collapse
Affiliation(s)
- Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA.
| |
Collapse
|
5
|
Warnow T, Mirarab S. Multiple Sequence Alignment for Large Heterogeneous Datasets Using SATé, PASTA, and UPP. Methods Mol Biol 2021; 2231:99-119. [PMID: 33289889 DOI: 10.1007/978-1-0716-1036-7_7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The estimation of very large multiple sequence alignments is a challenging problem that requires special techniques in order to achieve high accuracy. Here we describe two software packages-PASTA and UPP-for constructing alignments on large and ultra-large datasets. Both methods have been able to produce highly accurate alignments on 1,000,000 sequences, and trees computed on these alignments are also highly accurate. PASTA provides the best tree accuracy when the input sequences are all full-length, but UPP provides improved accuracy compared to PASTA and other methods when the input contains a large number of fragmentary sequences. Both methods are available in open source form on GitHub.
Collapse
Affiliation(s)
- Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA.
| | - Siavash Mirarab
- Electrical and Computer Engineering, University of California at San Diego, La Jolla, CA, USA
| |
Collapse
|
6
|
Portik DM, Wiens JJ. Do Alignment and Trimming Methods Matter for Phylogenomic (UCE) Analyses? Syst Biol 2020; 70:440-462. [PMID: 32797207 DOI: 10.1093/sysbio/syaa064] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2019] [Revised: 08/02/2020] [Accepted: 08/03/2020] [Indexed: 11/14/2022] Open
Abstract
Alignment is a crucial issue in molecular phylogenetics because different alignment methods can potentially yield very different topologies for individual genes. But it is unclear if the choice of alignment methods remains important in phylogenomic analyses, which incorporate data from hundreds or thousands of genes. For example, problematic biases in alignment might be multiplied across many loci, whereas alignment errors in individual genes might become irrelevant. The issue of alignment trimming (i.e., removing poorly aligned regions or missing data from individual genes) is also poorly explored. Here, we test the impact of 12 different combinations of alignment and trimming methods on phylogenomic analyses. We compare these methods using published phylogenomic data from ultraconserved elements (UCEs) from squamate reptiles (lizards and snakes), birds, and tetrapods. We compare the properties of alignments generated by different alignment and trimming methods (e.g., length, informative sites, missing data). We also test whether these data sets can recover well-established clades when analyzed with concatenated (RAxML) and species-tree methods (ASTRAL-III), using the full data ($\sim $5000 loci) and subsampled data sets (10% and 1% of loci). We show that different alignment and trimming methods can significantly impact various aspects of phylogenomic data sets (e.g., length, informative sites). However, these different methods generally had little impact on the recovery and support values for well-established clades, even across very different numbers of loci. Nevertheless, our results suggest several "best practices" for alignment and trimming. Intriguingly, the choice of phylogenetic methods impacted the phylogenetic results most strongly, with concatenated analyses recovering significantly more well-established clades (with stronger support) than the species-tree analyses. [Alignment; concatenated analysis; phylogenomics; sequence length heterogeneity; species-tree analysis; trimming].
Collapse
Affiliation(s)
- Daniel M Portik
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ 85721, USA.,California Academy of Sciences, San Francisco, CA 94118, USA
| | - John J Wiens
- Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ 85721, USA
| |
Collapse
|
7
|
Nute M, Saleh E, Warnow T. Evaluating Statistical Multiple Sequence Alignment in Comparison to Other Alignment Methods on Protein Data Sets. Syst Biol 2019; 68:396-411. [PMID: 30329135 PMCID: PMC6472439 DOI: 10.1093/sysbio/syy068] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Revised: 09/27/2018] [Accepted: 10/11/2018] [Indexed: 01/15/2023] Open
Abstract
The estimation of multiple sequence alignments of protein sequences is a basic step in many bioinformatics pipelines, including protein structure prediction, protein family identification, and phylogeny estimation. Statistical coestimation of alignments and trees under stochastic models of sequence evolution has long been considered the most rigorous technique for estimating alignments and trees, but little is known about the accuracy of such methods on biological benchmarks. We report the results of an extensive study evaluating the most popular protein alignment methods as well as the statistical coestimation method BAli-Phy on 1192 protein data sets from established benchmarks as well as on 120 simulated data sets. Our study (which used more than 230 CPU years for the BAli-Phy analyses alone) shows that BAli-Phy has better precision and recall (with respect to the true alignments) than the other alignment methods on the simulated data sets but has consistently lower recall on the biological benchmarks (with respect to the reference alignments) than many of the other methods. In other words, we find that BAli-Phy systematically underaligns when operating on biological sequence data but shows no sign of this on simulated data. There are several potential causes for this change in performance, including model misspecification, errors in the reference alignments, and conflicts between structural alignment and evolutionary alignments, and future research is needed to determine the most likely explanation. We conclude with a discussion of the potential ramifications for each of these possibilities. [BAli-Phy; homology; multiple sequence alignment; protein sequences; structural alignment.]
Collapse
Affiliation(s)
- Michael Nute
- Department of Statistics, University of Illinois at Urbana-Champaign, 725 S Wright St #101, Champaign, IL 61820, USA
| | - Ehsan Saleh
- Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N. Goodwin Ave, Urbana, IL 61801, USA
| | - Tandy Warnow
- Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N. Goodwin Ave, Urbana, IL 61801, USA.,Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, 1205 W. Clark St., Urbana, IL 61801, USA.,National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
8
|
Ashkenazy H, Sela I, Levy Karin E, Landan G, Pupko T. Multiple Sequence Alignment Averaging Improves Phylogeny Reconstruction. Syst Biol 2019; 68:117-130. [PMID: 29771363 PMCID: PMC6657586 DOI: 10.1093/sysbio/syy036] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2017] [Revised: 05/07/2018] [Accepted: 05/09/2018] [Indexed: 01/11/2023] Open
Abstract
The classic methodology of inferring a phylogenetic tree from sequence data is composed of two steps. First, a multiple sequence alignment (MSA) is computed. Then, a tree is reconstructed assuming the MSA is correct. Yet, inferred MSAs were shown to be inaccurate and alignment errors reduce tree inference accuracy. It was previously proposed that filtering unreliable alignment regions can increase the accuracy of tree inference. However, it was also demonstrated that the benefit of this filtering is often obscured by the resulting loss of phylogenetic signal. In this work we explore an approach, in which instead of relying on a single MSA, we generate a large set of alternative MSAs and concatenate them into a single SuperMSA. By doing so, we account for phylogenetic signals contained in columns that are not present in the single MSA computed by alignment algorithms. Using simulations, we demonstrate that this approach results, on average, in more accurate trees compared to 1) using an unfiltered MSA and 2) using a single MSA with weights assigned to columns according to their reliability. Next, we explore in which regions of the MSA space our approach is expected to be beneficial. Finally, we provide a simple criterion for deciding whether or not the extra effort of computing a SuperMSA and inferring a tree from it is beneficial. Based on these assessments, we expect our methodology to be useful for many cases in which diverged sequences are analyzed. The option to generate such a SuperMSA is available at http://guidance.tau.ac.il.
Collapse
Affiliation(s)
- Haim Ashkenazy
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Ramat Aviv 69978, Tel Aviv, Israel
| | - Itamar Sela
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Eli Levy Karin
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Ramat Aviv 69978, Tel Aviv, Israel
- Department of Molecular Biology & Ecology of Plants, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Giddy Landan
- Institute of Microbiology, Christian-Albrechts-University of Kiel, 24118 Kiel, Germany
| | - Tal Pupko
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Ramat Aviv 69978, Tel Aviv, Israel
| |
Collapse
|