1
|
Ashrafzadeh S, Golding GB, Ilie S, Ilie L. Scoring alignments by embedding vector similarity. Brief Bioinform 2024; 25:bbae178. [PMID: 38695119 PMCID: PMC11063651 DOI: 10.1093/bib/bbae178] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2023] [Revised: 03/20/2024] [Accepted: 03/31/2024] [Indexed: 05/05/2024] Open
Abstract
Sequence similarity is of paramount importance in biology, as similar sequences tend to have similar function and share common ancestry. Scoring matrices, such as PAM or BLOSUM, play a crucial role in all bioinformatics algorithms for identifying similarities, but have the drawback that they are fixed, independent of context. We propose a new scoring method for amino acid similarity that remedies this weakness, being contextually dependent. It relies on recent advances in deep learning architectures that employ self-supervised learning in order to leverage the power of enormous amounts of unlabelled data to generate contextual embeddings, which are vector representations for words. These ideas have been applied to protein sequences, producing embedding vectors for protein residues. We propose the E-score between two residues as the cosine similarity between their embedding vector representations. Thorough testing on a wide variety of reference multiple sequence alignments indicate that the alignments produced using the new $E$-score method, especially ProtT5-score, are significantly better than those obtained using BLOSUM matrices. The new method proposes to change the way alignments are computed, with far-reaching implications in all areas of textual data that use sequence similarity. The program to compute alignments based on various $E$-scores is available as a web server at e-score.csd.uwo.ca. The source code is freely available for download from github.com/lucian-ilie/E-score.
Collapse
Affiliation(s)
- Sepehr Ashrafzadeh
- Department of Computer Science, University of Western Ontario, London, N6A 5B7, Ontario, Canada
| | - G Brian Golding
- Department of Biology, McMaster University, Hamilton, L8S 4K1, Ontario, Canada
| | - Silvana Ilie
- Department of Mathematics, Toronto Metropolitan University, Toronto, M5B 2K3, Ontario, Canada
| | - Lucian Ilie
- Department of Computer Science, University of Western Ontario, London, N6A 5B7, Ontario, Canada
| |
Collapse
|
2
|
Khairul WM, Hashim F, Rahamathullah R, Mohammed M, Aisyah Razali S, Ahmad Tajudin Tuan Johari S, Azizan S. Exploring ethynyl-based chalcones as green semiconductor materials for optical limiting interests. SPECTROCHIMICA ACTA. PART A, MOLECULAR AND BIOMOLECULAR SPECTROSCOPY 2024; 308:123776. [PMID: 38134650 DOI: 10.1016/j.saa.2023.123776] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Revised: 12/09/2023] [Accepted: 12/14/2023] [Indexed: 12/24/2023]
Abstract
The fabrication of molecular electronics from non-toxic functional materials which eventually would potentially able to degrade or being breaking down into safe by-products have attracted much interests in recent years. Hence, in this study, the introduction of mixed highly functional substructures of chalcone (-CO-CH=CH-) and ethynylated (C≡C) as building blocks has shown ideal performance as solution-processed thin film candidatures. Two types of derivatives, (MM-3a) and (MM-3b) repectively, showed a substantial Stokes shifts at 75 nm and 116 nm, in which such emission exhibits an intramolecular charge transfer (ICT) state and fluoresce characteristics. The density functional theory (DFT) simulation shows that MM-3a and MM-3b exhibit low energy gaps of 3.70 eV and 2.81 eV, respectively. TD-DFT computations for molecular electrostatic potential (MEP) and frontier molecular orbitals (FMO) were also used to emphasise the structure-property relationship. A solution-processed thin film with a single layer of ITO/PEDOT:PSS/MM-3a-MM-3b/Au exhibited electroluminescence behaviour with orange and purple emissions when supplied with direct current (DC) voltages. To promote the safer application of the derivatives formed, ethynylated chalcone materials underwent toxicity studies toward Acanthamoeba sp. to determine their suitability as non-toxic molecules prior to the determination as safer materials in optical limiting interests. From the preliminary test, no IC50 value was obtained for both compounds via 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl-2H-tetrazolium bromide (MTT) assay analysis and molecular docking analysis between MM-3a and MM-3b, with profilin protein exhibited weak bond interactions and attaining huge interaction distances.
Collapse
Affiliation(s)
- Wan M Khairul
- Faculty of Science and Marine Environment, Universiti Malaysia Terengganu, 21030 Kuala Nerus, Terengganu, Malaysia.
| | - Fatimah Hashim
- Faculty of Science and Marine Environment, Universiti Malaysia Terengganu, 21030 Kuala Nerus, Terengganu, Malaysia; Biological Security and Sustainability Research Interest Group (BIOSES), Universiti Malaysia Terengganu, 21030 Kuala Nerus, Terengganu, Malaysia.
| | - Rafizah Rahamathullah
- Faculty of Chemical Engineering & Technology, University Malaysia Perlis, Level 1, Block S2, UniCITI Alam Campus, Sungai Chuchuh, Padang Besar, 02100 Perlis, Malaysia
| | - Mas Mohammed
- Faculty of Science and Marine Environment, Universiti Malaysia Terengganu, 21030 Kuala Nerus, Terengganu, Malaysia
| | - Siti Aisyah Razali
- Faculty of Science and Marine Environment, Universiti Malaysia Terengganu, 21030 Kuala Nerus, Terengganu, Malaysia; Biological Security and Sustainability Research Interest Group (BIOSES), Universiti Malaysia Terengganu, 21030 Kuala Nerus, Terengganu, Malaysia
| | - Syed Ahmad Tajudin Tuan Johari
- Centre for Research in Infectious Diseases and Biotechnology, Faculty of Medicine, Universiti Sultan Zainal Abidin, Medical Campus. 20400 kuala Terengganu, Terengganu, Malaysia
| | - Suha Azizan
- Faculty of Science and Marine Environment, Universiti Malaysia Terengganu, 21030 Kuala Nerus, Terengganu, Malaysia; Department of Pharmacology, Faculty of Medicine, Universiti Malaya, 50603 Kuala Lumpur, Malaysia
| |
Collapse
|
3
|
Iovino BG, Ye Y. Protein embedding based alignment. BMC Bioinformatics 2024; 25:85. [PMID: 38413857 PMCID: PMC10900708 DOI: 10.1186/s12859-024-05699-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Accepted: 02/12/2024] [Indexed: 02/29/2024] Open
Abstract
PURPOSE Despite the many progresses with alignment algorithms, aligning divergent protein sequences with less than 20-35% pairwise identity (so called "twilight zone") remains a difficult problem. Many alignment algorithms have been using substitution matrices since their creation in the 1970's to generate alignments, however, these matrices do not work well to score alignments within the twilight zone. We developed Protein Embedding based Alignments, or PEbA, to better align sequences with low pairwise identity. Similar to the traditional Smith-Waterman algorithm, PEbA uses a dynamic programming algorithm but the matching score of amino acids is based on the similarity of their embeddings from a protein language model. METHODS We tested PEbA on over twelve thousand benchmark pairwise alignments from BAliBASE, each one extracted from one of their multiple sequence alignments. Five different BAliBASE references were used, each with different sequence identities, motifs, and lengths, allowing PEbA to showcase how well it aligns under different circumstances. RESULTS PEbA greatly outperformed BLOSUM substitution matrix-based pairwise alignments, achieving different levels of improvements of the alignment quality for pairs of sequences with different levels of similarity (over four times as well for pairs of sequences with <10% identity). We also compared PEbA with embeddings generated by different protein language models (ProtT5 and ESM-2) and found that ProtT5-XL-U50 produced the most useful embeddings for aligning protein sequences. PEbA also outperformed DEDAL and vcMSA, two recently developed protein language model embedding-based alignment methods. CONCLUSION Our results suggested that general purpose protein language models provide useful contextual information for generating more accurate protein alignments than typically used methods.
Collapse
Affiliation(s)
- Benjamin Giovanni Iovino
- Luddy School of Informatics, Computing and Engineering, Indiana University, 700 N. Woodlawn Avenue, Bloomington, IN, 47408, USA
| | - Yuzhen Ye
- Luddy School of Informatics, Computing and Engineering, Indiana University, 700 N. Woodlawn Avenue, Bloomington, IN, 47408, USA.
| |
Collapse
|
4
|
Liu Y, Yuan H, Zhang Q, Wang Z, Xiong S, Wen N, Zhang Y. Multiple sequence alignment based on deep reinforcement learning with self-attention and positional encoding. Bioinformatics 2023; 39:btad636. [PMID: 37856335 PMCID: PMC10628385 DOI: 10.1093/bioinformatics/btad636] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2022] [Revised: 07/24/2023] [Accepted: 10/17/2023] [Indexed: 10/21/2023] Open
Abstract
MOTIVATION Multiple sequence alignment (MSA) is one of the hotspots of current research and is commonly used in sequence analysis scenarios. However, there is no lasting solution for MSA because it is a Nondeterministic Polynomially complete problem, and the existing methods still have room to improve the accuracy. RESULTS We propose Deep reinforcement learning with Positional encoding and self-Attention for MSA, based on deep reinforcement learning, to enhance the accuracy of the alignment Specifically, inspired by the translation technique in natural language processing, we introduce self-attention and positional encoding to improve accuracy and reliability. Firstly, positional encoding encodes the position of the sequence to prevent the loss of nucleotide position information. Secondly, the self-attention model is used to extract the key features of the sequence. Then input the features into a multi-layer perceptron, which can calculate the insertion position of the gap according to the features. In addition, a novel reinforcement learning environment is designed to convert the classic progressive alignment into progressive column alignment, gradually generating each column's sub-alignment. Finally, merge the sub-alignment into the complete alignment. Extensive experiments based on several datasets validate our method's effectiveness for MSA, outperforming some state-of-the-art methods in terms of the Sum-of-pairs and Column scores. AVAILABILITY AND IMPLEMENTATION The process is implemented in Python and available as open-source software from https://github.com/ZhangLab312/DPAMSA.
Collapse
Affiliation(s)
- Yuhang Liu
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Hao Yuan
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Qiang Zhang
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Zixuan Wang
- College of Electronics and Information Engineering, Sichuan University, Chengdu 610065, China
| | - Shuwen Xiong
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| | - Naifeng Wen
- School of Mechanical and Electrical Engineering, Dalian Minzu University, Dalian 116600, China
| | - Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China
| |
Collapse
|
5
|
Muntoni AP, Pagnani A. DCAlign v1.0: aligning biological sequences using co-evolution models and informed priors. Bioinformatics 2023; 39:btad537. [PMID: 37647658 PMCID: PMC10491954 DOI: 10.1093/bioinformatics/btad537] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2022] [Revised: 06/14/2023] [Accepted: 08/29/2023] [Indexed: 09/01/2023] Open
Abstract
SUMMARY DCAlign is a new alignment method able to cope with the conservation and the co-evolution signals that characterize the columns of multiple sequence alignments of homologous sequences. However, the pre-processing steps required to align a candidate sequence are computationally demanding. We show in v1.0 how to dramatically reduce the overall computing time by including an empirical prior over an informative set of variables mirroring the presence of insertions and deletions. AVAILABILITY AND IMPLEMENTATION DCAlign v1.0 is implemented in Julia and it is fully available at https://github.com/infernet-h2020/DCAlign.
Collapse
Affiliation(s)
- Anna Paola Muntoni
- Italian Institute for Genomic Medicine, IRCCS Candiolo, I-10060 Candiolo (TO), Italy
- Politecnico di Torino, I-10129 Torino, Italy
| | - Andrea Pagnani
- Italian Institute for Genomic Medicine, IRCCS Candiolo, I-10060 Candiolo (TO), Italy
- Politecnico di Torino, I-10129 Torino, Italy
- INFN, Sezione di Torino, Torino, Via Pietro Giuria 1, I-10125 Torino, Italy
| |
Collapse
|
6
|
João M, Sena AC, Rebello VEF. On closing the inopportune gap with consistency transformation and iterative refinement. PLoS One 2023; 18:e0287483. [PMID: 37440507 PMCID: PMC10343097 DOI: 10.1371/journal.pone.0287483] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Accepted: 06/06/2023] [Indexed: 07/15/2023] Open
Abstract
The problem of aligning multiple biological sequences has fascinated scientists for a long time. Over the last four decades, tens of heuristic-based Multiple Sequence Alignment (MSA) tools have been proposed, the vast majority being built on the concept of Progressive Alignment. It is known, however, that this approach suffers from an inherent drawback regarding the inadvertent insertion of gaps when aligning sequences. Two well-known corrective solutions have frequently been adopted to help mitigate this: Consistency Transformation and Iterative Refinement. This paper takes a tool-independent technique-oriented look at the alignment quality benefits of these two strategies using problem instances from the HOMSTRAD and BAliBASE benchmarks. Eighty MSA aligners have been used to compare 4 classes of heuristics: Progressive Alignments, Iterative Alignments, Consistency-based Alignments, and Consistency-based Progressive Alignments with Iterative Refinement. Statistically, while both Consistency-based classes are better for alignments with low similarity, for sequences with higher similarity, the differences between the classes are less clear. Iterative Refinement has its own drawbacks resulting in there being statistically little advantage for Progressive Aligners to adopt this technique either with Consistency Transformation or without. Nevertheless, all 4 classes are capable of bettering each other, depending on the instance problem. This further motivates the development of MSA frameworks, such as the one being developed for this research, which simultaneously contemplate multiple classes and techniques in their attempt to uncover better solutions.
Collapse
Affiliation(s)
- Mario João
- Medical Sciences College, State University of Rio de Janeiro, Rio de Janeiro, Rio de Janeiro, Brazil
- Institute of Computing, Fluminense Federal University, Niterói, Rio de Janeiro, Brazil
| | - Alexandre C Sena
- Institute of Mathematics and Statistics, State University of Rio de Janeiro, Rio de Janeiro, Rio de Janeiro, Brazil
| | - Vinod E F Rebello
- Institute of Computing, Fluminense Federal University, Niterói, Rio de Janeiro, Brazil
| |
Collapse
|
7
|
Zong F, Long C, Hu W, Chen S, Dai W, Xiao ZX, Cao Y. Abalign: a comprehensive multiple sequence alignment platform for B-cell receptor immune repertoires. Nucleic Acids Res 2023:7173809. [PMID: 37207341 DOI: 10.1093/nar/gkad400] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2023] [Revised: 04/23/2023] [Accepted: 05/08/2023] [Indexed: 05/21/2023] Open
Abstract
The utilization of high-throughput sequencing (HTS) for B-cell receptor (BCR) immune repertoire analysis has become widespread in the fields of adaptive immunity and antibody drug development. However, the sheer volume of sequences generated by these experiments presents a challenge in data processing. Specifically, multiple sequence alignment (MSA), a critical aspect of BCR analysis, remains inadequate for handling massive BCR sequencing data and lacks the ability to provide immunoglobulin-specific information. To address this gap, we introduce Abalign, a standalone program specifically designed for ultrafast MSA of BCR/antibody sequences. Benchmark tests demonstrate that Abalign achieves comparable or even better accuracy than state-of-the-art MSA tools, and shows remarkable advantages in terms of speed and memory consumption, reducing the time required for high-throughput analysis from weeks to hours. In addition to its alignment capabilities, Abalign offers a broad range of BCR analysis features, including extracting BCRs, constructing lineage trees, assigning VJ genes, analyzing clonotypes, profiling mutations, and comparing BCR immune repertoires. With its user-friendly graphic interface, Abalign can be easily run on personal computers instead of computing clusters. Overall, Abalign is an easy-to-use and effective tool that enables researchers to analyze massive BCR/antibody sequences, leading to new discoveries in the field of immunoinformatics. The software is freely available at http://cao.labshare.cn/abalign/.
Collapse
Affiliation(s)
- Fanjie Zong
- Center of Growth, Metabolism and Aging, Key Laboratory of Bio-Resource and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, China
- Animal Disease Prevention and Food Safety Key Laboratory of Sichuan Province, Microbiology and Metabolic Engineering Key Laboratory of Sichuan Province, Chengdu, China
| | - Chenyu Long
- Center of Growth, Metabolism and Aging, Key Laboratory of Bio-Resource and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, China
- Animal Disease Prevention and Food Safety Key Laboratory of Sichuan Province, Microbiology and Metabolic Engineering Key Laboratory of Sichuan Province, Chengdu, China
| | - Wanxin Hu
- Center of Growth, Metabolism and Aging, Key Laboratory of Bio-Resource and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, China
- Animal Disease Prevention and Food Safety Key Laboratory of Sichuan Province, Microbiology and Metabolic Engineering Key Laboratory of Sichuan Province, Chengdu, China
| | - Shuang Chen
- Chengdu Institute of Biology, Chinese Academy of Sciences, Chengdu, China
| | - Wentao Dai
- NHC Key Laboratory of Reproduction Regulation & Shanghai-MOST Key Laboratory of Health and Disease Genomics, Shanghai Institute for Biomedical and Pharmaceutical Technologies, Shanghai, China
| | - Zhi-Xiong Xiao
- Center of Growth, Metabolism and Aging, Key Laboratory of Bio-Resource and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, China
| | - Yang Cao
- Center of Growth, Metabolism and Aging, Key Laboratory of Bio-Resource and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, China
- Animal Disease Prevention and Food Safety Key Laboratory of Sichuan Province, Microbiology and Metabolic Engineering Key Laboratory of Sichuan Province, Chengdu, China
| |
Collapse
|
8
|
Nayeem MA, Samudro NA, Rahman MS, Rahman MS. MAMMLE: A Framework for Phylogeny Estimation Based on Multiobjective Application-aware Multiple Sequence Alignment and Maximum Likelihood Ensemble. J Comput Biol 2023; 30:245-249. [PMID: 36706434 DOI: 10.1089/cmb.2021.0533] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
Abstract
Motivation: Phylogenetic trees are often inferred from a multiple sequence alignment (MSA) where the tree accuracy is heavily impacted by the nature of estimated alignment. Carefully equipping an MSA tool with multiple application-aware objectives positively impacts its capability to yield better trees. Results: We introduce Multiobjective Application-aware Multiple Sequence Alignment and Maximum Likelihood Ensemble (MAMMLE), a framework for inferring better phylogenetic trees from unaligned sequences by hybridizing two MSA tools [i.e., Multiple Sequence Comparison by Log-Expectation (MUSCLE) and Multiple Alignment using Fast Fourier Transform (MAFFT)] with multiobjective optimization strategy and leveraging multiple maximum likelihood hypotheses. In our experiments, MAMMLE exhibits 5.57% (4.77%) median improvement (deterioration) over MUSCLE on 50.34% (37.41%) of instances.
Collapse
|
9
|
Benítez-Hidalgo A, Aldana-Montes JF, Navas-Delgado I, Roldán-García MDM. SALON ontology for the formal description of sequence alignments. BMC Bioinformatics 2023; 24:69. [PMID: 36849882 PMCID: PMC9972671 DOI: 10.1186/s12859-023-05190-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Accepted: 02/15/2023] [Indexed: 03/01/2023] Open
Abstract
BACKGROUND Information provided by high-throughput sequencing platforms allows the collection of content-rich data about biological sequences and their context. Sequence alignment is a bioinformatics approach to identifying regions of similarity in DNA, RNA, or protein sequences. However, there is no consensus about the specific common terminology and representation for sequence alignments. Thus, automatically linking the wide existing knowledge about the sequences with the alignments is challenging. RESULTS The Sequence Alignment Ontology (SALON) defines a helpful vocabulary for representing and semantically annotating pairwise and multiple sequence alignments. SALON is an OWL 2 ontology that supports automated reasoning for alignments validation and retrieving complementary information from public databases under the Open Linked Data approach. This will reduce the effort needed by scientists to interpret the sequence alignment results. CONCLUSIONS SALON defines a full range of controlled terminology in the domain of sequence alignments. It can be used as a mediated schema to integrate data from different sources and validate acquired knowledge.
Collapse
Affiliation(s)
- Antonio Benítez-Hidalgo
- Departamento de Lenguajes y Ciencias de la Computación, University of Málaga, Málaga, Spain. .,University of Málaga, ITIS Software, Ada Byron Research Building, Málaga, Spain. .,Instituto de Investigación Biomédica de Málaga - IBIMA, Málaga, Spain.
| | - José F. Aldana-Montes
- grid.10215.370000 0001 2298 7828Departamento de Lenguajes y Ciencias de la Computación, University of Málaga, Málaga, Spain ,grid.10215.370000 0001 2298 7828University of Málaga, ITIS Software, Ada Byron Research Building, Málaga, Spain ,grid.452525.1Instituto de Investigación Biomédica de Málaga – IBIMA, Málaga, Spain
| | - Ismael Navas-Delgado
- grid.10215.370000 0001 2298 7828Departamento de Lenguajes y Ciencias de la Computación, University of Málaga, Málaga, Spain ,grid.10215.370000 0001 2298 7828University of Málaga, ITIS Software, Ada Byron Research Building, Málaga, Spain ,grid.452525.1Instituto de Investigación Biomédica de Málaga – IBIMA, Málaga, Spain
| | - María del Mar Roldán-García
- grid.10215.370000 0001 2298 7828Departamento de Lenguajes y Ciencias de la Computación, University of Málaga, Málaga, Spain ,grid.10215.370000 0001 2298 7828University of Málaga, ITIS Software, Ada Byron Research Building, Málaga, Spain ,grid.452525.1Instituto de Investigación Biomédica de Málaga – IBIMA, Málaga, Spain
| |
Collapse
|
10
|
Kuang M, Zhang Y, Lam TW, Ting HF. MLProbs: A Data-Centric Pipeline for Better Multiple Sequence Alignment. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:524-533. [PMID: 35120007 DOI: 10.1109/tcbb.2022.3148382] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
In this paper, we explore using the data-centric approach to tackle the Multiple Sequence Alignment (MSA) construction problem. Unlike the algorithm-centric approach, which reduces the construction problem to a combinatorial optimization problem based on an abstract mathematical model, the data-centric approach explores using classification models trained from existing benchmark data to guide the construction. We identified two simple classifications to help us choose a better alignment tool and determine whether and how much to carry out realignment. We show that shallow machine-learning algorithms suffice to train sensitive models for these classifications. Based on these models, we implemented a new multiple sequence alignment pipeline, called MLProbs. Compared with 10 other popular alignment tools over four benchmark databases (namely, BAliBASE, OXBench, OXBench-X and SABMark), MLProbs consistently gives the highest TC score. More importantly, MLProbs shows non-trivial improvement for protein families with low similarity; in particular, when evaluated against the 1,356 protein families with similarity ≤ 50%, MLProbs achieves a TC score of 56.93, while the next best three tools are in the range of [55.41, 55.91] (increased by more than 1.8%). We also compared the performance of MLProbs and other MSA tools in two real-life applications - Phylogenetic Tree Construction Analysis and Protein Secondary Structure Prediction - and MLProbs also had the best performance. In our study, we used only shallow machine-learning algorithms to train our models. It would be interesting to study whether deep-learning methods can help make further improvements, so we suggest some possible research directions in the conclusion section.
Collapse
|
11
|
Becker F, Stanke M. learnMSA: learning and aligning large protein families. Gigascience 2022; 11:6833031. [PMID: 36399060 PMCID: PMC9673500 DOI: 10.1093/gigascience/giac104] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Revised: 09/01/2022] [Accepted: 10/06/2022] [Indexed: 11/19/2022] Open
Abstract
Background The alignment of large numbers of protein sequences is a challenging task and its importance grows rapidly along with the size of biological datasets. State-of-the-art algorithms have a tendency to produce less accurate alignments with an increasing number of sequences. This is a fundamental problem since many downstream tasks rely on accurate alignments. Results We present learnMSA, a novel statistical learning approach of profile hidden Markov models (pHMMs) based on batch gradient descent. Fundamentally different from popular aligners, we fit a custom recurrent neural network architecture for (p)HMMs to potentially millions of sequences with respect to a maximum a posteriori objective and decode an alignment. We rely on automatic differentiation of the log-likelihood, and thus, our approach is different from existing HMM training algorithms like Baum–Welch. Our method does not involve progressive, regressive, or divide-and-conquer heuristics. We use uniform batch sampling to adapt to large datasets in linear time without the requirement of a tree. When tested on ultra-large protein families with up to 3.5 million sequences, learnMSA is both more accurate and faster than state-of-the-art tools. On the established benchmarks HomFam and BaliFam with smaller sequence sets, it matches state-of-the-art performance. All experiments were done on a standard workstation with a GPU. Conclusions Our results show that learnMSA does not share the counterintuitive drawback of many popular heuristic aligners, which can substantially lose accuracy when many additional homologs are input. LearnMSA is a future-proof framework for large alignments with many opportunities for further improvements.
Collapse
Affiliation(s)
- Felix Becker
- Institute of Mathematics and Computer Science, University of Greifswald , Walther-Rathenau-Straße 47, 17489 Greifswald , Germany
| | - Mario Stanke
- Institute of Mathematics and Computer Science, University of Greifswald , Walther-Rathenau-Straße 47, 17489 Greifswald , Germany
| |
Collapse
|
12
|
Seo TK, Redelings BD, Thorne JL. Correlations between alignment gaps and nucleotide substitution or amino acid replacement. Proc Natl Acad Sci U S A 2022; 119:e2204435119. [PMID: 35972964 PMCID: PMC9407537 DOI: 10.1073/pnas.2204435119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2022] [Accepted: 07/11/2022] [Indexed: 11/18/2022] Open
Abstract
To assess the conventional treatment in evolutionary inference of alignment gaps as missing data, we propose a simple nonparametric test of the null hypothesis that the locations of alignment gaps are independent of the nucleotide substitution or amino acid replacement process. When we apply the test to 1,390 protein alignments that are informed by protein tertiary structure and use a 5% significance level, the null hypothesis of independence between amino acid replacement and gap location is rejected for ∼65% of datasets. Via simulations that include substitution and insertion-deletion, we show that the test performs well with true alignments. When we simulate according to the null hypothesis and then apply the test to optimal alignments that are inferred by each of four widely used software packages, the null hypothesis is rejected too frequently. Via further simulations and analyses, we show that the overly frequent rejections of the null hypothesis are not solely due to weaknesses of widely used software for finding optimal alignments. Instead, our evidence suggests that optimal alignments are unrepresentative of true alignments and that biased evolutionary inferences may result from relying upon individual optimal alignments.
Collapse
Affiliation(s)
- Tae-Kun Seo
- Division of Life Sciences, Korea Polar Research Institute, Yeonsu-gu, Incheon 21990, Republic of Korea
| | - Benjamin D. Redelings
- Biology Department, Duke University, Durham, NC 27708
- Ronin Institute, Durham, NC 27705
- Department of Ecology and Evolutionary Biology, University of Kansas, Lawrence, KS 66045
| | - Jeffrey L. Thorne
- Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695
- Department of Statistics, North Carolina State University, Raleigh, NC 27695
| |
Collapse
|
13
|
Nayeem MA, Bayzid MS, Rahman AH, Shahriyar R, Rahman MS. Multiobjective Formulation of Multiple Sequence Alignment for Phylogeny Inference. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:2775-2786. [PMID: 33044939 DOI: 10.1109/tcyb.2020.3020308] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Multiple sequence alignment (MSA) is a preliminary task for estimating phylogenies. It is used for homology inference among the sequences of a set of species. Generally, the MSA task is handled as a single-objective optimization process. The alignments computed under one criterion may be different from the alignments generated by other criteria, inferring discordant homologies and thus leading to different hypothesized evolutionary histories relating the sequences. The multiobjective (MO) formulation of MSA has recently been advocated by several researchers, to address this issue. An MO approach independently optimizes multiple (often conflicting) objective functions at the same time and outputs a set of competitive alignments. However, no conceptual or experimental rational from a real-world application perspective has been reported so far for any MO formulation of MSA. This article work investigates the impact of MO formulation in the context of an important scientific problem, namely, phylogeny estimation. Employing popular evolutionary MO algorithms, we show that: 1) trees inferred based on alignments produced by the existing MSA methods used in practice are substantially worse in quality than the trees inferred based on the alignment's output by an MO algorithm and 2) even high-quality alignments (according to popular measures available in the literature) may fail to achieve acceptable accuracy in generating phylogenetic trees. Thus, we essentially ask the following natural question: "can a phylogeny-aware (i.e., application-aware) metric guide in selecting appropriate MO formulations to ensure better phylogeny estimation?" Here, we report a carefully designed extensive experimental study that positively answers this question.
Collapse
|
14
|
Chao J, Tang F, Xu L. Developments in Algorithms for Sequence Alignment: A Review. Biomolecules 2022; 12:biom12040546. [PMID: 35454135 PMCID: PMC9024764 DOI: 10.3390/biom12040546] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2022] [Revised: 03/29/2022] [Accepted: 03/31/2022] [Indexed: 01/27/2023] Open
Abstract
The continuous development of sequencing technologies has enabled researchers to obtain large amounts of biological sequence data, and this has resulted in increasing demands for software that can perform sequence alignment fast and accurately. A number of algorithms and tools for sequence alignment have been designed to meet the various needs of biologists. Here, the ideas that prevail in the research of sequence alignment and some quality estimation methods for multiple sequence alignment tools are summarized.
Collapse
Affiliation(s)
- Jiannan Chao
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China;
| | - Furong Tang
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003, China;
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen 518055, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen 518055, China
- Correspondence:
| |
Collapse
|
15
|
Shrestha B, Adhikari B. Scoring protein sequence alignments using deep Learning. Bioinformatics 2022; 38:2988-2995. [PMID: 35385080 DOI: 10.1093/bioinformatics/btac210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2021] [Revised: 04/01/2022] [Accepted: 04/05/2022] [Indexed: 11/12/2022] Open
Abstract
BACKGROUND A high-quality sequence alignment (SA) is the most important input feature for accurate protein structure prediction. For a protein sequence, there are many methods to generate a SA. However, when given a choice of more than one SA for a protein sequence, there are no methods to predict which SA may lead to more accurate models without actually building the models. In this work, we describe a method to predict the quality of a protein's SA. METHODS We created our own dataset by generating a variety of SAs for a set of 1,351 representative proteins and investigated various deep learning architectures to predict the local distance difference test (lDDT) scores of distance maps predicted with SAs as the input. These lDDT scores serve as indicators of the quality of the SAs. RESULTS Using two independent test datasets consisting of CASP13 and CASP14 targets, we show that our method is effective for scoring and ranking SAs when a pool of SAs is available for a protein sequence. With an example, we further discuss that SA selection using our method can lead to improved structure prediction. AVAILABILITY Code and datasets are available at https://github.com/ba-lab/Alignment-Score/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bikash Shrestha
- Department of Computer Science, University of Missouri-St. Louis, St. Louis, MO 63132, USA
| | - Badri Adhikari
- Department of Computer Science, University of Missouri-St. Louis, St. Louis, MO 63132, USA
| |
Collapse
|
16
|
Kostenko DO, Korotkov EV. Application of the MAHDS Method for Multiple Alignment of Highly Diverged Amino Acid Sequences. Int J Mol Sci 2022; 23:ijms23073764. [PMID: 35409125 PMCID: PMC8998981 DOI: 10.3390/ijms23073764] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2022] [Revised: 03/23/2022] [Accepted: 03/23/2022] [Indexed: 12/10/2022] Open
Abstract
The aim of this work was to compare the multiple alignment methods MAHDS, T-Coffee, MUSCLE, Clustal Omega, Kalign, MAFFT, and PRANK in their ability to align highly divergent amino acid sequences. To accomplish this, we created test amino acid sequences with an average number of substitutions per amino acid (x) from 0.6 to 5.6, a total of 81 sets. Comparison of the performance of sequence alignments constructed by MAHDS and previously developed algorithms using the CS and Z score criteria and the benchmark alignment database (BAliBASE) indicated that, although the quality of the alignments built with MAHDS was somewhat lower than that of the other algorithms, it was compensated by greater statistical significance. MAHDS could construct statistically significant alignments of artificial sequences with x ≤ 4.8, whereas the other algorithms (T-Coffee, MUSCLE, Clustal Omega, Kalign, MAFFT, and PRANK) could not perform that at x > 2.4. The application of MAHDS to align 21 families of highly diverged proteins (identity < 20%) from Pfam and HOMSTRAD databases showed that it could calculate statistically significant alignments in cases when the other methods failed. Thus, MAHDS could be used to construct statistically significant multiple alignments of highly divergent protein sequences, which accumulated multiple mutations during evolution.
Collapse
|
17
|
Zhang Y, Zhang Q, Zhou J, Zou Q. A survey on the algorithm and development of multiple sequence alignment. Brief Bioinform 2022; 23:6546258. [PMID: 35272347 DOI: 10.1093/bib/bbac069] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Revised: 01/30/2022] [Accepted: 02/09/2022] [Indexed: 12/21/2022] Open
Abstract
Multiple sequence alignment (MSA) is an essential cornerstone in bioinformatics, which can reveal the potential information in biological sequences, such as function, evolution and structure. MSA is widely used in many bioinformatics scenarios, such as phylogenetic analysis, protein analysis and genomic analysis. However, MSA faces new challenges with the gradual increase in sequence scale and the increasing demand for alignment accuracy. Therefore, developing an efficient and accurate strategy for MSA has become one of the research hotspots in bioinformatics. In this work, we mainly summarize the algorithms for MSA and its applications in bioinformatics. To provide a structured and clear perspective, we systematically introduce MSA's knowledge, including background, database, metric and benchmark. Besides, we list the most common applications of MSA in the field of bioinformatics, including database searching, phylogenetic analysis, genomic analysis, metagenomic analysis and protein analysis. Furthermore, we categorize and analyze classical and state-of-the-art algorithms, divided into progressive alignment, iterative algorithm, heuristics, machine learning and divide-and-conquer. Moreover, we also discuss the challenges and opportunities of MSA in bioinformatics. Our work provides a comprehensive survey of MSA applications and their relevant algorithms. It could bring valuable insights for researchers to contribute their knowledge to MSA and relevant studies.
Collapse
Affiliation(s)
- Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China.,School of Computer Science and Engineering, University of Electronic Science and Technology of China, 611731, Chengdu, China
| | - Qiang Zhang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Jiliu Zhou
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, 610054, Chengdu, China
| |
Collapse
|
18
|
Nayeem MA, Bayzid MS, Samudro NA, Rahman MS, Rahman MS. PASTA with many application-aware optimization criteria for alignment based phylogeny inference. Comput Biol Chem 2022; 98:107661. [DOI: 10.1016/j.compbiolchem.2022.107661] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2021] [Revised: 02/22/2022] [Accepted: 02/27/2022] [Indexed: 11/25/2022]
|
19
|
Dougan KE, González-Pech RA, Stephens TG, Shah S, Chen Y, Ragan MA, Bhattacharya D, Chan CX. Genome-powered classification of microbial eukaryotes: focus on coral algal symbionts. Trends Microbiol 2022; 30:831-840. [DOI: 10.1016/j.tim.2022.02.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2021] [Revised: 01/20/2022] [Accepted: 02/01/2022] [Indexed: 12/20/2022]
|
20
|
TwinCons: Conservation score for uncovering deep sequence similarity and divergence. PLoS Comput Biol 2021; 17:e1009541. [PMID: 34714829 PMCID: PMC8580257 DOI: 10.1371/journal.pcbi.1009541] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Revised: 11/10/2021] [Accepted: 10/06/2021] [Indexed: 11/19/2022] Open
Abstract
We have developed the program TwinCons, to detect noisy signals of deep ancestry of proteins or nucleic acids. As input, the program uses a composite alignment containing pre-defined groups, and mathematically determines a 'cost' of transforming one group to the other at each position of the alignment. The output distinguishes conserved, variable and signature positions. A signature is conserved within groups but differs between groups. The method automatically detects continuous characteristic stretches (segments) within alignments. TwinCons provides a convenient representation of conserved, variable and signature positions as a single score, enabling the structural mapping and visualization of these characteristics. Structure is more conserved than sequence. TwinCons highlights alternative sequences of conserved structures. Using TwinCons, we detected highly similar segments between proteins from the translation and transcription systems. TwinCons detects conserved residues within regions of high functional importance for the ribosomal RNA (rRNA) and demonstrates that signatures are not confined to specific regions but are distributed across the rRNA structure. The ability to evaluate both nucleic acid and protein alignments allows TwinCons to be used in combined sequence and structural analysis of signatures and conservation in rRNA and in ribosomal proteins (rProteins). TwinCons detects a strong sequence conservation signal between bacterial and archaeal rProteins related by circular permutation. This conserved sequence is structurally colocalized with conserved rRNA, indicated by TwinCons scores of rRNA alignments of bacterial and archaeal groups. This combined analysis revealed deep co-evolution of rRNA and rProtein buried within the deepest branching points in the tree of life.
Collapse
|
21
|
Pajkos M, Dosztányi Z. Functions of intrinsically disordered proteins through evolutionary lenses. PROGRESS IN MOLECULAR BIOLOGY AND TRANSLATIONAL SCIENCE 2021; 183:45-74. [PMID: 34656334 DOI: 10.1016/bs.pmbts.2021.06.017] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Protein sequences are the result of an evolutionary process that involves the balancing act of experimenting with novel mutations and selecting out those that have an undesirable functional outcome. In the case of globular proteins, the function relies on a well-defined conformation, therefore, there is a strong evolutionary pressure to preserve the structure. However, different evolutionary rules might apply for the group of intrinsically disordered regions and proteins (IDR/IDPs) that exist as an ensemble of fluctuating conformations. The function of IDRs can directly originate from their disordered state or arise through different types of molecular recognition processes. There is an amazing variety of ways IDRs can carry out their functions, and this is also reflected in their evolutionary properties. In this chapter we give an overview of the different types of evolutionary behavior of disordered proteins and associated functions in normal and disease settings.
Collapse
Affiliation(s)
- Mátyás Pajkos
- Department of Biochemistry, ELTE Eötvös Loránd University, Budapest, Hungary
| | - Zsuzsanna Dosztányi
- Department of Biochemistry, ELTE Eötvös Loránd University, Budapest, Hungary.
| |
Collapse
|
22
|
Lladós J, Cores F, Guirado F, Lérida JL. Accurate consistency-based MSA reducing the memory footprint. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2021; 208:106237. [PMID: 34198017 DOI: 10.1016/j.cmpb.2021.106237] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/13/2020] [Accepted: 06/08/2021] [Indexed: 06/13/2023]
Abstract
BACKGROUND AND OBJECTIVE The emergence of Next-Generation sequencing has created a push for faster and more accurate multiple sequence alignment tools. The growing number of sequences and their longer sizes, which require the use of increased system resources and produce less accurate results, are heavily challenging to these applications. Consistency-based methods have the most intensive CPU and memory usage requirements. We hypothesize that library reductions can enhance the scalability and performance of consistency-based multiple sequence alignment tools; however, we have previously shown a noticeable impact on the accuracy when extreme reductions were performed. METHODS In this study, we propose Matrix-Based T-Coffee, a consistency-based method that uses library reductions in conjunction with a complementary objective function. The proposed method, implemented in T-Coffee, can mitigate the accuracy loss caused by low memory resources. RESULTS The use of a complementary objective function with a library reduction of ≥30% improved the accuracy of T-Coffee. Interestingly, ≥50% library reduction achieved lower execution times and better overall scalability. CONCLUSIONS Matrix-Based T-Coffee benefits from accurate alignments while achieving better scalability. This leads to a reduction in memory footprint and execution time. In addition, these enhancements could be applied to other aligners based on consistency.
Collapse
Affiliation(s)
- Jordi Lladós
- INSPIRES Research Center, Universitat de Lleida. Jaume II. 69, 25001 Lleida, Spain.
| | - Fernando Cores
- INSPIRES Research Center, Universitat de Lleida. Jaume II. 69, 25001 Lleida, Spain.
| | - Fernando Guirado
- INSPIRES Research Center, Universitat de Lleida. Jaume II. 69, 25001 Lleida, Spain.
| | - Josep L Lérida
- INSPIRES Research Center, Universitat de Lleida. Jaume II. 69, 25001 Lleida, Spain.
| |
Collapse
|
23
|
Zhang C, Zhao Y, Braun EL, Mirarab S. TAPER: Pinpointing errors in multiple sequence alignments despite varying rates of evolution. Methods Ecol Evol 2021. [DOI: 10.1111/2041-210x.13696] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Affiliation(s)
- Chao Zhang
- Bioinformatics and Systems Biology Program University of California San Diego CA USA
| | - Yiming Zhao
- Electrical and Computer Engineering Department University of California San Diego CA USA
| | - Edward L. Braun
- Department of Biology and Genetics Institute University of Florida Gainesville FL USA
| | - Siavash Mirarab
- Electrical and Computer Engineering Department University of California San Diego CA USA
| |
Collapse
|
24
|
Abadi S, Avram O, Rosset S, Pupko T, Mayrose I. ModelTeller: Model Selection for Optimal Phylogenetic Reconstruction Using Machine Learning. Mol Biol Evol 2021; 37:3338-3352. [PMID: 32585030 DOI: 10.1093/molbev/msaa154] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
Statistical criteria have long been the standard for selecting the best model for phylogenetic reconstruction and downstream statistical inference. Although model selection is regarded as a fundamental step in phylogenetics, existing methods for this task consume computational resources for long processing time, they are not always feasible, and sometimes depend on preliminary assumptions which do not hold for sequence data. Moreover, although these methods are dedicated to revealing the processes that underlie the sequence data, they do not always produce the most accurate trees. Notably, phylogeny reconstruction consists of two related tasks, topology reconstruction and branch-length estimation. It was previously shown that in many cases the most complex model, GTR+I+G, leads to topologies that are as accurate as using existing model selection criteria, but overestimates branch lengths. Here, we present ModelTeller, a computational methodology for phylogenetic model selection, devised within the machine-learning framework, optimized to predict the most accurate nucleotide substitution model for branch-length estimation. We demonstrate that ModelTeller leads to more accurate branch-length inference than current model selection criteria on data sets simulated under realistic processes. ModelTeller relies on a readily implemented machine-learning model and thus the prediction according to features extracted from the sequence data results in a substantial decrease in running time compared with existing strategies. By harnessing the machine-learning framework, we distinguish between features that mostly contribute to branch-length optimization, concerning the extent of sequence divergence, and features that are related to estimates of the model parameters that are important for the selection made by current criteria.
Collapse
Affiliation(s)
- Shiran Abadi
- School of Plant Sciences and Food security, Tel-Aviv University, Tel-Aviv, Israel
| | - Oren Avram
- School of Molecular Cell Biology & Biotechnology, Tel-Aviv University, Tel-Aviv, Israel
| | - Saharon Rosset
- Department of Statistics and Operations Research, School of Mathematical Sciences, Tel-Aviv University, Tel-Aviv, Israel
| | - Tal Pupko
- School of Molecular Cell Biology & Biotechnology, Tel-Aviv University, Tel-Aviv, Israel
| | - Itay Mayrose
- School of Plant Sciences and Food security, Tel-Aviv University, Tel-Aviv, Israel
| |
Collapse
|
25
|
Azouri D, Abadi S, Mansour Y, Mayrose I, Pupko T. Harnessing machine learning to guide phylogenetic-tree search algorithms. Nat Commun 2021; 12:1983. [PMID: 33790270 PMCID: PMC8012635 DOI: 10.1038/s41467-021-22073-8] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2020] [Accepted: 02/26/2021] [Indexed: 02/01/2023] Open
Abstract
Inferring a phylogenetic tree is a fundamental challenge in evolutionary studies. Current paradigms for phylogenetic tree reconstruction rely on performing costly likelihood optimizations. With the aim of making tree inference feasible for problems involving more than a handful of sequences, inference under the maximum-likelihood paradigm integrates heuristic approaches to evaluate only a subset of all potential trees. Consequently, existing methods suffer from the known tradeoff between accuracy and running time. In this proof-of-concept study, we train a machine-learning algorithm over an extensive cohort of empirical data to predict the neighboring trees that increase the likelihood, without actually computing their likelihood. This provides means to safely discard a large set of the search space, thus potentially accelerating heuristic tree searches without losing accuracy. Our analyses suggest that machine learning can guide tree-search methodologies towards the most promising candidate trees.
Collapse
Affiliation(s)
- Dana Azouri
- School of Plant Sciences and Food Security, Tel Aviv University, Ramat Aviv, Tel-Aviv, Israel
- The Shmunis School of Biomedicine and Cancer Research, Tel Aviv University, Ramat Aviv, Tel-Aviv, Israel
| | - Shiran Abadi
- School of Plant Sciences and Food Security, Tel Aviv University, Ramat Aviv, Tel-Aviv, Israel
| | - Yishay Mansour
- Balvatnik School of Computer Science, Tel-Aviv University, Ramat Aviv, Tel-Aviv, Israel
| | - Itay Mayrose
- School of Plant Sciences and Food Security, Tel Aviv University, Ramat Aviv, Tel-Aviv, Israel.
| | - Tal Pupko
- The Shmunis School of Biomedicine and Cancer Research, Tel Aviv University, Ramat Aviv, Tel-Aviv, Israel.
| |
Collapse
|
26
|
Chockalingam SP, Pannu J, Hooshmand S, Thankachan SV, Aluru S. An alignment-free heuristic for fast sequence comparisons with applications to phylogeny reconstruction. BMC Bioinformatics 2020; 21:404. [PMID: 33203364 PMCID: PMC7672814 DOI: 10.1186/s12859-020-03738-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2020] [Accepted: 09/04/2020] [Indexed: 11/10/2022] Open
Abstract
Abstract
Background
Alignment-free methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic trees. Recently, the average common substring measure, ACS, and its k-mismatch counterpart, ACSk, have been shown to produce results as effective as multiple-sequence alignment based methods for reconstruction of phylogeny trees. Since computing ACSk takes O(n logkn) time and hence impractical for large datasets, multiple heuristics that can approximate ACSk have been introduced.
Results
In this paper, we present a novel linear-time heuristic to approximate ACSk, which is faster than computing the exact ACSk while being closer to the exact ACSk values compared to previously published linear-time greedy heuristics. Using four real datasets, containing both DNA and protein sequences, we evaluate our algorithm in terms of accuracy, runtime and demonstrate its applicability for phylogeny reconstruction. Our algorithm provides better accuracy than previously published heuristic methods, while being comparable in its applications to phylogeny reconstruction.
Conclusions
Our method produces a better approximation for ACSk and is applicable for the alignment-free comparison of biological sequences at highly competitive speed. The algorithm is implemented in Rust programming language and the source code is available at https://github.com/srirampc/adyar-rs.
Collapse
|
27
|
Trivedi R, Nagarajaram HA. Substitution scoring matrices for proteins - An overview. Protein Sci 2020; 29:2150-2163. [PMID: 32954566 DOI: 10.1002/pro.3954] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2020] [Revised: 09/17/2020] [Accepted: 09/18/2020] [Indexed: 01/17/2023]
Abstract
Sequence analysis is the primary and simplest approach to discover structural, functional and evolutionary details of related proteins. All the alignment based approaches of sequence analysis make use of amino acid substitution matrices, and the accuracy of the results largely depends on the type of scoring matrices used to perform alignment tasks. An amino acid substitution matrix is a 20 × 20 matrix in which the individual elements encapsulate the rates at which each of the 20 amino acid residues in proteins are substituted by other amino acid residues over time. In contrast to most globular/ordered proteins whose amino acids composition is considered as standard, there are several classes of proteins (e.g., transmembrane proteins) in which certain types of amino acid (e.g., hydrophobic residues) are enriched. These compositional differences among various classes of proteins are manifested in their underlying residue substitution frequencies. Therefore, each of the compositionally distinct class of proteins or protein segments should be studied using specific scoring matrices that reflect their distinct residue substitution pattern. In this review, we describe the development and application of various substitution scoring matrices peculiar to proteins with standard and biased compositions. Along with most commonly used standard matrices (PAM, BLOSUM, MD and VTML) that act as default parameters in various homologs search and alignment tools, different substitution scoring matrices specific to compositionally distinct class of proteins are discussed in detail.
Collapse
Affiliation(s)
- Rakesh Trivedi
- Laboratory of Computational Biology, Centre for DNA Fingerprinting and Diagnostics, Uppal, Hyderabad, Telangana, India.,Graduate School, Manipal Academy of Higher Education, Manipal, Karnataka, India
| | - Hampapathalu Adimurthy Nagarajaram
- Laboratory of Computational Biology, Department of Systems and Computational Biology, School of Life Sciences, University of Hyderabad, Hyderabad, Telangana, India.,Centre for Modelling, Simulation and Design, University of Hyderabad, Hyderabad, Telangana, India
| |
Collapse
|
28
|
Evidence of absence treated as absence of evidence: The effects of variation in the number and distribution of gaps treated as missing data on the results of standard maximum likelihood analysis. Mol Phylogenet Evol 2020; 154:106966. [PMID: 32971285 DOI: 10.1016/j.ympev.2020.106966] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2020] [Revised: 08/15/2020] [Accepted: 09/15/2020] [Indexed: 11/23/2022]
Abstract
Although numerous studies have demonstrated the theoretical and empirical importance of treating gaps as insertion/deletion (indel) events in phylogenetic analyses, the standard approach to maximum likelihood (ML) analysis employed in the vast majority of empirical studies codes gaps as nucleotides of unknown identity ("missing data"). Therefore, it is imperative to understand the empirical consequences of different numbers and distributions of gaps treated as missing data. We evaluated the effects of variation in the number and distribution of gaps (i.e., no base, coded as IUPAC "." or "-") treated as missing data (i.e., any base, coded as "?" or IUPAC "N") in standard ML analysis. We obtained alignments with variable numbers and arrangements of gaps by aligning seven diverse empirical datasets under different gap opening costs using MAFFT. We selected the optimal substitution model for each alignment using the corrected Akaike Information Criterion in jModelTest2 and searched for optimal trees using GARLI. We also employed a Monte Carlo approach to randomly replace nucleotides with gaps (treated as missing data) in an empirical dataset to understand more precisely the effects of varying their number and distribution. To compare alignments, we developed four new indices and used several existing measures to quantify the number and distribution of gaps in all alignments. Our most important finding is that ML scores correlate negatively with gap opening costs and the amount of missing data. However, this negative relationship is not due to the increase in missing data per se-which increases ML scores-but instead to the effect of gaps on nucleotide homology. These variables also cause significant but largely unpredictable effects on tree topology.
Collapse
|
29
|
Polyanovsky V, Lifanov A, Esipova N, Tumanyan V. The ranging of amino acids substitution matrices of various types in accordance with the alignment accuracy criterion. BMC Bioinformatics 2020; 21:294. [PMID: 32921315 PMCID: PMC7489204 DOI: 10.1186/s12859-020-03616-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2020] [Accepted: 06/18/2020] [Indexed: 11/15/2022] Open
Abstract
Background The alignment of character sequences is important in bioinformatics. The quality of this procedure is determined by the substitution matrix and parameters of the insertion-deletion penalty function. These matrices are derived from sequence alignment and thus reflect the evolutionary process. Currently, in addition to evolutionary matrices, a large number of different background matrices have been obtained. To make an optimal choice of the substitution matrix and the penalty parameters, we conducted a numerical experiment using a representative sample of existing matrices of various types and origins. Results We tested both the classical evolutionary matrix series (PAM, Blosum, VTML, Pfasum); structural alignment based matrices, contact energy matrix, and matrix based on the properties of the genetic code. This study presents results for two test set types: first, we simulated sequences that reflect the divergent evolution; second, we performed tests on Balibase sequences. In both cases, we obtained the dependences of the alignment quality (Accuracy, Confidence) on the evolutionary distance between sequences and the evolutionary distance to which the substitution matrices correspond. Optimization of a combination of matrices and the penalty parameters was carried out for local and global alignment on the values of penalty function parameters. Consequently, we found that the best alignment quality is achieved with matrices corresponding to the largest evolutionary distance. These matrices prove to be universal, i.e. suitable for aligning sequences separated by both large and small evolutionary distances. We analysed the correspondence of the correlation coefficients of matrices to the alignment quality. It was found that matrices showing high quality alignment have an above average correlation value, but the converse is not true. Conclusions This study showed that the best alignment quality is achieved with evolutionary matrices designed for long distances: Gonnet, VTML250, PAM250, MIQS, and Pfasum050. The same property is inherent in matrices not only of evolutionary origin, but also of another background corresponding to a large evolutionary distance. Therefore, matrices based on structural data show alignment quality close enough to its value for evolutionary matrices. This agrees with the idea that the spatial structure is more conservative than the protein sequence.
Collapse
|
30
|
Dijkstra MJJ, van der Ploeg AJ, Feenstra KA, Fokkink WJ, Abeln S, Heringa J. Tailor-made multiple sequence alignments using the PRALINE 2 alignment toolkit. Bioinformatics 2020; 35:5315-5317. [PMID: 31368486 PMCID: PMC6954659 DOI: 10.1093/bioinformatics/btz572] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2019] [Revised: 05/29/2019] [Accepted: 07/29/2019] [Indexed: 12/03/2022] Open
Abstract
Summary PRALINE 2 is a toolkit for custom multiple sequence alignment workflows. It can be used to incorporate sequence annotations, such as secondary structure or (DNA) motifs, into the alignment scoring, as well as to customize many other aspects of a progressive multiple alignment workflow. Availability and implementation PRALINE 2 is implemented in Python and available as open source software on GitHub: https://github.com/ibivu/PRALINE/.
Collapse
Affiliation(s)
- Maurits J J Dijkstra
- Department of Computer Science, Vrije Universiteit, 1081 HV Amsterdam, The Netherlands
| | - Atze J van der Ploeg
- Department of Computer Science, Vrije Universiteit, 1081 HV Amsterdam, The Netherlands
| | - K Anton Feenstra
- Department of Computer Science, Vrije Universiteit, 1081 HV Amsterdam, The Netherlands
| | - Wan J Fokkink
- Department of Computer Science, Vrije Universiteit, 1081 HV Amsterdam, The Netherlands
| | - Sanne Abeln
- Department of Computer Science, Vrije Universiteit, 1081 HV Amsterdam, The Netherlands
| | - Jaap Heringa
- Department of Computer Science, Vrije Universiteit, 1081 HV Amsterdam, The Netherlands
| |
Collapse
|
31
|
Carpentier M, Chomilier J. Protein multiple alignments: sequence-based versus structure-based programs. Bioinformatics 2020; 35:3970-3980. [PMID: 30942864 DOI: 10.1093/bioinformatics/btz236] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2018] [Revised: 03/05/2019] [Accepted: 04/02/2019] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Multiple sequence alignment programs have proved to be very useful and have already been evaluated in the literature yet not alignment programs based on structure or both sequence and structure. In the present article we wish to evaluate the added value provided through considering structures. RESULTS We compared the multiple alignments resulting from 25 programs either based on sequence, structure or both, to reference alignments deposited in five databases (BALIBASE 2 and 3, HOMSTRAD, OXBENCH and SISYPHUS). On the whole, the structure-based methods compute more reliable alignments than the sequence-based ones, and even than the sequence+structure-based programs whatever the databases. Two programs lead, MAMMOTH and MATRAS, nevertheless the performances of MUSTANG, MATT, 3DCOMB, TCOFFEE+TM_ALIGN and TCOFFEE+SAP are better for some alignments. The advantage of structure-based methods increases at low levels of sequence identity, or for residues in regular secondary structures or buried ones. Concerning gap management, sequence-based programs set less gaps than structure-based programs. Concerning the databases, the alignments of the manually built databases are more challenging for the programs. AVAILABILITY AND IMPLEMENTATION All data and results presented in this study are available at: http://wwwabi.snv.jussieu.fr/people/mathilde/download/AliMulComp/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mathilde Carpentier
- Institut Systématique Evolution Biodiversité (ISYEB), Sorbonne Université, MNHN, CNRS, EPHE, Paris, France
| | - Jacques Chomilier
- Sorbonne Université, MNHN, CNRS, IRD, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie (IMPMC), BiBiP, Paris, France
| |
Collapse
|
32
|
A bi-objective function optimization approach for multiple sequence alignment using genetic algorithm. Soft comput 2020. [DOI: 10.1007/s00500-020-04917-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
|
33
|
Maldonado E, Antunes A. LMAP_S: Lightweight Multigene Alignment and Phylogeny eStimation. BMC Bioinformatics 2019; 20:739. [PMID: 31888452 PMCID: PMC6937843 DOI: 10.1186/s12859-019-3292-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2019] [Accepted: 11/26/2019] [Indexed: 01/22/2023] Open
Abstract
Background Recent advances in genome sequencing technologies and the cost drop in high-throughput sequencing continue to give rise to a deluge of data available for downstream analyses. Among others, evolutionary biologists often make use of genomic data to uncover phenotypic diversity and adaptive evolution in protein-coding genes. Therefore, multiple sequence alignments (MSA) and phylogenetic trees (PT) need to be estimated with optimal results. However, the preparation of an initial dataset of multiple sequence file(s) (MSF) and the steps involved can be challenging when considering extensive amount of data. Thus, it becomes necessary the development of a tool that removes the potential source of error and automates the time-consuming steps of a typical workflow with high-throughput and optimal MSA and PT estimations. Results We introduce LMAP_S (Lightweight Multigene Alignment and Phylogeny eStimation), a user-friendly command-line and interactive package, designed to handle an improved alignment and phylogeny estimation workflow: MSF preparation, MSA estimation, outlier detection, refinement, consensus, phylogeny estimation, comparison and editing, among which file and directory organization, execution, manipulation of information are automated, with minimal manual user intervention. LMAP_S was developed for the workstation multi-core environment and provides a unique advantage for processing multiple datasets. Our software, proved to be efficient throughout the workflow, including, the (unlimited) handling of more than 20 datasets. Conclusions We have developed a simple and versatile LMAP_S package enabling researchers to effectively estimate multiple datasets MSAs and PTs in a high-throughput fashion. LMAP_S integrates more than 25 software providing overall more than 65 algorithm choices distributed in five stages. At minimum, one FASTA file is required within a single input directory. To our knowledge, no other software combines MSA and phylogeny estimation with as many alternatives and provides means to find optimal MSAs and phylogenies. Moreover, we used a case study comparing methodologies that highlighted the usefulness of our software. LMAP_S has been developed as an open-source package, allowing its integration into more complex open-source bioinformatics pipelines. LMAP_S package is released under GPLv3 license and is freely available at https://lmap-s.sourceforge.io/.
Collapse
Affiliation(s)
- Emanuel Maldonado
- CIIMAR/CIMAR - Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Terminal de Cruzeiros do Porto de Leixões, Av. General Norton de Matos, s/n, 4450-208, Porto, Portugal
| | - Agostinho Antunes
- CIIMAR/CIMAR - Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Terminal de Cruzeiros do Porto de Leixões, Av. General Norton de Matos, s/n, 4450-208, Porto, Portugal. .,Department of Biology, Faculty of Sciences, University of Porto, Rua do Campo Alegre, 4169-007, Porto, Portugal.
| |
Collapse
|
34
|
Ali RH, Bogusz M, Whelan S. Identifying Clusters of High Confidence Homologies in Multiple Sequence Alignments. Mol Biol Evol 2019; 36:2340-2351. [PMID: 31209473 PMCID: PMC6933875 DOI: 10.1093/molbev/msz142] [Citation(s) in RCA: 52] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Multiple sequence alignment (MSA) is ubiquitous in evolution and bioinformatics. MSAs are usually taken to be a known and fixed quantity on which to perform downstream analysis despite extensive evidence that MSA accuracy and uncertainty affect results. These errors are known to cause a wide range of problems for downstream evolutionary inference, ranging from false inference of positive selection to long branch attraction artifacts. The most popular approach to dealing with this problem is to remove (filter) specific columns in the MSA that are thought to be prone to error. Although popular, this approach has had mixed success and several studies have even suggested that filtering might be detrimental to phylogenetic studies. We present a graph-based clustering method to address MSA uncertainty and error in the software Divvier (available at https://github.com/simonwhelan/Divvier), which uses a probabilistic model to identify clusters of characters that have strong statistical evidence of shared homology. These clusters can then be used to either filter characters from the MSA (partial filtering) or represent each of the clusters in a new column (divvying). We validate Divvier through its performance on real and simulated benchmarks, finding Divvier substantially outperforms existing filtering software by retaining more true pairwise homologies calls and removing more false positive pairwise homologies. We also find that Divvier, in contrast to other filtering tools, can alleviate long branch attraction artifacts induced by MSA and reduces the variation in tree estimates caused by MSA uncertainty.
Collapse
Affiliation(s)
- Raja Hashim Ali
- Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Uppsala, Sweden
- Faculty of Computer Science and Engineering, Ghulam Ishaq Khan Institute of Engineering Sciences and Technology, Topi, Pakistan
| | - Marcin Bogusz
- Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Uppsala, Sweden
| | - Simon Whelan
- Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Uppsala, Sweden
| |
Collapse
|
35
|
Liao Y, Schaeffer RD, Pei J, Grishin NV. A sequence family database built on ECOD structural domains. Bioinformatics 2019; 34:2997-3003. [PMID: 29659718 DOI: 10.1093/bioinformatics/bty214] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2018] [Accepted: 04/03/2018] [Indexed: 11/12/2022] Open
Abstract
Motivation The ECOD database classifies protein domains based on their evolutionary relationships, considering both remote and close homology. The family group in ECOD provides classification of domains that are closely related to each other based on sequence similarity. Due to different perspectives on domain definition, direct application of existing sequence domain databases, such as Pfam, to ECOD struggles with several shortcomings. Results We created multiple sequence alignments and profiles from ECOD domains with the help of structural information in alignment building and boundary delineation. We validated the alignment quality by scoring structure superposition to demonstrate that they are comparable to curated seed alignments in Pfam. Comparison to Pfam and CDD reveals that 27 and 16% of ECOD families are new, but they are also dominated by small families, likely because of the sampling bias from the PDB database. There are 35 and 48% of families whose boundaries are modified comparing to counterparts in Pfam and CDD, respectively. Availability and implementation The new families are now integrated in the ECOD website. The aggregate HMMER profile library and alignment are available for download on ECOD website (http://prodata.swmed.edu/ecod). Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yuxing Liao
- Department of Biophysics and Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - R Dustin Schaeffer
- Department of Biophysics and Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX, USA.,Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Jimin Pei
- Department of Biophysics and Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX, USA.,Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Nick V Grishin
- Department of Biophysics and Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX, USA.,Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX, USA
| |
Collapse
|
36
|
Bai N, Tang S, Yu C, Fu H, Wang C, Chen X. GMSA: A Data Sharing System for Multiple Sequence Alignment Across Multiple Users. Curr Bioinform 2019. [DOI: 10.2174/1574893614666190111160101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
In recent years, the rapid growth of biological datasets in Bioinformatics
has made the computation of Multiple Sequence Alignment (MSA) become extremely slow. Using
the GPU to accelerate MSA has shown to be an effective approach. Moreover, there is a trend that
many bioinformatic researchers or institutes setup a shared server for remote users to submit MSA
jobs via provided web-pages or tools.
Objective:
Given the fact that different MSA jobs submitted by users often process similar datasets,
there can be an opportunity for users to share their computation results between each other,
which can avoid the redundant computation and thereby reduce the overall computing time. Furthermore,
in the heterogeneous CPU/GPU platform, many existing applications assign their computation
on GPU devices only, which leads to a waste of the CPU resources. Co-run computation
can increase the utilization of computing resources on both CPUs and GPUs by dispatching workloads
onto them simultaneously.
Methods:
In this paper, we propose an efficient MSA system called GMSA for multi-users on
shared heterogeneous CPU/GPU platforms. To accelerate the computation of jobs from multiple
users, data sharing is considered in GMSA due to the fact that different MSA jobs often have a
percentage of the same data and tasks. Additionally, we also propose a scheduling strategy based
on the similarity in datasets or tasks between MSA jobs. Furthermore, co-run computation model
is adopted to take full use of both CPUs and GPUs.
Results:
We use four protein datasets which were redesigned according to different similarity. We
compare GMSA with ClustalW and CUDA-ClustalW in multiple users scenarios. Experiments results
showed that GMSA can achieve a speedup of up to 32X.
Conclusion:
GMSA is a system designed for accelerating the computation of MSA jobs with
shared input datasets on heterogeneous CPU/GPU platforms. In this system, a strategy was proposed
and implemented to find the common datasets among jobs submitted by multiple users, and
a scheduling algorithm is presented based on it. To utilize the overall resource of both CPU and
GPU, GMSA employs the co-run computation model. Results showed that it can speed up the total
computation of jobs efficiently.
Collapse
Affiliation(s)
- Na Bai
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, P.O. Box 300350, China
| | - Shanjiang Tang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, P.O. Box 300350, China
| | - Ce Yu
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, P.O. Box 300350, China
| | - Hao Fu
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, P.O. Box 300350, China
| | - Chen Wang
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana- Champaign, Illinois, United States
| | - Xi Chen
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, P.O. Box 300350, China
| |
Collapse
|
37
|
Abstract
Motivation Whole-genome alignment (WGA) methods show insufficient scalability toward the generation of large-scale WGAs. Profile alignment-based approaches revolutionized the fields of multiple sequence alignment construction methods by significantly reducing computational complexity and runtime. However, WGAs need to consider genomic rearrangements between genomes, which make the profile-based extension of several whole-genomes challenging. Currently, none of the available methods offer the possibility to align or extend WGA profiles. Results Here, we present genome profile alignment, an approach that aligns the profiles of WGAs and that is capable of producing large-scale WGAs many times faster than conventional methods. Our concept relies on already available whole-genome aligners, which are used to compute several smaller sets of aligned genomes that are combined to a full WGA with a divide and conquer approach. To align or extend WGA profiles, we make use of the SuperGenome data structure, which features a bidirectional mapping between individual sequence and alignment coordinates. This data structure is used to efficiently transfer different coordinate systems into a common one based on the principles of profiles alignments. The approach allows the computation of a WGA where alignments are subsequently merged along a guide tree. The current implementation uses progressiveMauve and offers the possibility for parallel computation of independent genome alignments. Our results based on various bacterial datasets up to several hundred genomes show that we can reduce the runtime from months to hours with a quality that is negligibly worse than the WGA computed with the conventional progressiveMauve tool. Availability and implementation GPA is freely available at https://lambda.informatik.uni-tuebingen.de/gitlab/ahennig/GPA. GPA is implemented in Java, uses progressiveMauve and offers a parallel computation of WGAs. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- André Hennig
- Center for Bioinformatics (ZBIT), Integrative Transcriptomics, Eberhard Karls University of Tübingen, Tübingen, Germany
| | - Kay Nieselt
- Center for Bioinformatics (ZBIT), Integrative Transcriptomics, Eberhard Karls University of Tübingen, Tübingen, Germany
| |
Collapse
|
38
|
Sievers F, Higgins DG. QuanTest2: benchmarking multiple sequence alignments using secondary structure prediction. Bioinformatics 2019; 36:90-95. [PMID: 31292629 PMCID: PMC9881607 DOI: 10.1093/bioinformatics/btz552] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2019] [Revised: 06/17/2019] [Accepted: 07/09/2019] [Indexed: 02/02/2023] Open
Abstract
MOTIVATION Secondary structure prediction accuracy (SSPA) in the QuanTest benchmark can be used to measure accuracy of a multiple sequence alignment. SSPA correlates well with the sum-of-pairs score, if the results are averaged over many alignments but not on an alignment-by-alignment basis. This is due to a sub-optimal selection of reference and non-reference sequences in QuanTest. RESULTS We develop an improved strategy for selecting reference and non-reference sequences for a new benchmark, QuanTest2. In QuanTest2, SSPA and SP correlate better on an alignment-by-alignment basis than in QuanTest. Guide-trees for QuanTest2 are more balanced with respect to reference sequences than in QuanTest. QuanTest2 scores correlate well with other well-established benchmarks. AVAILABILITY AND IMPLEMENTATION QuanTest2 is available at http://bioinf.ucd.ie/quantest2.tar, comprises of reference and non-reference sequence sets and a scoring script. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Fabian Sievers
- Conway Institute, UCD School of Medicine, University College Dublin, Belfield, Dublin 4, Ireland
| | | |
Collapse
|
39
|
Vineetha V, Biji CL, Nair AS. SPARK-MSNA: Efficient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning. Sci Rep 2019; 9:6631. [PMID: 31036850 PMCID: PMC6488671 DOI: 10.1038/s41598-019-42966-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2018] [Accepted: 04/12/2019] [Indexed: 11/09/2022] Open
Abstract
Multiple sequence alignment (MSA) is an integral part of molecular biology. But handling massive number of large sequences is still a bottleneck for most of the state-of-the-art software tools. Knowledge driven algorithms utilizing features of input sequences, such as high similarity in case of DNA sequences, can help in improving the efficiency of DNA MSA to assist in phylogenetic tree construction, comparative genomics etc. This article showcases the benefit of utilizing similarity features while performing the alignment. The algorithm uses suffix tree for identifying common substrings and uses a modified Needleman-Wunsch algorithm for pairwise alignments. In order to improve the efficiency of pairwise alignments, a knowledge base is created and a supervised learning with nearest neighbor algorithm is used to guide the alignment. The algorithm provided linear complexity O(m) compared to O(m2). Comparing with state-of-the-art algorithms (e.g., HAlign II), SPARK-MSNA provided 50% improvement in memory utilization in processing human mitochondrial genome (mt. genomes, 100x, 1.1. GB) with a better alignment accuracy in terms of average SP score and comparable execution time. The algorithm is implemented on big data framework Apache Spark in order to improve the scalability. The source code & test data are available at: https://sourceforge.net/projects/spark-msna/ .
Collapse
Affiliation(s)
- V Vineetha
- Department of Computational Biology and Bioinformatics, University of Kerala, Thiruvananthapuram, Kerala, India.
| | - C L Biji
- Department of Computational Biology and Bioinformatics, University of Kerala, Thiruvananthapuram, Kerala, India
| | - Achuthsankar S Nair
- Department of Computational Biology and Bioinformatics, University of Kerala, Thiruvananthapuram, Kerala, India
| |
Collapse
|
40
|
Pervez MT, Shah HA, Babar ME, Naveed N, Shoaib M. SAliBASE: A Database of Simulated Protein Alignments. Evol Bioinform Online 2019; 15:1176934318821080. [PMID: 30733625 PMCID: PMC6343434 DOI: 10.1177/1176934318821080] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2018] [Accepted: 11/26/2018] [Indexed: 01/17/2023] Open
Abstract
Simulated alignments are alternatives to manually constructed multiple sequence alignments for evaluating performance of multiple sequence alignment tools. The importance of simulated sequences is recognized because their true evolutionary history is known, which is very helpful for reconstructing accurate phylogenetic trees and alignments. However, generating simulated alignments require expertise to use bioinformatics tools and consume several hours for reconstructing even a few hundreds of simulated sequences. It becomes a tedious job for an end user who needs a few datasets of variety of simulated sequences. Currently, there is no databank available which may help researchers to download simulated sequences/alignments for their study. Major focus of our study was to develop a database of simulated protein sequences (SAliBASE) based on different varying parameters such as insertion rate, deletion rate, sequence length, number of sequences, and indel size. Each dataset has corresponding alignment as well. This repository is very useful for evaluating multiple alignment methods.
Collapse
Affiliation(s)
- Muhammad Tariq Pervez
- Department of Bioinformatics and Computational Biology, Virtual University of Pakistan, Lahore, Pakistan
| | - Hayat Ali Shah
- Department of Computer Science, Virtual University of Pakistan, Lahore, Pakistan
| | | | - Nasir Naveed
- Department of Computer Science, Virtual University of Pakistan, Lahore, Pakistan
| | - Muhammad Shoaib
- Department of Computer Science and Engineering, UET, Lahore, Pakistan
| |
Collapse
|
41
|
PVTree: A Sequential Pattern Mining Method for Alignment Independent Phylogeny Reconstruction. Genes (Basel) 2019; 10:genes10020073. [PMID: 30678245 PMCID: PMC6410268 DOI: 10.3390/genes10020073] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2018] [Revised: 01/04/2019] [Accepted: 01/14/2019] [Indexed: 11/21/2022] Open
Abstract
Phylogenetic tree is essential to understand evolution and it is usually constructed through multiple sequence alignment, which suffers from heavy computational burdens and requires sophisticated parameter tuning. Recently, alignment free methods based on k-mer profiles or common substrings provide alternative ways to construct phylogenetic trees. However, most of these methods ignore the global similarities between sequences or some specific valuable features, e.g., frequent patterns overall datasets. To make further improvement, we propose an alignment free algorithm based on sequential pattern mining, where each sequence is converted into a binary representation of sequential patterns among sequences. The phylogenetic tree is further constructed via clustering distance matrix which is calculated from pattern vectors. To increase accuracy for highly divergent sequences, we consider pattern weight and filtering redundancy sub-patterns. Both simulated and real data demonstrates our method outperform other alignment free methods, especially for large sequence set with low similarity.
Collapse
|
42
|
Abstract
Background Protein sequence alignment analyses have become a crucial step for many bioinformatics studies during the past decades. Multiple sequence alignment (MSA) and pair-wise sequence alignment (PSA) are two major approaches in sequence alignment. Former benchmark studies revealed drawbacks of MSA methods on nucleotide sequence alignments. To test whether similar drawbacks also influence protein sequence alignment analyses, we propose a new benchmark framework for protein clustering based on cluster validity. This new framework directly reflects the biological ground truth of the application scenarios that adopt sequence alignments, and evaluates the alignment quality according to the achievement of the biological goal, rather than the comparison on sequence level only, which averts the biases introduced by alignment scores or manual alignment templates. Compared with former studies, we calculate the cluster validity score based on sequence distances instead of clustering results. This strategy could avoid the influence brought by different clustering methods thus make results more dependable. Results Results showed that PSA methods performed better than MSA methods on most of the BAliBASE benchmark datasets. Analyses on the 80 re-sampled benchmark datasets constructed by randomly choosing 90% of each dataset 10 times showed similar results. Conclusions These results validated that the drawbacks of MSA methods revealed in nucleotide level also existed in protein sequence alignment analyses and affect the accuracy of results. Electronic supplementary material The online version of this article (10.1186/s12859-018-2524-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yingying Wang
- Research Center for Biomedical Information Technology, Shenzhen Institutes of Advanced Technologies, Chinese Academy of Sciences, Shenzhen, China
| | - Hongyan Wu
- Research Center for Biomedical Information Technology, Shenzhen Institutes of Advanced Technologies, Chinese Academy of Sciences, Shenzhen, China.
| | - Yunpeng Cai
- Research Center for Biomedical Information Technology, Shenzhen Institutes of Advanced Technologies, Chinese Academy of Sciences, Shenzhen, China.
| |
Collapse
|
43
|
Dijkstra M, Bawono P, Abeln S, Feenstra KA, Fokkink W, Heringa J. Motif-Aware PRALINE: Improving the alignment of motif regions. PLoS Comput Biol 2018; 14:e1006547. [PMID: 30383764 PMCID: PMC6233922 DOI: 10.1371/journal.pcbi.1006547] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2018] [Revised: 11/13/2018] [Accepted: 10/05/2018] [Indexed: 11/21/2022] Open
Abstract
Protein or DNA motifs are sequence regions which possess biological importance. These regions are often highly conserved among homologous sequences. The generation of multiple sequence alignments (MSAs) with a correct alignment of the conserved sequence motifs is still difficult to achieve, due to the fact that the contribution of these typically short fragments is overshadowed by the rest of the sequence. Here we extended the PRALINE multiple sequence alignment program with a novel motif-aware MSA algorithm in order to address this shortcoming. This method can incorporate explicit information about the presence of externally provided sequence motifs, which is then used in the dynamic programming step by boosting the amino acid substitution matrix towards the motif. The strength of the boost is controlled by a parameter, α. Using a benchmark set of alignments we confirm that a good compromise can be found that improves the matching of motif regions while not significantly reducing the overall alignment quality. By estimating α on an unrelated set of reference alignments we find there is indeed a strong conservation signal for motifs. A number of typical but difficult MSA use cases are explored to exemplify the problems in correctly aligning functional sequence motifs and how the motif-aware alignment method can be employed to alleviate these problems. The most important functional parts of proteins are often small—but very specific—sequence motifs. Moreover, these motifs tend to be strongly conserved during evolution due to their functional role. Nevertheless, when trying to align protein sequences of the same family, it is often very difficult to align such motifs using standard multiple sequence alignment methods. Aligning functional residues correctly is essential to detect motif conservation, which can be used to filter out spuriously occurring motifs. Additionally, many downstream analyses, such as phylogenetics, are strongly reliant on alignment quality. We have developed a sequence alignment program named Motif-Aware PRALINE (MA-PRALINE) that incorporates information about motifs explicitly. Motifs are provided to MA-PRALINE in the PROSITE pattern syntax; it then scans the input sequences for instances of the pattern and provides a score bonus to matching sequence positions. Our method provides a reproducible alternative to editing alignments by hand in order to account for motif conservation, which is a tedious and error-prone process. We will show that MA-PRALINE allows the alignment of motif-rich regions to be fine-tuned while not degrading the rest of the alignment. MA-PRALINE is available on GitHub as open source software; this allows it to be easily tailored to similar problems. We apply MA-PRALINE on the HIV-1 envelope glycoprotein (gp120) to get an improved alignment of the N-terminal glycosylation motifs. The presence of these motifs is essential for the virus in evading the immune response of the host.
Collapse
Affiliation(s)
- Maurits Dijkstra
- Department of Computer Science, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
- * E-mail:
| | - Punto Bawono
- Department of Computer Science, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
| | - Sanne Abeln
- Department of Computer Science, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
| | - K. Anton Feenstra
- Department of Computer Science, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
| | - Wan Fokkink
- Department of Computer Science, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
| | - Jaap Heringa
- Department of Computer Science, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
| |
Collapse
|
44
|
Chaabane L. A hybrid solver for protein multiple sequence alignment problem. J Bioinform Comput Biol 2018; 16:1850015. [DOI: 10.1142/s0219720018500154] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In this work, a novel hybrid model called PSOSA for solving multiple sequence alignment (MSA) problem is proposed. The developed approach is a combination between particle swarm optimization (PSO) algorithm and simulated annealing (SA) technique. In our PSOSA approach, PSO is exploited in global search, but it is easily trapped into local optimum and may lead to premature convergence. SA is incorporated as local improvement approach to overcome local optimum problem and intensify the search in local regions to improve solution quality. Numerical results on BAliBASE benchmark have shown the effectiveness of the proposed method and its ability to achieve good quality solutions when compared with those given by other existing methods.
Collapse
Affiliation(s)
- Lamiche Chaabane
- Department of Computer Science, Mohamed Boudiaf University, BP. 166 M’sila 28000, Algeria
| |
Collapse
|
45
|
Orlando G, Raimondi D, Khan T, Lenaerts T, Vranken WF. SVM-dependent pairwise HMM: an application to protein pairwise alignments. Bioinformatics 2018; 33:3902-3908. [PMID: 28666322 DOI: 10.1093/bioinformatics/btx391] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2016] [Accepted: 06/12/2017] [Indexed: 12/27/2022] Open
Abstract
Motivation Methods able to provide reliable protein alignments are crucial for many bioinformatics applications. In the last years many different algorithms have been developed and various kinds of information, from sequence conservation to secondary structure, have been used to improve the alignment performances. This is especially relevant for proteins with highly divergent sequences. However, recent works suggest that different features may have different importance in diverse protein classes and it would be an advantage to have more customizable approaches, capable to deal with different alignment definitions. Results Here we present Rigapollo, a highly flexible pairwise alignment method based on a pairwise HMM-SVM that can use any type of information to build alignments. Rigapollo lets the user decide the optimal features to align their protein class of interest. It outperforms current state of the art methods on two well-known benchmark datasets when aligning highly divergent sequences. Availability and implementation A Python implementation of the algorithm is available at http://ibsquare.be/rigapollo. Contact wim.vranken@vub.be. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Gabriele Orlando
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, La Plaine Campus, Triomflaan.,Structural Biology Brussels, Vrije Universiteit Brussel, Pleinlaan 2.,Structural Biology Research Center, VIB.,Structural Machine Learning Group, Université Libre de Bruxelles
| | - Daniele Raimondi
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, La Plaine Campus, Triomflaan.,Structural Biology Brussels, Vrije Universiteit Brussel, Pleinlaan 2.,Structural Biology Research Center, VIB.,Structural Machine Learning Group, Université Libre de Bruxelles
| | - Taushif Khan
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, La Plaine Campus, Triomflaan.,Structural Biology Brussels, Vrije Universiteit Brussel, Pleinlaan 2
| | - Tom Lenaerts
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, La Plaine Campus, Triomflaan.,Structural Machine Learning Group, Université Libre de Bruxelles.,Artificial Intelligence Lab, Vrije Universiteit Brussel, 1050 Brussels, Belgium
| | - Wim F Vranken
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, La Plaine Campus, Triomflaan.,Structural Biology Brussels, Vrije Universiteit Brussel, Pleinlaan 2.,Structural Biology Research Center, VIB
| |
Collapse
|
46
|
Rubio-Largo Á, Vanneschi L, Castelli M, Vega-Rodríguez MA. Multiobjective characteristic-based framework for very-large multiple sequence alignment. Appl Soft Comput 2018. [DOI: 10.1016/j.asoc.2017.06.022] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
47
|
Lladós J, Cores F, Guirado F. Scalable Consistency in T-Coffee Through Apache Spark and Cassandra Database. J Comput Biol 2018; 25:894-906. [PMID: 30004242 DOI: 10.1089/cmb.2018.0084] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Next-generation sequencing, also known as high-throughput sequencing, has increased the volume of genetic data processed by sequencers. In the bioinformatic scientific area, highly rated multiple sequence alignment tools, such as MAFFT, ProbCons, and T-Coffee (TC), use the probabilistic consistency as a prior step to the progressive alignment stage to improve the final accuracy. However, such methods are severely limited by the memory required to store the consistency information. Big data processing and persistence techniques are used to manage and store the huge amount of information that is generated. Although these techniques have significant advantages, few biological applications have adopted them. In this article, a novel approach named big data tree-based consistency objective function for alignment evaluation (BDT-Coffee) is presented. BDT-Coffee is based on the integration of consistency information through Cassandra database in TC, previously generated by the MapReduce processing paradigm, to enable large data sets to be processed with the aim of improving the performance and scalability of the original algorithm.
Collapse
Affiliation(s)
- Jordi Lladós
- INSPIRES Research Center, Universitat de Lleida , Lleida, Spain
| | - Fernando Cores
- INSPIRES Research Center, Universitat de Lleida , Lleida, Spain
| | | |
Collapse
|
48
|
Zambrano-Vega C, Nebro AJ, García-Nieto J, Aldana-Montes JF. M2Align: parallel multiple sequence alignment with a multi-objective metaheuristic. Bioinformatics 2018; 33:3011-3017. [PMID: 28541404 DOI: 10.1093/bioinformatics/btx338] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2016] [Accepted: 05/20/2017] [Indexed: 11/14/2022] Open
Abstract
Motivation Multiple sequence alignment (MSA) is an NP-complete optimization problem found in computational biology, where the time complexity of finding an optimal alignment raises exponentially along with the number of sequences and their lengths. Additionally, to assess the quality of a MSA, a number of objectives can be taken into account, such as maximizing the sum-of-pairs, maximizing the totally conserved columns, minimizing the number of gaps, or maximizing structural information based scores such as STRIKE. An approach to deal with MSA problems is to use multi-objective metaheuristics, which are non-exact stochastic optimization methods that can produce high quality solutions to complex problems having two or more objectives to be optimized at the same time. Our motivation is to provide a multi-objective metaheuristic for MSA that can run in parallel taking advantage of multi-core-based computers. Results The software tool we propose, called M2Align (Multi-objective Multiple Sequence Alignment), is a parallel and more efficient version of the three-objective optimizer for sequence alignments MO-SAStrE, able of reducing the algorithm computing time by exploiting the computing capabilities of common multi-core CPU clusters. Our performance evaluation over datasets of the benchmark BAliBASE (v3.0) shows that significant time reductions can be achieved by using up to 20 cores. Even in sequential executions, M2Align is faster than MO-SAStrE, thanks to the encoding method used for the alignments. Availability and implementation M2Align is an open source project hosted in GitHub, where the source code and sample datasets can be freely obtained: https://github.com/KhaosResearch/M2Align. Contact antonio@lcc.uma.es. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Cristian Zambrano-Vega
- Facultad de Ciencias de la Ingeniería, Universidad Técnica Estatal de Quevedo, Quevedo, Ecuador
| | - Antonio J Nebro
- Departamento de Lenguajes y Ciencias de la Computación, Universidad de Málaga, 29071 Málaga, Spain
| | - José García-Nieto
- Departamento de Lenguajes y Ciencias de la Computación, Universidad de Málaga, 29071 Málaga, Spain
| | - José F Aldana-Montes
- Departamento de Lenguajes y Ciencias de la Computación, Universidad de Málaga, 29071 Málaga, Spain
| |
Collapse
|
49
|
Amorim AR, Neves LA, Valêncio CR, Roberto GF, Zafalon GFD. An approach for COFFEE objective function to global DNA multiple sequence alignment. Comput Biol Chem 2018; 75:39-44. [PMID: 29738913 DOI: 10.1016/j.compbiolchem.2018.04.012] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2016] [Revised: 03/29/2018] [Accepted: 04/20/2018] [Indexed: 10/17/2022]
Abstract
Multiple sequence alignment (MSA) is one of the most important tasks in bioinformatics and it can be used to prediction of structures or functions of unknown proteins and to phylogenetic tree reconstruction. There are many heuristics to perform multiple sequence alignment, as Progressive Alignment, Ant Colony, Genetic Algorithms, among others. Along the years, some tools were proposed to perform MSA and MSA-GA is one of them. The MSA-GA is a tool based on Genetic Algorithm to perform multiple sequence alignment and its results are generally better than other well-known tools in bioinformatics, as Clustal W. The COFFEE objective function was implemented in the MSA-GA in order to allow it to produce better alignments to less similar sequence sets of proteins. Nonetheless, the COFFEE objective function is not suited do perform multiple sequence alignment of nucleotides. Thus, we have modified the COFFEE objective function, previously implemented in the MSA-GA, to allow it to obtain better results also to sequences of nucleotides. Our results have shown that our approach has achieved better results in all cases when compared with standard COFFEE and most of cases when compared with WSP for all test cases from BAliBase and BRAliBase. Moreover, our results are more reliable because their standard deviations have less variation.
Collapse
Affiliation(s)
- Anderson Rici Amorim
- Department of Computer Science and Statistics, São Paulo State University, Rua Cristovão Colombo 2265, São José do Rio Preto, São Paulo, Brazil
| | - Leandro Alves Neves
- Department of Computer Science and Statistics, São Paulo State University, Rua Cristovão Colombo 2265, São José do Rio Preto, São Paulo, Brazil
| | - Carlos Roberto Valêncio
- Department of Computer Science and Statistics, São Paulo State University, Rua Cristovão Colombo 2265, São José do Rio Preto, São Paulo, Brazil
| | - Guilherme Freire Roberto
- Department of Computer Science and Statistics, São Paulo State University, Rua Cristovão Colombo 2265, São José do Rio Preto, São Paulo, Brazil
| | - Geraldo Francisco Donegá Zafalon
- Department of Computer Science and Statistics, São Paulo State University, Rua Cristovão Colombo 2265, São José do Rio Preto, São Paulo, Brazil.
| |
Collapse
|
50
|
Rubio-Largo Á, Castelli M, Vanneschi L, Vega-Rodríguez MA. A Parallel Multiobjective Metaheuristic for Multiple Sequence Alignment. J Comput Biol 2018; 25:1009-1022. [PMID: 29671616 DOI: 10.1089/cmb.2018.0031] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The alignment among three or more nucleotides/amino acids sequences at the same time is known as multiple sequence alignment (MSA), a nondeterministic polynomial time (NP)-hard optimization problem. The time complexity of finding an optimal alignment raises exponentially when the number of sequences to align increases. In this work, we deal with a multiobjective version of the MSA problem wherein the goal is to simultaneously optimize the accuracy and conservation of the alignment. A parallel version of the hybrid multiobjective memetic metaheuristics for MSA is proposed. To evaluate the parallel performance of our proposal, we have selected a pull of data sets with different number of sequences (up to 1000 sequences) and study its parallel performance against other well-known parallel metaheuristics published in the literature, such as MSAProbs, tree-based consistency objective function for alignment evaluation (T-Coffee), Clustal [Formula: see text], and multiple alignment using fast Fourier transform (MAFFT). The comparative study reveals that our parallel aligner obtains better results than MSAProbs, T-Coffee, Clustal [Formula: see text], and MAFFT. In addition, the parallel version is around 25 times faster than the sequential version with 32 cores, obtaining an efficiency around 80%.
Collapse
Affiliation(s)
| | - Mauro Castelli
- 1 NOVA IMS, Universidade Nova de Lisboa , Lisbon, Portugal
| | | | - Miguel A Vega-Rodríguez
- 2 Department of Computer and Communications Technologies, University of Extremadura , Caceres, Spain
| |
Collapse
|