Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Thompson JD, Koehl P, Ripp R, Poch O. BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins 2006;61:127-36. [PMID: 16044462 DOI: 10.1002/prot.20527] [Citation(s) in RCA: 224] [Impact Index Per Article: 12.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]

For:	Thompson JD, Koehl P, Ripp R, Poch O. BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins 2006;61:127-36. [PMID: 16044462 DOI: 10.1002/prot.20527] [Citation(s) in RCA: 224] [Impact Index Per Article: 12.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]

Number

Cited by Other Article(s)

Ashrafzadeh S, Golding GB, Ilie S, Ilie L. Scoring alignments by embedding vector similarity. Brief Bioinform 2024;25:bbae178. [PMID: 38695119 PMCID: PMC11063651 DOI: 10.1093/bib/bbae178] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2023] [Revised: 03/20/2024] [Accepted: 03/31/2024] [Indexed: 05/05/2024] Open

Khairul WM, Hashim F, Rahamathullah R, Mohammed M, Aisyah Razali S, Ahmad Tajudin Tuan Johari S, Azizan S. Exploring ethynyl-based chalcones as green semiconductor materials for optical limiting interests. SPECTROCHIMICA ACTA. PART A, MOLECULAR AND BIOMOLECULAR SPECTROSCOPY 2024;308:123776. [PMID: 38134650 DOI: 10.1016/j.saa.2023.123776] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Revised: 12/09/2023] [Accepted: 12/14/2023] [Indexed: 12/24/2023]

Iovino BG, Ye Y. Protein embedding based alignment. BMC Bioinformatics 2024;25:85. [PMID: 38413857 PMCID: PMC10900708 DOI: 10.1186/s12859-024-05699-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Accepted: 02/12/2024] [Indexed: 02/29/2024] Open

Abstract

PURPOSE

Despite the many progresses with alignment algorithms, aligning divergent protein sequences with less than 20-35% pairwise identity (so called "twilight zone") remains a difficult problem. Many alignment algorithms have been using substitution matrices since their creation in the 1970's to generate alignments, however, these matrices do not work well to score alignments within the twilight zone. We developed Protein Embedding based Alignments, or PEbA, to better align sequences with low pairwise identity. Similar to the traditional Smith-Waterman algorithm, PEbA uses a dynamic programming algorithm but the matching score of amino acids is based on the similarity of their embeddings from a protein language model.

METHODS

We tested PEbA on over twelve thousand benchmark pairwise alignments from BAliBASE, each one extracted from one of their multiple sequence alignments. Five different BAliBASE references were used, each with different sequence identities, motifs, and lengths, allowing PEbA to showcase how well it aligns under different circumstances.

RESULTS

PEbA greatly outperformed BLOSUM substitution matrix-based pairwise alignments, achieving different levels of improvements of the alignment quality for pairs of sequences with different levels of similarity (over four times as well for pairs of sequences with <10% identity). We also compared PEbA with embeddings generated by different protein language models (ProtT5 and ESM-2) and found that ProtT5-XL-U50 produced the most useful embeddings for aligning protein sequences. PEbA also outperformed DEDAL and vcMSA, two recently developed protein language model embedding-based alignment methods.

CONCLUSION

Our results suggested that general purpose protein language models provide useful contextual information for generating more accurate protein alignments than typically used methods.

Collapse

Liu Y, Yuan H, Zhang Q, Wang Z, Xiong S, Wen N, Zhang Y. Multiple sequence alignment based on deep reinforcement learning with self-attention and positional encoding. Bioinformatics 2023;39:btad636. [PMID: 37856335 PMCID: PMC10628385 DOI: 10.1093/bioinformatics/btad636] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2022] [Revised: 07/24/2023] [Accepted: 10/17/2023] [Indexed: 10/21/2023] Open

Muntoni AP, Pagnani A. DCAlign v1.0: aligning biological sequences using co-evolution models and informed priors. Bioinformatics 2023;39:btad537. [PMID: 37647658 PMCID: PMC10491954 DOI: 10.1093/bioinformatics/btad537] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2022] [Revised: 06/14/2023] [Accepted: 08/29/2023] [Indexed: 09/01/2023] Open

João M, Sena AC, Rebello VEF. On closing the inopportune gap with consistency transformation and iterative refinement. PLoS One 2023;18:e0287483. [PMID: 37440507 PMCID: PMC10343097 DOI: 10.1371/journal.pone.0287483] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Accepted: 06/06/2023] [Indexed: 07/15/2023] Open

Zong F, Long C, Hu W, Chen S, Dai W, Xiao ZX, Cao Y. Abalign: a comprehensive multiple sequence alignment platform for B-cell receptor immune repertoires. Nucleic Acids Res 2023:7173809. [PMID: 37207341 DOI: 10.1093/nar/gkad400] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2023] [Revised: 04/23/2023] [Accepted: 05/08/2023] [Indexed: 05/21/2023] Open

Affiliation(s)

Fanjie Zong Center of Growth, Metabolism and Aging, Key Laboratory of Bio-Resource and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, China Animal Disease Prevention and Food Safety Key Laboratory of Sichuan Province, Microbiology and Metabolic Engineering Key Laboratory of Sichuan Province, Chengdu, China
Chenyu Long Center of Growth, Metabolism and Aging, Key Laboratory of Bio-Resource and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, China Animal Disease Prevention and Food Safety Key Laboratory of Sichuan Province, Microbiology and Metabolic Engineering Key Laboratory of Sichuan Province, Chengdu, China
Wanxin Hu Center of Growth, Metabolism and Aging, Key Laboratory of Bio-Resource and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, China Animal Disease Prevention and Food Safety Key Laboratory of Sichuan Province, Microbiology and Metabolic Engineering Key Laboratory of Sichuan Province, Chengdu, China
Shuang Chen Chengdu Institute of Biology, Chinese Academy of Sciences, Chengdu, China
Wentao Dai NHC Key Laboratory of Reproduction Regulation & Shanghai-MOST Key Laboratory of Health and Disease Genomics, Shanghai Institute for Biomedical and Pharmaceutical Technologies, Shanghai, China
Zhi-Xiong Xiao Center of Growth, Metabolism and Aging, Key Laboratory of Bio-Resource and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, China
Yang Cao Center of Growth, Metabolism and Aging, Key Laboratory of Bio-Resource and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, China Animal Disease Prevention and Food Safety Key Laboratory of Sichuan Province, Microbiology and Metabolic Engineering Key Laboratory of Sichuan Province, Chengdu, China

Collapse

Nayeem MA, Samudro NA, Rahman MS, Rahman MS. MAMMLE: A Framework for Phylogeny Estimation Based on Multiobjective Application-aware Multiple Sequence Alignment and Maximum Likelihood Ensemble. J Comput Biol 2023;30:245-249. [PMID: 36706434 DOI: 10.1089/cmb.2021.0533] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open

Benítez-Hidalgo A, Aldana-Montes JF, Navas-Delgado I, Roldán-García MDM. SALON ontology for the formal description of sequence alignments. BMC Bioinformatics 2023;24:69. [PMID: 36849882 PMCID: PMC9972671 DOI: 10.1186/s12859-023-05190-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Accepted: 02/15/2023] [Indexed: 03/01/2023] Open

Kuang M, Zhang Y, Lam TW, Ting HF. MLProbs: A Data-Centric Pipeline for Better Multiple Sequence Alignment. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023;20:524-533. [PMID: 35120007 DOI: 10.1109/tcbb.2022.3148382] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]

Becker F, Stanke M. learnMSA: learning and aligning large protein families. Gigascience 2022;11:6833031. [PMID: 36399060 PMCID: PMC9673500 DOI: 10.1093/gigascience/giac104] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Revised: 09/01/2022] [Accepted: 10/06/2022] [Indexed: 11/19/2022] Open

Seo TK, Redelings BD, Thorne JL. Correlations between alignment gaps and nucleotide substitution or amino acid replacement. Proc Natl Acad Sci U S A 2022;119:e2204435119. [PMID: 35972964 PMCID: PMC9407537 DOI: 10.1073/pnas.2204435119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2022] [Accepted: 07/11/2022] [Indexed: 11/18/2022] Open

Nayeem MA, Bayzid MS, Rahman AH, Shahriyar R, Rahman MS. Multiobjective Formulation of Multiple Sequence Alignment for Phylogeny Inference. IEEE TRANSACTIONS ON CYBERNETICS 2022;52:2775-2786. [PMID: 33044939 DOI: 10.1109/tcyb.2020.3020308] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]

Chao J, Tang F, Xu L. Developments in Algorithms for Sequence Alignment: A Review. Biomolecules 2022;12:biom12040546. [PMID: 35454135 PMCID: PMC9024764 DOI: 10.3390/biom12040546] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2022] [Revised: 03/29/2022] [Accepted: 03/31/2022] [Indexed: 01/27/2023] Open

Shrestha B, Adhikari B. Scoring protein sequence alignments using deep Learning. Bioinformatics 2022;38:2988-2995. [PMID: 35385080 DOI: 10.1093/bioinformatics/btac210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2021] [Revised: 04/01/2022] [Accepted: 04/05/2022] [Indexed: 11/12/2022] Open

Kostenko DO, Korotkov EV. Application of the MAHDS Method for Multiple Alignment of Highly Diverged Amino Acid Sequences. Int J Mol Sci 2022;23:ijms23073764. [PMID: 35409125 PMCID: PMC8998981 DOI: 10.3390/ijms23073764] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2022] [Revised: 03/23/2022] [Accepted: 03/23/2022] [Indexed: 12/10/2022] Open

Zhang Y, Zhang Q, Zhou J, Zou Q. A survey on the algorithm and development of multiple sequence alignment. Brief Bioinform 2022;23:6546258. [PMID: 35272347 DOI: 10.1093/bib/bbac069] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Revised: 01/30/2022] [Accepted: 02/09/2022] [Indexed: 12/21/2022] Open

Nayeem MA, Bayzid MS, Samudro NA, Rahman MS, Rahman MS. PASTA with many application-aware optimization criteria for alignment based phylogeny inference. Comput Biol Chem 2022;98:107661. [DOI: 10.1016/j.compbiolchem.2022.107661] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2021] [Revised: 02/22/2022] [Accepted: 02/27/2022] [Indexed: 11/25/2022]

Dougan KE, González-Pech RA, Stephens TG, Shah S, Chen Y, Ragan MA, Bhattacharya D, Chan CX. Genome-powered classification of microbial eukaryotes: focus on coral algal symbionts. Trends Microbiol 2022;30:831-840. [DOI: 10.1016/j.tim.2022.02.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2021] [Revised: 01/20/2022] [Accepted: 02/01/2022] [Indexed: 12/20/2022]

TwinCons: Conservation score for uncovering deep sequence similarity and divergence. PLoS Comput Biol 2021;17:e1009541. [PMID: 34714829 PMCID: PMC8580257 DOI: 10.1371/journal.pcbi.1009541] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Revised: 11/10/2021] [Accepted: 10/06/2021] [Indexed: 11/19/2022] Open

Pajkos M, Dosztányi Z. Functions of intrinsically disordered proteins through evolutionary lenses. PROGRESS IN MOLECULAR BIOLOGY AND TRANSLATIONAL SCIENCE 2021;183:45-74. [PMID: 34656334 DOI: 10.1016/bs.pmbts.2021.06.017] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]

Lladós J, Cores F, Guirado F, Lérida JL. Accurate consistency-based MSA reducing the memory footprint. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2021;208:106237. [PMID: 34198017 DOI: 10.1016/j.cmpb.2021.106237] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/13/2020] [Accepted: 06/08/2021] [Indexed: 06/13/2023]

Zhang C, Zhao Y, Braun EL, Mirarab S. TAPER: Pinpointing errors in multiple sequence alignments despite varying rates of evolution. Methods Ecol Evol 2021. [DOI: 10.1111/2041-210x.13696] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]

Abadi S, Avram O, Rosset S, Pupko T, Mayrose I. ModelTeller: Model Selection for Optimal Phylogenetic Reconstruction Using Machine Learning. Mol Biol Evol 2021;37:3338-3352. [PMID: 32585030 DOI: 10.1093/molbev/msaa154] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open

Abstract

Statistical criteria have long been the standard for selecting the best model for phylogenetic reconstruction and downstream statistical inference. Although model selection is regarded as a fundamental step in phylogenetics, existing methods for this task consume computational resources for long processing time, they are not always feasible, and sometimes depend on preliminary assumptions which do not hold for sequence data. Moreover, although these methods are dedicated to revealing the processes that underlie the sequence data, they do not always produce the most accurate trees. Notably, phylogeny reconstruction consists of two related tasks, topology reconstruction and branch-length estimation. It was previously shown that in many cases the most complex model, GTR+I+G, leads to topologies that are as accurate as using existing model selection criteria, but overestimates branch lengths. Here, we present ModelTeller, a computational methodology for phylogenetic model selection, devised within the machine-learning framework, optimized to predict the most accurate nucleotide substitution model for branch-length estimation. We demonstrate that ModelTeller leads to more accurate branch-length inference than current model selection criteria on data sets simulated under realistic processes. ModelTeller relies on a readily implemented machine-learning model and thus the prediction according to features extracted from the sequence data results in a substantial decrease in running time compared with existing strategies. By harnessing the machine-learning framework, we distinguish between features that mostly contribute to branch-length optimization, concerning the extent of sequence divergence, and features that are related to estimates of the model parameters that are important for the selection made by current criteria.

Collapse

Azouri D, Abadi S, Mansour Y, Mayrose I, Pupko T. Harnessing machine learning to guide phylogenetic-tree search algorithms. Nat Commun 2021;12:1983. [PMID: 33790270 PMCID: PMC8012635 DOI: 10.1038/s41467-021-22073-8] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2020] [Accepted: 02/26/2021] [Indexed: 02/01/2023] Open

Chockalingam SP, Pannu J, Hooshmand S, Thankachan SV, Aluru S. An alignment-free heuristic for fast sequence comparisons with applications to phylogeny reconstruction. BMC Bioinformatics 2020;21:404. [PMID: 33203364 PMCID: PMC7672814 DOI: 10.1186/s12859-020-03738-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2020] [Accepted: 09/04/2020] [Indexed: 11/10/2022] Open

Trivedi R, Nagarajaram HA. Substitution scoring matrices for proteins - An overview. Protein Sci 2020;29:2150-2163. [PMID: 32954566 DOI: 10.1002/pro.3954] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2020] [Revised: 09/17/2020] [Accepted: 09/18/2020] [Indexed: 01/17/2023]

Evidence of absence treated as absence of evidence: The effects of variation in the number and distribution of gaps treated as missing data on the results of standard maximum likelihood analysis. Mol Phylogenet Evol 2020;154:106966. [PMID: 32971285 DOI: 10.1016/j.ympev.2020.106966] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2020] [Revised: 08/15/2020] [Accepted: 09/15/2020] [Indexed: 11/23/2022]

Polyanovsky V, Lifanov A, Esipova N, Tumanyan V. The ranging of amino acids substitution matrices of various types in accordance with the alignment accuracy criterion. BMC Bioinformatics 2020;21:294. [PMID: 32921315 PMCID: PMC7489204 DOI: 10.1186/s12859-020-03616-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2020] [Accepted: 06/18/2020] [Indexed: 11/15/2022] Open

Abstract

Background

The alignment of character sequences is important in bioinformatics. The quality of this procedure is determined by the substitution matrix and parameters of the insertion-deletion penalty function. These matrices are derived from sequence alignment and thus reflect the evolutionary process. Currently, in addition to evolutionary matrices, a large number of different background matrices have been obtained. To make an optimal choice of the substitution matrix and the penalty parameters, we conducted a numerical experiment using a representative sample of existing matrices of various types and origins.

Results

We tested both the classical evolutionary matrix series (PAM, Blosum, VTML, Pfasum); structural alignment based matrices, contact energy matrix, and matrix based on the properties of the genetic code. This study presents results for two test set types: first, we simulated sequences that reflect the divergent evolution; second, we performed tests on Balibase sequences. In both cases, we obtained the dependences of the alignment quality (Accuracy, Confidence) on the evolutionary distance between sequences and the evolutionary distance to which the substitution matrices correspond. Optimization of a combination of matrices and the penalty parameters was carried out for local and global alignment on the values of penalty function parameters.

Consequently, we found that the best alignment quality is achieved with matrices corresponding to the largest evolutionary distance. These matrices prove to be universal, i.e. suitable for aligning sequences separated by both large and small evolutionary distances. We analysed the correspondence of the correlation coefficients of matrices to the alignment quality. It was found that matrices showing high quality alignment have an above average correlation value, but the converse is not true.

Conclusions

This study showed that the best alignment quality is achieved with evolutionary matrices designed for long distances: Gonnet, VTML250, PAM250, MIQS, and Pfasum050. The same property is inherent in matrices not only of evolutionary origin, but also of another background corresponding to a large evolutionary distance. Therefore, matrices based on structural data show alignment quality close enough to its value for evolutionary matrices. This agrees with the idea that the spatial structure is more conservative than the protein sequence.

Collapse

Dijkstra MJJ, van der Ploeg AJ, Feenstra KA, Fokkink WJ, Abeln S, Heringa J. Tailor-made multiple sequence alignments using the PRALINE 2 alignment toolkit. Bioinformatics 2020;35:5315-5317. [PMID: 31368486 PMCID: PMC6954659 DOI: 10.1093/bioinformatics/btz572] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2019] [Revised: 05/29/2019] [Accepted: 07/29/2019] [Indexed: 12/03/2022] Open

Carpentier M, Chomilier J. Protein multiple alignments: sequence-based versus structure-based programs. Bioinformatics 2020;35:3970-3980. [PMID: 30942864 DOI: 10.1093/bioinformatics/btz236] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2018] [Revised: 03/05/2019] [Accepted: 04/02/2019] [Indexed: 11/14/2022] Open

A bi-objective function optimization approach for multiple sequence alignment using genetic algorithm. Soft comput 2020. [DOI: 10.1007/s00500-020-04917-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]

Maldonado E, Antunes A. LMAP_S: Lightweight Multigene Alignment and Phylogeny eStimation. BMC Bioinformatics 2019;20:739. [PMID: 31888452 PMCID: PMC6937843 DOI: 10.1186/s12859-019-3292-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2019] [Accepted: 11/26/2019] [Indexed: 01/22/2023] Open

Abstract

Background

Recent advances in genome sequencing technologies and the cost drop in high-throughput sequencing continue to give rise to a deluge of data available for downstream analyses. Among others, evolutionary biologists often make use of genomic data to uncover phenotypic diversity and adaptive evolution in protein-coding genes. Therefore, multiple sequence alignments (MSA) and phylogenetic trees (PT) need to be estimated with optimal results. However, the preparation of an initial dataset of multiple sequence file(s) (MSF) and the steps involved can be challenging when considering extensive amount of data. Thus, it becomes necessary the development of a tool that removes the potential source of error and automates the time-consuming steps of a typical workflow with high-throughput and optimal MSA and PT estimations.

Results

We introduce LMAP_S (Lightweight Multigene Alignment and Phylogeny eStimation), a user-friendly command-line and interactive package, designed to handle an improved alignment and phylogeny estimation workflow: MSF preparation, MSA estimation, outlier detection, refinement, consensus, phylogeny estimation, comparison and editing, among which file and directory organization, execution, manipulation of information are automated, with minimal manual user intervention. LMAP_S was developed for the workstation multi-core environment and provides a unique advantage for processing multiple datasets. Our software, proved to be efficient throughout the workflow, including, the (unlimited) handling of more than 20 datasets.

Conclusions

We have developed a simple and versatile LMAP_S package enabling researchers to effectively estimate multiple datasets MSAs and PTs in a high-throughput fashion. LMAP_S integrates more than 25 software providing overall more than 65 algorithm choices distributed in five stages. At minimum, one FASTA file is required within a single input directory. To our knowledge, no other software combines MSA and phylogeny estimation with as many alternatives and provides means to find optimal MSAs and phylogenies. Moreover, we used a case study comparing methodologies that highlighted the usefulness of our software. LMAP_S has been developed as an open-source package, allowing its integration into more complex open-source bioinformatics pipelines. LMAP_S package is released under GPLv3 license and is freely available at https://lmap-s.sourceforge.io/.

Collapse

Ali RH, Bogusz M, Whelan S. Identifying Clusters of High Confidence Homologies in Multiple Sequence Alignments. Mol Biol Evol 2019;36:2340-2351. [PMID: 31209473 PMCID: PMC6933875 DOI: 10.1093/molbev/msz142] [Citation(s) in RCA: 52] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open

Liao Y, Schaeffer RD, Pei J, Grishin NV. A sequence family database built on ECOD structural domains. Bioinformatics 2019;34:2997-3003. [PMID: 29659718 DOI: 10.1093/bioinformatics/bty214] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2018] [Accepted: 04/03/2018] [Indexed: 11/12/2022] Open

Bai N, Tang S, Yu C, Fu H, Wang C, Chen X. GMSA: A Data Sharing System for Multiple Sequence Alignment Across Multiple Users. Curr Bioinform 2019. [DOI: 10.2174/1574893614666190111160101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]

Abstract Background: In recent years, the rapid growth of biological datasets in Bioinformatics has made the computation of Multiple Sequence Alignment (MSA) become extremely slow. Using the GPU to accelerate MSA has shown to be an effective approach. Moreover, there is a trend that many bioinformatic researchers or institutes setup a shared server for remote users to submit MSA jobs via provided web-pages or tools. Objective: Given the fact that different MSA jobs submitted by users often process similar datasets, there can be an opportunity for users to share their computation results between each other, which can avoid the redundant computation and thereby reduce the overall computing time. Furthermore, in the heterogeneous CPU/GPU platform, many existing applications assign their computation on GPU devices only, which leads to a waste of the CPU resources. Co-run computation can increase the utilization of computing resources on both CPUs and GPUs by dispatching workloads onto them simultaneously. Methods: In this paper, we propose an efficient MSA system called GMSA for multi-users on shared heterogeneous CPU/GPU platforms. To accelerate the computation of jobs from multiple users, data sharing is considered in GMSA due to the fact that different MSA jobs often have a percentage of the same data and tasks. Additionally, we also propose a scheduling strategy based on the similarity in datasets or tasks between MSA jobs. Furthermore, co-run computation model is adopted to take full use of both CPUs and GPUs. Results: We use four protein datasets which were redesigned according to different similarity. We compare GMSA with ClustalW and CUDA-ClustalW in multiple users scenarios. Experiments results showed that GMSA can achieve a speedup of up to 32X. Conclusion: GMSA is a system designed for accelerating the computation of MSA jobs with shared input datasets on heterogeneous CPU/GPU platforms. In this system, a strategy was proposed and implemented to find the common datasets among jobs submitted by multiple users, and a scheduling algorithm is presented based on it. To utilize the overall resource of both CPU and GPU, GMSA employs the co-run computation model. Results showed that it can speed up the total computation of jobs efficiently. Collapse

Hennig A, Nieselt K. Efficient merging of genome profile alignments. Bioinformatics 2019;35:i71-i80. [PMID: 31510683 PMCID: PMC6612806 DOI: 10.1093/bioinformatics/btz377] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open

Abstract

Motivation

Whole-genome alignment (WGA) methods show insufficient scalability toward the generation of large-scale WGAs. Profile alignment-based approaches revolutionized the fields of multiple sequence alignment construction methods by significantly reducing computational complexity and runtime. However, WGAs need to consider genomic rearrangements between genomes, which make the profile-based extension of several whole-genomes challenging. Currently, none of the available methods offer the possibility to align or extend WGA profiles.

Results

Here, we present genome profile alignment, an approach that aligns the profiles of WGAs and that is capable of producing large-scale WGAs many times faster than conventional methods. Our concept relies on already available whole-genome aligners, which are used to compute several smaller sets of aligned genomes that are combined to a full WGA with a divide and conquer approach. To align or extend WGA profiles, we make use of the SuperGenome data structure, which features a bidirectional mapping between individual sequence and alignment coordinates. This data structure is used to efficiently transfer different coordinate systems into a common one based on the principles of profiles alignments. The approach allows the computation of a WGA where alignments are subsequently merged along a guide tree. The current implementation uses progressiveMauve and offers the possibility for parallel computation of independent genome alignments. Our results based on various bacterial datasets up to several hundred genomes show that we can reduce the runtime from months to hours with a quality that is negligibly worse than the WGA computed with the conventional progressiveMauve tool.

Availability and implementation

GPA is freely available at https://lambda.informatik.uni-tuebingen.de/gitlab/ahennig/GPA. GPA is implemented in Java, uses progressiveMauve and offers a parallel computation of WGAs.

Supplementary information

Supplementary data are available at Bioinformatics online.

Collapse

Sievers F, Higgins DG. QuanTest2: benchmarking multiple sequence alignments using secondary structure prediction. Bioinformatics 2019;36:90-95. [PMID: 31292629 PMCID: PMC9881607 DOI: 10.1093/bioinformatics/btz552] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2019] [Revised: 06/17/2019] [Accepted: 07/09/2019] [Indexed: 02/02/2023] Open

Vineetha V, Biji CL, Nair AS. SPARK-MSNA: Efficient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning. Sci Rep 2019;9:6631. [PMID: 31036850 PMCID: PMC6488671 DOI: 10.1038/s41598-019-42966-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2018] [Accepted: 04/12/2019] [Indexed: 11/09/2022] Open

Pervez MT, Shah HA, Babar ME, Naveed N, Shoaib M. SAliBASE: A Database of Simulated Protein Alignments. Evol Bioinform Online 2019;15:1176934318821080. [PMID: 30733625 PMCID: PMC6343434 DOI: 10.1177/1176934318821080] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2018] [Accepted: 11/26/2018] [Indexed: 01/17/2023] Open

PVTree: A Sequential Pattern Mining Method for Alignment Independent Phylogeny Reconstruction. Genes (Basel) 2019;10:genes10020073. [PMID: 30678245 PMCID: PMC6410268 DOI: 10.3390/genes10020073] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2018] [Revised: 01/04/2019] [Accepted: 01/14/2019] [Indexed: 11/21/2022] Open

Wang Y, Wu H, Cai Y. A benchmark study of sequence alignment methods for protein clustering. BMC Bioinformatics 2018;19:529. [PMID: 30598070 PMCID: PMC6311937 DOI: 10.1186/s12859-018-2524-4] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open

Dijkstra M, Bawono P, Abeln S, Feenstra KA, Fokkink W, Heringa J. Motif-Aware PRALINE: Improving the alignment of motif regions. PLoS Comput Biol 2018;14:e1006547. [PMID: 30383764 PMCID: PMC6233922 DOI: 10.1371/journal.pcbi.1006547] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2018] [Revised: 11/13/2018] [Accepted: 10/05/2018] [Indexed: 11/21/2022] Open

Abstract

Protein or DNA motifs are sequence regions which possess biological importance. These regions are often highly conserved among homologous sequences. The generation of multiple sequence alignments (MSAs) with a correct alignment of the conserved sequence motifs is still difficult to achieve, due to the fact that the contribution of these typically short fragments is overshadowed by the rest of the sequence. Here we extended the PRALINE multiple sequence alignment program with a novel motif-aware MSA algorithm in order to address this shortcoming. This method can incorporate explicit information about the presence of externally provided sequence motifs, which is then used in the dynamic programming step by boosting the amino acid substitution matrix towards the motif. The strength of the boost is controlled by a parameter, α. Using a benchmark set of alignments we confirm that a good compromise can be found that improves the matching of motif regions while not significantly reducing the overall alignment quality. By estimating α on an unrelated set of reference alignments we find there is indeed a strong conservation signal for motifs. A number of typical but difficult MSA use cases are explored to exemplify the problems in correctly aligning functional sequence motifs and how the motif-aware alignment method can be employed to alleviate these problems.

The most important functional parts of proteins are often small—but very specific—sequence motifs. Moreover, these motifs tend to be strongly conserved during evolution due to their functional role. Nevertheless, when trying to align protein sequences of the same family, it is often very difficult to align such motifs using standard multiple sequence alignment methods. Aligning functional residues correctly is essential to detect motif conservation, which can be used to filter out spuriously occurring motifs. Additionally, many downstream analyses, such as phylogenetics, are strongly reliant on alignment quality. We have developed a sequence alignment program named Motif-Aware PRALINE (MA-PRALINE) that incorporates information about motifs explicitly. Motifs are provided to MA-PRALINE in the PROSITE pattern syntax; it then scans the input sequences for instances of the pattern and provides a score bonus to matching sequence positions. Our method provides a reproducible alternative to editing alignments by hand in order to account for motif conservation, which is a tedious and error-prone process. We will show that MA-PRALINE allows the alignment of motif-rich regions to be fine-tuned while not degrading the rest of the alignment. MA-PRALINE is available on GitHub as open source software; this allows it to be easily tailored to similar problems. We apply MA-PRALINE on the HIV-1 envelope glycoprotein (gp120) to get an improved alignment of the N-terminal glycosylation motifs. The presence of these motifs is essential for the virus in evading the immune response of the host.

Collapse

Chaabane L. A hybrid solver for protein multiple sequence alignment problem. J Bioinform Comput Biol 2018;16:1850015. [DOI: 10.1142/s0219720018500154] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]

Orlando G, Raimondi D, Khan T, Lenaerts T, Vranken WF. SVM-dependent pairwise HMM: an application to protein pairwise alignments. Bioinformatics 2018;33:3902-3908. [PMID: 28666322 DOI: 10.1093/bioinformatics/btx391] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2016] [Accepted: 06/12/2017] [Indexed: 12/27/2022] Open

Rubio-Largo Á, Vanneschi L, Castelli M, Vega-Rodríguez MA. Multiobjective characteristic-based framework for very-large multiple sequence alignment. Appl Soft Comput 2018. [DOI: 10.1016/j.asoc.2017.06.022] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]

Lladós J, Cores F, Guirado F. Scalable Consistency in T-Coffee Through Apache Spark and Cassandra Database. J Comput Biol 2018;25:894-906. [PMID: 30004242 DOI: 10.1089/cmb.2018.0084] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Zambrano-Vega C, Nebro AJ, García-Nieto J, Aldana-Montes JF. M2Align: parallel multiple sequence alignment with a multi-objective metaheuristic. Bioinformatics 2018;33:3011-3017. [PMID: 28541404 DOI: 10.1093/bioinformatics/btx338] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2016] [Accepted: 05/20/2017] [Indexed: 11/14/2022] Open

Amorim AR, Neves LA, Valêncio CR, Roberto GF, Zafalon GFD. An approach for COFFEE objective function to global DNA multiple sequence alignment. Comput Biol Chem 2018;75:39-44. [PMID: 29738913 DOI: 10.1016/j.compbiolchem.2018.04.012] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2016] [Revised: 03/29/2018] [Accepted: 04/20/2018] [Indexed: 10/17/2022]

Rubio-Largo Á, Castelli M, Vanneschi L, Vega-Rodríguez MA. A Parallel Multiobjective Metaheuristic for Multiple Sequence Alignment. J Comput Biol 2018;25:1009-1022. [PMID: 29671616 DOI: 10.1089/cmb.2018.0031] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open