1
|
Santus L, Garriga E, Deorowicz S, Gudyś A, Notredame C. Towards the accurate alignment of over a million protein sequences: Current state of the art. Curr Opin Struct Biol 2023; 80:102577. [PMID: 37012200 DOI: 10.1016/j.sbi.2023.102577] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2022] [Revised: 02/21/2023] [Accepted: 02/27/2023] [Indexed: 04/04/2023]
Abstract
Large-scale genomics requires highly scalable and accurate multiple sequence alignment methods. Results collected over this last decade suggest accuracy loss when scaling up over a few thousand sequences. This issue has been actively addressed with a number of innovative algorithmic solutions that combine low-level hardware optimization with novel higher-level heuristics. This review provides an extensive critical overview of these recent methods. Using established reference datasets we conclude that albeit significant progress has been achieved, a unified framework able to consistently and efficiently produce high-accuracy large-scale multiple alignments is still lacking.
Collapse
|
2
|
Nayeem MA, Bayzid MS, Rahman AH, Shahriyar R, Rahman MS. Multiobjective Formulation of Multiple Sequence Alignment for Phylogeny Inference. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:2775-2786. [PMID: 33044939 DOI: 10.1109/tcyb.2020.3020308] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Multiple sequence alignment (MSA) is a preliminary task for estimating phylogenies. It is used for homology inference among the sequences of a set of species. Generally, the MSA task is handled as a single-objective optimization process. The alignments computed under one criterion may be different from the alignments generated by other criteria, inferring discordant homologies and thus leading to different hypothesized evolutionary histories relating the sequences. The multiobjective (MO) formulation of MSA has recently been advocated by several researchers, to address this issue. An MO approach independently optimizes multiple (often conflicting) objective functions at the same time and outputs a set of competitive alignments. However, no conceptual or experimental rational from a real-world application perspective has been reported so far for any MO formulation of MSA. This article work investigates the impact of MO formulation in the context of an important scientific problem, namely, phylogeny estimation. Employing popular evolutionary MO algorithms, we show that: 1) trees inferred based on alignments produced by the existing MSA methods used in practice are substantially worse in quality than the trees inferred based on the alignment's output by an MO algorithm and 2) even high-quality alignments (according to popular measures available in the literature) may fail to achieve acceptable accuracy in generating phylogenetic trees. Thus, we essentially ask the following natural question: "can a phylogeny-aware (i.e., application-aware) metric guide in selecting appropriate MO formulations to ensure better phylogeny estimation?" Here, we report a carefully designed extensive experimental study that positively answers this question.
Collapse
|
3
|
Zhang Y, Zhang Q, Zhou J, Zou Q. A survey on the algorithm and development of multiple sequence alignment. Brief Bioinform 2022; 23:6546258. [PMID: 35272347 DOI: 10.1093/bib/bbac069] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Revised: 01/30/2022] [Accepted: 02/09/2022] [Indexed: 12/21/2022] Open
Abstract
Multiple sequence alignment (MSA) is an essential cornerstone in bioinformatics, which can reveal the potential information in biological sequences, such as function, evolution and structure. MSA is widely used in many bioinformatics scenarios, such as phylogenetic analysis, protein analysis and genomic analysis. However, MSA faces new challenges with the gradual increase in sequence scale and the increasing demand for alignment accuracy. Therefore, developing an efficient and accurate strategy for MSA has become one of the research hotspots in bioinformatics. In this work, we mainly summarize the algorithms for MSA and its applications in bioinformatics. To provide a structured and clear perspective, we systematically introduce MSA's knowledge, including background, database, metric and benchmark. Besides, we list the most common applications of MSA in the field of bioinformatics, including database searching, phylogenetic analysis, genomic analysis, metagenomic analysis and protein analysis. Furthermore, we categorize and analyze classical and state-of-the-art algorithms, divided into progressive alignment, iterative algorithm, heuristics, machine learning and divide-and-conquer. Moreover, we also discuss the challenges and opportunities of MSA in bioinformatics. Our work provides a comprehensive survey of MSA applications and their relevant algorithms. It could bring valuable insights for researchers to contribute their knowledge to MSA and relevant studies.
Collapse
Affiliation(s)
- Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China.,School of Computer Science and Engineering, University of Electronic Science and Technology of China, 611731, Chengdu, China
| | - Qiang Zhang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Jiliu Zhou
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, 610054, Chengdu, China
| |
Collapse
|
4
|
A bi-objective function optimization approach for multiple sequence alignment using genetic algorithm. Soft comput 2020. [DOI: 10.1007/s00500-020-04917-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
|
5
|
Gil N, Fiser A. Identifying functionally informative evolutionary sequence profiles. Bioinformatics 2018; 34:1278-1286. [PMID: 29211823 PMCID: PMC5905606 DOI: 10.1093/bioinformatics/btx779] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2017] [Accepted: 11/29/2017] [Indexed: 01/06/2023] Open
Abstract
Motivation Multiple sequence alignments (MSAs) can provide essential input to many bioinformatics applications, including protein structure prediction and functional annotation. However, the optimal selection of sequences to obtain biologically informative MSAs for such purposes is poorly explored, and has traditionally been performed manually. Results We present Selection of Alignment by Maximal Mutual Information (SAMMI), an automated, sequence-based approach to objectively select an optimal MSA from a large set of alternatives sampled from a general sequence database search. The hypothesis of this approach is that the mutual information among MSA columns will be maximal for those MSAs that contain the most diverse set possible of the most structurally and functionally homogeneous protein sequences. SAMMI was tested to select MSAs for functional site residue prediction by analysis of conservation patterns on a set of 435 proteins obtained from protein-ligand (peptides, nucleic acids and small substrates) and protein-protein interaction databases. Availability and implementation: A freely accessible program, including source code, implementing SAMMI is available at https://github.com/nelsongil92/SAMMI.git. Contact andras.fiser@einstein.yu.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Nelson Gil
- Department of Systems & Computational Biology, Albert Einstein College of Medicine, Bronx, NY 10461, USA
| | - Andras Fiser
- Department of Systems & Computational Biology, Albert Einstein College of Medicine, Bronx, NY 10461, USA
| |
Collapse
|
6
|
Rubio-Largo A, Vanneschi L, Castelli M, Vega-Rodriguez MA. A Characteristic-Based Framework for Multiple Sequence Aligners. IEEE TRANSACTIONS ON CYBERNETICS 2018; 48:41-51. [PMID: 27831898 DOI: 10.1109/tcyb.2016.2621129] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
The multiple sequence alignment is a well-known bioinformatics problem that consists in the alignment of three or more biological sequences (protein or nucleic acid). In the literature, a number of tools have been proposed for dealing with this biological sequence alignment problem, such as progressive methods, consistency-based methods, or iterative methods; among others. These aligners often use a default parameter configuration for all the input sequences to align. However, the default configuration is not always the best choice, the alignment accuracy of the tool may be highly boosted if specific parameter configurations are used, depending on the biological characteristics of the input sequences. In this paper, we propose a characteristic-based framework for multiple sequence aligners. The idea of the framework is, given an input set of unaligned sequences, extract its characteristics and run the aligner with the best parameter configuration found for another set of unaligned sequences with similar characteristics. In order to test the framework, we have used the well-known multiple sequence comparison by log-expectation (MUSCLE) v3.8 aligner with different benchmarks, such as benchmark alignments database v3.0, protein reference alignment benchmark v4.0, and sequence alignment benchmark v1.65. The results shown that the alignment accuracy and conservation of MUSCLE might be greatly improved with the proposed framework, specially in those scenarios with a low percentage of identity. The characteristic-based framework for multiple sequence aligners is freely available for downloading at http://arco.unex.es/arl/fwk-msa/cbf-msa.zip.
Collapse
|
7
|
Rubio-Largo Á, Vanneschi L, Castelli M, Vega-Rodríguez MA. Using biological knowledge for multiple sequence aligner decision making. Inf Sci (N Y) 2017. [DOI: 10.1016/j.ins.2017.08.069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
8
|
Chowdhury B, Garai G. A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics 2017; 109:419-431. [PMID: 28669847 DOI: 10.1016/j.ygeno.2017.06.007] [Citation(s) in RCA: 45] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2017] [Revised: 05/27/2017] [Accepted: 06/27/2017] [Indexed: 01/04/2023]
Abstract
Sequence alignment is an active research area in the field of bioinformatics. It is also a crucial task as it guides many other tasks like phylogenetic analysis, function, and/or structure prediction of biological macromolecules like DNA, RNA, and Protein. Proteins are the building blocks of every living organism. Although protein alignment problem has been studied for several decades, unfortunately, every available method produces alignment results differently for a single alignment problem. Multiple sequence alignment is characterized as a very high computational complex problem. Many stochastic methods, therefore, are considered for improving the accuracy of alignment. Among them, many researchers frequently use Genetic Algorithm. In this study, we have shown different types of the method applied in alignment and the recent trends in the multiobjective genetic algorithm for solving multiple sequence alignment. Many recent studies have demonstrated considerable progress in finding the alignment accuracy.
Collapse
Affiliation(s)
- Biswanath Chowdhury
- Department of Biophysics, Molecular Biology and Bioinformatics, University of Calcutta, Kolkata, WB, 700009, India.
| | - Gautam Garai
- Computational Sciences Division, Saha Institute of Nuclear Physics, Kolkata, WB 700064, India.
| |
Collapse
|
9
|
Arribas-Gil A, Matias C. A time warping approach to multiple sequence alignment. Stat Appl Genet Mol Biol 2017; 16:133-144. [PMID: 28593899 DOI: 10.1515/sagmb-2016-0043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
We propose an approach for multiple sequence alignment (MSA) derived from the dynamic time warping viewpoint and recent techniques of curve synchronization developed in the context of functional data analysis. Starting from pairwise alignments of all the sequences (viewed as paths in a certain space), we construct a median path that represents the MSA we are looking for. We establish a proof of concept that our method could be an interesting ingredient to include into refined MSA techniques. We present a simple synthetic experiment as well as the study of a benchmark dataset, together with comparisons with 2 widely used MSA softwares.
Collapse
|
10
|
Rani RR, Ramyachitra D. Multiple sequence alignment using multi-objective based bacterial foraging optimization algorithm. Biosystems 2016; 150:177-189. [PMID: 27784624 DOI: 10.1016/j.biosystems.2016.10.005] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2016] [Revised: 10/18/2016] [Accepted: 10/18/2016] [Indexed: 10/20/2022]
Abstract
Multiple sequence alignment (MSA) is a widespread approach in computational biology and bioinformatics. MSA deals with how the sequences of nucleotides and amino acids are sequenced with possible alignment and minimum number of gaps between them, which directs to the functional, evolutionary and structural relationships among the sequences. Still the computation of MSA is a challenging task to provide an efficient accuracy and statistically significant results of alignments. In this work, the Bacterial Foraging Optimization Algorithm was employed to align the biological sequences which resulted in a non-dominated optimal solution. It employs Multi-objective, such as: Maximization of Similarity, Non-gap percentage, Conserved blocks and Minimization of gap penalty. BAliBASE 3.0 benchmark database was utilized to examine the proposed algorithm against other methods In this paper, two algorithms have been proposed: Hybrid Genetic Algorithm with Artificial Bee Colony (GA-ABC) and Bacterial Foraging Optimization Algorithm. It was found that Hybrid Genetic Algorithm with Artificial Bee Colony performed better than the existing optimization algorithms. But still the conserved blocks were not obtained using GA-ABC. Then BFO was used for the alignment and the conserved blocks were obtained. The proposed Multi-Objective Bacterial Foraging Optimization Algorithm (MO-BFO) was compared with widely used MSA methods Clustal Omega, Kalign, MUSCLE, MAFFT, Genetic Algorithm (GA), Ant Colony Optimization (ACO), Artificial Bee Colony (ABC), Particle Swarm Optimization (PSO) and Hybrid Genetic Algorithm with Artificial Bee Colony (GA-ABC). The final results show that the proposed MO-BFO algorithm yields better alignment than most widely used methods.
Collapse
Affiliation(s)
- R Ranjani Rani
- Department of Computer Science, Bharathiar University, Coimbatore, Tamilnadu, India.
| | - D Ramyachitra
- Department of Computer Science, Bharathiar University, Coimbatore, Tamilnadu, India.
| |
Collapse
|
11
|
Rubio-Largo Á, Vega-Rodríguez MA, González-Álvarez DL. Hybrid multiobjective artificial bee colony for multiple sequence alignment. Appl Soft Comput 2016. [DOI: 10.1016/j.asoc.2015.12.034] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
12
|
Ezawa K. Characterization of multiple sequence alignment errors using complete-likelihood score and position-shift map. BMC Bioinformatics 2016; 17:133. [PMID: 26992851 PMCID: PMC4799563 DOI: 10.1186/s12859-016-0945-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2015] [Accepted: 02/11/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Reconstruction of multiple sequence alignments (MSAs) is a crucial step in most homology-based sequence analyses, which constitute an integral part of computational biology. To improve the accuracy of this crucial step, it is essential to better characterize errors that state-of-the-art aligners typically make. For this purpose, we here introduce two tools: the complete-likelihood score and the position-shift map. RESULTS The logarithm of the total probability of a MSA under a stochastic model of sequence evolution along a time axis via substitutions, insertions and deletions (called the "complete-likelihood score" here) can serve as an ideal score of the MSA. A position-shift map, which maps the difference in each residue's position between two MSAs onto one of them, can clearly visualize where and how MSA errors occurred and help disentangle composite errors. To characterize MSA errors using these tools, we constructed three sets of simulated MSAs of selectively neutral mammalian DNA sequences, with small, moderate and large divergences, under a stochastic evolutionary model with an empirically common power-law insertion/deletion length distribution. Then, we reconstructed MSAs using MAFFT and Prank as representative state-of-the-art single-optimum-search aligners. About 40-99% of the hundreds of thousands of gapped segments were involved in alignment errors. In a substantial fraction, from about 1/4 to over 3/4, of erroneously reconstructed segments, reconstructed MSAs by each aligner showed complete-likelihood scores not lower than those of the true MSAs. Out of the remaining errors, a majority by an iterative option of MAFFT showed discrepancies between the aligner-specific score and the complete-likelihood score, and a majority by Prank seemed due to inadequate exploration of the MSA space. Analyses by position-shift maps indicated that true MSAs are in considerable neighborhoods of reconstructed MSAs in about 80-99% of the erroneous segments for small and moderate divergences, but in only a minority for large divergences. CONCLUSIONS The results of this study suggest that measures to further improve the accuracy of reconstructed MSAs would substantially differ depending on the types of aligners. They also re-emphasize the importance of obtaining a probability distribution of fairly likely MSAs, instead of just searching for a single optimum MSA.
Collapse
Affiliation(s)
- Kiyoshi Ezawa
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, Iizuka, 820-8502, Japan. .,Department of Biology and Biochemistry, University of Houston, Houston, TX, 77204-5001, USA.
| |
Collapse
|
13
|
Kuznetsov IB, McDuffie M. PR2ALIGN: a stand-alone software program and a web-server for protein sequence alignment using weighted biochemical properties of amino acids. BMC Res Notes 2015; 8:187. [PMID: 25947299 PMCID: PMC4477417 DOI: 10.1186/s13104-015-1152-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2014] [Accepted: 04/24/2015] [Indexed: 12/04/2022] Open
Abstract
Background Alignment of amino acid sequences is the main sequence comparison method used in computational molecular biology. The selection of the amino acid substitution matrix best suitable for a given alignment problem is one of the most important decisions the user has to make. In a conventional amino acid substitution matrix all elements are fixed and their values cannot be easily adjusted. Moreover, most existing amino acid substitution matrices account for the average (dis)similarities between amino acid types and do not distinguish the contribution of a specific biochemical property to these (dis)similarities. Findings PR2ALIGN is a stand-alone software program and a web-server that provide the functionality for implementing flexible user-specified alignment scoring functions and aligning pairs of amino acid sequences based on the comparison of the profiles of biochemical properties of these sequences. Unlike the conventional sequence alignment methods that use 20x20 fixed amino acid substitution matrices, PR2ALIGN uses a set of weighted biochemical properties of amino acids to measure the distance between pairs of aligned residues and to find an optimal minimal distance global alignment. The user can provide any number of amino acid properties and specify a weight for each property. The higher the weight for a given property, the more this property affects the final alignment. We show that in many cases the approach implemented in PR2ALIGN produces better quality pair-wise alignments than the conventional matrix-based approach. Conclusions PR2ALIGN will be helpful for researchers who wish to align amino acid sequences by using flexible user-specified alignment scoring functions based on the biochemical properties of amino acids instead of the amino acid substitution matrix. To the best of the authors’ knowledge, there are no existing stand-alone software programs or web-servers analogous to PR2ALIGN. The software is freely available from http://pr2align.rit.albany.edu. Electronic supplementary material The online version of this article (doi:10.1186/s13104-015-1152-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Igor B Kuznetsov
- Cancer Research Center and Department of Epidemiology and Biostatistics, University at Albany, State University of New York, One Discovery Drive, Rensselaer, NY, 12144, USA.
| | - Michael McDuffie
- Cancer Research Center and Department of Epidemiology and Biostatistics, University at Albany, State University of New York, One Discovery Drive, Rensselaer, NY, 12144, USA.
| |
Collapse
|
14
|
PROMALS3D: multiple protein sequence alignment enhanced with evolutionary and three-dimensional structural information. Methods Mol Biol 2014; 1079:263-71. [PMID: 24170408 DOI: 10.1007/978-1-62703-646-7_17] [Citation(s) in RCA: 178] [Impact Index Per Article: 17.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Multiple sequence alignment (MSA) is an essential tool with many applications in bioinformatics and computational biology. Accurate MSA construction for divergent proteins remains a difficult computational task. The constantly increasing protein sequences and structures in public databases could be used to improve alignment quality. PROMALS3D is a tool for protein MSA construction enhanced with additional evolutionary and structural information from database searches. PROMALS3D automatically identifies homologs from sequence and structure databases for input proteins, derives structure-based constraints from alignments of three-dimensional structures, and combines them with sequence-based constraints of profile-profile alignments in a consistency-based framework to construct high-quality multiple sequence alignments. PROMALS3D output is a consensus alignment enriched with sequence and structural information about input proteins and their homologs. PROMALS3D Web server and package are available at http://prodata.swmed.edu/PROMALS3D.
Collapse
|
15
|
Deng X, Cheng J. MSACompro: improving multiple protein sequence alignment by predicted structural features. Methods Mol Biol 2014; 1079:273-283. [PMID: 24170409 DOI: 10.1007/978-1-62703-646-7_18] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Multiple Sequence Alignment (MSA) is an essential tool in protein structure modeling, gene and protein function prediction, DNA motif recognition, phylogenetic analysis, and many other bioinformatics tasks. Therefore, improving the accuracy of multiple sequence alignment is an important long-term objective in bioinformatics. We designed and developed a new method MSACompro to incorporate predicted secondary structure, relative solvent accessibility, and residue-residue contact information into the currently most accurate posterior probability-based MSA methods to improve the accuracy of multiple sequence alignments. Different from the multiple sequence alignment methods that use the tertiary structure information of some sequences, our method uses the structural information purely predicted from sequences. In this chapter, we first introduce some background and related techniques in the field of multiple sequence alignment. Then, we describe the detailed algorithm of MSACompro. Finally, we show that integrating predicted protein structural information improved the multiple sequence alignment accuracy.
Collapse
Affiliation(s)
- Xin Deng
- Computer Science Department, University of Missouri, Columbia, MO, USA
| | | |
Collapse
|
16
|
Sahraeian SME, Yoon BJ. PicXAA: a probabilistic scheme for finding the maximum expected accuracy alignment of multiple biological sequences. Methods Mol Biol 2014; 1079:203-210. [PMID: 24170404 DOI: 10.1007/978-1-62703-646-7_13] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
PicXAA is a probabilistic nonprogressive alignment algorithm that finds protein (or DNA) multiple sequence alignments with maximum expected accuracy. PicXAA greedily builds up the alignment from sequence regions with high local similarity, thereby yielding an accurate global alignment that effectively captures the local similarities across sequences. PicXAA constantly yields accurate alignment results on a wide range of reference sets that have different characteristics, with especially remarkable improvements over other leading algorithms on sequence sets with high local similarities. In this chapter, we describe the overall alignment strategy used in PicXAA and discuss several important considerations for effective deployment of the algorithm.
Collapse
|
17
|
Heuristic alignment methods. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2013; 1079:29-43. [PMID: 24170393 DOI: 10.1007/978-1-62703-646-7_2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
Computation of multiple sequence alignment (MSA) is usually formulated as a combinatory optimization problem of an objective function. Solving the problem for virtually all sensible objective functions is known to be NP-complete implying that some heuristics must be adopted. Several general strategies have been proven effective to obtain accurate MSAs in reasonable computational costs. This chapter is devoted to a brief summary of most successful heuristic approaches.
Collapse
|
18
|
Sievers F, Dineen D, Wilm A, Higgins DG. Making automated multiple alignments of very large numbers of protein sequences. ACTA ACUST UNITED AC 2013; 29:989-95. [PMID: 23428640 DOI: 10.1093/bioinformatics/btt093] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
MOTIVATION Recent developments in sequence alignment software have made possible multiple sequence alignments (MSAs) of >100 000 sequences in reasonable times. At present, there are no systematic analyses concerning the scalability of the alignment quality as the number of aligned sequences is increased. RESULTS We benchmarked a wide range of widely used MSA packages using a selection of protein families with some known structures and found that the accuracy of such alignments decreases markedly as the number of sequences grows. This is more or less true of all packages and protein families. The phenomenon is mostly due to the accumulation of alignment errors, rather than problems in guide-tree construction. This is partly alleviated by using iterative refinement or selectively adding sequences. The average accuracy of progressive methods by comparison with structure-based benchmarks can be improved by incorporating information derived from high-quality structural alignments of sequences with solved structures. This suggests that the availability of high quality curated alignments will have to complement algorithmic and/or software developments in the long-term. AVAILABILITY AND IMPLEMENTATION Benchmark data used in this study are available at http://www.clustal.org/omega/homfam-20110613-25.tar.gz and http://www.clustal.org/omega/bali3fam-26.tar.gz. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Fabian Sievers
- School of Medicine and Medical Science, Conway Institute, University College Dublin, Dublin 4, Ireland.
| | | | | | | |
Collapse
|
19
|
A data parallel strategy for aligning multiple biological sequences on multi-core computers. Comput Biol Med 2013; 43:350-61. [PMID: 23414778 DOI: 10.1016/j.compbiomed.2012.12.009] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2011] [Revised: 06/22/2012] [Accepted: 12/25/2012] [Indexed: 11/21/2022]
Abstract
In this paper, we address the large-scale biological sequence alignment problem, which has an increasing demand in computational biology. We employ data parallelism paradigm that is suitable for handling large-scale processing on multi-core computers to achieve a high degree of parallelism. Using the data parallelism paradigm, we propose a general strategy which can be used to speed up any multiple sequence alignment method. We applied five different clustering algorithms in our strategy and implemented rigorous tests on an 8-core computer using four traditional benchmarks and artificially generated sequences. The results show that our multi-core-based implementations can achieve up to 151-fold improvements in execution time while losing 2.19% accuracy on average. The source code of the proposed strategy, together with the test sets used in our analysis, is available on request.
Collapse
|
20
|
NARIMANI ZAHRA, BEIGY HAMID, ABOLHASSANI HASSAN. A NEW GENETIC ALGORITHM FOR MULTIPLE SEQUENCE ALIGNMENT. INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE AND APPLICATIONS 2013. [DOI: 10.1142/s146902681250023x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Multiple sequence alignment (MSA) is one of the basic and important problems in molecular biology. MSA can be used for different purposes including finding the conserved motifs and structurally important regions in protein sequences and determine evolutionary distance between sequences. Aligning several sequences cannot be done in polynomial time and therefore heuristic methods such as genetic algorithms can be used to find approximate solutions of MSA problems. Several algorithms based on genetic algorithms have been developed for this problem in recent years. Most of these algorithms use very complicated, problem specific and time consuming mutation operators. In this paper, we propose a new algorithm that uses a new way of population initialization and simple mutation and recombination operators. The strength of the proposed GA is using simple mutation operators and also a special recombination operator that does not have problems of similar recombination operators in other GAs. The experimental results show that the proposed algorithm is capable of finding good MSAs in contrast to existing methods, while it uses simple operators with low computational complexity.
Collapse
Affiliation(s)
- ZAHRA NARIMANI
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| | - HAMID BEIGY
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| | - HASSAN ABOLHASSANI
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| |
Collapse
|
21
|
Blackburne BP, Whelan S. Class of multiple sequence alignment algorithm affects genomic analysis. Mol Biol Evol 2012; 30:642-53. [PMID: 23144040 DOI: 10.1093/molbev/mss256] [Citation(s) in RCA: 57] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
Multiple sequence alignment (MSA) is the heart of comparative sequence analysis. Recent studies demonstrate that MSA algorithms can produce different outcomes when analyzing genomes, including phylogenetic tree inference and the detection of adaptive evolution. These studies also suggest that the difference between MSA algorithms is of a similar order to the uncertainty within an algorithm and suggest integrating across this uncertainty. In this study, we examine further the problem of disagreements between MSA algorithms and how they affect downstream analyses. We also investigate whether integrating across alignment uncertainty affects downstream analyses. We address these questions by analyzing 200 chordate gene families, with properties reflecting those used in large-scale genomic analyses. We find that newly developed distance metrics reveal two significantly different classes of MSA methods (MSAMs). The similarity-based class includes progressive aligners and consistency aligners, representing many methodological innovations for sequence alignment, whereas the evolution-based class includes phylogenetically aware alignment and statistical alignment. We proceed to show that the class of an MSAM has a substantial impact on downstream analyses. For phylogenetic inference, tree estimates and their branch lengths appear highly dependent on the class of aligner used. The number of families, and the sites within those families, inferred to have undergone adaptive evolution depend on the class of aligner used. Similarity-based aligners tend to identify more adaptive evolution. We also develop and test methods for incorporating MSA uncertainty when detecting adaptive evolution but find that although accounting for MSA uncertainty does affect downstream analyses, it appears less important than the class of aligner chosen. Our results demonstrate the critical role that MSA methodology has on downstream analysis, highlighting that the class of aligner chosen in an analysis has a demonstrable effect on its outcome.
Collapse
|
22
|
Wu M, Chatterji S, Eisen JA. Accounting for alignment uncertainty in phylogenomics. PLoS One 2012; 7:e30288. [PMID: 22272325 PMCID: PMC3260272 DOI: 10.1371/journal.pone.0030288] [Citation(s) in RCA: 127] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2011] [Accepted: 12/14/2011] [Indexed: 01/12/2023] Open
Abstract
Uncertainty in multiple sequence alignments has a large impact on phylogenetic analyses. Little has been done to evaluate the quality of individual positions in protein sequence alignments, which directly impact the accuracy of phylogenetic trees. Here we describe ZORRO, a probabilistic masking program that accounts for alignment uncertainty by assigning confidence scores to each alignment position. Using the BALIBASE database and in simulation studies, we demonstrate that masking by ZORRO significantly reduces the alignment uncertainty and improves the tree accuracy.
Collapse
Affiliation(s)
- Martin Wu
- Department of Biology, University of Virginia, Charlottesville, Virginia, United States of America.
| | | | | |
Collapse
|
23
|
Lin HN, Notredame C, Chang JM, Sung TY, Hsu WL. Improving the alignment quality of consistency based aligners with an evaluation function using synonymous protein words. PLoS One 2011; 6:e27872. [PMID: 22163274 PMCID: PMC3229492 DOI: 10.1371/journal.pone.0027872] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2011] [Accepted: 10/27/2011] [Indexed: 11/18/2022] Open
Abstract
Most sequence alignment tools can successfully align protein sequences with higher levels of sequence identity. The accuracy of corresponding structure alignment, however, decreases rapidly when considering distantly related sequences (<20% identity). In this range of identity, alignments optimized so as to maximize sequence similarity are often inaccurate from a structural point of view. Over the last two decades, most multiple protein aligners have been optimized for their capacity to reproduce structure-based alignments while using sequence information. Methods currently available differ essentially in the similarity measurement between aligned residues using substitution matrices, Fourier transform, sophisticated profile-profile functions, or consistency-based approaches, more recently. In this paper, we present a flexible similarity measure for residue pairs to improve the quality of protein sequence alignment. Our approach, called SymAlign, relies on the identification of conserved words found across a sizeable fraction of the considered dataset, and supported by evolutionary analysis. These words are then used to define a position specific substitution matrix that better reflects the biological significance of local similarity. The experiment results show that the SymAlign scoring scheme can be incorporated within T-Coffee to improve sequence alignment accuracy. We also demonstrate that SymAlign is less sensitive to the presence of structurally non-similar proteins. In the analysis of the relationship between sequence identity and structure similarity, SymAlign can better differentiate structurally similar proteins from non- similar proteins. We show that protein sequence alignments can be significantly improved using a similarity estimation based on weighted n-grams. In our analysis of the alignments thus produced, sequence conservation becomes a better indicator of structural similarity. SymAlign also provides alignment visualization that can display sub-optimal alignments on dot-matrices. The visualization makes it easy to identify well-supported alternative alignments that may not have been identified by dynamic programming. SymAlign is available at http://bio-cluster.iis.sinica.edu.tw/SymAlign/.
Collapse
Affiliation(s)
- Hsin-Nan Lin
- Bioinformatics Lab, Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | | | - Jia-Ming Chang
- Centre for Genomic Regulation (CRG), UPF, Barcelona, Spain
| | - Ting-Yi Sung
- Bioinformatics Lab, Institute of Information Science, Academia Sinica, Taipei, Taiwan
- * E-mail: (TYS); (WLH)
| | - Wen-Lian Hsu
- Bioinformatics Lab, Institute of Information Science, Academia Sinica, Taipei, Taiwan
- * E-mail: (TYS); (WLH)
| |
Collapse
|
24
|
Deng X, Cheng J. MSACompro: protein multiple sequence alignment using predicted secondary structure, solvent accessibility, and residue-residue contacts. BMC Bioinformatics 2011; 12:472. [PMID: 22168237 PMCID: PMC3299741 DOI: 10.1186/1471-2105-12-472] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2011] [Accepted: 12/14/2011] [Indexed: 11/20/2022] Open
Abstract
Background Multiple Sequence Alignment (MSA) is a basic tool for bioinformatics research and analysis. It has been used essentially in almost all bioinformatics tasks such as protein structure modeling, gene and protein function prediction, DNA motif recognition, and phylogenetic analysis. Therefore, improving the accuracy of multiple sequence alignment is important for advancing many bioinformatics fields. Results We designed and developed a new method, MSACompro, to synergistically incorporate predicted secondary structure, relative solvent accessibility, and residue-residue contact information into the currently most accurate posterior probability-based MSA methods to improve the accuracy of multiple sequence alignments. The method is different from the multiple sequence alignment methods (e.g. 3D-Coffee) that use the tertiary structure information of some sequences since the structural information of our method is fully predicted from sequences. To the best of our knowledge, applying predicted relative solvent accessibility and contact map to multiple sequence alignment is novel. The rigorous benchmarking of our method to the standard benchmarks (i.e. BAliBASE, SABmark and OXBENCH) clearly demonstrated that incorporating predicted protein structural information improves the multiple sequence alignment accuracy over the leading multiple protein sequence alignment tools without using this information, such as MSAProbs, ProbCons, Probalign, T-coffee, MAFFT and MUSCLE. And the performance of the method is comparable to the state-of-the-art method PROMALS of using structural features and additional homologous sequences by slightly lower scores. Conclusion MSACompro is an efficient and reliable multiple protein sequence alignment tool that can effectively incorporate predicted protein structural information into multiple sequence alignment. The software is available at http://sysbio.rnet.missouri.edu/multicom_toolbox/.
Collapse
Affiliation(s)
- Xin Deng
- Department of Computer Science, University of Missouri-Columbia, Columbia, MO 65211, USA
| | | |
Collapse
|
25
|
Protein sequence alignment with family-specific amino acid similarity matrices. BMC Res Notes 2011; 4:296. [PMID: 21846354 PMCID: PMC3201029 DOI: 10.1186/1756-0500-4-296] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2011] [Accepted: 08/16/2011] [Indexed: 11/16/2022] Open
Abstract
Background Alignment of amino acid sequences by means of dynamic programming is a cornerstone sequence comparison method. The quality of alignments produced by dynamic programming critically depends on the choice of the alignment scoring function. Therefore, for a specific alignment problem one needs a way of selecting the best performing scoring function. This work is focused on the issue of finding optimized protein family- and fold-specific scoring functions for global similarity matrix-based sequence alignment. Findings I utilize a comprehensive set of reference alignments obtained from structural superposition of homologous and analogous proteins to design a quantitative statistical framework for evaluating the performance of alignment scoring functions in global pairwise sequence alignment. This framework is applied to study how existing general-purpose amino acid similarity matrices perform on individual protein families and structural folds, and to compare them to family-specific and fold-specific matrices derived in this work. I describe an adaptive alignment procedure that automatically selects an appropriate similarity matrix and optimized gap penalties based on the properties of the sequences being aligned. Conclusions The results of this work indicate that using family-specific similarity matrices significantly improves the quality of the alignment of homologous sequences over the traditional sequence alignment based on a single general-purpose similarity matrix. However, using fold-specific similarity matrices can only marginally improve sequence alignment of proteins that share the same structural fold but do not share a common evolutionary origin. The family-specific matrices derived in this work and the optimized gap penalties are available at http://taurus.crc.albany.edu/fsm.
Collapse
|
26
|
Sahraeian SME, Yoon BJ. PicXAA-Web: a web-based platform for non-progressive maximum expected accuracy alignment of multiple biological sequences. Nucleic Acids Res 2011; 39:W8-12. [PMID: 21515632 PMCID: PMC3125727 DOI: 10.1093/nar/gkr244] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023] Open
Abstract
In this article, we introduce PicXAA-Web, a web-based platform for accurate probabilistic alignment of multiple biological sequences. The core of PicXAA-Web consists of PicXAA, a multiple protein/DNA sequence alignment algorithm, and PicXAA-R, an extension of PicXAA for structural alignment of RNA sequences. Both PicXAA and PicXAA-R are probabilistic non-progressive alignment algorithms that aim to find the optimal alignment of multiple biological sequences by maximizing the expected accuracy. PicXAA and PicXAA-R greedily build up the alignment from sequence regions with high local similarity, thereby yielding an accurate global alignment that effectively captures local similarities among sequences. PicXAA-Web integrates these two algorithms in a user-friendly web platform for accurate alignment and analysis of multiple protein, DNA and RNA sequences. PicXAA-Web can be freely accessed at http://gsp.tamu.edu/picxaa/.
Collapse
|
27
|
Hijikata A, Yura K, Noguti T, Go M. Revisiting gap locations in amino acid sequence alignments and a proposal for a method to improve them by introducing solvent accessibility. Proteins 2011; 79:1868-77. [PMID: 21465562 PMCID: PMC3110861 DOI: 10.1002/prot.23011] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2010] [Revised: 01/23/2011] [Accepted: 01/28/2011] [Indexed: 12/27/2022]
Abstract
In comparative modeling, the quality of amino acid sequence alignment still constitutes a major bottleneck in the generation of high quality models of protein three-dimensional (3D) structures. Substantial efforts have been made to improve alignment quality by revising the substitution matrix, introducing multiple sequences, replacing dynamic programming with hidden Markov models, and incorporating 3D structure information. Improvements in the gap penalty have not been a major focus, however, following the development of the affine gap penalty and of the secondary structure dependent gap penalty. We revisited the correlation between protein 3D structure and gap location in a large protein 3D structure data set, and found that the frequency of gap locations approximated to an exponential function of the solvent accessibility of the inserted residues. The nonlinearity of the gap frequency as a function of accessibility corresponded well to the relationship between residue mutation pattern and residue accessibility. By introducing this relationship into the gap penalty calculation for pairwise alignment between template and target amino acid sequences, we were able to obtain a sequence alignment much closer to the structural alignment. The quality of the alignments was substantially improved on a pair of sequences with identity in the “twilight zone” between 20 and 40%. The relocation of gaps by our new method made a significant improvement in comparative modeling, exemplified here by the Bacillus subtilis yitF protein. The method was implemented in a computer program, ALAdeGAP (ALignment with Accessibility dependent GAp Penalty), which is available at http://cib.cf.ocha.ac.jp/target_protein/. Proteins 2011; © 2011 Wiley-Liss, Inc.
Collapse
Affiliation(s)
- Atsushi Hijikata
- Division of Biological Science, Graduate School of Science, Nagoya University, Furo, Chikusa, Nagoya 464-8602, Japan
| | | | | | | |
Collapse
|
28
|
Abstract
We present a graph-based model for representing two aligned genomic sequences. An alignment graph is a mixed graph consisting of two sets of vertices, each representing one of the input sequences, and three sets of edges. These edges allow the model to represent a number of evolutionary events. This model is used to perform sequence alignment at the level of nucleotides. We define a scoring function for alignment graphs. We show that minimizing the score is NP-complete. However, we present a dynamic programming algorithm that solves the minimization problem optimally for a certain class of alignments, called breakable arrangements. Algorithms for analyzing breakable arrangements are presented. We also present a greedy algorithm that is capable of representing reversals. We present a dynamic programming algorithm that optimally aligns two genomic sequences, when one of the input sequences is a breakable arrangement of the other. Comparing what we define as breakable arrangements to alignments generated by other algorithms, it is seen that many already aligned genomes fall into the category of being breakable. Moreover, the greedy algorithm is shown to represent reversals, besides rearrangements, mutations, and other evolutionary events.
Collapse
Affiliation(s)
- Nahla A Belal
- Department of Computer Science, AAST, Alexandria, Egypt
| | | |
Collapse
|
29
|
Husain N, Obranic S, Koscinski L, Seetharaman J, Babic F, Bujnicki JM, Maravic-Vlahovicek G, Sivaraman J. Structural basis for the methylation of A1408 in 16S rRNA by a panaminoglycoside resistance methyltransferase NpmA from a clinical isolate and analysis of the NpmA interactions with the 30S ribosomal subunit. Nucleic Acids Res 2010; 39:1903-18. [PMID: 21062819 PMCID: PMC3061052 DOI: 10.1093/nar/gkq1033] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
Abstract
NpmA, a methyltransferase that confers resistance to aminoglycosides was identified in an Escherichia coli clinical isolate. It belongs to the kanamycin–apramycin methyltransferase (Kam) family and specifically methylates the 16S rRNA at the N1 position of A1408. We determined the structures of apo-NpmA and its complexes with S-adenosylmethionine (AdoMet) and S-adenosylhomocysteine (AdoHcy) at 2.4, 2.7 and 1.68 Å, respectively. We generated a number of NpmA variants with alanine substitutions and studied their ability to bind the cofactor, to methylate A1408 in the 30S subunit, and to confer resistance to kanamycin in vivo. Residues D30, W107 and W197 were found to be essential. We have also analyzed the interactions between NpmA and the 30S subunit by footprinting experiments and computational docking. Helices 24, 42 and 44 were found to be the main NpmA-binding site. Both experimental and theoretical analyses suggest that NpmA flips out the target nucleotide A1408 to carry out the methylation. NpmA is plasmid-encoded and can be transferred between pathogenic bacteria; therefore it poses a threat to the successful use of aminoglycosides in clinical practice. The results presented here will assist in the development of specific NpmA inhibitors that could restore the potential of aminoglycoside antibiotics.
Collapse
Affiliation(s)
- Nilofer Husain
- Department of Biological Sciences, 14 Science drive 4, National University of Singapore, Singapore
| | | | | | | | | | | | | | | |
Collapse
|
30
|
Aniba MR, Poch O, Thompson JD. Issues in bioinformatics benchmarking: the case study of multiple sequence alignment. Nucleic Acids Res 2010; 38:7353-63. [PMID: 20639539 PMCID: PMC2995051 DOI: 10.1093/nar/gkq625] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2010] [Revised: 06/10/2010] [Accepted: 06/29/2010] [Indexed: 11/13/2022] Open
Abstract
The post-genomic era presents many new challenges for the field of bioinformatics. Novel computational approaches are now being developed to handle the large, complex and noisy datasets produced by high throughput technologies. Objective evaluation of these methods is essential (i) to assure high quality, (ii) to identify strong and weak points of the algorithms, (iii) to measure the improvements introduced by new methods and (iv) to enable non-specialists to choose an appropriate tool. Here, we discuss the development of formal benchmarks, designed to represent the current problems encountered in the bioinformatics field. We consider several criteria for building good benchmarks and the advantages to be gained when they are used intelligently. To illustrate these principles, we present a more detailed discussion of benchmarks for multiple alignments of protein sequences. As in many other domains, significant progress has been achieved in the multiple alignment field and the datasets have become progressively more challenging as the existing algorithms have evolved. Finally, we propose directions for future developments that will ensure that the bioinformatics benchmarks correspond to the challenges posed by the high throughput data.
Collapse
Affiliation(s)
- Mohamed Radhouene Aniba
- Institut de Génétique et de Biologie Moléculaire et Cellulaire (IGBMC), Department of Structural Biology and Genomics, Institut National de la Santé et de la Recherche Médicale (INSERM), U596, The Centre National de la Recherche Scientifique (CNRS), UMR7104, F-67400 Illkirch and Université de Strasbourg, F-67000 Strasbourg, France
| | - Olivier Poch
- Institut de Génétique et de Biologie Moléculaire et Cellulaire (IGBMC), Department of Structural Biology and Genomics, Institut National de la Santé et de la Recherche Médicale (INSERM), U596, The Centre National de la Recherche Scientifique (CNRS), UMR7104, F-67400 Illkirch and Université de Strasbourg, F-67000 Strasbourg, France
| | - Julie D. Thompson
- Institut de Génétique et de Biologie Moléculaire et Cellulaire (IGBMC), Department of Structural Biology and Genomics, Institut National de la Santé et de la Recherche Médicale (INSERM), U596, The Centre National de la Recherche Scientifique (CNRS), UMR7104, F-67400 Illkirch and Université de Strasbourg, F-67000 Strasbourg, France
| |
Collapse
|
31
|
Sahraeian SME, Yoon BJ. PicXAA: greedy probabilistic construction of maximum expected accuracy alignment of multiple sequences. Nucleic Acids Res 2010; 38:4917-28. [PMID: 20413579 PMCID: PMC2926610 DOI: 10.1093/nar/gkq255] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2009] [Revised: 03/25/2010] [Accepted: 03/26/2010] [Indexed: 11/13/2022] Open
Abstract
Accurate tools for multiple sequence alignment (MSA) are essential for comparative studies of the function and structure of biological sequences. However, it is very challenging to develop a computationally efficient algorithm that can consistently predict accurate alignments for various types of sequence sets. In this article, we introduce PicXAA (Probabilistic Maximum Accuracy Alignment), a probabilistic non-progressive alignment algorithm that aims to find protein alignments with maximum expected accuracy. PicXAA greedily builds up the multiple alignment from sequence regions with high local similarities, thereby yielding an accurate global alignment that effectively grasps the local similarities among sequences. Evaluations on several widely used benchmark sets show that PicXAA constantly yields accurate alignment results on a wide range of reference sets, with especially remarkable improvements over other leading algorithms on sequence sets with local similarities. PicXAA source code is freely available at: http://www.ece.tamu.edu/~bjyoon/picxaa/.
Collapse
Affiliation(s)
| | - Byung-Jun Yoon
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA
| |
Collapse
|
32
|
ElHefnawi MM, Zada S, El-Azab IA. Prediction of prognostic biomarkers for interferon-based therapy to hepatitis C virus patients: a meta-analysis of the NS5A protein in subtypes 1a, 1b, and 3a. Virol J 2010; 7:130. [PMID: 20550652 PMCID: PMC3238222 DOI: 10.1186/1743-422x-7-130] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2010] [Accepted: 06/15/2010] [Indexed: 12/19/2022] Open
Abstract
Background Hepatitis C virus (HCV) is a worldwide health problem with no vaccine and the only approved therapy is Interferon-based plus Ribavarin. Response prediction to treatment has health and economic impacts, and is a multi-factorial problem including both host and viral factors (e.g: age, sex, ethnicity, pre-treatment viral load, and dynamics of the HCV non-structural protein NS5A quasispecies). We implement a novel approach for extracting features including informative markers from mutations in the non-structural 5A protein (NS5A), specifically its Interferon sensitivity determining region (ISDR) and V3 regions, and use a novel bioinformatics approach for pattern recognition on the NS5A protein and its motifs to find biomarkers for response prediction using class association rules and comparing the predictability of the different features. Results A total of 58 sequences from sustained responders and 94 from non-responders were downloaded from the HCV LANL database. Site-specific signatures for response prediction from the NS5A protein were extracted from the alignments. Class association rules were generated (e.g.: sustained response is associated with position A2368T in subtype 1a (support 100% and confidence 52.19%); in subtype 1b, response is associated with E2356G/D/K (support 76.3% and confidence 67.3%). Conclusion The V3 region was a more accurate biomarker than the ISDR region. Subtype-specific class association rules gave better support and confidence than profile hidden Markov models HMMs scores, genetic distances or number of variable sites, and would thus aid in the prediction of prognostic biomarkers and improve the accuracy of prognosis. Sites-specific class association rules in the V3 region of the NS5A protein have given the best support and confidence.
Collapse
Affiliation(s)
- Mahmoud M ElHefnawi
- Informatics and Systems Department, Division of Engineering Research, National Research Centre, Tahrir Street, Cairo, Egypt.
| | | | | |
Collapse
|
33
|
Dessimoz C, Gil M. Phylogenetic assessment of alignments reveals neglected tree signal in gaps. Genome Biol 2010; 11:R37. [PMID: 20370897 PMCID: PMC2884540 DOI: 10.1186/gb-2010-11-4-r37] [Citation(s) in RCA: 137] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2009] [Revised: 01/26/2010] [Accepted: 04/06/2010] [Indexed: 01/08/2023] Open
Abstract
Tree-based tests of alignment methods enable the evaluation of the effect of gap placement on the inference of phylogenetic relationships. Background The alignment of biological sequences is of chief importance to most evolutionary and comparative genomics studies, yet the two main approaches used to assess alignment accuracy have flaws: reference alignments are derived from the biased sample of proteins with known structure, and simulated data lack realism. Results Here, we introduce tree-based tests of alignment accuracy, which not only use large and representative samples of real biological data, but also enable the evaluation of the effect of gap placement on phylogenetic inference. We show that (i) the current belief that consistency-based alignments outperform scoring matrix-based alignments is misguided; (ii) gaps carry substantial phylogenetic signal, but are poorly exploited by most alignment and tree building programs; (iii) even so, excluding gaps and variable regions is detrimental; (iv) disagreement among alignment programs says little about the accuracy of resulting trees. Conclusions This study provides the broad community relying on sequence alignment with important practical recommendations, sets superior standards for assessing alignment accuracy, and paves the way for the development of phylogenetic inference methods of significantly higher resolution.
Collapse
Affiliation(s)
- Christophe Dessimoz
- Department of Computer Science, ETH Zurich, Universitaetstr, 6, 8092 Zürich, Switzerland.
| | | |
Collapse
|
34
|
Raman S, Vernon R, Thompson J, Tyka M, Sadreyev R, Pei J, Kim D, Kellogg E, DiMaio F, Lange O, Kinch L, Sheffler W, Kim BH, Das R, Grishin NV, Baker D. Structure prediction for CASP8 with all-atom refinement using Rosetta. Proteins 2010; 77 Suppl 9:89-99. [PMID: 19701941 DOI: 10.1002/prot.22540] [Citation(s) in RCA: 367] [Impact Index Per Article: 26.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
We describe predictions made using the Rosetta structure prediction methodology for the Eighth Critical Assessment of Techniques for Protein Structure Prediction. Aggressive sampling and all-atom refinement were carried out for nearly all targets. A combination of alignment methodologies was used to generate starting models from a range of templates, and the models were then subjected to Rosetta all atom refinement. For the 64 domains with readily identified templates, the best submitted model was better than the best alignment to the best template in the Protein Data Bank for 24 cases, and improved over the best starting model for 43 cases. For 13 targets where only very distant sequence relationships to proteins of known structure were detected, models were generated using the Rosetta de novo structure prediction methodology followed by all-atom refinement; in several cases the submitted models were better than those based on the available templates. Of the 12 refinement challenges, the best submitted model improved on the starting model in seven cases. These improvements over the starting template-based models and refinement tests demonstrate the power of Rosetta structure refinement in improving model accuracy.
Collapse
Affiliation(s)
- Srivatsan Raman
- Department of Biochemistry, University of Washington, Seattle, Washington 98195, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
35
|
Edgar RC. Optimizing substitution matrix choice and gap parameters for sequence alignment. BMC Bioinformatics 2009; 10:396. [PMID: 19954534 PMCID: PMC2791778 DOI: 10.1186/1471-2105-10-396] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2009] [Accepted: 12/02/2009] [Indexed: 12/04/2022] Open
Abstract
Background While substitution matrices can readily be computed from reference alignments, it is challenging to compute optimal or approximately optimal gap penalties. It is also not well understood which substitution matrices are the most effective when alignment accuracy is the goal rather than homolog recognition. Here a new parameter optimization procedure, POP, is described and applied to the problems of optimizing gap penalties and selecting substitution matrices for pair-wise global protein alignments. Results POP is compared to a recent method due to Kim and Kececioglu and found to achieve from 0.2% to 1.3% higher accuracies on pair-wise benchmarks extracted from BALIBASE. The VTML matrix series is shown to be the most accurate on several global pair-wise alignment benchmarks, with VTML200 giving best or close to the best performance in all tests. BLOSUM matrices are found to be slightly inferior, even with the marginal improvements in the bug-fixed RBLOSUM series. The PAM series is significantly worse, giving accuracies typically 2% less than VTML. Integer rounding is found to cause slight degradations in accuracy. No evidence is found that selecting a matrix based on sequence divergence improves accuracy, suggesting that the use of this heuristic in CLUSTALW may be ineffective. Using VTML200 is found to improve the accuracy of CLUSTALW by 8% on BALIBASE and 5% on PREFAB. Conclusion The hypothesis that more accurate alignments of distantly related sequences may be achieved using low-identity matrices is shown to be false for commonly used matrix types. Source code and test data is freely available from the author's web site at http://www.drive5.com/pop.
Collapse
|
36
|
Knight HM, Pickard BS, Maclean A, Malloy MP, Soares DC, McRae AF, Condie A, White A, Hawkins W, McGhee K, van Beck M, MacIntyre DJ, Starr JM, Deary IJ, Visscher PM, Porteous DJ, Cannon RE, St Clair D, Muir WJ, Blackwood DH. A cytogenetic abnormality and rare coding variants identify ABCA13 as a candidate gene in schizophrenia, bipolar disorder, and depression. Am J Hum Genet 2009; 85:833-46. [PMID: 19944402 PMCID: PMC2790560 DOI: 10.1016/j.ajhg.2009.11.003] [Citation(s) in RCA: 89] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2009] [Revised: 10/14/2009] [Accepted: 11/02/2009] [Indexed: 01/22/2023] Open
Abstract
Schizophrenia and bipolar disorder are leading causes of morbidity across all populations, with heritability estimates of approximately 80% indicating a substantial genetic component. Population genetics and genome-wide association studies suggest an overlap of genetic risk factors between these illnesses but it is unclear how this genetic component is divided between common gene polymorphisms, rare genomic copy number variants, and rare gene sequence mutations. We report evidence that the lipid transporter gene ABCA13 is a susceptibility factor for both schizophrenia and bipolar disorder. After the initial discovery of its disruption by a chromosome abnormality in a person with schizophrenia, we resequenced ABCA13 exons in 100 cases with schizophrenia and 100 controls. Multiple rare coding variants were identified including one nonsense and nine missense mutations and compound heterozygosity/homozygosity in six cases. Variants were genotyped in additional schizophrenia, bipolar, depression (n > 1600), and control (n > 950) cohorts and the frequency of all rare variants combined was greater than controls in schizophrenia (OR = 1.93, p = 0.0057) and bipolar disorder (OR = 2.71, p = 0.00007). The population attributable risk of these mutations was 2.2% for schizophrenia and 4.0% for bipolar disorder. In a study of 21 families of mutation carriers, we genotyped affected and unaffected relatives and found significant linkage (LOD = 4.3) of rare variants with a phenotype including schizophrenia, bipolar disorder, and major depression. These data identify a candidate gene, highlight the genetic overlap between schizophrenia, bipolar disorder, and depression, and suggest that rare coding variants may contribute significantly to risk of these disorders.
Collapse
Affiliation(s)
- Helen M. Knight
- Division of Psychiatry, University of Edinburgh, Royal Edinburgh Hospital, Edinburgh EH10 5HF, UK
- Medical Genetics, Institute of Genetics and Molecular Medicine, University of Edinburgh, Molecular Medicine Centre, Western General Hospital, Crewe Road, Edinburgh EH4 2XU, UK
| | - Benjamin S. Pickard
- Medical Genetics, Institute of Genetics and Molecular Medicine, University of Edinburgh, Molecular Medicine Centre, Western General Hospital, Crewe Road, Edinburgh EH4 2XU, UK
| | - Alan Maclean
- Division of Psychiatry, University of Edinburgh, Royal Edinburgh Hospital, Edinburgh EH10 5HF, UK
- Medical Genetics, Institute of Genetics and Molecular Medicine, University of Edinburgh, Molecular Medicine Centre, Western General Hospital, Crewe Road, Edinburgh EH4 2XU, UK
| | - Mary P. Malloy
- Division of Psychiatry, University of Edinburgh, Royal Edinburgh Hospital, Edinburgh EH10 5HF, UK
- Medical Genetics, Institute of Genetics and Molecular Medicine, University of Edinburgh, Molecular Medicine Centre, Western General Hospital, Crewe Road, Edinburgh EH4 2XU, UK
| | - Dinesh C. Soares
- Medical Genetics, Institute of Genetics and Molecular Medicine, University of Edinburgh, Molecular Medicine Centre, Western General Hospital, Crewe Road, Edinburgh EH4 2XU, UK
| | - Allan F. McRae
- Queensland Institute of Medical Research, 300 Herston Road, Herston 4006, QLD, Australia
| | - Alison Condie
- Wellcome Trust Clinical Research Facility, Western General Hospital, Crewe Road, Edinburgh EH4 2XU, UK
| | - Angela White
- Wellcome Trust Clinical Research Facility, Western General Hospital, Crewe Road, Edinburgh EH4 2XU, UK
| | - William Hawkins
- Wellcome Trust Clinical Research Facility, Western General Hospital, Crewe Road, Edinburgh EH4 2XU, UK
| | - Kevin McGhee
- Division of Psychiatry, University of Edinburgh, Royal Edinburgh Hospital, Edinburgh EH10 5HF, UK
- Medical Genetics, Institute of Genetics and Molecular Medicine, University of Edinburgh, Molecular Medicine Centre, Western General Hospital, Crewe Road, Edinburgh EH4 2XU, UK
| | - Margaret van Beck
- Division of Psychiatry, University of Edinburgh, Royal Edinburgh Hospital, Edinburgh EH10 5HF, UK
| | - Donald J. MacIntyre
- Division of Psychiatry, University of Edinburgh, Royal Edinburgh Hospital, Edinburgh EH10 5HF, UK
| | - John M. Starr
- Centre for Cognitive Ageing and Cognitive Epidemiology, Geriatric Medicine Unit, University of Edinburgh, Royal Victoria Hospital, Craigleith Road, Edinburgh EH4 2DN, UK
| | - Ian J. Deary
- Centre for Cognitive Ageing and Cognitive Epidemiology, Department of Psychology, University of Edinburgh, 7 George Square, Edinburgh EH8 9JZ, UK
| | - Peter M. Visscher
- Queensland Institute of Medical Research, 300 Herston Road, Herston 4006, QLD, Australia
| | - David J. Porteous
- Medical Genetics, Institute of Genetics and Molecular Medicine, University of Edinburgh, Molecular Medicine Centre, Western General Hospital, Crewe Road, Edinburgh EH4 2XU, UK
| | - Ronald E. Cannon
- Cancer Biology Group, National Center for Toxicogenomics, NIEHS, Research Triangle Park, NC 27709, USA
| | - David St Clair
- Institute of Medical Sciences, University of Aberdeen, Foresterhill, Aberdeen AB25 2ZD, UK
| | - Walter J. Muir
- Division of Psychiatry, University of Edinburgh, Royal Edinburgh Hospital, Edinburgh EH10 5HF, UK
- Medical Genetics, Institute of Genetics and Molecular Medicine, University of Edinburgh, Molecular Medicine Centre, Western General Hospital, Crewe Road, Edinburgh EH4 2XU, UK
| | - Douglas H.R. Blackwood
- Division of Psychiatry, University of Edinburgh, Royal Edinburgh Hospital, Edinburgh EH10 5HF, UK
- Medical Genetics, Institute of Genetics and Molecular Medicine, University of Edinburgh, Molecular Medicine Centre, Western General Hospital, Crewe Road, Edinburgh EH4 2XU, UK
| |
Collapse
|
37
|
Kemena C, Notredame C. Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics 2009; 25:2455-65. [PMID: 19648142 PMCID: PMC2752613 DOI: 10.1093/bioinformatics/btp452] [Citation(s) in RCA: 150] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2009] [Revised: 06/24/2009] [Accepted: 07/16/2009] [Indexed: 12/22/2022] Open
Abstract
This review focuses on recent trends in multiple sequence alignment tools. It describes the latest algorithmic improvements including the extension of consistency-based methods to the problem of template-based multiple sequence alignments. Some results are presented suggesting that template-based methods are significantly more accurate than simpler alternative methods. The validation of existing methods is also discussed at length with the detailed description of recent results and some suggestions for future validation strategies. The last part of the review addresses future challenges for multiple sequence alignment methods in the genomic era, most notably the need to cope with very large sequences, the need to integrate large amounts of experimental data, the need to accurately align non-coding and non-transcribed sequences and finally, the need to integrate many alternative methods and approaches.
Collapse
Affiliation(s)
- Carsten Kemena
- Centre For Genomic Regulation, Pompeus Fabre University, Carrer del Doctor Aiguader 88, 08003 Barcelona, Spain
| | | |
Collapse
|
38
|
Abstract
BACKGROUND Multiple Sequence Alignment (MSA) has always been an active area of research in Bioinformatics. MSA is mainly focused on discovering biologically meaningful relationships among different sequences or proteins in order to investigate the underlying main characteristics/functions. This information is also used to generate phylogenetic trees. RESULTS This paper presents a novel approach, namely RBT-GA, to solve the MSA problem using a hybrid solution methodology combining the Rubber Band Technique (RBT) and the Genetic Algorithm (GA) metaheuristic. RBT is inspired by the behavior of an elastic Rubber Band (RB) on a plate with several poles, which is analogues to locations in the input sequences that could potentially be biologically related. A GA attempts to mimic the evolutionary processes of life in order to locate optimal solutions in an often very complex landscape. RBT-GA is a population based optimization algorithm designed to find the optimal alignment for a set of input protein sequences. In this novel technique, each alignment answer is modeled as a chromosome consisting of several poles in the RBT framework. These poles resemble locations in the input sequences that are most likely to be correlated and/or biologically related. A GA-based optimization process improves these chromosomes gradually yielding a set of mostly optimal answers for the MSA problem. CONCLUSION RBT-GA is tested with one of the well-known benchmarks suites (BALiBASE 2.0) in this area. The obtained results show that the superiority of the proposed technique even in the case of formidable sequences.
Collapse
Affiliation(s)
- Javid Taheri
- School of Information Technologies, J12, The University of Sydney, Sydney, NSW 2006, Australia
| | - Albert Y Zomaya
- School of Information Technologies, J12, The University of Sydney, Sydney, NSW 2006, Australia
| |
Collapse
|
39
|
Abstract
We describe a new program for the alignment of multiple biological sequences that is both statistically motivated and fast enough for problem sizes that arise in practice. Our Fast Statistical Alignment program is based on pair hidden Markov models which approximate an insertion/deletion process on a tree and uses a sequence annealing algorithm to combine the posterior probabilities estimated from these models into a multiple alignment. FSA uses its explicit statistical model to produce multiple alignments which are accompanied by estimates of the alignment accuracy and uncertainty for every column and character of the alignment—previously available only with alignment programs which use computationally-expensive Markov Chain Monte Carlo approaches—yet can align thousands of long sequences. Moreover, FSA utilizes an unsupervised query-specific learning procedure for parameter estimation which leads to improved accuracy on benchmark reference alignments in comparison to existing programs. The centroid alignment approach taken by FSA, in combination with its learning procedure, drastically reduces the amount of false-positive alignment on biological data in comparison to that given by other methods. The FSA program and a companion visualization tool for exploring uncertainty in alignments can be used via a web interface at http://orangutan.math.berkeley.edu/fsa/, and the source code is available at http://fsa.sourceforge.net/. Biological sequence alignment is one of the fundamental problems in comparative genomics, yet it remains unsolved. Over sixty sequence alignment programs are listed on Wikipedia, and many new programs are published every year. However, many popular programs suffer from pathologies such as aligning unrelated sequences and producing discordant alignments in protein (amino acid) and codon (nucleotide) space, casting doubt on the accuracy of the inferred alignments. Inaccurate alignments can introduce large and unknown systematic biases into downstream analyses such as phylogenetic tree reconstruction and substitution rate estimation. We describe a new program for multiple sequence alignment which can align protein, RNA and DNA sequence and improves on the accuracy of existing approaches on benchmarks of protein and RNA structural alignments and simulated mammalian and fly genomic alignments. Our approach, which seeks to find the alignment which is closest to the truth under our statistical model, leaves unrelated sequences largely unaligned and produces concordant alignments in protein and codon space. It is fast enough for difficult problems such as aligning orthologous genomic regions or aligning hundreds or thousands of proteins. It furthermore has a companion GUI for visualizing the estimated alignment reliability.
Collapse
|
40
|
Mahalakshmi A, Sujatha K, Shenbagarathai R. Molecular modeling and characterization of the B. thuringiensis and B. thuringiensis LDC-9 cytolytic proteins. J Biomol Struct Dyn 2008; 26:375-86. [PMID: 18808203 DOI: 10.1080/07391102.2008.10507252] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
The Cyt toxins are able to lyse a wide range of cell types in vitro, unlike the Cry delta-endotoxins. It exerts its activity by the formation of pores within target cell membranes. The structural information available for Cyt2Aa (PDB id: 1CBY) consists of a single domain in which two outer layers of alpha-helix wrap around a mixed beta-sheet. Beta-barrel was suggested as a possible structure of the pores. Hence, this study seeks to investigate the structural properties of other Cytolytic proteins by predicting the three-dimensional (3D) model using Cyt2Aa as template. The predicted models are expected to be significantly more accurate as all the Cyt proteins showed significant similarity with the template (PDB id: 1CBY). The refined homology models revealed similar secondary structures (alpha-helices and beta-sheets) and tertiary features as Cyt2Aa. The variation in the loop regions of the tertiary structure accounts for the differential toxicity.
Collapse
Affiliation(s)
- A Mahalakshmi
- PG and Research Department of Zoology and Biotechnology, Lady Doak College, Madurai-625 002, TamilNadu, India.
| | | | | |
Collapse
|
41
|
Lu Y, Sze SH. Improving accuracy of multiple sequence alignment algorithms based on alignment of neighboring residues. Nucleic Acids Res 2008; 37:463-72. [PMID: 19056820 PMCID: PMC2632924 DOI: 10.1093/nar/gkn945] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
While most of the recent improvements in multiple sequence alignment accuracy are due to better use of vertical information, which include the incorporation of consistency-based pairwise alignments and the use of profile alignments, we observe that it is possible to further improve accuracy by taking into account alignment of neighboring residues when aligning two residues, thus making better use of horizontal information. By modifying existing multiple alignment algorithms to make use of horizontal information, we show that this strategy is able to consistently improve over existing algorithms on a few sets of benchmark alignments that are commonly used to measure alignment accuracy, and the average improvements in accuracy can be as much as 1–3% on protein sequence alignment and 5–10% on DNA/RNA sequence alignment. Unlike previous algorithms, consistent average improvements can be obtained across all identity levels.
Collapse
Affiliation(s)
- Yue Lu
- Department of Biochemistry and Biophysics, Texas A&M University, College Station, TX 77843, USA
| | | |
Collapse
|
42
|
Rausch T, Emde AK, Weese D, Döring A, Notredame C, Reinert K. Segment-based multiple sequence alignment. Bioinformatics 2008; 24:i187-92. [PMID: 18689823 DOI: 10.1093/bioinformatics/btn281] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Many multiple sequence alignment tools have been developed in the past, progressing either in speed or alignment accuracy. Given the importance and wide-spread use of alignment tools, progress in both categories is a contribution to the community and has driven research in the field so far. RESULTS We introduce a graph-based extension to the consistency-based, progressive alignment strategy. We apply the consistency notion to segments instead of single characters. The main problem we solve in this context is to define segments of the sequences in such a way that a graph-based alignment is possible. We implemented the algorithm using the SeqAn library and report results on amino acid and DNA sequences. The benefit of our approach is threefold: (1) sequences with conserved blocks can be rapidly aligned, (2) the implementation is conceptually easy, generic and fast and (3) the consistency idea can be extended to align multiple genomic sequences. AVAILABILITY The segment-based multiple sequence alignment tool can be downloaded from http://www.seqan.de/projects/msa.html. A novel version of T-Coffee interfaced with the tool is available from http://www.tcoffee.org. The usage of the tool is described in both documentations.
Collapse
Affiliation(s)
- Tobias Rausch
- International Max Planck Research School for Computational Biology and Scientific Computing, Ihnestr 63-73, 14195 Berlin, Germany.
| | | | | | | | | | | |
Collapse
|
43
|
Larrea AA, Pedroso IM, Malhotra A, Myers RS. Identification of two conserved aspartic acid residues required for DNA digestion by a novel thermophilic Exonuclease VII in Thermotoga maritima. Nucleic Acids Res 2008; 36:5992-6003. [PMID: 18812402 PMCID: PMC2566859 DOI: 10.1093/nar/gkn588] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
Exonuclease VII was first identified in 1974 as a DNA exonuclease that did not require any divalent cations for activity. Indeed, Escherichia coli ExoVII was identified in partially purified extracts in the presence of EDTA. ExoVII is comprised of two subunits (XseA and XseB) that are highly conserved and present in most sequenced prokaryotic genomes, but are not seen in eukaryotes. To better understand this exonuclease family, we have characterized an ExoVII homolog from Thermotoga maritima. Thermotoga maritima XseA/B homologs TM1768 and TM1769 were co-expressed and purified, and show robust nuclease activity at 80°C. This activity is magnesium dependent and is inhibited by phosphate ions, which distinguish it from E. coli ExoVII. Nevertheless, both E. coli and T. maritima ExoVII share a similar putative active site motif with two conserved aspartate residues in the large (XseA/TM1768) subunit. We show that these residues, Asp235 and Asp240, are essential for the nuclease activity of T. maritima ExoVII. We hypothesize that the ExoVII family of nucleases can be sub-divided into two sub-families based on EDTA resistance and that T. maritima ExoVII is the first member of the branch that is characterized by EDTA sensitivity and inhibition by phosphate.
Collapse
Affiliation(s)
- Andres A Larrea
- Department of Biochemistry and Molecular Biology, University of Miami Miller School of Medicine, Miami, FL 33136, USA
| | | | | | | |
Collapse
|
44
|
Lu Y, Sze SH. Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences. J Comput Biol 2008; 15:767-77. [PMID: 18662101 DOI: 10.1089/cmb.2007.0132] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Yue Lu
- Department of Biochemistry and Biophysics, Texas A&M University, College Station, Texas
| | - Sing-Hoi Sze
- Department of Biochemistry and Biophysics, Texas A&M University, College Station, Texas
- Computer Science, Texas A&M University, College Station, Texas
| |
Collapse
|
45
|
Lee BC, Park K, Kim D. Analysis of the residue-residue coevolution network and the functionally important residues in proteins. Proteins 2008; 72:863-72. [PMID: 18275083 DOI: 10.1002/prot.21972] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
It is a common belief that some residues of a protein are more important than others. In some cases, point mutations of some residues make butterfly effect on the protein structure and function, but in other cases they do not. In addition, the residues important for the protein function tend to be not only conserved but also coevolved with other interacting residues in a protein. Motivated by these observations, the authors propose that there is a network composed of the residues, the residue-residue coevolution network (RRCN), where nodes are residues and links are set when the coevolutionary interaction strengths between residues are sufficiently large. The authors build the RRCN for the 44 diverse protein families. The interaction strengths are calculated by using McBASC algorithm. After constructing the RRCN, the authors identify residues that have high degree of connectivity (hub nodes), and residues that play a central role in network flow of information (C(I) nodes). The authors show that these residues are likely to be functionally important residues. Moreover, the C(I) nodes appear to be more relevant to the function than the hub nodes. Unlike other similar methods, the method described in this study is solely based on sequences. Therefore, the method can be applied to the function annotation of a wider range of proteins.
Collapse
Affiliation(s)
- Byung-Chul Lee
- Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology, Daejeon 305-701, Korea
| | | | | |
Collapse
|
46
|
Ahola V, Aittokallio T, Vihinen M, Uusipaikka E. Model-based prediction of sequence alignment quality. Bioinformatics 2008; 24:2165-71. [DOI: 10.1093/bioinformatics/btn414] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
47
|
Pei J, Tang M, Grishin NV. PROMALS3D web server for accurate multiple protein sequence and structure alignments. Nucleic Acids Res 2008; 36:W30-4. [PMID: 18503087 PMCID: PMC2447800 DOI: 10.1093/nar/gkn322] [Citation(s) in RCA: 133] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Multiple sequence alignments are essential in computational sequence and structural analysis, with applications in homology detection, structure modeling, function prediction and phylogenetic analysis. We report PROMALS3D web server for constructing alignments for multiple protein sequences and/or structures using information from available 3D structures, database homologs and predicted secondary structures. PROMALS3D shows higher alignment accuracy than a number of other advanced methods. Input of PROMALS3D web server can be FASTA format protein sequences, PDB format protein structures and/or user-defined alignment constraints. The output page provides alignments with several formats, including a colored alignment augmented with useful information about sequence grouping, predicted secondary structures and consensus sequences. Intermediate results of sequence and structural database searches are also available. The PROMALS3D web server is available at: http://prodata.swmed.edu/promals3d/.
Collapse
Affiliation(s)
- Jimin Pei
- Howard Hughes Medical Institute and Department of Biochemistry, University of Texas Southwestern Medical Center, 6001 Forest Park Road, Dallas, TX 75390-9050, USA
| | | | | |
Collapse
|
48
|
Pei J. Multiple protein sequence alignment. Curr Opin Struct Biol 2008; 18:382-6. [PMID: 18485694 DOI: 10.1016/j.sbi.2008.03.007] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2008] [Accepted: 03/18/2008] [Indexed: 11/16/2022]
Abstract
Multiple sequence alignments are essential in computational analysis of protein sequences and structures, with applications in structure modeling, functional site prediction, phylogenetic analysis and sequence database searching. Constructing accurate multiple alignments for divergent protein sequences remains a difficult computational task, and alignment speed becomes an issue for large sequence datasets. Here, I review methodologies and recent advances in the multiple protein sequence alignment field, with emphasis on the use of additional sequence and structural information to improve alignment quality.
Collapse
Affiliation(s)
- Jimin Pei
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center at Dallas, 5323 Harry Hines Boulevard, Dallas, TX 75390, USA.
| |
Collapse
|
49
|
Orlowski J, Bujnicki JM. Structural and evolutionary classification of Type II restriction enzymes based on theoretical and experimental analyses. Nucleic Acids Res 2008; 36:3552-69. [PMID: 18456708 PMCID: PMC2441816 DOI: 10.1093/nar/gkn175] [Citation(s) in RCA: 91] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
For a very long time, Type II restriction enzymes (REases) have been a paradigm of ORFans: proteins with no detectable similarity to each other and to any other protein in the database, despite common cellular and biochemical function. Crystallographic analyses published until January 2008 provided high-resolution structures for only 28 of 1637 Type II REase sequences available in the Restriction Enzyme database (REBASE). Among these structures, all but two possess catalytic domains with the common PD-(D/E)XK nuclease fold. Two structures are unrelated to the others: R.BfiI exhibits the phospholipase D (PLD) fold, while R.PabI has a new fold termed 'half-pipe'. Thus far, bioinformatic studies supported by site-directed mutagenesis have extended the number of tentatively assigned REase folds to five (now including also GIY-YIG and HNH folds identified earlier in homing endonucleases) and provided structural predictions for dozens of REase sequences without experimentally solved structures. Here, we present a comprehensive study of all Type II REase sequences available in REBASE together with their homologs detectable in the nonredundant and environmental samples databases at the NCBI. We present the summary and critical evaluation of structural assignments and predictions reported earlier, new classification of all REase sequences into families, domain architecture analysis and new predictions of three-dimensional folds. Among 289 experimentally characterized (not putative) Type II REases, whose apparently full-length sequences are available in REBASE, we assign 199 (69%) to contain the PD-(D/E)XK domain. The HNH domain is the second most common, with 24 (8%) members. When putative REases are taken into account, the fraction of PD-(D/E)XK and HNH folds changes to 48% and 30%, respectively. Fifty-six characterized (and 521 predicted) REases remain unassigned to any of the five REase folds identified so far, and may exhibit new architectures. These enzymes are proposed as the most interesting targets for structure determination by high-resolution experimental methods. Our analysis provides the first comprehensive map of sequence-structure relationships among Type II REases and will help to focus the efforts of structural and functional genomics of this large and biotechnologically important class of enzymes.
Collapse
Affiliation(s)
- Jerzy Orlowski
- Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology in Warsaw, ul. Ks. Trojdena 4, PL-02-109 Warsaw, Poland
| | | |
Collapse
|
50
|
Perrodou E, Chica C, Poch O, Gibson TJ, Thompson JD. A new protein linear motif benchmark for multiple sequence alignment software. BMC Bioinformatics 2008; 9:213. [PMID: 18439277 PMCID: PMC2374782 DOI: 10.1186/1471-2105-9-213] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2007] [Accepted: 04/25/2008] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Linear motifs (LMs) are abundant short regulatory sites used for modulating the functions of many eukaryotic proteins. They play important roles in post-translational modification, cell compartment targeting, docking sites for regulatory complex assembly and protein processing and cleavage. Methods for LM detection are now being developed that are strongly dependent on scores for motif conservation in homologous proteins. However, most LMs are found in natively disordered polypeptide segments that evolve rapidly, unhindered by structural constraints on the sequence. These regions of modular proteins are difficult to align using classical multiple sequence alignment programs that are specifically optimised to align the globular domains. As a consequence, poor motif alignment quality is hindering efforts to detect new LMs. RESULTS We have developed a new benchmark, as part of the BAliBASE suite, designed to assess the ability of standard multiple alignment methods to detect and align LMs. The reference alignments are organised into different test sets representing real alignment problems and contain examples of experimentally verified functional motifs, extracted from the Eukaryotic Linear Motif (ELM) database. The benchmark has been used to evaluate and compare a number of multiple alignment programs. With distantly related proteins, the worst alignment program correctly aligns 48% of LMs compared to 73% for the best program. However, the performance of all the programs is adversely affected by the introduction of other sequences containing false positive motifs. The ranking of the alignment programs based on LM alignment quality is similar to that observed when considering full-length protein alignments, however little correlation was observed between LM and overall alignment quality for individual alignment test cases. CONCLUSION We have shown that none of the programs currently available is capable of reliably aligning LMs in distantly related sequences and we have highlighted a number of specific problems. The results of the tests suggest possible ways to improve program accuracy for difficult, divergent sequences.
Collapse
Affiliation(s)
- Emmanuel Perrodou
- Institut de Génétique et de Biologie Moléculaire et Cellulaire, Department of Structural Biology and Genomics, F-67400 Illkirch, France.
| | | | | | | | | |
Collapse
|