1
|
Liu Y, Zhu YH, Song X, Song J, Yu DJ. Why can deep convolutional neural networks improve protein fold recognition? A visual explanation by interpretation. Brief Bioinform 2021; 22:6127449. [PMID: 33537753 DOI: 10.1093/bib/bbab001] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2020] [Revised: 12/20/2020] [Accepted: 01/01/2021] [Indexed: 01/26/2023] Open
Abstract
As an essential task in protein structure and function prediction, protein fold recognition has attracted increasing attention. The majority of the existing machine learning-based protein fold recognition approaches strongly rely on handcrafted features, which depict the characteristics of different protein folds; however, effective feature extraction methods still represent the bottleneck for further performance improvement of protein fold recognition. As a powerful feature extractor, deep convolutional neural network (DCNN) can automatically extract discriminative features for fold recognition without human intervention, which has demonstrated an impressive performance on protein fold recognition. Despite the encouraging progress, DCNN often acts as a black box, and as such, it is challenging for users to understand what really happens in DCNN and why it works well for protein fold recognition. In this study, we explore the intrinsic mechanism of DCNN and explain why it works for protein fold recognition using a visual explanation technique. More specifically, we first trained a VGGNet-based DCNN model, termed VGGNet-FE, which can extract fold-specific features from the predicted protein residue-residue contact map for protein fold recognition. Subsequently, based on the trained VGGNet-FE, we implemented a new contact-assisted predictor, termed VGGfold, for protein fold recognition; we then visualized what features were extracted by each of the convolutional layers in VGGNet-FE using a deconvolution technique. Furthermore, we visualized the high-level semantic information, termed fold-discriminative region, of a predicted contact map from the localization map obtained from the last convolutional layer of VGGNet-FE. It is visually confirmed that VGGNet-FE could effectively extract distinct fold-discriminative regions for different types of protein folds, thereby accounting for the improved performance of VGGfold for protein fold recognition. In summary, this study is of great significance for both understanding the working principle of DCNNs in protein fold recognition and exploring the relationship between the predicted protein contact map and protein tertiary structure. This proposed visualization method is flexible and applicable to address other DCNN-based bioinformatics and computational biology questions. The online web server of VGGfold is freely available at http://csbio.njust.edu.cn/bioinf/vggfold/.
Collapse
Affiliation(s)
- Yan Liu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - Yi-Heng Zhu
- Department of Computer Science, Jiangnan University, No. 1800 Lihu Avenue, Wuxi, 214122, China
| | - Xiaoning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Jiangning Song
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| |
Collapse
|
2
|
Benítez-Hidalgo A, Nebro AJ, Aldana-Montes JF. Sequoya: multiobjective multiple sequence alignment in Python. Bioinformatics 2020; 36:3892-3893. [PMID: 32315391 DOI: 10.1093/bioinformatics/btaa257] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2019] [Revised: 03/10/2020] [Accepted: 04/15/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Multiple sequence alignment (MSA) consists of finding the optimal alignment of three or more biological sequences to identify highly conserved regions that may be the result of similarities and relationships between the sequences. MSA is an optimization problem with NP-hard complexity (non-deterministic polynomial-time hardness), because the time needed to find optimal alignments raises exponentially along with the number of sequences and their length. Furthermore, the problem becomes multiobjective when more than one score is considered to assess the quality of an alignment, such as maximizing the percentage of totally conserved columns and minimizing the number of gaps. Our motivation is to provide a Python tool for solving MSA problems using evolutionary algorithms, a nonexact stochastic optimization approach that has proven to be effective to solve multiobjective problems. RESULTS The software tool we have developed, called Sequoya, is written in the Python programming language, which offers a broad set of libraries for data analysis, visualization and parallelism. Thus, Sequoya offers a graphical tool to visualize the progress of the optimization in real time, the ability to guide the search toward a preferred region in run-time, parallel support to distribute the computation among nodes in a distributed computing system, and a graphical component to assist in the analysis of the solutions found at the end of the optimization. AVAILABILITY AND IMPLEMENTATION Sequoya can be freely obtained from the Python Package Index (pip) or, alternatively, it can be downloaded from Github at https://github.com/benhid/Sequoya. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Antonio Benítez-Hidalgo
- Departamento de Lenguajes y Ciencias de la Computación.,ITIS Software, Universidad de Málaga, Málaga 29071, Spain.,Instituto de Investigación Biomédica de Málaga - IBIMA, UMA, Málaga, Spain
| | - Antonio J Nebro
- Departamento de Lenguajes y Ciencias de la Computación.,ITIS Software, Universidad de Málaga, Málaga 29071, Spain.,Instituto de Investigación Biomédica de Málaga - IBIMA, UMA, Málaga, Spain
| | - José F Aldana-Montes
- Departamento de Lenguajes y Ciencias de la Computación.,ITIS Software, Universidad de Málaga, Málaga 29071, Spain.,Instituto de Investigación Biomédica de Málaga - IBIMA, UMA, Málaga, Spain
| |
Collapse
|
3
|
Rubio-Largo Á, Vanneschi L, Castelli M, Vega-Rodríguez MA. Multiobjective characteristic-based framework for very-large multiple sequence alignment. Appl Soft Comput 2018. [DOI: 10.1016/j.asoc.2017.06.022] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
4
|
Danchin A, Ouzounis C, Tokuyasu T, Zucker JD. No wisdom in the crowd: genome annotation in the era of big data - current status and future prospects. Microb Biotechnol 2018; 11:588-605. [PMID: 29806194 PMCID: PMC6011933 DOI: 10.1111/1751-7915.13284] [Citation(s) in RCA: 33] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Science and engineering rely on the accumulation and dissemination of knowledge to make discoveries and create new designs. Discovery-driven genome research rests on knowledge passed on via gene annotations. In response to the deluge of sequencing big data, standard annotation practice employs automated procedures that rely on majority rules. We argue this hinders progress through the generation and propagation of errors, leading investigators into blind alleys. More subtly, this inductive process discourages the discovery of novelty, which remains essential in biological research and reflects the nature of biology itself. Annotation systems, rather than being repositories of facts, should be tools that support multiple modes of inference. By combining deduction, induction and abduction, investigators can generate hypotheses when accurate knowledge is extracted from model databases. A key stance is to depart from 'the sequence tells the structure tells the function' fallacy, placing function first. We illustrate our approach with examples of critical or unexpected pathways, using MicroScope to demonstrate how tools can be implemented following the principles we advocate. We end with a challenge to the reader.
Collapse
Affiliation(s)
- Antoine Danchin
- Integromics, Institute of Cardiometabolism and Nutrition, Hôpital de la Pitié-Salpêtrière, 47 Boulevard de l'Hôpital, 75013, Paris, France
- School of Biomedical Sciences, Li KaShing Faculty of Medicine, Hong Kong University, 21 Sassoon Road, Pokfulam, Hong Kong
| | - Christos Ouzounis
- Biological Computation and Process Laboratory, Centre for Research and Technology Hellas, Chemical Process and Energy Resources Institute, Thessalonica, 57001, Greece
| | - Taku Tokuyasu
- Shenzhen Institutes of Advanced Technology, Institute of Synthetic Biology, Shenzhen University Town, 1068 Xueyuan Avenue, Shenzhen, China
| | - Jean-Daniel Zucker
- Integromics, Institute of Cardiometabolism and Nutrition, Hôpital de la Pitié-Salpêtrière, 47 Boulevard de l'Hôpital, 75013, Paris, France
| |
Collapse
|
5
|
Zambrano-Vega C, Nebro AJ, García-Nieto J, Aldana-Montes JF. M2Align: parallel multiple sequence alignment with a multi-objective metaheuristic. Bioinformatics 2018; 33:3011-3017. [PMID: 28541404 DOI: 10.1093/bioinformatics/btx338] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2016] [Accepted: 05/20/2017] [Indexed: 11/14/2022] Open
Abstract
Motivation Multiple sequence alignment (MSA) is an NP-complete optimization problem found in computational biology, where the time complexity of finding an optimal alignment raises exponentially along with the number of sequences and their lengths. Additionally, to assess the quality of a MSA, a number of objectives can be taken into account, such as maximizing the sum-of-pairs, maximizing the totally conserved columns, minimizing the number of gaps, or maximizing structural information based scores such as STRIKE. An approach to deal with MSA problems is to use multi-objective metaheuristics, which are non-exact stochastic optimization methods that can produce high quality solutions to complex problems having two or more objectives to be optimized at the same time. Our motivation is to provide a multi-objective metaheuristic for MSA that can run in parallel taking advantage of multi-core-based computers. Results The software tool we propose, called M2Align (Multi-objective Multiple Sequence Alignment), is a parallel and more efficient version of the three-objective optimizer for sequence alignments MO-SAStrE, able of reducing the algorithm computing time by exploiting the computing capabilities of common multi-core CPU clusters. Our performance evaluation over datasets of the benchmark BAliBASE (v3.0) shows that significant time reductions can be achieved by using up to 20 cores. Even in sequential executions, M2Align is faster than MO-SAStrE, thanks to the encoding method used for the alignments. Availability and implementation M2Align is an open source project hosted in GitHub, where the source code and sample datasets can be freely obtained: https://github.com/KhaosResearch/M2Align. Contact antonio@lcc.uma.es. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Cristian Zambrano-Vega
- Facultad de Ciencias de la Ingeniería, Universidad Técnica Estatal de Quevedo, Quevedo, Ecuador
| | - Antonio J Nebro
- Departamento de Lenguajes y Ciencias de la Computación, Universidad de Málaga, 29071 Málaga, Spain
| | - José García-Nieto
- Departamento de Lenguajes y Ciencias de la Computación, Universidad de Málaga, 29071 Málaga, Spain
| | - José F Aldana-Montes
- Departamento de Lenguajes y Ciencias de la Computación, Universidad de Málaga, 29071 Málaga, Spain
| |
Collapse
|
6
|
Rubio-Largo Á, Castelli M, Vanneschi L, Vega-Rodríguez MA. A Parallel Multiobjective Metaheuristic for Multiple Sequence Alignment. J Comput Biol 2018; 25:1009-1022. [PMID: 29671616 DOI: 10.1089/cmb.2018.0031] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The alignment among three or more nucleotides/amino acids sequences at the same time is known as multiple sequence alignment (MSA), a nondeterministic polynomial time (NP)-hard optimization problem. The time complexity of finding an optimal alignment raises exponentially when the number of sequences to align increases. In this work, we deal with a multiobjective version of the MSA problem wherein the goal is to simultaneously optimize the accuracy and conservation of the alignment. A parallel version of the hybrid multiobjective memetic metaheuristics for MSA is proposed. To evaluate the parallel performance of our proposal, we have selected a pull of data sets with different number of sequences (up to 1000 sequences) and study its parallel performance against other well-known parallel metaheuristics published in the literature, such as MSAProbs, tree-based consistency objective function for alignment evaluation (T-Coffee), Clustal [Formula: see text], and multiple alignment using fast Fourier transform (MAFFT). The comparative study reveals that our parallel aligner obtains better results than MSAProbs, T-Coffee, Clustal [Formula: see text], and MAFFT. In addition, the parallel version is around 25 times faster than the sequential version with 32 cores, obtaining an efficiency around 80%.
Collapse
Affiliation(s)
| | - Mauro Castelli
- 1 NOVA IMS, Universidade Nova de Lisboa , Lisbon, Portugal
| | | | - Miguel A Vega-Rodríguez
- 2 Department of Computer and Communications Technologies, University of Extremadura , Caceres, Spain
| |
Collapse
|
7
|
Rubio-Largo A, Vanneschi L, Castelli M, Vega-Rodriguez MA. A Characteristic-Based Framework for Multiple Sequence Aligners. IEEE TRANSACTIONS ON CYBERNETICS 2018; 48:41-51. [PMID: 27831898 DOI: 10.1109/tcyb.2016.2621129] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
The multiple sequence alignment is a well-known bioinformatics problem that consists in the alignment of three or more biological sequences (protein or nucleic acid). In the literature, a number of tools have been proposed for dealing with this biological sequence alignment problem, such as progressive methods, consistency-based methods, or iterative methods; among others. These aligners often use a default parameter configuration for all the input sequences to align. However, the default configuration is not always the best choice, the alignment accuracy of the tool may be highly boosted if specific parameter configurations are used, depending on the biological characteristics of the input sequences. In this paper, we propose a characteristic-based framework for multiple sequence aligners. The idea of the framework is, given an input set of unaligned sequences, extract its characteristics and run the aligner with the best parameter configuration found for another set of unaligned sequences with similar characteristics. In order to test the framework, we have used the well-known multiple sequence comparison by log-expectation (MUSCLE) v3.8 aligner with different benchmarks, such as benchmark alignments database v3.0, protein reference alignment benchmark v4.0, and sequence alignment benchmark v1.65. The results shown that the alignment accuracy and conservation of MUSCLE might be greatly improved with the proposed framework, specially in those scenarios with a low percentage of identity. The characteristic-based framework for multiple sequence aligners is freely available for downloading at http://arco.unex.es/arl/fwk-msa/cbf-msa.zip.
Collapse
|
8
|
Rubio-Largo Á, Vanneschi L, Castelli M, Vega-Rodríguez MA. Using biological knowledge for multiple sequence aligner decision making. Inf Sci (N Y) 2017. [DOI: 10.1016/j.ins.2017.08.069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
9
|
Rubio-Largo Á, Vanneschi L, Castelli M, Vega-Rodríguez MA. Reducing Alignment Time Complexity of Ultra-Large Sets of Sequences. J Comput Biol 2017; 24:1144-1154. [PMID: 28686466 DOI: 10.1089/cmb.2017.0097] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The alignment of three or more protein or nucleotide sequences is known as Multiple Sequence Alignment problem. The complexity of this problem increases exponentially with the number of sequences; therefore, many of the current approaches published in the literature suffer a computational overhead when thousands of sequences are required to be aligned. We introduce a new approach for dealing with ultra-large sets of sequences. A two-level clustering method is considered. The first level clusters the input sequences by using their biological composition, that is, the number of positive, negative, polar, special, and hydrophobic amino acids. In the second level, each cluster is divided into different clusters according to their similarity. Then, each cluster is aligned by using any method/aligner. After aligning the centroid sequences of each second-level cluster, we extrapolate the new gaps to each cluster of sequences to obtain the final alignment. We present a study on biological data with up to ∼100,000 sequences, showing that the proposed approach is able to obtain accurate alignments in a reduced amount of time; for example, in >10,000 sequences datasets, it is able to reduce up to ∼45 times the required runtime of the well-known Kalign.
Collapse
Affiliation(s)
- Álvaro Rubio-Largo
- 1 Nova Information Management School-NOVA IMS , Universidade Nova de Lisboa, Lisboa, Portugal
| | - Leonardo Vanneschi
- 1 Nova Information Management School-NOVA IMS , Universidade Nova de Lisboa, Lisboa, Portugal
| | - Mauro Castelli
- 1 Nova Information Management School-NOVA IMS , Universidade Nova de Lisboa, Lisboa, Portugal
| | - Miguel A Vega-Rodríguez
- 2 Department of Technologies of Computers and Communications, University of Extremadura , Cáceres, Spain
| |
Collapse
|
10
|
Song YJ, Cho DH. Classification of various genomic sequences based on distribution of repeated k-word. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2017; 2017:3894-3897. [PMID: 29060748 DOI: 10.1109/embc.2017.8037707] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
In order to extract phylogenetic information from DNA sequences, alignment-free methods and alignment-based methods are used. Alignment-based methods have high complexity and conventional alignment-free methods have low accuracy. In this paper, a new alignment-free method based on the distribution of repeated k-word measure is proposed. This novel measure is based on k-words and its multiple repeated words. We can get higher performance than conventional word count methods in case of using proposed scheme while maintaining total time complexity. The proposed measure shows better performance compared to conventional alignment-free methods with respect to RF distance.
Collapse
|
11
|
Zambrano-Vega C, Nebro AJ, García-Nieto J, Aldana-Montes JF. Comparing multi-objective metaheuristics for solving a three-objective formulation of multiple sequence alignment. PROGRESS IN ARTIFICIAL INTELLIGENCE 2017. [DOI: 10.1007/s13748-017-0116-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
12
|
Lelieveld SH, Schütte J, Dijkstra MJJ, Bawono P, Kinston SJ, Göttgens B, Heringa J, Bonzanni N. ConBind: motif-aware cross-species alignment for the identification of functional transcription factor binding sites. Nucleic Acids Res 2016; 44:e72. [PMID: 26721389 PMCID: PMC4856970 DOI: 10.1093/nar/gkv1518] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2014] [Revised: 12/15/2015] [Accepted: 12/16/2015] [Indexed: 12/23/2022] Open
Abstract
Eukaryotic gene expression is regulated by transcription factors (TFs) binding to promoter as well as distal enhancers. TFs recognize short, but specific binding sites (TFBSs) that are located within the promoter and enhancer regions. Functionally relevant TFBSs are often highly conserved during evolution leaving a strong phylogenetic signal. While multiple sequence alignment (MSA) is a potent tool to detect the phylogenetic signal, the current MSA implementations are optimized to align the maximum number of identical nucleotides. This approach might result in the omission of conserved motifs that contain interchangeable nucleotides such as the ETS motif (IUPAC code: GGAW). Here, we introduce ConBind, a novel method to enhance alignment of short motifs, even if their mutual sequence similarity is only partial. ConBind improves the identification of conserved TFBSs by improving the alignment accuracy of TFBS families within orthologous DNA sequences. Functional validation of the Gfi1b + 13 enhancer reveals that ConBind identifies additional functionally important ETS binding sites that were missed by all other tested alignment tools. In addition to the analysis of known regulatory regions, our web tool is useful for the analysis of TFBSs on so far unknown DNA regions identified through ChIP-sequencing.
Collapse
Affiliation(s)
- Stefan H Lelieveld
- Centre for Integrative Bioinformatics VU, VU University Amsterdam, Amsterdam 1081 HV, The Netherlands Department of Human Genetics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen 6525 GA, The Netherlands
| | - Judith Schütte
- Department of Haematology, Wellcome Trust-MRC Cambridge Stem Cell Institute & Cambridge Institute for Medical Research, Cambridge University, Cambridge CB2 0XY, UK Klinik für Hämatologie, Universitätsklinik Essen 45147, Germany
| | - Maurits J J Dijkstra
- Centre for Integrative Bioinformatics VU, VU University Amsterdam, Amsterdam 1081 HV, The Netherlands
| | - Punto Bawono
- Centre for Integrative Bioinformatics VU, VU University Amsterdam, Amsterdam 1081 HV, The Netherlands
| | - Sarah J Kinston
- Department of Haematology, Wellcome Trust-MRC Cambridge Stem Cell Institute & Cambridge Institute for Medical Research, Cambridge University, Cambridge CB2 0XY, UK
| | - Berthold Göttgens
- Department of Haematology, Wellcome Trust-MRC Cambridge Stem Cell Institute & Cambridge Institute for Medical Research, Cambridge University, Cambridge CB2 0XY, UK
| | - Jaap Heringa
- Centre for Integrative Bioinformatics VU, VU University Amsterdam, Amsterdam 1081 HV, The Netherlands
| | - Nicola Bonzanni
- Centre for Integrative Bioinformatics VU, VU University Amsterdam, Amsterdam 1081 HV, The Netherlands Computational Cancer Biology Group, Division of Molecular Carcinogenesis, The Netherlands Cancer Institute, Amsterdam 1066 CX, The Netherlands ENPICOM, Eindhoven 5632 CW, The Netherlands
| |
Collapse
|
13
|
Rubio-Largo Á, Vega-Rodríguez MA, González-Álvarez DL. Hybrid multiobjective artificial bee colony for multiple sequence alignment. Appl Soft Comput 2016. [DOI: 10.1016/j.asoc.2015.12.034] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
14
|
Bawono P, van der Velde A, Abeln S, Heringa J. Quantifying the displacement of mismatches in multiple sequence alignment benchmarks. PLoS One 2015; 10:e0127431. [PMID: 25993129 PMCID: PMC4438059 DOI: 10.1371/journal.pone.0127431] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2014] [Accepted: 04/14/2015] [Indexed: 11/18/2022] Open
Abstract
Multiple Sequence Alignment (MSA) methods are typically benchmarked on sets of reference alignments. The quality of the alignment can then be represented by the sum-of-pairs (SP) or column (CS) scores, which measure the agreement between a reference and corresponding query alignment. Both the SP and CS scores treat mismatches between a query and reference alignment as equally bad, and do not take the separation into account between two amino acids in the query alignment, that should have been matched according to the reference alignment. This is significant since the magnitude of alignment shifts is often of relevance in biological analyses, including homology modeling and MSA refinement/manual alignment editing. In this study we develop a new alignment benchmark scoring scheme, SPdist, that takes the degree of discordance of mismatches into account by measuring the sequence distance between mismatched residue pairs in the query alignment. Using this new score along with the standard SP score, we investigate the discriminatory behavior of the new score by assessing how well six different MSA methods perform with respect to BAliBASE reference alignments. The SP score and the SPdist score yield very similar outcomes when the reference and query alignments are close. However, for more divergent reference alignments the SPdist score is able to distinguish between methods that keep alignments approximately close to the reference and those exhibiting larger shifts. We observed that by using SPdist together with SP scoring we were able to better delineate the alignment quality difference between alternative MSA methods. With a case study we exemplify why it is important, from a biological perspective, to consider the separation of mismatches. The SPdist scoring scheme has been implemented in the VerAlign web server (http://www.ibi.vu.nl/programs/veralignwww/). The code for calculating SPdist score is also available upon request.
Collapse
Affiliation(s)
- Punto Bawono
- Centre for Integrative Bioinformatics (IBIVU), VU University Amsterdam, Amsterdam, The Netherlands
- * E-mail: (PB); (JH)
| | - Arjan van der Velde
- Centre for Integrative Bioinformatics (IBIVU), VU University Amsterdam, Amsterdam, The Netherlands
| | - Sanne Abeln
- Centre for Integrative Bioinformatics (IBIVU), VU University Amsterdam, Amsterdam, The Netherlands
| | - Jaap Heringa
- Centre for Integrative Bioinformatics (IBIVU), VU University Amsterdam, Amsterdam, The Netherlands
- Amsterdam Institute for Molecules Medicines and Systems (AIMMS), VU University Amsterdam, Amsterdam, The Netherlands
- * E-mail: (PB); (JH)
| |
Collapse
|
15
|
Abboud A, Williams VV, Weimann O. Consequences of Faster Alignment of Sequences. AUTOMATA, LANGUAGES, AND PROGRAMMING 2014. [DOI: 10.1007/978-3-662-43948-7_4] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
16
|
Optimizing multiple sequence alignments using a genetic algorithm based on three objectives: structural information, non-gaps percentage and totally conserved columns. Bioinformatics 2013; 29:2112-21. [DOI: 10.1093/bioinformatics/btt360] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
|
17
|
Lee JH, Yi H, Jeon YS, Won S, Chun J. TBC: a clustering algorithm based on prokaryotic taxonomy. J Microbiol 2012; 50:181-5. [PMID: 22538644 DOI: 10.1007/s12275-012-1214-6] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2011] [Accepted: 11/07/2011] [Indexed: 12/27/2022]
Abstract
High-throughput DNA sequencing technologies have revolutionized the study of microbial ecology. Massive sequencing of PCR amplicons of the 16S rRNA gene has been widely used to understand the microbial community structure of a variety of environmental samples. The resulting sequencing reads are clustered into operational taxonomic units that are then used to calculate various statistical indices that represent the degree of species diversity in a given sample. Several algorithms have been developed to perform this task, but they tend to produce different outcomes. Herein, we propose a novel sequence clustering algorithm, namely Taxonomy-Based Clustering (TBC). This algorithm incorporates the basic concept of prokaryotic taxonomy in which only comparisons to the type strain are made and used to form species while omitting full-scale multiple sequence alignment. The clustering quality of the proposed method was compared with those of MOTHUR, BLASTClust, ESPRIT-Tree, CD-HIT, and UCLUST. A comprehensive comparison using three different experimental datasets produced by pyrosequencing demonstrated that the clustering obtained using TBC is comparable to those obtained using MOTHUR and ESPRIT-Tree and is computationally efficient. The program was written in JAVA and is available from http://sw.ezbiocloud.net/tbc.
Collapse
Affiliation(s)
- Jae-Hak Lee
- Interdisciplinary Graduate Program in Bioinformatics, Seoul National University, Seoul, 151-742, Republic of Korea
| | | | | | | | | |
Collapse
|
18
|
Jagadeesh Chandra Bose R, van der Aalst WM. Process diagnostics using trace alignment: Opportunities, issues, and challenges. INFORM SYST 2012. [DOI: 10.1016/j.is.2011.08.003] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
19
|
Ye X, Wang G, Altschul SF. An assessment of substitution scores for protein profile-profile comparison. Bioinformatics 2011; 27:3356-63. [PMID: 21998158 PMCID: PMC3232366 DOI: 10.1093/bioinformatics/btr565] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2011] [Revised: 09/22/2011] [Accepted: 10/06/2011] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Pairwise protein sequence alignments are generally evaluated using scores defined as the sum of substitution scores for aligning amino acids to one another, and gap scores for aligning runs of amino acids in one sequence to null characters inserted into the other. Protein profiles may be abstracted from multiple alignments of protein sequences, and substitution and gap scores have been generalized to the alignment of such profiles either to single sequences or to other profiles. Although there is widespread agreement on the general form substitution scores should take for profile-sequence alignment, little consensus has been reached on how best to construct profile-profile substitution scores, and a large number of these scoring systems have been proposed. Here, we assess a variety of such substitution scores. For this evaluation, given a gold standard set of multiple alignments, we calculate the probability that a profile column yields a higher substitution score when aligned to a related than to an unrelated column. We also generalize this measure to sets of two or three adjacent columns. This simple approach has the advantages that it does not depend primarily upon the gold-standard alignment columns with the weakest empirical support, and that it does not need to fit gap and offset costs for use with each substitution score studied. RESULTS A simple symmetrization of mean profile-sequence scores usually performed the best. These were followed closely by several specific scoring systems constructed using a variety of rationales. CONTACT altschul@ncbi.nlm.nih.gov SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xugang Ye
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | | | | |
Collapse
|
20
|
Ye X, Yu YK, Altschul SF. Compositional adjustment of Dirichlet mixture priors. J Comput Biol 2011; 17:1607-20. [PMID: 21128852 DOI: 10.1089/cmb.2010.0117] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Dirichlet mixture priors provide a Bayesian formalism for scoring alignments of protein profiles to individual sequences, which can be generalized to constructing scores for multiple-alignment columns. A Dirichlet mixture is a probability distribution over multinomial space, each of whose components can be thought of as modeling a type of protein position. Applied to the simplest case of pairwise sequence alignment, a Dirichlet mixture is equivalent to an implied symmetric substitution matrix. For alphabets of even size L, Dirichlet mixtures with L/2 components and symmetric substitution matrices have an identical number of free parameters. Although this suggests the possibility of a one-to-one mapping between the two formalisms, we show that there are some symmetric matrices no Dirichlet mixture can imply, and others implied by many distinct Dirichlet mixtures. Dirichlet mixtures are derived empirically from curated sets of multiple alignments. They imply "background" amino acid frequencies characteristic of these sets, and should thus be non-optimal for comparing proteins with non-standard composition. Given a mixture Θ, we seek an adjusted Θ' that implies the desired composition, but that minimizes an appropriate relative-entropy-based distance function. To render the problem tractable, we fix the mixture parameter as well as the sum of the Dirichlet parameters for each component, allowing only its center of mass to vary. This linearizes the constraints on the remaining parameters. An approach to finding Θ' may be based on small consecutive parameter adjustments. The relative entropy of two Dirichlet distributions separated by a small change in their parameter values implies a quadratic cost function for such changes. For a small change in implied background frequencies, this function can be minimized using the Lagrange-Newton method. We have implemented this method, and can compositionally adjust to good precision a 20-component Dirichlet mixture prior for proteins in under half a second on a standard workstation.
Collapse
Affiliation(s)
- Xugang Ye
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| | | | | |
Collapse
|
21
|
Altschul SF, Wootton JC, Zaslavsky E, Yu YK. The construction and use of log-odds substitution scores for multiple sequence alignment. PLoS Comput Biol 2010; 6:e1000852. [PMID: 20657661 PMCID: PMC2904766 DOI: 10.1371/journal.pcbi.1000852] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2009] [Accepted: 06/03/2010] [Indexed: 01/18/2023] Open
Abstract
Most pairwise and multiple sequence alignment programs seek alignments with optimal scores. Central to defining such scores is selecting a set of substitution scores for aligned amino acids or nucleotides. For local pairwise alignment, substitution scores are implicitly of log-odds form. We now extend the log-odds formalism to multiple alignments, using Bayesian methods to construct "BILD" ("Bayesian Integral Log-odds") substitution scores from prior distributions describing columns of related letters. This approach has been used previously only to define scores for aligning individual sequences to sequence profiles, but it has much broader applicability. We describe how to calculate BILD scores efficiently, and illustrate their uses in Gibbs sampling optimization procedures, gapped alignment, and the construction of hidden Markov model profiles. BILD scores enable automated selection of optimal motif and domain model widths, and can inform the decision of whether to include a sequence in a multiple alignment, and the selection of insertion and deletion locations. Other applications include the classification of related sequences into subfamilies, and the definition of profile-profile alignment scores. Although a fully realized multiple alignment program must rely upon more than substitution scores, many existing multiple alignment programs can be modified to employ BILD scores. We illustrate how simple BILD score based strategies can enhance the recognition of DNA binding domains, including the Api-AP2 domain in Toxoplasma gondii and Plasmodium falciparum.
Collapse
Affiliation(s)
- Stephen F Altschul
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America.
| | | | | | | |
Collapse
|
22
|
|
23
|
|
24
|
Abstract
An overview of data mining (DM) and its application to the analysis of DM and electroencephalography (EEG) is given by: (i) presenting a working definition of DM, (ii) motivating why EEG analysis is a challenging field of application for DM technology and (iii) by reviewing exemplary work on DM applied to EEG analysis. The current status of work on DM and EEG is discussed and some general conclusions are drawn.
Collapse
Affiliation(s)
- A Flexer
- Austrian Research Institute for Artificial Intelligence, Vienna, Austria.
| |
Collapse
|
25
|
Henikoff S. Comparative methods for identifying functional domains in protein sequences. BIOTECHNOLOGY ANNUAL REVIEW 1998; 1:129-47. [PMID: 9704087 DOI: 10.1016/s1387-2656(08)70050-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
This chapter reviews the different approaches that have been applied to the problem of protein motif identification. Several methods are available for finding motifs within protein families, for clustering databases to identify family relationships and for searching databases that consist of motif representations. With the rapid expansion of sequence databases, which currently appear to represent most protein families, these methods are becoming increasingly important for interpretation of molecular sequence information.
Collapse
Affiliation(s)
- S Henikoff
- Howard Hughes Medical Institute, Fred Hutchinson Cancer Research Center, Seattle, Washington 98104, USA
| |
Collapse
|
26
|
Lawrence C, Reilly A. Likelihood Inference for Permuted Data with Application to Gene Regulation. J Am Stat Assoc 1996. [DOI: 10.1080/01621459.1996.10476665] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
27
|
Prendergast JA, Ptak C, Kornitzer D, Steussy CN, Hodgins R, Goebl M, Ellison MJ. Identification of a positive regulator of the cell cycle ubiquitin-conjugating enzyme Cdc34 (Ubc3). Mol Cell Biol 1996; 16:677-84. [PMID: 8552096 PMCID: PMC231047 DOI: 10.1128/mcb.16.2.677] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023] Open
Abstract
The Cdc34 (Ubc3) ubiquitin-conjugating enzyme from Saccharomyces cerevisiae plays an essential role in the progression of cells from the G1 to S phase of the cell division cycle. Using a high-copy suppression strategy, we have identified a yeast gene (UBS1) whose elevated expression suppresses the conditional cell cycle defects associated with cdc34 mutations. The UBS1 gene encodes a 32.2-kDa protein of previously unknown function and is identical in sequence to a genomic open reading frame on chromosome II (GenBank accession number Z36034). Several lines of evidence described here indicate that Ubs1 functions as a general positive regulator of Cdc34 activity. First, overexpression of UBS1 suppresses not only the cell proliferation and morphological defects associated with cdc34 mutants but also the inability of cdc34 mutant cells to degrade the general amino acid biosynthesis transcriptional regulator, Gcn4. Second, deletion of the UBS1 gene profoundly accentuates the cell cycle defect when placed in combination with a cdc34 temperature-sensitive allele. Finally, a comparison of the Ubs1 and Cdc34 polypeptide sequences reveals two noncontiguous regions of similarity, which, when projected onto the three-dimensional structure of a ubiquitin-conjugating enzyme, define a single region situated on its surface. While cdc34 mutations corresponding to substitutions outside this region are suppressed by UBS1 overexpression, Ubs1 fails to suppress amino acid substitutions made within this region. Taken together with other findings, the allele specificity exhibited by UBS1 expression suggests that Ubs1 regulates Cdc34 by interaction or modification.
Collapse
Affiliation(s)
- J A Prendergast
- Department of Biochemistry, University of Alberta, Edmonton, Canada
| | | | | | | | | | | | | |
Collapse
|
28
|
Hanke J, Beckmann G, Bork P, Reich JG. Self-organizing hierarchic networks for pattern recognition in protein sequence. Protein Sci 1996; 5:72-82. [PMID: 8771198 PMCID: PMC2143234 DOI: 10.1002/pro.5560050109] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
We present a method based on hierarchical self-organizing maps (SOMs) for recognizing patterns in protein sequences. The method is fully automatic, does not require prealigned sequences, is insensitive to redundancy in the training set, and works surprisingly well even with small learning sets. Because it uses unsupervised neural networks, it is able to extract patterns that are not present in all of the unaligned sequences of the learning set. The identification of these patterns in sequence databases is sensitive and efficient. The procedure comprises three main training stages. In the first stage, one SOM is trained to extract common features from the set of unaligned learning sequences. A feature is a number of ungapped sequence segments (usually 4-16 residues long) that are similar to segments in most of the sequences of the learning set according to an initial similarity matrix. In the second training stage, the recognition of each individual feature is refined by selecting an optimal weighting matrix out of a variety of existing amino acid similarity matrices. In a third stage of the SOM procedure, the position of the features in the individual sequences is learned. This allows for variants with feature repeats and feature shuffling. The procedure has been successfully applied to a number of notoriously difficult cases with distinct recognition problems: helix-turn-helix motifs in DNA-binding proteins, the CUB domain of developmentally regulated proteins, and the superfamily of ribokinases. A comparison with the established database search procedure PROFILE (and with several others) led to the conclusion that the new automatic method performs satisfactorily.
Collapse
Affiliation(s)
- J Hanke
- Max-Delbrück-Center for Molecular Medicine, Department of Bioinformatics, Berlin-Buch, Germany.
| | | | | | | |
Collapse
|
29
|
Abstract
We have tested CLUSTAL W in a wide variety of situations, and it is capable of handling some very difficult protein alignment problems. If the data set consists of enough closely related sequences so that the first alignments are accurate, then CLUSTAL W will usually find an alignment that is very close to ideal. Problems can still occur if the data set includes sequences of greatly different lengths or if some sequences include long regions that are impossible to align with the rest of the data set. Trying to balance the need for long insertions and deletions in some alignments with the need to avoid them in others is still a problem. The default values for our parameters were tested empirically using test cases of sets of globular proteins where some information as to the correct alignment was available. The parameter values may not be very appropriate with nonglobular proteins. We have argued that using one weight matrix and two gap penalties is too simplistic to be of general use in the most difficult cases. We have replaced these parameters with a large number of new parameters designed primarily to help encourage gaps in loop regions. Although these new parameters are largely heuristic in nature, they perform surprisingly well and are simple to implement. The underlying speed of the progressive alignment approach is not adversely affected. The disadvantage is that the parameter space is now huge; the number of possible combinations of parameters is more than can easily be examined by hand. We justify this by asking the user to treat CLUSTAL W as a data exploration tool rather than as a definitive analysis method. It is not sensible to automatically derive multiple alignments and to trust particular algorithms as being capable of always getting the correct answer. One must examine the alignments closely, especially in conjunction with the underlying phylogenetic tree (or estimate of it) and try varying some of the parameters. Outliers (sequences that have no close relatives) should be aligned carefully, as should fragments of sequences. The program will automatically delay the alignment of any sequences that are less than 40% identical to any others until all other sequences are aligned, but this can be set from a menu by the user. It may be useful to build up an alignment of closely related sequences first and to then add in the more distant relatives one at a time or in batches, using the profile alignments and weighting scheme described earlier and perhaps using a variety of parameter settings. We give one example using SH2 domains. SH2 domains are widespread in eukaryotic signalling proteins where they function in the recognition of phosphotyrosine-containing peptides. In the chapter by Bork and Gibson ([11], this volume), Blast and pattern/profile searches were used to extract the set of known SH2 domains and to search for new members. (Profiles used in database searches are conceptually very similar to the profiles used in CLUSTAL W: see the chapters [11] and [13] for profile search methods.) The profile searches detected SH2 domains in the JAK family of protein tyrosine kinases, which were thought not to contain SH2 domains. Although the JAK family SH2 domains are rather divergent, they have the necessary core structural residues as well as the critical positively charged residue that binds phosphotyrosine, leaving no doubt that they are bona fide SH2 domains. The five new JAK family SH2 domains were added sequentially to the existing alignment of 65 SH2 domains using the CLUSTAL W profile alignment option. Figure 6 shows part of the resulting alignment. Despite their divergent sequences, the new SH2 domains have been aligned nearly perfectly with the old set. No insertions were placed in the original SH2 domains. In this example, the profile alignment procedure has produced better results than a one-step full alignment of all 70 SH2 domains, and in considerably less time. (ABSTRACT TRUNCATED)
Collapse
Affiliation(s)
- D G Higgins
- European Molecular Biology Laboratory Outstation-European Bioinformatics Institute, Hinxton, Cambridge, United Kingdom
| | | | | |
Collapse
|
30
|
Smith DK, Treutlein HR, Maurer T, Owczarek CM, Layton MJ, Nicola NA, Norton RS. Homology modelling and 1H NMR studies of human leukaemia inhibitory factor. FEBS Lett 1994; 350:275-80. [PMID: 7520873 DOI: 10.1016/0014-5793(94)00785-3] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
Human leukaemia inhibitory factor (LIF) is a glycoprotein with a diverse range of activities on many cell types. A molecular model of LIF has been constructed based mainly on the structure of the related cytokine granulocyte colony-stimulating factor, and refined using simulated annealing and molecular dynamics in water. The model was stable during molecular dynamics refinement and is consistent with known stereochemical data on proteins. It has been assessed by comparison with 1H NMR data on the ionization behaviour of the six histidine residues in LIF, the imidazolium pKa values of which range from 3.6 to 7.4. These pKa values were assigned to individual histidine residues from NMR studies on a series of His-->Ala mutants. The environments of the histidine residues in the model account very well for their observed ionization behaviour. Furthermore, the model is consistent with mutagenesis studies which have defined a group of amino acid residues involved in receptor binding.
Collapse
Affiliation(s)
- D K Smith
- NMR Laboratory, Biomolecular Research Institute, Parkville, Australia
| | | | | | | | | | | | | |
Collapse
|
31
|
Wang JT, Marr TG, Shasha D, Shapiro BA, Chirn GW. Discovering active motifs in sets of related protein sequences and using them for classification. Nucleic Acids Res 1994; 22:2769-75. [PMID: 8052532 PMCID: PMC308246 DOI: 10.1093/nar/22.14.2769] [Citation(s) in RCA: 28] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
Abstract
We describe a method for discovering active motifs in a set of related protein sequences. The method is an automatic two step process: (1) find candidate motifs in a small sample of the sequences; (2) test whether these motifs are approximately present in all the sequences. To reduce the running time, we develop two optimization heuristics based on statistical estimation and pattern matching techniques. Experimental results obtained by running these algorithms on generated data and functionally related proteins demonstrate the good performance of the presented method compared with visual method of O'Farrell and Leopold. By combining the discovered motifs with an existing fingerprint technique, we develop a protein classifier. When we apply the classifier to the 698 groups of related proteins in the PROSITE catalog, it gives information that is complementary to the BLOCKS protein classifier of Henikoff and Henikoff. Thus, using our classifier in conjunction with theirs, one can obtain high confidence classifications (if BLOCKS and our classifier agree) or suggest a new hypothesis (if the two disagree).
Collapse
Affiliation(s)
- J T Wang
- Department of Computer and Information Science, New Jersey Institute of Technology, Newark 07102
| | | | | | | | | |
Collapse
|
32
|
Abstract
As the explosive growth of sequence databases continues, both the content of these databases and the best strategies for exploiting them are changing. Remarkable improvements in the performance of microcomputers have put very sophisticated tools within the reach of all investigators. New scoring schemes have improved the sensitivity of searching regimens and the accuracy of alignments.
Collapse
Affiliation(s)
- R F Doolittle
- Department of Chemistry, University of California, San Diego, La Jolla 92093-0634
| |
Collapse
|
33
|
Modelling of peptide and protein structures. Amino Acids 1994; 7:175-202. [DOI: 10.1007/bf00814159] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/1993] [Accepted: 08/12/1993] [Indexed: 10/26/2022]
|
34
|
Abstract
We study the computational complexity of two popular problems in multiple sequence alignment: multiple alignment with SP-score and multiple tree alignment. It is shown that the first problem is NP-complete and the second is MAX SNP-hard. The complexity of tree alignment with a given phylogeny is also considered.
Collapse
Affiliation(s)
- L Wang
- Department of Electrical and Computer Engineering, McMaster University, Hamilton, Ontario, Canada
| | | |
Collapse
|
35
|
Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 1993; 262:208-14. [PMID: 8211139 DOI: 10.1126/science.8211139] [Citation(s) in RCA: 1214] [Impact Index Per Article: 39.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Abstract
A wealth of protein and DNA sequence data is being generated by genome projects and other sequencing efforts. A crucial barrier to deciphering these sequences and understanding the relations among them is the difficulty of detecting subtle local residue patterns common to multiple sequences. Such patterns frequently reflect similar molecular structures and biological properties. A mathematical definition of this "local multiple alignment" problem suitable for full computer automation has been used to develop a new and sensitive algorithm, based on the statistical method of iterative sampling. This algorithm finds an optimized local alignment model for N sequences in N-linear time, requiring only seconds on current workstations, and allows the simultaneous detection and optimization of multiple patterns and pattern repeats. The method is illustrated as applied to helix-turn-helix proteins, lipocalins, and prenyltransferases.
Collapse
Affiliation(s)
- C E Lawrence
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894
| | | | | | | | | | | |
Collapse
|
36
|
Abstract
This article presents a new method for the comparison of multiple macromolecular sequences. It is based on a hierarchical sequence synthesis procedure that does not require any a priori knowledge of the molecular structure of the sequences or the phylogenetic relations among the sequences. It differs from the existing methods as it has the capability of: (i) generating a statistical-structural model of the sequences through a synthesis process that detects homologous groups of the sequences, and (ii) aligning the sequences while the taxonomic tree of the sequences is being constructed in one single phase. It produces superior results when compared with some existing methods.
Collapse
Affiliation(s)
- A K Wong
- Department of Systems Design Engineering, University of Waterloo, Canada
| | | | | |
Collapse
|
37
|
Brodsky LI, Drachev AL, Leontovich AM, Feranchuk SI. A novel method of multiple alignment of biopolymer sequences. Biosystems 1993; 30:65-79. [PMID: 8374082 DOI: 10.1016/0303-2647(93)90063-i] [Citation(s) in RCA: 23] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
A novel algorithm for multiple alignment of biological sequences is suggested. At the first step the DotHelix procedure is employed for construction of motifs, i.e. continuous fragments of local similarity of various "thickness" and strength, and then these motifs are concatenated into chains consistent with the order of letters in the sequences. The algorithm is implemented in the MA-Tools program of the GeneBee package. An example illustrating the effectivity of the algorithm is presented.
Collapse
|
38
|
Gusfield D. Efficient methods for multiple sequence alignment with guaranteed error bounds. Bull Math Biol 1993; 55:141-54. [PMID: 7680269 DOI: 10.1007/bf02460299] [Citation(s) in RCA: 110] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
Multiple string (sequence) alignment is a difficult and important problem in computational biology, where it is central in two related tasks: finding highly conserved subregions or embedded patterns of a set of biological sequences (strings of DNA, RNA or amino acids), and inferring the evolutionary history of a set of taxa from their associated biological sequences. Several precise measures have been proposed for evaluating the goodness of a multiple alignment, but no efficient methods are known which compute the optimal alignment for any of these measures in any but small cases. In this paper, we consider two previously proposed measures, and give two computationaly efficient multiple alignment methods (one for each measure) whose deviation from the optimal value is guaranteed to be less than a factor of two. This is the novel feature of thse methods. but the methods have additional virtues as well. For both methods, the guaranteed bounds are much smaller than two when the number of strings is small (1.33 for three strings of any length); for one of the methods we give a related randomized method which is much faster and which gives, with high probability, multiple alignments with fairly small error bounds; and for the other measure, the method given yields a non-obvious lower bound on the optimal alignment.
Collapse
Affiliation(s)
- D Gusfield
- Computer Science Division, University of California, Davis 95616-8755
| |
Collapse
|
39
|
Bains W. Alignment and comparison of multiple related DNA sequences can be improved by consideration of known patterns of mutation. DNA SEQUENCE : THE JOURNAL OF DNA SEQUENCING AND MAPPING 1993; 3:267-76. [PMID: 8400356 DOI: 10.3109/10425179309020823] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
The rate at which bases are replaced in mammalian DNA is influenced by the sequence involved. I have modified a multiple sequence alignment and comparison program to take account of the observed patterns of base replacement in mammals, and used the program to analyse 100 human Alu sequences. The results show that using a sequence matching matrix based on the actual rate of base replacement, as opposed to arbitrary base matching matrix, gives significant differences in the way that sequences are aligned and compared with each other. Use of the observed matching matrix gives significantly better discrimination between sub-families of Alu sequences than conventional methods, and shows that comparison systems which take account of the mechanism of mutation give more biologically realistic results.
Collapse
Affiliation(s)
- W Bains
- PA Consulting Group, Melbourn, Royston, Herts, UK
| |
Collapse
|
40
|
Abstract
Signature sequences are contiguous patterns of amino acids 10-50 residues long that are associated with a particular structure or function in proteins. These may be of three types (by our nomenclature): superfamily signatures, remnant homologies, and motifs. We have performed a systematic search through a database of protein sequences to automatically and preferentially find remnant homologies and motifs. This was accomplished in three steps: 1. We generated a nonredundant sequence database. 2. We used BLAST3 (Altschul and Lipman, Proc. Natl. Acad. Sci. U.S.A. 87:5509-5513, 1990) to generate local pairwise and triplet sequence alignments for every protein in the database vs. every other. 3. We selected "interesting" alignments and grouped them into clusters. We find that most of the clusters contain segments from proteins which share a common structure or function. Many of them correspond to signatures previously noted in the literature. We discuss three previously recognized motifs in detail (FAD/NAD-binding, ATP/GTP-binding, and cytochrome b5-like domains) to demonstrate how the alignments generated by our procedure are consistent with previous work and make structural and functional sense. We also discuss two signatures (for N-acetyltransferases and glycerol-phosphate binding) which to our knowledge have not been previously recognized.
Collapse
Affiliation(s)
- R P Sheridan
- Medical Research Division, Lederle Laboratories, American Cyanamid Corp., Pearl River, New York 10965
| | | |
Collapse
|
41
|
Beanland TJ, Howe CJ. The inference of evolutionary trees from molecular data. COMPARATIVE BIOCHEMISTRY AND PHYSIOLOGY. B, COMPARATIVE BIOCHEMISTRY 1992; 102:643-59. [PMID: 1395500 DOI: 10.1016/0305-0491(92)90061-u] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
1. Procedures for multiple alignment of sequence data, subsequent phylogenetic inference, and testing of the trees derived are presented. 2. The assumptions underlying different approaches and the extent to which they are valid are discussed.
Collapse
Affiliation(s)
- T J Beanland
- Department of Biochemistry, University of Cambridge, U.K
| | | |
Collapse
|
42
|
Landès C, Hénaut A, Risler JL. A comparison of several similarity indices used in the classification of protein sequences: a multivariate analysis. Nucleic Acids Res 1992; 20:3631-7. [PMID: 1641329 PMCID: PMC334011 DOI: 10.1093/nar/20.14.3631] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
The present work describes an attempt to identify reliable criteria which could be used as distance indices between protein sequences. Seven different criteria have been tested: i and ii) the scores of the alignments as given by the BESTFIT and the FASTA programs; iii) the ratio parameter, i.e. the BESTFIT score divided by the length of the aligned peptides; iv and v) the statistical significance (Z-scores) of the scores calculated by BESTFIT and FASTA, as obtained by comparison with shuffled sequences; vi) the Z-scores provided by the program RELATE which performs a segment-by-segment comparison of 2 sequences, and vii) an original distance index calculated by the program DOCMA from all the pairwise dotplots between the sequences. These 7 criteria have been tested against the aminoacid sequences of 39 globins and those of the 20 aminoacyl-tRNA synthetases from E. coli. The distances between the sequences were analyzed by the multivariate analysis techniques. The results show that the distances calculated from the scores of the pairwise alignments are not adequately sensitive. The Z-score from RELATE is not selective enough and too demanding in computer time. Three criteria gave a classification consistent with the known similarities between the sequences in the sets, namely the Z-scores from BESTFIT and FASTA and the multiple dotplot comparison distance index from DOCMA.
Collapse
Affiliation(s)
- C Landès
- Centre de Génétique Moléculaire du CNRS, Gif sur Yvette, France
| | | | | |
Collapse
|
43
|
Abstract
Multiple sequence comparison refers to the search for similarity in three or more sequences. This article presents a survey of the exhaustive (optimal) and heuristic (possibly sub-optimal) methods developed for the comparison of multiple macromolecular sequences. Emphasis is given to the different approaches of the heuristic methods. Four distance measures derived from information engineering and genetic studies are introduced for the comparison between two alignments of sequences. The use of entropy, which plays a central role in information theory as measures of information, choice and uncertainty, is proposed as a simple measure for the evaluation of the optimality of an alignment in the absence of any a priori knowledge about the structures of the sequences being compared. This article also gives two examples of comparison between alternative alignments of the same set of 5SRNAs as obtained by several different heuristic methods.
Collapse
Affiliation(s)
- S C Chan
- Department of Systems Design Engineering, University of Waterloo, Canada
| | | | | |
Collapse
|
44
|
Abstract
A simple and efficient method is described for analyzing quantitatively multiple protein sequence alignments and finding the most conserved blocks as well as the maxima of divergence within the set of aligned sequences. It consists of calculating the mean distance and the root-mean-square distance in each column of the multiple alignment, averaging the values in a window of defined length and plotting the results as a function of the position of the window. Due attention is paid to the presence of gaps in the columns. Several examples are provided, using the sequences of several cytochromes c, serine proteases, lysozymes and globins. Two distance matrices are compared, namely the matrix derived by Gribskov and Burgess from the Dayhoff matrix, and the Risler Structural Superposition Matrix. In each case, the divergence plots effectively point to the specific residues which are known to be essential for the catalytic activity of the proteins. In addition, the regions of maximum divergence are clearly delineated. Interestingly, they are generally observed in positions immediately flanking the most conserved blocks. The method should therefore be useful for delineating the peptide segments which will be good candidates for site-directed mutagenesis and for visualizing the evolutionary constraints along homologous polypeptide chains.
Collapse
Affiliation(s)
- S Brouillet
- Centre de Génétique Moléculaire du CNRS, Laboratoire Associé à l'Université P et M Curie, Gif-sur-Yvette, France
| | | | | | | |
Collapse
|
45
|
Leung MY, Blaisdell BE, Burge C, Karlin S. An efficient algorithm for identifying matches with errors in multiple long molecular sequences. J Mol Biol 1991; 221:1367-78. [PMID: 1942056 PMCID: PMC4076298 DOI: 10.1016/0022-2836(91)90938-3] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
An efficient algorithm is described for finding matches, repeats and other word relations, allowing for errors, in large data sets of long molecular sequences. The algorithm entails hashing on fixed-size words in conjunction with the use of a linked list connecting all occurrences of the same word. The average memory and run time requirement both increase almost linearly with the total sequence length. Some results of the program's performance on a database of Escherichia coli DNA sequences are presented.
Collapse
Affiliation(s)
- M Y Leung
- Division of Mathematics, Computer Science and Statistics, University of Texas, San Antonio 78249-0664
| | | | | | | |
Collapse
|
46
|
Abstract
Multiple sequence alignment can be a useful technique for studying molecular evolution, as well as for analyzing relationships between structure or function and primary sequence. We have developed for this purpose an interactive program, MACAW (Multiple Alignment Construction and Analysis Workbench), that allows the user to construct multiple alignments by locating, analyzing, editing, and combining "blocks" of aligned sequence segments. MACAW incorporates several novel features. (1) Regions of local similarity are located by a new search algorithm that avoids many of the limitations of previous techniques. (2) The statistical significance of blocks of similarity is evaluated using a recently developed mathematical theory. (3) Candidate blocks may be evaluated for potential inclusion in a multiple alignment using a variety of visualization tools. (4) A user interface permits each block to be edited by moving its boundaries or by eliminating particular segments, and blocks may be linked to form a composite multiple alignment. No completely automatic program is likely to deal effectively with all the complexities of the multiple alignment problem; by combining a powerful similarity search algorithm with flexible editing, analysis and display tools, MACAW allows the alignment strategy to be tailored to the problem at hand.
Collapse
Affiliation(s)
- G D Schuler
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894
| | | | | |
Collapse
|
47
|
Vyas NK, Vyas MN, Quiocho FA. Comparison of the periplasmic receptors for L-arabinose, D-glucose/D-galactose, and D-ribose. Structural and Functional Similarity. J Biol Chem 1991. [DOI: 10.1016/s0021-9258(19)67776-8] [Citation(s) in RCA: 56] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022] Open
|
48
|
Nickell C, Anderson WF, Lloyd RS. Substitution of basic amino acids within endonuclease V enhances nontarget DNA binding. J Biol Chem 1991. [DOI: 10.1016/s0021-9258(19)67642-8] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022] Open
|
49
|
|
50
|
|