1
|
Llinares-López F, Berthet Q, Blondel M, Teboul O, Vert JP. Deep embedding and alignment of protein sequences. Nat Methods 2023; 20:104-111. [PMID: 36522501 DOI: 10.1038/s41592-022-01700-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2021] [Accepted: 10/24/2022] [Indexed: 12/23/2022]
Abstract
Protein sequence alignment is a key component of most bioinformatics pipelines to study the structures and functions of proteins. Aligning highly divergent sequences remains, however, a difficult task that current algorithms often fail to perform accurately, leaving many proteins or open reading frames poorly annotated. Here we leverage recent advances in deep learning for language modeling and differentiable programming to propose DEDAL (deep embedding and differentiable alignment), a flexible model to align protein sequences and detect homologs. DEDAL is a machine learning-based model that learns to align sequences by observing large datasets of raw protein sequences and of correct alignments. Once trained, we show that DEDAL improves by up to two- or threefold the alignment correctness over existing methods on remote homologs and better discriminates remote homologs from evolutionarily unrelated sequences, paving the way to improvements on many downstream tasks relying on sequence alignment in structural and functional genomics.
Collapse
|
2
|
Pevzner P, Vingron M, Reidys C, Sun F, Istrail S. Michael Waterman's Contributions to Computational Biology and Bioinformatics. J Comput Biol 2022; 29:601-615. [PMID: 35727100 DOI: 10.1089/cmb.2022.29066.pp] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
On the occasion of Dr. Michael Waterman's 80th birthday, we review his major contributions to the field of computational biology and bioinformatics including the famous Smith-Waterman algorithm for sequence alignment, the probability and statistics theory related to sequence alignment, algorithms for sequence assembly, the Lander-Waterman model for genome physical mapping, combinatorics and predictions of ribonucleic acid structures, word counting statistics in molecular sequences, alignment-free sequence comparison, and algorithms for haplotype block partition and tagSNP selection related to the International HapMap Project. His books Introduction to Computational Biology: Maps, Sequences and Genomes for graduate students and Computational Genome Analysis: An Introduction geared toward undergraduate students played key roles in computational biology and bioinformatics education. We also highlight his efforts of building the computational biology and bioinformatics community as the founding editor of the Journal of Computational Biology and a founding member of the International Conference on Research in Computational Molecular Biology (RECOMB).
Collapse
Affiliation(s)
- Pavel Pevzner
- Department of Computer Science and Engineering, University of California San Diego, San Diego, California, USA
| | - Martin Vingron
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Christian Reidys
- Department of Mathematics, Biocomplexity Institute & Initiative, University of Virginia, Charlottesville, Virginia, USA
| | - Fengzhu Sun
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, California, USA
| | - Sorin Istrail
- Department of Computer Science, Center for Computational Molecular Biology, Brown University, Providence, Rhode Island, USA
| |
Collapse
|
3
|
Root-Bernstein R, Churchill B. Co-Evolution of Opioid and Adrenergic Ligands and Receptors: Shared, Complementary Modules Explain Evolution of Functional Interactions and Suggest Novel Engineering Possibilities. Life (Basel) 2021; 11:life11111217. [PMID: 34833093 PMCID: PMC8623292 DOI: 10.3390/life11111217] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2021] [Revised: 10/29/2021] [Accepted: 11/03/2021] [Indexed: 12/14/2022] Open
Abstract
Cross-talk between opioid and adrenergic receptors is well-characterized and involves second messenger systems, the formation of receptor heterodimers, and the presence of extracellular allosteric binding regions for the complementary ligand; however, the evolutionary origins of these interactions have not been investigated. We propose that opioid and adrenergic ligands and receptors co-evolved from a common set of modular precursors so that they share binding functions. We demonstrate the plausibility of this hypothesis through a review of experimental evidence for molecularly complementary modules and report unexpected homologies between the two receptor types. Briefly, opioids form homodimers also bind adrenergic compounds; opioids bind to conserved extracellular regions of adrenergic receptors while adrenergic compounds bind to conserved extracellular regions of opioid receptors; opioid-like modules appear in both sets of receptors within key ligand-binding regions. Transmembrane regions associated with homodimerization of each class of receptors are also highly conserved across receptor types and implicated in heterodimerization. This conservation of multiple functional modules suggests opioid–adrenergic ligand and receptor co-evolution and provides mechanisms for explaining the evolution of their crosstalk. These modules also suggest the structure of a primordial receptor, providing clues for engineering receptor functions.
Collapse
|
4
|
Xu Z, Yang Y, Huang B. A teaching approach from the exhaustive search method to the Needleman-Wunsch algorithm. BIOCHEMISTRY AND MOLECULAR BIOLOGY EDUCATION : A BIMONTHLY PUBLICATION OF THE INTERNATIONAL UNION OF BIOCHEMISTRY AND MOLECULAR BIOLOGY 2017; 45:194-204. [PMID: 27740737 DOI: 10.1002/bmb.21027] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/04/2016] [Revised: 08/12/2016] [Accepted: 09/06/2016] [Indexed: 06/06/2023]
Abstract
The Needleman-Wunsch algorithm has become one of the core algorithms in bioinformatics; however, this programming requires more suitable explanations for students with different major backgrounds. In supposing sample sequences and using a simple store system, the connection between the exhaustive search method and the Needleman-Wunsch algorithm was analyzed to more thoroughly explain this algorithm. The present study could benefit the teaching and learning of the Needleman-Wunsch algorithm. © 2016 by The International Union of Biochemistry and Molecular Biology, 45(3):194-204, 2017.
Collapse
Affiliation(s)
- Zhongneng Xu
- Department of Ecology, Jinan University, Guangzhou, 510632, China
| | - Yayun Yang
- Department of Ecology, Jinan University, Guangzhou, 510632, China
| | - Beibei Huang
- Departament d'Enginyeria Quimica, Universitat Rovira i Virgili, 26 Av. dels Paisos Catalans, 43007, Tarragona, Spain
- Department of Experimental Therapeutics, The University of Texas M. D. Anderson Cancer Center, Unit 36, Houston, TX 77030, USA
| |
Collapse
|
5
|
Abstract
Many bioinformatics problems, such as sequence alignment, gene prediction, phylogenetic tree estimation and RNA secondary structure prediction, are often affected by the 'uncertainty' of a solution, that is, the probability of the solution is extremely small. This situation arises for estimation problems on high-dimensional discrete spaces in which the number of possible discrete solutions is immense. In the analysis of biological data or the development of prediction algorithms, this uncertainty should be handled carefully and appropriately. In this review, I will explain several methods to combat this uncertainty, presenting a number of examples in bioinformatics. The methods include (i) avoiding point estimation, (ii) maximum expected accuracy (MEA) estimations and (iii) several strategies to design a pipeline involving several prediction methods. I believe that the basic concepts and ideas described in this review will be generally useful for estimation problems in various areas of bioinformatics.
Collapse
|
6
|
Zhou H, Skolnick J. Template-based protein structure modeling using TASSER(VMT.). Proteins 2011; 80:352-61. [PMID: 22105797 DOI: 10.1002/prot.23183] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2011] [Revised: 08/25/2011] [Accepted: 09/04/2011] [Indexed: 12/29/2022]
Abstract
Template-based protein structure modeling is commonly used for protein structure prediction. Based on the observation that multiple template-based methods often perform better than single template-based methods, we further explore the use of a variable number of multiple templates for a given target in the latest variant of TASSER, TASSER(VMT) . We first develop an algorithm that improves the target-template alignment for a given template. The improved alignment, called the SP(3) alternative alignment, is generated by a parametric alignment method coupled with short TASSER refinement on models selected using knowledge-based scores. The refined top model is then structurally aligned to the template to produce the SP(3) alternative alignment. Templates identified using SP(3) threading are combined with the SP(3) alternative and HHEARCH alignments to provide target alignments to each template. These template models are then grouped into sets containing a variable number of template/alignment combinations. For each set, we run short TASSER simulations to build full-length models. Then, the models from all sets of templates are pooled, and the top 20-50 models selected using FTCOM ranking method. These models are then subjected to a single longer TASSER refinement run for final prediction. We benchmarked our method by comparison with our previously developed approach, pro-sp(3) -TASSER, on a set with 874 easy and 318 hard targets. The average GDT-TS score improvements for the first model are 3.5 and 4.3% for easy and hard targets, respectively. When tested on the 112 CASP9 targets, our method improves the average GDT-TS scores as compared to pro-sp3-TASSER by 8.2 and 9.3% for the 80 easy and 32 hard targets, respectively. It also shows slightly better results than the top ranked CASP9 Zhang-Server, QUARK and HHpredA methods. The program is available for download at http://cssb.biology.gatech.edu/.
Collapse
Affiliation(s)
- Hongyi Zhou
- Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, Atlanta, Georgia 30318
| | | |
Collapse
|
7
|
Parametric Analysis of Alignment and Phylogenetic Uncertainty. Bull Math Biol 2011; 73:795-810. [DOI: 10.1007/s11538-010-9610-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2009] [Accepted: 11/04/2010] [Indexed: 10/18/2022]
|
8
|
Hower V, Heitsch CE. Parametric Analysis of RNA Branching Configurations. Bull Math Biol 2011; 73:754-76. [DOI: 10.1007/s11538-010-9607-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2009] [Accepted: 11/04/2010] [Indexed: 01/30/2023]
|
9
|
Yakovlev VV, Roytberg MA. Increasing the accuracy of global alignment of amino acid sequences by constructing a set of alignment candidates. Biophysics (Nagoya-shi) 2010. [DOI: 10.1134/s0006350910060011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
|
10
|
Didier G. Parametric Maximum Parsimonious Reconstruction on Trees. Bull Math Biol 2010; 73:1477-502. [DOI: 10.1007/s11538-010-9574-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2009] [Accepted: 07/05/2010] [Indexed: 11/30/2022]
|
11
|
Frith MC, Hamada M, Horton P. Parameters for accurate genome alignment. BMC Bioinformatics 2010; 11:80. [PMID: 20144198 PMCID: PMC2829014 DOI: 10.1186/1471-2105-11-80] [Citation(s) in RCA: 145] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2009] [Accepted: 02/09/2010] [Indexed: 11/25/2022] Open
Abstract
Background Genome sequence alignments form the basis of much research. Genome alignment depends on various mundane but critical choices, such as how to mask repeats and which score parameters to use. Surprisingly, there has been no large-scale assessment of these choices using real genomic data. Moreover, rigorous procedures to control the rate of spurious alignment have not been employed. Results We have assessed 495 combinations of score parameters for alignment of animal, plant, and fungal genomes. As our gold-standard of accuracy, we used genome alignments implied by multiple alignments of proteins and of structural RNAs. We found the HOXD scoring schemes underlying alignments in the UCSC genome database to be far from optimal, and suggest better parameters. Higher values of the X-drop parameter are not always better. E-values accurately indicate the rate of spurious alignment, but only if tandem repeats are masked in a non-standard way. Finally, we show that γ-centroid (probabilistic) alignment can find highly reliable subsets of aligned bases. Conclusions These results enable more accurate genome alignment, with reliability measures for local alignments and for individual aligned bases. This study was made possible by our new software, LAST, which can align vertebrate genomes in a few hours http://last.cbrc.jp/.
Collapse
Affiliation(s)
- Martin C Frith
- Computational Biology Research Center, Institute for Advanced Industrial Science and Technology, Tokyo 135-0064, Japan.
| | | | | |
Collapse
|
12
|
Edgar RC. Optimizing substitution matrix choice and gap parameters for sequence alignment. BMC Bioinformatics 2009; 10:396. [PMID: 19954534 PMCID: PMC2791778 DOI: 10.1186/1471-2105-10-396] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2009] [Accepted: 12/02/2009] [Indexed: 12/04/2022] Open
Abstract
Background While substitution matrices can readily be computed from reference alignments, it is challenging to compute optimal or approximately optimal gap penalties. It is also not well understood which substitution matrices are the most effective when alignment accuracy is the goal rather than homolog recognition. Here a new parameter optimization procedure, POP, is described and applied to the problems of optimizing gap penalties and selecting substitution matrices for pair-wise global protein alignments. Results POP is compared to a recent method due to Kim and Kececioglu and found to achieve from 0.2% to 1.3% higher accuracies on pair-wise benchmarks extracted from BALIBASE. The VTML matrix series is shown to be the most accurate on several global pair-wise alignment benchmarks, with VTML200 giving best or close to the best performance in all tests. BLOSUM matrices are found to be slightly inferior, even with the marginal improvements in the bug-fixed RBLOSUM series. The PAM series is significantly worse, giving accuracies typically 2% less than VTML. Integer rounding is found to cause slight degradations in accuracy. No evidence is found that selecting a matrix based on sequence divergence improves accuracy, suggesting that the use of this heuristic in CLUSTALW may be ineffective. Using VTML200 is found to improve the accuracy of CLUSTALW by 8% on BALIBASE and 5% on PREFAB. Conclusion The hypothesis that more accurate alignments of distantly related sequences may be achieved using low-identity matrices is shown to be false for commonly used matrix types. Source code and test data is freely available from the author's web site at http://www.drive5.com/pop.
Collapse
|
13
|
Zheng W, Friedman AM, Bailey-Kellogg C. Algorithms for joint optimization of stability and diversity in planning combinatorial libraries of chimeric proteins. J Comput Biol 2009; 16:1151-68. [PMID: 19645597 DOI: 10.1089/cmb.2009.0090] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
In engineering protein variants by constructing and screening combinatorial libraries of chimeric proteins, two complementary and competing goals are desired: the new proteins must be similar enough to the evolutionarily-selected wild-type proteins to be stably folded, and they must be different enough to display functional variation. We present here the first method, Staversity, to simultaneously optimize stability and diversity in selecting sets of breakpoint locations for site-directed recombination. Our goal is to uncover all "undominated" breakpoint sets, for which no other breakpoint set is better in both factors. Our first algorithm finds the undominated sets serving as the vertices of the lower envelope of the two-dimensional (stability and diversity) convex hull containing all possible breakpoint sets. Our second algorithm identifies additional breakpoint sets in the concavities that are either undominated or dominated only by undiscovered breakpoint sets within a distance bound computed by the algorithm. Both algorithms are efficient, requiring only time polynomial in the numbers of residues and breakpoints, while characterizing a space defined by an exponential number of possible breakpoint sets. We applied Staversity to identify 2-10 breakpoint plans for different sets of parent proteins taken from the purE family, as well as for parent proteins TEM-1 and PSE-4 from the beta-lactamase family. The average normalized distance between our plans and the lower bound for optimal plans is around 2%. Our plans dominate most (60-90% on average for each parent set) of the plans found by other possible approaches, random sampling or explicit optimization for stability with implicit optimization for diversity. The identified breakpoint sets provide a compact representation of good plans, enabling a protein engineer to understand and account for the trade-offs between two key considerations in combinatorial chimeragenesis.
Collapse
Affiliation(s)
- Wei Zheng
- Department of Computer Science, Dartmouth College , Hanover, New Hampshire, USA
| | | | | |
Collapse
|
14
|
Zhou H, Skolnick J. Protein structure prediction by pro-Sp3-TASSER. Biophys J 2009; 96:2119-27. [PMID: 19289038 DOI: 10.1016/j.bpj.2008.12.3898] [Citation(s) in RCA: 54] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2008] [Revised: 11/12/2008] [Accepted: 12/03/2008] [Indexed: 12/29/2022] Open
Abstract
An automated protein structure prediction algorithm, pro-sp3-Threading/ASSEmbly/Refinement (TASSER), is described and benchmarked. Structural templates are identified using five different scoring functions derived from the previously developed threading methods PROSPECTOR_3 and SP(3). Top templates identified by each scoring function are combined to derive contact and distant restraints for subsequent model refinement by short TASSER simulations. For Medium/Hard targets (those with moderate to poor quality templates and/or alignments), alternative template alignments are also generated by parametric alignment and the top models selected by TASSER-QA are included in the contact and distance restraint derivation. Then, multiple short TASSER simulations are used to generate an ensemble of full-length models. Subsequently, the top models are selected from the ensemble by TASSER-QA and used to derive TASSER contacts and distant restraints for another round of full TASSER refinement. The final models are selected from both rounds of TASSER simulations by TASSER-QA. We compare pro-sp3-TASSER with our previously developed MetaTASSER method (enhanced with chunk-TASSER for Medium/Hard targets) on a representative test data set of 723 proteins <250 residues in length. For the 348 proteins classified as easy targets (those templates with good alignments and global structure similarity to the target), the cumulative TM-score of the best of top five models by pro-sp3-TASSER shows a 2.1% improvement over MetaTASSER. For the 155/220 medium/hard targets, the improvements in TM-score are 2.8% and 2.2%, respectively. All improvements are statistically significant. More importantly, the number of foldable targets (those having models whose TM-score to native >0.4 in the top five clusters) increases from 472 to 497 for all targets, and the relative increases for medium and hard targets are 10% and 15%, respectively. A server that implements the above algorithm is available at http://cssb.biology.gatech.edu/skolnick/webservice/pro-sp3-TASSER/. The source code is also available upon request.
Collapse
Affiliation(s)
- Hongyi Zhou
- Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, Atlanta, Georgia, USA
| | | |
Collapse
|
15
|
Kim E, Kececioglu J. Learning scoring schemes for sequence alignment from partial examples. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2008; 5:546-556. [PMID: 18989042 DOI: 10.1109/tcbb.2008.57] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
When aligning biological sequences, the choice of parameter values for the alignment scoring function is critical. Small changes in gap penalties, for example, can yield radically different alignments. A rigorous way to compute parameter values that are appropriate for aligning biological sequences is through inverse parametric sequence alignment. Given a collection of examples of biologically correct alignments, this is the problem of finding parameter values that make the scores of the example alignments close to those of optimal alignments for their sequences. We extend prior work on inverse parametric alignment to partial examples, which contain regions where the alignment is left unspecified, and to an improved formulation based on minimizing the average error between the score of an example and the score of an optimal alignment. Experiments on benchmark biological alignments show we can find parameters that generalize across protein families and that boost the accuracy of multiple sequence alignment by as much as 25 percent.
Collapse
Affiliation(s)
- Eagu Kim
- Department of Computer Science, The University of Arizona, Tucson, AZ 85721, USA
| | | |
Collapse
|
16
|
Mount DW. Using gaps and gap penalties to optimize pairwise sequence alignments. ACTA ACUST UNITED AC 2008; 2008:pdb.top40. [PMID: 21356856 DOI: 10.1101/pdb.top40] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
INTRODUCTIONTo obtain the best possible alignment between two sequences, it is necessary to include gaps in sequence alignments and use gap penalties. For aligning DNA sequences, a simple positive score for matches and a negative score for mismatches and gaps are most often used. To score matches and mismatches in alignments of proteins, it is necessary to know how often one amino acid is substituted for another in related proteins. In addition, a method is needed to account for insertions and deletions that sometimes appear in related DNA or protein sequences. To accommodate such sequence variations, gaps that appear in sequence alignments are given a negative penalty score reflecting the fact that they are not expected to occur very often. Mathematically speaking, it is very difficult to produce the best-possible alignment, either global or local, unless gaps are included in the alignment. This article discusses how to use gaps and gap penalties to optimize pairwise sequence alignments.
Collapse
|
17
|
Choi S, Jeon J, Yang JS, Kim S. Common occurrence of internal repeat symmetry in membrane proteins. Proteins 2008; 71:68-80. [PMID: 17932930 DOI: 10.1002/prot.21656] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Symmetry plays significant roles in protein structure and function. Particularly, symmetric interfaces are known to act as switches for two-state conformational change. Membrane proteins often undergo two-state conformational change during the transport process of ion channels or the active/inactive transitions in receptors. Here, we provide the first comprehensive analyses of internal repeat symmetry in membrane proteins. We examined the known membrane protein structures and found that, remarkably, nearly half of them have internal repeat symmetry. Moreover, we found that the conserved cores of these internal repeats are positioned at the interface of symmetric units when they are mapped on structures. Because of the large sequence divergence that occurs between internal repeats, the inherent symmetry present in protein sequences often has only been detected after structure determination. We therefore developed a sensitive procedure to predict the internal repeat symmetry from sequence information and identified 4653 proteins that are likely to have internal repeat symmetry.
Collapse
Affiliation(s)
- Sungwon Choi
- Division of Molecular and Life Science, Pohang University of Science and Technology, Pohang 790-784, Korea
| | | | | | | |
Collapse
|
18
|
Abstract
A widely used algorithm for computing an optimal local alignment between two sequences requires a parameter set with a substitution matrix and gap penalties. It is recognized that a proper parameter set should be selected to suit the level of conservation between sequences. We describe an algorithm for selecting an appropriate substitution matrix at given gap penalties for computing an optimal local alignment between two sequences. In the algorithm, a substitution matrix that leads to the maximum alignment similarity score is selected among substitution matrices at various evolutionary distances. The evolutionary distance of the selected substitution matrix is defined as the distance of the computed alignment. To show the effects of gap penalties on alignments and their distances and help select appropriate gap penalties, alignments and their distances are computed at various gap penalties. The algorithm has been implemented as a computer program named SimDist. The SimDist program was compared with an existing local alignment program named SIM for finding reciprocally best-matching pairs (RBPs) of sequences in each of 100 protein families, where RBPs are commonly used as an operational definition of orthologous sequences. SimDist produced more accurate results than SIM on 50 of the 100 families, whereas both programs produced the same results on the other 50 families. SimDist was also used to compare three types of substitution matrices in scoring 444,461 pairs of homologous sequences from the 100 families.
Collapse
Affiliation(s)
- Xiaoqiu Huang
- Department of Computer Science, Iowa State University, Ames, Iowa 50011-1040, USA.
| |
Collapse
|
19
|
Abstract
Protein sequence alignment is the task of identifying evolutionarily or structurally related positions in a collection of amino acid sequences. Although the protein alignment problem has been studied for several decades, many recent studies have demonstrated considerable progress in improving the accuracy or scalability of multiple and pairwise alignment tools, or in expanding the scope of tasks handled by an alignment program. In this chapter, we review state-of-the-art protein sequence alignment and provide practical advice for users of alignment tools.
Collapse
Affiliation(s)
- Chuong B Do
- Computer Science Department, Stanford University, Stanford, CA, USA
| | | |
Collapse
|
20
|
Frith MC, Valen E, Krogh A, Hayashizaki Y, Carninci P, Sandelin A. A code for transcription initiation in mammalian genomes. Genes Dev 2008; 18:1-12. [PMID: 18032727 PMCID: PMC2134772 DOI: 10.1101/gr.6831208] [Citation(s) in RCA: 189] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2007] [Accepted: 10/14/2007] [Indexed: 11/24/2022]
Abstract
Genome-wide detection of transcription start sites (TSSs) has revealed that RNA Polymerase II transcription initiates at millions of positions in mammalian genomes. Most core promoters do not have a single TSS, but an array of closely located TSSs with different rates of initiation. As a rule, genes have more than one such core promoter; however, defining the boundaries between core promoters is not trivial. These discoveries prompt a re-evaluation of our models for transcription initiation. We describe a new framework for understanding the organization of transcription initiation. We show that initiation events are clustered on the chromosomes at multiple scales-clusters within clusters-indicating multiple regulatory processes. Within the smallest of such clusters, which can be interpreted as core promoters, the local DNA sequence predicts the relative transcription start usage of each nucleotide with a remarkable 91% accuracy, implying the existence of a DNA code that determines TSS selection. Conversely, the total expression strength of such clusters is only partially determined by the local DNA sequence. Thus, the overall control of transcription can be understood as a combination of large- and small-scale effects; the selection of transcription start sites is largely governed by the local DNA sequence, whereas the transcriptional activity of a locus is regulated at a different level; it is affected by distal features or events such as enhancers and chromatin remodeling.
Collapse
Affiliation(s)
- Martin C. Frith
- Genome Exploration Research Group (Genome Network Project Core Group), RIKEN Genomic Sciences Center (GSC), RIKEN Yokohama Institute, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa, 230-0045, Japan
- ARC Centre in Bioinformatics, Institute for Molecular Bioscience, University of Queensland, Brisbane, Qld 4072, Australia
| | - Eivind Valen
- The Bioinformatics Centre, Department of Molecular Biology & Biotech Research and Innovation Centre, University of Copenhagen, Ole Maaløes Vej 5, DK-2200 København N, Denmark
| | - Anders Krogh
- The Bioinformatics Centre, Department of Molecular Biology & Biotech Research and Innovation Centre, University of Copenhagen, Ole Maaløes Vej 5, DK-2200 København N, Denmark
| | - Yoshihide Hayashizaki
- Genome Exploration Research Group (Genome Network Project Core Group), RIKEN Genomic Sciences Center (GSC), RIKEN Yokohama Institute, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa, 230-0045, Japan
- Genome Science Laboratory, Discovery Research Institute, RIKEN Wako Institute, 2-1 Hirosawa, Wako, Saitama, 351-0198, Japan
| | - Piero Carninci
- Genome Exploration Research Group (Genome Network Project Core Group), RIKEN Genomic Sciences Center (GSC), RIKEN Yokohama Institute, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa, 230-0045, Japan
- Genome Science Laboratory, Discovery Research Institute, RIKEN Wako Institute, 2-1 Hirosawa, Wako, Saitama, 351-0198, Japan
| | - Albin Sandelin
- The Bioinformatics Centre, Department of Molecular Biology & Biotech Research and Innovation Centre, University of Copenhagen, Ole Maaløes Vej 5, DK-2200 København N, Denmark
| |
Collapse
|
21
|
Lunter G, Rocco A, Mimouni N, Heger A, Caldeira A, Hein J. Uncertainty in homology inferences: assessing and improving genomic sequence alignment. Genome Res 2007; 18:298-309. [PMID: 18073381 DOI: 10.1101/gr.6725608] [Citation(s) in RCA: 90] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Sequence alignment underpins all of comparative genomics, yet it remains an incompletely solved problem. In particular, the statistical uncertainty within inferred alignments is often disregarded, while parametric or phylogenetic inferences are considered meaningless without confidence estimates. Here, we report on a theoretical and simulation study of pairwise alignments of genomic DNA at human-mouse divergence. We find that >15% of aligned bases are incorrect in existing whole-genome alignments, and we identify three types of alignment error, each leading to systematic biases in all algorithms considered. Careful modeling of the evolutionary process improves alignment quality; however, these improvements are modest compared with the remaining alignment errors, even with exact knowledge of the evolutionary model, emphasizing the need for statistical approaches to account for uncertainty. We develop a new algorithm, Marginalized Posterior Decoding (MPD), which explicitly accounts for uncertainties, is less biased and more accurate than other algorithms we consider, and reduces the proportion of misaligned bases by a third compared with the best existing algorithm. To our knowledge, this is the first nonheuristic algorithm for DNA sequence alignment to show robust improvements over the classic Needleman-Wunsch algorithm. Despite this, considerable uncertainty remains even in the improved alignments. We conclude that a probabilistic treatment is essential, both to improve alignment quality and to quantify the remaining uncertainty. This is becoming increasingly relevant with the growing appreciation of the importance of noncoding DNA, whose study relies heavily on alignments. Alignment errors are inevitable, and should be considered when drawing conclusions from alignments. Software and alignments to assist researchers in doing this are provided at http://genserv.anat.ox.ac.uk/grape/.
Collapse
Affiliation(s)
- Gerton Lunter
- MRC Functional Genetics Unit, Department of Physiology, Anatomy, and Genetics, University of Oxford, Oxford OX1 3QX, United Kingdom.
| | | | | | | | | | | |
Collapse
|
22
|
Chivian D, Baker D. Homology modeling using parametric alignment ensemble generation with consensus and energy-based model selection. Nucleic Acids Res 2006; 34:e112. [PMID: 16971460 PMCID: PMC1635247 DOI: 10.1093/nar/gkl480] [Citation(s) in RCA: 89] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The accuracy of a homology model based on the structure of a distant relative or other topologically equivalent protein is primarily limited by the quality of the alignment. Here we describe a systematic approach for sequence-to-structure alignment, called ‘K*Sync’, in which alignments are generated by dynamic programming using a scoring function that combines information on many protein features, including a novel measure of how obligate a sequence region is to the protein fold. By systematically varying the weights on the different features that contribute to the alignment score, we generate very large ensembles of diverse alignments, each optimal under a particular constellation of weights. We investigate a variety of approaches to select the best models from the ensemble, including consensus of the alignments, a hydrophobic burial measure, low- and high-resolution energy functions, and combinations of these evaluation methods. The effect on model quality and selection resulting from loop modeling and backbone optimization is also studied. The performance of the method on a benchmark set is reported and shows the approach to be effective at both generating and selecting accurate alignments. The method serves as the foundation of the homology modeling module in the Robetta server.
Collapse
Affiliation(s)
- Dylan Chivian
- Department of Biochemistry, University of WashingtonSeattle, WA, USA
| | - David Baker
- Department of Biochemistry, University of WashingtonSeattle, WA, USA
- Howard Hughes Medical Institute, SeattleWA, USA
- To whom correspondence should be addressed at Department of Biochemistry and HHMI, University of Washington, Box 357350, Seattle, WA 98195, USA. Tel: +1 206 543 1295; Fax: +1 206 685 1792;
| |
Collapse
|
23
|
Dewey CN, Huggins PM, Woods K, Sturmfels B, Pachter L. Parametric alignment of Drosophila genomes. PLoS Comput Biol 2006; 2:e73. [PMID: 16789815 PMCID: PMC1480539 DOI: 10.1371/journal.pcbi.0020073] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2005] [Accepted: 05/10/2006] [Indexed: 12/29/2022] Open
Abstract
The classic algorithms of Needleman-Wunsch and Smith-Waterman find a maximum a posteriori probability alignment for a pair hidden Markov model (PHMM). To process large genomes that have undergone complex genome rearrangements, almost all existing whole genome alignment methods apply fast heuristics to divide genomes into small pieces that are suitable for Needleman-Wunsch alignment. In these alignment methods, it is standard practice to fix the parameters and to produce a single alignment for subsequent analysis by biologists. As the number of alignment programs applied on a whole genome scale continues to increase, so does the disagreement in their results. The alignments produced by different programs vary greatly, especially in non-coding regions of eukaryotic genomes where the biologically correct alignment is hard to find. Parametric alignment is one possible remedy. This methodology resolves the issue of robustness to changes in parameters by finding all optimal alignments for all possible parameters in a PHMM. Our main result is the construction of a whole genome parametric alignment of Drosophila melanogaster and Drosophila pseudoobscura. This alignment draws on existing heuristics for dividing whole genomes into small pieces for alignment, and it relies on advances we have made in computing convex polytopes that allow us to parametrically align non-coding regions using biologically realistic models. We demonstrate the utility of our parametric alignment for biological inference by showing that cis-regulatory elements are more conserved between Drosophila melanogaster and Drosophila pseudoobscura than previously thought. We also show how whole genome parametric alignment can be used to quantitatively assess the dependence of branch length estimates on alignment parameters.
Collapse
Affiliation(s)
- Colin N Dewey
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, California, United States of America
| | - Peter M Huggins
- Department of Mathematics, University of California Berkeley, Berkeley, California, United States of America
| | - Kevin Woods
- Department of Mathematics, University of California Berkeley, Berkeley, California, United States of America
| | - Bernd Sturmfels
- Department of Mathematics, University of California Berkeley, Berkeley, California, United States of America
| | - Lior Pachter
- Department of Mathematics, University of California Berkeley, Berkeley, California, United States of America
- * To whom correspondence should be addressed. E-mail:
| |
Collapse
|
24
|
Kececioglu J, Kim E. Simple and Fast Inverse Alignment. LECTURE NOTES IN COMPUTER SCIENCE 2006. [DOI: 10.1007/11732990_37] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
|
25
|
Abstract
We describe a novel model and algorithm for simultaneously estimating multiple molecular sequence alignments and the phylogenetic trees that relate the sequences. Unlike current techniques that base phylogeny estimates on a single estimate of the alignment, we take alignment uncertainty into account by considering all possible alignments. Furthermore, because the alignment and phylogeny are constructed simultaneously, a guide tree is not needed. This sidesteps the problem in which alignments created by progressive alignment are biased toward the guide tree used to generate them. Joint estimation also allows us to model rate variation between sites when estimating the alignment and to use the evidence in shared insertion/deletions (indels) to group sister taxa in the phylogeny. Our indel model makes use of affine gap penalties and considers indels of multiple letters. We make the simplifying assumption that the indel process is identical on all branches. As a result, the probability of a gap is independent of branch length. We use a Markov chain Monte Carlo (MCMC) method to sample from the posterior of the joint model, estimating the most probable alignment and tree and their support simultaneously. We describe a new MCMC transition kernel that improves our algorithm's mixing efficiency, allowing the MCMC chains to converge even when started from arbitrary alignments. Our software implementation can estimate alignment uncertainty and we describe a method for summarizing this uncertainty in a single plot.
Collapse
Affiliation(s)
- Benjamin D Redelings
- Department of Biomathematics, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095-1766, USA
| | | |
Collapse
|
26
|
Abstract
One of the major successes in computational biology has been the unification, by using the graphical model formalism, of a multitude of algorithms for annotating and comparing biological sequences. Graphical models that have been applied to these problems include hidden Markov models for annotation, tree models for phylogenetics, and pair hidden Markov models for alignment. A single algorithm, the sum-product algorithm, solves many of the inference problems that are associated with different statistical models. This article introduces the polytope propagation algorithm for computing the Newton polytope of an observation from a graphical model. This algorithm is a geometric version of the sum-product algorithm and is used to analyze the parametric behavior of maximum a posteriori inference calculations for graphical models.
Collapse
Affiliation(s)
- Lior Pachter
- Department of Mathematics, University of California, Berkeley, CA 94720, USA.
| | | |
Collapse
|
27
|
Abstract
This article presents a unified mathematical framework for inference in graphical models, building on the observation that graphical models are algebraic varieties. From this geometric viewpoint, observations generated from a model are coordinates of a point in the variety, and the sum-product algorithm is an efficient tool for evaluating specific coordinates. Here, we address the question of how the solutions to various inference problems depend on the model parameters. The proposed answer is expressed in terms of tropical algebraic geometry. The Newton polytope of a statistical model plays a key role. Our results are applied to the hidden Markov model and the general Markov model on a binary tree.
Collapse
Affiliation(s)
- Lior Pachter
- Department of Mathematics, University of California, Berkeley, CA 94720, USA
| | | |
Collapse
|
28
|
|
29
|
Zhang K, Sun F, Waterman MS, Chen T. Haplotype block partition with limited resources and applications to human chromosome 21 haplotype data. Am J Hum Genet 2003; 73:63-73. [PMID: 12802783 PMCID: PMC1180591 DOI: 10.1086/376437] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2003] [Accepted: 04/10/2003] [Indexed: 11/03/2022] Open
Abstract
Recent studies have shown that the human genome has a haplotype block structure such that it can be decomposed into large blocks with high linkage disequilibrium (LD) and relatively limited haplotype diversity, separated by short regions of low LD. One of the practical implications of this observation is that only a small fraction of all the single-nucleotide polymorphisms (SNPs) (referred as "tag SNPs") can be chosen for mapping genes responsible for human complex diseases, which can significantly reduce genotyping effort, without much loss of power. Algorithms have been developed to partition haplotypes into blocks with the minimum number of tag SNPs for an entire chromosome. In practice, investigators may have limited resources, and only a certain number of SNPs can be genotyped. In the present article, we first formulate this problem as finding a block partition with a fixed number of tag SNPs that can cover the maximal percentage of the whole genome, and we then develop two dynamic programming algorithms to solve this problem. The algorithms are sufficiently flexible to permit knowledge of functional polymorphisms to be considered. We apply the algorithms to a data set of SNPs on human chromosome 21, combining the information of coding and noncoding regions. We study the density of SNPs in intergenic regions, introns, and exons, and we find that the SNP density in intergenic regions is similar to that in introns and is higher than that in exons, results that are consistent with previous studies. We also calculate the distribution of block break points in intergenic regions, genes, exons, and coding regions and do not find any significant differences.
Collapse
Affiliation(s)
- Kui Zhang
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, 1042 W. 36th Place DRB-290, Los Angeles, CA 90089-1113, USA
| | | | | | | |
Collapse
|
30
|
Hoofer SR, Bussche RAVD. Molecular Phylogenetics of the Chiropteran Family Vespertilionidae. ACTA CHIROPTEROLOGICA 2003. [DOI: 10.3161/001.005.s101] [Citation(s) in RCA: 128] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
31
|
Abstract
A major bottleneck in comparative modeling is the alignment quality; this is especially true for proteins whose distant relationships could be reliably recognized only by recent advances in fold recognition. The best algorithms excel in recognizing distant homologs but often produce incorrect alignments for over 50% of protein pairs in large fold-prediction benchmarks. The alignments obtained by sequence-sequence or sequence-structure matching algorithms differ significantly from the structural alignments. To study this problem, we developed a simplified method to explicitly enumerate all possible alignments for a pair of proteins. This allowed us to estimate the number of significantly different alignments for a given scoring method that score better than the structural alignment. Using several examples of distantly related proteins, we show that for standard sequence-sequence alignment methods, the number of significantly different alignments is usually large, often about 10(10) alternatives. This distance decreases when the alignment method is improved, but the number is still too large for the brute force enumeration approach. More effective strategies were needed, so we evaluated and compared two well-known approaches for searching the space of suboptimal alignments. We combined their best features and produced a hybrid method, which yielded alignments that surpassed the original alignments for about 50% of protein pairs with minimal computational effort.
Collapse
Affiliation(s)
- Lukasz Jaroszewski
- Program in Bioinformatics and Biological Complexity, The Burnham Institute, 10901 N. Torrey Pines Road, La Jolla, CA 92037, USA
| | | | | |
Collapse
|
32
|
|
33
|
|
34
|
Drasdo D, Hwa T, Lässig M. Scaling laws and similarity detection in sequence alignment with gaps. J Comput Biol 2000; 7:115-41. [PMID: 10890391 DOI: 10.1089/10665270050081414] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We study the problem of similarity detection by sequence alignment with gaps, using a recently established theoretical framework based on the morphology of alignment paths. Alignments of sequences without mutual correlations are found to have scale-invariant statistics. This is the basis for a scaling theory of alignments of correlated sequences. Using a simple Markov model of evolution, we generate sequences with well-defined mutual correlations and quantify the fidelity of an alignment in an unambiguous way. The scaling theory predicts the dependence of the fidelity on the alignment parameters and on the statistical evolution parameters characterizing the sequence correlations. Specific criteria for the optimal choice of alignment parameters emerge from this theory. The results are verified by extensive numerical simulations.
Collapse
Affiliation(s)
- D Drasdo
- Max-Planck Institut für Kolloid- und Grenzflächenforschung, Potsdam, Germany
| | | | | |
Collapse
|
35
|
Abstract
Gaps result from the alignment of sequences of unequal length during primary homology assessment. Viewed as character states originating from particular biological events (mutation), gaps contain historical information suitable for phylogenetic analysis. The effect of gaps as a source of phylogenetic data is explored via sensitivity analysis and character congruence among different data partitions. Example data sets are provided to show that gaps contain important phylogenetic information not recovered by those methods that omit gaps in their calculations. However, gap cost schemes are arbitrary (although they must be explicit) and thus data exploration is a necessity of molecular analyses, while character congruence is necessary as an external criterion for hypothesis decision.
Collapse
Affiliation(s)
- G Giribet
- Department of Invertebrates, American Museum of Natural History, Central Park West at 79th Street, New York, New York 10024, USA
| | | |
Collapse
|
36
|
|
37
|
|
38
|
Abstract
Molecular biology is becoming a computationally intense realm of contemporary science and faces some of the current grand scientific challenges. In its context, tools that identify, store, compare and analyze effectively large and growing numbers of bio-sequences are found of increasingly crucial importance. Biosequences are routinely compared or aligned, in a variety of ways, to infer common ancestry, to detect functional equivalence, or simply while searching for similar entries in a database. A considerable body of knowledge has accumulated on sequence alignment during the past few decades. Without pretending to be exhaustive, this paper attempts a survey of some criteria of wide use in sequence alignment and comparison problems, and of the corresponding solutions. The paper is based on presentations and literature given at the Workshop on Sequence Alignment held at Princeton, N.J., in November 1994, as part of the DIMACS Special Year on Mathematical Support for Molecular Biology.
Collapse
Affiliation(s)
- A Apostolico
- Department of Computer Sciences, Purdue University, West Lafayette, IN 47907, USA.
| | | |
Collapse
|
39
|
Abstract
The multiple sequence alignment problem is applicable and important in various fields in molecular biology such as the prediction of three-dimensional structures of proteins and the inference of phylogenetic trees. However, the optimal alignment based on the scoring criterion is not always biologically the most significant alignment. We here propose two flexible and efficient approaches to solve this problem. One approach is to provide many suboptimal alignments as alternatives for the optimal one. It has been considered almost impossible to investigate such suboptimal alignments of more than two sequences because of the enormous size of the problem. We propose techniques for enumeration of suboptimal alignments using the Eppstein algorithm. We also discuss what kind of suboptimal alignment is unnecessary to enumerate and propose an efficient enumeration algorithm to enumerate only necessary alignments. The other approach is parametric analysis. The obtained optimal solution with fixed parameters such as gap penalties is not always the biologically best alignment. Thus, it is required to vary parameters and check how the optimal alignments change. The way to vary parameters has been studied well on the problem of two sequences, but not on the multiple alignment problem because of the difficulty of computing the optimal solution. We propose techniques for this parametric multiple alignment problem and examine the features of alignments obtained by various parametric analyses. For both approaches, this paper performs experiments on various groups of actual protein sequences and examines the efficiency of these algorithms and properties of sequence groups.
Collapse
Affiliation(s)
- T Shibuya
- Department of Information Science, Faculty of Science, University of Tokyo, Japan
| | | |
Collapse
|
40
|
Komatsoulis GA, Waterman MS. A new computational method for detection of chimeric 16S rRNA artifacts generated by PCR amplification from mixed bacterial populations. Appl Environ Microbiol 1997; 63:2338-46. [PMID: 9172353 PMCID: PMC168526 DOI: 10.1128/aem.63.6.2338-2346.1997] [Citation(s) in RCA: 36] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023] Open
Abstract
A new computational method (chimeric alignment) has been developed to detect chimeric 16S rRNA artifacts generated during PCR amplification from mixed bacterial populations. In contrast to other nearest-neighbor methods (e.g., CHECK_CHIMERA) that define sequence similarity by k-tuple matching, the chimeric alignment method uses the score from dynamic programming alignments. Further, the chimeric alignments are displayed to the user to assist in sequence classification. The distribution of improvement scores for 500 authentic, nonchimeric sequences and 300 artificial chimeras (constructed from authentic sequences) was used to study the sensitivity and accuracy of both chimeric alignment and CHECK_CHIMERA. At a constant rate of authentic sequence misclassification (5%), chimeric alignment incorrectly classified 13% of the artificial chimeras versus 14% for CHECK_CHIMERA. Interestingly, only 1% of nonchimeras and 10% of chimeras were misclassified by both programs, suggesting that optimum performance is obtained by using the two methods to assign sequences to three classes: high-probability nonchimeras, high-probability chimeras, and sequences that need further study by other means. This study suggests that k-tuple-based matching methods are more sensitive than alignment-based methods when there is significant parental sequence similarity, while the opposite becomes true as the sequences become more distantly related. The software and a World Wide Web-based server are available at http://www-hto.usc.edu/software/mglobal CHI.
Collapse
Affiliation(s)
- G A Komatsoulis
- Department of Mathematics, University of Southern California, Los Angeles 90089-1113, USA.
| | | |
Collapse
|
41
|
Koch I, Lengauer T, Wanke E. An algorithm for finding maximal common subtopologies in a set of protein structures. J Comput Biol 1996; 3:289-306. [PMID: 8811488 DOI: 10.1089/cmb.1996.3.289] [Citation(s) in RCA: 85] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
For the comparison and analysis of protein structures, it is of interest to find maximal common substructures in a given set of proteins. This question is also relevant for motif definition and structure classification. In this paper we describe first a new suitable representation of the secondary structure topology of a protein by an undirected labeled graph. Based on this representation we developed a new fast algorithm that finds all common subtopologies in a set of protein structures. Our method is based on the algorithm by Bron and Kerbosch (1973), which enumerates all maximal cliques in a graph. The main improvement of our algorithm is to restrict the search process to cliques that represent connected substructures. This restriction reduces the number of cliques to be considered during the search process and the size of the search tree drastically. Thus we are able to handle large proteins. Experiments show the efficiency and superiority of our algorithm in comparison with other existing algorithms basing on graph-theoretical methods.
Collapse
Affiliation(s)
- I Koch
- Institute for Algorithms and Scientific Computing SCAI.ALG, GMD--German National Research Center for Information Technology, Sankt Augustin, Germany
| | | | | |
Collapse
|
42
|
Agarwal P, States DJ. A Bayesian evolutionary distance for parametrically aligned sequences. J Comput Biol 1996; 3:1-17. [PMID: 8697232 DOI: 10.1089/cmb.1996.3.1] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
There is an inherent relationship between the process of pairwise sequence alignment and the estimation of evolutionary distance. This relationship is explored and made explicit. Assuming an evolutionary model and given a specific pattern of observed base mismatches, the relative probabilities of evolution at each evolutionary distance are computed using a Bayesian framework. The mean or the median of this probability distribution provides a robust estimate of the central value. The evolutionary distance has traditionally been computed as zero for an observed homology of 20 bases with no mismatches; we prove that it is highly probable that the distance is greater than 0.01. The mean of the distribution is 0.047, which is a better estimate of the evolutionary distance. Bayesian estimates of the evolutionary distance incorporate arbitrary prior information about variable mutation rates both over time and along sequence position, thus requiring only a weak form of the molecular-clock hypothesis. The endpoints of the similarity between genomic DNA sequences are often ambiguous. The probability of evolution at each evolutionary distance can be estimated over the entire set of alignments by choosing the best alignment at each distance and the corresponding probability of duplication at that evolutionary distance. A central value of this distribution provides a robust evolutionary distance estimate. We provide an efficient algorithm for computing the parametric alignment, considering evolutionary distance as the only parameter. These techniques and estimates are used to infer the duplication history of the genomic sequence in C. elegans and in S. cerevisiae. Our results indicate that repeats discovered using a single scoring matrix show a considerable bias in subsequent evolutionary distance estimates.
Collapse
Affiliation(s)
- P Agarwal
- Institute for Biomedical Computing, Washington University, St. Louis, MO 63110, USA
| | | |
Collapse
|
43
|
Abstract
Recently algorithms for parametric alignment (Waterman et al., 1992, Natl Acad. Sci. USA 89, 6090-6093; Gusfield et al., 1992, Proceedings of the Third Annual ACM-SIAM Discrete Algorithms) find optimal scores for all penalty parameters, both for global and local sequence alignment. This paper reviews those techniques. Then in the main part of this paper dynamic programming methods are used to compute ensemble alignment, finding all alignment scores for all parameters. Both global and local ensemble alignments are studied, and parametric alignment is used to compute near optimal ensemble alignments.
Collapse
Affiliation(s)
- M S Waterman
- Department of Mathematics, University of Southern California, Los Angeles 90089-1113
| |
Collapse
|
44
|
Abstract
The development of extremely powerful computer programs and the ready availability of microcomputers has revealed several computational problems with data analysis. These problems occur in the handling of systematic data in general and molecular systematic data in particular. This paper examines three areas of controversy in molecular systematics resulting from increased computer power. We start by examining the first step in DNA sequence analysis, the establishment of homology via sequence alignment. Next we examine several problems in phylogenetic analysis that have arisen in the last few years due to use of the PAUP (Swofford, 1991), HENNIG86 (Farris, 1988), and PHYLIP programs. These problems include limitations on the number of taxa examined in a given analysis and the accuracy of the parsimony trees in such analyses. The final subject is an examination of programs used for assessing tree robustness. We concentrate on certain programs (such as MALIGN (Wheeler and Gladstein, 1993), PAUP (Swofford, 1991), HENNIG86 (Farris, 1988), PHYLIP (Felsenstein, 1990), CLADOS (Nixon, 1993), MacClade (Maddison and Maddison, 1993), etc.), but similar comments about other programs could also be made.
Collapse
Affiliation(s)
- R DeSalle
- Department of Entomology, American Museum of Natural History, New York, NY 10024
| | | | | |
Collapse
|
45
|
Abstract
The discussion of molecular sequence alignment is becoming more prominent in studies of molecular systematics and evolution. As the basis for initial homology statements, alignment is crucial to comparative molecular biology. Although fundamental, alignment is not a process which necessarily yields objective, precise results. Ambiguities can appear in alignment due to a number of factors. Three such sources of ambiguity are discussed here. These are ambiguity in the establishment of alignment parameters, pair-wise order and individual "path" variation. The first arises from the necessary but empirically untestable assumptions of gap costs and other factors which are required to align sequences objectively. The second is due to the possible existence of non-unique solutions to the same alignment parameters in heuristic and exact solutions. The third is a result of multiple optimal paths within single alignments, potentially generating huge numbers of equally costly but unique alignments. Some of the problems with and several possible solutions to the difficult situation of non-unique alignments are discussed.
Collapse
Affiliation(s)
- W C Wheeler
- Department of Invertebrates, American Museum of Natural History, New York, NY 10024-5192
| |
Collapse
|
46
|
Abstract
A near-optimal alignment between a pair of sequences is an alignment whose score lies within the neighborhood of the optimal score. We present an efficient method for representing all alignments whose score is within any given delta from the optimal score. The representation is a compact graph that makes it easy to impose additional biological constraints and select one desirable alignment from the large set of alignments. We study the combinatorial nature of near-optimal alignments, and define a set of "canonical" near-optimal alignments. We then show how to enumerate near-optimal alignments efficiently in order of their score, and count their number. When applied to comparisons of two distantly related proteins, near-optimal alignments reveal that the most conserved regions among the near-optimal alignments are the highly structured regions in the proteins. We also show that by counting the number of near optimal alignments as a function of the distance from the optimal score, we can select a good set of parameters that best constraints the biologically relevant alignments.
Collapse
Affiliation(s)
- D Naor
- Department of Biochemistry, Stanford University School of Medicine, CA 94305-5307, USA
| | | |
Collapse
|
47
|
Abstract
Recent developments in the statistical analysis of DNA sequences are reviewed. The pace with which sequence data are being generated and analysed has increased with the growth of the human genome project. Two areas of activity are emphasized: attention to error rates in recorded sequences, and heterogeneity in structure of sequences. There is now empirical evidence suggesting error rates in the range 0.1%-1%, and such rates will affect evolutionary studies since these are about the rates at which DNA sequences from different individuals are expected to differ. Heterogeneity for such quantities as base composition, or lengths between successive subsequences of specified types, may be sufficient to account for observed long-range correlations between bases. The need for statistical models and analyses of DNA sequence data will continue, and will offer interesting challenges.
Collapse
Affiliation(s)
- B S Weir
- Department of Statistics, North Carolina State University, Raleigh 27695-8203
| |
Collapse
|