1
|
Abdullahi KB. Kabirian-based optinalysis: A conceptually grounded framework for symmetry/asymmetry, similarity/dissimilarity and identity/unidentity estimations in mathematical structures and biological sequences. MethodsX 2023; 11:102400. [PMID: 37928104 PMCID: PMC10622715 DOI: 10.1016/j.mex.2023.102400] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2023] [Accepted: 09/24/2023] [Indexed: 11/07/2023] Open
Abstract
This paper introduces "Kabirian-based optinalysis (KBO)," a pioneering framework that addresses the longstanding challenges in estimating symmetry/asymmetry, similarity/dissimilarity, and identity/unidentity within mathematical structures and biological sequences. The existing methods often lack a strong theoretical foundation, leading to inconsistencies and limitations. Kabirian-based optinalysis draws inspiration from isomorphism and automorphism, providing a theoretically grounded framework that unifies estimation methodologies. It introduces the concept of optiscale, autoreflective pairing, isoreflective pairing, and others ensuring invariance and robustness under various mathematical transformations and establishing functional bijectivity for isomorphic or automorphic structures. This not only overcomes previous limitations but also offers precise and interpretable estimations. Additionally, the framework introduces "geometrical pairwise analysis" to improve sensitivity to position-specific and character-specific variations in biological sequences. This novel approach enhances the accuracy of sequence similarity assessments, surpassing the constraints of conventional methods. The novelty of this work extends beyond mathematics and biology, impacting diverse fields such as computer science, data analysis, pattern recognition, and evolutionary biology. Kabirian-based optinalysis presents a holistic and theoretically grounded solution that has the potential to revolutionize the analysis of complex structures and sequences, opening new horizons for interdisciplinary research.•Inspired by automorphism and isomorphism, Kabirian-based optinalysis introduces a new paradigm-shifting and unified approach to estimations in mathematical structures and biological sequences with a solid conceptual and theoretical foundation.•The GPA method enhances pairwise sequence similarity estimation by being sensitive to position-specific and character-specific variations and providing a comprehensive characterization of these features.
Collapse
Affiliation(s)
- Kabir Bindawa Abdullahi
- Department of Biology, Faculty of Natural and Applied Sciences, Umaru Musa Yar'adua University, P.M.B., Katsina, Katsina State 2218, Nigeria
| |
Collapse
|
2
|
Minkin I, Medvedev P. Scalable multiple whole-genome alignment and locally collinear block construction with SibeliaZ. Nat Commun 2020; 11:6327. [PMID: 33303762 PMCID: PMC7728760 DOI: 10.1038/s41467-020-19777-8] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2019] [Accepted: 10/29/2020] [Indexed: 11/29/2022] Open
Abstract
Multiple whole-genome alignment is a challenging problem in bioinformatics. Despite many successes, current methods are not able to keep up with the growing number, length, and complexity of assembled genomes, especially when computational resources are limited. Approaches based on compacted de Bruijn graphs to identify and extend anchors into locally collinear blocks have potential for scalability, but current methods do not scale to mammalian genomes. We present an algorithm, SibeliaZ-LCB, for identifying collinear blocks in closely related genomes based on analysis of the de Bruijn graph. We further incorporate this into a multiple whole-genome alignment pipeline called SibeliaZ. SibeliaZ shows run-time improvements over other methods while maintaining accuracy. On sixteen recently-assembled strains of mice, SibeliaZ runs in under 16 hours on a single machine, while other tools did not run to completion for eight mice within a week. SibeliaZ makes a significant step towards improving scalability of multiple whole-genome alignment and collinear block reconstruction algorithms on a single machine. Multiple whole-genome alignment is a challenging problem in bioinformatics, especially when computational resources are limited. Here the authors present SibeliaZ, an algorithm and software based on analysis of de Bruijn graphs, which provides improved computational efficiency and scalability.
Collapse
Affiliation(s)
- Ilia Minkin
- Department of Computer Science and Engineering, The Pennsylvania State University, 506 Wartik Lab University Park, University Park, PA, 16802, USA.
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, 506 Wartik Lab University Park, University Park, PA, 16802, USA.,Department of Biochemistry and Molecular Biology, The Pennsylvania State University, 506 Wartik Lab University Park, University Park, PA, 16802, USA.,Center for Computational Biology and Bioinformatics, The Pennsylvania State University, 506 Wartik Lab University Park, University Park, PA, 16802, USA
| |
Collapse
|
3
|
Vialle RA, Tamuri AU, Goldman N. Alignment Modulates Ancestral Sequence Reconstruction Accuracy. Mol Biol Evol 2019; 35:1783-1797. [PMID: 29618097 PMCID: PMC5995191 DOI: 10.1093/molbev/msy055] [Citation(s) in RCA: 51] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
Accurate reconstruction of ancestral states is a critical evolutionary analysis when studying ancient proteins and comparing biochemical properties between parental or extinct species and their extant relatives. It relies on multiple sequence alignment (MSA) which may introduce biases, and it remains unknown how MSA methodological approaches impact ancestral sequence reconstruction (ASR). Here, we investigate how MSA methodology modulates ASR using a simulation study of various evolutionary scenarios. We evaluate the accuracy of ancestral protein sequence reconstruction for simulated data and compare reconstruction outcomes using different alignment methods. Our results reveal biases introduced not only by aligner algorithms and assumptions, but also tree topology and the rate of insertions and deletions. Under many conditions we find no substantial differences between the MSAs. However, increasing the difficulty for the aligners can significantly impact ASR. The MAFFT consistency aligners and PRANK variants exhibit the best performance, whereas FSA displays limited performance. We also discover a bias towards reconstructed sequences longer than the true ancestors, deriving from a preference for inferring insertions, in almost all MSA methodological approaches. In addition, we find measures of MSA quality generally correlate highly with reconstruction accuracy. Thus, we show MSA methodological differences can affect the quality of reconstructions and propose MSA methods should be selected with care to accurately determine ancestral states with confidence.
Collapse
Affiliation(s)
- Ricardo Assunção Vialle
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, United Kingdom.,Department of Biochemistry and Immunology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil.,Department of Genetics and Molecular Biology, Laboratory of Human and Medical Genetics, Federal University of Pará, Belém, Pará, Brazil
| | - Asif U Tamuri
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, United Kingdom.,Research IT Services, University College London, London, United Kingdom
| | - Nick Goldman
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, United Kingdom
| |
Collapse
|
4
|
Abstract
Whole-genome alignment (WGA) is the prediction of evolutionary relationships at the nucleotide level between two or more genomes. It combines aspects of both colinear sequence alignment and gene orthology prediction and is typically more challenging to address than either of these tasks due to the size and complexity of whole genomes. Despite the difficulty of this problem, numerous methods have been developed for its solution because WGAs are valuable for genome-wide analyses such as phylogenetic inference, genome annotation, and function prediction. In this chapter, we discuss the meaning and significance of WGA and present an overview of the methods that address it. We also examine the problem of evaluating whole-genome aligners and offer a set of methodological challenges that need to be tackled in order to make most effective use of our rapidly growing databases of whole genomes.
Collapse
Affiliation(s)
- Colin N Dewey
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA.
| |
Collapse
|
5
|
Shim H, Larget B. BayesCAT: Bayesian co-estimation of alignment and tree. Biometrics 2017; 74:270-279. [DOI: 10.1111/biom.12640] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2014] [Revised: 06/01/2016] [Accepted: 06/01/2016] [Indexed: 11/30/2022]
Affiliation(s)
- Heejung Shim
- Department of Statistics; Purdue University; West Lafayette Indiana U.S.A
| | - Bret Larget
- Departments of Statistics and of Botany; University of Wisconsin; Madison Wisconsin U.S.A
| |
Collapse
|
6
|
Ye Y, Lam TW, Ting HF. PnpProbs: a better multiple sequence alignment tool by better handling of guide trees. BMC Bioinformatics 2016; 17 Suppl 8:285. [PMID: 27585754 PMCID: PMC5009527 DOI: 10.1186/s12859-016-1121-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND This paper describes a new MSA tool called PnpProbs, which constructs better multiple sequence alignments by better handling of guide trees. It classifies sequences into two types: normally related and distantly related. For normally related sequences, it uses an adaptive approach to construct the guide tree needed for progressive alignment; it first estimates the input's discrepancy by computing the standard deviation of their percent identities, and based on this estimate, it chooses the better method to construct the guide tree. For distantly related sequences, PnpProbs abandons the guide tree and uses instead some non-progressive alignment method to generate the alignment. RESULTS To evaluate PnpProbs, we have compared it with thirteen other popular MSA tools, and PnpProbs has the best alignment scores in all but one test. We have also used it for phylogenetic analysis, and found that the phylogenetic trees constructed from PnpProbs' alignments are closest to the model trees. CONCLUSIONS By combining the strength of the progressive and non-progressive alignment methods, we have developed an MSA tool called PnpProbs. We have compared PnpProbs with thirteen other popular MSA tools and our results showed that our tool usually constructed the best alignments.
Collapse
Affiliation(s)
- Yongtao Ye
- HKU-BGI Bioinformatics Algorithms & Core Technology Research Lab, Computer Science Department, University of Hong Kong, Hong Kong, China
| | - Tak-Wah Lam
- HKU-BGI Bioinformatics Algorithms & Core Technology Research Lab, Computer Science Department, University of Hong Kong, Hong Kong, China
| | - Hing-Fung Ting
- HKU-BGI Bioinformatics Algorithms & Core Technology Research Lab, Computer Science Department, University of Hong Kong, Hong Kong, China.
| |
Collapse
|
7
|
Barquist L, Burge SW, Gardner PP. Studying RNA Homology and Conservation with Infernal: From Single Sequences to RNA Families. CURRENT PROTOCOLS IN BIOINFORMATICS 2016; 54:12.13.1-12.13.25. [PMID: 27322404 PMCID: PMC5010141 DOI: 10.1002/cpbi.4] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Emerging high-throughput technologies have led to a deluge of putative non-coding RNA (ncRNA) sequences identified in a wide variety of organisms. Systematic characterization of these transcripts will be a tremendous challenge. Homology detection is critical to making maximal use of functional information gathered about ncRNAs: identifying homologous sequence allows us to transfer information gathered in one organism to another quickly and with a high degree of confidence. ncRNA presents a challenge for homology detection, as the primary sequence is often poorly conserved and de novo secondary structure prediction and search remain difficult. This unit introduces methods developed by the Rfam database for identifying "families" of homologous ncRNAs starting from single "seed" sequences, using manually curated sequence alignments to build powerful statistical models of sequence and structure conservation known as covariance models (CMs), implemented in the Infernal software package. We provide a step-by-step iterative protocol for identifying ncRNA homologs and then constructing an alignment and corresponding CM. We also work through an example for the bacterial small RNA MicA, discovering a previously unreported family of divergent MicA homologs in genus Xenorhabdus in the process. © 2016 by John Wiley & Sons, Inc.
Collapse
Affiliation(s)
- Lars Barquist
- Institute for Molecular Infection Biology, University of Würzburg, Würzburg, D-97080 Germany
- Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA United Kingdom; Fax: +44 (0)1223 494919
| | - Sarah W. Burge
- Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA United Kingdom; Fax: +44 (0)1223 494919
| | - Paul P. Gardner
- School of Biological Sciences, University of Canterbury, Private Bag 4800, Christchurch, New Zealand
- Biomolecular Interaction Centre, University of Canterbury, Private Bag 4800, Christchurch, New Zealand
| |
Collapse
|
8
|
Katoh K, Standley DM. A simple method to control over-alignment in the MAFFT multiple sequence alignment program. Bioinformatics 2016; 32:1933-42. [PMID: 27153688 PMCID: PMC4920119 DOI: 10.1093/bioinformatics/btw108] [Citation(s) in RCA: 318] [Impact Index Per Article: 39.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2015] [Accepted: 02/19/2016] [Indexed: 12/17/2022] Open
Abstract
Motivation: We present a new feature of the MAFFT multiple alignment program for suppressing over-alignment (aligning unrelated segments). Conventional MAFFT is highly sensitive in aligning conserved regions in remote homologs, but the risk of over-alignment is recently becoming greater, as low-quality or noisy sequences are increasing in protein sequence databases, due, for example, to sequencing errors and difficulty in gene prediction. Results: The proposed method utilizes a variable scoring matrix for different pairs of sequences (or groups) in a single multiple sequence alignment, based on the global similarity of each pair. This method significantly increases the correctly gapped sites in real examples and in simulations under various conditions. Regarding sensitivity, the effect of the proposed method is slightly negative in real protein-based benchmarks, and mostly neutral in simulation-based benchmarks. This approach is based on natural biological reasoning and should be compatible with many methods based on dynamic programming for multiple sequence alignment. Availability and implementation: The new feature is available in MAFFT versions 7.263 and higher. http://mafft.cbrc.jp/alignment/software/ Contact:katoh@ifrec.osaka-u.ac.jp Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kazutaka Katoh
- Immunology Frontier Research Center, Osaka University, Suita 565-0871, Japan
| | - Daron M Standley
- Immunology Frontier Research Center, Osaka University, Suita 565-0871, Japan Institute for Virus Research, Kyoto University, Kyoto 606-8507, Japan
| |
Collapse
|
9
|
Ip CL, Loose M, Tyson JR, de Cesare M, Brown BL, Jain M, Leggett RM, Eccles DA, Zalunin V, Urban JM, Piazza P, Bowden RJ, Paten B, Mwaigwisya S, Batty EM, Simpson JT, Snutch TP, Birney E, Buck D, Goodwin S, Jansen HJ, O'Grady J, Olsen HE. MinION Analysis and Reference Consortium: Phase 1 data release and analysis. F1000Res 2015; 4:1075. [PMID: 26834992 PMCID: PMC4722697 DOI: 10.12688/f1000research.7201.1] [Citation(s) in RCA: 190] [Impact Index Per Article: 21.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 10/09/2015] [Indexed: 11/20/2022] Open
Abstract
The advent of a miniaturized DNA sequencing device with a high-throughput contextual sequencing capability embodies the next generation of large scale sequencing tools. The MinION™ Access Programme (MAP) was initiated by Oxford Nanopore Technologies™ in April 2014, giving public access to their USB-attached miniature sequencing device. The MinION Analysis and Reference Consortium (MARC) was formed by a subset of MAP participants, with the aim of evaluating and providing standard protocols and reference data to the community. Envisaged as a multi-phased project, this study provides the global community with the Phase 1 data from MARC, where the reproducibility of the performance of the MinION was evaluated at multiple sites. Five laboratories on two continents generated data using a control strain of Escherichia coli K-12, preparing and sequencing samples according to a revised ONT protocol. Here, we provide the details of the protocol used, along with a preliminary analysis of the characteristics of typical runs including the consistency, rate, volume and quality of data produced. Further analysis of the Phase 1 data presented here, and additional experiments in Phase 2 of E. coli from MARC are already underway to identify ways to improve and enhance MinION performance.
Collapse
Affiliation(s)
- Camilla L.C. Ip
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Matthew Loose
- School of Life Sciences, Queens Medical Centre, University of Nottingham, Nottingham, UK
| | - John R. Tyson
- Michael Smith Laboratories and Djavad Mowafaghian Centre for Brain Health, University of British Columbia, Vancouver, Canada
| | | | | | - Miten Jain
- University of California, Santa Cruz, Santa Cruz, CA, USA
| | | | - David A. Eccles
- Malaghan Institute of Medical Research, Wellington, New Zealand
| | - Vadim Zalunin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, UK
| | - John M. Urban
- Division of Biology and Medicine, Brown University, Providence, RI, USA
| | - Paolo Piazza
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Rory J. Bowden
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Benedict Paten
- University of California, Santa Cruz, Santa Cruz, CA, USA
| | | | - Elizabeth M. Batty
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Jared T. Simpson
- Informatics and Biocomputing, Ontario Institute for Cancer Research, ON, Canada
| | - Terrance P. Snutch
- Michael Smith Laboratories and Djavad Mowafaghian Centre for Brain Health, University of British Columbia, Vancouver, Canada
| | - Ewan Birney
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, UK
| | - David Buck
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Sara Goodwin
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | | | - Justin O'Grady
- Norwich Medical School, University of East Anglia, Norwich, UK
| | - Hugh E. Olsen
- University of California, Santa Cruz, Santa Cruz, CA, USA
| | - MinION Analysis and Reference Consortium
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
- School of Life Sciences, Queens Medical Centre, University of Nottingham, Nottingham, UK
- Michael Smith Laboratories and Djavad Mowafaghian Centre for Brain Health, University of British Columbia, Vancouver, Canada
- Virginia Commonwealth University, Richmond, VA, USA
- University of California, Santa Cruz, Santa Cruz, CA, USA
- The Genome Analysis Centre, Norwich Research Park, Norwich, UK
- Malaghan Institute of Medical Research, Wellington, New Zealand
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, UK
- Division of Biology and Medicine, Brown University, Providence, RI, USA
- Norwich Medical School, University of East Anglia, Norwich, UK
- Informatics and Biocomputing, Ontario Institute for Cancer Research, ON, Canada
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
- ZF-screens B.V., Leiden, Netherlands
| |
Collapse
|
10
|
Herman JL, Novák Á, Lyngsø R, Szabó A, Miklós I, Hein J. Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs. BMC Bioinformatics 2015; 16:108. [PMID: 25888064 PMCID: PMC4395974 DOI: 10.1186/s12859-015-0516-1] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2014] [Accepted: 02/24/2015] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND A standard procedure in many areas of bioinformatics is to use a single multiple sequence alignment (MSA) as the basis for various types of analysis. However, downstream results may be highly sensitive to the alignment used, and neglecting the uncertainty in the alignment can lead to significant bias in the resulting inference. In recent years, a number of approaches have been developed for probabilistic sampling of alignments, rather than simply generating a single optimum. However, this type of probabilistic information is currently not widely used in the context of downstream inference, since most existing algorithms are set up to make use of a single alignment. RESULTS In this work we present a framework for representing a set of sampled alignments as a directed acyclic graph (DAG) whose nodes are alignment columns; each path through this DAG then represents a valid alignment. Since the probabilities of individual columns can be estimated from empirical frequencies, this approach enables sample-based estimation of posterior alignment probabilities. Moreover, due to conditional independencies between columns, the graph structure encodes a much larger set of alignments than the original set of sampled MSAs, such that the effective sample size is greatly increased. CONCLUSIONS The alignment DAG provides a natural way to represent a distribution in the space of MSAs, and allows for existing algorithms to be efficiently scaled up to operate on large sets of alignments. As an example, we show how this can be used to compute marginal probabilities for tree topologies, averaging over a very large number of MSAs. This framework can also be used to generate a statistically meaningful summary alignment; example applications show that this summary alignment is consistently more accurate than the majority of the alignment samples, leading to improvements in downstream tree inference. Implementations of the methods described in this article are available at http://statalign.github.io/WeaveAlign .
Collapse
Affiliation(s)
- Joseph L Herman
- Department of Statistics, University of Oxford, 1 South Parks Road, Oxford, OX1 3TG, UK.
- Division of Mathematical Biology, National Institute of Medical Research,, The Ridgeway, London, NW7 1AA, UK.
| | - Ádám Novák
- Department of Statistics, University of Oxford, 1 South Parks Road, Oxford, OX1 3TG, UK.
| | - Rune Lyngsø
- Department of Statistics, University of Oxford, 1 South Parks Road, Oxford, OX1 3TG, UK.
| | - Adrienn Szabó
- Institute of Computer Science and Control, Hungarian Academy of Sciences, Lagymanyosi u. 11., Budapest, 1111, Hungary.
| | - István Miklós
- Institute of Computer Science and Control, Hungarian Academy of Sciences, Lagymanyosi u. 11., Budapest, 1111, Hungary.
- Department of Stochastics, Rényi Institute, Reáltanoda u. 13-15, Budapest, 1053, Hungary.
| | - Jotun Hein
- Department of Statistics, University of Oxford, 1 South Parks Road, Oxford, OX1 3TG, UK.
| |
Collapse
|
11
|
Jain M, Fiddes IT, Miga KH, Olsen HE, Paten B, Akeson M. Improved data analysis for the MinION nanopore sequencer. Nat Methods 2015; 12:351-6. [PMID: 25686389 PMCID: PMC4907500 DOI: 10.1038/nmeth.3290] [Citation(s) in RCA: 377] [Impact Index Per Article: 41.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2014] [Accepted: 01/20/2015] [Indexed: 12/31/2022]
Abstract
The Oxford Nanopore MinION sequences individual DNA molecules using an array of pores that read nucleotide identities based on ionic current steps. We evaluated and optimized MinION performance using M13 genomic dsDNA. Using expectation-maximization (EM) we obtained robust maximum likelihood (ML) estimates for read insertion, deletion and substitution error rates (4.9%, 7.8%, and 5.1% respectively). We found that 99% of high-quality ‘2D’ MinION reads mapped to reference at a mean identity of 85%. We present a MinION-tailored tool for single nucleotide variant (SNV) detection that uses ML parameter estimates and marginalization over many possible read alignments to achieve precision and recall of up to 99%. By pairing our high-confidence alignment strategy with long MinION reads, we resolved the copy number for a cancer/testis gene family (CT47) within an unresolved region of human chromosome Xq24.
Collapse
Affiliation(s)
- Miten Jain
- 1] UC Santa Cruz Genomics Institute, Santa Cruz, California, USA. [2] Department of Biomolecular Engineering, University of California, Santa Cruz, California, USA
| | - Ian T Fiddes
- 1] UC Santa Cruz Genomics Institute, Santa Cruz, California, USA. [2] Department of Biomolecular Engineering, University of California, Santa Cruz, California, USA
| | - Karen H Miga
- 1] UC Santa Cruz Genomics Institute, Santa Cruz, California, USA. [2] Department of Biomolecular Engineering, University of California, Santa Cruz, California, USA
| | - Hugh E Olsen
- 1] UC Santa Cruz Genomics Institute, Santa Cruz, California, USA. [2] Department of Biomolecular Engineering, University of California, Santa Cruz, California, USA
| | - Benedict Paten
- 1] UC Santa Cruz Genomics Institute, Santa Cruz, California, USA. [2] Department of Biomolecular Engineering, University of California, Santa Cruz, California, USA
| | - Mark Akeson
- 1] UC Santa Cruz Genomics Institute, Santa Cruz, California, USA. [2] Department of Biomolecular Engineering, University of California, Santa Cruz, California, USA
| |
Collapse
|
12
|
Nánási M, Vinař T, Brejová B. Probabilistic approaches to alignment with tandem repeats. Algorithms Mol Biol 2014; 9:3. [PMID: 24580741 PMCID: PMC3975930 DOI: 10.1186/1748-7188-9-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2013] [Accepted: 02/24/2014] [Indexed: 11/16/2022] Open
Abstract
Background Short tandem repeats are ubiquitous in genomic sequences and due to their complex evolutionary history pose a challenge for sequence alignment tools. Results To better account for the presence of tandem repeats in pairwise sequence alignments, we propose a simple tractable pair hidden Markov model that explicitly models their presence. Using the framework of gain functions, we design several optimization criteria for decoding this model and describe resulting decoding algorithms, ranging from the traditional Viterbi and posterior decoding to block-based decoding algorithms tailored to our model. We compare the accuracy of individual decoding algorithms on simulated and real data and find that our approach is superior to the classical three-state pair HMM. Conclusions Our study illustrates versatility of pair hidden Markov models coupled with appropriate decoding criteria as a modeling tool for capturing complex sequence features.
Collapse
|
13
|
Sahraeian SME, Yoon BJ. PicXAA: a probabilistic scheme for finding the maximum expected accuracy alignment of multiple biological sequences. Methods Mol Biol 2014; 1079:203-210. [PMID: 24170404 DOI: 10.1007/978-1-62703-646-7_13] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
PicXAA is a probabilistic nonprogressive alignment algorithm that finds protein (or DNA) multiple sequence alignments with maximum expected accuracy. PicXAA greedily builds up the alignment from sequence regions with high local similarity, thereby yielding an accurate global alignment that effectively captures the local similarities across sequences. PicXAA constantly yields accurate alignment results on a wide range of reference sets that have different characteristics, with especially remarkable improvements over other leading algorithms on sequence sets with high local similarities. In this chapter, we describe the overall alignment strategy used in PicXAA and discuss several important considerations for effective deployment of the algorithm.
Collapse
|
14
|
Roozgard A, Barzigar N, Wang S, Jiang X, Cheng S. Empirical Transition Probability Indexing Sparse-Coding Belief Propagation (ETPI-SCoBeP) Genome Sequence Alignment. Cancer Inform 2014; 13:159-65. [PMID: 25983537 PMCID: PMC4426956 DOI: 10.4137/cin.s13887] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2014] [Revised: 10/09/2014] [Accepted: 10/10/2014] [Indexed: 11/29/2022] Open
Abstract
The advance in human genome sequencing technology has significantly reduced the cost of data generation and overwhelms the computing capability of sequence analysis. Efficiency, efficacy, and scalability remain challenging in sequence alignment, which is an important and foundational operation for genome data analysis. In this paper, we propose a two-stage approach to tackle this problem. In the preprocessing step, we match blocks of reference and target sequences based on the similarities between their empirical transition probability distributions using belief propagation. We then conduct a refined match using our recently published sparse-coding belief propagation (SCoBeP) technique. Our experimental results demonstrated robustness in nucleotide sequence alignment, and our results are competitive to those of the SOAP aligner and the BWA algorithm. Moreover, compared to SCoBeP alignment, the proposed technique can handle sequences of much longer lengths.
Collapse
Affiliation(s)
- Aminmohammad Roozgard
- School of Electrical and Computer Engineering, University of Oklahoma, Tulsa, OK, USA
| | - Nafise Barzigar
- School of Electrical and Computer Engineering, University of Oklahoma, Tulsa, OK, USA
| | - Shuang Wang
- Division of Biomedical Informatics, University of California, San Diego, CA, USA
| | - Xiaoqian Jiang
- Division of Biomedical Informatics, University of California, San Diego, CA, USA
| | - Samuel Cheng
- School of Electrical and Computer Engineering, University of Oklahoma, Tulsa, OK, USA
| |
Collapse
|
15
|
Fernandes CA, Comparetti EJ, Borges RJ, Huancahuire-Vega S, Ponce-Soto LA, Marangoni S, Soares AM, Fontes MR. Structural bases for a complete myotoxic mechanism: Crystal structures of two non-catalytic phospholipases A2-like from Bothrops brazili venom. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2013; 1834:2772-81. [DOI: 10.1016/j.bbapap.2013.10.009] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/06/2013] [Revised: 10/07/2013] [Accepted: 10/12/2013] [Indexed: 11/16/2022]
|
16
|
Vieira LF, Magro AJ, Fernandes CA, de Souza BM, Cavalcante WL, Palma MS, Rosa JC, Fuly AL, Fontes MR, Gallacci M, Butzke DS, Calderon LA, Stábeli RG, Giglio JR, Soares AM. Biochemical, functional, structural and phylogenetic studies on Intercro, a new isoform phospholipase A2 from Crotalus durissus terrificus snake venom. Biochimie 2013; 95:2365-75. [DOI: 10.1016/j.biochi.2013.08.028] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2012] [Accepted: 08/25/2013] [Indexed: 10/26/2022]
|
17
|
Cho SJ, Vallès Y, Weisblat DA. Differential expression of conserved germ line markers and delayed segregation of male and female primordial germ cells in a hermaphrodite, the leech helobdella. Mol Biol Evol 2013; 31:341-54. [PMID: 24217283 PMCID: PMC3907050 DOI: 10.1093/molbev/mst201] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023] Open
Abstract
In sexually reproducing animals, primordial germ cells (PGCs) are often set aside early in embryogenesis, a strategy that minimizes the risk of genomic damage associated with replication and mitosis during the cell cycle. Here, we have used germ line markers (piwi, vasa, and nanos) and microinjected cell lineage tracers to show that PGC specification in the leech genus Helobdella follows a different scenario: in this hermaphrodite, the male and female PGCs segregate from somatic lineages only after more than 20 rounds of zygotic mitosis; the male and female PGCs share the same (mesodermal) cell lineage for 19 rounds of zygotic mitosis. Moreover, while all three markers are expressed in both male and female reproductive tissues of the adult, they are expressed differentially between the male and female PGCs of the developing embryo: piwi and vasa are expressed preferentially in female PGCs at a time when nanos is expressed preferentially in male PGCs. A priori, the delayed segregation of male and female PGCs from somatic tissues and from one another increases the probability of mutations affecting both male and female PGCs of a given individual. We speculate that this suite of features, combined with a capacity for self-fertilization, may contribute to the dramatically rearranged genome of Helobdella robusta relative to other animals.
Collapse
Affiliation(s)
- Sung-Jin Cho
- Department of Molecular and Cell Biology, LSA, University of California, Berkeley
| | | | | |
Collapse
|
18
|
Heuristic alignment methods. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2013; 1079:29-43. [PMID: 24170393 DOI: 10.1007/978-1-62703-646-7_2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
Computation of multiple sequence alignment (MSA) is usually formulated as a combinatory optimization problem of an objective function. Solving the problem for virtually all sensible objective functions is known to be NP-complete implying that some heuristics must be adopted. Several general strategies have been proven effective to obtain accurate MSAs in reasonable computational costs. This chapter is devoted to a brief summary of most successful heuristic approaches.
Collapse
|
19
|
Baumler DJ, Ma B, Reed JL, Perna NT. Inferring ancient metabolism using ancestral core metabolic models of enterobacteria. BMC SYSTEMS BIOLOGY 2013; 7:46. [PMID: 23758866 PMCID: PMC3694032 DOI: 10.1186/1752-0509-7-46] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/03/2012] [Accepted: 06/06/2013] [Indexed: 11/30/2022]
Abstract
Background Enterobacteriaceae diversified from an ancestral lineage ~300-500 million years ago (mya) into a wide variety of free-living and host-associated lifestyles. Nutrient availability varies across niches, and evolution of metabolic networks likely played a key role in adaptation. Results Here we use a paleo systems biology approach to reconstruct and model metabolic networks of ancestral nodes of the enterobacteria phylogeny to investigate metabolism of ancient microorganisms and evolution of the networks. Specifically, we identified orthologous genes across genomes of 72 free-living enterobacteria (16 genera), and constructed core metabolic networks capturing conserved components for ancestral lineages leading to E. coli/Shigella (~10 mya), E. coli/Shigella/Salmonella (~100 mya), and all enterobacteria (~300-500 mya). Using these models we analyzed the capacity for carbon, nitrogen, phosphorous, sulfur, and iron utilization in aerobic and anaerobic conditions, identified conserved and differentiating catabolic phenotypes, and validated predictions by comparison to experimental data from extant organisms. Conclusions This is a novel approach using quantitative ancestral models to study metabolic network evolution and may be useful for identification of new targets to control infectious diseases caused by enterobacteria.
Collapse
Affiliation(s)
- David J Baumler
- Genome Center of Wisconsin, University of Wisconsin-Madison, Madison, Wisconsin, USA.
| | | | | | | |
Collapse
|
20
|
Anderson JWJ, Novák Á, Sükösd Z, Golden M, Arunapuram P, Edvardsson I, Hein J. Quantifying variances in comparative RNA secondary structure prediction. BMC Bioinformatics 2013; 14:149. [PMID: 23634662 PMCID: PMC3667108 DOI: 10.1186/1471-2105-14-149] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2012] [Accepted: 03/21/2013] [Indexed: 11/11/2022] Open
Abstract
Background With the advancement of next-generation sequencing and transcriptomics technologies, regulatory effects involving RNA, in particular RNA structural changes are being detected. These results often rely on RNA secondary structure predictions. However, current approaches to RNA secondary structure modelling produce predictions with a high variance in predictive accuracy, and we have little quantifiable knowledge about the reasons for these variances. Results In this paper we explore a number of factors which can contribute to poor RNA secondary structure prediction quality. We establish a quantified relationship between alignment quality and loss of accuracy. Furthermore, we define two new measures to quantify uncertainty in alignment-based structure predictions. One of the measures improves on the “reliability score” reported by PPfold, and considers alignment uncertainty as well as base-pair probabilities. The other measure considers the information entropy for SCFGs over a space of input alignments. Conclusions Our predictive accuracy improves on the PPfold reliability score. We can successfully characterize many of the underlying reasons for and variances in poor prediction. However, there is still variability unaccounted for, which we therefore suggest comes from the RNA secondary structure predictive model itself.
Collapse
|
21
|
Salvador GHM, Fernandes CAH, Magro AJ, Marchi-Salvador DP, Cavalcante WLG, Fernandez RM, Gallacci M, Soares AM, Oliveira CLP, Fontes MRM. Structural and phylogenetic studies with MjTX-I reveal a multi-oligomeric toxin--a novel feature in Lys49-PLA2s protein class. PLoS One 2013; 8:e60610. [PMID: 23573271 PMCID: PMC3616104 DOI: 10.1371/journal.pone.0060610] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2012] [Accepted: 02/28/2013] [Indexed: 11/19/2022] Open
Abstract
The mortality caused by snakebites is more damaging than many tropical diseases, such as dengue haemorrhagic fever, cholera, leishmaniasis, schistosomiasis and Chagas disease. For this reason, snakebite envenoming adversely affects health services of tropical and subtropical countries and is recognized as a neglected disease by the World Health Organization. One of the main components of snake venoms is the Lys49-phospholipases A2, which is catalytically inactive but possesses other toxic and pharmacological activities. Preliminary studies with MjTX-I from Bothrops moojeni snake venom revealed intriguing new structural and functional characteristics compared to other bothropic Lys49-PLA2s. We present in this article a comprehensive study with MjTX-I using several techniques, including crystallography, small angle X-ray scattering, analytical size-exclusion chromatography, dynamic light scattering, myographic studies, bioinformatics and molecular phylogenetic analyses.Based in all these experiments we demonstrated that MjTX-I is probably a unique Lys49-PLA2, which may adopt different oligomeric forms depending on the physical-chemical environment. Furthermore, we showed that its myotoxic activity is dramatically low compared to other Lys49-PLA2s, probably due to the novel oligomeric conformations and important mutations in the C-terminal region of the protein. The phylogenetic analysis also showed that this toxin is clearly distinct from other bothropic Lys49-PLA2s, in conformity with the peculiar oligomeric characteristics of MjTX-I and possible emergence of new functionalities inresponse to environmental changes and adaptation to new preys.
Collapse
Affiliation(s)
- Guilherme H. M. Salvador
- Depto. de Física e Biofísica, Instituto de Biociências, Universidade Estadual Paulista–UNESP, Botucatu, SP, Brazil
| | - Carlos A. H. Fernandes
- Depto. de Física e Biofísica, Instituto de Biociências, Universidade Estadual Paulista–UNESP, Botucatu, SP, Brazil
| | - Angelo J. Magro
- Depto. de Física e Biofísica, Instituto de Biociências, Universidade Estadual Paulista–UNESP, Botucatu, SP, Brazil
| | - Daniela P. Marchi-Salvador
- Depto. de Física e Biofísica, Instituto de Biociências, Universidade Estadual Paulista–UNESP, Botucatu, SP, Brazil
| | - Walter L. G. Cavalcante
- Depto. de Física e Biofísica, Instituto de Biociências, Universidade Estadual Paulista–UNESP, Botucatu, SP, Brazil
- Depto. de Farmacologia, Universidade Estadual Paulista – UNESP, Botucatu, SP, Brazil
| | - Roberto M. Fernandez
- Depto. de Física e Biofísica, Instituto de Biociências, Universidade Estadual Paulista–UNESP, Botucatu, SP, Brazil
| | - Márcia Gallacci
- Depto. de Farmacologia, Universidade Estadual Paulista – UNESP, Botucatu, SP, Brazil
| | - Andreimar M. Soares
- Fundação Oswaldo Cruz – FIOCRUZ Rondônia and Centro de Estudos de Biomoléculas Aplicadas – CEBio, Universidade Federal de Rondônia – UNIR, Porto Velho, RO, Brazil
| | - Cristiano L. P. Oliveira
- Depto. de Física Experimental, Instituto de Física, Universidade de São Paulo – USP, São Paulo, SP, Brazil
| | - Marcos R. M. Fontes
- Depto. de Física e Biofísica, Instituto de Biociências, Universidade Estadual Paulista–UNESP, Botucatu, SP, Brazil
- * E-mail:
| |
Collapse
|
22
|
Abstract
We address the problem of the joint statistical inference of phylogenetic trees and multiple sequence alignments from unaligned molecular sequences. This problem is generally formulated in terms of string-valued evolutionary processes along the branches of a phylogenetic tree. The classic evolutionary process, the TKF91 model [Thorne JL, Kishino H, Felsenstein J (1991) J Mol Evol 33(2):114-124] is a continuous-time Markov chain model composed of insertion, deletion, and substitution events. Unfortunately, this model gives rise to an intractable computational problem: The computation of the marginal likelihood under the TKF91 model is exponential in the number of taxa. In this work, we present a stochastic process, the Poisson Indel Process (PIP), in which the complexity of this computation is reduced to linear. The Poisson Indel Process is closely related to the TKF91 model, differing only in its treatment of insertions, but it has a global characterization as a Poisson process on the phylogeny. Standard results for Poisson processes allow key computations to be decoupled, which yields the favorable computational profile of inference under the PIP model. We present illustrative experiments in which Bayesian inference under the PIP model is compared with separate inference of phylogenies and alignments.
Collapse
Affiliation(s)
- Alexandre Bouchard-Côté
- Department of Statistics, University of British Columbia, Vancouver, BC, Canada V6T 1Z4; and
| | - Michael I. Jordan
- Departments of Statistics and Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720
| |
Collapse
|
23
|
Hamada M, Asai K. A classification of bioinformatics algorithms from the viewpoint of maximizing expected accuracy (MEA). J Comput Biol 2012; 19:532-49. [PMID: 22313125 DOI: 10.1089/cmb.2011.0197] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Many estimation problems in bioinformatics are formulated as point estimation problems in a high-dimensional discrete space. In general, it is difficult to design reliable estimators for this type of problem, because the number of possible solutions is immense, which leads to an extremely low probability for every solution-even for the one with the highest probability. Therefore, maximum score and maximum likelihood estimators do not work well in this situation although they are widely employed in a number of applications. Maximizing expected accuracy (MEA) estimation, in which accuracy measures of the target problem and the entire distribution of solutions are considered, is a more successful approach. In this review, we provide an extensive discussion of algorithms and software based on MEA. We describe how a number of algorithms used in previous studies can be classified from the viewpoint of MEA. We believe that this review will be useful not only for users wishing to utilize software to solve the estimation problems appearing in this article, but also for developers wishing to design algorithms on the basis of MEA.
Collapse
Affiliation(s)
- Michiaki Hamada
- Graduate School of Frontier Sciences, University of Tokyo, Kashiwa, Japan.
| | | |
Collapse
|
24
|
Wu M, Chatterji S, Eisen JA. Accounting for alignment uncertainty in phylogenomics. PLoS One 2012; 7:e30288. [PMID: 22272325 PMCID: PMC3260272 DOI: 10.1371/journal.pone.0030288] [Citation(s) in RCA: 127] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2011] [Accepted: 12/14/2011] [Indexed: 01/12/2023] Open
Abstract
Uncertainty in multiple sequence alignments has a large impact on phylogenetic analyses. Little has been done to evaluate the quality of individual positions in protein sequence alignments, which directly impact the accuracy of phylogenetic trees. Here we describe ZORRO, a probabilistic masking program that accounts for alignment uncertainty by assigning confidence scores to each alignment position. Using the BALIBASE database and in simulation studies, we demonstrate that masking by ZORRO significantly reduces the alignment uncertainty and improves the tree accuracy.
Collapse
Affiliation(s)
- Martin Wu
- Department of Biology, University of Virginia, Charlottesville, Virginia, United States of America.
| | | | | |
Collapse
|
25
|
Löytynoja A. Alignment methods: strategies, challenges, benchmarking, and comparative overview. Methods Mol Biol 2012; 855:203-35. [PMID: 22407710 DOI: 10.1007/978-1-61779-582-4_7] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
Comparative evolutionary analyses of molecular sequences are solely based on the identities and differences detected between homologous characters. Errors in this homology statement, that is errors in the alignment of the sequences, are likely to lead to errors in the downstream analyses. Sequence alignment and phylogenetic inference are tightly connected and many popular alignment programs use the phylogeny to divide the alignment problem into smaller tasks. They then neglect the phylogenetic tree, however, and produce alignments that are not evolutionarily meaningful. The use of phylogeny-aware methods reduces the error but the resulting alignments, with evolutionarily correct representation of homology, can challenge the existing practices and methods for viewing and visualising the sequences. The inter-dependency of alignment and phylogeny can be resolved by joint estimation of the two; methods based on statistical models allow for inferring the alignment parameters from the data and correctly take into account the uncertainty of the solution but remain computationally challenging. Widely used alignment methods are based on heuristic algorithms and unlikely to find globally optimal solutions. The whole concept of one correct alignment for the sequences is questionable, however, as there typically exist vast numbers of alternative, roughly equally good alignments that should also be considered. This uncertainty is hidden by many popular alignment programs and is rarely correctly taken into account in the downstream analyses. The quest for finding and improving the alignment solution is complicated by the lack of suitable measures of alignment goodness. The difficulty of comparing alternative solutions also affects benchmarks of alignment methods and the results strongly depend on the measure used. As the effects of alignment error cannot be predicted, comparing the alignments' performance in downstream analyses is recommended.
Collapse
Affiliation(s)
- Ari Löytynoja
- European Bioinformatics Institute (EMBL), Hinxton, UK.
| |
Collapse
|
26
|
Wang LS, Leebens-Mack J, Kerr Wall P, Beckmann K, dePamphilis CW, Warnow T. The impact of multiple protein sequence alignment on phylogenetic estimation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:1108-1119. [PMID: 21566256 DOI: 10.1109/tcbb.2009.68] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Multiple sequence alignment is typically the first step in estimating phylogenetic trees, with the assumption being that as alignments improve, so will phylogenetic reconstructions. Over the last decade or so, new multiple sequence alignment methods have been developed to improve comparative analyses of protein structure, but these new methods have not been typically used in phylogenetic analyses. In this paper, we report on a simulation study that we performed to evaluate the consequences of using these new multiple sequence alignment methods in terms of the resultant phylogenetic reconstruction. We find that while alignment accuracy is positively correlated with phylogenetic accuracy, the amount of improvement in phylogenetic estimation that results from an improved alignment can range from quite small to substantial. We observe that phylogenetic accuracy is most highly correlated with alignment accuracy when sequences are most difficult to align, and that variation in alignment accuracy can have little impact on phylogenetic accuracy when alignment error rates are generally low. We discuss these observations and implications for future work.
Collapse
Affiliation(s)
- Li-San Wang
- Department of Pathology and Laboratory Medicine and Penn Center for Bioinformatics, 1424 Blockley Hall, 423 Guardian Drive, University of Pennsylvania, Philadelphia, PA 19104, USA.
| | | | | | | | | | | |
Collapse
|
27
|
Paten B, Earl D, Nguyen N, Diekhans M, Zerbino D, Haussler D. Cactus: Algorithms for genome multiple sequence alignment. Genome Res 2011; 21:1512-28. [PMID: 21665927 DOI: 10.1101/gr.123356.111] [Citation(s) in RCA: 162] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Much attention has been given to the problem of creating reliable multiple sequence alignments in a model incorporating substitutions, insertions, and deletions. Far less attention has been paid to the problem of optimizing alignments in the presence of more general rearrangement and copy number variation. Using Cactus graphs, recently introduced for representing sequence alignments, we describe two complementary algorithms for creating genomic alignments. We have implemented these algorithms in the new "Cactus" alignment program. We test Cactus using the Evolver genome evolution simulator, a comprehensive new tool for simulation, and show using these and existing simulations that Cactus significantly outperforms all of its peers. Finally, we make an empirical assessment of Cactus's ability to properly align genes and find interesting cases of intra-gene duplication within the primates.
Collapse
Affiliation(s)
- Benedict Paten
- Center for Biomolecular Science and Engineering, University of California-Santa Cruz, CA 95064, USA.
| | | | | | | | | | | |
Collapse
|
28
|
Roskin KM, Paten B, Haussler D. Meta-alignment with crumble and prune: partitioning very large alignment problems for performance and parallelization. BMC Bioinformatics 2011; 12:144. [PMID: 21569267 PMCID: PMC3114744 DOI: 10.1186/1471-2105-12-144] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2010] [Accepted: 05/10/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Continuing research into the global multiple sequence alignment problem has resulted in more sophisticated and principled alignment methods. Unfortunately these new algorithms often require large amounts of time and memory to run, making it nearly impossible to run these algorithms on large datasets. As a solution, we present two general methods, Crumble and Prune, for breaking a phylogenetic alignment problem into smaller, more tractable sub-problems. We call Crumble and Prune meta-alignment methods because they use existing alignment algorithms and can be used with many current alignment programs. Crumble breaks long alignment problems into shorter sub-problems. Prune divides the phylogenetic tree into a collection of smaller trees to reduce the number of sequences in each alignment problem. These methods are orthogonal: they can be applied together to provide better scaling in terms of sequence length and in sequence depth. Both methods partition the problem such that many of the sub-problems can be solved independently. The results are then combined to form a solution to the full alignment problem. RESULTS Crumble and Prune each provide a significant performance improvement with little loss of accuracy. In some cases, a gain in accuracy was observed. Crumble and Prune were tested on real and simulated data. Furthermore, we have implemented a system called Job-tree that allows hierarchical sub-problems to be solved in parallel on a compute cluster, significantly shortening the run-time. CONCLUSIONS These methods enabled us to solve gigabase alignment problems. These methods could enable a new generation of biologically realistic alignment algorithms to be applied to real world, large scale alignment problems.
Collapse
Affiliation(s)
- Krishna M Roskin
- Department of Computer Science, Univ. of California, Santa Cruz, USA.
| | | | | |
Collapse
|
29
|
Hudek AK, Brown DG. FEAST: sensitive local alignment with multiple rates of evolution. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:698-709. [PMID: 20733242 DOI: 10.1109/tcbb.2010.76] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
We present a pairwise local aligner, FEAST, which uses two new techniques: a sensitive extension algorithm for identifying homologous subsequences, and a descriptive probabilistic alignment model. We also present a new procedure for training alignment parameters and apply it to the human and mouse genomes, producing a better parameter set for these sequences. Our extension algorithm identifies homologous subsequences by considering all evolutionary histories. It has higher maximum sensitivity than Viterbi extensions, and better balances specificity. We model alignments with several submodels, each with unique statistical properties, describing strongly similar and weakly similar regions of homologous DNA. Training parameters using two submodels produces superior alignments, even when we align with only the parameters from the weaker submodel. Our extension algorithm combined with our new parameter set achieves sensitivity 0.59 on synthetic tests. In contrast, LASTZ with default settings achieves sensitivity 0.35 with the same false positive rate. Using the weak submodel as parameters for LASTZ increases its sensitivity to 0.59 with high error. FEAST is available at http://monod.uwaterloo.ca/feast/.
Collapse
Affiliation(s)
- Alexander K Hudek
- David R. Cheriton School of Computer Science, University of Waterloo, 200 University Avenue West, Waterloo, Ontario N2L 3G1, Canada.
| | | |
Collapse
|
30
|
Sahraeian SME, Yoon BJ. PicXAA-Web: a web-based platform for non-progressive maximum expected accuracy alignment of multiple biological sequences. Nucleic Acids Res 2011; 39:W8-12. [PMID: 21515632 PMCID: PMC3125727 DOI: 10.1093/nar/gkr244] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023] Open
Abstract
In this article, we introduce PicXAA-Web, a web-based platform for accurate probabilistic alignment of multiple biological sequences. The core of PicXAA-Web consists of PicXAA, a multiple protein/DNA sequence alignment algorithm, and PicXAA-R, an extension of PicXAA for structural alignment of RNA sequences. Both PicXAA and PicXAA-R are probabilistic non-progressive alignment algorithms that aim to find the optimal alignment of multiple biological sequences by maximizing the expected accuracy. PicXAA and PicXAA-R greedily build up the alignment from sequence regions with high local similarity, thereby yielding an accurate global alignment that effectively captures local similarities among sequences. PicXAA-Web integrates these two algorithms in a user-friendly web platform for accurate alignment and analysis of multiple protein, DNA and RNA sequences. PicXAA-Web can be freely accessed at http://gsp.tamu.edu/picxaa/.
Collapse
|
31
|
Markova-Raina P, Petrov D. High sensitivity to aligner and high rate of false positives in the estimates of positive selection in the 12 Drosophila genomes. Genome Res 2011; 21:863-74. [PMID: 21393387 DOI: 10.1101/gr.115949.110] [Citation(s) in RCA: 108] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
We investigate the effect of aligner choice on inferences of positive selection using site-specific models of molecular evolution. We find that independently of the choice of aligner, the rate of false positives is unacceptably high. Our study is a whole-genome analysis of all protein-coding genes in 12 Drosophila genomes annotated in either all 12 species (~6690 genes) or in the six melanogaster group species. We compare six popular aligners: PRANK, T-Coffee, ClustalW, ProbCons, AMAP, and MUSCLE, and find that the aligner choice strongly influences the estimates of positive selection. Differences persist when we use (1) different stringency cutoffs, (2) different selection inference models, (3) alignments with or without gaps, and/or additional masking, (4) per-site versus per-gene statistics, (5) closely related melanogaster group species versus more distant 12 Drosophila genomes. Furthermore, we find that these differences are consequential for downstream analyses such as determination of over/under-represented GO terms associated with positive selection. Visual analysis indicates that most sites inferred as positively selected are, in fact, misaligned at the codon level, resulting in false positive rates of 48%-82%. PRANK, which has been reported to outperform other aligners in simulations, performed best in our empirical study as well. Unfortunately, PRANK still had a high, and unacceptable for most applications, false positives rate of 50%-55%. We identify misannotations and indels, many of which appear to be located in disordered protein regions, as primary culprits for the high misalignment-related error levels and discuss possible workaround approaches to this apparently pervasive problem in genome-wide evolutionary analyses.
Collapse
Affiliation(s)
- Penka Markova-Raina
- Department of Biology, Stanford University, Stanford, California 94305, USA.
| | | |
Collapse
|
32
|
Sahraeian SME, Yoon BJ. PicXAA-R: efficient structural alignment of multiple RNA sequences using a greedy approach. BMC Bioinformatics 2011; 12 Suppl 1:S38. [PMID: 21342569 PMCID: PMC3044294 DOI: 10.1186/1471-2105-12-s1-s38] [Citation(s) in RCA: 61] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
Background Accurate and efficient structural alignment of non-coding RNAs (ncRNAs) has grasped more and more attentions as recent studies unveiled the significance of ncRNAs in living organisms. While the Sankoff style structural alignment algorithms cannot efficiently serve for multiple sequences, mostly progressive schemes are used to reduce the complexity. However, this idea tends to propagate the early stage errors throughout the entire process, thereby degrading the quality of the final alignment. For multiple protein sequence alignment, we have recently proposed PicXAA which constructs an accurate alignment in a non-progressive fashion. Results Here, we propose PicXAA-R as an extension to PicXAA for greedy structural alignment of ncRNAs. PicXAA-R efficiently grasps both folding information within each sequence and local similarities between sequences. It uses a set of probabilistic consistency transformations to improve the posterior base-pairing and base alignment probabilities using the information of all sequences in the alignment. Using a graph-based scheme, we greedily build up the structural alignment from sequence regions with high base-pairing and base alignment probabilities. Conclusions Several experiments on datasets with different characteristics confirm that PicXAA-R is one of the fastest algorithms for structural alignment of multiple RNAs and it consistently yields accurate alignment results, especially for datasets with locally similar sequences. PicXAA-R source code is freely available at: http://www.ece.tamu.edu/~bjyoon/picxaa/.
Collapse
|
33
|
Santos-Filho NA, Fernandes CAH, Menaldo DL, Magro AJ, Fortes-Dias CL, Estevão-Costa MI, Fontes MRM, Santos CR, Murakami MT, Soares AM. Molecular cloning and biochemical characterization of a myotoxin inhibitor from Bothrops alternatus snake plasma. Biochimie 2010; 93:583-92. [PMID: 21144879 DOI: 10.1016/j.biochi.2010.11.016] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2010] [Accepted: 11/26/2010] [Indexed: 10/18/2022]
Abstract
Phospholipases A(2) (PLA(2)s) are important components of Bothrops snake venoms, that can induce several effects on envenomations such as myotoxicity, inhibition or induction of platelet aggregation and edema. It is known that venomous and non-venomous snakes present PLA(2) inhibitory proteins (PLIs) in their blood plasma. An inhibitory protein that neutralizes the enzymatic and toxic activities of several PLA(2)s from Bothrops venoms was isolated from Bothrops alternatus snake plasma by affinity chromatography using the immobilized myotoxin BthTX-I on CNBr-activated Sepharose. Biochemical characterization of this inhibitory protein, denominated αBaltMIP, showed it to be a glycoprotein with Mr of ~24,000 for the monomeric subunit. CD spectra of the PLA(2)/inhibitor complexes are considerably different from those corresponding to the individual proteins and data deconvolution suggests that the complexes had a relative gain of helical structure elements in comparison to the individual protomers, which may indicate a more compact structure upon complexation. Theoretical and experimental structural studies performed in order to obtain insights into the structural features of αBaltMIP indicated that this molecule may potentially trimerize in solution, thus strengthening the hypothesis previously raised by other authors about snake PLIs oligomerization.
Collapse
Affiliation(s)
- Norival A Santos-Filho
- Departamento de Bioquímica e Imunologia, Faculdade de Medicina de Ribeirão Preto, Universidade de São Paulo, FMRP-USP, Ribeirão Preto-SP, Brazil.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
34
|
dos Santos JI, Cintra-Francischinelli M, Borges RJ, Fernandes CAH, Pizzo P, Cintra ACO, Braz ASK, Soares AM, Fontes MRM. Structural, functional, and bioinformatics studies reveal a new snake venom homologue phospholipase A2class. Proteins 2010; 79:61-78. [DOI: 10.1002/prot.22858] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2010] [Revised: 07/22/2010] [Accepted: 08/13/2010] [Indexed: 11/09/2022]
|
35
|
Sahraeian SME, Yoon BJ. PicXAA: greedy probabilistic construction of maximum expected accuracy alignment of multiple sequences. Nucleic Acids Res 2010; 38:4917-28. [PMID: 20413579 PMCID: PMC2926610 DOI: 10.1093/nar/gkq255] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2009] [Revised: 03/25/2010] [Accepted: 03/26/2010] [Indexed: 11/13/2022] Open
Abstract
Accurate tools for multiple sequence alignment (MSA) are essential for comparative studies of the function and structure of biological sequences. However, it is very challenging to develop a computationally efficient algorithm that can consistently predict accurate alignments for various types of sequence sets. In this article, we introduce PicXAA (Probabilistic Maximum Accuracy Alignment), a probabilistic non-progressive alignment algorithm that aims to find protein alignments with maximum expected accuracy. PicXAA greedily builds up the multiple alignment from sequence regions with high local similarities, thereby yielding an accurate global alignment that effectively grasps the local similarities among sequences. Evaluations on several widely used benchmark sets show that PicXAA constantly yields accurate alignment results on a wide range of reference sets, with especially remarkable improvements over other leading algorithms on sequence sets with local similarities. PicXAA source code is freely available at: http://www.ece.tamu.edu/~bjyoon/picxaa/.
Collapse
Affiliation(s)
| | - Byung-Jun Yoon
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA
| |
Collapse
|
36
|
Sengupta R, Bastola DR, Ali HH. Classification and identification of fungal sequences using characteristic restriction endonuclease cut order. J Bioinform Comput Biol 2010; 8:181-98. [PMID: 20401943 DOI: 10.1142/s0219720010004616] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2009] [Revised: 10/18/2009] [Accepted: 10/18/2009] [Indexed: 11/18/2022]
Abstract
Restriction Fragment Length Polymorphism (RFLP) is a powerful molecular tool that is extensively used in the molecular fingerprinting and epidemiological studies of microorganisms. In a wet-lab setting, the DNA is cut with one or more restriction enzymes and subjected to gel electrophoresis to obtain signature fragment patterns, which is utilized in the classification and identification of organisms. This wet-lab approach may not be practical when the experimental data set includes a large number of genetic sequences and a wide pool of restriction enzymes to choose from. In this study, we introduce a novel concept of Enzyme Cut Order - a biological property-based characteristic of DNA sequences which can be defined and analyzed computationally without any alignment algorithm. In this alignment-free approach, a similarity matrix is developed based on the pairwise Longest Common Subsequences (LCS) of the Enzyme Cut Orders. The choice of an ideal set of restriction enzymes used for analysis is augmented by using genetic algorithms. The results obtained from this approach using internal transcribed spacer regions of rDNA from fungi as the target sequence show that the phylogenetically-related organisms form a single cluster and successful grouping of phylogenetically close or distant organisms is dependent on the choice of restriction enzymes used in the analysis. Additionally, comparison of trees obtained with this alignment-free and the legacy method revealed highly similar tree topologies. This novel alignment-free method, which utilizes the Enzyme Cut Order and restriction enzyme profile, is a reliable alternative to local or global alignment-based classification and identification of organisms.
Collapse
Affiliation(s)
- Rajib Sengupta
- College of Information Science and Technology, University of Nebraska, Omaha, NE 68182, USA
| | | | | |
Collapse
|
37
|
Dessimoz C, Gil M. Phylogenetic assessment of alignments reveals neglected tree signal in gaps. Genome Biol 2010; 11:R37. [PMID: 20370897 PMCID: PMC2884540 DOI: 10.1186/gb-2010-11-4-r37] [Citation(s) in RCA: 137] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2009] [Revised: 01/26/2010] [Accepted: 04/06/2010] [Indexed: 01/08/2023] Open
Abstract
Tree-based tests of alignment methods enable the evaluation of the effect of gap placement on the inference of phylogenetic relationships. Background The alignment of biological sequences is of chief importance to most evolutionary and comparative genomics studies, yet the two main approaches used to assess alignment accuracy have flaws: reference alignments are derived from the biased sample of proteins with known structure, and simulated data lack realism. Results Here, we introduce tree-based tests of alignment accuracy, which not only use large and representative samples of real biological data, but also enable the evaluation of the effect of gap placement on phylogenetic inference. We show that (i) the current belief that consistency-based alignments outperform scoring matrix-based alignments is misguided; (ii) gaps carry substantial phylogenetic signal, but are poorly exploited by most alignment and tree building programs; (iii) even so, excluding gaps and variable regions is detrimental; (iv) disagreement among alignment programs says little about the accuracy of resulting trees. Conclusions This study provides the broad community relying on sequence alignment with important practical recommendations, sets superior standards for assessing alignment accuracy, and paves the way for the development of phylogenetic inference methods of significantly higher resolution.
Collapse
Affiliation(s)
- Christophe Dessimoz
- Department of Computer Science, ETH Zurich, Universitaetstr, 6, 8092 Zürich, Switzerland.
| | | |
Collapse
|
38
|
Brandalise M, Severino FE, Maluf MP, Maia IG. The promoter of a gene encoding an isoflavone reductase-like protein in coffee (Coffea arabica) drives a stress-responsive expression in leaves. PLANT CELL REPORTS 2009; 28:1699-708. [PMID: 19756631 DOI: 10.1007/s00299-009-0769-0] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/26/2009] [Revised: 08/12/2009] [Accepted: 08/20/2009] [Indexed: 05/12/2023]
Abstract
A cDNA clone (designated CaIRL) encoding an isoflavone reductase-like protein from coffee (Coffea arabica) was retrieved during a search for genes showing organ/tissue-specific expression among the expressed sequence tags (EST) of the Brazilian coffee EST database. The CaIRL cDNA contains a single open reading frame of 946 nucleotides (nt) encoding 314 amino acids (predicted molecular weight of 34 kDa). Several features identified the predicted CaIRL protein as a new member of the PIP family of NADPH-dependent reductases. Expression studies demonstrated that CaIRL is expressed exclusively in coffee leaves and its transcript level is markedly increased in response to fungal infection and mechanical injury. Analysis of transgenic tobacco plants harboring a CaIRL 5'-flanking region (862 nt) fused to uidA reporter gene (GUS) confirmed the responsiveness of the putative promoter to abiotic stress in wounded leaves. In turn, a 5' deletion to -404 completely abolished promoter activation by abiotic stimulus in transgenic plants. The lack of GUS expression in non-wounded leaf tissues in transgenic tobacco was in contrast to the basal level of CaIRL expression observed in non-stressed healthy coffee leaves.
Collapse
Affiliation(s)
- Marcos Brandalise
- Departamento de Genética, Instituto de Biociências, UNESP, Botucatu, SP, Brazil
| | | | | | | |
Collapse
|
39
|
Bradley RK, Holmes I. Evolutionary triplet models of structured RNA. PLoS Comput Biol 2009; 5:e1000483. [PMID: 19714212 PMCID: PMC2725318 DOI: 10.1371/journal.pcbi.1000483] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2008] [Accepted: 07/23/2009] [Indexed: 12/31/2022] Open
Abstract
The reconstruction and synthesis of ancestral RNAs is a feasible goal for paleogenetics. This will require new bioinformatics methods, including a robust statistical framework for reconstructing histories of substitutions, indels and structural changes. We describe a "transducer composition" algorithm for extending pairwise probabilistic models of RNA structural evolution to models of multiple sequences related by a phylogenetic tree. This algorithm draws on formal models of computational linguistics as well as the 1985 protosequence algorithm of David Sankoff. The output of the composition algorithm is a multiple-sequence stochastic context-free grammar. We describe dynamic programming algorithms, which are robust to null cycles and empty bifurcations, for parsing this grammar. Example applications include structural alignment of non-coding RNAs, propagation of structural information from an experimentally-characterized sequence to its homologs, and inference of the ancestral structure of a set of diverged RNAs. We implemented the above algorithms for a simple model of pairwise RNA structural evolution; in particular, the algorithms for maximum likelihood (ML) alignment of three known RNA structures and a known phylogeny and inference of the common ancestral structure. We compared this ML algorithm to a variety of related, but simpler, techniques, including ML alignment algorithms for simpler models that omitted various aspects of the full model and also a posterior-decoding alignment algorithm for one of the simpler models. In our tests, incorporation of basepair structure was the most important factor for accurate alignment inference; appropriate use of posterior-decoding was next; and fine details of the model were least important. Posterior-decoding heuristics can be substantially faster than exact phylogenetic inference, so this motivates the use of sum-over-pairs heuristics where possible (and approximate sum-over-pairs). For more exact probabilistic inference, we discuss the use of transducer composition for ML (or MCMC) inference on phylogenies, including possible ways to make the core operations tractable.
Collapse
Affiliation(s)
- Robert K. Bradley
- Biophysics Graduate Group, University of California, Berkeley, California, United States of America
| | - Ian Holmes
- Biophysics Graduate Group, University of California, Berkeley, California, United States of America
- Department of Bioengineering, University of California, Berkeley, California, United States of America
- * E-mail:
| |
Collapse
|
40
|
Abstract
We describe a new program for the alignment of multiple biological sequences that is both statistically motivated and fast enough for problem sizes that arise in practice. Our Fast Statistical Alignment program is based on pair hidden Markov models which approximate an insertion/deletion process on a tree and uses a sequence annealing algorithm to combine the posterior probabilities estimated from these models into a multiple alignment. FSA uses its explicit statistical model to produce multiple alignments which are accompanied by estimates of the alignment accuracy and uncertainty for every column and character of the alignment—previously available only with alignment programs which use computationally-expensive Markov Chain Monte Carlo approaches—yet can align thousands of long sequences. Moreover, FSA utilizes an unsupervised query-specific learning procedure for parameter estimation which leads to improved accuracy on benchmark reference alignments in comparison to existing programs. The centroid alignment approach taken by FSA, in combination with its learning procedure, drastically reduces the amount of false-positive alignment on biological data in comparison to that given by other methods. The FSA program and a companion visualization tool for exploring uncertainty in alignments can be used via a web interface at http://orangutan.math.berkeley.edu/fsa/, and the source code is available at http://fsa.sourceforge.net/. Biological sequence alignment is one of the fundamental problems in comparative genomics, yet it remains unsolved. Over sixty sequence alignment programs are listed on Wikipedia, and many new programs are published every year. However, many popular programs suffer from pathologies such as aligning unrelated sequences and producing discordant alignments in protein (amino acid) and codon (nucleotide) space, casting doubt on the accuracy of the inferred alignments. Inaccurate alignments can introduce large and unknown systematic biases into downstream analyses such as phylogenetic tree reconstruction and substitution rate estimation. We describe a new program for multiple sequence alignment which can align protein, RNA and DNA sequence and improves on the accuracy of existing approaches on benchmarks of protein and RNA structural alignments and simulated mammalian and fly genomic alignments. Our approach, which seeks to find the alignment which is closest to the truth under our statistical model, leaves unrelated sequences largely unaligned and produces concordant alignments in protein and codon space. It is fast enough for difficult problems such as aligning orthologous genomic regions or aligning hundreds or thousands of proteins. It furthermore has a companion GUI for visualizing the estimated alignment reliability.
Collapse
|
41
|
Ashkenazy H, Unger R, Kliger Y. Optimal data collection for correlated mutation analysis. Proteins 2009; 74:545-55. [PMID: 18655065 DOI: 10.1002/prot.22168] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
The main objective of correlated mutation analysis (CMA) is to predict intraprotein residue-residue interactions from sequence alone. Despite considerable progress in algorithms and computer capabilities, the performance of CMA methods remains quite low. Here we examine whether, and to what extent, the quality of CMA methods depends on the sequences that are included in the multiple sequence alignment (MSA). The results revealed a strong correlation between the number of homologs in an MSA and CMA prediction strength. Furthermore, many of the current methods include only orthologs in the MSA, we found that it is beneficial to include both orthologs and paralogs in the MSA. Remarkably, even remote homologs contribute to the improved accuracy. Based on our findings we put forward an automated data collection procedure, with a minimal coverage of 50% between the query protein and its orthologs and paralogs. This procedure improves accuracy even in the absence of manual curation. In this era of massive sequencing and exploding sequence data, our results suggest that correlated mutation-based methods have not reached their inherent performance limitations and that the role of CMA in structural biology is far from being fulfilled.
Collapse
|
42
|
Paten B, Herrero J, Beal K, Birney E. Sequence progressive alignment, a framework for practical large-scale probabilistic consistency alignment. ACTA ACUST UNITED AC 2008; 25:295-301. [PMID: 19056777 DOI: 10.1093/bioinformatics/btn630] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Multiple sequence alignment is a cornerstone of comparative genomics. Much work has been done to improve methods for this task, particularly for the alignment of small sequences, and especially for amino acid sequences. However, less work has been done in making promising methods that work on the small-scale practically for the alignment of much larger genomic sequences. RESULTS We take the method of probabilistic consistency alignment and make it practical for the alignment of large genomic sequences. In so doing we develop a set of new technical methods, combined in a framework we term 'sequence progressive alignment', because it allows us to iteratively compute an alignment by passing over the input sequences from left to right. The result is that we massively decrease the memory consumption of the program relative to a naive implementation. The general engineering of the challenges faced in scaling such a computationally intensive process offer valuable lessons for planning related large-scale sequence analysis algorithms. We also further show the strong performance of Pecan using an extended analysis of ancient repeat alignments. Pecan is now one of the default alignment programs that has and is being used by a number of whole-genome comparative genomic projects. AVAILABILITY The Pecan program is freely available at http://www.ebi.ac.uk/ approximately bjp/pecan/ Pecan whole genome alignments can be found in the Ensembl genome browser.
Collapse
Affiliation(s)
- Benedict Paten
- Department of Engineering, University of California, Santa Cruz CA, USA.
| | | | | | | |
Collapse
|
43
|
Rausch T, Emde AK, Weese D, Döring A, Notredame C, Reinert K. Segment-based multiple sequence alignment. Bioinformatics 2008; 24:i187-92. [PMID: 18689823 DOI: 10.1093/bioinformatics/btn281] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Many multiple sequence alignment tools have been developed in the past, progressing either in speed or alignment accuracy. Given the importance and wide-spread use of alignment tools, progress in both categories is a contribution to the community and has driven research in the field so far. RESULTS We introduce a graph-based extension to the consistency-based, progressive alignment strategy. We apply the consistency notion to segments instead of single characters. The main problem we solve in this context is to define segments of the sequences in such a way that a graph-based alignment is possible. We implemented the algorithm using the SeqAn library and report results on amino acid and DNA sequences. The benefit of our approach is threefold: (1) sequences with conserved blocks can be rapidly aligned, (2) the implementation is conceptually easy, generic and fast and (3) the consistency idea can be extended to align multiple genomic sequences. AVAILABILITY The segment-based multiple sequence alignment tool can be downloaded from http://www.seqan.de/projects/msa.html. A novel version of T-Coffee interfaced with the tool is available from http://www.tcoffee.org. The usage of the tool is described in both documentations.
Collapse
Affiliation(s)
- Tobias Rausch
- International Max Planck Research School for Computational Biology and Scientific Computing, Ihnestr 63-73, 14195 Berlin, Germany.
| | | | | | | | | | | |
Collapse
|
44
|
Bradley RK, Pachter L, Holmes I. Specific alignment of structured RNA: stochastic grammars and sequence annealing. ACTA ACUST UNITED AC 2008; 24:2677-83. [PMID: 18796475 DOI: 10.1093/bioinformatics/btn495] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
MOTIVATION Whole-genome screens suggest that eukaryotic genomes are dense with non-coding RNAs (ncRNAs). We introduce a novel approach to RNA multiple alignment which couples a generative probabilistic model of sequence and structure with an efficient sequence annealing approach for exploring the space of multiple alignments. This leads to a new software program, Stemloc-AMA, that is both accurate and specific in the alignment of multiple related RNA sequences. RESULTS When tested on the benchmark datasets BRalibase II and BRalibase 2.1, Stemloc-AMA has comparable sensitivity to and better specificity than the best competing methods. We use a large-scale random sequence experiment to show that while most alignment programs maximize sensitivity at the expense of specificity, even to the point of giving complete alignments of non-homologous sequences, Stemloc-AMA aligns only sequences with detectable homology and leaves unrelated sequences largely unaligned. Such accurate and specific alignments are crucial for comparative-genomics analysis, from inferring phylogeny to estimating substitution rates across different lineages. AVAILABILITY Stemloc-AMA is available from http://biowiki.org/StemLocAMA as part of the dart software package for sequence analysis.
Collapse
Affiliation(s)
- Robert K Bradley
- Biophysics Graduate Group, Department of Mathematics and Department of Bioengineering, University of California, Berkeley, CA 94720, USA
| | | | | |
Collapse
|
45
|
Sanchez-Villeda H, Schroeder S, Flint-Garcia S, Guill KE, Yamasaki M, McMullen MD. DNAAlignEditor: DNA alignment editor tool. BMC Bioinformatics 2008; 9:154. [PMID: 18366684 PMCID: PMC2322986 DOI: 10.1186/1471-2105-9-154] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2007] [Accepted: 03/19/2008] [Indexed: 12/02/2022] Open
Abstract
Background With advances in DNA re-sequencing methods and Next-Generation parallel sequencing approaches, there has been a large increase in genomic efforts to define and analyze the sequence variability present among individuals within a species. For very polymorphic species such as maize, this has lead to a need for intuitive, user-friendly software that aids the biologist, often with naïve programming capability, in tracking, editing, displaying, and exporting multiple individual sequence alignments. To fill this need we have developed a novel DNA alignment editor. Results We have generated a nucleotide sequence alignment editor (DNAAlignEditor) that provides an intuitive, user-friendly interface for manual editing of multiple sequence alignments with functions for input, editing, and output of sequence alignments. The color-coding of nucleotide identity and the display of associated quality score aids in the manual alignment editing process. DNAAlignEditor works as a client/server tool having two main components: a relational database that collects the processed alignments and a user interface connected to database through universal data access connectivity drivers. DNAAlignEditor can be used either as a stand-alone application or as a network application with multiple users concurrently connected. Conclusion We anticipate that this software will be of general interest to biologists and population genetics in editing DNA sequence alignments and analyzing natural sequence variation regardless of species, and will be particularly useful for manual alignment editing of sequences in species with high levels of polymorphism.
Collapse
|
46
|
Sanchez-Villeda H, Schroeder S, Flint-Garcia S, Guill KE, Yamasaki M, McMullen MD. DNAAlignEditor: DNA alignment editor tool. BMC Bioinformatics 2008. [PMID: 18366684 DOI: 10.1186/1471‐2105‐9‐154] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND With advances in DNA re-sequencing methods and Next-Generation parallel sequencing approaches, there has been a large increase in genomic efforts to define and analyze the sequence variability present among individuals within a species. For very polymorphic species such as maize, this has lead to a need for intuitive, user-friendly software that aids the biologist, often with naïve programming capability, in tracking, editing, displaying, and exporting multiple individual sequence alignments. To fill this need we have developed a novel DNA alignment editor. RESULTS We have generated a nucleotide sequence alignment editor (DNAAlignEditor) that provides an intuitive, user-friendly interface for manual editing of multiple sequence alignments with functions for input, editing, and output of sequence alignments. The color-coding of nucleotide identity and the display of associated quality score aids in the manual alignment editing process. DNAAlignEditor works as a client/server tool having two main components: a relational database that collects the processed alignments and a user interface connected to database through universal data access connectivity drivers. DNAAlignEditor can be used either as a stand-alone application or as a network application with multiple users concurrently connected. CONCLUSION We anticipate that this software will be of general interest to biologists and population genetics in editing DNA sequence alignments and analyzing natural sequence variation regardless of species, and will be particularly useful for manual alignment editing of sequences in species with high levels of polymorphism.
Collapse
|
47
|
Abstract
Protein sequence alignment is the task of identifying evolutionarily or structurally related positions in a collection of amino acid sequences. Although the protein alignment problem has been studied for several decades, many recent studies have demonstrated considerable progress in improving the accuracy or scalability of multiple and pairwise alignment tools, or in expanding the scope of tasks handled by an alignment program. In this chapter, we review state-of-the-art protein sequence alignment and provide practical advice for users of alignment tools.
Collapse
Affiliation(s)
- Chuong B Do
- Computer Science Department, Stanford University, Stanford, CA, USA
| | | |
Collapse
|
48
|
Ruby JG, Stark A, Johnston WK, Kellis M, Bartel DP, Lai EC. Evolution, biogenesis, expression, and target predictions of a substantially expanded set of Drosophila microRNAs. Genome Res 2007; 17:1850-64. [PMID: 17989254 DOI: 10.1101/gr.6597907] [Citation(s) in RCA: 462] [Impact Index Per Article: 27.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
MicroRNA (miRNA) genes give rise to small regulatory RNAs in a wide variety of organisms. We used computational methods to predict miRNAs conserved among Drosophila species and large-scale sequencing of small RNAs from Drosophila melanogaster to experimentally confirm and complement these predictions. In addition to validating 20 of our top 45 predictions for novel miRNA loci, the large-scale sequencing identified many miRNAs that had not been predicted. In total, 59 novel genes were identified, increasing our tally of confirmed fly miRNAs to 148. The large-scale sequencing also refined the identities of previously known miRNAs and provided insights into their biogenesis and expression. Many miRNAs were expressed in particular developmental contexts, with a large cohort of miRNAs expressed primarily in imaginal discs. Conserved miRNAs typically were expressed more broadly and robustly than were nonconserved miRNAs, and those conserved miRNAs with more restricted expression tended to have fewer predicted targets than those expressed more broadly. Predicted targets for the expanded set of microRNAs substantially increased and revised the miRNA-target relationships that appear conserved among the fly species. Insights were also provided into miRNA gene evolution, including evidence for emergent regulatory function deriving from the opposite arm of the miRNA hairpin, exemplified by mir-10, and even the opposite strand of the DNA, exemplified by mir-iab-4.
Collapse
Affiliation(s)
- J Graham Ruby
- Whitehead Institute for Biomedical Research, Cambridge, Massachusetts 02142, USA
| | | | | | | | | | | |
Collapse
|
49
|
Martin W, Roettger M, Lockhart PJ. A reality check for alignments and trees. Trends Genet 2007; 23:478-80. [PMID: 17825944 DOI: 10.1016/j.tig.2007.08.007] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2007] [Revised: 07/16/2007] [Accepted: 08/22/2007] [Indexed: 11/18/2022]
Abstract
Making multiple sequence alignments is one of the more commonplace procedures in modern biology. Multiple alignments are typically generated by feeding sequences into the alignment program from the N-terminus to the C-terminus. Recent results show that if the same sequences are processed from the C- to the N-terminus, a different alignment is often obtained. Because phylogenetic trees are built from alignments, the resulting trees can also differ. The new findings highlight sequence alignment as a crucial step in molecular evolutionary studies and provide straightforward measures to assess alignment reliability.
Collapse
Affiliation(s)
- William Martin
- Institute of Botany III, University of Düsseldorf, Düsseldorf, Germany.
| | | | | |
Collapse
|