1
|
Krause GR, Shands W, Wheeler TJ. Sensitive and error-tolerant annotation of protein-coding DNA with BATH. BIOINFORMATICS ADVANCES 2024; 4:vbae088. [PMID: 38966592 PMCID: PMC11223822 DOI: 10.1093/bioadv/vbae088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/04/2024] [Revised: 05/03/2024] [Accepted: 06/10/2024] [Indexed: 07/06/2024]
Abstract
Summary We present BATH, a tool for highly sensitive annotation of protein-coding DNA based on direct alignment of that DNA to a database of protein sequences or profile hidden Markov models (pHMMs). BATH is built on top of the HMMER3 code base, and simplifies the annotation workflow for pHMM-based translated sequence annotation by providing a straightforward input interface and easy-to-interpret output. BATH also introduces novel frameshift-aware algorithms to detect frameshift-inducing nucleotide insertions and deletions (indels). BATH matches the accuracy of HMMER3 for annotation of sequences containing no errors, and produces superior accuracy to all tested tools for annotation of sequences containing nucleotide indels. These results suggest that BATH should be used when high annotation sensitivity is required, particularly when frameshift errors are expected to interrupt protein-coding regions, as is true with long-read sequencing data and in the context of pseudogenes. Availability and implementation The software is available at https://github.com/TravisWheelerLab/BATH.
Collapse
Affiliation(s)
- Genevieve R Krause
- R. Ken Coit College of Pharmacy, University of Arizona, Tucson, AZ 85721, United States
- Department of Computer Science, University of Montana, Missoula, MT 59812, United States
| | - Walt Shands
- Department of Computer Science, University of Montana, Missoula, MT 59812, United States
- Genomics Institute, UC Santa Cruz, Santa Cruz, CA 95060, United States
| | - Travis J Wheeler
- R. Ken Coit College of Pharmacy, University of Arizona, Tucson, AZ 85721, United States
- Department of Computer Science, University of Montana, Missoula, MT 59812, United States
| |
Collapse
|
2
|
Krause GR, Shands W, Wheeler TJ. Sensitive and error-tolerant annotation of protein-coding DNA with BATH. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.12.31.573773. [PMID: 38260252 PMCID: PMC10802276 DOI: 10.1101/2023.12.31.573773] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
We present BATH, a tool for highly sensitive annotation of protein-coding DNA based on direct alignment of that DNA to a database of protein sequences or profile hidden Markov models (pHMMs). BATH is built on top of the HMMER3 code base, and simplifies the annotation workflow for pHMM-based annotation by providing a straightforward input interface and easy-to-interpret output. BATH also introduces novel frameshift-aware algorithms to detect frameshift-inducing nucleotide insertions and deletions (indels). BATH matches the accuracy of HMMER3 for annotation of sequences containing no errors, and produces superior accuracy to all tested tools for annotation of sequences containing nucleotide indels. These results suggest that BATH should be used when high annotation sensitivity is required, particularly when frameshift errors are expected to interrupt protein-coding regions, as is true with long read sequencing data and in the context of pseudogenes.
Collapse
Affiliation(s)
- Genevieve R Krause
- R. Ken Coit College of Pharmacy, University of Arizona, Tucson, Arizona, USA
- Department of Computer Science, University of Montana, Missoula, Montana, USA
| | - Walt Shands
- Department of Computer Science, University of Montana, Missoula, Montana, USA
- UC Santa Cruz Genomics Institute, Santa Cruz, California, USA
| | - Travis J Wheeler
- R. Ken Coit College of Pharmacy, University of Arizona, Tucson, Arizona, USA
- Department of Computer Science, University of Montana, Missoula, Montana, USA
| |
Collapse
|
3
|
Frith MC. Paleozoic Protein Fossils Illuminate the Evolution of Vertebrate Genomes and Transposable Elements. Mol Biol Evol 2022; 39:6555113. [PMID: 35348724 PMCID: PMC9004415 DOI: 10.1093/molbev/msac068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Genomes hold a treasure trove of protein fossils: fragments of formerly protein-coding DNA, which mainly come from transposable elements (TEs) or host genes. These fossils reveal ancient evolution of TEs and genomes, and many fossils have been exapted to perform diverse functions important for the host's fitness. However, old and highly-degraded fossils are hard to identify, standard methods (e.g. BLAST) are not optimized for this task, and few Paleozoic protein fossils have been found. Here, a recently optimized method is used to find protein fossils in vertebrate genomes. It finds Paleozoic fossils predating the amphibian/amniote divergence from most major TE categories, including virus-related Polinton and Gypsy elements. It finds 10 fossils in the human genome (8 from TEs and 2 from host genes) that predate the last common ancestor of all jawed vertebrates, probably from the Ordovician period. It also finds types of transposon and retrotransposon not found in human before. These fossils have extreme sequence conservation, indicating exaptation: some have evidence of gene-regulatory function, and they tend to lienearest to developmental genes. Some ancient fossils suggest "genome tectonics", where two fragments of one TE have drifted apart by up to megabases, possibly explaining gene deserts and large introns. This paints a picture of great TE diversity in our aquatic ancestors, with patchy TE inheritance by later vertebrates, producing new genes and regulatory elements on the way. Host-gene fossils too have contributed anciently-conserved DNA segments. This paves the way to further studies of ancient protein fossils.
Collapse
Affiliation(s)
- Martin C Frith
- Artificial Intelligence Research Center, AIST, Tokyo, Japan.,Graduate School of Frontier Sciences, University of Tokyo, Chiba, Japan.,Computational Bio Big-Data Open Innovation Laboratory, AIST, Tokyo, Japan
| |
Collapse
|
4
|
Kiryu H, Ichikawa Y, Kojima Y. TMRS: an algorithm for computing the time to the most recent substitution event from a multiple alignment column. Algorithms Mol Biol 2019; 14:23. [PMID: 31832082 PMCID: PMC6859643 DOI: 10.1186/s13015-019-0158-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2018] [Accepted: 11/04/2019] [Indexed: 11/10/2022] Open
Abstract
Background As the number of sequenced genomes grows, researchers have access to an increasingly rich source for discovering detailed evolutionary information. However, the computational technologies for inferring biologically important evolutionary events are not sufficiently developed. Results We present algorithms to estimate the evolutionary time (\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$t_{\text {MRS}}$$\end{document}tMRS) to the most recent substitution event from a multiple alignment column by using a probabilistic model of sequence evolution. As the confidence in estimated \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$t_{\text {MRS}}$$\end{document}tMRS values varies depending on gap fractions and nucleotide patterns of alignment columns, we also compute the standard deviation \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\sigma$$\end{document}σ of \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$t_{\text {MRS}}$$\end{document}tMRS by using a dynamic programming algorithm. We identified a number of human genomic sites at which the last substitutions occurred between two speciation events in the human lineage with confidence. A large fraction of such sites have substitutions that occurred between the concestor nodes of Hominoidea and Euarchontoglires. We investigated the correlation between tissue-specific transcribed enhancers and the distribution of the sites with specific substitution time intervals, and found that brain-specific transcribed enhancers are threefold enriched in the density of substitutions in the human lineage relative to expectations. Conclusions We have presented algorithms to estimate the evolutionary time (\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$t_{\text {MRS}}$$\end{document}tMRS) to the most recent substitution event from a multiple alignment column by using a probabilistic model of sequence evolution. Our algorithms will be useful for Evo-Devo studies, as they facilitate screening potential genomic sites that have played an important role in the acquisition of unique biological features by target species. Electronic supplementary material The online version of this article (10.1186/s13015-019-0158-3) contains supplementary material, which is available to authorized users.
Collapse
|
5
|
Mahlich Y, Steinegger M, Rost B, Bromberg Y. HFSP: high speed homology-driven function annotation of proteins. Bioinformatics 2019; 34:i304-i312. [PMID: 29950013 PMCID: PMC6022561 DOI: 10.1093/bioinformatics/bty262] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Motivation The rapid drop in sequencing costs has produced many more (predicted) protein sequences than can feasibly be functionally annotated with wet-lab experiments. Thus, many computational methods have been developed for this purpose. Most of these methods employ homology-based inference, approximated via sequence alignments, to transfer functional annotations between proteins. The increase in the number of available sequences, however, has drastically increased the search space, thus significantly slowing down alignment methods. Results Here we describe homology-derived functional similarity of proteins (HFSP), a novel computational method that uses results of a high-speed alignment algorithm, MMseqs2, to infer functional similarity of proteins on the basis of their alignment length and sequence identity. We show that our method is accurate (85% precision) and fast (more than 40-fold speed increase over state-of-the-art). HFSP can help correct at least a 16% error in legacy curations, even for a resource of as high quality as Swiss-Prot. These findings suggest HFSP as an ideal resource for large-scale functional annotation efforts. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yannick Mahlich
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, NJ, USA.,Computational Biology & Bioinformatics - i12 Informatics, Technical University of Munich (TUM), Munich, Germany.,Institute for Advanced Study, Technical University of Munich (TUM), Munich, Germany
| | - Martin Steinegger
- Computational Biology & Bioinformatics - i12 Informatics, Technical University of Munich (TUM), Munich, Germany.,Quantitative and Computational Biology Group, Max-Planck Institute for Biophysical Chemistry, Göttingen, Germany.,Department of Chemistry, Seoul National University, Seoul, Korea
| | - Burkhard Rost
- Computational Biology & Bioinformatics - i12 Informatics, Technical University of Munich (TUM), Munich, Germany.,Institute for Advanced Study, Technical University of Munich (TUM), Munich, Germany.,TUM School of Life Sciences Weihenstephan (WZW), Technical University Munich (TUM), Freising, Germany.,Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY, USA.,New York Consortium on Membrane Protein Structure (NYCOMPS), New York, NY, USA
| | - Yana Bromberg
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, NJ, USA.,Institute for Advanced Study, Technical University of Munich (TUM), Munich, Germany.,Department of Genetics, Human Genetics Institute, Rutgers University, Piscataway, NJ, USA
| |
Collapse
|
6
|
Abstract
Whole-genome alignment (WGA) is the prediction of evolutionary relationships at the nucleotide level between two or more genomes. It combines aspects of both colinear sequence alignment and gene orthology prediction and is typically more challenging to address than either of these tasks due to the size and complexity of whole genomes. Despite the difficulty of this problem, numerous methods have been developed for its solution because WGAs are valuable for genome-wide analyses such as phylogenetic inference, genome annotation, and function prediction. In this chapter, we discuss the meaning and significance of WGA and present an overview of the methods that address it. We also examine the problem of evaluating whole-genome aligners and offer a set of methodological challenges that need to be tackled in order to make most effective use of our rapidly growing databases of whole genomes.
Collapse
Affiliation(s)
- Colin N Dewey
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA.
| |
Collapse
|
7
|
MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 2017; 35:1026-1028. [PMID: 29035372 DOI: 10.1038/nbt.3988] [Citation(s) in RCA: 1427] [Impact Index Per Article: 203.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
8
|
Lim K, Yamada KD, Frith MC, Tomii K. Protein sequence-similarity search acceleration using a heuristic algorithm with a sensitive matrix. ACTA ACUST UNITED AC 2017; 17:147-154. [PMID: 28083762 PMCID: PMC5274646 DOI: 10.1007/s10969-016-9210-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2015] [Accepted: 12/05/2016] [Indexed: 12/28/2022]
Abstract
Protein database search for public databases is a fundamental step in the target selection of proteins in structural and functional genomics and also for inferring protein structure, function, and evolution. Most database search methods employ amino acid substitution matrices to score amino acid pairs. The choice of substitution matrix strongly affects homology detection performance. We earlier proposed a substitution matrix named MIQS that was optimized for distant protein homology search. Herein we further evaluate MIQS in combination with LAST, a heuristic and fast database search tool with a tunable sensitivity parameter m, where larger m denotes higher sensitivity. Results show that MIQS substantially improves the homology detection and alignment quality performance of LAST across diverse m parameters. Against a protein database consisting of approximately 15 million sequences, LAST with m = 105 achieves better homology detection performance than BLASTP, and completes the search 20 times faster. Compared to the most sensitive existing methods being used today, CS-BLAST and SSEARCH, LAST with MIQS and m = 106 shows comparable homology detection performance at 2.0 and 3.9 times greater speed, respectively. Results demonstrate that MIQS-powered LAST is a time-efficient method for sensitive and accurate homology search.
Collapse
Affiliation(s)
- Kyungtaek Lim
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan
| | - Kazunori D Yamada
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan
- Graduate School of Information Sciences, Tohoku University, 6-3-9 Aramaki-Aza-Aoba, Aoba-ku, Sendai, 980-8579, Japan
| | - Martin C Frith
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan
- Department of Computational Biology and Medical Sciences, University of Tokyo, 5-1-5 Kashiwa-no-ha, Kashiwa, Chiba, 227-8561, Japan
| | - Kentaro Tomii
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan.
- Biotechnology Research Institute for Drug Discovery, National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan.
| |
Collapse
|
9
|
Hubley R, Finn RD, Clements J, Eddy SR, Jones TA, Bao W, Smit AFA, Wheeler TJ. The Dfam database of repetitive DNA families. Nucleic Acids Res 2015; 44:D81-9. [PMID: 26612867 PMCID: PMC4702899 DOI: 10.1093/nar/gkv1272] [Citation(s) in RCA: 409] [Impact Index Per Article: 45.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2015] [Accepted: 11/03/2015] [Indexed: 11/20/2022] Open
Abstract
Repetitive DNA, especially that due to transposable elements (TEs), makes up a large fraction of many genomes. Dfam is an open access database of families of repetitive DNA elements, in which each family is represented by a multiple sequence alignment and a profile hidden Markov model (HMM). The initial release of Dfam, featured in the 2013 NAR Database Issue, contained 1143 families of repetitive elements found in humans, and was used to produce more than 100 Mb of additional annotation of TE-derived regions in the human genome, with improved speed. Here, we describe recent advances, most notably expansion to 4150 total families including a comprehensive set of known repeat families from four new organisms (mouse, zebrafish, fly and nematode). We describe improvements to coverage, and to our methods for identifying and reducing false annotation. We also describe updates to the website interface. The Dfam website has moved to http://dfam.org. Seed alignments, profile HMMs, hit lists and other underlying data are available for download.
Collapse
Affiliation(s)
- Robert Hubley
- Institute for Systems Biology, Seattle, WA 98109, USA
| | - Robert D Finn
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1RQ, UK
| | - Jody Clements
- HHMI Janelia Research Campus, Ashburn, VA 20147, USA
| | - Sean R Eddy
- Howard Hughes Medical Institute, Harvard University, Cambridge, MA 02138, USA
| | - Thomas A Jones
- Howard Hughes Medical Institute, Harvard University, Cambridge, MA 02138, USA
| | - Weidong Bao
- Genetic Information Research Institute, Los Altos, CA 94022, USA
| | | | | |
Collapse
|
10
|
Hoen DR, Hickey G, Bourque G, Casacuberta J, Cordaux R, Feschotte C, Fiston-Lavier AS, Hua-Van A, Hubley R, Kapusta A, Lerat E, Maumus F, Pollock DD, Quesneville H, Smit A, Wheeler TJ, Bureau TE, Blanchette M. A call for benchmarking transposable element annotation methods. Mob DNA 2015; 6:13. [PMID: 26244060 PMCID: PMC4524446 DOI: 10.1186/s13100-015-0044-6] [Citation(s) in RCA: 65] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2015] [Accepted: 07/22/2015] [Indexed: 12/31/2022] Open
Abstract
DNA derived from transposable elements (TEs) constitutes large parts of the genomes of complex eukaryotes, with major impacts not only on genomic research but also on how organisms evolve and function. Although a variety of methods and tools have been developed to detect and annotate TEs, there are as yet no standard benchmarks-that is, no standard way to measure or compare their accuracy. This lack of accuracy assessment calls into question conclusions from a wide range of research that depends explicitly or implicitly on TE annotation. In the absence of standard benchmarks, toolmakers are impeded in improving their tools, annotators cannot properly assess which tools might best suit their needs, and downstream researchers cannot judge how accuracy limitations might impact their studies. We therefore propose that the TE research community create and adopt standard TE annotation benchmarks, and we call for other researchers to join the authors in making this long-overdue effort a success.
Collapse
Affiliation(s)
- Douglas R Hoen
- School of Computer Science, McGill University, McConnell Engineering Bldg., Rm. 318, 3480 Rue University, Montréal, Québec H3A 0E9 Canada ; Department of Biology, McGill University, Stewart Biology Bldg., 1205 Ave. du Docteur-Penfield, Montréal, Québec H3A 1B1 Canada
| | - Glenn Hickey
- School of Computer Science, McGill University, McConnell Engineering Bldg., Rm. 318, 3480 Rue University, Montréal, Québec H3A 0E9 Canada ; McGill Centre for Bioinformatics, McGill University, Montréal, Québec Canada
| | - Guillaume Bourque
- Department of Human Genetics, McGill University, Montréal, Québec Canada ; McGill University and Génome Québec Innovation Center, Montréal, Québec Canada
| | - Josep Casacuberta
- Centre for Research in Agricultural Genomics CSIC-IRTA-UAB-UB, 08193 Barcelona, Spain
| | - Richard Cordaux
- Université de Poitiers, UMR CNRS 7267 Ecologie et Biologie des Interactions, Equipe Ecologie Evolution Symbiose, 5 Rue Albert Turpin, 86073 Poitiers Cedex 9, France
| | - Cédric Feschotte
- Department of Human Genetics, University of Utah School of Medicine, Salt Lake City, UT 84112 USA
| | - Anna-Sophie Fiston-Lavier
- Institut des Sciences de l'Evolution de Montpellier (ISE-M), Equipe Evolution, Vecteurs, Adaptation et Symbiose, UMR5554 CNRS-Université Montpellier, Montpellier, 34090 cedex 05 France
| | - Aurélie Hua-Van
- Laboratoire Evolution, Génomes, Comportement Ecologie, CNRS-Université Paris-Sud (UMR 9191)-IRD (UMR 247)-Université Paris-Saclay, F-91198 Gif-sur-Yvette, France
| | - Robert Hubley
- Institute for Systems Biology, 401 Terry Ave. N, Seattle, WA 98109 USA
| | - Aurélie Kapusta
- Department of Human Genetics, University of Utah School of Medicine, Salt Lake City, UT 84112 USA
| | - Emmanuelle Lerat
- Laboratoire Biometrie et Biologie Evolutive, Universite Claude Bernard-Lyon 1, UMR-CNRS 5558-Bat. Mendel, 43 bd du 11 novembre 1918, 69622 Villeurbanne cedex, France
| | - Florian Maumus
- INRA, UR1164 URGI-Research Unit in Genomics-Info, INRA de Versailles-Grignon, Route de Saint-Cyr, Versailles, 78026 France
| | - David D Pollock
- University of Colorado School of Medicine, Aurora, CO 80045 USA
| | - Hadi Quesneville
- INRA, UR1164 URGI-Research Unit in Genomics-Info, INRA de Versailles-Grignon, Route de Saint-Cyr, Versailles, 78026 France
| | - Arian Smit
- Institute for Systems Biology, 401 Terry Ave. N, Seattle, WA 98109 USA
| | - Travis J Wheeler
- Department of Computer Science, University of Montana, Missoula, MT 59812 USA
| | - Thomas E Bureau
- Department of Biology, McGill University, Stewart Biology Bldg., 1205 Ave. du Docteur-Penfield, Montréal, Québec H3A 1B1 Canada
| | - Mathieu Blanchette
- School of Computer Science, McGill University, McConnell Engineering Bldg., Rm. 318, 3480 Rue University, Montréal, Québec H3A 0E9 Canada ; McGill Centre for Bioinformatics, McGill University, Montréal, Québec Canada
| |
Collapse
|
11
|
Frith MC, Kawaguchi R. Split-alignment of genomes finds orthologies more accurately. Genome Biol 2015; 16:106. [PMID: 25994148 PMCID: PMC4464727 DOI: 10.1186/s13059-015-0670-9] [Citation(s) in RCA: 65] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2015] [Accepted: 05/08/2015] [Indexed: 04/29/2023] Open
Abstract
We present a new pair-wise genome alignment method, based on a simple concept of finding an optimal set of local alignments. It gains accuracy by not masking repeats, and by using a statistical model to quantify the (un)ambiguity of each alignment part. Compared to previous animal genome alignments, it aligns thousands of locations differently and with much higher similarity, strongly suggesting that the previous alignments are non-orthologous. The previous methods suffer from an overly-strong assumption of long un-rearranged blocks. The new alignments should help find interesting and unusual features, such as fast-evolving elements and micro-rearrangements, which are confounded by alignment errors.
Collapse
Affiliation(s)
- Martin C Frith
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan.
| | - Risa Kawaguchi
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan. .,Department of Computational Biology, Faculty of Frontier Sciences, The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba, 277-8561, Japan.
| |
Collapse
|
12
|
Abstract
It is well known that remnants of partial or whole copies of mitochondrial DNA, known as Nuclear MiTochondrial sequences (NUMTs), are found in nuclear genomes. Since whole genome sequences have become available, many bioinformatics studies have identified putative NUMTs and from those attempted to infer the factors involved in NUMT creation. These studies conclude that NUMTs represent randomly chosen regions of the mitochondrial genome. There is less consensus regarding the nuclear insertion sites of NUMTs - previous studies have discussed the possible role of retrotransposons, but some recent ones have reported no correlation or even anti-correlation between NUMT sites and retrotransposons. These studies have generally defined NUMT sites using BLAST with default parameters. We analyze a redefined set of human NUMTs, computed with a carefully considered protocol. We discover that the inferred insertion points of NUMTs have a strong tendency to have high-predicted DNA curvature, occur in experimentally defined open chromatin regions and often occur immediately adjacent to A + T oligomers. We also show clear evidence that their flanking regions are indeed rich in retrotransposons. Finally we show that parts of the mitochondrial genome D-loop are under-represented as a source of NUMTs in primate evolution.
Collapse
Affiliation(s)
- Junko Tsuji
- Department of Computational Biology, Graduate School of Frontier Science, The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba 277-8561, Japan
| | | | | | | |
Collapse
|
13
|
Frith MC. Gentle masking of low-complexity sequences improves homology search. PLoS One 2011; 6:e28819. [PMID: 22205972 PMCID: PMC3242753 DOI: 10.1371/journal.pone.0028819] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2011] [Accepted: 11/15/2011] [Indexed: 11/19/2022] Open
Abstract
Detection of sequences that are homologous, i.e. descended from a common ancestor, is a fundamental task in computational biology. This task is confounded by low-complexity tracts (such as atatatatatat), which arise frequently and independently, causing strong similarities that are not homologies. There has been much research on identifying low-complexity tracts, but little research on how to treat them during homology search. We propose to find homologies by aligning sequences with “gentle” masking of low-complexity tracts. Gentle masking means that the match score involving a masked letter is , where is the unmasked score. Gentle masking slightly but noticeably improves the sensitivity of homology search (compared to “harsh” masking), without harming specificity. We show examples in three useful homology search problems: detection of NUMTs (nuclear copies of mitochondrial DNA), recruitment of metagenomic DNA reads to reference genomes, and pseudogene detection. Gentle masking is currently the best way to treat low-complexity tracts during homology search.
Collapse
Affiliation(s)
- Martin C Frith
- Computational Biology Research Center, Institute for Advanced Industrial Science and Technology, Koto-ku, Tokyo, Japan.
| |
Collapse
|
14
|
Kiryu H. Sufficient statistics and expectation maximization algorithms in phylogenetic tree models. ACTA ACUST UNITED AC 2011; 27:2346-53. [PMID: 21757463 DOI: 10.1093/bioinformatics/btr420] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Measuring evolutionary conservation is a routine step in the identification of functional elements in genome sequences. Although a number of studies have proposed methods that use the continuous time Markov models (CTMMs) to find evolutionarily constrained elements, their probabilistic structures have been less frequently investigated. RESULTS In this article, we investigate a sufficient statistic for CTMMs. The statistic is composed of the fractional duration of nucleotide characters over evolutionary time, F(d), and the number of substitutions occurring in phylogenetic trees, N(s). We first derive basic properties of the sufficient statistic. Then, we derive an expectation maximization (EM) algorithm for estimating the parameters of a phylogenetic model, which iteratively computes the expectation values of the sufficient statistic. We show that the EM algorithm exhibits much faster convergence than other optimization methods that use numerical gradient descent algorithms. Finally, we investigate the genome-wide distribution of fractional duration time F(d) which, unlike the number of substitutions N(s), has rarely been investigated. We show that F(d) has evolutionary information that is distinct from that in N(s), which may be useful for detecting novel types of evolutionary constraints existing in the human genome. AVAILABILITY The C++ source code of the 'Fdur' software is available at http://www.ncrna.org/software/fdur/ CONTACT kiryu-h@k.u-tokyo.ac.jp SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hisanori Kiryu
- Department of Computational Biology, Faculty of Frontier Science, The University of Tokyo, Kashiwa, Chiba 277-8561, Japan.
| |
Collapse
|
15
|
Hudek AK, Brown DG. FEAST: sensitive local alignment with multiple rates of evolution. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:698-709. [PMID: 20733242 DOI: 10.1109/tcbb.2010.76] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
We present a pairwise local aligner, FEAST, which uses two new techniques: a sensitive extension algorithm for identifying homologous subsequences, and a descriptive probabilistic alignment model. We also present a new procedure for training alignment parameters and apply it to the human and mouse genomes, producing a better parameter set for these sequences. Our extension algorithm identifies homologous subsequences by considering all evolutionary histories. It has higher maximum sensitivity than Viterbi extensions, and better balances specificity. We model alignments with several submodels, each with unique statistical properties, describing strongly similar and weakly similar regions of homologous DNA. Training parameters using two submodels produces superior alignments, even when we align with only the parameters from the weaker submodel. Our extension algorithm combined with our new parameter set achieves sensitivity 0.59 on synthetic tests. In contrast, LASTZ with default settings achieves sensitivity 0.35 with the same false positive rate. Using the weak submodel as parameters for LASTZ increases its sensitivity to 0.59 with high error. FEAST is available at http://monod.uwaterloo.ca/feast/.
Collapse
Affiliation(s)
- Alexander K Hudek
- David R. Cheriton School of Computer Science, University of Waterloo, 200 University Avenue West, Waterloo, Ontario N2L 3G1, Canada.
| | | |
Collapse
|
16
|
Söding J, Remmert M. Protein sequence comparison and fold recognition: progress and good-practice benchmarking. Curr Opin Struct Biol 2011; 21:404-11. [PMID: 21458982 DOI: 10.1016/j.sbi.2011.03.005] [Citation(s) in RCA: 55] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2011] [Revised: 03/01/2011] [Accepted: 03/09/2011] [Indexed: 11/26/2022]
Abstract
Protein sequence comparison methods have grown increasingly sensitive during the last decade and can often identify distantly related proteins sharing a common ancestor some 3 billion years ago. Although cellular function is not conserved so long, molecular functions and structures of protein domains often are. In combination with a domain-centered approach to function and structure prediction, modern remote homology detection methods have a great and largely underexploited potential for elucidating protein functions and evolution. Advances during the last few years include nonlinear scoring functions combining various sequence features, the use of sequence context information, and powerful new software packages. Since progress depends on realistically assessing new and existing methods and published benchmarks are often hard to compare, we propose 10 rules of good-practice benchmarking.
Collapse
Affiliation(s)
- Johannes Söding
- Gene Center and Center for Integrated Protein Science, Ludwig-Maximilians-Universität München, Feodor-Lynen-Strasse 25, Munich, Germany.
| | | |
Collapse
|
17
|
Selective loss of glycogen synthase kinase-3α in birds reveals distinct roles for GSK-3 isozymes in tau phosphorylation. FEBS Lett 2011; 585:1158-62. [PMID: 21419127 DOI: 10.1016/j.febslet.2011.03.025] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2011] [Revised: 03/06/2011] [Accepted: 03/11/2011] [Indexed: 01/05/2023]
Abstract
Mammalian glycogen synthase kinase-3 (GSK-3), a critical regulator in neuronal signaling, cognition, and behavior, exists as two isozymes GSK-3α and GSK-3β. Their distinct biological functions remains largely unknown. Here, we examined the evolutionary significance of each of these isozymes. Surprisingly, we found that unlike other vertebrates that harbor both GSK-3 genes, the GSK-3α gene is missing in birds. GSK-3-mediated tau phosphorylation was significantly lower in adult bird brains than in mouse brains, a phenomenon that was reproduced in GSK-3α knockout mouse brains. Tau phosphorylation was detected in brains from bird embryos suggesting that GSK-3 isozymes play distinct roles in tau phosphorylation during development. Birds are natural GSK-3α knockout organisms and may serve as a novel model to study the distinct functions of GSK-3 isozymes.
Collapse
|
18
|
Nakato R, Gotoh O. Cgaln: fast and space-efficient whole-genome alignment. BMC Bioinformatics 2010; 11:224. [PMID: 20433723 PMCID: PMC2873541 DOI: 10.1186/1471-2105-11-224] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2010] [Accepted: 04/30/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Whole-genome sequence alignment is an essential process for extracting valuable information about the functions, evolution, and peculiarities of genomes under investigation. As available genomic sequence data accumulate rapidly, there is great demand for tools that can compare whole-genome sequences within practical amounts of time and space. However, most existing genomic alignment tools can treat sequences that are only a few Mb long at once, and no state-of-the-art alignment program can align large sequences such as mammalian genomes directly on a conventional standalone computer. RESULTS We previously proposed the CGAT (Coarse-Grained AlignmenT) algorithm, which performs an alignment job in two steps: first at the block level and then at the nucleotide level. The former is "coarse-grained" alignment that can explore genomic rearrangements and reduce the sizes of the regions to be analyzed in the next step. The latter is detailed alignment within limited regions. In this paper, we present an update of the algorithm and the open-source program, Cgaln, that implements the algorithm. We compared the performance of Cgaln with those of other programs on whole genomic sequences of several bacteria and of some mammalian chromosome pairs. The results showed that Cgaln is several times faster and more memory-efficient than the best existing programs, while its sensitivity and accuracy are comparable to those of the best programs. Cgaln takes less than 13 hours to finish an alignment between the whole genomes of human and mouse in a single run on a conventional desktop computer with a single CPU and 2 GB memory. CONCLUSIONS Cgaln is not only fast and memory efficient but also effective in coping with genomic rearrangements. Our results show that Cgaln is very effective for comparison of large genomes, especially of intact chromosomal sequences. We believe that Cgaln provides novel viewpoint for reducing computational complexity and will contribute to various fields of genome science.
Collapse
Affiliation(s)
- Ryuichiro Nakato
- Department of Intelligence Science and Technology, Graduate School of Informatics, Kyoto University, Yoshida-Honmachi, Sakyo-ku, Kyoto-shi, Kyoto 606-8501, Japan
| | | |
Collapse
|
19
|
Grabherr MG, Russell P, Meyer M, Mauceli E, Alföldi J, Di Palma F, Lindblad-Toh K. Genome-wide synteny through highly sensitive sequence alignment: Satsuma. Bioinformatics 2010; 26:1145-51. [PMID: 20208069 DOI: 10.1093/bioinformatics/btq102] [Citation(s) in RCA: 188] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Comparative genomics heavily relies on alignments of large and often complex DNA sequences. From an engineering perspective, the problem here is to provide maximum sensitivity (to find all there is to find), specificity (to only find real homology) and speed (to accommodate the billions of base pairs of vertebrate genomes). RESULTS Satsuma addresses all three issues through novel strategies: (i) cross-correlation, implemented via fast Fourier transform; (ii) a match scoring scheme that eliminates almost all false hits; and (iii) an asynchronous 'battleship'-like search that allows for aligning two entire fish genomes (470 and 217 Mb) in 120 CPU hours using 15 processors on a single machine. AVAILABILITY Satsuma is part of the Spines software package, implemented in C++ on Linux. The latest version of Spines can be freely downloaded under the LGPL license from http://www.broadinstitute.org/science/programs/genome-biology/spines/.
Collapse
Affiliation(s)
- Manfred G Grabherr
- Broad Institute of MIT and Harvard, 7 Cambridge Center, Cambridge, MA 02142, USA.
| | | | | | | | | | | | | |
Collapse
|
20
|
Frith MC, Hamada M, Horton P. Parameters for accurate genome alignment. BMC Bioinformatics 2010; 11:80. [PMID: 20144198 PMCID: PMC2829014 DOI: 10.1186/1471-2105-11-80] [Citation(s) in RCA: 136] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2009] [Accepted: 02/09/2010] [Indexed: 11/25/2022] Open
Abstract
Background Genome sequence alignments form the basis of much research. Genome alignment depends on various mundane but critical choices, such as how to mask repeats and which score parameters to use. Surprisingly, there has been no large-scale assessment of these choices using real genomic data. Moreover, rigorous procedures to control the rate of spurious alignment have not been employed. Results We have assessed 495 combinations of score parameters for alignment of animal, plant, and fungal genomes. As our gold-standard of accuracy, we used genome alignments implied by multiple alignments of proteins and of structural RNAs. We found the HOXD scoring schemes underlying alignments in the UCSC genome database to be far from optimal, and suggest better parameters. Higher values of the X-drop parameter are not always better. E-values accurately indicate the rate of spurious alignment, but only if tandem repeats are masked in a non-standard way. Finally, we show that γ-centroid (probabilistic) alignment can find highly reliable subsets of aligned bases. Conclusions These results enable more accurate genome alignment, with reliability measures for local alignments and for individual aligned bases. This study was made possible by our new software, LAST, which can align vertebrate genomes in a few hours http://last.cbrc.jp/.
Collapse
Affiliation(s)
- Martin C Frith
- Computational Biology Research Center, Institute for Advanced Industrial Science and Technology, Tokyo 135-0064, Japan.
| | | | | |
Collapse
|