1
|
Kimbrel JA, Jeffrey BM, Ward CS. Prokaryotic Genome Annotation. Methods Mol Biol 2021; 2349:193-214. [PMID: 34718997 DOI: 10.1007/978-1-0716-1585-0_10] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/19/2023]
Abstract
In the last decade, the high-throughput and relatively low cost of short-read sequencing technologies have revolutionized prokaryotic genomics. This has led to an exponential increase in the number of bacterial and archaeal genome sequences available, as well as corresponding increase of genome assembly and annotation tools developed. Together, these hardware and software technologies have given scientists unprecedented options to study their chosen microbial systems without the need for large teams of bioinformaticists or supercomputing facilities. While these analysis tools largely fall into only a few categories, each may have different requirements, caveats and file formats, and some may be rarely updated or even abandoned. And so, despite the apparent ease in sequencing and analyzing a prokaryotic genome, it is no wonder that the budding genomicist may quickly find oneself overwhelmed. Here, we aim to provide the reader with an overview of genome annotation and its most important considerations, as well as an easy-to-follow protocol to get started with annotating a prokaryotic genome.
Collapse
Affiliation(s)
- Jeffrey A Kimbrel
- Biosciences and Biotechnology Division, Lawrence Livermore National Laboratory, Livermore, CA, USA.
| | - Brendan M Jeffrey
- Bioinformatics and Computational Biosciences Branch, Rocky Mountain Laboratories, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Hamilton, MA, USA
| | - Christopher S Ward
- Biosciences and Biotechnology Division, Lawrence Livermore National Laboratory, Livermore, CA, USA
- Department of Biological Sciences, Bowling Green State University, Bowling Green, OH, USA
| |
Collapse
|
2
|
Chan KL, Rosli R, Tatarinova TV, Hogan M, Firdaus-Raih M, Low ETL. Seqping: gene prediction pipeline for plant genomes using self-training gene models and transcriptomic data. BMC Bioinformatics 2017; 18:1426. [PMID: 28466793 PMCID: PMC5333190 DOI: 10.1186/s12859-016-1426-6] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Gene prediction is one of the most important steps in the genome annotation process. A large number of software tools and pipelines developed by various computing techniques are available for gene prediction. However, these systems have yet to accurately predict all or even most of the protein-coding regions. Furthermore, none of the currently available gene-finders has a universal Hidden Markov Model (HMM) that can perform gene prediction for all organisms equally well in an automatic fashion. RESULTS We present an automated gene prediction pipeline, Seqping that uses self-training HMM models and transcriptomic data. The pipeline processes the genome and transcriptome sequences of the target species using GlimmerHMM, SNAP, and AUGUSTUS pipelines, followed by MAKER2 program to combine predictions from the three tools in association with the transcriptomic evidence. Seqping generates species-specific HMMs that are able to offer unbiased gene predictions. The pipeline was evaluated using the Oryza sativa and Arabidopsis thaliana genomes. Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis showed that the pipeline was able to identify at least 95% of BUSCO's plantae dataset. Our evaluation shows that Seqping was able to generate better gene predictions compared to three HMM-based programs (MAKER2, GlimmerHMM and AUGUSTUS) using their respective available HMMs. Seqping had the highest accuracy in rice (0.5648 for CDS, 0.4468 for exon, and 0.6695 nucleotide structure) and A. thaliana (0.5808 for CDS, 0.5955 for exon, and 0.8839 nucleotide structure). CONCLUSIONS Seqping provides researchers a seamless pipeline to train species-specific HMMs and predict genes in newly sequenced or less-studied genomes. We conclude that the Seqping pipeline predictions are more accurate than gene predictions using the other three approaches with the default or available HMMs.
Collapse
Affiliation(s)
- Kuang-Lim Chan
- Advanced Biotechnology and Breeding Center, Malaysian Palm Oil Board, 6 Persiaran Institusi, Bandar Baru Bangi, 43000 Kajang, Selangor Malaysia
- Faculty of Science and Technology, Universiti Kebangsaan Malaysia, 43600 Bangi, Selangor Malaysia
| | - Rozana Rosli
- Advanced Biotechnology and Breeding Center, Malaysian Palm Oil Board, 6 Persiaran Institusi, Bandar Baru Bangi, 43000 Kajang, Selangor Malaysia
| | - Tatiana V. Tatarinova
- Center for Personalized Medicine and Spatial Sciences Institute, University of Southern California, Los Angeles, CA USA
| | - Michael Hogan
- Orion Genomics, 4041 Forest Park Avenue, St. Louis, MO 63108 USA
| | - Mohd Firdaus-Raih
- Faculty of Science and Technology, Universiti Kebangsaan Malaysia, 43600 Bangi, Selangor Malaysia
| | - Eng-Ti Leslie Low
- Advanced Biotechnology and Breeding Center, Malaysian Palm Oil Board, 6 Persiaran Institusi, Bandar Baru Bangi, 43000 Kajang, Selangor Malaysia
| |
Collapse
|
3
|
Abstract
Gene finding is the process of identifying genome sequence regions representing stretches of DNA that encode biologically active products, such as proteins or functional noncoding RNAs. As this is usually the first step in the analysis of any novel genomic sequence or resequenced sample of well-known organisms, it is a very important issue, as all downstream analyses depend on the results. This chapter describes the biological basis for gene finding, and the programs and computational approaches that are available for the automated identification of protein-coding genes. For bacterial, archaeal, and eukaryotic genomes, as well as for multi-species sequence data originating from environmental community studies, the state of the art in automated gene finding is described.
Collapse
Affiliation(s)
- Alice Carolyn McHardy
- Department for Algorithmic Bioinformatics, Heinrich Heine University, Düsseldorf, Germany.
- Computational Biology of Infection Research, Helmholtz Center for Infection Research, Braunschweig, Germany.
| | - Andreas Kloetgen
- Department for Algorithmic Bioinformatics, Heinrich Heine University, Düsseldorf, Germany
- Department of Pediatric Oncology, Hematology and Clinical Immunology, Heinrich Heine University, Düsseldorf, Germany
| |
Collapse
|
4
|
Zickmann F, Renard BY. MSProGene: integrative proteogenomics beyond six-frames and single nucleotide polymorphisms. Bioinformatics 2015; 31:i106-15. [PMID: 26072472 PMCID: PMC4765881 DOI: 10.1093/bioinformatics/btv236] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
Summary: Ongoing advances in high-throughput technologies have facilitated accurate proteomic measurements and provide a wealth of information on genomic and transcript level. In proteogenomics, this multi-omics data is combined to analyze unannotated organisms and to allow more accurate sample-specific predictions. Existing analysis methods still mainly depend on six-frame translations or reference protein databases that are extended by transcriptomic information or known single nucleotide polymorphisms (SNPs). However, six-frames introduce an artificial sixfold increase of the target database and SNP integration requires a suitable database summarizing results from previous experiments. We overcome these limitations by introducing MSProGene, a new method for integrative proteogenomic analysis based on customized RNA-Seq driven transcript databases. MSProGene is independent from existing reference databases or annotated SNPs and avoids large six-frame translated databases by constructing sample-specific transcripts. In addition, it creates a network combining RNA-Seq and peptide information that is optimized by a maximum-flow algorithm. It thereby also allows resolving the ambiguity of shared peptides for protein inference. We applied MSProGene on three datasets and show that it facilitates a database-independent reliable yet accurate prediction on gene and protein level and additionally identifies novel genes. Availability and implementation: MSProGene is written in Java and Python. It is open source and available at http://sourceforge.net/projects/msprogene/. Contact:renardb@rki.de
Collapse
Affiliation(s)
- Franziska Zickmann
- Research Group Bioinformatics (NG4), Robert Koch Institute, 13353 Berlin, Germany
| | - Bernhard Y Renard
- Research Group Bioinformatics (NG4), Robert Koch Institute, 13353 Berlin, Germany
| |
Collapse
|
5
|
Ahn J, Xiao X. RASER: reads aligner for SNPs and editing sites of RNA. Bioinformatics 2015; 31:3906-13. [PMID: 26323713 DOI: 10.1093/bioinformatics/btv505] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2015] [Accepted: 08/23/2015] [Indexed: 12/30/2022] Open
Abstract
MOTIVATION Accurate identification of genetic variants such as single-nucleotide polymorphisms (SNPs) or RNA editing sites from RNA-Seq reads is important, yet challenging, because it necessitates a very low false-positive rate in read mapping. Although many read aligners are available, no single aligner was specifically developed or tested as an effective tool for SNP and RNA editing prediction. RESULTS We present RASER, an accurate read aligner with novel mapping schemes and index tree structure that aims to reduce false-positive mappings due to existence of highly similar regions. We demonstrate that RASER shows the best mapping accuracy compared with other popular algorithms and highest sensitivity in identifying multiply mapped reads. As a result, RASER displays superb efficacy in unbiased mapping of the alternative alleles of SNPs and in identification of RNA editing sites. AVAILABILITY AND IMPLEMENTATION RASER is written in C++ and freely available for download at https://github.com/jaegyoonahn/RASER.
Collapse
Affiliation(s)
- Jaegyoon Ahn
- Department of Integrative Biology and Physiology and the Molecular Biology Institute, University of California Los Angeles, Los Angeles, CA 90095, USA
| | - Xinshu Xiao
- Department of Integrative Biology and Physiology and the Molecular Biology Institute, University of California Los Angeles, Los Angeles, CA 90095, USA
| |
Collapse
|
6
|
Spies D, Ciaudo C. Dynamics in Transcriptomics: Advancements in RNA-seq Time Course and Downstream Analysis. Comput Struct Biotechnol J 2015; 13:469-77. [PMID: 26430493 PMCID: PMC4564389 DOI: 10.1016/j.csbj.2015.08.004] [Citation(s) in RCA: 45] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2015] [Revised: 08/05/2015] [Accepted: 08/07/2015] [Indexed: 12/17/2022] Open
Abstract
Analysis of gene expression has contributed to a plethora of biological and medical research studies. Microarrays have been intensively used for the profiling of gene expression during diverse developmental processes, treatments and diseases. New massively parallel sequencing methods, often named as RNA-sequencing (RNA-seq) are extensively improving our understanding of gene regulation and signaling networks. Computational methods developed originally for microarrays analysis can now be optimized and applied to genome-wide studies in order to have access to a better comprehension of the whole transcriptome. This review addresses current challenges on RNA-seq analysis and specifically focuses on new bioinformatics tools developed for time series experiments. Furthermore, possible improvements in analysis, data integration as well as future applications of differential expression analysis are discussed.
Collapse
Affiliation(s)
- Daniel Spies
- Swiss Federal Institute of Technology Zurich, Department of Biology, Institute of Molecular Health Sciences, Zurich, Otto-Stern Weg 7, 8093 Zurich, Switzerland
- Life Science Zurich Graduate School, Molecular Life Science Program, University of Zurich, Institute of Molecular Life Sciences, Winterthurerstrasse 190, 8057 Zurich, Switzerland
| | - Constance Ciaudo
- Swiss Federal Institute of Technology Zurich, Department of Biology, Institute of Molecular Health Sciences, Zurich, Otto-Stern Weg 7, 8093 Zurich, Switzerland
| |
Collapse
|
7
|
Zickmann F, Renard BY. IPred - integrating ab initio and evidence based gene predictions to improve prediction accuracy. BMC Genomics 2015; 16:134. [PMID: 25766582 PMCID: PMC4345001 DOI: 10.1186/s12864-015-1315-9] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2014] [Accepted: 02/03/2015] [Indexed: 11/21/2022] Open
Abstract
Background Gene prediction is a challenging but crucial part in most genome analysis pipelines. Various methods have evolved that predict genes ab initio on reference sequences or evidence based with the help of additional information, such as RNA-Seq reads or EST libraries. However, none of these strategies is bias-free and one method alone does not necessarily provide a complete set of accurate predictions. Results We present IPred (Integrative gene Prediction), a method to integrate ab initio and evidence based gene identifications to complement the advantages of different prediction strategies. IPred builds on the output of gene finders and generates a new combined set of gene identifications, representing the integrated evidence of the single method predictions. Conclusion We evaluate IPred in simulations and real data experiments on Escherichia Coli and human data. We show that IPred improves the prediction accuracy in comparison to single method predictions and to existing methods for prediction combination. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1315-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Franziska Zickmann
- Research Group Bioinformatics (NG4), Robert Koch-Institute, Berlin, Germany.
| | - Bernhard Y Renard
- Research Group Bioinformatics (NG4), Robert Koch-Institute, Berlin, Germany.
| |
Collapse
|
8
|
Fawal N, Li Q, Mathé C, Dunand C. Automatic multigenic family annotation: risks and solutions. Trends Genet 2014; 30:323-5. [DOI: 10.1016/j.tig.2014.06.004] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2014] [Revised: 06/23/2014] [Accepted: 06/24/2014] [Indexed: 12/17/2022]
|
9
|
Sallet E, Gouzy J, Schiex T. EuGene-PP: a next-generation automated annotation pipeline for prokaryotic genomes. Bioinformatics 2014; 30:2659-61. [PMID: 24880686 DOI: 10.1093/bioinformatics/btu366] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
UNLABELLED It is now easy and increasingly usual to produce oriented RNA-Seq data as a prokaryotic genome is being sequenced. However, this information is usually just used for expression quantification. EuGene-PP is a fully automated pipeline for structural annotation of prokaryotic genomes integrating protein similarities, statistical information and any oriented expression information (RNA-Seq or tiling arrays) through a variety of file formats to produce a qualitatively enriched annotation including coding regions but also (possibly antisense) non-coding genes and transcription start sites. AVAILABILITY AND IMPLEMENTATION EuGene-PP is an open-source software based on EuGene-P integrating a Galaxy configuration. EuGene-PP can be downloaded at eugene.toulouse.inra.fr.
Collapse
Affiliation(s)
- Erika Sallet
- Laboratoire Interactions Plantes Micro-organismes (LIPM) UMR441/2594, INRA/CNRS, F-31320 and INRA, Unité de Mathématiques et Informatique Appliques de Toulouse, UR 875, Castanet-Tolosan F-31326, France
| | - Jérôme Gouzy
- Laboratoire Interactions Plantes Micro-organismes (LIPM) UMR441/2594, INRA/CNRS, F-31320 and INRA, Unité de Mathématiques et Informatique Appliques de Toulouse, UR 875, Castanet-Tolosan F-31326, France
| | - Thomas Schiex
- Laboratoire Interactions Plantes Micro-organismes (LIPM) UMR441/2594, INRA/CNRS, F-31320 and INRA, Unité de Mathématiques et Informatique Appliques de Toulouse, UR 875, Castanet-Tolosan F-31326, France
| |
Collapse
|