1
|
Pardo-Palacios FJ, Wang D, Reese F, Diekhans M, Carbonell-Sala S, Williams B, Loveland JE, De María M, Adams MS, Balderrama-Gutierrez G, Behera AK, Gonzalez Martinez JM, Hunt T, Lagarde J, Liang CE, Li H, Meade MJ, Moraga Amador DA, Prjibelski AD, Birol I, Bostan H, Brooks AM, Çelik MH, Chen Y, Du MRM, Felton C, Göke J, Hafezqorani S, Herwig R, Kawaji H, Lee J, Li JL, Lienhard M, Mikheenko A, Mulligan D, Nip KM, Pertea M, Ritchie ME, Sim AD, Tang AD, Wan YK, Wang C, Wong BY, Yang C, Barnes I, Berry AE, Capella-Gutierrez S, Cousineau A, Dhillon N, Fernandez-Gonzalez JM, Ferrández-Peral L, Garcia-Reyero N, Götz S, Hernández-Ferrer C, Kondratova L, Liu T, Martinez-Martin A, Menor C, Mestre-Tomás J, Mudge JM, Panayotova NG, Paniagua A, Repchevsky D, Ren X, Rouchka E, Saint-John B, Sapena E, Sheynkman L, Smith ML, Suner MM, Takahashi H, Youngworth IA, Carninci P, Denslow ND, Guigó R, Hunter ME, Maehr R, Shen Y, Tilgner HU, Wold BJ, Vollmers C, Frankish A, Au KF, Sheynkman GM, Mortazavi A, Conesa A, Brooks AN. Systematic assessment of long-read RNA-seq methods for transcript identification and quantification. Nat Methods 2024:10.1038/s41592-024-02298-3. [PMID: 38849569 DOI: 10.1038/s41592-024-02298-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2021] [Accepted: 05/03/2024] [Indexed: 06/09/2024]
Abstract
The Long-read RNA-Seq Genome Annotation Assessment Project Consortium was formed to evaluate the effectiveness of long-read approaches for transcriptome analysis. Using different protocols and sequencing platforms, the consortium generated over 427 million long-read sequences from complementary DNA and direct RNA datasets, encompassing human, mouse and manatee species. Developers utilized these data to address challenges in transcript isoform detection, quantification and de novo transcript detection. The study revealed that libraries with longer, more accurate sequences produce more accurate transcripts than those with increased read depth, whereas greater read depth improved quantification accuracy. In well-annotated genomes, tools based on reference sequences demonstrated the best performance. Incorporating additional orthogonal data and replicate samples is advised when aiming to detect rare and novel transcripts or using reference-free approaches. This collaborative study offers a benchmark for current practices and provides direction for future method development in transcriptome analysis.
Collapse
Affiliation(s)
| | - Dingjie Wang
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Fairlie Reese
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA, USA
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA, USA
| | - Mark Diekhans
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Sílvia Carbonell-Sala
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Brian Williams
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Jane E Loveland
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus Hinxton, Cambridge, UK
| | - Maite De María
- Department of Physiological Sciences, College of Veterinary Medicine, Gainesville, FL, USA
- Cherokee Nation System Solutions, contractor to the US Geological Survey-Wetland and Aquatic Research Center, Gainesville, FL, USA
| | - Matthew S Adams
- Department of Molecular Cell and Developmental Biology, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Gabriela Balderrama-Gutierrez
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA, USA
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA, USA
| | - Amit K Behera
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Jose M Gonzalez Martinez
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus Hinxton, Cambridge, UK
| | - Toby Hunt
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus Hinxton, Cambridge, UK
| | - Julien Lagarde
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
- Flomics Biotech, SL, Barcelona, Spain
| | - Cindy E Liang
- Department of Molecular Cell and Developmental Biology, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Haoran Li
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Marcus Jerryd Meade
- Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, VA, USA
| | - David A Moraga Amador
- Interdisciplinary Center for Biotechnology Research, University of Florida, Gainesville, FL, USA
| | - Andrey D Prjibelski
- Department of Computer Science, University of Helsinki, Helsinki, Finland
- Center for Bioinformatics and Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia
| | - Inanc Birol
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| | - Hamed Bostan
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Durham, NC, USA
| | - Ashley M Brooks
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Durham, NC, USA
| | - Muhammed Hasan Çelik
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA, USA
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA, USA
| | - Ying Chen
- Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
| | - Mei R M Du
- Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria, Australia
| | - Colette Felton
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Jonathan Göke
- Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
- Department of Statistics and Data Science, National University of Singapore, Singapore, Singapore
| | - Saber Hafezqorani
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| | - Ralf Herwig
- Department Computational Molecular Biology, Max-Planck-Institute for Molecular Genetics, Berlin, Germany
| | - Hideya Kawaji
- Research Center for Genome & Medical Sciences, Tokyo Metropolitan Institute of Medical Science, Tokyo, Japan
| | - Joseph Lee
- Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
| | - Jian-Liang Li
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Durham, NC, USA
| | - Matthias Lienhard
- Department Computational Molecular Biology, Max-Planck-Institute for Molecular Genetics, Berlin, Germany
| | - Alla Mikheenko
- Department of Neuromuscular Diseases, UCL Queen Square Institute of Neurology, London, UK
| | - Dennis Mulligan
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Ka Ming Nip
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| | - Mihaela Pertea
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Matthew E Ritchie
- Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria, Australia
- Department of Medical Biology, The University of Melbourne, Parkville, Victoria, Australia
| | - Andre D Sim
- Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
| | - Alison D Tang
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Yuk Kei Wan
- Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
- Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
| | - Changqing Wang
- Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria, Australia
| | - Brandon Y Wong
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Chen Yang
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| | - If Barnes
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus Hinxton, Cambridge, UK
| | - Andrew E Berry
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus Hinxton, Cambridge, UK
| | | | - Alyssa Cousineau
- Program in Molecular Medicine, Diabetes Center of Excellence, University of Massachusetts Chan Medical School, Worcester, MA, USA
| | - Namrita Dhillon
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, USA
| | | | - Luis Ferrández-Peral
- Institute for Integrative Systems Biology, Spanish National Research Council (CSIC), Paterna, Spain
| | - Natàlia Garcia-Reyero
- Energy, Installations & Environment, Office of the Assistant Secretary of Defense, Washington, DC, USA
| | | | | | | | | | | | | | - Jorge Mestre-Tomás
- Institute for Integrative Systems Biology, Spanish National Research Council (CSIC), Paterna, Spain
| | - Jonathan M Mudge
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus Hinxton, Cambridge, UK
| | - Nedka G Panayotova
- Interdisciplinary Center for Biotechnology Research, University of Florida, Gainesville, FL, USA
| | - Alejandro Paniagua
- Institute for Integrative Systems Biology, Spanish National Research Council (CSIC), Paterna, Spain
| | | | - Xingjie Ren
- Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA
| | - Eric Rouchka
- Department of Biochemistry & Molecular Genetics, University of Louisville, Louisville, KY, USA
| | - Brandon Saint-John
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Enrique Sapena
- European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Leon Sheynkman
- Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, VA, USA
| | - Melissa Laird Smith
- Department of Biochemistry & Molecular Genetics, University of Louisville, Louisville, KY, USA
| | - Marie-Marthe Suner
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus Hinxton, Cambridge, UK
| | - Hazuki Takahashi
- Center for Integrative Medical Sciences, Laboratory for Transcriptome Technology, RIKEN, Yokohama, Japan
| | | | - Piero Carninci
- Center for Integrative Medical Sciences, Laboratory for Transcriptome Technology, RIKEN, Yokohama, Japan
- Human Technopole, Milano, Italy
| | - Nancy D Denslow
- Department of Physiological Sciences, College of Veterinary Medicine, Gainesville, FL, USA
- Center for Environmental and Human Toxicology, Department of Physiological Sciences, University of Florida, Gainesville, FL, USA
| | - Roderic Guigó
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Margaret E Hunter
- US Geological Survey, Wetland and Aquatic Research Center, Gainesville, FL, USA
| | - Rene Maehr
- Program in Molecular Medicine, Diabetes Center of Excellence, University of Massachusetts Chan Medical School, Worcester, MA, USA
| | - Yin Shen
- Institute for Human Genetics, Department of Neurology, University of California, San Francisco, San Francisco, CA, USA
| | - Hagen U Tilgner
- Brain and Mind Research Institute and Center for Neurogenetics, Weill Cornell Medicine, New York City, NY, USA
| | - Barbara J Wold
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Christopher Vollmers
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, USA.
| | - Adam Frankish
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus Hinxton, Cambridge, UK.
| | - Kin Fai Au
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA.
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.
| | - Gloria M Sheynkman
- Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, VA, USA.
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA.
- UVA Cancer Center, University of Virginia, Charlottesville, VA, USA.
| | - Ali Mortazavi
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA, USA.
- Center for Complex Biological Systems, University of California, Irvine, Irvine, CA, USA.
| | - Ana Conesa
- Institute for Integrative Systems Biology, Spanish National Research Council (CSIC), Paterna, Spain.
- Microbiology and Cell Science Department, Institute for Food and Agricultural Sciences, University of Florida, Gainesville, FL, USA.
| | - Angela N Brooks
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA.
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, USA.
| |
Collapse
|
2
|
Sullivan DK, Min KHJ, Hjörleifsson KE, Luebbert L, Holley G, Moses L, Gustafsson J, Bray NL, Pimentel H, Booeshaghi AS, Melsted P, Pachter L. kallisto, bustools, and kb-python for quantifying bulk, single-cell, and single-nucleus RNA-seq. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.11.21.568164. [PMID: 38045414 PMCID: PMC10690192 DOI: 10.1101/2023.11.21.568164] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/05/2023]
Abstract
The term "RNA-seq" refers to a collection of assays based on sequencing experiments that involve quantifying RNA species from bulk tissue, from single cells, or from single nuclei. The kallisto, bustools, and kb-python programs are free, open-source software tools for performing this analysis that together can produce gene expression quantification from raw sequencing reads. The quantifications can be individualized for multiple cells, multiple samples, or both. Additionally, these tools allow gene expression values to be classified as originating from nascent RNA species or mature RNA species, making this workflow amenable to both cell-based and nucleus-based assays. This protocol describes in detail how to use kallisto and bustools in conjunction with a wrapper, kb-python, to preprocess RNA-seq data.
Collapse
Affiliation(s)
- Delaney K Sullivan
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125, USA
- UCLA-Caltech Medical Scientist Training Program, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, 90095, USA
| | | | | | - Laura Luebbert
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125, USA
| | | | - Lambda Moses
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125, USA
| | | | - Nicolas L Bray
- Broad Institute of Harvard and MIT, Cambridge, MA, 02142, USA
| | - Harold Pimentel
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, 90095, USA
- Department of Computational Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, 90095, USA
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, 90095, USA
| | - A Sina Booeshaghi
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125, USA
- School of Engineering and Natural Sciences, University of Iceland, Reykjavik, Iceland
| | - Páll Melsted
- deCODE Genetics/Amgen Inc., Reykjavik, Iceland
- School of Engineering and Natural Sciences, University of Iceland, Reykjavik, Iceland
| | - Lior Pachter
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125, USA
- Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, 91125, USA
| |
Collapse
|
3
|
Pardo-Palacios FJ, Wang D, Reese F, Diekhans M, Carbonell-Sala S, Williams B, Loveland JE, De María M, Adams MS, Balderrama-Gutierrez G, Behera AK, Gonzalez JM, Hunt T, Lagarde J, Liang CE, Li H, Jerryd Meade M, Moraga Amador DA, Prjibelski AD, Birol I, Bostan H, Brooks AM, Hasan Çelik M, Chen Y, Du MR, Felton C, Göke J, Hafezqorani S, Herwig R, Kawaji H, Lee J, Liang Li J, Lienhard M, Mikheenko A, Mulligan D, Ming Nip K, Pertea M, Ritchie ME, Sim AD, Tang AD, Kei Wan Y, Wang C, Wong BY, Yang C, Barnes I, Berry A, Capella S, Dhillon N, Fernandez-Gonzalez JM, Ferrández-Peral L, Garcia-Reyero N, Goetz S, Hernández-Ferrer C, Kondratova L, Liu T, Martinez-Martin A, Menor C, Mestre-Tomás J, Mudge JM, Panayotova NG, Paniagua A, Repchevsky D, Rouchka E, Saint-John B, Sapena E, Sheynkman L, Laird Smith M, Suner MM, Takahashi H, Youngworth IA, Carninci P, Denslow ND, Guigó R, Hunter ME, Tilgner HU, Wold BJ, Vollmers C, Frankish A, Fai Au K, Sheynkman GM, Mortazavi A, Conesa A, Brooks AN. Systematic assessment of long-read RNA-seq methods for transcript identification and quantification. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.07.25.550582. [PMID: 37546854 PMCID: PMC10402094 DOI: 10.1101/2023.07.25.550582] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/08/2023]
Abstract
The Long-read RNA-Seq Genome Annotation Assessment Project (LRGASP) Consortium was formed to evaluate the effectiveness of long-read approaches for transcriptome analysis. The consortium generated over 427 million long-read sequences from cDNA and direct RNA datasets, encompassing human, mouse, and manatee species, using different protocols and sequencing platforms. These data were utilized by developers to address challenges in transcript isoform detection and quantification, as well as de novo transcript isoform identification. The study revealed that libraries with longer, more accurate sequences produce more accurate transcripts than those with increased read depth, whereas greater read depth improved quantification accuracy. In well-annotated genomes, tools based on reference sequences demonstrated the best performance. When aiming to detect rare and novel transcripts or when using reference-free approaches, incorporating additional orthogonal data and replicate samples are advised. This collaborative study offers a benchmark for current practices and provides direction for future method development in transcriptome analysis.
Collapse
Affiliation(s)
- Francisco J. Pardo-Palacios
- Institute for Integrative Systems Biology, Spanish National Research Council (CSIC), Paterna, Spain
- These authors contributed equally to this work
| | - Dingjie Wang
- Department of Biomedical Informatics, The Ohio State University, Columbus, USA
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, USA
- These authors contributed equally to this work
| | - Fairlie Reese
- Developmental and Cell Biology, University of California, Irvine, Irvine, USA
- Center for Complex Biological Systems, University of California, Irvine, Irvine, USA
- These authors contributed equally to this work
| | - Mark Diekhans
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, USA
- These authors contributed equally to this work
| | - Sílvia Carbonell-Sala
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Catalonia, Spain
- These authors contributed equally to this work
| | - Brian Williams
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, USA
- These authors contributed equally to this work
| | - Jane E. Loveland
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
- These authors contributed equally to this work
| | - Maite De María
- Department of Physiological Sciences, College of Veterinary Medicine, University of Florida, Gainesville, USA
- Center for Environmental and Human Toxicology, University of Florida, Gainesville, USA
- These authors contributed equally to this work
| | - Matthew S. Adams
- Molecular Cell and Developmental Biology, University of California, Santa Cruz, Santa Cruz, USA
- These authors contributed equally to this work
| | - Gabriela Balderrama-Gutierrez
- Developmental and Cell Biology, University of California, Irvine, Irvine, USA
- Center for Complex Biological Systems, University of California, Irvine, Irvine, USA
- These authors contributed equally to this work
| | - Amit K. Behera
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, USA
- These authors contributed equally to this work
| | - Jose M. Gonzalez
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
- These authors contributed equally to this work
| | - Toby Hunt
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
- These authors contributed equally to this work
| | - Julien Lagarde
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Catalonia, Spain
- Flomics Biotech, Dr Aiguader 88, Barcelona 08003, Spain
- These authors contributed equally to this work
| | - Cindy E. Liang
- Molecular Cell and Developmental Biology, University of California, Santa Cruz, Santa Cruz, USA
- These authors contributed equally to this work
| | - Haoran Li
- Department of Biomedical Informatics, The Ohio State University, Columbus, USA
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, USA
- These authors contributed equally to this work
| | - Marcus Jerryd Meade
- Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, USA
- These authors contributed equally to this work
| | - David A. Moraga Amador
- Interdisciplinary Center for Biotechnology Research, University of Florida, Gainesville, USA
- These authors contributed equally to this work
| | - Andrey D. Prjibelski
- Department of Computer Science, University of Helsinki, Helsinki, Finland
- Center for Bioinformatics and Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia
- These authors contributed equally to this work
| | - Inanc Birol
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, Canada
| | - Hamed Bostan
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Durham, USA
| | - Ashley M. Brooks
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Durham, USA
| | - Muhammed Hasan Çelik
- Developmental and Cell Biology, University of California, Irvine, Irvine, USA
- Center for Complex Biological Systems, University of California, Irvine, Irvine, USA
| | - Ying Chen
- Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
| | - Mei R,M. Du
- Walter and Eliza Hall Institute of Medical Research, Parkville, Australia
| | - Colette Felton
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, USA
| | - Jonathan Göke
- Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
- Department of Statistics and Data Science, National University of Singapore, Singapore, Singapore
| | - Saber Hafezqorani
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, Canada
| | - Ralf Herwig
- Department Computational Molecular Biology, Max-Planck-Institute for Molecular Genetics, Berlin, Germany
| | - Hideya Kawaji
- Research Center for Genome & Medical Sciences, Tokyo Metropolitan Institute of Medical Science, Tokyo, Japan
| | - Joseph Lee
- Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
| | - Jian Liang Li
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Durham, USA
| | - Matthias Lienhard
- Department Computational Molecular Biology, Max-Planck-Institute for Molecular Genetics, Berlin, Germany
| | - Alla Mikheenko
- Department of Neuromuscular Diseases, UCL Queen Square Institute of Neurology, London, UK
| | - Dennis Mulligan
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, USA
| | - Ka Ming Nip
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, Canada
| | - Mihaela Pertea
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, USA
| | - Matthew E. Ritchie
- Walter and Eliza Hall Institute of Medical Research, Parkville, Australia
- Department of Medical Biology, The University of Melbourne, Parkville, Australia
| | - Andre D. Sim
- Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
| | - Alison D. Tang
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, USA
| | - Yuk Kei Wan
- Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
- Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
| | - Changqing Wang
- Walter and Eliza Hall Institute of Medical Research, Parkville, Australia
| | - Brandon Y. Wong
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, USA
| | - Chen Yang
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, Canada
| | - If Barnes
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Andrew Berry
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | | | - Namrita Dhillon
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, USA
| | | | - Luis Ferrández-Peral
- Institute for Integrative Systems Biology, Spanish National Research Council (CSIC), Paterna, Spain
| | - Natàlia Garcia-Reyero
- Environmental Laboratory, US Army Engineer Research & Development Center, Vicksburg, USA
| | | | | | | | | | | | | | - Jorge Mestre-Tomás
- Institute for Integrative Systems Biology, Spanish National Research Council (CSIC), Paterna, Spain
| | - Jonathan M. Mudge
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Nedka G. Panayotova
- Interdisciplinary Center for Biotechnology Research, University of Florida, Gainesville, USA
| | - Alejandro Paniagua
- Institute for Integrative Systems Biology, Spanish National Research Council (CSIC), Paterna, Spain
| | | | - Eric Rouchka
- Department of Biochemistry & Molecular Genetics, University of Louisville, Louisville, USA
| | - Brandon Saint-John
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, USA
| | - Enrique Sapena
- European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK, UK
| | - Leon Sheynkman
- Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, USA
| | - Melissa Laird Smith
- Department of Biochemistry & Molecular Genetics, University of Louisville, Louisville, USA
| | - Marie-Marthe Suner
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Hazuki Takahashi
- Center for Integrative Medical Sciences, Laboratory for Transcriptome Technology, RIKEN, Yokohama, Japan
| | | | - Piero Carninci
- Center for Integrative Medical Sciences, Laboratory for Transcriptome Technology, RIKEN, Yokohama, Japan
- Human Technopole, Milano, Italy
| | - Nancy D. Denslow
- Department of Physiological Sciences, College of Veterinary Medicine, University of Florida, Gainesville, USA
- Center for Environmental and Human Toxicology, Department of Physiological Sciences,, University of Florida, Gainesville, USA
| | - Roderic Guigó
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Catalonia, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Catalonia, Spain
| | - Margaret E. Hunter
- U.S. Geological Survey, Wetland and Aquatic Research Center, Gainesville, USA
| | - Hagen U. Tilgner
- Brain and Mind Research Institute and Center for Neurogenetics, Weill Cornell Medicine, New York City, USA
| | - Barbara J. Wold
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, USA
| | - Christopher Vollmers
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, USA
| | - Adam Frankish
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Kin Fai Au
- Department of Biomedical Informatics, The Ohio State University, Columbus, USA
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, USA
| | - Gloria M. Sheynkman
- Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, USA
- Center for Public Health Genomics
- UVA Cancer Center, University of Virginia, Charlottesville, USA
| | - Ali Mortazavi
- Developmental and Cell Biology, University of California, Irvine, Irvine, USA
- Center for Complex Biological Systems, University of California, Irvine, Irvine, USA
| | - Ana Conesa
- Institute for Integrative Systems Biology, Spanish National Research Council (CSIC), Paterna, Spain
- Microbiology and Cell Science Department, Institute for Food and Agricultural Sciences, University of Florida, Gainesville, USA
| | - Angela N. Brooks
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, USA
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, USA
| |
Collapse
|
4
|
Wilbrandt J, Misof B, Panfilio KA, Niehuis O. Repertoire-wide gene structure analyses: a case study comparing automatically predicted and manually annotated gene models. BMC Genomics 2019; 20:753. [PMID: 31623555 PMCID: PMC6798390 DOI: 10.1186/s12864-019-6064-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Accepted: 08/27/2019] [Indexed: 02/06/2023] Open
Abstract
Background The location and modular structure of eukaryotic protein-coding genes in genomic sequences can be automatically predicted by gene annotation algorithms. These predictions are often used for comparative studies on gene structure, gene repertoires, and genome evolution. However, automatic annotation algorithms do not yet correctly identify all genes within a genome, and manual annotation is often necessary to obtain accurate gene models and gene sets. As manual annotation is time-consuming, only a fraction of the gene models in a genome is typically manually annotated, and this fraction often differs between species. To assess the impact of manual annotation efforts on genome-wide analyses of gene structural properties, we compared the structural properties of protein-coding genes in seven diverse insect species sequenced by the i5k initiative. Results Our results show that the subset of genes chosen for manual annotation by a research community (3.5–7% of gene models) may have structural properties (e.g., lengths and exon counts) that are not necessarily representative for a species’ gene set as a whole. Nonetheless, the structural properties of automatically generated gene models are only altered marginally (if at all) through manual annotation. Major correlative trends, for example a negative correlation between genome size and exonic proportion, can be inferred from either the automatically predicted or manually annotated gene models alike. Vice versa, some previously reported trends did not appear in either the automatic or manually annotated gene sets, pointing towards insect-specific gene structural peculiarities. Conclusions In our analysis of gene structural properties, automatically predicted gene models proved to be sufficiently reliable to recover the same gene-repertoire-wide correlative trends that we found when focusing on manually annotated gene models only. We acknowledge that analyses on the individual gene level clearly benefit from manual curation. However, as genome sequencing and annotation projects often differ in the extent of their manual annotation and curation efforts, our results indicate that comparative studies analyzing gene structural properties in these genomes can nonetheless be justifiable and informative. Electronic supplementary material The online version of this article (10.1186/s12864-019-6064-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jeanne Wilbrandt
- Center for molecular Biodiversity Research, Zoological Research Museum Alexander Koenig (ZFMK), Adenauerallee 160, 53113, Bonn, Germany. .,Present address: Hoffmann Research Group, Leibniz Institute on Aging - Fritz Lipmann Institute, Beutenbergstraße 11, 07745, Jena, Germany.
| | - Bernhard Misof
- Center for molecular Biodiversity Research, Zoological Research Museum Alexander Koenig (ZFMK), Adenauerallee 160, 53113, Bonn, Germany
| | - Kristen A Panfilio
- School of Life Sciences, University of Warwick, Gibbet Hill Campus, Coventry, CV4 7AL, UK
| | - Oliver Niehuis
- Evolutionary Biology and Ecology, Institute of Biology I (Zoology), Albert Ludwig University, Hauptstr. 1, 79104, Freiburg, Germany
| |
Collapse
|
5
|
Ye X, Tang X, Wang X, Che J, Wu M, Liang J, Ye L, Qian Q, Li J, You Z, Zhang Y, Wang S, Zhong B. Improving Silkworm Genome Annotation Using a Proteogenomics Approach. J Proteome Res 2019; 18:3009-3019. [PMID: 31250652 DOI: 10.1021/acs.jproteome.8b00965] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
The silkworm genome has been deeply sequenced and assembled, but accurate genome annotation, which is important for modern biological research, remains far from complete. To improve silkworm genome annotation, we carried out a proteogenomics analysis using 9.8 million mass spectra collected from different tissues and developmental stages of the silkworm. The results confirmed the translational products of 4307 existing gene models and identified 1701 novel genome search-specific peptides (GSSPs). Using these GSSPs, 74 novel gene-coding sequences were identified, and 121 existing gene models were corrected. We also identified 1182 novel junction peptides based on an exon-skipping database that resulted in the identification of 973 alternative splicing sites. Furthermore, we performed RNA-seq analysis to improve silkworm genome annotation at the transcriptional level. A total of 1704 new transcripts and 1136 new exons were identified, 2581 untranslated regions (UTRs) were revised, and 1301 alternative splicing (AS) genes were identified. The transcriptomics results were integrated with the proteomics data to further complement and verify the new annotations. In addition, 14 incorrect genes and 10 skipped exons were verified using the two analysis methods. Altogether, we identified 1838 new transcripts and 1593 AS genes and revised 5074 existing genes using proteogenomics and transcriptome analyses. Data are available via ProteomeXchange with identifier PXD009672. The large-scale proteogenomics and transcriptome analyses in this study will greatly improve silkworm genome annotation and contribute to future studies.
Collapse
Affiliation(s)
- Xiaogang Ye
- College of Animal Sciences , Zhejiang University , Hangzhou , P. R. China
| | - Xiaoli Tang
- College of Animal Sciences , Zhejiang University , Hangzhou , P. R. China
| | - Xiaoxiao Wang
- College of Animal Sciences , Zhejiang University , Hangzhou , P. R. China
| | - Jiaqian Che
- College of Animal Sciences , Zhejiang University , Hangzhou , P. R. China
| | - Meiyu Wu
- College of Animal Sciences , Zhejiang University , Hangzhou , P. R. China
| | - Jianshe Liang
- College of Animal Sciences , Zhejiang University , Hangzhou , P. R. China
| | - Lupeng Ye
- College of Animal Sciences , Zhejiang University , Hangzhou , P. R. China
| | - Qiujie Qian
- College of Animal Sciences , Zhejiang University , Hangzhou , P. R. China
| | - Jianying Li
- College of Animal Sciences , Zhejiang University , Hangzhou , P. R. China
| | - Zhengying You
- College of Animal Sciences , Zhejiang University , Hangzhou , P. R. China
| | - Yuyu Zhang
- College of Animal Sciences , Zhejiang University , Hangzhou , P. R. China
| | - Shaohua Wang
- College of Animal Sciences , Zhejiang University , Hangzhou , P. R. China
| | - Boxiong Zhong
- College of Animal Sciences , Zhejiang University , Hangzhou , P. R. China
| |
Collapse
|
6
|
Yi F, Jia Z, Xiao Y, Ma W, Wang J. SPTEdb: a database for transposable elements in salicaceous plants. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018; 2018:4925802. [PMID: 29688371 PMCID: PMC5846285 DOI: 10.1093/database/bay024] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/30/2017] [Accepted: 02/12/2018] [Indexed: 01/10/2023]
Abstract
Although transposable elements (TEs) play significant roles in structural, functional and evolutionary dynamics of the salicaceous plants genome and the accurate identification, definition and classification of TEs are still inadequate. In this study, we identified 18 393 TEs from Populus trichocarpa, Populus euphratica and Salix suchowensis using a combination of signature-based, similarity-based and De novo method, and annotated them into 1621 families. A comprehensive and user-friendly web-based database, SPTEdb, was constructed and served for researchers. SPTEdb enables users to browse, retrieve and download the TEs sequences from the database. Meanwhile, several analysis tools, including BLAST, HMMER, GetORF and Cut sequence, were also integrated into SPTEdb to help users to mine the TEs data easily and effectively. In summary, SPTEdb will facilitate the study of TEs biology and functional genomics in salicaceous plants. Database URL: http://genedenovoweb.ticp.net:81/SPTEdb/index.php
Collapse
Affiliation(s)
- Fei Yi
- State Key Laboratory of Tree Genetics and Breeding, Key Laboratory of Tree Breeding and Cultivation of State Forestry Administration, Research Institute of Forestry, Chinese Academy of Forestry, Beijing 100091, China and.,College of Biological and Pharmaceutical Sciences, Three Gorges University, Yichang 443002, China
| | - Zirui Jia
- State Key Laboratory of Tree Genetics and Breeding, Key Laboratory of Tree Breeding and Cultivation of State Forestry Administration, Research Institute of Forestry, Chinese Academy of Forestry, Beijing 100091, China and
| | - Yao Xiao
- State Key Laboratory of Tree Genetics and Breeding, Key Laboratory of Tree Breeding and Cultivation of State Forestry Administration, Research Institute of Forestry, Chinese Academy of Forestry, Beijing 100091, China and
| | - Wenjun Ma
- State Key Laboratory of Tree Genetics and Breeding, Key Laboratory of Tree Breeding and Cultivation of State Forestry Administration, Research Institute of Forestry, Chinese Academy of Forestry, Beijing 100091, China and
| | - Junhui Wang
- State Key Laboratory of Tree Genetics and Breeding, Key Laboratory of Tree Breeding and Cultivation of State Forestry Administration, Research Institute of Forestry, Chinese Academy of Forestry, Beijing 100091, China and
| |
Collapse
|
7
|
Reid I. Evaluating Programs for Predicting Genes and Transcripts with RNA-Seq Support in Fungal Genomes. Methods Mol Biol 2018; 1775:209-227. [PMID: 29876820 DOI: 10.1007/978-1-4939-7804-5_17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The steps needed to computationally predict genes and transcripts in fungal genomes with support from RNA-Seq data are described in detail for three prediction programs: CodingQuarry, BRAKER1, and Harfang. These programs predicted from 86% to 92% (Harfang) of the genes in a manually curated reference set for Aspergillus niger strain NRRL3. Genes with little or no RNA-Seq read coverage were predicted less successfully than genes with adequate coverage.
Collapse
Affiliation(s)
- Ian Reid
- Centre for Structural and Functional Genomics, Concordia University, Montreal, QC, Canada.
| |
Collapse
|
8
|
Next Generation Sequencing Data and Proteogenomics. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2016; 926:11-19. [DOI: 10.1007/978-3-319-42316-6_2] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
|
9
|
Buisine N, Ruan X, Bilesimo P, Grimaldi A, Alfama G, Ariyaratne P, Mulawadi F, Chen J, Sung WK, Liu ET, Demeneix BA, Ruan Y, Sachs LM. Xenopus tropicalis Genome Re-Scaffolding and Re-Annotation Reach the Resolution Required for In Vivo ChIA-PET Analysis. PLoS One 2015; 10:e0137526. [PMID: 26348928 PMCID: PMC4562602 DOI: 10.1371/journal.pone.0137526] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2015] [Accepted: 08/19/2015] [Indexed: 12/11/2022] Open
Abstract
Genome-wide functional analyses require high-resolution genome assembly and annotation. We applied ChIA-PET to analyze gene regulatory networks, including 3D chromosome interactions, underlying thyroid hormone (TH) signaling in the frog Xenopus tropicalis. As the available versions of Xenopus tropicalis assembly and annotation lacked the resolution required for ChIA-PET we improve the genome assembly version 4.1 and annotations using data derived from the paired end tag (PET) sequencing technologies and approaches (e.g., DNA-PET [gPET], RNA-PET etc.). The large insert (~10Kb, ~17Kb) paired end DNA-PET with high throughput NGS sequencing not only significantly improved genome assembly quality, but also strongly reduced genome “fragmentation”, reducing total scaffold numbers by ~60%. Next, RNA-PET technology, designed and developed for the detection of full-length transcripts and fusion mRNA in whole transcriptome studies (ENCODE consortia), was applied to capture the 5' and 3' ends of transcripts. These amendments in assembly and annotation were essential prerequisites for the ChIA-PET analysis of TH transcription regulation. Their application revealed complex regulatory configurations of target genes and the structures of the regulatory networks underlying physiological responses. Our work allowed us to improve the quality of Xenopus tropicalis genomic resources, reaching the standard required for ChIA-PET analysis of transcriptional networks. We consider that the workflow proposed offers useful conceptual and methodological guidance and can readily be applied to other non-conventional models that have low-resolution genome data.
Collapse
Affiliation(s)
- Nicolas Buisine
- UMR CNRS 7221, Muséum National d'Histoire Naturelle, Paris, France
| | - Xiaoan Ruan
- The Jackson Laboratory of Genomic Medicine, Farmington, Connecticut, United States of America
- Department of Genetics and Developmental Biology, University of Connecticut, Farmington, Connecticut, United States of America
- Genome Institute of Singapore, Singapore, Singapore
| | - Patrice Bilesimo
- UMR CNRS 7221, Muséum National d'Histoire Naturelle, Paris, France
- Watchfrog S.A.S., Evry, France
| | - Alexis Grimaldi
- UMR CNRS 7221, Muséum National d'Histoire Naturelle, Paris, France
| | - Gladys Alfama
- UMR CNRS 7221, Muséum National d'Histoire Naturelle, Paris, France
| | | | | | - Jieqi Chen
- Genome Institute of Singapore, Singapore, Singapore
| | | | - Edison T. Liu
- The Jackson Laboratory of Genomic Medicine, Farmington, Connecticut, United States of America
- Department of Genetics and Developmental Biology, University of Connecticut, Farmington, Connecticut, United States of America
- Genome Institute of Singapore, Singapore, Singapore
| | | | - Yijun Ruan
- The Jackson Laboratory of Genomic Medicine, Farmington, Connecticut, United States of America
- Department of Genetics and Developmental Biology, University of Connecticut, Farmington, Connecticut, United States of America
- Genome Institute of Singapore, Singapore, Singapore
- * E-mail: (YR); (LMS)
| | - Laurent M. Sachs
- UMR CNRS 7221, Muséum National d'Histoire Naturelle, Paris, France
- * E-mail: (YR); (LMS)
| |
Collapse
|
10
|
Reid I, O’Toole N, Zabaneh O, Nourzadeh R, Dahdouli M, Abdellateef M, Gordon PMK, Soh J, Butler G, Sensen CW, Tsang A. SnowyOwl: accurate prediction of fungal genes by using RNA-Seq and homology information to select among ab initio models. BMC Bioinformatics 2014; 15:229. [PMID: 24980894 PMCID: PMC4084796 DOI: 10.1186/1471-2105-15-229] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2013] [Accepted: 06/17/2014] [Indexed: 12/02/2022] Open
Abstract
BACKGROUND Locating the protein-coding genes in novel genomes is essential to understanding and exploiting the genomic information but it is still difficult to accurately predict all the genes. The recent availability of detailed information about transcript structure from high-throughput sequencing of messenger RNA (RNA-Seq) delineates many expressed genes and promises increased accuracy in gene prediction. Computational gene predictors have been intensively developed for and tested in well-studied animal genomes. Hundreds of fungal genomes are now or will soon be sequenced. The differences of fungal genomes from animal genomes and the phylogenetic sparsity of well-studied fungi call for gene-prediction tools tailored to them. RESULTS SnowyOwl is a new gene prediction pipeline that uses RNA-Seq data to train and provide hints for the generation of Hidden Markov Model (HMM)-based gene predictions and to evaluate the resulting models. The pipeline has been developed and streamlined by comparing its predictions to manually curated gene models in three fungal genomes and validated against the high-quality gene annotation of Neurospora crassa; SnowyOwl predicted N. crassa genes with 83% sensitivity and 65% specificity. SnowyOwl gains sensitivity by repeatedly running the HMM gene predictor Augustus with varied input parameters and selectivity by choosing the models with best homology to known proteins and best agreement with the RNA-Seq data. CONCLUSIONS SnowyOwl efficiently uses RNA-Seq data to produce accurate gene models in both well-studied and novel fungal genomes. The source code for the SnowyOwl pipeline (in Python) and a web interface (in PHP) is freely available from http://sourceforge.net/projects/snowyowl/.
Collapse
Affiliation(s)
- Ian Reid
- Centre for Structural and Functional Genomics, Concordia University, 7141 Sherbrooke St. W, Montreal, QC H4B 1R6, Canada
| | - Nicholas O’Toole
- Centre for Structural and Functional Genomics, Concordia University, 7141 Sherbrooke St. W, Montreal, QC H4B 1R6, Canada
| | - Omar Zabaneh
- Faculty of Medicine, Visual Genomics Centre, University of Calgary, 3330 Hospital Drive NW, Calgary, AB T2N 4N1, Canada
| | - Reza Nourzadeh
- Faculty of Medicine, Visual Genomics Centre, University of Calgary, 3330 Hospital Drive NW, Calgary, AB T2N 4N1, Canada
| | - Mahmoud Dahdouli
- Faculty of Medicine, Visual Genomics Centre, University of Calgary, 3330 Hospital Drive NW, Calgary, AB T2N 4N1, Canada
| | - Mostafa Abdellateef
- Faculty of Medicine, Visual Genomics Centre, University of Calgary, 3330 Hospital Drive NW, Calgary, AB T2N 4N1, Canada
| | - Paul MK Gordon
- Faculty of Medicine, Visual Genomics Centre, University of Calgary, 3330 Hospital Drive NW, Calgary, AB T2N 4N1, Canada
| | - Jung Soh
- Faculty of Medicine, Visual Genomics Centre, University of Calgary, 3330 Hospital Drive NW, Calgary, AB T2N 4N1, Canada
| | - Gregory Butler
- Centre for Structural and Functional Genomics, Concordia University, 7141 Sherbrooke St. W, Montreal, QC H4B 1R6, Canada
| | - Christoph W Sensen
- Faculty of Medicine, Visual Genomics Centre, University of Calgary, 3330 Hospital Drive NW, Calgary, AB T2N 4N1, Canada
| | - Adrian Tsang
- Centre for Structural and Functional Genomics, Concordia University, 7141 Sherbrooke St. W, Montreal, QC H4B 1R6, Canada
| |
Collapse
|
11
|
Ashkenazi S, Snir R, Ofran Y. Assessing the relationship between conservation of function and conservation of sequence using photosynthetic proteins. ACTA ACUST UNITED AC 2012; 28:3203-10. [PMID: 23080118 DOI: 10.1093/bioinformatics/bts608] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Assessing the false positive rate of function prediction methods is difficult, as it is hard to establish that a protein does not have a certain function. To determine to what extent proteins with similar sequences have a common function, we focused on photosynthesis-related proteins. A protein that comes from a non-photosynthetic organism is, undoubtedly, not involved in photosynthesis. RESULTS We show that function diverges very rapidly: 70% of the close homologs of photosynthetic proteins come from non-photosynthetic organisms. Therefore, high sequence similarity, in most cases, is not tantamount to similar function. However, we found that many functionally similar proteins often share short sequence elements, which may correspond to a functional site and could reveal functional similarities more accurately than sequence similarity. CONCLUSIONS These results shed light on the way biological function is conserved in evolution and may help improve large-scale analysis of protein function.
Collapse
Affiliation(s)
- Shaul Ashkenazi
- The Goodman faculty of life sciences, Bar Ilan University, Ramat Gan 52900, Israel
| | | | | |
Collapse
|
12
|
Xu HE, Zhang HH, Han MJ, Shen YH, Huang XZ, Xiang ZH, Zhang Z. [Computational approaches for identification and classification of transposable elements in eukaryotic genomes]. YI CHUAN = HEREDITAS 2012; 34:1009-1019. [PMID: 22917906 DOI: 10.3724/sp.j.1005.2012.01009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Repetitive sequences (repeats) represent a significant fraction of the eukaryotic genomes and can be divided into tandem repeats, segmental duplications, and interspersed repeats on the basis of their sequence characteristics and how they are formed. Most interspersed repeats are derived from transposable elements (TEs). Eukaryotic TEs have been subdivided into two major classes according to the intermediate they use to move. The transposition and amplification of TEs have a great impact on the evolution of genes and the stability of genomes. However, identification and classification of TEs are complex and difficult due to the fact that their structure and classification are complex and diverse compared with those of other types of repeats. Here, we briefly introduced the function and classification of TEs, and summarized three different steps for identification, classification and annotation of TEs in eukaryotic genomes: (1) assembly of a repeat library, (2) repeat correction and classification, and (3) genome annotation. The existing computational approaches for each step were summarized and the advantages and disadvantages of the approaches were also highlighted in this review. To accurately identify, classify, and annotate the TEs in eukaryotic genomes requires combined methods. This review provides useful information for biologists who are not familiar with these approaches to find their way through the forest of programs.
Collapse
Affiliation(s)
- Hong-En Xu
- The Institute of Sericulture and Systems Biology, Southwest University, Chongqing, China.
| | | | | | | | | | | | | |
Collapse
|
13
|
Abstract
Evolutionary genomics is a field that relies heavily upon comparing genomes, that is, the full complement of genes of one species with another. However, given a genome sequence and little else, as is now often the case, genes must first be found and annotated before downstream analyses can be done. Computational gene prediction techniques are brought to bear on the problem of constructing a genome annotation as manual annotation is extremely time-consuming and costly. This chapter reviews the methods by which the individual components of a typical gene structure are detected in genomic sequence and then discusses several popular statistical frameworks for integrated gene prediction on eukaryotic genome sequences.
Collapse
Affiliation(s)
- Tyler Alioto
- Centro Nacional de Análisis Genómico, Barcelona, Spain.
| |
Collapse
|
14
|
Grassa CJ, Kulathinal RJ. Elevated Evolutionary Rates among Functionally Diverged Reproductive Genes across Deep Vertebrate Lineages. INTERNATIONAL JOURNAL OF EVOLUTIONARY BIOLOGY 2011; 2011:274975. [PMID: 21811675 PMCID: PMC3147129 DOI: 10.4061/2011/274975] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/01/2011] [Revised: 05/17/2011] [Accepted: 05/23/2011] [Indexed: 11/24/2022]
Abstract
Among closely related taxa, proteins involved in reproduction generally evolve more rapidly than other proteins. Here, we apply a functional and comparative genomics approach to compare functional divergence across a deep phylogenetic array of egg-laying and live-bearing vertebrate taxa. We aligned and annotated a set of 4,986 1 : 1 : 1 : 1 : 1 orthologs in Anolis carolinensis (green lizard), Danio rerio (zebrafish), Xenopus tropicalis (frog), Gallus gallus (chicken), and Mus musculus (mouse) according to function using ESTs from available reproductive (including testis and ovary) and non-reproductive tissues as well as Gene Ontology. For each species lineage, genes were further classified as tissue-specific (found in a single tissue) or tissue-expressed (found in multiple tissues). Within independent vertebrate lineages, we generally find that gonadal-specific genes evolve at a faster rate than gonadal-expressed genes and significantly faster than non-reproductive genes. Among the gonadal set, testis genes are generally more diverged than ovary genes. Surprisingly, an opposite but nonsignificant pattern is found among the subset of orthologs that remained functionally conserved across all five lineages. These contrasting evolutionary patterns found between functionally diverged and functionally conserved reproductive orthologs provide evidence for pervasive and potentially cryptic lineage-specific selective processes on ancestral reproductive systems in vertebrates.
Collapse
Affiliation(s)
- Christopher J Grassa
- Department of Botany, University of British Columbia, 6270 University Boulevard, Vancouver, BC, Canada V6T 1Z4
| | | |
Collapse
|
15
|
Renuse S, Chaerkady R, Pandey A. Proteogenomics. Proteomics 2011; 11:620-30. [DOI: 10.1002/pmic.201000615] [Citation(s) in RCA: 106] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2010] [Revised: 11/14/2010] [Accepted: 11/16/2010] [Indexed: 12/13/2022]
|
16
|
Orthopoxvirus genome evolution: the role of gene loss. Viruses 2010; 2:1933-1967. [PMID: 21994715 PMCID: PMC3185746 DOI: 10.3390/v2091933] [Citation(s) in RCA: 125] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2010] [Revised: 08/25/2010] [Accepted: 09/01/2010] [Indexed: 12/26/2022] Open
Abstract
Poxviruses are highly successful pathogens, known to infect a variety of hosts. The family Poxviridae includes Variola virus, the causative agent of smallpox, which has been eradicated as a public health threat but could potentially reemerge as a bioterrorist threat. The risk scenario includes other animal poxviruses and genetically engineered manipulations of poxviruses. Studies of orthologous gene sets have established the evolutionary relationships of members within the Poxviridae family. It is not clear, however, how variations between family members arose in the past, an important issue in understanding how these viruses may vary and possibly produce future threats. Using a newly developed poxvirus-specific tool, we predicted accurate gene sets for viruses with completely sequenced genomes in the genus Orthopoxvirus. Employing sensitive sequence comparison techniques together with comparison of syntenic gene maps, we established the relationships between all viral gene sets. These techniques allowed us to unambiguously identify the gene loss/gain events that have occurred over the course of orthopoxvirus evolution. It is clear that for all existing Orthopoxvirus species, no individual species has acquired protein-coding genes unique to that species. All existing species contain genes that are all present in members of the species Cowpox virus and that cowpox virus strains contain every gene present in any other orthopoxvirus strain. These results support a theory of reductive evolution in which the reduction in size of the core gene set of a putative ancestral virus played a critical role in speciation and confining any newly emerging virus species to a particular environmental (host or tissue) niche.
Collapse
|
17
|
Reese MG, Moore B, Batchelor C, Salas F, Cunningham F, Marth GT, Stein L, Flicek P, Yandell M, Eilbeck K. A standard variation file format for human genome sequences. Genome Biol 2010; 11:R88. [PMID: 20796305 PMCID: PMC2945790 DOI: 10.1186/gb-2010-11-8-r88] [Citation(s) in RCA: 73] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2010] [Revised: 07/26/2010] [Accepted: 08/26/2010] [Indexed: 12/03/2022] Open
Abstract
Here we describe the Genome Variation Format (GVF) and the 10Gen dataset. GVF, an extension of Generic Feature Format version 3 (GFF3), is a simple tab-delimited format for DNA variant files, which uses Sequence Ontology to describe genome variation data. The 10Gen dataset, ten human genomes in GVF format, is freely available for community analysis from the Sequence Ontology website and from an Amazon elastic block storage (EBS) snapshot for use in Amazon's EC2 cloud computing environment.
Collapse
Affiliation(s)
- Martin G Reese
- Omicia, 2200 Powell Street, Suite 525, Emeryville, CA 94608, USA.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
18
|
Bodian DL, Klein TE. COLdb, a database linking genetic data to molecular function in fibrillar collagens. Hum Mutat 2009; 30:946-51. [PMID: 19370761 DOI: 10.1002/humu.20978] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Fibrillar collagens are ubiquitous proteins essential for the structural integrity of bones, skin, blood vessels, and other tissues. Mutations in collagen genes result in disorders including osteogenesis imperfecta, chondrodysplasias, and Ehlers-Danlos syndromes, but the molecular basis for the heterogeneity of clinical phenotypes is not well understood. A more complete understanding of the relationship between sequence and phenotype requires synthesis of multiple facets of collagen structure and function. To facilitate such an analysis, we developed COLdb, a freely available database integrating collagen biological and physicochemical properties with known variants. A Web-based, interactive, graphical user interface displays the data as annotations on the collagen protein sequences. Collagen gene-level data are provided as custom tracks for display in the UCSC genome browser. COLdb currently includes 35,582 data points spanning collagen types I, II, and III, and, importantly, users can add their own data to the display. The database is the first comprehensive integration of disparate functional information on the three major fibrillar collagens, and the first electronic collection of mutations in the COL2A1 gene.
Collapse
Affiliation(s)
- Dale L Bodian
- Genetics Department, School of Medicine, Stanford University, Stanford, CA 94305-5120, USA
| | | |
Collapse
|
19
|
Schweikert G, Zien A, Zeller G, Behr J, Dieterich C, Ong CS, Philips P, De Bona F, Hartmann L, Bohlen A, Krüger N, Sonnenburg S, Rätsch G. mGene: accurate SVM-based gene finding with an application to nematode genomes. Genome Res 2009; 19:2133-43. [PMID: 19564452 DOI: 10.1101/gr.090597.108] [Citation(s) in RCA: 74] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
We present a highly accurate gene-prediction system for eukaryotic genomes, called mGene. It combines in an unprecedented manner the flexibility of generalized hidden Markov models (gHMMs) with the predictive power of modern machine learning methods, such as Support Vector Machines (SVMs). Its excellent performance was proved in an objective competition based on the genome of the nematode Caenorhabditis elegans. Considering the average of sensitivity and specificity, the developmental version of mGene exhibited the best prediction performance on nucleotide, exon, and transcript level for ab initio and multiple-genome gene-prediction tasks. The fully developed version shows superior performance in 10 out of 12 evaluation criteria compared with the other participating gene finders, including Fgenesh++ and Augustus. An in-depth analysis of mGene's genome-wide predictions revealed that approximately 2200 predicted genes were not contained in the current genome annotation. Testing a subset of 57 of these genes by RT-PCR and sequencing, we confirmed expression for 24 (42%) of them. mGene missed 300 annotated genes, out of which 205 were unconfirmed. RT-PCR testing of 24 of these genes resulted in a success rate of merely 8%. These findings suggest that even the gene catalog of a well-studied organism such as C. elegans can be substantially improved by mGene's predictions. We also provide gene predictions for the four nematodes C. briggsae, C. brenneri, C. japonica, and C. remanei. Comparing the resulting proteomes among these organisms and to the known protein universe, we identified many species-specific gene inventions. In a quality assessment of several available annotations for these genomes, we find that mGene's predictions are most accurate.
Collapse
|
20
|
Abstract
Automated evidence-based gene building is a rapid and cost-effective way to provide reliable gene annotations on newly sequenced genomes. One of the limitations of evidence-based gene builders, however, is their requirement for transcriptional evidence-known proteins, full-length cDNAs, or expressed sequence tags (ESTs)-in the species of interest. This limitation is of particular concern for plant genomes, where the rate of genome sequencing is greatly outpacing the rate of EST- and cDNA-sequencing projects. To overcome this limitation, we have developed an evidence-based gene build system (the Gramene pipeline) that can use transcriptional evidence across related species. The Gramene pipeline uses the Ensembl computing infrastructure with a novel data processing scheme. Using the previously annotated plant genomes, the dicot Arabidopsis thaliana and the monocot Oryza sativa, we show that the cross-species ESTs from within monocot or dicot class are a valuable source of evidence for gene predictions. We also find that, using only EST and cross-species evidence, the Gramene pipeline can generate a plant gene set that is comparable in quality to the human genes based on known proteins and full-length cDNAs. We compare the Gramene pipeline to several widely used ab initio gene prediction programs in rice; this comparison shows the pipeline performs favorably at both the gene and exon levels with cross-species gene products only. We discuss the results of testing the pipeline on a 22-Mb region of the newly sequenced maize genome and discuss potential application of the pipeline to other genomes.
Collapse
|
21
|
Commins J, Toft C, Fares MA. Computational biology methods and their application to the comparative genomics of endocellular symbiotic bacteria of insects. Biol Proced Online 2009; 11:52-78. [PMID: 19495914 PMCID: PMC3055744 DOI: 10.1007/s12575-009-9004-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2009] [Accepted: 02/17/2009] [Indexed: 12/02/2022] Open
Abstract
Comparative genomics has become a real tantalizing challenge in the postgenomic era. This fact has been mostly magnified by the plethora of new genomes becoming available in a daily bases. The overwhelming list of new genomes to compare has pushed the field of bioinformatics and computational biology forward toward the design and development of methods capable of identifying patterns in a sea of swamping data noise. Despite many advances made in such endeavor, the ever-lasting annoying exceptions to the general patterns remain to pose difficulties in generalizing methods for comparative genomics. In this review, we discuss the different tools devised to undertake the challenge of comparative genomics and some of the exceptions that compromise the generality of such methods. We focus on endosymbiotic bacteria of insects because of their genomic dynamics peculiarities when compared to free-living organisms.
Collapse
Affiliation(s)
- Jennifer Commins
- Evolutionary Genetics and Bioinformatics Laboratory, Department of Genetics, Smurfit Institute of Genetics, Trinity College, University of Dublin, Dublin, Ireland
| | - Christina Toft
- Evolutionary Genetics and Bioinformatics Laboratory, Department of Genetics, Smurfit Institute of Genetics, Trinity College, University of Dublin, Dublin, Ireland
| | - Mario A Fares
- Evolutionary Genetics and Bioinformatics Laboratory, Department of Genetics, Smurfit Institute of Genetics, Trinity College, University of Dublin, Dublin, Ireland
| |
Collapse
|
22
|
Quantitative measures for the management and comparison of annotated genomes. BMC Bioinformatics 2009; 10:67. [PMID: 19236712 PMCID: PMC2653490 DOI: 10.1186/1471-2105-10-67] [Citation(s) in RCA: 91] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2008] [Accepted: 02/23/2009] [Indexed: 11/22/2022] Open
Abstract
Background The ever-increasing number of sequenced and annotated genomes has made management of their annotations a significant undertaking, especially for large eukaryotic genomes containing many thousands of genes. Typically, changes in gene and transcript numbers are used to summarize changes from release to release, but these measures say nothing about changes to individual annotations, nor do they provide any means to identify annotations in need of manual review. Results In response, we have developed a suite of quantitative measures to better characterize changes to a genome's annotations between releases, and to prioritize problematic annotations for manual review. We have applied these measures to the annotations of five eukaryotic genomes over multiple releases – H. sapiens, M. musculus, D. melanogaster, A. gambiae, and C. elegans. Conclusion Our results provide the first detailed, historical overview of how these genomes' annotations have changed over the years, and demonstrate the usefulness of these measures for genome annotation management.
Collapse
|
23
|
Abstract
The sequence of many eukaryotic genomes is nowadays available from a personal computer to any researcher in the world-wide scientific community. However, the sequences are worthless without the adequate annotation of the biological meaningful elements. The annotation of the genes, in particular, is a challenging task that can not be tackled without the aid of specific bioinformatics tools. We present in this chapter a simple protocol mainly based on the combination of the program GeneID and other computational tools to annotate the location of a gene, which was previously annotated in D. melanogaster, in the recently assembled genome of D. yakuba.
Collapse
Affiliation(s)
- Enrique Blanco
- Departament de Genètica, Facultat de Biologia, Universitat de Barcelona, Spain
| | | |
Collapse
|
24
|
Coghlan A, Fiedler TJ, McKay SJ, Flicek P, Harris TW, Blasiar D, Stein LD. nGASP--the nematode genome annotation assessment project. BMC Bioinformatics 2008; 9:549. [PMID: 19099578 PMCID: PMC2651883 DOI: 10.1186/1471-2105-9-549] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2008] [Accepted: 12/19/2008] [Indexed: 11/15/2022] Open
Abstract
Background While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets across 10 Mb of the C. elegans genome. Predictions were compared to reference gene sets consisting of confirmed or manually curated gene models from WormBase. Results The most accurate gene-finders were 'combiner' algorithms, which made use of transcript- and protein-alignments and multi-genome alignments, as well as gene predictions from other gene-finders. Gene-finders that used alignments of ESTs, mRNAs and proteins came in second. There was a tie for third place between gene-finders that used multi-genome alignments and ab initio gene-finders. The median gene level sensitivity of combiners was 78% and their specificity was 42%, which is nearly the same accuracy reported for combiners in the human genome. C. elegans genes with exons of unusual hexamer content, as well as those with unusually many exons, short exons, long introns, a weak translation start signal, weak splice sites, or poorly conserved orthologs posed the greatest difficulty for gene-finders. Conclusion This experiment establishes a baseline of gene prediction accuracy in Caenorhabditis genomes, and has guided the choice of gene-finders for the annotation of newly sequenced genomes of Caenorhabditis and other nematode species. We have created new gene sets for C. briggsae, C. remanei, C. brenneri, C. japonica, and Brugia malayi using some of the best-performing gene-finders.
Collapse
Affiliation(s)
- Avril Coghlan
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.
| | | | | | | | | | | | | | | |
Collapse
|
25
|
Reinhardt JA, Baltrus DA, Nishimura MT, Jeck WR, Jones CD, Dangl JL. De novo assembly using low-coverage short read sequence data from the rice pathogen Pseudomonas syringae pv. oryzae. Genome Res 2008; 19:294-305. [PMID: 19015323 DOI: 10.1101/gr.083311.108] [Citation(s) in RCA: 121] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
We developed a novel approach for de novo genome assembly using only sequence data from high-throughput short read sequencing technologies. By combining data generated from 454 Life Sciences (Roche) and Illumina (formerly known as Solexa sequencing) sequencing platforms, we reliably assembled genomes into large scaffolds at a fraction of the traditional cost and without use of a reference sequence. We applied this method to two isolates of the phytopathogenic bacteria Pseudomonas syringae. Sequencing and reassembly of the well-studied tomato and Arabidopsis pathogen, Pto(DC3000), facilitated development and testing of our method. Sequencing of a distantly related rice pathogen, Por(1_)(6), demonstrated our method's efficacy for de novo assembly of novel genomes. Our assembly of Por(1_6) yielded an N50 scaffold size of 531,821 bp with >75% of the predicted genome covered by scaffolds over 100,000 bp. One of the critical phenotypic differences between strains of P. syringae is the range of plant hosts they infect. This is largely determined by their complement of type III effector proteins. The genome of Por(1_6) is the first sequenced for a P. syringae isolate that is a pathogen of monocots, and, as might be predicted, its complement of type III effectors differs substantially from the previously sequenced isolates of this species. The genome of Por(1_6) helps to define an expansion of the P. syringae pan-genome, a corresponding contraction of the core genome, and a further diversification of the type III effector complement for this important plant pathogen species.
Collapse
Affiliation(s)
- Josephine A Reinhardt
- Department of Biology, University of North Carolina, Chapel Hill, North Carolina 27599, USA
| | | | | | | | | | | |
Collapse
|
26
|
Ter-Hovhannisyan V, Lomsadze A, Chernoff YO, Borodovsky M. Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res 2008; 18:1979-90. [PMID: 18757608 DOI: 10.1101/gr.081612.108] [Citation(s) in RCA: 638] [Impact Index Per Article: 39.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
We describe a new ab initio algorithm, GeneMark-ES version 2, that identifies protein-coding genes in fungal genomes. The algorithm does not require a predetermined training set to estimate parameters of the underlying hidden Markov model (HMM). Instead, the anonymous genomic sequence in question is used as an input for iterative unsupervised training. The algorithm extends our previously developed method tested on genomes of Arabidopsis thaliana, Caenorhabditis elegans, and Drosophila melanogaster. To better reflect features of fungal gene organization, we enhanced the intron submodel to accommodate sequences with and without branch point sites. This design enables the algorithm to work equally well for species with the kinds of variations in splicing mechanisms seen in the fungal phyla Ascomycota, Basidiomycota, and Zygomycota. Upon self-training, the intron submodel switches on in several steps to reach its full complexity. We demonstrate that the algorithm accuracy, both at the exon and the whole gene level, is favorably compared to the accuracy of gene finders that employ supervised training. Application of the new method to known fungal genomes indicates substantial improvement over existing annotations. By eliminating the effort necessary to build comprehensive training sets, the new algorithm can streamline and accelerate the process of annotation in a large number of fungal genome sequencing projects.
Collapse
|
27
|
Advances in the sequencing of the genome of the adenophorean nematode Trichinella spiralis. Parasitology 2008; 135:869-80. [PMID: 18598573 DOI: 10.1017/s0031182008004472] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
The adenophorean nematodes are evolutionarily distant from other species in the phylum Nematoda. Interspecific comparisons of predicted proteins have supported such an ancient divergence. Accordingly, Trichinella spiralis represents a basal nematode representative for genome sequencing focused on gaining a deeper insight into the evolutionary biology of nematodes. In addition, molecular characteristics that are conserved across the phylum could be of great value for control strategies with broad application. In this review, we describe and summarize progress that has been made on the sequencing and analysis of the T. spiralis genome. The genome sequence was used in preliminary analyses for the investigation of specific questions relating to the biology of T. spiralis and, more generally, to parasitic nematodes. For instance, we evaluated an unusually large DNase II-like protein family, predicted proteins of prospective interest in the parasite-host muscle cell interaction, anthelmintic targets and prospective intestinal genes, the encoded proteins (potentially) linked to immunological control against other nematodes. The results are discussed in relation to characteristics that are broadly conserved among evolutionary distant nematodes. The results lead to expectations that this genome sequence will contribute to advances in research on T. spiralis and other parasitic nematodes.
Collapse
|
28
|
Zou J, Hallen MA, Yankel CD, Endow SA. A microtubule-destabilizing kinesin motor regulates spindle length and anchoring in oocytes. ACTA ACUST UNITED AC 2008; 180:459-66. [PMID: 18250200 PMCID: PMC2234233 DOI: 10.1083/jcb.200711031] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
The kinesin-13 motor, KLP10A, destabilizes microtubules at their minus ends in mitosis and binds to polymerizing plus ends in interphase, regulating spindle and microtubule dynamics. Little is known about kinesin-13 motors in meiosis. In this study, we report that KLP10A localizes to the unusual pole bodies of anastral Drosophila melanogaster oocyte meiosis I spindles as well as spindle fibers, centromeres, and cortical microtubules. We frequently observe the pole bodies attached to cortical microtubules, indicating that KLP10A could mediate spindle anchoring to the cortex via cortical microtubules. Oocytes treated with drugs that suppress microtubule dynamics exhibit spindles that are reoriented more vertically to the cortex than untreated controls. A dominant-negative klp10A mutant shows both reoriented and shorter oocyte spindles, implying that, unexpectedly, KLP10A may stabilize rather than destabilize microtubules, regulating spindle length and positioning the oocyte spindle. By altering microtubule dynamics, KLP10A could promote spindle reorientation upon oocyte activation.
Collapse
Affiliation(s)
- Jianwei Zou
- Department of Cell Biology, Duke University Medical Center, Durham, NC 27710, USA
| | | | | | | |
Collapse
|
29
|
Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Nature 2008; 450:219-32. [PMID: 17994088 DOI: 10.1038/nature06340] [Citation(s) in RCA: 462] [Impact Index Per Article: 28.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2007] [Accepted: 10/04/2007] [Indexed: 12/25/2022]
Abstract
Sequencing of multiple related species followed by comparative genomics analysis constitutes a powerful approach for the systematic understanding of any genome. Here, we use the genomes of 12 Drosophila species for the de novo discovery of functional elements in the fly. Each type of functional element shows characteristic patterns of change, or 'evolutionary signatures', dictated by its precise selective constraints. Such signatures enable recognition of new protein-coding genes and exons, spurious and incorrect gene annotations, and numerous unusual gene structures, including abundant stop-codon readthrough. Similarly, we predict non-protein-coding RNA genes and structures, and new microRNA (miRNA) genes. We provide evidence of miRNA processing and functionality from both hairpin arms and both DNA strands. We identify several classes of pre- and post-transcriptional regulatory motifs, and predict individual motif instances with high confidence. We also study how discovery power scales with the divergence and number of species compared, and we provide general guidelines for comparative studies.
Collapse
|
30
|
Díaz-Pérez C, Cervantes C, Campos-García J, Julián-Sánchez A, Riveros-Rosas H. Phylogenetic analysis of the chromate ion transporter (CHR) superfamily. FEBS J 2007; 274:6215-27. [DOI: 10.1111/j.1742-4658.2007.06141.x] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
31
|
Bowser PRF, Tobe SS. Comparative genomic analysis of allatostatin-encoding (Ast) genes in Drosophila species and prediction of regulatory elements by phylogenetic footprinting. Peptides 2007; 28:83-93. [PMID: 17175069 DOI: 10.1016/j.peptides.2006.08.033] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/27/2006] [Revised: 08/04/2006] [Accepted: 08/04/2006] [Indexed: 01/02/2023]
Abstract
The role of the YXFGLa family of allatostatin (AST) peptides in dipterans is not well-established. The recent completion of sequencing of genomes for multiple Drosophila species provides an opportunity to study the evolutionary variation of the allatostatins and to examine regulatory elements that control gene expression. We performed comparative analyses of Ast genes from seven Drosophila species (Drosophila melanogaster, Drosophila simulans, Drosophila ananassae, Drosophila yakuba, Drosophila pseudoobscura, Drosophila mojavensis, and Drosophila grimshawi) and used phylogenetic footprinting methods to identify conserved noncoding motifs, which are candidates for regulatory regions. The peptides encoded by the Ast precursor are nearly identical across species with the exception of AST-1, in which the leading residue may be either methionine or valine. Phylogenetic footprinting predicts as few as 3, to as many as 17 potential regulatory sites depending on the parameters used during analysis. These include a Hunchback motif approximately 1.2 kb upstream of the open reading frame (ORF), overlapping motifs for two Broad-complex isoforms in the first intron, and a CF2-II motif located in the 3'-UTR. Understanding the regulatory elements involved in Ast expression may provide insight into the function of this neuropeptide family.
Collapse
Affiliation(s)
- P R F Bowser
- Department of Zoology, University of Toronto, 25 Harbord Street, Toronto, Ont. M5S 3G5, Canada
| | | |
Collapse
|
32
|
Christoffels A, Bartfai R, Srinivasan H, Komen H, Orban L. Comparative genomics in cyprinids: common carp ESTs help the annotation of the zebrafish genome. BMC Bioinformatics 2006; 7 Suppl 5:S2. [PMID: 17254304 PMCID: PMC1764476 DOI: 10.1186/1471-2105-7-s5-s2] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Background Automatic annotation of sequenced eukaryotic genomes integrates a combination of methodologies such as ab-initio methods and alignment of homologous genes and/or proteins. For example, annotation of the zebrafish genome within Ensembl relies heavily on available cDNA and protein sequences from two distantly related fish species and other vertebrates that have diverged several hundred million years ago. The scarcity of genomic information from other cyprinids provides the impetus to leverage EST collections to understand gene structures in this diverse teleost group. Results We have generated 6,050 ESTs from the differentiating testis of common carp (Cyprinus carpio) and clustered them with 9,303 non-gonadal ESTs from CarpBase as well as 1,317 ESTs and 652 common carp mRNAs from GenBank. Over 28% of the resulting 8,663 unique transcripts are exclusively testis-derived ESTs. Moreover, 974 of these transcripts did not match any sequence in the zebrafish or fathead minnow EST collection. A total of 1,843 unique common carp sequences could be stringently mapped to the zebrafish genome (version 5), of which 1,752 matched coding sequences of zebrafish genes with or without potential splice variants. We show that 91 common carp transcripts map to intergenic and intronic regions on the zebrafish genome assembly and regions annotated with non-teleost sequences. Interestingly, an additional 42 common carp transcripts indicate the potential presence of new splicing variants not found in zebrafish databases so far. The fact that common carp transcripts help the identification or confirmation of these coding regions in zebrafish exemplifies the usefulness of sequences from closely related species for the annotation of model genomes. We also demonstrate that 5' UTR sequences of common carp and zebrafish orthologs share a significant level of similarity based on preservation of motif arrangements for as many as 10 ab-initio motifs. Conclusion Our data show that there is sufficient homology between the transcribed sequences of common carp and zebrafish to warrant an even deeper cyprinid transcriptome comparison. On the other hand, the comparative analysis illustrates the value in utilizing partially sequenced transcriptomes to understand gene structure in this diverse teleost group. We highlight the need for integrated resources to leverage the wealth of fragmented genomic data.
Collapse
Affiliation(s)
- Alan Christoffels
- Computational Biology Group, Temasek Life Sciences Laboratory, Singapore
- School of Biological Sciences, Nanyang Technological University, Singapore
| | - Richard Bartfai
- Reproductive Genomics Group, Temasek Life Sciences Laboratory, Singapore
| | - Hamsa Srinivasan
- Computational Biology Group, Temasek Life Sciences Laboratory, Singapore
| | - Hans Komen
- Animal Breeding and Genetics Group, Wageningen University, Wageningen, The Netherlands
| | - Laszlo Orban
- Reproductive Genomics Group, Temasek Life Sciences Laboratory, Singapore
- Department of Biological Sciences, The National University of Singapore, Singapore
| |
Collapse
|
33
|
Pfarao: a web application for protein family analysis customized for cytoskeletal and motor proteins (CyMoBase). BMC Genomics 2006; 7:300. [PMID: 17134497 PMCID: PMC1684263 DOI: 10.1186/1471-2164-7-300] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2006] [Accepted: 11/29/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Annotation of protein sequences of eukaryotic organisms is crucial for the understanding of their function in the cell. Manual annotation is still by far the most accurate way to correctly predict genes. The classification of protein sequences, their phylogenetic relation and the assignment of function involves information from various sources. This often leads to a collection of heterogeneous data, which is hard to track. Cytoskeletal and motor proteins consist of large and diverse superfamilies comprising up to several dozen members per organism. Up to date there is no integrated tool available to assist in the manual large-scale comparative genomic analysis of protein families. DESCRIPTION Pfarao (Protein Family Application for Retrieval, Analysis and Organisation) is a database driven online working environment for the analysis of manually annotated protein sequences and their relationship. Currently, the system can store and interrelate a wide range of information about protein sequences, species, phylogenetic relations and sequencing projects as well as links to literature and domain predictions. Sequences can be imported from multiple sequence alignments that are generated during the annotation process. A web interface allows to conveniently browse the database and to compile tabular and graphical summaries of its content. CONCLUSION We implemented a protein sequence-centric web application to store, organize, interrelate, and present heterogeneous data that is generated in manual genome annotation and comparative genomics. The application has been developed for the analysis of cytoskeletal and motor proteins (CyMoBase) but can easily be adapted for any protein.
Collapse
|
34
|
Ohler U. Identification of core promoter modules in Drosophila and their application in accurate transcription start site prediction. Nucleic Acids Res 2006; 34:5943-50. [PMID: 17068082 PMCID: PMC1635271 DOI: 10.1093/nar/gkl608] [Citation(s) in RCA: 83] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The reliable recognition of eukaryotic RNA polymerase II core promoters, and the associated transcription start sites (TSSs) of genes, has been an ongoing challenge for computational biology. High throughput experimental methods such as tiling arrays or 5' SAGE/EST sequencing have recently lead to much larger datasets of core promoters, and to the assessment that the well-known core promoter sequence elements such as the TATA box appear to be much less frequent than thought. Here, we address the co-occurrence of several previously identified core promoter sequence motifs in Drosophila melanogaster to determine frequently occurring core promoter modules. We then use this in a new strategy to model core promoters as a set of alternative submodels for different core promoter architectures reflecting these different motif modules. We show that this system improves greatly on computational promoter recognition and leads to highly accurate in silico TSS prediction. Our results indicate that at least for the case of the fruit fly, we are getting closer to an understanding of how the beginning of a gene is defined in a eukaryotic genome.
Collapse
Affiliation(s)
- Uwe Ohler
- Institute for Genome Sciences and Policy, Durham, NC 27708, USA.
| |
Collapse
|
35
|
Bandyopadhyay S, Sharan R, Ideker T. Systematic identification of functional orthologs based on protein network comparison. Genome Res 2006; 16:428-35. [PMID: 16510899 PMCID: PMC1415213 DOI: 10.1101/gr.4526006] [Citation(s) in RCA: 148] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
Abstract
Annotating protein function across species is an important task that is often complicated by the presence of large paralogous gene families. Here, we report a novel strategy for identifying functionally related proteins that supplements sequence-based comparisons with information on conserved protein-protein interactions. First, the protein interaction networks of two species are aligned by assigning proteins to sequence homology clusters using the Inparanoid algorithm. Next, probabilistic inference is performed on the aligned networks to identify pairs of proteins, one from each species, that are likely to retain the same function based on conservation of their interacting partners. Applying this method to Drosophila melanogaster and Saccharomyces cerevisiae, we analyze 121 cases for which functional orthology assignment is ambiguous when sequence similarity is used alone. In 61 of these cases, the network supports a different protein pair than that favored by sequence comparisons. These results suggest that network analysis can be used to provide a key source of information for refining sequence-based homology searches.
Collapse
Affiliation(s)
- Sourav Bandyopadhyay
- Program in Bioinformatics, University of California at San Diego, La Jolla, California 92093, USA
| | | | | |
Collapse
|
36
|
|
37
|
Bajic VB, Brent MR, Brown RH, Frankish A, Harrow J, Ohler U, Solovyev VV, Tan SL. Performance assessment of promoter predictions on ENCODE regions in the EGASP experiment. Genome Biol 2006; 7 Suppl 1:S3.1-13. [PMID: 16925837 PMCID: PMC1810552 DOI: 10.1186/gb-2006-7-s1-s3] [Citation(s) in RCA: 53] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND This study analyzes the predictions of a number of promoter predictors on the ENCODE regions of the human genome as part of the ENCODE Genome Annotation Assessment Project (EGASP). The systems analyzed operate on various principles and we assessed the effectiveness of different conceptual strategies used to correlate produced promoter predictions with the manually annotated 5' gene ends. RESULTS The predictions were assessed relative to the manual HAVANA annotation of the 5' gene ends. These 5' gene ends were used as the estimated reference transcription start sites. With the maximum allowed distance for predictions of 1,000 nucleotides from the reference transcription start sites, the sensitivity of predictors was in the range 32% to 56%, while the positive predictive value was in the range 79% to 93%. The average distance mismatch of predictions from the reference transcription start sites was in the range 259 to 305 nucleotides. At the same time, using transcription start site estimates from DBTSS and H-Invitational databases as promoter predictions, we obtained a sensitivity of 58%, a positive predictive value of 92%, and an average distance from the annotated transcription start sites of 117 nucleotides. In this experiment, the best performing promoter predictors were those that combined promoter prediction with gene prediction. The main reason for this is the reduced promoter search space that resulted in smaller numbers of false positive predictions. CONCLUSION The main finding, now supported by comprehensive data, is that the accuracy of human promoter predictors for high-throughput annotation purposes can be significantly improved if promoter prediction is combined with gene prediction. Based on the lessons learned in this experiment, we propose a framework for the preparation of the next similar promoter prediction assessment.
Collapse
Affiliation(s)
- Vladimir B Bajic
- South African National Bioinformatics Institute, University of the Western Cape, Bellville 7535, South Africa.
| | | | | | | | | | | | | | | |
Collapse
|
38
|
Guigó R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, Castelo R, Eyras E, Ucla C, Gingeras TR, Harrow J, Hubbard T, Lewis SE, Reese MG. EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol 2006; 7 Suppl 1:S2.1-31. [PMID: 16925836 PMCID: PMC1810551 DOI: 10.1186/gb-2006-7-s1-s2] [Citation(s) in RCA: 198] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
Abstract
BACKGROUND We present the results of EGASP, a community experiment to assess the state-of-the-art in genome annotation within the ENCODE regions, which span 1% of the human genome sequence. The experiment had two major goals: the assessment of the accuracy of computational methods to predict protein coding genes; and the overall assessment of the completeness of the current human genome annotations as represented in the ENCODE regions. For the computational prediction assessment, eighteen groups contributed gene predictions. We evaluated these submissions against each other based on a 'reference set' of annotations generated as part of the GENCODE project. These annotations were not available to the prediction groups prior to the submission deadline, so that their predictions were blind and an external advisory committee could perform a fair assessment. RESULTS The best methods had at least one gene transcript correctly predicted for close to 70% of the annotated genes. Nevertheless, the multiple transcript accuracy, taking into account alternative splicing, reached only approximately 40% to 50% accuracy. At the coding nucleotide level, the best programs reached an accuracy of 90% in both sensitivity and specificity. Programs relying on mRNA and protein sequences were the most accurate in reproducing the manually curated annotations. Experimental validation shows that only a very small percentage (3.2%) of the selected 221 computationally predicted exons outside of the existing annotation could be verified. CONCLUSION This is the first such experiment in human DNA, and we have followed the standards established in a similar experiment, GASP1, in Drosophila melanogaster. We believe the results presented here contribute to the value of ongoing large-scale annotation projects and should guide further experimental methods when being scaled up to the entire human genome sequence.
Collapse
Affiliation(s)
- Roderic Guigó
- Centre de Regulació Genòmica, Institut Municipal d'Investigació Mèdica-Universitat Pompeu Fabra, E08003 Barcelona, Catalonia, Spain
- Member of the EGASP Organizing Committee
| | - Paul Flicek
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Josep F Abril
- Centre de Regulació Genòmica, Institut Municipal d'Investigació Mèdica-Universitat Pompeu Fabra, E08003 Barcelona, Catalonia, Spain
| | - Alexandre Reymond
- Center for Integrative Genomics, University of Lausanne, Switzerland
| | - Julien Lagarde
- Centre de Regulació Genòmica, Institut Municipal d'Investigació Mèdica-Universitat Pompeu Fabra, E08003 Barcelona, Catalonia, Spain
| | - France Denoeud
- Centre de Regulació Genòmica, Institut Municipal d'Investigació Mèdica-Universitat Pompeu Fabra, E08003 Barcelona, Catalonia, Spain
| | - Stylianos Antonarakis
- University of Geneva Medical School and University Hospitals of Geneva, 1211 Geneva, Switzerland
| | - Michael Ashburner
- Department of Genetics, University of Cambridge, Cambridge CB3 2EH, UK
- Member of the EGASP Advisory Board
| | - Vladimir B Bajic
- South African National Bioinformatics Institute (SANBI), University of Western Cape, Bellville 7535, South Africa
- Member of the EGASP Advisory Board
| | - Ewan Birney
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
- Member of the EGASP Organizing Committee
| | - Robert Castelo
- Centre de Regulació Genòmica, Institut Municipal d'Investigació Mèdica-Universitat Pompeu Fabra, E08003 Barcelona, Catalonia, Spain
| | - Eduardo Eyras
- Centre de Regulació Genòmica, Institut Municipal d'Investigació Mèdica-Universitat Pompeu Fabra, E08003 Barcelona, Catalonia, Spain
| | - Catherine Ucla
- University of Geneva Medical School and University Hospitals of Geneva, 1211 Geneva, Switzerland
| | - Thomas R Gingeras
- Affymetrix Inc., Santa Clara, California 95051, USA
- Member of the EGASP Advisory Board
| | - Jennifer Harrow
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK
- Member of the EGASP Organizing Committee
| | - Tim Hubbard
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK
- Member of the EGASP Organizing Committee
| | - Suzanna E Lewis
- Department of Molecular and Cellular Biology, University of California, Berkeley, California 94792, USA
- Member of the EGASP Advisory Board
| | - Martin G Reese
- Omicia Inc., Christie Ave., Emeryville, California 94608, USA
- Member of the EGASP Advisory Board
| |
Collapse
|
39
|
Moult J. Rigorous performance evaluation in protein structure modelling and implications for computational biology. Philos Trans R Soc Lond B Biol Sci 2006; 361:453-8. [PMID: 16524833 PMCID: PMC1609338 DOI: 10.1098/rstb.2005.1810] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
In principle, given the amino acid sequence of a protein, it is possible to compute the corresponding three-dimensional structure. Methods for modelling structure based on this premise have been under development for more than 40 years. For the past decade, a series of community wide experiments (termed Critical Assessment of Structure Prediction (CASP)) have assessed the state of the art, providing a detailed picture of what has been achieved in the field, where we are making progress, and what major problems remain. The rigorous evaluation procedures of CASP have been accompanied by substantial progress. Lessons from this area of computational biology suggest a set of principles for increasing rigor in the field as a whole.
Collapse
Affiliation(s)
- John Moult
- Center for Advanced Research in Biotechnology, University of Maryland Biotechnology Institute, 9600 Gudelsky Drive, Rockville, MD 20850, USA.
| |
Collapse
|
40
|
Li J, Riehle MM, Zhang Y, Xu J, Oduol F, Gomez SM, Eiglmeier K, Ueberheide BM, Shabanowitz J, Hunt DF, Ribeiro JMC, Vernick KD. Anopheles gambiae genome reannotation through synthesis of ab initio and comparative gene prediction algorithms. Genome Biol 2006; 7:R24. [PMID: 16569258 PMCID: PMC1557760 DOI: 10.1186/gb-2006-7-3-r24] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2005] [Revised: 01/19/2006] [Accepted: 02/23/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Complete genome annotation is a necessary tool as Anopheles gambiae researchers probe the biology of this potent malaria vector. RESULTS We reannotate the A. gambiae genome by synthesizing comparative and ab initio sets of predicted coding sequences (CDSs) into a single set using an exon-gene-union algorithm followed by an open-reading-frame-selection algorithm. The reannotation predicts 20,970 CDSs supported by at least two lines of evidence, and it lowers the proportion of CDSs lacking start and/or stop codons to only approximately 4%. The reannotated CDS set includes a set of 4,681 novel CDSs not represented in the Ensembl annotation but with EST support, and another set of 4,031 Ensembl-supported genes that undergo major structural and, therefore, probably functional changes in the reannotated set. The quality and accuracy of the reannotation was assessed by comparison with end sequences from 20,249 full-length cDNA clones, and evaluation of mass spectrometry peptide hit rates from an A. gambiae shotgun proteomic dataset confirms that the reannotated CDSs offer a high quality protein database for proteomics. We provide a functional proteomics annotation, ReAnoXcel, obtained by analysis of the new CDSs through the AnoXcel pipeline, which allows functional comparisons of the CDS sets within the same bioinformatic platform. CDS data are available for download. CONCLUSION Comprehensive A. gambiae genome reannotation is achieved through a combination of comparative and ab initio gene prediction algorithms.
Collapse
Affiliation(s)
- Jun Li
- Center for Microbial and Plant Genomics, and Department of Microbiology, University of Minnesota, St Paul, MN 55108, USA
| | - Michelle M Riehle
- Center for Microbial and Plant Genomics, and Department of Microbiology, University of Minnesota, St Paul, MN 55108, USA
| | - Yan Zhang
- Center for Microbial and Plant Genomics, and Department of Microbiology, University of Minnesota, St Paul, MN 55108, USA
| | - Jiannong Xu
- Center for Microbial and Plant Genomics, and Department of Microbiology, University of Minnesota, St Paul, MN 55108, USA
| | - Frederick Oduol
- Center for Microbial and Plant Genomics, and Department of Microbiology, University of Minnesota, St Paul, MN 55108, USA
| | - Shawn M Gomez
- Unité de Biochimie et Biologie Moléculaire des Insectes and CNRS FRE 2849, Institut Pasteur, 75724 Paris Cedex 15, France
| | - Karin Eiglmeier
- Unité de Biochimie et Biologie Moléculaire des Insectes and CNRS FRE 2849, Institut Pasteur, 75724 Paris Cedex 15, France
| | - Beatrix M Ueberheide
- Department of Chemistry, McCormick Rd, University of Virginia, Charlottesville, VA 22904, USA
| | - Jeffrey Shabanowitz
- Department of Chemistry, McCormick Rd, University of Virginia, Charlottesville, VA 22904, USA
| | - Donald F Hunt
- Department of Chemistry, McCormick Rd, University of Virginia, Charlottesville, VA 22904, USA
| | - José MC Ribeiro
- Laboratory of Malaria and Vector Research, National Institute of Allergy and Infectious Diseases, Bethesda, MD 20892, USA
| | - Kenneth D Vernick
- Center for Microbial and Plant Genomics, and Department of Microbiology, University of Minnesota, St Paul, MN 55108, USA
| |
Collapse
|
41
|
Abstract
According to the most recent estimates, the number of human genes is possibly--but not certainly--between 20,000 and 25,000. To contribute strategies to reduce this uncertainty, several groups working on computational gene prediction met recently at the Welcome Trust Sanger Institute with the goal to test and compare predictive methods of genome annotation.
Collapse
Affiliation(s)
- Roderic Guigó
- Municipal Institute of Medical Research and Center for Genomic Regulation, University Pompeu Fabra, C/ Dr. Aiguader 80, 08003 Barcelona, Catalonia, Spain.
| | | |
Collapse
|
42
|
O'Neill B. Prices for Ingenuity. PLoS Biol 2005; 3:e288. [PMID: 16089506 PMCID: PMC1187858 DOI: 10.1371/journal.pbio.0030288] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Competitions and community exchanges are spurring scientific progress.
Collapse
|
43
|
Yu L, Haverty PM, Mariani J, Wang Y, Shen HY, Schwarzschild MA, Weng Z, Chen JF. Genetic and pharmacological inactivation of adenosine A2A receptor reveals an Egr-2-mediated transcriptional regulatory network in the mouse striatum. Physiol Genomics 2005; 23:89-102. [PMID: 16046619 DOI: 10.1152/physiolgenomics.00068.2005] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
The adenosine A2A receptor (A2AR) is highly expressed in the striatum, where it modulates motor and emotional behaviors. We used both microarray and bioinformatics analyses to compare gene expression profiles by genetic and pharmacological inactivation of A2AR and inferred an A2AR-controlled transcription network in the mouse striatum. A comparison between vehicle (VEH)-treated A2AR knockout (KO) mice (A2AR KO-VEH) and wild-type (WT) mice (WT-VEH) revealed 36 upregulated genes that were partially mimicked by treatment with SCH-58261 (SCH; an A2AR antagonist) and 54 downregulated genes that were not mimicked by SCH treatment. We validated the A2AR as a specific drug target for SCH by comparing A2AR KO-SCH and A2AR KO-VEH groups. The unique downregulation effect of A2AR KO was confirmed by comparing A2AR KO-SCH with WT-SCH gene groups. The distinct striatal gene expression profiles induced by A2AR KO and SCH should provide clues to the molecular mechanisms underlying the different phenotypes observed after genetic and pharmacological inactivation of A2AR. Furthermore, bioinformatics analysis discovered that Egr-2 binding sites were statistically overrepresented in the proximal promoters of A2AR KO-affected genes relative to the unaffected genes. This finding was further substantiated by the demonstration that the Egr-2 mRNA level increased in the striatum of both A2AR KO and SCH-treated mice and that striatal Egr-2 binding activity in the promoters of two A2AR KO-affected genes was enhanced in A2AR KO mice as assayed by chromatin immunoprecipitation. Taken together, these results strongly support the existence of an Egr-2-directed transcriptional regulatory network controlled by striatal A2ARs.
Collapse
Affiliation(s)
- Liqun Yu
- Department of Neurology, Boston University School of Medicine, Boston, Massachusetts 02118, USA
| | | | | | | | | | | | | | | |
Collapse
|
44
|
Ren D, Nedialkov YA, Li F, Xu D, Reimers S, Finkelstein A, Burton ZF. Spacing requirements for simultaneous recognition of the adenovirus major late promoter TATAAAAG box and initiator element. Arch Biochem Biophys 2005; 435:347-62. [PMID: 15708378 DOI: 10.1016/j.abb.2004.12.028] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2004] [Revised: 12/28/2004] [Indexed: 11/18/2022]
Abstract
The distance between the TATAAAAG box and initiator element of the strong adenovirus major late promoter was systematically altered to determine the optimal spacing for simultaneous recognition of both elements. We find that the TATAAAAG element is strongly dominant over the initiator for specification of the start site. The wild type spacing of 23 base pairs between TATAAAAG and +1A is optimal for promoter strength and selective recognition of the A-start. Initiation is constrained to a window spaced 19-26 base pairs downstream of (-31)-TATAAAAG-(-24), and A-starts are favored over alternate starts only when spaced between 21 and 25 base pairs downstream of TATAAAAG. We report an expanded TATAAAAG and initiator promoter consensus for vertebrates and plants. Plant promoters of this class are (A-T)-rich and have an A-rich (non-template strand) core promoter sequence element downstream of +1A.
Collapse
Affiliation(s)
- Delin Ren
- Department of Biochemistry and Molecular Biology, Michigan State University, E. Lansing, MI 48824-1319, USA
| | | | | | | | | | | | | |
Collapse
|
45
|
Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Régnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z. Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 2005; 23:137-44. [PMID: 15637633 DOI: 10.1038/nbt1053] [Citation(s) in RCA: 691] [Impact Index Per Article: 36.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The prediction of regulatory elements is a problem where computational methods offer great hope. Over the past few years, numerous tools have become available for this task. The purpose of the current assessment is twofold: to provide some guidance to users regarding the accuracy of currently available tools in various settings, and to provide a benchmark of data sets for assessing future tools.
Collapse
Affiliation(s)
- Martin Tompa
- Department of Computer Science and Engineering, Box 352350, University of Washington, Seattle, Washington 98195-2350, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
46
|
Abstract
Background Molecular Biology accumulated substantial amounts of data concerning functions of genes and proteins. Information relating to functional descriptions is generally extracted manually from textual data and stored in biological databases to build up annotations for large collections of gene products. Those annotation databases are crucial for the interpretation of large scale analysis approaches using bioinformatics or experimental techniques. Due to the growing accumulation of functional descriptions in biomedical literature the need for text mining tools to facilitate the extraction of such annotations is urgent. In order to make text mining tools useable in real world scenarios, for instance to assist database curators during annotation of protein function, comparisons and evaluations of different approaches on full text articles are needed. Results The Critical Assessment for Information Extraction in Biology (BioCreAtIvE) contest consists of a community wide competition aiming to evaluate different strategies for text mining tools, as applied to biomedical literature. We report on task two which addressed the automatic extraction and assignment of Gene Ontology (GO) annotations of human proteins, using full text articles. The predictions of task 2 are based on triplets of protein – GO term – article passage. The annotation-relevant text passages were returned by the participants and evaluated by expert curators of the GO annotation (GOA) team at the European Institute of Bioinformatics (EBI). Each participant could submit up to three results for each sub-task comprising task 2. In total more than 15,000 individual results were provided by the participants. The curators evaluated in addition to the annotation itself, whether the protein and the GO term were correctly predicted and traceable through the submitted text fragment. Conclusion Concepts provided by GO are currently the most extended set of terms used for annotating gene products, thus they were explored to assess how effectively text mining tools are able to extract those annotations automatically. Although the obtained results are promising, they are still far from reaching the required performance demanded by real world applications. Among the principal difficulties encountered to address the proposed task, were the complex nature of the GO terms and protein names (the large range of variants which are used to express proteins and especially GO terms in free text), and the lack of a standard training set. A range of very different strategies were used to tackle this task. The dataset generated in line with the BioCreative challenge is publicly available and will allow new possibilities for training information extraction methods in the domain of molecular biology.
Collapse
Affiliation(s)
- Christian Blaschke
- Bioalma SL, Ronda de Poniente 4- 2nd floor, Tres Cantos, E-28760, Madrid, Spain
| | - Eduardo Andres Leon
- Protein Design Group, National Center of Biotechnology, CNB-CSIC, Cantoblanco, E-28049 Madrid, Spain
| | - Martin Krallinger
- Protein Design Group, National Center of Biotechnology, CNB-CSIC, Cantoblanco, E-28049 Madrid, Spain
| | - Alfonso Valencia
- Protein Design Group, National Center of Biotechnology, CNB-CSIC, Cantoblanco, E-28049 Madrid, Spain
| |
Collapse
|
47
|
Szafranski K, Lehmann R, Parra G, Guigo R, Glöckner G. Gene organization features in A/T-rich organisms. J Mol Evol 2005; 60:90-8. [PMID: 15696371 DOI: 10.1007/s00239-004-0201-2] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2003] [Accepted: 08/18/2004] [Indexed: 10/25/2022]
Abstract
Several species have genomes in which the four nucleotides are not equally represented (Glöckner 2000). Interestingly, shifts to very high A/T or G/C levels can occur in several distinct branches of the tree of life. The underlying reasons for these shifts therefore may be of different origin. Now entire chromosome sequences from two different A/T-rich genomes, Dictyostelium discoideum and Plasmodium falciparum, are available (Bowman et al. 1999; Gardner et al. 2002; Glöckner et al. 2002). This gives us the opportunity to investigate how a high A/T content may influence the signals that are the landmarks for gene specification. We found that, in contrast with most known metazoan and plant genomes, splice signals contain, little information other than the canonical GT-AG dinucleotides. Intron lengths in A/T rich organisms, on the other hand, are comparable to those of other lower eukaryotes. Intergenic regions show, dependent on the orientation of adjacent genes, a size pattern with a ratio of 1 (3'-3') to 2 (3'-5') to 3 (5'-5'). Overall, gene organization patterns seem not to be influenced by the A/T bias. Surprisingly, the slightly higher A/T content of the P. falciparum genome compared to that of D. discoideum (80.1 versus 77.4%) is not achieved by increased A/T richness in intergenic regions. Instead both the shift of the nucleotide usage in coding regions to A/T-rich codons and the longer intergenic regions make an equal contribution to the higher A/T content in this organism.
Collapse
Affiliation(s)
- Karol Szafranski
- Department of Genome Analysis, Institute for Molecular Biotechnology Jena, Beutenbergstr. 11, D-07745 Jena, Germany
| | | | | | | | | |
Collapse
|
48
|
Martin RE, Henry RI, Abbey JL, Clements JD, Kirk K. The 'permeome' of the malaria parasite: an overview of the membrane transport proteins of Plasmodium falciparum. Genome Biol 2005; 6:R26. [PMID: 15774027 PMCID: PMC1088945 DOI: 10.1186/gb-2005-6-3-r26] [Citation(s) in RCA: 129] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2004] [Revised: 12/31/2004] [Accepted: 01/28/2005] [Indexed: 11/24/2022] Open
Abstract
Bioinformatic and expression analyses attribute putative functions to transporters and channels encoded by the Plasmodium falciparum genome. The malaria parasite has substantially more membrane transport proteins than previously thought. Background The uptake of nutrients, expulsion of metabolic wastes and maintenance of ion homeostasis by the intraerythrocytic malaria parasite is mediated by membrane transport proteins. Proteins of this type are also implicated in the phenomenon of antimalarial drug resistance. However, the initial annotation of the genome of the human malaria parasite Plasmodium falciparum identified only a limited number of transporters, and no channels. In this study we have used a combination of bioinformatic approaches to identify and attribute putative functions to transporters and channels encoded by the malaria parasite, as well as comparing expression patterns for a subset of these. Results A computer program that searches a genome database on the basis of the hydropathy plots of the corresponding proteins was used to identify more than 100 transport proteins encoded by P. falciparum. These include all the transporters previously annotated as such, as well as a similar number of candidate transport proteins that had escaped detection. Detailed sequence analysis enabled the assignment of putative substrate specificities and/or transport mechanisms to all those putative transport proteins previously without. The newly-identified transport proteins include candidate transporters for a range of organic and inorganic nutrients (including sugars, amino acids, nucleosides and vitamins), and several putative ion channels. The stage-dependent expression of RNAs for 34 candidate transport proteins of particular interest are compared. Conclusion The malaria parasite possesses substantially more membrane transport proteins than was originally thought, and the analyses presented here provide a range of novel insights into the physiology of this important human pathogen.
Collapse
Affiliation(s)
- Rowena E Martin
- School of Biochemistry and Molecular Biology, Faculty of Science, The Australian National University, Canberra, ACT 0200, Australia
| | - Roselani I Henry
- School of Biochemistry and Molecular Biology, Faculty of Science, The Australian National University, Canberra, ACT 0200, Australia
| | - Janice L Abbey
- School of Biochemistry and Molecular Biology, Faculty of Science, The Australian National University, Canberra, ACT 0200, Australia
| | - John D Clements
- School of Biochemistry and Molecular Biology, Faculty of Science, The Australian National University, Canberra, ACT 0200, Australia
- Division of Neuroscience, The John Curtin School of Medical Research, The Australian National University, Canberra, ACT 0200, Australia
| | - Kiaran Kirk
- School of Biochemistry and Molecular Biology, Faculty of Science, The Australian National University, Canberra, ACT 0200, Australia
| |
Collapse
|
49
|
Abstract
Genome comparisons are behind the powerful new annotation methods being developed to find all human genes, as well as genes from other genomes. Genomes are now frequently being studied in pairs to provide cross-comparison datasets. This 'Noah's Ark' approach often reveals unsuspected genes and may support the deletion of false-positive predictions. Joining mouse and human as the cross-comparison dataset for the first two mammals are: two Drosophila species, D. melanogaster and D. pseudoobscura; two sea squirts, Ciona intestinalis and Ciona savignyi; four yeast (Saccharomyces) species; two nematodes, Caenorhabditis elegans and Caenorhabditis briggsae; and two pufferfish (Takefugu rubripes and Tetraodon nigroviridis). Even genomes like yeast and C. elegans, which have been known for more than five years, are now being significantly improved. Methods developed for yeast or nematodes will now be applied to mouse and human, and soon to additional mammals such as rat and dog, to identify all the mammalian protein-coding genes. Current large disparities between human Unigene predictions (127,835 genes) and gene-scanning methods (45,000 genes) still need to be resolved. This will be the challenge during the next few years.
Collapse
Affiliation(s)
- David R Nelson
- Department of Molecular Sciences and The UT Center of Excellence in Genomics and Bioinformatics, University of Tennessee, Memphis, Tennessee 38163, USA
| | - Daniel W Nebert
- Department of Environmental Health and Center for Environmental Genetics (CEG), University of Cincinnati Medical Center, Cincinnati, Ohio 45267-0056, USA
| |
Collapse
|
50
|
|