1
|
Brůna T, Lomsadze A, Borodovsky M. GeneMark-ETP significantly improves the accuracy of automatic annotation of large eukaryotic genomes. Genome Res 2024; 34:757-768. [PMID: 38866548 PMCID: PMC11216313 DOI: 10.1101/gr.278373.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Accepted: 05/02/2024] [Indexed: 06/14/2024]
Abstract
Large-scale genomic initiatives, such as the Earth BioGenome Project, require efficient methods for eukaryotic genome annotation. Here we present an automatic gene finder, GeneMark-ETP, integrating genomic-, transcriptomic-, and protein-derived evidence that has been developed with a focus on large plant and animal genomes. GeneMark-ETP first identifies genomic loci where extrinsic data are sufficient for making gene predictions with "high confidence." The genes situated in the genomic space between the high-confidence genes are predicted in the next stage. The set of high-confidence genes serves as an initial training set for the statistical model. Further on, the model parameters are iteratively updated in the rounds of gene prediction and parameter re-estimation. Upon reaching convergence, GeneMark-ETP makes the final predictions and delivers the whole complement of predicted genes. GeneMark-ETP outperforms gene finders using a single type of extrinsic evidence. Comparisons with gene finders MAKER2 and TSEBRA, those that use both transcript- and protein-derived extrinsic evidence, show that GeneMark-ETP delivers state-of-the-art gene-prediction accuracy, with the margin of outperforming existing approaches increasing in its application to larger and more complex eukaryotic genomes.
Collapse
Affiliation(s)
- Tomáš Brůna
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia 30332, USA
| | - Alexandre Lomsadze
- Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, USA
| | - Mark Borodovsky
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia 30332, USA;
- Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, USA
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, USA
| |
Collapse
|
2
|
Bruna T, Lomsadze A, Borodovsky M. A new gene finding tool GeneMark-ETP significantly improves the accuracy of automatic annotation of large eukaryotic genomes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.01.13.524024. [PMID: 36711453 PMCID: PMC9882169 DOI: 10.1101/2023.01.13.524024] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
Large-scale genomic initiatives, such as the Earth BioGenome Project, require efficient methods for eukaryotic genome annotation. Here we present an automatic gene finder, GeneMark-ETP, integrating genomic-, transcriptomic- and protein-derived evidence that has been developed with a focus on large plant and animal genomes. GeneMark-ETP first identifies genomic loci where extrinsic data is sufficient for making gene predictions with 'high confidence'. The genes situated in the genomic space between the high confidence genes are predicted in the next stage. The set of high confidence genes serves as an initial training set for the statistical model. Further on, the model parameters are iteratively updated in the rounds of gene prediction and parameter re-estimation. Upon reaching convergence, GeneMark-ETP makes the final predictions and delivers the whole complement of predicted genes. GeneMark-ETP outperformed gene finders using a single type of extrinsic evidence. Comparisons with gene finders utilizing both transcript- and protein-derived extrinsic evidence, MAKER2, and TSEBRA, demonstrated that GeneMark-ETP delivered state-of-the-art gene prediction accuracy with the margin of outperforming existing approaches increasing in its applications to larger and more complex eukaryotic genomes.
Collapse
Affiliation(s)
- Tomas Bruna
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA 30332, USA
| | - Alexandre Lomsadze
- Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA
| | - Mark Borodovsky
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA 30332, USA
- Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA
| |
Collapse
|
3
|
Gabriel L, Hoff KJ, Brůna T, Borodovsky M, Stanke M. TSEBRA: transcript selector for BRAKER. BMC Bioinformatics 2021; 22:566. [PMID: 34823473 PMCID: PMC8620231 DOI: 10.1186/s12859-021-04482-0] [Citation(s) in RCA: 80] [Impact Index Per Article: 26.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2021] [Accepted: 11/15/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND BRAKER is a suite of automatic pipelines, BRAKER1 and BRAKER2, for the accurate annotation of protein-coding genes in eukaryotic genomes. Each pipeline trains statistical models of protein-coding genes based on provided evidence and, then predicts protein-coding genes in genomic sequences using both the extrinsic evidence and statistical models. For training and prediction, BRAKER1 and BRAKER2 incorporate complementary extrinsic evidence: BRAKER1 uses only RNA-seq data while BRAKER2 uses only a database of cross-species proteins. The BRAKER suite has so far not been able to reliably exceed the accuracy of BRAKER1 and BRAKER2 when incorporating both types of evidence simultaneously. Currently, for a novel genome project where both RNA-seq and protein data are available, the best option is to run both pipelines independently, and to pick one, likely better output. Therefore, one or another type of the extrinsic evidence would remain unexploited. RESULTS We present TSEBRA, a software that selects gene predictions (transcripts) from the sets generated by BRAKER1 and BRAKER2. TSEBRA uses a set of rules to compare scores of overlapping transcripts based on their support by RNA-seq and homologous protein evidence. We show in computational experiments on genomes of 11 species that TSEBRA achieves higher accuracy than either BRAKER1 or BRAKER2 running alone and that TSEBRA compares favorably with the combiner tool EVidenceModeler. CONCLUSION TSEBRA is an easy-to-use and fast software tool. It can be used in concert with the BRAKER pipeline to generate a gene prediction set supported by both RNA-seq and homologous protein evidence.
Collapse
Affiliation(s)
- Lars Gabriel
- Institute of Mathematics and Computer Science, University of Greifswald, Walther-Rathenau-Str. 47, 17489 Greifswald, Germany
- Center for Functional Genomics of Microbes, University of Greifswald, Felix-Hausdorff-Str. 8, 17489 Greifswald, Germany
| | - Katharina J. Hoff
- Institute of Mathematics and Computer Science, University of Greifswald, Walther-Rathenau-Str. 47, 17489 Greifswald, Germany
- Center for Functional Genomics of Microbes, University of Greifswald, Felix-Hausdorff-Str. 8, 17489 Greifswald, Germany
| | - Tomáš Brůna
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA 30332 USA
| | - Mark Borodovsky
- Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA 30332 USA
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30332 USA
| | - Mario Stanke
- Institute of Mathematics and Computer Science, University of Greifswald, Walther-Rathenau-Str. 47, 17489 Greifswald, Germany
- Center for Functional Genomics of Microbes, University of Greifswald, Felix-Hausdorff-Str. 8, 17489 Greifswald, Germany
| |
Collapse
|
4
|
Li F, Zhao X, Li M, He K, Huang C, Zhou Y, Li Z, Walters JR. Insect genomes: progress and challenges. INSECT MOLECULAR BIOLOGY 2019; 28:739-758. [PMID: 31120160 DOI: 10.1111/imb.12599] [Citation(s) in RCA: 78] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/02/2017] [Revised: 03/22/2019] [Accepted: 05/14/2019] [Indexed: 05/24/2023]
Abstract
In the wake of constant improvements in sequencing technologies, numerous insect genomes have been sequenced. Currently, 1219 insect genome-sequencing projects have been registered with the National Center for Biotechnology Information, including 401 that have genome assemblies and 155 with an official gene set of annotated protein-coding genes. Comparative genomics analysis showed that the expansion or contraction of gene families was associated with well-studied physiological traits such as immune system, metabolic detoxification, parasitism and polyphagy in insects. Here, we summarize the progress of insect genome sequencing, with an emphasis on how this impacts research on pest control. We begin with a brief introduction to the basic concepts of genome assembly, annotation and metrics for evaluating the quality of draft assemblies. We then provide an overview of genome information for numerous insect species, highlighting examples from prominent model organisms, agricultural pests and disease vectors. We also introduce the major insect genome databases. The increasing availability of insect genomic resources is beneficial for developing alternative pest control methods. However, many opportunities remain for developing data-mining tools that make maximal use of the available insect genome resources. Although rapid progress has been achieved, many challenges remain in the field of insect genomics.
Collapse
Affiliation(s)
- F Li
- Ministry of Agriculture Key Lab of Molecular Biology of Crop Pathogens and Insects, Zhejiang University, Hangzhou, China
| | - X Zhao
- Ministry of Agriculture Key Lab of Molecular Biology of Crop Pathogens and Insects, Zhejiang University, Hangzhou, China
| | - M Li
- Ministry of Agriculture Key Lab of Molecular Biology of Crop Pathogens and Insects, Zhejiang University, Hangzhou, China
| | - K He
- Ministry of Agriculture Key Lab of Molecular Biology of Crop Pathogens and Insects, Zhejiang University, Hangzhou, China
| | - C Huang
- Ministry of Agriculture Key Lab of Molecular Biology of Crop Pathogens and Insects, Zhejiang University, Hangzhou, China
| | - Y Zhou
- Ministry of Agriculture Key Lab of Molecular Biology of Crop Pathogens and Insects, Zhejiang University, Hangzhou, China
| | - Z Li
- Ministry of Agriculture Key Lab of Molecular Biology of Crop Pathogens and Insects, Zhejiang University, Hangzhou, China
| | - J R Walters
- Ecology and Evolutionary Biology, University of Kansas, Lawrence, KS, USA
| |
Collapse
|
5
|
GAAP: A Genome Assembly + Annotation Pipeline. BIOMED RESEARCH INTERNATIONAL 2019; 2019:4767354. [PMID: 31346518 PMCID: PMC6617929 DOI: 10.1155/2019/4767354] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Revised: 05/20/2019] [Accepted: 05/26/2019] [Indexed: 12/24/2022]
Abstract
Genomic analysis begins with de novo assembly of short-read fragments in order to reconstruct full-length base sequences without exploiting a reference genome sequence. Then, in the annotation step, gene locations are identified within the base sequences, and the structures and functions of these genes are determined. Recently, a wide range of powerful tools have been developed and published for whole-genome analysis, enabling even individual researchers in small laboratories to perform whole-genome analyses on their objects of interest. However, these analytical tools are generally complex and use diverse algorithms, parameter setting methods, and input formats; thus, it remains difficult for individual researchers to select, utilize, and combine these tools to obtain their final results. To resolve these issues, we have developed a genome analysis pipeline (GAAP) for semiautomated, iterative, and high-throughput analysis of whole-genome data. This pipeline is designed to perform read correction, de novo genome (transcriptome) assembly, gene prediction, and functional annotation using a range of proven tools and databases. We aim to assist non-IT researchers by describing each stage of analysis in detail and discussing current approaches. We also provide practical advice on how to access and use the bioinformatics tools and databases and how to implement the provided suggestions. Whole-genome analysis of Toxocara canis is used as case study to show intermediate results at each stage, demonstrating the practicality of the proposed method.
Collapse
|
6
|
Making sense of genomes of parasitic worms: Tackling bioinformatic challenges. Biotechnol Adv 2016; 34:663-686. [DOI: 10.1016/j.biotechadv.2016.03.001] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2015] [Revised: 02/25/2016] [Accepted: 03/01/2016] [Indexed: 01/25/2023]
|
7
|
Pan J, Hu X, Li P, Li H, He W, Zhang Y, Lin Y. Domain adaptation via Multi-Layer Transfer Learning. Neurocomputing 2016. [DOI: 10.1016/j.neucom.2015.12.097] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
8
|
|
9
|
Campbell MS, Yandell M. An Introduction to Genome Annotation. CURRENT PROTOCOLS IN BIOINFORMATICS 2015; 52:4.1.1-4.1.17. [PMID: 26678385 DOI: 10.1002/0471250953.bi0401s52] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Genome projects have evolved from large international undertakings to tractable endeavors for a single lab. Accurate genome annotation is critical for successful genomic, genetic, and molecular biology experiments. These annotations can be generated using a number of approaches and available software tools. This unit describes methods for genome annotation and a number of software tools commonly used in gene annotation.
Collapse
Affiliation(s)
- Michael S Campbell
- Eccles Institute of Human Genetics, University of Utah, Salt Lake City, Utah
| | - Mark Yandell
- Eccles Institute of Human Genetics, University of Utah, Salt Lake City, Utah.,USTAR Center for Genetic Discovery, University of Utah, Salt Lake City, Utah
| |
Collapse
|
10
|
McMullan M, Gardiner A, Bailey K, Kemen E, Ward BJ, Cevik V, Robert-Seilaniantz A, Schultz-Larsen T, Balmuth A, Holub E, van Oosterhout C, Jones JDG. Evidence for suppression of immunity as a driver for genomic introgressions and host range expansion in races of Albugo candida, a generalist parasite. eLife 2015; 4. [PMID: 25723966 PMCID: PMC4384639 DOI: 10.7554/elife.04550] [Citation(s) in RCA: 56] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2014] [Accepted: 02/26/2015] [Indexed: 12/13/2022] Open
Abstract
How generalist parasites with wide host ranges can evolve is a central question in parasite evolution. Albugo candida is an obligate biotrophic parasite that consists of many physiological races that each specialize on distinct Brassicaceae host species. By analyzing genome sequence assemblies of five isolates, we show they represent three races that are genetically diverged by ∼1%. Despite this divergence, their genomes are mosaic-like, with ∼25% being introgressed from other races. Sequential infection experiments show that infection by adapted races enables subsequent infection of hosts by normally non-infecting races. This facilitates introgression and the exchange of effector repertoires, and may enable the evolution of novel races that can undergo clonal population expansion on new hosts. We discuss recent studies on hybridization in other eukaryotes such as yeast, Heliconius butterflies, Darwin's finches, sunflowers and cichlid fishes, and the implications of introgression for pathogen evolution in an agro-ecological environment.
Collapse
Affiliation(s)
| | | | - Kate Bailey
- The Sainsbury Laboratory, Norwich, United Kingdom
| | - Eric Kemen
- Max Planck Research Group Fungal Biodiversity, Max Planck Institute for Plant Breeding Research, Cologne, Germany
| | - Ben J Ward
- School of Environmental Sciences, University of East Anglia, Norwich, United Kingdom
| | - Volkan Cevik
- The Sainsbury Laboratory, Norwich, United Kingdom
| | | | - Torsten Schultz-Larsen
- Department of Plant and Environmental Sciences, University of Copenhagen, Copenhagen, Denmark
| | | | - Eric Holub
- Warwick Crop Centre, University of Warwick, School of Life Sciences, Warwick, United Kingdom
| | - Cock van Oosterhout
- School of Environmental Sciences, University of East Anglia, Norwich, United Kingdom
| | | |
Collapse
|
11
|
Sharma R, Mishra B, Runge F, Thines M. Gene loss rather than gene gain is associated with a host jump from monocots to dicots in the Smut Fungus Melanopsichium pennsylvanicum. Genome Biol Evol 2014; 6:2034-49. [PMID: 25062916 PMCID: PMC4159001 DOI: 10.1093/gbe/evu148] [Citation(s) in RCA: 80] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
Smut fungi are well-suited to investigate the ecology and evolution of plant pathogens, as they are strictly biotrophic, yet cultivable on media. Here we report the genome sequence of Melanopsichium pennsylvanicum, closely related to Ustilago maydis and other Poaceae-infecting smuts, but parasitic to a dicot plant. To explore the evolutionary patterns resulting from host adaptation after this huge host jump, the genome of Me. pennsylvanicum was sequenced and compared with the genomes of U. maydis, Sporisorium reilianum, and U. hordei. Although all four genomes had a similar completeness in CEGMA (Core Eukaryotic Genes Mapping Approach) analysis, gene absence was highest in Me. pennsylvanicum, and most pronounced in putative secreted proteins, which are often considered as effector candidates. In contrast, the amount of private genes was similar among the species, highlighting that gene loss rather than gene gain is the hallmark of adaptation after the host jump to the dicot host. Our analyses revealed a trend of putative effectors to be next to another putative effector, but the majority of these are not in clusters and thus the focus on pathogenicity clusters might not be appropriate for all smut genomes. Positive selection studies revealed that Me. pennsylvanicum has the highest number and proportion of genes under positive selection. In general, putative effectors showed a higher proportion of positively selected genes than noneffector candidates. The 248 putative secreted effectors found in all four smut genomes might constitute a core set needed for pathogenicity, whereas those 92 that are found in all grass-parasitic smuts but have no ortholog in Me. pennsylvanicum might constitute a set of effectors important for successful colonization of grass hosts.
Collapse
Affiliation(s)
- Rahul Sharma
- Biodiversity and Climate Research Centre (BiK-F), Frankfurt am Main, GermanyInstitute of Ecology, Evolution and Diversity, Goethe University, Frankfurt am Main, GermanySenckenberg Gesellschaft für Naturforschung, Frankfurt am Main, GermanyCluster for Integrative Fungal Research (IPF), Frankfurt am Main, Germany
| | - Bagdevi Mishra
- Biodiversity and Climate Research Centre (BiK-F), Frankfurt am Main, GermanyInstitute of Ecology, Evolution and Diversity, Goethe University, Frankfurt am Main, GermanySenckenberg Gesellschaft für Naturforschung, Frankfurt am Main, Germany
| | - Fabian Runge
- Institute of Botany 210, University of Hohenheim, Stuttgart, Germany
| | - Marco Thines
- Biodiversity and Climate Research Centre (BiK-F), Frankfurt am Main, GermanyInstitute of Ecology, Evolution and Diversity, Goethe University, Frankfurt am Main, GermanySenckenberg Gesellschaft für Naturforschung, Frankfurt am Main, GermanyCluster for Integrative Fungal Research (IPF), Frankfurt am Main, Germany
| |
Collapse
|
12
|
van der Burgt A, Severing E, Collemare J, de Wit PJGM. Automated alignment-based curation of gene models in filamentous fungi. BMC Bioinformatics 2014; 15:19. [PMID: 24433567 PMCID: PMC3898260 DOI: 10.1186/1471-2105-15-19] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2013] [Accepted: 01/11/2014] [Indexed: 11/16/2022] Open
Abstract
Background Automated gene-calling is still an error-prone process, particularly for the highly plastic genomes of fungal species. Improvement through quality control and manual curation of gene models is a time-consuming process that requires skilled biologists and is only marginally performed. The wealth of available fungal genomes has not yet been exploited by an automated method that applies quality control of gene models in order to obtain more accurate genome annotations. Results We provide a novel method named alignment-based fungal gene prediction (ABFGP) that is particularly suitable for plastic genomes like those of fungi. It can assess gene models on a gene-by-gene basis making use of informant gene loci. Its performance was benchmarked on 6,965 gene models confirmed by full-length unigenes from ten different fungi. 79.4% of all gene models were correctly predicted by ABFGP. It improves the output of ab initio gene prediction software due to a higher sensitivity and precision for all gene model components. Applicability of the method was shown by revisiting the annotations of six different fungi, using gene loci from up to 29 fungal genomes as informants. Between 7,231 and 8,337 genes were assessed by ABFGP and for each genome between 1,724 and 3,505 gene model revisions were proposed. The reliability of the proposed gene models is assessed by an a posteriori introspection procedure of each intron and exon in the multiple gene model alignment. The total number and type of proposed gene model revisions in the six fungal genomes is correlated to the quality of the genome assembly, and to sequencing strategies used in the sequencing centre, highlighting different types of errors in different annotation pipelines. The ABFGP method is particularly successful in discovering sequence errors and/or disruptive mutations causing truncated and erroneous gene models. Conclusions The ABFGP method is an accurate and fully automated quality control method for fungal gene catalogues that can be easily implemented into existing annotation pipelines. With the exponential release of new genomes, the ABFGP method will help decreasing the number of gene models that require additional manual curation.
Collapse
Affiliation(s)
| | | | | | - Pierre J G M de Wit
- Laboratory of Phytopathology, Wageningen University & Research Centre, P,O, Box 16, 6700 AA Wageningen, The Netherlands.
| |
Collapse
|
13
|
Alamancos GP, Agirre E, Eyras E. Methods to study splicing from high-throughput RNA sequencing data. Methods Mol Biol 2014; 1126:357-97. [PMID: 24549677 DOI: 10.1007/978-1-62703-980-2_26] [Citation(s) in RCA: 54] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022]
Abstract
The development of novel high-throughput sequencing (HTS) methods for RNA (RNA-Seq) has provided a very powerful mean to study splicing under multiple conditions at unprecedented depth. However, the complexity of the information to be analyzed has turned this into a challenging task. In the last few years, a plethora of tools have been developed, allowing researchers to process RNA-Seq data to study the expression of isoforms and splicing events, and their relative changes under different conditions. We provide an overview of the methods available to study splicing from short RNA-Seq data, which could serve as an entry point for users who need to decide on a suitable tool for a specific analysis. We also attempt to propose a classification of the tools according to the operations they do, to facilitate the comparison and choice of methods.
Collapse
Affiliation(s)
- Gael P Alamancos
- Computational Genomics, Universitat Pompeu Fabra, Barcelona, Spain
| | | | | |
Collapse
|
14
|
Goodswen SJ, Kennedy PJ, Ellis JT. Evaluating high-throughput ab initio gene finders to discover proteins encoded in eukaryotic pathogen genomes missed by laboratory techniques. PLoS One 2012; 7:e50609. [PMID: 23226328 PMCID: PMC3511556 DOI: 10.1371/journal.pone.0050609] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2012] [Accepted: 10/24/2012] [Indexed: 11/25/2022] Open
Abstract
Next generation sequencing technology is advancing genome sequencing at an unprecedented level. By unravelling the code within a pathogen’s genome, every possible protein (prior to post-translational modifications) can theoretically be discovered, irrespective of life cycle stages and environmental stimuli. Now more than ever there is a great need for high-throughput ab initio gene finding. Ab initio gene finders use statistical models to predict genes and their exon-intron structures from the genome sequence alone. This paper evaluates whether existing ab initio gene finders can effectively predict genes to deduce proteins that have presently missed capture by laboratory techniques. An aim here is to identify possible patterns of prediction inaccuracies for gene finders as a whole irrespective of the target pathogen. All currently available ab initio gene finders are considered in the evaluation but only four fulfil high-throughput capability: AUGUSTUS, GeneMark_hmm, GlimmerHMM, and SNAP. These gene finders require training data specific to a target pathogen and consequently the evaluation results are inextricably linked to the availability and quality of the data. The pathogen, Toxoplasma gondii, is used to illustrate the evaluation methods. The results support current opinion that predicted exons by ab initio gene finders are inaccurate in the absence of experimental evidence. However, the results reveal some patterns of inaccuracy that are common to all gene finders and these inaccuracies may provide a focus area for future gene finder developers.
Collapse
Affiliation(s)
- Stephen J. Goodswen
- School of Medical and Molecular Sciences, and the Ithree Institute at the University of Technology Sydney (UTS), New South Wales, Australia
| | - Paul J. Kennedy
- School of Software, Faculty of Engineering and Information Technology and the Centre for Quantum Computation and Intelligent Systems at the University of Technology Sydney (UTS), New South Wales, Australia
| | - John T. Ellis
- School of Medical and Molecular Sciences, and the Ithree Institute at the University of Technology Sydney (UTS), New South Wales, Australia
- * E-mail:
| |
Collapse
|
15
|
Bernal A, Crammer K, Pereira F. Automated gene-model curation using global discriminative learning. Bioinformatics 2012; 28:1571-8. [DOI: 10.1093/bioinformatics/bts176] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
16
|
|
17
|
Sato A, Oshima K, Noguchi H, Ogawa M, Takahashi T, Oguma T, Koyama Y, Itoh T, Hattori M, Hanya Y. Draft genome sequencing and comparative analysis of Aspergillus sojae NBRC4239. DNA Res 2011; 18:165-76. [PMID: 21659486 PMCID: PMC3111232 DOI: 10.1093/dnares/dsr009] [Citation(s) in RCA: 61] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
We conducted genome sequencing of the filamentous fungus Aspergillus sojae NBRC4239 isolated from the koji used to prepare Japanese soy sauce. We used the 454 pyrosequencing technology and investigated the genome with respect to enzymes and secondary metabolites in comparison with other Aspergilli sequenced. Assembly of 454 reads generated a non-redundant sequence of 39.5-Mb possessing 13 033 putative genes and 65 scaffolds composed of 557 contigs. Of the 2847 open reading frames with Pfam domain scores of >150 found in A. sojae NBRC4239, 81.7% had a high degree of similarity with the genes of A. oryzae. Comparative analysis identified serine carboxypeptidase and aspartic protease genes unique to A. sojae NBRC4239. While A. oryzae possessed three copies of α-amyalse gene, A. sojae NBRC4239 possessed only a single copy. Comparison of 56 gene clusters for secondary metabolites between A. sojae NBRC4239 and A. oryzae revealed that 24 clusters were conserved, whereas 32 clusters differed between them that included a deletion of 18 508 bp containing mfs1, mao1, dmaT, and pks-nrps for the cyclopiazonic acid (CPA) biosynthesis, explaining the no productivity of CPA in A. sojae. The A. sojae NBRC4239 genome data will be useful to characterize functional features of the koji moulds used in Japanese industries.
Collapse
Affiliation(s)
- Atsushi Sato
- Research and Development Division, Kikkoman Corporation, 399 Noda, Noda City, Chiba 278-0037, Japan
| | | | | | | | | | | | | | | | | | | |
Collapse
|
18
|
Kemen E, Gardiner A, Schultz-Larsen T, Kemen AC, Balmuth AL, Robert-Seilaniantz A, Bailey K, Holub E, Studholme DJ, MacLean D, Jones JDG. Gene gain and loss during evolution of obligate parasitism in the white rust pathogen of Arabidopsis thaliana. PLoS Biol 2011; 9:e1001094. [PMID: 21750662 PMCID: PMC3130010 DOI: 10.1371/journal.pbio.1001094] [Citation(s) in RCA: 213] [Impact Index Per Article: 16.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2010] [Accepted: 05/10/2011] [Indexed: 01/21/2023] Open
Abstract
Biotrophic eukaryotic plant pathogens require a living host for their growth and form an intimate haustorial interface with parasitized cells. Evolution to biotrophy occurred independently in fungal rusts and powdery mildews, and in oomycete white rusts and downy mildews. Biotroph evolution and molecular mechanisms of biotrophy are poorly understood. It has been proposed, but not shown, that obligate biotrophy results from (i) reduced selection for maintenance of biosynthetic pathways and (ii) gain of mechanisms to evade host recognition or suppress host defence. Here we use Illumina sequencing to define the genome, transcriptome, and gene models for the obligate biotroph oomycete and Arabidopsis parasite, Albugo laibachii. A. laibachii is a member of the Chromalveolata, which incorporates Heterokonts (containing the oomycetes), Apicomplexa (which includes human parasites like Plasmodium falciparum and Toxoplasma gondii), and four other taxa. From comparisons with other oomycete plant pathogens and other chromalveolates, we reveal independent loss of molybdenum-cofactor-requiring enzymes in downy mildews, white rusts, and the malaria parasite P. falciparum. Biotrophy also requires “effectors” to suppress host defence; we reveal RXLR and Crinkler effectors shared with other oomycetes, and also discover and verify a novel class of effectors, the “CHXCs”, by showing effector delivery and effector functionality. Our findings suggest that evolution to progressively more intimate association between host and parasite results in reduced selection for retention of certain biosynthetic pathways, and particularly reduced selection for retention of molybdopterin-requiring biosynthetic pathways. These mechanisms are not only relevant to plant pathogenic oomycetes but also to human pathogens within the Chromalveolata. Plant pathogens that cannot grow except on their hosts are called obligate biotrophs. How such biotrophy evolves is poorly understood. In this study, we sequenced the genome of the obligate biotroph white rust pathogen (Albugo laibachii, Oomycota) of Arabidopsis. From comparisons with other oomycete plant pathogens, diatoms, and the human pathogen Plasmodium falciparum, we reveal a loss of important metabolic enzymes. We also reveal the appearance of defence-suppressing “effectors”, some carrying motifs known from other oomycete effectors, and discover and experimentally verify a novel class of effectors that share a CHXC motif within 50 amino acids of the signal peptide cleavage site. Obligate biotrophy involves an intimate association within host cells at the haustorial interface (where the parasite penetrates the host cell's cell wall), where nutrients are acquired from the host and effectors are delivered to the host. We found that A. laibachii, like Hyaloperonospora arabidopsidis and Plasmodium falciparum, lacks molybdopterin-requiring biosynthetic pathways, suggesting relaxed selection for retention of, or even selection against, this pathway. We propose that when defence suppression becomes sufficiently effective, hosts become such a reliable source of nutrients that a free-living phase can be lost. These mechanisms leading to obligate biotrophy and host specificity are relevant not only to plant pathogenic oomycetes but also to human pathogens.
Collapse
Affiliation(s)
- Eric Kemen
- The Sainsbury Laboratory, Norwich Research Park, Norwich, United Kingdom
| | - Anastasia Gardiner
- The Sainsbury Laboratory, Norwich Research Park, Norwich, United Kingdom
| | | | - Ariane C. Kemen
- The Sainsbury Laboratory, Norwich Research Park, Norwich, United Kingdom
| | - Alexi L. Balmuth
- The Sainsbury Laboratory, Norwich Research Park, Norwich, United Kingdom
- The GenePool, The University of Edinburgh, Edinburgh, United Kingdom
| | | | - Kate Bailey
- The Sainsbury Laboratory, Norwich Research Park, Norwich, United Kingdom
| | - Eric Holub
- School of Life Sciences, University of Warwick, Wellesbourne Campus, United Kingdom
| | | | - Dan MacLean
- The Sainsbury Laboratory, Norwich Research Park, Norwich, United Kingdom
| | - Jonathan D. G. Jones
- The Sainsbury Laboratory, Norwich Research Park, Norwich, United Kingdom
- * E-mail:
| |
Collapse
|
19
|
Sorber K, Dimon MT, DeRisi JL. RNA-Seq analysis of splicing in Plasmodium falciparum uncovers new splice junctions, alternative splicing and splicing of antisense transcripts. Nucleic Acids Res 2011; 39:3820-35. [PMID: 21245033 PMCID: PMC3089446 DOI: 10.1093/nar/gkq1223] [Citation(s) in RCA: 98] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
Over 50% of genes in Plasmodium falciparum, the deadliest human malaria parasite, contain predicted introns, yet experimental characterization of splicing in this organism remains incomplete. We present here a transcriptome-wide characterization of intraerythrocytic splicing events, as captured by RNA-Seq data from four timepoints of a single highly synchronous culture. Gene model-independent analysis of these data in conjunction with publically available RNA-Seq data with HMMSplicer, an in-house developed splice site detection algorithm, revealed a total of 977 new 5' GU-AG 3' and 5 new 5' GC-AG 3' junctions absent from gene models and ESTs (11% increase to the current annotation). In addition, 310 alternative splicing events were detected in 254 (4.5%) genes, most of which truncate open reading frames. Splicing events antisense to gene models were also detected, revealing complex transcriptional arrangements within the parasite's transcriptome. Interestingly, antisense introns overlap sense introns more than would be expected by chance, perhaps indicating a functional relationship between overlapping transcripts or an inherent organizational property of the transcriptome. Independent experimental validation confirmed over 30 new antisense and alternative junctions. Thus, this largest assemblage of new and alternative splicing events to date in Plasmodium falciparum provides a more precise, dynamic view of the parasite's transcriptome.
Collapse
Affiliation(s)
- Katherine Sorber
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, CA, USA
| | | | | |
Collapse
|
20
|
Bahl A, Davis PH, Behnke M, Dzierszinski F, Jagalur M, Chen F, Shanmugam D, White MW, Kulp D, Roos DS. A novel multifunctional oligonucleotide microarray for Toxoplasma gondii. BMC Genomics 2010; 11:603. [PMID: 20974003 PMCID: PMC3017859 DOI: 10.1186/1471-2164-11-603] [Citation(s) in RCA: 57] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2010] [Accepted: 10/25/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Microarrays are invaluable tools for genome interrogation, SNP detection, and expression analysis, among other applications. Such broad capabilities would be of value to many pathogen research communities, although the development and use of genome-scale microarrays is often a costly undertaking. Therefore, effective methods for reducing unnecessary probes while maintaining or expanding functionality would be relevant to many investigators. RESULTS Taking advantage of available genome sequences and annotation for Toxoplasma gondii (a pathogenic parasite responsible for illness in immunocompromised individuals) and Plasmodium falciparum (a related parasite responsible for severe human malaria), we designed a single oligonucleotide microarray capable of supporting a wide range of applications at relatively low cost, including genome-wide expression profiling for Toxoplasma, and single-nucleotide polymorphism (SNP)-based genotyping of both T. gondii and P. falciparum. Expression profiling of the three clonotypic lineages dominating T. gondii populations in North America and Europe provides a first comprehensive view of the parasite transcriptome, revealing that ~49% of all annotated genes are expressed in parasite tachyzoites (the acutely lytic stage responsible for pathogenesis) and 26% of genes are differentially expressed among strains. A novel design utilizing few probes provided high confidence genotyping, used here to resolve recombination points in the clonal progeny of sexual crosses. Recent sequencing of additional T. gondii isolates identifies >620 K new SNPs, including ~11 K that intersect with expression profiling probes, yielding additional markers for genotyping studies, and further validating the utility of a combined expression profiling/genotyping array design. Additional applications facilitating SNP and transcript discovery, alternative statistical methods for quantifying gene expression, etc. are also pursued at pilot scale to inform future array designs. CONCLUSIONS In addition to providing an initial global view of the T. gondii transcriptome across major lineages and permitting detailed resolution of recombination points in a historical sexual cross, the multifunctional nature of this array also allowed opportunities to exploit probes for purposes beyond their intended use, enhancing analyses. This array is in widespread use by the T. gondii research community, and several aspects of the design strategy are likely to be useful for other pathogens.
Collapse
Affiliation(s)
- Amit Bahl
- Genomics and Computational Biology, University of Pennsylvania, Philadelphia, PA 19104, USA
| | | | | | | | | | | | | | | | | | | |
Collapse
|
21
|
De novo assembly of a 40 Mb eukaryotic genome from short sequence reads: Sordaria macrospora, a model organism for fungal morphogenesis. PLoS Genet 2010; 6:e1000891. [PMID: 20386741 PMCID: PMC2851567 DOI: 10.1371/journal.pgen.1000891] [Citation(s) in RCA: 140] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2009] [Accepted: 03/02/2010] [Indexed: 01/09/2023] Open
Abstract
Filamentous fungi are of great importance in ecology, agriculture, medicine, and biotechnology. Thus, it is not surprising that genomes for more than 100 filamentous fungi have been sequenced, most of them by Sanger sequencing. While next-generation sequencing techniques have revolutionized genome resequencing, e.g. for strain comparisons, genetic mapping, or transcriptome and ChIP analyses, de novo assembly of eukaryotic genomes still presents significant hurdles, because of their large size and stretches of repetitive sequences. Filamentous fungi contain few repetitive regions in their 30-90 Mb genomes and thus are suitable candidates to test de novo genome assembly from short sequence reads. Here, we present a high-quality draft sequence of the Sordaria macrospora genome that was obtained by a combination of Illumina/Solexa and Roche/454 sequencing. Paired-end Solexa sequencing of genomic DNA to 85-fold coverage and an additional 10-fold coverage by single-end 454 sequencing resulted in approximately 4 Gb of DNA sequence. Reads were assembled to a 40 Mb draft version (N50 of 117 kb) with the Velvet assembler. Comparative analysis with Neurospora genomes increased the N50 to 498 kb. The S. macrospora genome contains even fewer repeat regions than its closest sequenced relative, Neurospora crassa. Comparison with genomes of other fungi showed that S. macrospora, a model organism for morphogenesis and meiosis, harbors duplications of several genes involved in self/nonself-recognition. Furthermore, S. macrospora contains more polyketide biosynthesis genes than N. crassa. Phylogenetic analyses suggest that some of these genes may have been acquired by horizontal gene transfer from a distantly related ascomycete group. Our study shows that, for typical filamentous fungi, de novo assembly of genomes from short sequence reads alone is feasible, that a mixture of Solexa and 454 sequencing substantially improves the assembly, and that the resulting data can be used for comparative studies to address basic questions of fungal biology.
Collapse
|
22
|
Inskeep WP, Rusch DB, Jay ZJ, Herrgard MJ, Kozubal MA, Richardson TH, Macur RE, Hamamura N, Jennings RD, Fouke BW, Reysenbach AL, Roberto F, Young M, Schwartz A, Boyd ES, Badger JH, Mathur EJ, Ortmann AC, Bateson M, Geesey G, Frazier M. Metagenomes from high-temperature chemotrophic systems reveal geochemical controls on microbial community structure and function. PLoS One 2010; 5:e9773. [PMID: 20333304 PMCID: PMC2841643 DOI: 10.1371/journal.pone.0009773] [Citation(s) in RCA: 134] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2009] [Accepted: 02/25/2010] [Indexed: 01/07/2023] Open
Abstract
The Yellowstone caldera contains the most numerous and diverse geothermal systems on Earth, yielding an extensive array of unique high-temperature environments that host a variety of deeply-rooted and understudied Archaea, Bacteria and Eukarya. The combination of extreme temperature and chemical conditions encountered in geothermal environments often results in considerably less microbial diversity than other terrestrial habitats and offers a tremendous opportunity for studying the structure and function of indigenous microbial communities and for establishing linkages between putative metabolisms and element cycling. Metagenome sequence (14–15,000 Sanger reads per site) was obtained for five high-temperature (>65°C) chemotrophic microbial communities sampled from geothermal springs (or pools) in Yellowstone National Park (YNP) that exhibit a wide range in geochemistry including pH, dissolved sulfide, dissolved oxygen and ferrous iron. Metagenome data revealed significant differences in the predominant phyla associated with each of these geochemical environments. Novel members of the Sulfolobales are dominant in low pH environments, while other Crenarchaeota including distantly-related Thermoproteales and Desulfurococcales populations dominate in suboxic sulfidic sediments. Several novel archaeal groups are well represented in an acidic (pH 3) Fe-oxyhydroxide mat, where a higher O2 influx is accompanied with an increase in archaeal diversity. The presence or absence of genes and pathways important in S oxidation-reduction, H2-oxidation, and aerobic respiration (terminal oxidation) provide insight regarding the metabolic strategies of indigenous organisms present in geothermal systems. Multiple-pathway and protein-specific functional analysis of metagenome sequence data corroborated results from phylogenetic analyses and clearly demonstrate major differences in metabolic potential across sites. The distribution of functional genes involved in electron transport is consistent with the hypothesis that geochemical parameters (e.g., pH, sulfide, Fe, O2) control microbial community structure and function in YNP geothermal springs.
Collapse
Affiliation(s)
- William P. Inskeep
- Thermal Biology Institute and Department of Land Resources and Environmental Sciences, Montana State University, Bozeman, Montana, United States of America
- * E-mail: (WPI); (DBR)
| | - Douglas B. Rusch
- J. Craig Venter Institute, Rockville, Maryland, United States of America
- * E-mail: (WPI); (DBR)
| | - Zackary J. Jay
- Thermal Biology Institute and Department of Land Resources and Environmental Sciences, Montana State University, Bozeman, Montana, United States of America
| | | | - Mark A. Kozubal
- Thermal Biology Institute and Department of Land Resources and Environmental Sciences, Montana State University, Bozeman, Montana, United States of America
| | | | - Richard E. Macur
- Thermal Biology Institute and Department of Land Resources and Environmental Sciences, Montana State University, Bozeman, Montana, United States of America
| | - Natsuko Hamamura
- Center for Marine Environmental Studies, Ehime University, Matsuyama, Japan
| | - Ryan deM. Jennings
- Thermal Biology Institute and Department of Land Resources and Environmental Sciences, Montana State University, Bozeman, Montana, United States of America
| | - Bruce W. Fouke
- University of Illinois, Urbana, Illinois, United States of America
| | | | - Frank Roberto
- Idaho National Laboratory, Idaho Falls, Idaho, United States of America
| | - Mark Young
- Thermal Biology Institute and Department of Plant Sciences and Plant Pathology, Montana State University, Bozeman, Montana, United States of America
| | - Ariel Schwartz
- Synthetic Genomics Inc., La Jolla, California, United States of America
| | - Eric S. Boyd
- Thermal Biology Institute and Department of Microbiology, Montana State University, Bozeman, Montana, United States of America
| | - Jonathan H. Badger
- J. Craig Venter Institute, Rockville, Maryland, United States of America
| | - Eric J. Mathur
- Synthetic Genomics Inc., La Jolla, California, United States of America
| | - Alice C. Ortmann
- Department of Marine Science, University of South Alabama, Mobile, Alabama, United States of America
| | - Mary Bateson
- Thermal Biology Institute and Department of Plant Sciences and Plant Pathology, Montana State University, Bozeman, Montana, United States of America
| | - Gill Geesey
- Thermal Biology Institute and Department of Microbiology, Montana State University, Bozeman, Montana, United States of America
| | - Marvin Frazier
- J. Craig Venter Institute, Rockville, Maryland, United States of America
| |
Collapse
|
23
|
Madupu R, Brinkac LM, Harrow J, Wilming LG, Böhme U, Lamesch P, Hannick LI. Meeting report: a workshop on Best Practices in Genome Annotation. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2010; 2010:baq001. [PMID: 20428316 PMCID: PMC2860899 DOI: 10.1093/database/baq001] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/12/2009] [Revised: 01/08/2010] [Accepted: 01/11/2010] [Indexed: 01/28/2023]
Abstract
Efforts to annotate the genomes of a wide variety of model organisms are currently carried out by sequencing centers, model organism databases and academic/institutional laboratories around the world. Different annotation methods and tools have been developed over time to meet the needs of biologists faced with the task of annotating biological data. While standardized methods are essential for consistent curation within each annotation group, methods and tools can differ between groups, especially when the groups are curating different organisms. Biocurators from several institutes met at the Third International Biocuration Conference in Berlin, Germany, April 2009 and hosted the ‘Best Practices in Genome Annotation: Inference from Evidence’ workshop to share their strategies, pipelines, standards and tools. This article documents the material presented in the workshop.
Collapse
Affiliation(s)
- Ramana Madupu
- Informatics, J. Craig Venter Institute, Rockville, MD 20850 USA, Wellcome Trust Sanger Institute, Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, UK and The Arabidopsis Information Resource, Carnegie Institution of Washington, Stanford, CA 94305 USA
| | | | | | | | | | | | | |
Collapse
|
24
|
Romero-Zaliz R, Rubio-Escudero C, Zwir I, del Val C. Optimization of multi-classifiers for computational biology: application to gene finding and expression. Theor Chem Acc 2009. [DOI: 10.1007/s00214-009-0648-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
|
25
|
Coghlan A, Fiedler TJ, McKay SJ, Flicek P, Harris TW, Blasiar D, Stein LD. nGASP--the nematode genome annotation assessment project. BMC Bioinformatics 2008; 9:549. [PMID: 19099578 PMCID: PMC2651883 DOI: 10.1186/1471-2105-9-549] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2008] [Accepted: 12/19/2008] [Indexed: 11/15/2022] Open
Abstract
Background While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets across 10 Mb of the C. elegans genome. Predictions were compared to reference gene sets consisting of confirmed or manually curated gene models from WormBase. Results The most accurate gene-finders were 'combiner' algorithms, which made use of transcript- and protein-alignments and multi-genome alignments, as well as gene predictions from other gene-finders. Gene-finders that used alignments of ESTs, mRNAs and proteins came in second. There was a tie for third place between gene-finders that used multi-genome alignments and ab initio gene-finders. The median gene level sensitivity of combiners was 78% and their specificity was 42%, which is nearly the same accuracy reported for combiners in the human genome. C. elegans genes with exons of unusual hexamer content, as well as those with unusually many exons, short exons, long introns, a weak translation start signal, weak splice sites, or poorly conserved orthologs posed the greatest difficulty for gene-finders. Conclusion This experiment establishes a baseline of gene prediction accuracy in Caenorhabditis genomes, and has guided the choice of gene-finders for the annotation of newly sequenced genomes of Caenorhabditis and other nematode species. We have created new gene sets for C. briggsae, C. remanei, C. brenneri, C. japonica, and Brugia malayi using some of the best-performing gene-finders.
Collapse
Affiliation(s)
- Avril Coghlan
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.
| | | | | | | | | | | | | | | |
Collapse
|
26
|
Chen Z, Harb OS, Roos DS. In silico identification of specialized secretory-organelle proteins in apicomplexan parasites and in vivo validation in Toxoplasma gondii. PLoS One 2008; 3:e3611. [PMID: 18974850 PMCID: PMC2575384 DOI: 10.1371/journal.pone.0003611] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2008] [Accepted: 10/06/2008] [Indexed: 12/04/2022] Open
Abstract
Apicomplexan parasites, including the human pathogens Toxoplasma gondii and Plasmodium falciparum, employ specialized secretory organelles (micronemes, rhoptries, dense granules) to invade and survive within host cells. Because molecules secreted from these organelles function at the host/parasite interface, their identification is important for understanding invasion mechanisms, and central to the development of therapeutic strategies. Using a computational approach based on predicted functional domains, we have identified more than 600 candidate secretory organelle proteins in twelve apicomplexan parasites. Expression in transgenic T. gondii of eight proteins identified in silico confirms that all enter into the secretory pathway, and seven target to apical organelles associated with invasion. An in silico approach intended to identify possible host interacting proteins yields a dataset enriched in secretory/transmembrane proteins, including most of the antigens known to be engaged by apicomplexan parasites during infection. These domain pattern and projected interactome approaches significantly expand the repertoire of proteins that may be involved in host parasite interactions.
Collapse
Affiliation(s)
- ZhongQiang Chen
- Department of Biology, Penn Genomic Frontiers Institute, and the Graduate Program in Genomics and Computational Biology, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| | - Omar S. Harb
- Department of Biology, Penn Genomic Frontiers Institute, and the Graduate Program in Genomics and Computational Biology, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
- * E-mail: (DSR); (OSH)
| | - David S. Roos
- Department of Biology, Penn Genomic Frontiers Institute, and the Graduate Program in Genomics and Computational Biology, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
- * E-mail: (DSR); (OSH)
| |
Collapse
|
27
|
Liu Q, Crammer K, Pereira FCN, Roos DS. Reranking candidate gene models with cross-species comparison for improved gene prediction. BMC Bioinformatics 2008; 9:433. [PMID: 18854050 PMCID: PMC2587481 DOI: 10.1186/1471-2105-9-433] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2008] [Accepted: 10/14/2008] [Indexed: 11/10/2022] Open
Abstract
Background Most gene finders score candidate gene models with state-based methods, typically HMMs, by combining local properties (coding potential, splice donor and acceptor patterns, etc). Competing models with similar state-based scores may be distinguishable with additional information. In particular, functional and comparative genomics datasets may help to select among competing models of comparable probability by exploiting features likely to be associated with the correct gene models, such as conserved exon/intron structure or protein sequence features. Results We have investigated the utility of a simple post-processing step for selecting among a set of alternative gene models, using global scoring rules to rerank competing models for more accurate prediction. For each gene locus, we first generate the K best candidate gene models using the gene finder Evigan, and then rerank these models using comparisons with putative orthologous genes from closely-related species. Candidate gene models with lower scores in the original gene finder may be selected if they exhibit strong similarity to probable orthologs in coding sequence, splice site location, or signal peptide occurrence. Experiments on Drosophila melanogaster demonstrate that reranking based on cross-species comparison outperforms the best gene models identified by Evigan alone, and also outperforms the comparative gene finders GeneWise and Augustus+. Conclusion Reranking gene models with cross-species comparison improves gene prediction accuracy. This straightforward method can be readily adapted to incorporate additional lines of evidence, as it requires only a ranked source of candidate gene models.
Collapse
Affiliation(s)
- Qian Liu
- Department of Computer and Information Science, University of Pennsylvania, Philadelphia, Pennsylvania, USA.
| | | | | | | |
Collapse
|