1
|
Darbinian N, Gallia GL, Darbinyan A, Vadachkoria E, Merabova N, Moore A, Goetzl L, Amini S, Selzer ME. Effects of In Utero EtOH Exposure on 18S Ribosomal RNA Processing: Contribution to Fetal Alcohol Spectrum Disorder. Int J Mol Sci 2023; 24:13714. [PMID: 37762017 PMCID: PMC10531167 DOI: 10.3390/ijms241813714] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Revised: 08/28/2023] [Accepted: 08/29/2023] [Indexed: 09/29/2023] Open
Abstract
Fetal alcohol spectrum disorders (FASD) are leading causes of neurodevelopmental disability. The mechanisms by which alcohol (EtOH) disrupts fetal brain development are incompletely understood, as are the genetic factors that modify individual vulnerability. Because the phenotype abnormalities of FASD are so varied and widespread, we investigated whether fetal exposure to EtOH disrupts ribosome biogenesis and the processing of pre-ribosomal RNAs and ribosome assembly, by determining the effect of exposure to EtOH on the developmental expression of 18S rRNA and its cleaved forms, members of a novel class of short non-coding RNAs (srRNAs). In vitro neuronal cultures and fetal brains (11-22 weeks) were collected according to an IRB-approved protocol. Twenty EtOH-exposed brains from the first and second trimester were compared with ten unexposed controls matched for gestational age and fetal gender. Twenty fetal-brain-derived exosomes (FB-Es) were isolated from matching maternal blood. RNA was isolated using Qiagen RNA isolation kits. Fetal brain srRNA expression was quantified by ddPCR. srRNAs were expressed in the human brain and FB-Es during fetal development. EtOH exposure slightly decreased srRNA expression (1.1-fold; p = 0.03). Addition of srRNAs to in vitro neuronal cultures inhibited EtOH-induced caspase-3 activation (1.6-fold, p = 0.002) and increased cell survival (4.7%, p = 0.034). The addition of exogenous srRNAs reversed the EtOH-mediated downregulation of srRNAs (2-fold, p = 0.002). EtOH exposure suppressed expression of srRNAs in the developing brain, increased activity of caspase-3, and inhibited neuronal survival. Exogenous srRNAs reversed this effect, possibly by stabilizing endogenous srRNAs, or by increasing the association of cellular proteins with srRNAs, modifying gene transcription. Finally, the reduction in 18S rRNA levels correlated closely with the reduction in fetal eye diameter, an anatomical hallmark of FASD. The findings suggest a potential mechanism for EtOH-mediated neurotoxicity via alterations in 18S rRNA processing and the use of FB-Es for early diagnosis of FASD. Ribosome biogenesis may be a novel target to ameliorate FASD in utero or after birth. These findings are consistent with observations that gene-environment interactions contribute to FASD vulnerability.
Collapse
Affiliation(s)
- Nune Darbinian
- Center for Neural Repair and Rehabilitation Shriners Hospitals Pediatric Research Center, Lewis Katz School of Medicine, Temple University, Philadelphia, PA 19140, USA; (E.V.); (N.M.); (A.M.)
| | - Gary L. Gallia
- Department of Neurosurgery, Johns Hopkins Hospital, Baltimore, MD 21287, USA;
| | - Armine Darbinyan
- Department of Pathology, Yale University School of Medicine, New Haven, CT 06520, USA;
| | - Ekaterina Vadachkoria
- Center for Neural Repair and Rehabilitation Shriners Hospitals Pediatric Research Center, Lewis Katz School of Medicine, Temple University, Philadelphia, PA 19140, USA; (E.V.); (N.M.); (A.M.)
| | - Nana Merabova
- Center for Neural Repair and Rehabilitation Shriners Hospitals Pediatric Research Center, Lewis Katz School of Medicine, Temple University, Philadelphia, PA 19140, USA; (E.V.); (N.M.); (A.M.)
- Medical College of Wisconsin-Prevea Health, Green Bay, WI 54304, USA
| | - Amos Moore
- Center for Neural Repair and Rehabilitation Shriners Hospitals Pediatric Research Center, Lewis Katz School of Medicine, Temple University, Philadelphia, PA 19140, USA; (E.V.); (N.M.); (A.M.)
| | - Laura Goetzl
- Department of Obstetrics & Gynecology, University of Texas, Houston, TX 77030, USA;
| | - Shohreh Amini
- Department of Biology, College of Science and Technology, Temple University, Philadelphia, PA 19122, USA;
| | - Michael E. Selzer
- Center for Neural Repair and Rehabilitation Shriners Hospitals Pediatric Research Center, Lewis Katz School of Medicine, Temple University, Philadelphia, PA 19140, USA; (E.V.); (N.M.); (A.M.)
- Departments of Neurology and Neural Sciences, Lewis Katz School of Medicine at Temple University, Philadelphia, PA 19140, USA
| |
Collapse
|
2
|
Abstract
Gene finding is the process of identifying genome sequence regions representing stretches of DNA that encode biologically active products, such as proteins or functional noncoding RNAs. As this is usually the first step in the analysis of any novel genomic sequence or resequenced sample of well-known organisms, it is a very important issue, as all downstream analyses depend on the results. This chapter describes the biological basis for gene finding, and the programs and computational approaches that are available for the automated identification of protein-coding genes. For bacterial, archaeal, and eukaryotic genomes, as well as for multi-species sequence data originating from environmental community studies, the state of the art in automated gene finding is described.
Collapse
Affiliation(s)
- Alice Carolyn McHardy
- Department for Algorithmic Bioinformatics, Heinrich Heine University, Düsseldorf, Germany.
- Computational Biology of Infection Research, Helmholtz Center for Infection Research, Braunschweig, Germany.
| | - Andreas Kloetgen
- Department for Algorithmic Bioinformatics, Heinrich Heine University, Düsseldorf, Germany
- Department of Pediatric Oncology, Hematology and Clinical Immunology, Heinrich Heine University, Düsseldorf, Germany
| |
Collapse
|
3
|
Coding sequence density estimation via topological pressure. J Math Biol 2014; 70:45-69. [PMID: 24448658 DOI: 10.1007/s00285-014-0754-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2013] [Revised: 12/31/2013] [Indexed: 10/25/2022]
Abstract
We give a new approach to coding sequence (CDS) density estimation in genomic analysis based on the topological pressure, which we develop from a well known concept in ergodic theory. Topological pressure measures the 'weighted information content' of a finite word, and incorporates 64 parameters which can be interpreted as a choice of weight for each nucleotide triplet. We train the parameters so that the topological pressure fits the observed coding sequence density on the human genome, and use this to give ab initio predictions of CDS density over windows of size around 66,000 bp on the genomes of Mus Musculus, Rhesus Macaque and Drososphilia Melanogaster. While the differences between these genomes are too great to expect that training on the human genome could predict, for example, the exact locations of genes, we demonstrate that our method gives reasonable estimates for the 'coarse scale' problem of predicting CDS density. Inspired again by ergodic theory, the weightings of the nucleotide triplets obtained from our training procedure are used to define a probability distribution on finite sequences, which can be used to distinguish between intron and exon sequences from the human genome of lengths between 750 and 5,000 bp. At the end of the paper, we explain the theoretical underpinning for our approach, which is the theory of Thermodynamic Formalism from the dynamical systems literature. Mathematica and MATLAB implementations of our method are available at http://sourceforge.net/projects/topologicalpres/ .
Collapse
|
4
|
ASPic-GeneID: a lightweight pipeline for gene prediction and alternative isoforms detection. BIOMED RESEARCH INTERNATIONAL 2013; 2013:502827. [PMID: 24308000 PMCID: PMC3838850 DOI: 10.1155/2013/502827] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/16/2013] [Revised: 08/01/2013] [Accepted: 08/04/2013] [Indexed: 12/31/2022]
Abstract
New genomes are being sequenced at an increasingly rapid rate, far outpacing the rate at which manual gene annotation can be performed. Automated genome annotation is thus necessitated by this growth in genome projects; however, full-fledged annotation systems are usually home-grown and customized to a particular genome. There is thus a renewed need for accurate ab initio gene prediction methods. However, it is apparent that fully ab initio methods fall short of the required level of sensitivity and specificity for a quality annotation. Evidence in the form of expressed sequences gives the single biggest improvement in accuracy when used to inform gene predictions. Here, we present a lightweight pipeline for first-pass gene prediction on newly sequenced genomes. The two main components are ASPic, a program that derives highly accurate, albeit not necessarily complete, EST-based transcript annotations from EST alignments, and GeneID, a standard gene prediction program, which we have modified to take as evidence intron annotations. The introns output by ASPic CDS predictions is given to GeneID to constrain the exon-chaining process and produce predictions consistent with the underlying EST alignments. The pipeline was successfully tested on the entire C. elegans genome and the 44 ENCODE human pilot regions.
Collapse
|
5
|
|
6
|
Specht M, Stanke M, Terashima M, Naumann-Busch B, Janssen I, Höhner R, Hom EFY, Liang C, Hippler M. Concerted action of the new Genomic Peptide Finder and AUGUSTUS allows for automated proteogenomic annotation of the Chlamydomonas reinhardtii genome. Proteomics 2011; 11:1814-23. [PMID: 21432999 DOI: 10.1002/pmic.201000621] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2010] [Revised: 01/31/2011] [Accepted: 02/11/2011] [Indexed: 12/24/2022]
Abstract
The use and development of post-genomic tools naturally depends on large-scale genome sequencing projects. The usefulness of post-genomic applications is dependent on the accuracy of genome annotations, for which the correct identification of intron-exon borders in complex genomes of eukaryotic organisms is often an error-prone task. Although automated algorithms for predicting intron-exon structures are available, supporting exon evidence is necessary to achieve comprehensive genome annotation. Besides cDNA and EST support, peptides identified via MS/MS can be used as extrinsic evidence in a proteogenomic approach. We describe an improved version of the Genomic Peptide Finder (GPF), which aligns de novo predicted amino acid sequences to the genomic DNA sequence of an organism while correcting for peptide sequencing errors and accounting for the possibility of splicing. We have coupled GPF and the gene finding program AUGUSTUS in a way that provides automatic structural annotations of the Chlamydomonas reinhardtii genome, using highly unbiased GPF evidence. A comparison of the AUGUSTUS gene set incorporating GPF evidence to the standard JGI FM4 (Filtered Models 4) gene set reveals 932 GPF peptides that are not contained in the Filtered Models 4 gene set. Furthermore, the GPF evidence improved the AUGUSTUS gene models by altering 65 gene models and adding three previously unidentified genes.
Collapse
Affiliation(s)
- Michael Specht
- Institute of Plant Biology and Biotechnology, University of Münster, Münster, Germany
| | | | | | | | | | | | | | | | | |
Collapse
|
7
|
Romero-Zaliz R, Rubio-Escudero C, Zwir I, del Val C. Optimization of multi-classifiers for computational biology: application to gene finding and expression. Theor Chem Acc 2009. [DOI: 10.1007/s00214-009-0648-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
|
8
|
Quantitative measures for the management and comparison of annotated genomes. BMC Bioinformatics 2009; 10:67. [PMID: 19236712 PMCID: PMC2653490 DOI: 10.1186/1471-2105-10-67] [Citation(s) in RCA: 91] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2008] [Accepted: 02/23/2009] [Indexed: 11/22/2022] Open
Abstract
Background The ever-increasing number of sequenced and annotated genomes has made management of their annotations a significant undertaking, especially for large eukaryotic genomes containing many thousands of genes. Typically, changes in gene and transcript numbers are used to summarize changes from release to release, but these measures say nothing about changes to individual annotations, nor do they provide any means to identify annotations in need of manual review. Results In response, we have developed a suite of quantitative measures to better characterize changes to a genome's annotations between releases, and to prioritize problematic annotations for manual review. We have applied these measures to the annotations of five eukaryotic genomes over multiple releases – H. sapiens, M. musculus, D. melanogaster, A. gambiae, and C. elegans. Conclusion Our results provide the first detailed, historical overview of how these genomes' annotations have changed over the years, and demonstrate the usefulness of these measures for genome annotation management.
Collapse
|
9
|
Jiang X, Lavenier D, Yau SST. Coding region prediction based on a universal DNA sequence representation method. J Comput Biol 2009; 15:1237-56. [PMID: 19040362 DOI: 10.1089/cmb.2008.0041] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Graphical representation of DNA sequences provides a simple and intuitive way of viewing, anchoring, and comparing various gene structures, so a simple and non-degenerate method is attractive to both biologists and computational biologists. In this study, a universal graphical representation method for DNA sequences based on S.S.-T. Yau's method is presented. The method adopts a trigonometric function to represent the four nucleotides A, G, C, and T. Some interesting characteristics of the universal representation are introduced. We exploit frequency analysis with our representation method on DNA sequences, demonstrating possible applications in coding region prediction, and sequence analysis. Based on the statistically experimental results from this frequency analysis, a simple coding region predictor and an optimized one are presented. An experiment on the broadly accepted ROSETTA data set demonstrates that the performance of the optimized predictor is comparable to that of other popular methods.
Collapse
Affiliation(s)
- Xianyang Jiang
- Institute of Microelectronics and Information Technology, Wuhan University, Wuhan, China
| | | | | |
Collapse
|
10
|
Wilming L, Harrow J. Gene Annotation Methods. Bioinformatics 2009. [DOI: 10.1007/978-0-387-92738-1_6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022] Open
|
11
|
Altman RB, Bergman CM, Blake J, Blaschke C, Cohen A, Gannon F, Grivell L, Hahn U, Hersh W, Hirschman L, Jensen LJ, Krallinger M, Mons B, O'Donoghue SI, Peitsch MC, Rebholz-Schuhmann D, Shatkay H, Valencia A. Text mining for biology--the way forward: opinions from leading scientists. Genome Biol 2008; 9 Suppl 2:S7. [PMID: 18834498 PMCID: PMC2559991 DOI: 10.1186/gb-2008-9-s2-s7] [Citation(s) in RCA: 62] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
This article collects opinions from leading scientists about how text mining can provide better access to the biological literature, how the scientific community can help with this process, what the next steps are, and what role future BioCreative evaluations can play. The responses identify several broad themes, including the possibility of fusing literature and biological databases through text mining; the need for user interfaces tailored to different classes of users and supporting community-based annotation; the importance of scaling text mining technology and inserting it into larger workflows; and suggestions for additional challenge evaluations, new applications, and additional resources needed to make progress.
Collapse
Affiliation(s)
- Russ B Altman
- Stanford University, Stanford, California, 94305-5444, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
12
|
Liu Q, Mackey AJ, Roos DS, Pereira FCN. Evigan: a hidden variable model for integrating gene evidence for eukaryotic gene prediction. ACTA ACUST UNITED AC 2008; 24:597-605. [PMID: 18187439 DOI: 10.1093/bioinformatics/btn004] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION The increasing diversity and variable quality of evidence relevant to gene annotation argues for a probabilistic framework that automatically integrates such evidence to yield candidate gene models. RESULTS Evigan is an automated gene annotation program for eukaryotic genomes, employing probabilistic inference to integrate multiple sources of gene evidence. The probabilistic model is a dynamic Bayes network whose parameters are adjusted to maximize the probability of observed evidence. Consensus gene predictions are then derived by maximum likelihood decoding, yielding n-best models (with probabilities for each). Evigan is capable of accommodating a variety of evidence types, including (but not limited to) gene models computed by diverse gene finders, BLAST hits, EST matches, and splice site predictions; learned parameters encode the relative quality of evidence sources. Since separate training data are not required (apart from the training sets used by individual gene finders), Evigan is particularly attractive for newly sequenced genomes where little or no reliable manually curated annotation is available. The ability to produce a ranked list of alternative gene models may facilitate identification of alternatively spliced transcripts. Experimental application to ENCODE regions of the human genome, and the genomes of Plasmodium vivax and Arabidopsis thaliana show that Evigan achieves better performance than any of the individual data sources used as evidence. AVAILABILITY The source code is available at http://www.seas.upenn.edu/~strctlrn/evigan/evigan.html.
Collapse
Affiliation(s)
- Qian Liu
- Department of Computer and Information Science, University of Pennsylvania, Philadelphia PA 19104, USA.
| | | | | | | |
Collapse
|
13
|
Applying negative rule mining to improve genome annotation. BMC Bioinformatics 2007; 8:261. [PMID: 17659089 PMCID: PMC1940032 DOI: 10.1186/1471-2105-8-261] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2007] [Accepted: 07/21/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Unsupervised annotation of proteins by software pipelines suffers from very high error rates. Spurious functional assignments are usually caused by unwarranted homology-based transfer of information from existing database entries to the new target sequences. We have previously demonstrated that data mining in large sequence annotation databanks can help identify annotation items that are strongly associated with each other, and that exceptions from strong positive association rules often point to potential annotation errors. Here we investigate the applicability of negative association rule mining to revealing erroneously assigned annotation items. RESULTS Almost all exceptions from strong negative association rules are connected to at least one wrong attribute in the feature combination making up the rule. The fraction of annotation features flagged by this approach as suspicious is strongly enriched in errors and constitutes about 0.6% of the whole body of the similarity-transferred annotation in the PEDANT genome database. Positive rule mining does not identify two thirds of these errors. The approach based on exceptions from negative rules is much more specific than positive rule mining, but its coverage is significantly lower. CONCLUSION Mining of both negative and positive association rules is a potent tool for finding significant trends in protein annotation and flagging doubtful features for further inspection.
Collapse
|
14
|
Affiliation(s)
- Dmitrij Frishman
- Department of Genome Oriented Bioinformatics, Technische Universität München, Wissenchaftszentrum Weihenstephan, 85350 Freising, Germany
| |
Collapse
|
15
|
Pighetti GM, Rambeaud M. Genome conservation between the bovine and human interleukin-8 receptor complex: improper annotation of bovine interleukin-8 receptor b identified. Vet Immunol Immunopathol 2006; 114:335-40. [PMID: 16982101 DOI: 10.1016/j.vetimm.2006.08.008] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2006] [Revised: 07/28/2006] [Accepted: 08/14/2006] [Indexed: 11/29/2022]
Abstract
Interleukin (IL)-8 and its receptors, CXCR1 and CXCR2, are key regulators of inflammation. However, knowledge of these receptors at the genomic level is limiting or absent in cattle. Therefore, our objective was to identify bovine orthologs of human CXCR1 and CXCR2. Alignment of bovine CXCR2 reference mRNA to the bovine genome revealed two regions of similarity on BTA2 approximately 20 kb apart and on opposite strands. Comparison with the human genome suggested the more centromeric region to be CXCR2 and the more telomeric region to be CXCR1 which contradicts the current annotation of the bovine CXCR2 reference mRNA. This observation was verified by sequencing RT-PCR products of specific regions within each predicted IL-8 receptor and comparing with human sequences using ClustalW. Further examination of coding and non-coding regions within the IL-8 receptor genome complex revealed that both bovine and canine CXCR1 and CXCR2 genes had more conserved sequences in common with the human genes than either mouse or rat, and may offer more suitable animal models for certain applications. This molecular information provides a stepping stone for greater understanding of the role each IL-8 receptor plays in inflammation and will enhance our ability to develop strategies against inflammatory based diseases.
Collapse
Affiliation(s)
- Gina M Pighetti
- Department of Animal Science, 114 McCord Hall, 2640 Morgan Circle, The University of Tennessee, Knoxville, TN 37996, USA.
| | | |
Collapse
|
16
|
Thierry-Mieg D, Thierry-Mieg J. AceView: a comprehensive cDNA-supported gene and transcripts annotation. Genome Biol 2006; 7 Suppl 1:S12.1-14. [PMID: 16925834 PMCID: PMC1810549 DOI: 10.1186/gb-2006-7-s1-s12] [Citation(s) in RCA: 453] [Impact Index Per Article: 25.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022] Open
Abstract
Background Regions covering one percent of the genome, selected by ENCODE for extensive analysis, were annotated by the HAVANA/Gencode group with high quality transcripts, thus defining a benchmark. The ENCODE Genome Annotation Assessment Project (EGASP) competition aimed at reproducing Gencode and finding new genes. The organizers evaluated the protein predictions in depth. We present a complementary analysis of the mRNAs, including alternative transcript variants. Results We evaluate 25 gene tracks from the University of California Santa Cruz (UCSC) genome browser. We either distinguish or collapse the alternative splice variants, and compare the genomic coordinates of exons, introns and nucleotides. Whole mRNA models, seen as chains of introns, are sorted to find the best matching pairs, and compared so that each mRNA is used only once. At the mRNA level, AceView is by far the closest to Gencode: the vast majority of transcripts of the two methods, including alternative variants, are identical. At the protein level, however, due to a lack of experimental data, our predictions differ: Gencode annotates proteins in only 41% of the mRNAs whereas AceView does so in virtually all. We describe the driving principles of AceView, and how, by performing hand-supervised automatic annotation, we solve the combinatorial splicing problem and summarize all of GenBank, dbEST and RefSeq into a genome-wide non-redundant but comprehensive cDNA-supported transcriptome. AceView accuracy is now validated by Gencode. Conclusion Relative to a consensus mRNA catalog constructed from all evidence-based annotations, Gencode and AceView have 81% and 84% sensitivity, and 74% and 73% specificity, respectively. This close agreement validates a richer view of the human transcriptome, with three to five times more transcripts than in UCSC Known Genes (sensitivity 28%), RefSeq (sensitivity 21%) or Ensembl (sensitivity 19%).
Collapse
Affiliation(s)
- Danielle Thierry-Mieg
- National Center for Biotechnology Information, National Library of Medicine, NIH, Bethesda, MD 20894, USA.
| | | |
Collapse
|
17
|
Stanke M, Tzvetkova A, Morgenstern B. AUGUSTUS at EGASP: using EST, protein and genomic alignments for improved gene prediction in the human genome. Genome Biol 2006; 7 Suppl 1:S11.1-8. [PMID: 16925833 PMCID: PMC1810548 DOI: 10.1186/gb-2006-7-s1-s11] [Citation(s) in RCA: 204] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
Background A large number of gene prediction programs for the human genome exist. These annotation tools use a variety of methods and data sources. In the recent ENCODE genome annotation assessment project (EGASP), some of the most commonly used and recently developed gene-prediction programs were systematically evaluated and compared on test data from the human genome. AUGUSTUS was among the tools that were tested in this project. Results AUGUSTUS can be used as an ab initio program, that is, as a program that uses only one single genomic sequence as input information. In addition, it is able to combine information from the genomic sequence under study with external hints from various sources of information. For EGASP, we used genomic sequence alignments as well as alignments to expressed sequence tags (ESTs) and protein sequences as additional sources of information. Within the category of ab initio programs AUGUSTUS predicted significantly more genes correctly than any other ab initio program. At the same time it predicted the smallest number of false positive genes and the smallest number of false positive exons among all ab initio programs. The accuracy of AUGUSTUS could be further improved when additional extrinsic data, such as alignments to EST, protein and/or genomic sequences, was taken into account. Conclusion AUGUSTUS turned out to be the most accurate ab initio gene finder among the tested tools. Moreover it is very flexible because it can take information from several sources simultaneously into consideration.
Collapse
Affiliation(s)
- Mario Stanke
- Institut für Mikrobiologie und Genetik, Universität Göttingen, Goldschmidtstrasse, 37077 Göttingen, Germany.
| | | | | |
Collapse
|
18
|
Solovyev V, Kosarev P, Seledsov I, Vorobyev D. Automatic annotation of eukaryotic genes, pseudogenes and promoters. Genome Biol 2006; 7 Suppl 1:S10.1-12. [PMID: 16925832 PMCID: PMC1810547 DOI: 10.1186/gb-2006-7-s1-s10] [Citation(s) in RCA: 491] [Impact Index Per Article: 27.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The ENCODE gene prediction workshop (EGASP) has been organized to evaluate how well state-of-the-art automatic gene finding methods are able to reproduce the manual and experimental gene annotation of the human genome. We have used Softberry gene finding software to predict genes, pseudogenes and promoters in 44 selected ENCODE sequences representing approximately 1% (30 Mb) of the human genome. Predictions of gene finding programs were evaluated in terms of their ability to reproduce the ENCODE-HAVANA annotation. RESULTS The Fgenesh++ gene prediction pipeline can identify 91% of coding nucleotides with a specificity of 90%. Our automatic pseudogene finder (PSF program) found 90% of the manually annotated pseudogenes and some new ones. The Fprom promoter prediction program identifies 80% of TATA promoters sequences with one false positive prediction per 2,000 base-pairs (bp) and 50% of TATA-less promoters with one false positive prediction per 650 bp. It can be used to identify transcription start sites upstream of annotated coding parts of genes found by gene prediction software. CONCLUSION We review our software and underlying methods for identifying these three important structural and functional genome components and discuss the accuracy of predictions, recent advances and open problems in annotating genomic sequences. We have demonstrated that our methods can be effectively used for initial automatic annotation of the eukaryotic genome.
Collapse
Affiliation(s)
- Victor Solovyev
- Department of Computer Science, Royal Holloway, University of London, Egham, Surrey TW20 0EX, UK.
| | | | | | | |
Collapse
|
19
|
Wei C, Brent MR. Using ESTs to improve the accuracy of de novo gene prediction. BMC Bioinformatics 2006; 7:327. [PMID: 16817966 PMCID: PMC1534067 DOI: 10.1186/1471-2105-7-327] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2006] [Accepted: 07/03/2006] [Indexed: 12/01/2022] Open
Abstract
Background ESTs are a tremendous resource for determining the exon-intron structures of genes, but even extensive EST sequencing tends to leave many exons and genes untouched. Gene prediction systems based exclusively on EST alignments miss these exons and genes, leading to poor sensitivity. De novo gene prediction systems, which ignore ESTs in favor of genomic sequence, can predict such "untouched" exons, but they are less accurate when predicting exons to which ESTs align. TWINSCAN is the most accurate de novo gene finder available for nematodes and N-SCAN is the most accurate for mammals, as measured by exact CDS gene prediction and exact exon prediction. Results TWINSCAN_EST is a new system that successfully combines EST alignments with TWINSCAN. On the whole C. elegans genome TWINSCAN_EST shows 14% improvement in sensitivity and 13% in specificity in predicting exact gene structures compared to TWINSCAN without EST alignments. Not only are the structures revealed by EST alignments predicted correctly, but these also constrain the predictions without alignments, improving their accuracy. For the human genome, we used the same approach with N-SCAN, creating N-SCAN_EST. On the whole genome, N-SCAN_EST produced a 6% improvement in sensitivity and 1% in specificity of exact gene structure predictions compared to N-SCAN. Conclusion TWINSCAN_EST and N-SCAN_EST are more accurate than TWINSCAN and N-SCAN, while retaining their ability to discover novel genes to which no ESTs align. Thus, we recommend using the EST versions of these programs to annotate any genome for which EST information is available. TWINSCAN_EST and N-SCAN_EST are part of the TWINSCAN open source software package .
Collapse
Affiliation(s)
- Chaochun Wei
- Laboratory for Computational Genomics and Department of Computer Science and Engineering, Washington University, One Brookings Drive, St. Louis, MO 63130, USA
| | - Michael R Brent
- Laboratory for Computational Genomics and Department of Computer Science and Engineering, Washington University, One Brookings Drive, St. Louis, MO 63130, USA
| |
Collapse
|
20
|
Stanke M, Keller O, Gunduz I, Hayes A, Waack S, Morgenstern B. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res 2006; 34:W435-9. [PMID: 16845043 PMCID: PMC1538822 DOI: 10.1093/nar/gkl200] [Citation(s) in RCA: 1457] [Impact Index Per Article: 80.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2006] [Revised: 03/21/2006] [Accepted: 03/21/2006] [Indexed: 11/13/2022] Open
Abstract
AUGUSTUS is a software tool for gene prediction in eukaryotes based on a Generalized Hidden Markov Model, a probabilistic model of a sequence and its gene structure. Like most existing gene finders, the first version of AUGUSTUS returned one transcript per predicted gene and ignored the phenomenon of alternative splicing. Herein, we present a WWW server for an extended version of AUGUSTUS that is able to predict multiple splice variants. To our knowledge, this is the first ab initio gene finder that can predict multiple transcripts. In addition, we offer a motif searching facility, where user-defined regular expressions can be searched against putative proteins encoded by the predicted genes. The AUGUSTUS web interface and the downloadable open-source stand-alone program are freely available from http://augustus.gobics.de.
Collapse
Affiliation(s)
- Mario Stanke
- Institut für Mikrobiologie und Genetik, Abteilung Bioinformatik, Goldschmidtstrasse 1, 37077 Göttingen, Germany.
| | | | | | | | | | | |
Collapse
|
21
|
Agrawal R, Stormo GD. Using mRNAs lengths to accurately predict the alternatively spliced gene products in Caenorhabditis elegans. Bioinformatics 2006; 22:1239-44. [PMID: 16595562 DOI: 10.1093/bioinformatics/btl076] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Computational gene prediction methods are an important component of whole genome analyses. While ab initio gene finders have demonstrated major improvements in accuracy, the most reliable methods are evidence-based gene predictors. These algorithms can rely on several different sources of evidence including predictions from multiple ab initio gene finders, matches to known proteins, sequence conservation and partial cDNAs to predict the final product. Despite the success of these algorithms, prediction of complete gene structures, especially for alternatively spliced products, remains a difficult task. RESULTS LOCUS (Length Optimized Characterization of Unknown Spliceforms) is a new evidence-based gene finding algorithm which integrates a length-constraint into a dynamic programming-based framework for prediction of gene products. On a Caenorhabditis elegans test set of alternatively spliced internal exons, its performance exceeds that of current ab initio gene finders and in most cases can accurately predict the correct form of all the alternative products. As the length information used by the algorithm can be obtained in a high-throughput fashion, we propose that integration of such information into a gene-prediction pipeline is feasible and doing so may improve our ability to fully characterize the complete set of mRNAs for a genome. AVAILABILITY LOCUS is available from http://ural.wustl.edu/software.html
Collapse
Affiliation(s)
- Ritesh Agrawal
- Department of Genetics, Washington University School of Medicine 660 S. Euclid, Campus Box 8232, St. Louis, MO 63110, USA
| | | |
Collapse
|
22
|
Abstract
Driven by competition, automation, and technology, the genomics community has far exceeded its ambition to sequence the human genome by 2005. By analyzing mammalian genomes, we have shed light on the history of our DNA sequence, determined that alternatively spliced RNAs and retroposed pseudogenes are incredibly abundant, and glimpsed the apparently huge number of non-coding RNAs that play significant roles in gene regulation. Ultimately, genome science is likely to provide comprehensive catalogs of these elements. However, the methods we have been using for most of the last 10 years will not yield even one complete open reading frame (ORF) for every gene--the first plateau on the long climb toward a comprehensive catalog. These strategies--sequencing randomly selected cDNA clones, aligning protein sequences identified in other organisms, sequencing more genomes, and manual curation--will have to be supplemented by large-scale amplification and sequencing of specific predicted mRNAs. The steady improvements in gene prediction that have occurred over the last 10 years have increased the efficacy of this approach and decreased its cost. In this Perspective, I review the state of gene prediction roughly 10 years ago, summarize the progress that has been made since, argue that the primary ORF identification methods we have relied on so far are inadequate, and recommend a path toward completing the Catalog of Protein Coding Genes, Version 1.0.
Collapse
Affiliation(s)
- Michael R Brent
- Laboratory for Computational Genomics and Department of Computer Science, Washington University, St. Louis, Missouri 63130, USA.
| |
Collapse
|