1
|
Zong S, Deng S, Chen K, Wu JQ. Identification of key factors regulating self-renewal and differentiation in EML hematopoietic precursor cells by RNA-sequencing analysis. J Vis Exp 2014:e52104. [PMID: 25407807 DOI: 10.3791/52104] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
Hematopoietic stem cells (HSCs) are used clinically for transplantation treatment to rebuild a patient's hematopoietic system in many diseases such as leukemia and lymphoma. Elucidating the mechanisms controlling HSCs self-renewal and differentiation is important for application of HSCs for research and clinical uses. However, it is not possible to obtain large quantity of HSCs due to their inability to proliferate in vitro. To overcome this hurdle, we used a mouse bone marrow derived cell line, the EML (Erythroid, Myeloid, and Lymphocytic) cell line, as a model system for this study. RNA-sequencing (RNA-Seq) has been increasingly used to replace microarray for gene expression studies. We report here a detailed method of using RNA-Seq technology to investigate the potential key factors in regulation of EML cell self-renewal and differentiation. The protocol provided in this paper is divided into three parts. The first part explains how to culture EML cells and separate Lin-CD34+ and Lin-CD34- cells. The second part of the protocol offers detailed procedures for total RNA preparation and the subsequent library construction for high-throughput sequencing. The last part describes the method for RNA-Seq data analysis and explains how to use the data to identify differentially expressed transcription factors between Lin-CD34+ and Lin-CD34- cells. The most significantly differentially expressed transcription factors were identified to be the potential key regulators controlling EML cell self-renewal and differentiation. In the discussion section of this paper, we highlight the key steps for successful performance of this experiment. In summary, this paper offers a method of using RNA-Seq technology to identify potential regulators of self-renewal and differentiation in EML cells. The key factors identified are subjected to downstream functional analysis in vitro and in vivo.
Collapse
Affiliation(s)
- Shan Zong
- The Vivian L. Smith Department of Neurosurgery, Center for Stem Cell and Regenerative Medicine, University of Texas Health Science Center, The University of Texas Graduate School of Biomedical Sciences at Houston
| | - Shuyun Deng
- The Vivian L. Smith Department of Neurosurgery, Center for Stem Cell and Regenerative Medicine, University of Texas Health Science Center, The University of Texas Graduate School of Biomedical Sciences at Houston
| | - Kenian Chen
- The Vivian L. Smith Department of Neurosurgery, Center for Stem Cell and Regenerative Medicine, University of Texas Health Science Center, The University of Texas Graduate School of Biomedical Sciences at Houston
| | - Jia Qian Wu
- The Vivian L. Smith Department of Neurosurgery, Center for Stem Cell and Regenerative Medicine, University of Texas Health Science Center, The University of Texas Graduate School of Biomedical Sciences at Houston;
| |
Collapse
|
2
|
ASPic-GeneID: a lightweight pipeline for gene prediction and alternative isoforms detection. BIOMED RESEARCH INTERNATIONAL 2013; 2013:502827. [PMID: 24308000 PMCID: PMC3838850 DOI: 10.1155/2013/502827] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/16/2013] [Revised: 08/01/2013] [Accepted: 08/04/2013] [Indexed: 12/31/2022]
Abstract
New genomes are being sequenced at an increasingly rapid rate, far outpacing the rate at which manual gene annotation can be performed. Automated genome annotation is thus necessitated by this growth in genome projects; however, full-fledged annotation systems are usually home-grown and customized to a particular genome. There is thus a renewed need for accurate ab initio gene prediction methods. However, it is apparent that fully ab initio methods fall short of the required level of sensitivity and specificity for a quality annotation. Evidence in the form of expressed sequences gives the single biggest improvement in accuracy when used to inform gene predictions. Here, we present a lightweight pipeline for first-pass gene prediction on newly sequenced genomes. The two main components are ASPic, a program that derives highly accurate, albeit not necessarily complete, EST-based transcript annotations from EST alignments, and GeneID, a standard gene prediction program, which we have modified to take as evidence intron annotations. The introns output by ASPic CDS predictions is given to GeneID to constrain the exon-chaining process and produce predictions consistent with the underlying EST alignments. The pipeline was successfully tested on the entire C. elegans genome and the 44 ENCODE human pilot regions.
Collapse
|
3
|
Prasad TSK, Harsha HC, Keerthikumar S, Sekhar NR, Selvan LDN, Kumar P, Pinto SM, Muthusamy B, Subbannayya Y, Renuse S, Chaerkady R, Mathur PP, Ravikumar R, Pandey A. Proteogenomic Analysis of Candida glabrata using High Resolution Mass Spectrometry. J Proteome Res 2011; 11:247-60. [DOI: 10.1021/pr200827k] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Affiliation(s)
- T. S. Keshava Prasad
- Institute of Bioinformatics, International Technology Park, Bangalore
-560 066, India
- Centre
of Excellence in Bioinformatics,
Bioinformatics Centre, School of Life Sciences, Pondicherry University, Puducherry -605 014, India
- Manipal University, Madhav Nagar, Manipal, Karnataka 576104; India
- Amrita School of Biotechnology, Amrita University, Kollam -690 525, India
| | - H. C. Harsha
- Institute of Bioinformatics, International Technology Park, Bangalore
-560 066, India
| | | | - Nirujogi Raja Sekhar
- Institute of Bioinformatics, International Technology Park, Bangalore
-560 066, India
- Centre
of Excellence in Bioinformatics,
Bioinformatics Centre, School of Life Sciences, Pondicherry University, Puducherry -605 014, India
| | - Lakshmi Dhevi N. Selvan
- Institute of Bioinformatics, International Technology Park, Bangalore
-560 066, India
- Amrita School of Biotechnology, Amrita University, Kollam -690 525, India
| | - Praveen Kumar
- Institute of Bioinformatics, International Technology Park, Bangalore
-560 066, India
- Amrita School of Biotechnology, Amrita University, Kollam -690 525, India
| | - Sneha M. Pinto
- Institute of Bioinformatics, International Technology Park, Bangalore
-560 066, India
- Manipal University, Madhav Nagar, Manipal, Karnataka 576104; India
| | - Babylakshmi Muthusamy
- Institute of Bioinformatics, International Technology Park, Bangalore
-560 066, India
- Centre
of Excellence in Bioinformatics,
Bioinformatics Centre, School of Life Sciences, Pondicherry University, Puducherry -605 014, India
| | - Yashwanth Subbannayya
- Institute of Bioinformatics, International Technology Park, Bangalore
-560 066, India
- Rajiv Gandhi University of Health Sciences, Jayanagar, Bangalore −560
041, India
| | - Santosh Renuse
- Institute of Bioinformatics, International Technology Park, Bangalore
-560 066, India
- Amrita School of Biotechnology, Amrita University, Kollam -690 525, India
| | - Raghothama Chaerkady
- Institute of Bioinformatics, International Technology Park, Bangalore
-560 066, India
| | - Premendu P. Mathur
- Centre
of Excellence in Bioinformatics,
Bioinformatics Centre, School of Life Sciences, Pondicherry University, Puducherry -605 014, India
| | - Raju Ravikumar
- Department of
Neuromicrobiology, National Institute of Mental Health and Neuro Sciences, Bangalore -560029, India
| | | |
Collapse
|
4
|
Wu JQ, Du J, Rozowsky J, Zhang Z, Urban AE, Euskirchen G, Weissman S, Gerstein M, Snyder M. Systematic analysis of transcribed loci in ENCODE regions using RACE sequencing reveals extensive transcription in the human genome. Genome Biol 2008; 9:R3. [PMID: 18173853 PMCID: PMC2395237 DOI: 10.1186/gb-2008-9-1-r3] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2007] [Revised: 12/06/2007] [Accepted: 01/03/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Recent studies of the mammalian transcriptome have revealed a large number of additional transcribed regions and extraordinary complexity in transcript diversity. However, there is still much uncertainty regarding precisely what portion of the genome is transcribed, the exact structures of these novel transcripts, and the levels of the transcripts produced. RESULTS We have interrogated the transcribed loci in 420 selected ENCyclopedia Of DNA Elements (ENCODE) regions using rapid amplification of cDNA ends (RACE) sequencing. We analyzed annotated known gene regions, but primarily we focused on novel transcriptionally active regions (TARs), which were previously identified by high-density oligonucleotide tiling arrays and on random regions that were not believed to be transcribed. We found RACE sequencing to be very sensitive and were able to detect low levels of transcripts in specific cell types that were not detectable by microarrays. We also observed many instances of sense-antisense transcripts; further analysis suggests that many of the antisense transcripts (but not all) may be artifacts generated from the reverse transcription reaction. Our results show that the majority of the novel TARs analyzed (60%) are connected to other novel TARs or known exons. Of previously unannotated random regions, 17% were shown to produce overlapping transcripts. Furthermore, it is estimated that 9% of the novel transcripts encode proteins. CONCLUSION We conclude that RACE sequencing is an efficient, sensitive, and highly accurate method for characterization of the transcriptome of specific cell/tissue types. Using this method, it appears that much of the genome is represented in polyA+ RNA. Moreover, a fraction of the novel RNAs can encode protein and are likely to be functional.
Collapse
Affiliation(s)
- Jia Qian Wu
- Molecular, Cellular and Developmental Biology Department, KBT918, Yale University, New Haven, Connecticut 06511, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
5
|
Brent MR. Steady progress and recent breakthroughs in the accuracy of automated genome annotation. Nat Rev Genet 2008; 9:62-73. [DOI: 10.1038/nrg2220] [Citation(s) in RCA: 117] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
6
|
Bioinformatic prediction and analysis of eukaryotic protein kinases in the rat genome. Gene 2007; 410:147-53. [PMID: 18201844 DOI: 10.1016/j.gene.2007.12.003] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2007] [Revised: 12/03/2007] [Accepted: 12/04/2007] [Indexed: 01/29/2023]
Abstract
Eukaryotic protein kinases, containing a conserved catalytic domain, represent one of the largest superfamilies of the eukaryotic proteins and play distinct roles in cell signaling and diseases. Near completion of rat genome sequencing project enables the evaluation of a near complete set of rat protein kinases. Publicly accessible genetic sequence databases were searched for rat protein kinases, and 515 eukaryotic protein kinases, 40 atypical protein kinases and 45 kinase pseudogenes were identified. The rat has 509 putative protein kinases orthologous to human kinases. Unlike microtubule affinity-regulating kinases, the rat has a few more kinases, in addition to the orthologous pairs of mouse kinases. The comparison of 11 different eukaryotic species revealed the evolutionary conservation of this diverse family of proteins. The evolutionary rate studies of human disease and non-disease associated kinases suggested that relatively uniform selective pressures have been applied to these kinase classes. This bioinformatic study of the rat protein kinases provides a suitable framework for further characterization of the functional and structural properties of these protein kinases.
Collapse
|
7
|
Siepel A, Diekhans M, Brejová B, Langton L, Stevens M, Comstock CLG, Davis C, Ewing B, Oommen S, Lau C, Yu HC, Li J, Roe BA, Green P, Gerhard DS, Temple G, Haussler D, Brent MR. Targeted discovery of novel human exons by comparative genomics. Genes Dev 2007; 17:1763-73. [PMID: 17989246 PMCID: PMC2099585 DOI: 10.1101/gr.7128207] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2007] [Accepted: 10/15/2007] [Indexed: 01/20/2023]
Abstract
A complete and accurate set of human protein-coding gene annotations is perhaps the single most important resource for genomic research after the human-genome sequence itself, yet the major gene catalogs remain incomplete and imperfect. Here we describe a genome-wide effort, carried out as part of the Mammalian Gene Collection (MGC) project, to identify human genes not yet in the gene catalogs. Our approach was to produce gene predictions by algorithms that rely on comparative sequence data but do not require direct cDNA evidence, then to test predicted novel genes by RT-PCR. We have identified 734 novel gene fragments (NGFs) containing 2188 exons with, at most, weak prior cDNA support. These NGFs correspond to an estimated 563 distinct genes, of which >160 are completely absent from the major gene catalogs, while hundreds of others represent significant extensions of known genes. The NGFs appear to be predominantly protein-coding genes rather than noncoding RNAs, unlike novel transcribed sequences identified by technologies such as tiling arrays and CAGE. They tend to be expressed at low levels and in a tissue-specific manner, and they are enriched for roles in motor activity, cell adhesion, connective tissue, and central nervous system development. Our results demonstrate that many important genes and gene fragments have been missed by traditional approaches to gene discovery but can be identified by their evolutionary signatures using comparative sequence data. However, they suggest that hundreds-not thousands-of protein-coding genes are completely missing from the current gene catalogs.
Collapse
Affiliation(s)
- Adam Siepel
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York 14853, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
8
|
Buza TJ, McCarthy FM, Burgess SC. Experimental-confirmation and functional-annotation of predicted proteins in the chicken genome. BMC Genomics 2007; 8:425. [PMID: 18021451 PMCID: PMC2204016 DOI: 10.1186/1471-2164-8-425] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2007] [Accepted: 11/19/2007] [Indexed: 11/11/2022] Open
Abstract
Background The chicken genome was sequenced because of its phylogenetic position as a non-mammalian vertebrate, its use as a biomedical model especially to study embryology and development, its role as a source of human disease organisms and its importance as the major source of animal derived food protein. However, genomic sequence data is, in itself, of limited value; generally it is not equivalent to understanding biological function. The benefit of having a genome sequence is that it provides a basis for functional genomics. However, the sequence data currently available is poorly structurally and functionally annotated and many genes do not have standard nomenclature assigned. Results We analysed eight chicken tissues and improved the chicken genome structural annotation by providing experimental support for the in vivo expression of 7,809 computationally predicted proteins, including 30 chicken proteins that were only electronically predicted or hypothetical translations in human. To improve functional annotation (based on Gene Ontology), we mapped these identified proteins to their human and mouse orthologs and used this orthology to transfer Gene Ontology (GO) functional annotations to the chicken proteins. The 8,213 orthology-based GO annotations that we produced represent an 8% increase in currently available chicken GO annotations. Orthologous chicken products were also assigned standardized nomenclature based on current chicken nomenclature guidelines. Conclusion We demonstrate the utility of high-throughput expression proteomics for rapid experimental structural annotation of a newly sequenced eukaryote genome. These experimentally-supported predicted proteins were further annotated by assigning the proteins with standardized nomenclature and functional annotation. This method is widely applicable to a diverse range of species. Moreover, information from one genome can be used to improve the annotation of other genomes and inform gene prediction algorithms.
Collapse
Affiliation(s)
- Teresia J Buza
- Department of Basic Sciences, College of Veterinary Medicine, Mississippi State University, Mississippi State, MS 39762, USA.
| | | | | |
Collapse
|
9
|
Abstract
The Drosophila species comparative genome database DroSpeGe () provides genome researchers with rapid, usable access to 12 new and old Drosophila genomes, since its inception in 2004. Scientists can use, with minimal computing expertise, the wealth of new genome information for developing new insights into insect evolution. New genome assemblies provided by several sequencing centers have been annotated with known model organism gene homologies and gene predictions to provided basic comparative data. TeraGrid supplies the shared cyberinfrastructure for the primary computations. This genome database includes homologies to Drosophila melanogaster and eight other eukaryote model genomes, and gene predictions from several groups. BLAST searches of the newest assemblies are integrated with genome maps. GBrowse maps provide detailed views of cross-species aligned genomes. BioMart provides for data mining of annotations and sequences. Common chromosome maps identify major synteny among species. Potential gain and loss of genes is suggested by Gene Ontology groupings for genes of the new species. Summaries of essential genome statistics include sizes, genes found and predicted, homology among genomes, phylogenetic trees of species and comparisons of several gene predictions for sensitivity and specificity in finding new and known genes.
Collapse
Affiliation(s)
- Donald G Gilbert
- Department of Biology, Indiana University, Bloomington, IN 47405, USA.
| |
Collapse
|
10
|
Keibler E, Arumugam M, Brent MR. The Treeterbi and Parallel Treeterbi algorithms: efficient, optimal decoding for ordinary, generalized and pair HMMs. Bioinformatics 2007; 23:545-54. [PMID: 17237054 DOI: 10.1093/bioinformatics/btl659] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Hidden Markov models (HMMs) and generalized HMMs been successfully applied to many problems, but the standard Viterbi algorithm for computing the most probable interpretation of an input sequence (known as decoding) requires memory proportional to the length of the sequence, which can be prohibitive. Existing approaches to reducing memory usage either sacrifice optimality or trade increased running time for reduced memory. RESULTS We developed two novel decoding algorithms, Treeterbi and Parallel Treeterbi, and implemented them in the TWINSCAN/N-SCAN gene-prediction system. The worst case asymptotic space and time are the same as for standard Viterbi, but in practice, Treeterbi optimally decodes arbitrarily long sequences with generalized HMMs in bounded memory without increasing running time. Parallel Treeterbi uses the same ideas to split optimal decoding across processors, dividing latency to completion by approximately the number of available processors with constant average overhead per processor. Using these algorithms, we were able to optimally decode all human chromosomes with N-SCAN, which increased its accuracy relative to heuristic solutions. We also implemented Treeterbi for Pairagon, our pair HMM based cDNA-to-genome aligner. AVAILABILITY The TWINSCAN/N-SCAN/PAIRAGON open source software package is available from http://genes.cse.wustl.edu.
Collapse
Affiliation(s)
- Evan Keibler
- Laboratory for Computational Genomics, Campus Box 1045, Washington University, St. Louis, MO 63130, USA
| | | | | |
Collapse
|
11
|
Experimental validation of novel genes predicted in the un-annotated regions of the Arabidopsis genome. BMC Genomics 2007; 8:18. [PMID: 17229318 PMCID: PMC1783852 DOI: 10.1186/1471-2164-8-18] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2006] [Accepted: 01/17/2007] [Indexed: 11/10/2022] Open
Abstract
Background Several lines of evidence support the existence of novel genes and other transcribed units which have not yet been annotated in the Arabidopsis genome. Two gene prediction programs which make use of comparative genomic analysis, Twinscan and EuGene, have recently been deployed on the Arabidopsis genome. The ability of these programs to make use of sequence data from other species has allowed both Twinscan and EuGene to predict over 1000 genes that are intergenic with respect to the most recent annotation release. A high throughput RACE pipeline was utilized in an attempt to verify the structure and expression of these novel genes. Results 1,071 un-annotated loci were targeted by RACE, and full length sequence coverage was obtained for 35% of the targeted genes. We have verified the structure and expression of 378 genes that were not present within the most recent release of the Arabidopsis genome annotation. These 378 genes represent a structurally diverse set of transcripts and encode a functionally diverse set of proteins. Conclusion We have investigated the accuracy of the Twinscan and EuGene gene prediction programs and found them to be reliable predictors of gene structure in Arabidopsis. Several hundred previously un-annotated genes were validated by this work. Based upon this information derived from these efforts it is likely that the Arabidopsis genome annotation continues to overlook several hundred protein coding genes.
Collapse
|
12
|
Tenney AE, Wu JQ, Langton L, Klueh P, Quatrano R, Brent MR. A tale of two templates: automatically resolving double traces has many applications, including efficient PCR-based elucidation of alternative splices. Genome Res 2007; 17:212-8. [PMID: 17210930 PMCID: PMC1781353 DOI: 10.1101/gr.5661407] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
Trace Recalling is a novel method for deconvoluting double traces that result from simultaneously sequencing two DNA templates. Trace Recalling identifies up to two bases at each position of such a trace. The resulting ambiguity sequence is aligned to the genome, identifying one template sequence. A second template sequence is then inferred from this alignment. This technique makes possible many exciting biological applications. Here we present two such applications, alternate splice finding and elucidation of multiple insertion sites in a random insertional mutagenesis library. Our results demonstrate that RT-PCR followed by Trace Recalling is a more efficient and cost effective way to find alternate splices than traditional methods. We also present a method for mapping double-insertion events in a random insertional-mutagenesis library.
Collapse
Affiliation(s)
- Aaron E. Tenney
- Laboratory for Computational Genomics and Department of Computer Science, Washington University, St. Louis, Missouri 63130, USA
| | - Jia Qian Wu
- Department of Molecular, Cellular, and Developmental Biology, Yale University, New Haven, Connecticut 06620-8103, USA
| | - Laura Langton
- Laboratory for Computational Genomics and Department of Computer Science, Washington University, St. Louis, Missouri 63130, USA
| | - Paul Klueh
- Department of Biology, Washington University, St. Louis, Missouri 63130, USA
| | - Ralph Quatrano
- Department of Biology, Washington University, St. Louis, Missouri 63130, USA
| | - Michael R. Brent
- Laboratory for Computational Genomics and Department of Computer Science, Washington University, St. Louis, Missouri 63130, USA
- Corresponding author.E-mail ; fax (314) 935-7302
| |
Collapse
|
13
|
Flicek P, Brent MR. Using several pair-wise informant sequences for de novo prediction of alternatively spliced transcripts. Genome Biol 2006; 7 Suppl 1:S8.1-9. [PMID: 16925842 PMCID: PMC1810557 DOI: 10.1186/gb-2006-7-s1-s8] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Abstract
Background As part of the ENCODE Genome Annotation Assessment Project (EGASP), we developed the MARS extension to the Twinscan algorithm. MARS is designed to find human alternatively spliced transcripts that are conserved in only one or a limited number of extant species. MARS is able to use an arbitrary number of informant sequences and predicts a number of alternative transcripts at each gene locus. Results MARS uses the mouse, rat, dog, opossum, chicken, and frog genome sequences as pairwise informant sources for Twinscan and combines the resulting transcript predictions into genes based on coding (CDS) region overlap. Based on the EGASP assessment, MARS is one of the more accurate dual-genome prediction programs. Compared to the GENCODE annotation, we find that predictive sensitivity increases, while specificity decreases, as more informant species are used. MARS correctly predicts alternatively spliced transcripts for 11 of the 236 multi-exon GENCODE genes that are alternatively spliced in the coding region of their transcripts. For these genes a total of 24 correct transcripts are predicted. Conclusion The MARS algorithm is able to predict alternatively spliced transcripts without the use of expressed sequence information, although the number of loci in which multiple predicted transcripts match multiple alternatively spliced transcripts in the GENCODE annotation is relatively small.
Collapse
Affiliation(s)
- Paul Flicek
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.
| | | |
Collapse
|
14
|
Guigó R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, Castelo R, Eyras E, Ucla C, Gingeras TR, Harrow J, Hubbard T, Lewis SE, Reese MG. EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol 2006; 7 Suppl 1:S2.1-31. [PMID: 16925836 PMCID: PMC1810551 DOI: 10.1186/gb-2006-7-s1-s2] [Citation(s) in RCA: 198] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
Abstract
BACKGROUND We present the results of EGASP, a community experiment to assess the state-of-the-art in genome annotation within the ENCODE regions, which span 1% of the human genome sequence. The experiment had two major goals: the assessment of the accuracy of computational methods to predict protein coding genes; and the overall assessment of the completeness of the current human genome annotations as represented in the ENCODE regions. For the computational prediction assessment, eighteen groups contributed gene predictions. We evaluated these submissions against each other based on a 'reference set' of annotations generated as part of the GENCODE project. These annotations were not available to the prediction groups prior to the submission deadline, so that their predictions were blind and an external advisory committee could perform a fair assessment. RESULTS The best methods had at least one gene transcript correctly predicted for close to 70% of the annotated genes. Nevertheless, the multiple transcript accuracy, taking into account alternative splicing, reached only approximately 40% to 50% accuracy. At the coding nucleotide level, the best programs reached an accuracy of 90% in both sensitivity and specificity. Programs relying on mRNA and protein sequences were the most accurate in reproducing the manually curated annotations. Experimental validation shows that only a very small percentage (3.2%) of the selected 221 computationally predicted exons outside of the existing annotation could be verified. CONCLUSION This is the first such experiment in human DNA, and we have followed the standards established in a similar experiment, GASP1, in Drosophila melanogaster. We believe the results presented here contribute to the value of ongoing large-scale annotation projects and should guide further experimental methods when being scaled up to the entire human genome sequence.
Collapse
Affiliation(s)
- Roderic Guigó
- Centre de Regulació Genòmica, Institut Municipal d'Investigació Mèdica-Universitat Pompeu Fabra, E08003 Barcelona, Catalonia, Spain
- Member of the EGASP Organizing Committee
| | - Paul Flicek
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Josep F Abril
- Centre de Regulació Genòmica, Institut Municipal d'Investigació Mèdica-Universitat Pompeu Fabra, E08003 Barcelona, Catalonia, Spain
| | - Alexandre Reymond
- Center for Integrative Genomics, University of Lausanne, Switzerland
| | - Julien Lagarde
- Centre de Regulació Genòmica, Institut Municipal d'Investigació Mèdica-Universitat Pompeu Fabra, E08003 Barcelona, Catalonia, Spain
| | - France Denoeud
- Centre de Regulació Genòmica, Institut Municipal d'Investigació Mèdica-Universitat Pompeu Fabra, E08003 Barcelona, Catalonia, Spain
| | - Stylianos Antonarakis
- University of Geneva Medical School and University Hospitals of Geneva, 1211 Geneva, Switzerland
| | - Michael Ashburner
- Department of Genetics, University of Cambridge, Cambridge CB3 2EH, UK
- Member of the EGASP Advisory Board
| | - Vladimir B Bajic
- South African National Bioinformatics Institute (SANBI), University of Western Cape, Bellville 7535, South Africa
- Member of the EGASP Advisory Board
| | - Ewan Birney
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
- Member of the EGASP Organizing Committee
| | - Robert Castelo
- Centre de Regulació Genòmica, Institut Municipal d'Investigació Mèdica-Universitat Pompeu Fabra, E08003 Barcelona, Catalonia, Spain
| | - Eduardo Eyras
- Centre de Regulació Genòmica, Institut Municipal d'Investigació Mèdica-Universitat Pompeu Fabra, E08003 Barcelona, Catalonia, Spain
| | - Catherine Ucla
- University of Geneva Medical School and University Hospitals of Geneva, 1211 Geneva, Switzerland
| | - Thomas R Gingeras
- Affymetrix Inc., Santa Clara, California 95051, USA
- Member of the EGASP Advisory Board
| | - Jennifer Harrow
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK
- Member of the EGASP Organizing Committee
| | - Tim Hubbard
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK
- Member of the EGASP Organizing Committee
| | - Suzanna E Lewis
- Department of Molecular and Cellular Biology, University of California, Berkeley, California 94792, USA
- Member of the EGASP Advisory Board
| | - Martin G Reese
- Omicia Inc., Christie Ave., Emeryville, California 94608, USA
- Member of the EGASP Advisory Board
| |
Collapse
|
15
|
Windsor AJ, Mitchell-Olds T. Comparative genomics as a tool for gene discovery. Curr Opin Biotechnol 2006; 17:161-7. [PMID: 16459073 DOI: 10.1016/j.copbio.2006.01.007] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2005] [Revised: 12/20/2005] [Accepted: 01/20/2006] [Indexed: 01/21/2023]
Abstract
With the increasing availability of data from multiple eukaryotic genome sequencing projects, attention has focused on interspecific comparisons to discover novel genes and transcribed genomic sequences. Generally, these extrinsic strategies combine ab initio gene prediction with expression and/or homology data to identify conserved gene candidates between two or more genomes. Interspecific sequence analyses have proven invaluable for the improvement of existing annotations, automation of annotation, and identification of novel coding regions and splice variants. Further, comparative genomic approaches hold the promise of improved prediction of terminal or small exons, microRNA precursors, and small peptide-encoding open reading frames--sequence elements that are difficult to identify through purely intrinsic methodologies in the absence of experimental data.
Collapse
Affiliation(s)
- Aaron J Windsor
- Max-Planck-Institut fuer chemische Oekologie, Abteilung Genetik und Evolution, Hans-Knoell-Strasse 8, D-07745 Jena, Germany.
| | | |
Collapse
|
16
|
Brzoska PM, Brown C, Cassel M, Ceccardi T, Di Francisco V, Dubman A, Evans J, Fang R, Harris M, Hoover J, Hu F, Larry C, Li P, Malicdem M, Maltchenko S, Shannon M, Perkins S, Poulter K, Webster-Laig M, Xiao C, Young S, Spier G, Guegler K, Gilbert D, Samaha RR. An efficient and high-throughput approach for experimental validation of novel human gene predictions. Genomics 2006; 87:437-45. [PMID: 16406193 DOI: 10.1016/j.ygeno.2005.11.016] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2005] [Revised: 10/26/2005] [Accepted: 11/24/2005] [Indexed: 11/29/2022]
Abstract
A highly automated RT-PCR-based approach has been established to validate novel human gene predictions with no prior experimental evidence of mRNA splicing (ab initio predictions). Ab initio gene predictions were selected for high-throughput validation using predicted protein classification, sequence similarity to other genomes, colocalization with an MPSS tag, or microarray expression. Initial microarray prioritization followed by RT-PCR validation was the most efficient combination, resulting in approximately 35% of the ab initio predictions being validated by RT-PCR. Of the 7252 novel genes that were prioritized and processed, 796 constituted real transcripts. In addition, high-throughput RACE successfully extended the 5' and/or 3' ends of >60% of RT-PCR-validated genes. Reevaluation of these transcripts produced 574 novel transcripts using RefSeq as a reference. RT-PCR sequencing in combination with RACE on ab initio gene predictions could be used to define the transcriptome across all species.
Collapse
Affiliation(s)
- Pius M Brzoska
- Applied Biosystems, 850 Lincoln Center Drive, Foster City, CA 94404, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
17
|
Abstract
Driven by competition, automation, and technology, the genomics community has far exceeded its ambition to sequence the human genome by 2005. By analyzing mammalian genomes, we have shed light on the history of our DNA sequence, determined that alternatively spliced RNAs and retroposed pseudogenes are incredibly abundant, and glimpsed the apparently huge number of non-coding RNAs that play significant roles in gene regulation. Ultimately, genome science is likely to provide comprehensive catalogs of these elements. However, the methods we have been using for most of the last 10 years will not yield even one complete open reading frame (ORF) for every gene--the first plateau on the long climb toward a comprehensive catalog. These strategies--sequencing randomly selected cDNA clones, aligning protein sequences identified in other organisms, sequencing more genomes, and manual curation--will have to be supplemented by large-scale amplification and sequencing of specific predicted mRNAs. The steady improvements in gene prediction that have occurred over the last 10 years have increased the efficacy of this approach and decreased its cost. In this Perspective, I review the state of gene prediction roughly 10 years ago, summarize the progress that has been made since, argue that the primary ORF identification methods we have relied on so far are inadequate, and recommend a path toward completing the Catalog of Protein Coding Genes, Version 1.0.
Collapse
Affiliation(s)
- Michael R Brent
- Laboratory for Computational Genomics and Department of Computer Science, Washington University, St. Louis, Missouri 63130, USA.
| |
Collapse
|
18
|
Eyras E, Reymond A, Castelo R, Bye JM, Camara F, Flicek P, Huckle EJ, Parra G, Shteynberg DD, Wyss C, Rogers J, Antonarakis SE, Birney E, Guigo R, Brent MR. Gene finding in the chicken genome. BMC Bioinformatics 2005; 6:131. [PMID: 15924626 PMCID: PMC1174864 DOI: 10.1186/1471-2105-6-131] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2004] [Accepted: 05/30/2005] [Indexed: 11/24/2022] Open
Abstract
Background Despite the continuous production of genome sequence for a number of organisms, reliable, comprehensive, and cost effective gene prediction remains problematic. This is particularly true for genomes for which there is not a large collection of known gene sequences, such as the recently published chicken genome. We used the chicken sequence to test comparative and homology-based gene-finding methods followed by experimental validation as an effective genome annotation method. Results We performed experimental evaluation by RT-PCR of three different computational gene finders, Ensembl, SGP2 and TWINSCAN, applied to the chicken genome. A Venn diagram was computed and each component of it was evaluated. The results showed that de novo comparative methods can identify up to about 700 chicken genes with no previous evidence of expression, and can correctly extend about 40% of homology-based predictions at the 5' end. Conclusions De novo comparative gene prediction followed by experimental verification is effective at enhancing the annotation of the newly sequenced genomes provided by standard homology-based methods.
Collapse
Affiliation(s)
- Eduardo Eyras
- Research Group in Biomedical Informatics, Institut Municipal d'Investigacio Medica/Universitat Pompeu Fabra/Centre de Regulacio Genomica, E08003 Barcelona, Catalonia, Spain
| | - Alexandre Reymond
- Department of Genetic Medicine and Development, University of Geneva, Medical School and University Hospital of Geneva, CMU, 1, rue Michel Servet, 1211 Geneva, Switzerland
- Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland
| | - Robert Castelo
- Research Group in Biomedical Informatics, Institut Municipal d'Investigacio Medica/Universitat Pompeu Fabra/Centre de Regulacio Genomica, E08003 Barcelona, Catalonia, Spain
| | - Jacqueline M Bye
- The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Francisco Camara
- Research Group in Biomedical Informatics, Institut Municipal d'Investigacio Medica/Universitat Pompeu Fabra/Centre de Regulacio Genomica, E08003 Barcelona, Catalonia, Spain
| | - Paul Flicek
- Laboratory for Computational Genomics and Department of Computer Science, Campus Box 1045, Washington University, One Brookings Drive, St Louis, Missouri 63130, USA
| | - Elizabeth J Huckle
- The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Genis Parra
- Research Group in Biomedical Informatics, Institut Municipal d'Investigacio Medica/Universitat Pompeu Fabra/Centre de Regulacio Genomica, E08003 Barcelona, Catalonia, Spain
| | - David D Shteynberg
- Laboratory for Computational Genomics and Department of Computer Science, Campus Box 1045, Washington University, One Brookings Drive, St Louis, Missouri 63130, USA
| | - Carine Wyss
- Department of Genetic Medicine and Development, University of Geneva, Medical School and University Hospital of Geneva, CMU, 1, rue Michel Servet, 1211 Geneva, Switzerland
| | - Jane Rogers
- The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Stylianos E Antonarakis
- Department of Genetic Medicine and Development, University of Geneva, Medical School and University Hospital of Geneva, CMU, 1, rue Michel Servet, 1211 Geneva, Switzerland
| | - Ewan Birney
- EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Roderic Guigo
- Research Group in Biomedical Informatics, Institut Municipal d'Investigacio Medica/Universitat Pompeu Fabra/Centre de Regulacio Genomica, E08003 Barcelona, Catalonia, Spain
| | - Michael R Brent
- Laboratory for Computational Genomics and Department of Computer Science, Campus Box 1045, Washington University, One Brookings Drive, St Louis, Missouri 63130, USA
| |
Collapse
|
19
|
Castelo R, Reymond A, Wyss C, Câmara F, Parra G, Antonarakis SE, Guigó R, Eyras E. Comparative gene finding in chicken indicates that we are closing in on the set of multi-exonic widely expressed human genes. Nucleic Acids Res 2005; 33:1935-9. [PMID: 15809229 PMCID: PMC1074396 DOI: 10.1093/nar/gki328] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
The recent availability of the chicken genome sequence poses the question of whether there are human protein-coding genes conserved in chicken that are currently not included in the human gene catalog. Here, we show, using comparative gene finding followed by experimental verification of exon pairs by RT-PCR, that the addition to the multi-exonic subset of this catalog could be as little as 0.2%, suggesting that we may be closing in on the human gene set. Our protocol, however, has two shortcomings: (i) the bioinformatic screening of the predicted genes, applied to filter out false positives, cannot handle intronless genes; and (ii) the experimental verification could fail to identify expression at a specific developmental time. This highlights the importance of developing methods that could provide a reliable estimate of the number of these two types of genes.
Collapse
Affiliation(s)
- Robert Castelo
- Research Unit on Biomedical Informatics, Institut Municipal d'Investigació Mèdica/Universitat Pompeu Fabra, Centre de Regulació Genòmica E08003 Barcelona, Spain.
| | | | | | | | | | | | | | | |
Collapse
|
20
|
Wei C, Lamesch P, Arumugam M, Rosenberg J, Hu P, Vidal M, Brent MR. Closing in on the C. elegans ORFeome by cloning TWINSCAN predictions. Genome Res 2005; 15:577-82. [PMID: 15805498 PMCID: PMC1074372 DOI: 10.1101/gr.3329005] [Citation(s) in RCA: 35] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2004] [Accepted: 01/26/2005] [Indexed: 11/25/2022]
Abstract
The genome of Caenorhabditis elegans was the first animal genome to be sequenced. Although considerable effort has been devoted to annotating it, the standard WormBase annotation contains thousands of predicted genes for which there is no cDNA or EST evidence. We hypothesized that a more complete experimental annotation could be obtained by creating a more accurate gene-prediction program and then amplifying and sequencing predicted genes. Our approach was to adapt the TWINSCAN gene prediction system to C. elegans and C. briggsae and to improve its splice site and intron-length models. The resulting system has 60% sensitivity and 58% specificity in exact prediction of open reading frames (ORFs), and hence, proteins-the best results we are aware of any multicellular organism. We then attempted to amplify, clone, and sequence 265 TWINSCAN-predicted ORFs that did not overlap WormBase gene annotations. The success rate was 55%, adding 146 genes that were completely absent from WormBase to the ORF clone collection (ORFeome). The same procedure had a 7% success rate on 90 Worm Base "predicted" genes that do not overlap TWINSCAN predictions. These results indicate that the accuracy of WormBase could be significantly increased by replacing its partially curated predicted genes with TWINSCAN predictions. The technology described in this study will continue to drive the C. elegans ORFeome toward completion and contribute to the annotation of the three Caenorhabditis species currently being sequenced. The results also suggest that this technology can significantly improve our knowledge of the "parts list" for even the best-studied model organisms.
Collapse
Affiliation(s)
- Chaochun Wei
- Laboratory for Computational Genomics and Department of Computer Science and Engineering, Washington University, St. Louis, Missouri 63130, USA
| | | | | | | | | | | | | |
Collapse
|
21
|
Ren P, Roncaglia P, Springer DJ, Fan J, Chaturvedi V. Genomic organization and expression of 23 new genes from MATalpha locus of Cryptococcus neoformans var. gattii. Biochem Biophys Res Commun 2005; 326:233-41. [PMID: 15567176 DOI: 10.1016/j.bbrc.2004.11.017] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2004] [Indexed: 11/22/2022]
Abstract
The pathogenic yeast Cryptococcus neoformans (Cn) causes cryptococcosis, a life-threatening disease of the brain. Molecular studies of Cn variety gattii have lagged behind other two varieties (var. grubii and var. neoformans) although they have distinct biology and disease patterns. We focused on gene discovery in MATalpha locus because it predominates in clinical strains. A var. gattii cosmid library was screened with DNA probes from other two varieties. Two positive clones were sequenced to identify ORFs based on similarities to known proteins, and to ESTs using bioinformatics, and manually by a curator. Approximately 76kb sequenced DNA revealed 23 genes and ORFs. The existence of predicted genes was verified by RT-PCR analyses designed to amplify spliced sequences. The results confirmed that the transcripts were expressed both at 30 and 37 degrees C. The var. gattii MATalpha locus genes showed rearrangements in order and orientation vis-a-vis other two varieties. Mating-specific genes showed higher nonsynonymous mutation rates, and gene trees showed var. gattii strains in a distinct clade. The identification of the largest number, thus far, of var. gattii structural genes should set the stage for future molecular pathogenesis studies.
Collapse
Affiliation(s)
- Ping Ren
- Mycology Laboratory, Wadsworth Center, New York State Department of Health, Albany, NY, USA
| | | | | | | | | |
Collapse
|
22
|
Abstract
De novo gene predictors are programs that predict the exon-intron structures of genes using the sequences of one or more genomes as their only input. In the past two years, dual-genome de novo predictors, which exploit local rates and patterns of mutation inferred from alignments between two genomes, have led to significant improvements in accuracy. Systems that exploit more than two genomes simultaneously have only recently begun to appear and are not yet competitive on practical tasks, but offer the greatest hope for near-term improvements. Dual-genome de novo prediction for compact eukaryotic genomes such as those of Arabidopsis thaliana and Caenorhabditis elegans is already quite accurate. Although mammalian gene prediction lags behind in accuracy, it is yielding ever more useful results. Coupled with significant improvements in pseudogene detection methods, which have eliminated many false positives, we have reached the point where de novo gene predictions are being used as hypotheses to drive experimental annotation via systematic RT-PCR and sequencing.
Collapse
Affiliation(s)
- Michael R Brent
- Laboratory for Computational Genomics, Campus Box 1045, Washington University, One Brookings Drive, St Louis, Missouri 63130, USA.
| | | |
Collapse
|
23
|
Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 2004; 432:695-716. [PMID: 15592404 DOI: 10.1038/nature03154] [Citation(s) in RCA: 1953] [Impact Index Per Article: 97.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2004] [Accepted: 11/01/2004] [Indexed: 12/28/2022]
Abstract
We present here a draft genome sequence of the red jungle fowl, Gallus gallus. Because the chicken is a modern descendant of the dinosaurs and the first non-mammalian amniote to have its genome sequenced, the draft sequence of its genome--composed of approximately one billion base pairs of sequence and an estimated 20,000-23,000 genes--provides a new perspective on vertebrate genome evolution, while also improving the annotation of mammalian genomes. For example, the evolutionary distance between chicken and human provides high specificity in detecting functional elements, both non-coding and coding. Notably, many conserved non-coding sequences are far from genes and cannot be assigned to defined functional classes. In coding regions the evolutionary dynamics of protein domains and orthologous groups illustrate processes that distinguish the lineages leading to birds and mammals. The distinctive properties of avian microchromosomes, together with the inferred patterns of conserved synteny, provide additional insights into vertebrate chromosome architecture.
Collapse
|
24
|
Tenney AE, Brown RH, Vaske C, Lodge JK, Doering TL, Brent MR. Gene prediction and verification in a compact genome with numerous small introns. Genome Res 2004; 14:2330-5. [PMID: 15479946 PMCID: PMC525692 DOI: 10.1101/gr.2816704] [Citation(s) in RCA: 35] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
The genomes of clusters of related eukaryotes are now being sequenced at an increasing rate, creating a need for accurate, low-cost annotation of exon-intron structures. In this paper, we demonstrate that reverse transcription-polymerase chain reaction (RT-PCR) and direct sequencing based on predicted gene structures satisfy this need, at least for single-celled eukaryotes. The TWINSCAN gene prediction algorithm was adapted for the fungal pathogen Cryptococcus neoformans by using a precise model of intron lengths in combination with ungapped alignments between the genome sequences of the two closely related Cryptococcus varieties. This approach resulted in approximately 60% of known genes being predicted exactly right at every coding base and splice site. When previously unannotated TWINSCAN predictions were tested by RT-PCR and direct sequencing, 75% of targets spanning two predicted introns were amplified and produced high-quality sequence. When targets spanning the complete predicted open reading frame were tested, 72% of them amplified and produced high-quality sequence. We conclude that sequencing a small number of expressed sequence tags (ESTs) to provide training data, running TWINSCAN on an entire genome, and then performing RT-PCR and direct sequencing on all of its predictions would be a cost-effective method for obtaining an experimentally verified genome annotation.
Collapse
Affiliation(s)
- Aaron E Tenney
- Laboratory for Computational Genomics and Department of Computer Science, Washington University, St. Louis, Missouri 63130, USA
| | | | | | | | | | | |
Collapse
|
25
|
Wang M, Buhler J, Brent MR. The effects of evolutionary distance on TWINSCAN, an algorithm for pair-wise comparative gene prediction. COLD SPRING HARBOR SYMPOSIA ON QUANTITATIVE BIOLOGY 2004; 68:125-30. [PMID: 15338610 DOI: 10.1101/sqb.2003.68.125] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Affiliation(s)
- M Wang
- Department of Computer Science and Engineering, Washington University, St. Louis, Missouri 63130, USA
| | | | | |
Collapse
|