1
|
Demain AL, Vandamme EJ, Collins J, Buchholz K. History of Industrial Biotechnology. Ind Biotechnol (New Rochelle N Y) 2016. [DOI: 10.1002/9783527807796.ch1] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Affiliation(s)
- Arnold L. Demain
- Drew University; Charles A. Dana Research Institute for Scientists Emeriti (R.I.S.E.); 36, Madison Ave Madison NJ 07940 USA
| | - Erick J. Vandamme
- Ghent University; Department of Biochemical and Microbial Technology; Belgium
| | - John Collins
- Science historian; Leipziger Straße 82A; 38124 Braunschweig Germany
| | - Klaus Buchholz
- Technical University Braunschweig; Institute of Chemical Engineering; Hans-Sommer-Str. 10 38106 Braunschweig Germany
| |
Collapse
|
2
|
Bermudez-Santana CI. APLICACIONES DE LA BIOINFORMÁTICA EN LA MEDICINA: EL GENOMA HUMANO. ¿CÓMO PODEMOS VER TANTO DETALLE? ACTA BIOLÓGICA COLOMBIANA 2016. [DOI: 10.15446/abc.v21n1supl.51233] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
<p lang="es-ES" align="JUSTIFY">La bioinformática es un campo novedoso que soporta parte de la investigación biológica dirigida a la identificación de variantes génicas que pueden ser descubiertas desde los estudios de genomas completos. Basados en esta motivación se presenta el panorama general de los aportes principales de la bioinformática en el desarrollo del secuenciamiento del primer genoma humano. Adicionalmente se resumen los principales avances en cómputo desarrollados para responder a las demandas requeridas por los métodos de secuenciamiento de última generación para lograr re-secuenciar un genoma humano. Finalmente se introducen algunos de los nuevos retos que deben asumirse para aplicar la genómica personalizada en el desarrollo de la medicina. </p><p lang="es-ES" align="JUSTIFY"> </p><p lang="es-ES" align="JUSTIFY">Abstract</p><p lang="es-ES" align="JUSTIFY">Bioinformatics is a new field that supports part of the biological research aimed at identifying gene variants that can be discovered from studies of whole genomes. Based on this motivation the overview of the main contributions of bioinformatics in the development of sequencing the first human genome is presented. Additionally it is summarized the main advances in computing developed to meet the demands to re-sequence a human genome by using the next generation sequencing technologies. Finally some new challenges that must be faced to apply the personalized genomics into the medicine development are introduced.</p>
Collapse
|
3
|
Santos A, Tsafou K, Stolte C, Pletscher-Frankild S, O’Donoghue SI, Jensen LJ. Comprehensive comparison of large-scale tissue expression datasets. PeerJ 2015; 3:e1054. [PMID: 26157623 PMCID: PMC4493645 DOI: 10.7717/peerj.1054] [Citation(s) in RCA: 60] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2015] [Accepted: 06/04/2015] [Indexed: 01/01/2023] Open
Abstract
For tissues to carry out their functions, they rely on the right proteins to be present. Several high-throughput technologies have been used to map out which proteins are expressed in which tissues; however, the data have not previously been systematically compared and integrated. We present a comprehensive evaluation of tissue expression data from a variety of experimental techniques and show that these agree surprisingly well with each other and with results from literature curation and text mining. We further found that most datasets support the assumed but not demonstrated distinction between tissue-specific and ubiquitous expression. By developing comparable confidence scores for all types of evidence, we show that it is possible to improve both quality and coverage by combining the datasets. To facilitate use and visualization of our work, we have developed the TISSUES resource (http://tissues.jensenlab.org), which makes all the scored and integrated data available through a single user-friendly web interface.
Collapse
Affiliation(s)
- Alberto Santos
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Kalliopi Tsafou
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Christian Stolte
- Commonwealth Scientific and Industrial Research Organisation (CSIRO), Sydney, Australia
| | - Sune Pletscher-Frankild
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Seán I. O’Donoghue
- Commonwealth Scientific and Industrial Research Organisation (CSIRO), Sydney, Australia
- Garvan Institute of Medical Research, Sydney, Australia
| | - Lars Juhl Jensen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| |
Collapse
|
4
|
Wu PY, Phan JH, Wang MD. Assessing the impact of human genome annotation choice on RNA-seq expression estimates. BMC Bioinformatics 2013; 14 Suppl 11:S8. [PMID: 24564364 PMCID: PMC3816316 DOI: 10.1186/1471-2105-14-s11-s8] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Background Genome annotation is a crucial component of RNA-seq data analysis. Much effort has been devoted to producing an accurate and rational annotation of the human genome. An annotated genome provides a comprehensive catalogue of genomic functional elements. Currently, at least six human genome annotations are publicly available, including AceView Genes, Ensembl Genes, H-InvDB Genes, RefSeq Genes, UCSC Known Genes, and Vega Genes. Characteristics of these annotations differ because of variations in annotation strategies and information sources. When performing RNA-seq data analysis, researchers need to choose a genome annotation. However, the effect of genome annotation choice on downstream RNA-seq expression estimates is still unclear. This study (1) investigates the effect of different genome annotations on RNA-seq quantification and (2) provides guidelines for choosing a genome annotation based on research focus. Results We define the complexity of human genome annotations in terms of the number of genes, isoforms, and exons. This definition facilitates an investigation of potential relationships between complexity and variations in RNA-seq quantification. We apply several evaluation metrics to demonstrate the impact of genome annotation choice on RNA-seq expression estimates. In the mapping stage, the least complex genome annotation, RefSeq Genes, appears to have the highest percentage of uniquely mapped short sequence reads. In the quantification stage, RefSeq Genes results in the most stable expression estimates in terms of the average coefficient of variation over all genes. Stable expression estimates in the quantification stage translate to accurate statistics for detecting differentially expressed genes. We observe that RefSeq Genes produces the most accurate fold-change measures with respect to a ground truth of RT-qPCR gene expression estimates. Conclusions Based on the observed variations in the mapping, quantification, and differential expression calling stages, we demonstrate that the selection of human genome annotation results in different gene expression estimates. When conducting research that emphasizes reproducible and robust gene expression estimates, a less complex genome annotation may be preferred. However, simpler genome annotations may limit opportunities for identifying or characterizing novel transcriptional or regulatory mechanisms. When conducting research that aims to be more exploratory, a more complex genome annotation may be preferred.
Collapse
|
5
|
Expressed sequence tags of the peanut pod nematode Ditylenchus africanus: the first transcriptome analysis of an Anguinid nematode. Mol Biochem Parasitol 2009; 167:32-40. [PMID: 19383517 DOI: 10.1016/j.molbiopara.2009.04.004] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2008] [Revised: 04/07/2009] [Accepted: 04/12/2009] [Indexed: 11/20/2022]
Abstract
In this study, 4847 expressed sequenced tags (ESTs) from mixed stages of the migratory plant-parasitic nematode Ditylenchus africanus (peanut pod nematode) were investigated. It is the first molecular survey of a nematode which belongs to the family of the Anguinidae (order Rhabditida, superfamily Sphaerularioidea). The sequences were clustered into 2596 unigenes, of which 43% did not show any homology to known protein, nucleotide, nematode EST or plant-parasitic nematode genome sequences. Gene ontology mapping revealed that most putative proteins are involved in developmental and reproductive processes. In addition unigenes involved in oxidative stress as well as in anhydrobiosis, such as LEA (late embryogenesis abundant protein) and trehalose-6-phosphate synthase were identified. Other tags showed homology to genes previously described as being involved in parasitism (expansin, SEC-2, calreticulin, 14-3-3b and various allergen proteins). In situ hybridization revealed that the expression of a putative expansin and a venom allergen protein was restricted to the gland cell area of the nematode, being in agreement with their presumed role in parasitism. Furthermore, seven putative novel candidate parasitism genes were identified based on the prediction of a signal peptide in the corresponding protein sequence and homologous ESTs exclusively in parasitic nematodes. These genes are interesting for further research and functional characterization. Finally, 34 unigenes were retained as good target candidates for future RNAi experiments, because of their nematode specific nature and observed lethal phenotypes of Caenorhabditis elegans homologs.
Collapse
|
6
|
Salzburger W, Renn SCP, Steinke D, Braasch I, Hofmann HA, Meyer A. Annotation of expressed sequence tags for the East African cichlid fish Astatotilapia burtoni and evolutionary analyses of cichlid ORFs. BMC Genomics 2008; 9:96. [PMID: 18298844 PMCID: PMC2279125 DOI: 10.1186/1471-2164-9-96] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2007] [Accepted: 02/25/2008] [Indexed: 11/13/2022] Open
Abstract
Background The cichlid fishes in general, and the exceptionally diverse East African haplochromine cichlids in particular, are famous examples of adaptive radiation and explosive speciation. Here we report the collection and annotation of more than 12,000 expressed sequence tags (ESTs) generated from three different cDNA libraries obtained from the East African haplochromine cichlid species Astatotilapia burtoni and Metriaclima zebra. Results We first annotated more than 12,000 newly generated cichlid ESTs using the Gene Ontology classification system. For evolutionary analyses, we combined these ESTs with all available sequence data for haplochromine cichlids, which resulted in a total of more than 45,000 ESTs. The ESTs represent a broad range of molecular functions and biological processes. We compared the haplochromine ESTs to sequence data from those available for other fish model systems such as pufferfish (Takifugu rubripes and Tetraodon nigroviridis), trout, and zebrafish. We characterized genes that show a faster or slower rate of base substitutions in haplochromine cichlids compared to other fish species, as this is indicative of a relaxed or reinforced selection regime. Four of these genes showed the signature of positive selection as revealed by calculating Ka/Ks ratios. Conclusion About 22% of the surveyed ESTs were found to have cichlid specific rate differences suggesting that these genes might play a role in lineage specific characteristics of cichlids. We also conclude that the four genes with a Ka/Ks ratio greater than one appear as good candidate genes for further work on the genetic basis of evolutionary success of haplochromine cichlid fishes.
Collapse
Affiliation(s)
- Walter Salzburger
- Lehrstuhl für Zoologie und Evolutionsbiologie, Department of Biology, University of Konstanz, 78467 Konstanz, Germany.
| | | | | | | | | | | |
Collapse
|
7
|
Abstract
In recent years, genome-wide detection of alternative splicing based on Expressed Sequence Tag (EST) sequence alignments with mRNA and genomic sequences has dramatically expanded our understanding of the role of alternative splicing in functional regulation. This chapter reviews the data, methodology, and technical challenges of these genome-wide analyses of alternative splicing, and briefly surveys some of the uses to which such alternative splicing databases have been put. For example, with proper alternative splicing database schema design, it is possible to query genome-wide for alternative splicing patterns that are specific to particular tissues, disease states (e.g., cancer), gender, or developmental stages. EST alignments can be used to estimate exon inclusion or exclusion level of alternatively spliced exons and evolutionary changes for various species can be inferred from exon inclusion level. Such databases can also help automate design of probes for RT-PCR and microarrays, enabling high throughput experimental measurement of alternative splicing.
Collapse
|
8
|
Arhondakis S, Clay O, Bernardi G. Compositional properties of human cDNA libraries: practical implications. FEBS Lett 2006; 580:5772-8. [PMID: 17022979 DOI: 10.1016/j.febslet.2006.09.034] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2006] [Revised: 09/12/2006] [Accepted: 09/19/2006] [Indexed: 01/28/2023]
Abstract
The strikingly wide and bimodal gene distribution exhibited by the human genome has prompted us to study the correlations between EST-counts (expression levels) and base composition of genes, especially since existing data are contradictory. Here we investigate how cDNA library preparation affects the GC distributions of ESTs and/or genes found in the library, and address consequences for expression studies. We observe that strongly anomalous GC distributions often indicate experimental biases or deficits during their preparation. We propose the use of compositional distributions of raw ESTs from a cDNA library, and/or of the genes they represent, as a simple and effective tool for quality control.
Collapse
Affiliation(s)
- Stilianos Arhondakis
- Laboratory of Molecular Evolution, Stazione Zoologica Anton Dohrn, 80121 Naples, Italy
| | | | | |
Collapse
|
9
|
Affiliation(s)
- Simon Gregory
- Duke University Medical Center Durham North Carolina
| | - John Gilbert
- Duke University Medical Center Durham North Carolina
| |
Collapse
|
10
|
Abstract
MOTIVATION mRNA sequences and expressed sequence tags represent some of the most abundant experimental data for identifying genes and alternatively spliced products in metazoans. These transcript sequences are frequently studied by aligning them to a genomic sequence template. For existing programs, error-prone, polymorphic and cross-species data, as well as non-canonical splice sites, still present significant barriers to producing accurate, complete alignments. RESULTS We took a novel approach to spliced alignment that meaningfully combined information from sequence similarity with that obtained from PSSM splice site models. Scoring systems were chosen to maximize their power of discrimination, and dynamic programming (DP) was employed to guarantee optimal solutions would be found. The resultant program, EXALIN, performed better than other popular tools tested under a wide range of conditions that included detection of micro-exons and human-mouse cross-species comparisons. For improved speed with only a marginal decrease in splice site prediction accuracy, EXALIN could perform limited DP guided by a result from BLASTN. AVAILABILITY The source code, binaries, scripts, scoring matrices and splice site models for human, mouse, rice and Caenorhabditis elegans utilized in this study are posted at http://blast.wustl.edu/exalin. The software (scripts, source code and binaries) is copyrighted but free for all to use.
Collapse
Affiliation(s)
- Miao Zhang
- Department of Genetics, School of Medicine, Washington University-St Louis, 4566 Scott Avenue, St Louis, MO 63110, USA
| | | |
Collapse
|
11
|
Larsson TP, Murray CG, Hill T, Fredriksson R, Schiöth HB. Comparison of the current RefSeq, Ensembl and EST databases for counting genes and gene discovery. FEBS Lett 2005; 579:690-8. [PMID: 15670830 DOI: 10.1016/j.febslet.2004.12.046] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2004] [Revised: 12/13/2004] [Accepted: 12/13/2004] [Indexed: 11/25/2022]
Abstract
Large amounts of refined sequence material in the form of predicted, curated and annotated genes and expressed sequences tags (ESTs) have recently been added to the NCBI databases. We matched the transcript-sequences of RefSeq, Ensembl and dbEST in an attempt to provide an updated overview of how many unique human genes can be found. The results indicate that there are about 25000 unique genes in the union of RefSeq and Ensembl with 12-18% and 8-13% of the genes in each set unique to the other set, respectively. About 20% of all genes had splice variants. There are a considerable number of ESTs (2200000) that do not match the identified genes and we used an in-house pipeline to identify 22 novel genes from Genscan predictions that have considerable EST coverage. The study provides an insight into the current status of human gene catalogues and shows that considerable refinement of methods and datasets is needed to come to a conclusive gene count.
Collapse
Affiliation(s)
- Thomas P Larsson
- Department of Neuroscience, Uppsala University, BMC Box 593, 751 24 Uppsala, Sweden.
| | | | | | | | | |
Collapse
|
12
|
Sogayar MC, Camargo AA, Bettoni F, Carraro DM, Pires LC, Parmigiani RB, Ferreira EN, de Sá Moreira E, do Rosário D de O Latorre M, Simpson AJG, Cruz LO, Degaki TL, Festa F, Massirer KB, Sogayar MC, Filho FC, Camargo LP, Cunha MAV, De Souza SJ, Faria M, Giuliatti S, Kopp L, de Oliveira PSL, Paiva PB, Pereira AA, Pinheiro DG, Puga RD, S de Souza JE, Albuquerque DM, Andrade LEC, Baia GS, Briones MRS, Cavaleiro-Luna AMS, Cerutti JM, Costa FF, Costanzi-Strauss E, Espreafico EM, Ferrasi AC, Ferro ES, Fortes MAHZ, Furchi JRF, Giannella-Neto D, Goldman GH, Goldman MHS, Gruber A, Guimarães GS, Hackel C, Henrique-Silva F, Kimura ET, Leoni SG, Macedo C, Malnic B, Manzini B CV, Marie SKN, Martinez-Rossi NM, Menossi M, Miracca EC, Nagai MA, Nobrega FG, Nobrega MP, Oba-Shinjo SM, Oliveira MK, Orabona GM, Otsuka AY, Paço-Larson ML, Paixão BMC, Pandolfi JRC, Pardini MIMC, Passos Bueno MR, Passos GAS, Pesquero JB, Pessoa JG, Rahal P, Rainho CA, Reis CP, Ricca TI, Rodrigues V, Rogatto SR, Romano CM, Romeiro JG, Rossi A, Sá RG, Sales MM, Sant'Anna SC, Santarosa PL, Segato F, Silva WA, Silva IDCG, Silva NP, Soares-Costa A, Sonati MF, Strauss BE, Tajara EH, Valentini SR, Villanova FE, Ward LS, Zanette DL. A transcript finishing initiative for closing gaps in the human transcriptome. Genome Res 2004; 14:1413-23. [PMID: 15197164 PMCID: PMC442158 DOI: 10.1101/gr.2111304] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2003] [Accepted: 03/12/2004] [Indexed: 11/24/2022]
Abstract
We report the results of a transcript finishing initiative, undertaken for the purpose of identifying and characterizing novel human transcripts, in which RT-PCR was used to bridge gaps between paired EST clusters, mapped against the genomic sequence. Each pair of EST clusters selected for experimental validation was designated a transcript finishing unit (TFU). A total of 489 TFUs were selected for validation, and an overall efficiency of 43.1% was achieved. We generated a total of 59,975 bp of transcribed sequences organized into 432 exons, contributing to the definition of the structure of 211 human transcripts. The structure of several transcripts reported here was confirmed during the course of this project, through the generation of their corresponding full-length cDNA sequences. Nevertheless, for 21% of the validated TFUs, a full-length cDNA sequence is not yet available in public databases, and the structure of 69.2% of these TFUs was not correctly predicted by computer programs. The TF strategy provides a significant contribution to the definition of the complete catalog of human genes and transcripts, because it appears to be particularly useful for identification of low abundance transcripts expressed in a restricted set of tissues as well as for the delineation of gene boundaries and alternatively spliced isoforms.
Collapse
|
13
|
Close J, Game L, Clark B, Bergounioux J, Gerovassili A, Thein SL. Genome annotation of a 1.5 Mb region of human chromosome 6q23 encompassing a quantitative trait locus for fetal hemoglobin expression in adults. BMC Genomics 2004; 5:33. [PMID: 15169551 PMCID: PMC441375 DOI: 10.1186/1471-2164-5-33] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2004] [Accepted: 05/31/2004] [Indexed: 12/24/2022] Open
Abstract
Background Heterocellular hereditary persistence of fetal hemoglobin (HPFH) is a common multifactorial trait characterized by a modest increase of fetal hemoglobin levels in adults. We previously localized a Quantitative Trait Locus for HPFH in an extensive Asian-Indian kindred to chromosome 6q23. As part of the strategy of positional cloning and a means towards identification of the specific genetic alteration in this family, a thorough annotation of the candidate interval based on a strategy of in silico / wet biology approach with comparative genomics was conducted. Results The ~1.5 Mb candidate region was shown to contain five protein-coding genes. We discovered a very large uncharacterized gene containing WD40 and SH3 domains (AHI1), and extended the annotation of four previously characterized genes (MYB, ALDH8A1, HBS1L and PDE7B). We also identified several genes that do not appear to be protein coding, and generated 17 kb of novel transcript sequence data from re-sequencing 97 EST clones. Conclusion Detailed and thorough annotation of this 1.5 Mb interval in 6q confirms a high level of aberrant transcripts in testicular tissue. The candidate interval was shown to exhibit an extraordinary level of alternate splicing – 19 transcripts were identified for the 5 protein coding genes, but it appears that a significant portion (14/19) of these alternate transcripts did not have an open reading frame, hence their functional role is questionable. These transcripts may result from aberrant rather than regulated splicing.
Collapse
Affiliation(s)
- James Close
- Department of Haematological Medicine, GKT School of Medicine, King's Denmark Hill Campus, Bessemer Road, London, SE5 9PJ, UK
- SANE POWIC, Warneford Hospital, Department of Psychiatry, University of Oxford, Oxford, OX3 7JX, UK
| | - Laurence Game
- Department of Haematological Medicine, GKT School of Medicine, King's Denmark Hill Campus, Bessemer Road, London, SE5 9PJ, UK
- CSC-IC Microarray Centre, 2nd floor, L-block, Room 221, Imperial College Faculty of Medicine, Hammersmith Hospital Campus, Du Cane Road, London, W12 0NN, UK
| | - Barnaby Clark
- Department of Haematological Medicine, GKT School of Medicine, King's Denmark Hill Campus, Bessemer Road, London, SE5 9PJ, UK
| | - Jean Bergounioux
- Department of Haematological Medicine, GKT School of Medicine, King's Denmark Hill Campus, Bessemer Road, London, SE5 9PJ, UK
- Unité de soins intensif pédiatrique, Hôpital Universitaire Krémlin Bicêtre, 63 av. Gabriel Péri, 94270 Le Krémlin Bicêtre, France
| | - Ageliki Gerovassili
- Department of Haematological Medicine, GKT School of Medicine, King's Denmark Hill Campus, Bessemer Road, London, SE5 9PJ, UK
| | - Swee Lay Thein
- Department of Haematological Medicine, GKT School of Medicine, King's Denmark Hill Campus, Bessemer Road, London, SE5 9PJ, UK
| |
Collapse
|
14
|
Clark MS, Edwards YJK, Peterson D, Clifton SW, Thompson AJ, Sasaki M, Suzuki Y, Kikuchi K, Watabe S, Kawakami K, Sugano S, Elgar G, Johnson SL. Fugu ESTs: new resources for transcription analysis and genome annotation. Genome Res 2003; 13:2747-53. [PMID: 14613980 PMCID: PMC403817 DOI: 10.1101/gr.1691503] [Citation(s) in RCA: 36] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2003] [Accepted: 09/10/2003] [Indexed: 10/26/2022]
Abstract
The draft Fugu rubripes genome was released in 2002, at which time relatively few cDNAs were available to aid in the annotation of genes. The data presented here describe the sequencing and analysis of 24,398 expressed sequence tags (ESTs) generated from 15 different adult and juvenile Fugu tissues, 74% of which matched protein database entries. Analysis of the EST data compared with the Fugu genome data predicts that approximately 10,116 gene tags have been generated, covering almost one-third of Fugu predicted genes. This represents a remarkable economy of effort. Comparison with the Washington University zebrafish EST assemblies indicates strong conservation within fish species, but significant differences remain. This potentially represents divergence of sequence in the 5' terminal exons and UTRs between these two fish species, although clearly, complete EST data sets are not available for either species. This project provides new Fugu resources, and the analysis adds significant weight to the argument that EST programs remain an essential resource for genome exploitation and annotation. This is particularly timely with the increasing availability of draft genome sequence from different organisms and the mounting emphasis on gene function and regulation.
Collapse
Affiliation(s)
- Melody S Clark
- MRC Rosalind Franklin Centre for Genomics Research, (formerly known as the MRC UK HGMP Resource Centre), Genome Campus, Hinxton, Cambridge, CB10 1SB, UK.
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
15
|
Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK, Hannick LI, Maiti R, Ronning CM, Rusch DB, Town CD, Salzberg SL, White O. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 2003; 31:5654-66. [PMID: 14500829 PMCID: PMC206470 DOI: 10.1093/nar/gkg770] [Citation(s) in RCA: 1309] [Impact Index Per Article: 62.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The spliced alignment of expressed sequence data to genomic sequence has proven a key tool in the comprehensive annotation of genes in eukaryotic genomes. A novel algorithm was developed to assemble clusters of overlapping transcript alignments (ESTs and full-length cDNAs) into maximal alignment assemblies, thereby comprehensively incorporating all available transcript data and capturing subtle splicing variations. Complete and partial gene structures identified by this method were used to improve The Institute for Genomic Research Arabidopsis genome annotation (TIGR release v.4.0). The alignment assemblies permitted the automated modeling of several novel genes and >1000 alternative splicing variations as well as updates (including UTR annotations) to nearly half of the approximately 27 000 annotated protein coding genes. The algorithm of the Program to Assemble Spliced Alignments (PASA) tool is described, as well as the results of automated updates to Arabidopsis gene annotations.
Collapse
Affiliation(s)
- Brian J Haas
- The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
16
|
CHEN TAO, REITH MICHAELE, ROSS NEILW, MACRAE THOMASH. Expressed sequence tag (EST)-based characterization of gene regulation inArtemialarvae. INVERTEBR REPROD DEV 2003. [DOI: 10.1080/07924259.2003.9652551] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
17
|
Huntley D, Hummerich H, Smedley D, Kittivoravitkul S, McCarthy M, Little P, Sergot M. GANESH: software for customized annotation of genome regions. Genome Res 2003; 13:2195-202. [PMID: 12952886 PMCID: PMC403729 DOI: 10.1101/gr.698103] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
GANESH is a software package designed to support the genetic analysis of regions of human and other genomes. It provides a set of components that may be assembled to construct a self-updating database of DNA sequence, mapping data, and annotations of possible genome features. Once one or more remote sources of data for the target region have been identified, all sequences for that region are downloaded, assimilated, and subjected to a (configurable) set of standard database-searching and genome-analysis packages. The results are stored in compressed form in a relational database, and are updated automatically on a regular schedule so that they are always immediately available in their most up-to-date versions. A Java front-end, executed as a stand alone application or web applet, provides a graphical interface for navigating the database and for viewing the annotations. There are facilities for importing and exporting data in the format of the Distributed Annotation System (DAS), enabling a GANESH database to be used as a component of a DAS configuration. The system has been used to construct databases for about a dozen regions of human chromosomes and for three regions of mouse chromosomes.
Collapse
Affiliation(s)
- Derek Huntley
- Department of Computing, Imperial College, London SW7 2AZ, UK
| | | | | | | | | | | | | |
Collapse
|
18
|
Abstract
The draft of the human genome sequence is still incomplete. The outstanding tasks include filling in some gaps, finalizing the assembly of short sequences, improving sequence accuracy and correctly identifying coding regions. However, a closely related problem that receives little attention is the substantial number of incorrect annotations that have penetrated some of the widely used databases. This article illustrates this problem using the example of ubiquitin genes, and draws some conclusions that apply to false annotations in other short open reading frames (ORFs). Although the focus is on the human genome, other genomes are equally prone to similar propagation of false annotations.
Collapse
Affiliation(s)
- Michal Linial
- Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University, Jerusalem, 91904, Israel.
| |
Collapse
|
19
|
Silva APM, Salim ACM, Bulgarelli A, de Souza JES, Osório E, Caballero OL, Iseli C, Stevenson BJ, Jongeneel CV, de Souza SJ, Simpson AJG, Camargo AA. Identification of 9 novel transcripts and two RGSL genes within the hereditary prostate cancer region (HPC1) at 1q25. Gene 2003; 310:49-57. [PMID: 12801632 DOI: 10.1016/s0378-1119(03)00501-8] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
We applied a systematic bioinformatics approach, followed by careful manual inspection and experimental validation to identify additional expressed sequences located at the Hereditary Prostate Cancer Region (HPC1) between D1S2818 and D1S1642 on chromosome 1q25. All transcripts already described for the 1q25 region were identified and we were able to define 11 additional expressed sequences within this region (three full-length cDNA clone sequences and eight ESTs), increasing the total number of gene count in this region by 38%. Five out of the 11 expressed sequences identified were shown to be expressed in prostate tissue and thus represent novel disease gene candidates for the HPC1 region. Here, we report a detailed characterization of these five novel disease gene candidates, their expression pattern in various tissues, their genomic organization and functional annotation. Two candidates (RGSL1 and RGSL2) correspond to novel members of the RGS family, which is involved in the regulation of G-protein signaling. RGSL1 and RGLS2 expression was detected by real-time polymerase chain reaction in normal prostate tissue, but could not be detected in prostate tumor cell lines, suggesting they might have a role in prostate cancer.
Collapse
Affiliation(s)
- Ana Paula M Silva
- Ludwig Institute for Cancer Research, Rua Antonio Prudente 109, 4th floor, 01509-010, São Paulo, SP, Brazil
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
20
|
Mathé C, Sagot MF, Schiex T, Rouzé P. Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res 2002; 30:4103-17. [PMID: 12364589 PMCID: PMC140543 DOI: 10.1093/nar/gkf543] [Citation(s) in RCA: 209] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2002] [Revised: 08/07/2002] [Accepted: 08/07/2002] [Indexed: 11/14/2022] Open
Abstract
While the genomes of many organisms have been sequenced over the last few years, transforming such raw sequence data into knowledge remains a hard task. A great number of prediction programs have been developed that try to address one part of this problem, which consists of locating the genes along a genome. This paper reviews the existing approaches to predicting genes in eukaryotic genomes and underlines their intrinsic advantages and limitations. The main mathematical models and computational algorithms adopted are also briefly described and the resulting software classified according to both the method and the type of evidence used. Finally, the several difficulties and pitfalls encountered by the programs are detailed, showing that improvements are needed and that new directions must be considered.
Collapse
Affiliation(s)
- Catherine Mathé
- Institut de Pharmacologie et Biologie Structurale, UMR 5089, 205 route de Narbonne, F-31077 Toulouse Cedex, France.
| | | | | | | |
Collapse
|
21
|
Abstract
The availability of the human genomic sequence is changing the way in which biological questions are addressed. Based on the prediction of genes from nucleotide sequences, homologies among their encoded amino acids can be analyzed and used to place them in distinct families. This serves as a first step in building hypotheses for testing the structural and functional properties of previously uncharacterized paralogous genes. As genomic information from more organisms becomes available, these hypotheses can be refined through comparative genomics and phylogenetic studies. Instead of the traditional single-gene approach in endocrine research, we are beginning to gain an understanding of entire mammalian genomes, thus providing the basis to reveal subfamilies and pathways for genes involved in ligand signaling. The present review provides selective examples of postgenomic approaches in the analysis of novel genes involved in hormonal signaling and their chromosomal locations, polymorphisms, splicing variants, differential expression, and physiological function. In the postgenomic era, scientists will be able to move from a gene-by-gene approach to a reconstructionistic one by reading the encyclopedia of life from a global perspective. Eventually, a community-based approach will yield new insights into the complexity of intercellular communications, thereby offering us an understanding of hormonal physiology and pathophysiology.
Collapse
Affiliation(s)
- Chandra P Leo
- Division of Reproductive Biology, Department of Gynecology and Obstetrics, Stanford University School of Medicine, Stanford, California 94305-5317, USA
| | | | | |
Collapse
|
22
|
Wei L, Liu Y, Dubchak I, Shon J, Park J. Comparative genomics approaches to study organism similarities and differences. J Biomed Inform 2002; 35:142-50. [PMID: 12474427 DOI: 10.1016/s1532-0464(02)00506-3] [Citation(s) in RCA: 41] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
Comparative genomics is a large-scale, holistic approach that compares two or more genomes to discover the similarities and differences between the genomes and to study the biology of the individual genomes. Comparative studies can be performed at different levels of the genomes to obtain multiple perspectives about the organisms. We discuss in detail the type of analyses that offer significant biological insights in the comparisons of (1) genome structure including overall genome statistics, repeats, genome rearrangement at both DNA and gene level, synteny, and breakpoints; (2) coding regions including gene content, protein content, orthologs, and paralogs; and (3) noncoding regions including the prediction of regulatory elements. We also briefly review the currently available computational tools in comparative genomics such as algorithms for genome-scale sequence alignment, gene identification, and nonhomology-based function prediction.
Collapse
Affiliation(s)
- Liping Wei
- Nexus Genomics, Inc., 229 Polaris Ave., Suite 6, Mountain View, CA 94043, USA.
| | | | | | | | | |
Collapse
|
23
|
Abstract
The advent of whole-genome data resources--not only sequence but also other genome-scale data collections such as gene expression, protein interaction, and genetic variation--is having two marked, complementary effects on the relatively new discipline of bioinformatics. First, the veritable flood of data is creating a need and demand for new tools for dealing adequately with the deluge, and, second, the unprecedented extent, diversity, and impending completeness of the data sets are creating opportunities for new approaches to discovery based on computational methods.
Collapse
Affiliation(s)
- D B Searls
- Bioinformatics Department, SmithKline Beecham Pharmaceuticals, King of Prussia, Pennsylvania 19406, USA.
| |
Collapse
|
24
|
Camargo AA, de Souza SJ, Brentani RR, Simpson AJG. Human gene discovery through experimental definition of transcribed regions of the human genome. Curr Opin Chem Biol 2002; 6:13-6. [PMID: 11827817 DOI: 10.1016/s1367-5931(01)00279-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Abstract
The sequencing of the human genome has failed to realize its primary goal: the identification of all human genes. We have learned that genes can only be identified with certainty within this vast and information-sparse structure by comparison with transcript sequences. Significantly more sequence data of this kind is required before we can claim to have deciphered our genetic blueprint.
Collapse
Affiliation(s)
- Anamaria A Camargo
- The Ludwig Institute for Cancer Research, Rua Professor Antonio Prudente, 109, 4th floor, Saõ Paulo, 01509-010, SP, Brazil
| | | | | | | |
Collapse
|
25
|
Das M, Burge CB, Park E, Colinas J, Pelletier J. Assessment of the total number of human transcription units. Genomics 2001; 77:71-8. [PMID: 11543635 DOI: 10.1006/geno.2001.6620] [Citation(s) in RCA: 44] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Variation in the estimates of the number of genes encoded by the human genome (28,000-120,000) attests to the difficulty of systematically identifying human genes. Sequencing of human chromosome 22 (Chr22) provided the first comprehensive, unbiased view of an entire human chromosome, and intensive analysis of this sequence identified 545 genes and 134 pseudogenes that had similarity or identity to known proteins and/or ESTs and which were listed in the gene annotation (http://www.sanger.ac.uk/HGP/Chr22). This analysis yielded an estimate of approximately 36,000 functional expressed genes in the human genome (and 9000 pseudogenes). However, a key uncertainty in this estimate was that hundreds of additional genes beyond those annotated in the Chr22 sequence are predicted by the gene prediction program Genscan, an unknown number of which might represent additional expressed genes. To determine what fraction of these "predicted novel genes" (PNGs) represents expressed human genes, we used a sensitive RT-PCR assay to detect predicted transcripts in 17 tissues and one cell line. Our results indicate that at least 5000-9000 additional human genes which lack similarity to known genes or proteins exist in the human genome, increasing baseline gene estimates to approximately 41,000-45,000.
Collapse
Affiliation(s)
- M Das
- Department of Biochemistry, McGill University, Rm 810, 3655 Drummond St., Montreal, Quebec, H3G 1Y6, Canada
| | | | | | | | | |
Collapse
|
26
|
Abstract
The recent release of the draft sequence and the eventual completion of the human genome present the scientific community with a rich source of data to mine. Yet, these data are content poor in the absence of additional correlative information. Expressed sequence tag (EST) datasets and their associated gene indices have existed for many years, and represent the first attempt at understanding the complexity of the genome. These datasets remain extremely important as information sources and, in particular, as tools for analyzing the completed genomes. Here, we discuss the nature of ESTs and their associated tools and gene-indexing databases. In particular, we will compare three EST gene indices (UNIGENE, Merck Gene Index Version 2.0 and Doubletwist CAT), discuss how these gene indices are applied for both genome analysis and drug discovery, and demonstrate their importance as a complementary dataset to the annotated human genome.
Collapse
Affiliation(s)
- J Yuan
- Department of Bioinformatics, Merck & Co., Inc., P.O. Box 2000-RY80-A1, Rahway, NJ 07065, USA.
| | | | | | | | | |
Collapse
|
27
|
Abstract
The wealth of information from various genome sequencing projects provides the biologist with a new perspective from which to analyze, and design experiments with, mammalian systems. The complexity of the information, however, requires new software tools, and numerous such tools are now available. Which type and which specific system is most effective depends, in part, upon how much sequence is to be analyzed and with what level of experimental support. Here we survey a number of mammalian genomic sequence analysis systems with respect to the data they provide and the ease of their use. The hope is to aid the experimental biologist in choosing the most appropriate tool for their analyses.
Collapse
Affiliation(s)
- A Fortna
- Eleanor Roosevelt Institute, 1899 Gaylord St, Denver, CO 80206-1210, USA
| | | |
Collapse
|
28
|
Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann Y, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS, Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith DR, Doucette-Stamm L, Rubenfield M, Weinstock K, Lee HM, Dubois J, Rosenthal A, Platzer M, Nyakatura G, Taudien S, Rump A, Yang H, Yu J, Wang J, Huang G, Gu J, Hood L, Rowen L, Madan A, Qin S, Davis RW, Federspiel NA, Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood J, Cox DR, Olson MV, Kaul R, Raymond C, Shimizu N, Kawasaki K, Minoshima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, Dedhia N, Blöcker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bailey JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC, Church D, Clamp M, Copley RR, Doerks T, Eddy SR, Eichler EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hayashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS, Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, Koonin EV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A, Mikkelsen T, Moran JV, Mulder N, Pollara VJ, Ponting CP, Schuler G, Schultz J, Slater G, Smit AF, Stupka E, Szustakowki J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, Wolfe KH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA, Patrinos A, Morgan MJ, de Jong P, Catanese JJ, Osoegawa K, Shizuya H, Choi S, Chen YJ, Szustakowki J. Initial sequencing and analysis of the human genome. Nature 2001; 409:860-921. [PMID: 11237011 DOI: 10.1038/35057062] [Citation(s) in RCA: 14728] [Impact Index Per Article: 640.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.
Collapse
Affiliation(s)
- E S Lander
- Whitehead Institute for Biomedical Research, Center for Genome Research, Cambridge, MA 02142, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
29
|
Abstract
'The right drug for the right patient at the right time'Is this a realistic goal for today's pharmaceutical industry and tomorrow's medical practitioner? Or merely an over-simplistic refrain that can only ever be an unfulfilled dream? Here we discuss the reality behind the dream and illustrate how the analysis of genetic variation is a complex science that has the capacity to make significant contributions to drug discovery and development strategies. An understanding of the impact of human variation must be a central consideration in the future practice of pharmaceutical R&D.
Collapse
|
30
|
Ruiz A, Pujana MA, Estivill X. Isolation and characterisation of a novel human gene (C9orf11) on chromosome 9p21, a region frequently deleted in human cancer. BIOCHIMICA ET BIOPHYSICA ACTA 2000; 1517:128-34. [PMID: 11118625 DOI: 10.1016/s0167-4781(00)00272-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
The chromosome 9p21 region has been described to be frequently deleted in several neoplasias. The cyclin dependent kinase inhibitor 2A (CDKN2A or P16) gene was cloned in this region and identified as a tumour suppressor gene. However, much evidence indicates the existence of another tumour suppressor gene located proximal to the CDKN2A gene, which could be involved in cutaneous malignant melanoma (CMM) initiation. In the present report we have further investigated this 9p21 chromosomal region and cloned and characterised a novel gene within it (C9orf11). This gene shares no similarities to any known gene or predicted protein representing a novel human gene. Nevertheless, a putative leucine zipper pattern is located at the C-terminal end of the predicted protein, suggesting that it could dimerise. C9orf11 encodes for a protein of 294 amino acids with a predicted molecular mass of 32.8 kDa. C9orf11 is organised in eight exons that encompass a region of approx. 13 kb. Expression analysis demonstrates that C9orf11 is highly expressed in testis, although minor expression was seen in other tissues. Mutations in the C9orf11 gene were not detected in CMM families that were negative for CDKN2A mutations. Two SNPs for the C9orf11 gene have been identified, which could be used in segregation or association studies for other disorders.
Collapse
Affiliation(s)
- A Ruiz
- Medical and Molecular Genetics Centre - IRO, Hospital Duran i Reynals, Autovia de Castelldefels km 2,7, 08907 L'Hospitalet de Llobregat, Barcelona, Catalonia, Spain
| | | | | |
Collapse
|
31
|
Abstract
High-throughput gene sequencing has revolutionized the process used to identify novel molecular targets for drug discovery. Thousands of new gene sequences have been generated but only a limited number of these can be converted into validated targets likely to be involved in disease. We describe here some of the approaches used at SmithKline Beecham to select and validate novel targets. These include the identification of selective tissue gene product expression, such as for cathepsin K, a novel osteoclast-specific cysteine protease. We also describe the discovery and functional characterization of novel members of the G-protein coupled receptor superfamily and their pairing with natural ligands. Lastly, we discuss the promises of gene microarrays and proteomics, developing technologies that allow the parallel analyses of tissue expression patterns of thousands of genes or proteins, respectively.
Collapse
Affiliation(s)
- C Debouck
- Discovery Chemistry & Platform Technologies, SmithKline Beecham Pharmaceuticals, Research & Development, King of Prussia, Pennsylvania 19406, USA.
| | | |
Collapse
|
32
|
Lai CH, Chou CY, Ch'ang LY, Liu CS, Lin W. Identification of novel human genes evolutionarily conserved in Caenorhabditis elegans by comparative proteomics. Genome Res 2000; 10:703-13. [PMID: 10810093 PMCID: PMC310876 DOI: 10.1101/gr.10.5.703] [Citation(s) in RCA: 342] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
Modern biomedical research greatly benefits from large-scale genome-sequencing projects ranging from studies of viruses, bacteria, and yeast to multicellular organisms, like Caenorhabditis elegans. Comparative genomic studies offer a vast array of prospects for identification and functional annotation of human ortholog genes. We presented a novel comparative proteomic approach for assembling human gene contigs and assisting gene discovery. The C. elegans proteome was used as an alignment template to assist in novel human gene identification from human EST nucleotide databases. Among the available 18,452 C. elegans protein sequences, our results indicate that at least 83% (15,344 sequences) of C. elegans proteome has human homologous genes, with 7,954 records of C. elegans proteins matching known human gene transcripts. Only 11% or less of C. elegans proteome contains nematode-specific genes. We found that the remaining 7,390 sequences might lead to discoveries of novel human genes, and over 150 putative full-length human gene transcripts were assembled upon further database analyses. [The sequence data described in this paper have been submitted to the
Collapse
Affiliation(s)
- C H Lai
- Institute of Biomedical Sciences, Academia Sinica, Taipei 115, Taiwan, Republic of China
| | | | | | | | | |
Collapse
|
33
|
Abstract
Bioinformatics has, out of necessity, become a key aspect of drug discovery in the genomic revolution, contributing to both target discovery and target validation. The author describes the role that bioinformatics has played and will continue to play in response to the waves of genome-wide data sources that have become available to the industry, including expressed sequence tags, microbial genome sequences, model organism sequences, polymorphisms, gene expression data and proteomics. However, these knowledge sources must be intelligently integrated.
Collapse
|
34
|
Schwartz S, Zhang Z, Frazer KA, Smit A, Riemer C, Bouck J, Gibbs R, Hardison R, Miller W. PipMaker--a web server for aligning two genomic DNA sequences. Genome Res 2000; 10:577-86. [PMID: 10779500 PMCID: PMC310868 DOI: 10.1101/gr.10.4.577] [Citation(s) in RCA: 841] [Impact Index Per Article: 35.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/1999] [Accepted: 02/01/2000] [Indexed: 11/25/2022]
Abstract
PipMaker (http://bio.cse.psu.edu) is a World-Wide Web site for comparing two long DNA sequences to identify conserved segments and for producing informative, high-resolution displays of the resulting alignments. One display is a percent identity plot (pip), which shows both the position in one sequence and the degree of similarity for each aligning segment between the two sequences in a compact and easily understandable form. Positions along the horizontal axis can be labeled with features such as exons of genes and repetitive elements, and colors can be used to clarify and enhance the display. The web site also provides a plot of the locations of those segments in both species (similar to a dot plot). PipMaker is appropriate for comparing genomic sequences from any two related species, although the types of information that can be inferred (e.g., protein-coding regions and cis-regulatory elements) depend on the level of conservation and the time and divergence rate since the separation of the species. Gene regulatory elements are often detectable as similar, noncoding sequences in species that diverged as much as 100-300 million years ago, such as humans and mice, Caenorhabditis elegans and C. briggsae, or Escherichia coli and Salmonella spp. PipMaker supports analysis of unfinished or "working draft" sequences by permitting one of the two sequences to be in unoriented and unordered contigs.
Collapse
Affiliation(s)
- S Schwartz
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park 16802, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
35
|
Dias Neto E, Correa RG, Verjovski-Almeida S, Briones MR, Nagai MA, da Silva W, Zago MA, Bordin S, Costa FF, Goldman GH, Carvalho AF, Matsukuma A, Baia GS, Simpson DH, Brunstein A, de Oliveira PS, Bucher P, Jongeneel CV, O'Hare MJ, Soares F, Brentani RR, Reis LF, de Souza SJ, Simpson AJ. Shotgun sequencing of the human transcriptome with ORF expressed sequence tags. Proc Natl Acad Sci U S A 2000; 97:3491-6. [PMID: 10737800 PMCID: PMC16267 DOI: 10.1073/pnas.97.7.3491] [Citation(s) in RCA: 144] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Theoretical considerations predict that amplification of expressed gene transcripts by reverse transcription-PCR using arbitrarily chosen primers will result in the preferential amplification of the central portion of the transcript. Systematic, high-throughput sequencing of such products would result in an expressed sequence tag (EST) database consisting of central, generally coding regions of expressed genes. Such a database would add significant value to existing public EST databases, which consist mostly of sequences derived from the extremities of cDNAs, and facilitate the construction of contigs of transcript sequences. We tested our predictions, creating a database of 10,000 sequences from human breast tumors. The data confirmed the central distribution of the sequences, the significant normalization of the sequence population, the frequent extension of contigs composed of existing human ESTs, and the identification of a series of potentially important homologues of known genes. This approach should make a significant contribution to the early identification of important human genes, the deciphering of the draft human genome sequence currently being compiled, and the shotgun sequencing of the human transcriptome.
Collapse
Affiliation(s)
- E Dias Neto
- Ludwig Institute for Cancer Research, São Paulo 01509-010, Brazil
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
36
|
Michalek W, Künzel G, Graner A. Sequence analysis and gene identification in a set of mapped RFLP markers in barley (Hordeum vulgare). Genome 1999; 42:849-53. [PMID: 10584307 DOI: 10.1139/g99-036] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
The "Igri/Franka" (I/F) map ranks among the most comprehensive genetic linkage maps of barley (Hordeum vulgare), containing a large number of markers derived from cDNA and genomic PstI clones. Fourty-three cDNA clones and 259 genomic clones were at least partially sequenced and compared with the major data bases of protein and nucleic acid sequences. Of the cDNA clones, 53% show significant similarity to known sequences in protein data bases. A comparison of sequences from genomic clones to nucleic acid sequence data bases revealed similarities for 9% of the clones. For cDNA sequences analyzed the same way, significant similarities were observed for 35% of the clones. These results show that genomic PstI clones, although containing genes at a significant frequency, represent an inappropriate source for an efficient, systematic gene identification in barley. Sequence information obtained in the context of the present study provides a resource for the conversion of these markers into sequence-tagged site (STS) markers and their use in PCR assays.
Collapse
Affiliation(s)
- W Michalek
- Institute for Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany
| | | | | |
Collapse
|
37
|
Loftus BJ, Kim UJ, Sneddon VP, Kalush F, Brandon R, Fuhrmann J, Mason T, Crosby ML, Barnstead M, Cronin L, Deslattes Mays A, Cao Y, Xu RX, Kang HL, Mitchell S, Eichler EE, Harris PC, Venter JC, Adams MD. Genome duplications and other features in 12 Mb of DNA sequence from human chromosome 16p and 16q. Genomics 1999; 60:295-308. [PMID: 10493829 DOI: 10.1006/geno.1999.5927] [Citation(s) in RCA: 105] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Several publicly funded large-scale sequencing efforts have been initiated with the goal of completing the first reference human genome sequence by the year 2005. Here we present the results of analysis of 11.8 Mb of genomic sequence from chromosome 16. The apparent gene density varies throughout the region, but the number of genes predicted (84) suggests that this is a gene-poor region. This result may also suggest that the total number of human genes is likely to be at the lower end of published estimates. One of the most interesting aspects of this region of the genome is the presence of highly homologous, recently duplicated tracts of sequence distributed throughout the p-arm. Such duplications have implications for mapping and gene analysis as well as the predisposition to recurrent chromosomal structural rearrangements associated with genetic disease.
Collapse
Affiliation(s)
- B J Loftus
- The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, Maryland 20850, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
38
|
Marra M, Hillier L, Kucaba T, Allen M, Barstead R, Beck C, Blistain A, Bonaldo M, Bowers Y, Bowles L, Cardenas M, Chamberlain A, Chappell J, Clifton S, Favello A, Geisel S, Gibbons M, Harvey N, Hill F, Jackson Y, Kohn S, Lennon G, Mardis E, Martin J, Mila L, McCann R, Morales R, Pape D, Person B, Prange C, Ritter E, Soares M, Schurk R, Shin T, Steptoe M, Swaller T, Theising B, Underwood K, Wylie T, Yount T, Wilson R, Waterston R. An encyclopedia of mouse genes. Nat Genet 1999; 21:191-4. [PMID: 9988271 DOI: 10.1038/5976] [Citation(s) in RCA: 91] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The laboratory mouse is the premier model system for studies of mammalian development due to the powerful classical genetic analysis possible (see also the Jackson Laboratory web site, http://www.jax.org/) and the ever-expanding collection of molecular tools. To enhance the utility of the mouse system, we initiated a program to generate a large database of expressed sequence tags (ESTs) that can provide rapid access to genes. Of particular significance was the possibility that cDNA libraries could be prepared from very early stages of development, a situation unrealized in human EST projects. We report here the development of a comprehensive database of ESTs for the mouse. The project, initiated in March 1996, has focused on 5' end sequences from directionally cloned, oligo-dT primed cDNA libraries. As of 23 October 1998, 352,040 sequences had been generated, annotated and deposited in dbEST, where they comprised 93% of the total ESTs available for mouse. EST data are versatile and have been applied to gene identification, comparative sequence analysis, comparative gene mapping and candidate disease gene identification, genome sequence annotation, microarray development and the development of gene-based map resources.
Collapse
Affiliation(s)
- M Marra
- Washington University Genome Sequencing Center, St. Louis, Missouri 63108, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
39
|
Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W. A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res 1998; 8:967-74. [PMID: 9750195 PMCID: PMC310774 DOI: 10.1101/gr.8.9.967] [Citation(s) in RCA: 559] [Impact Index Per Article: 21.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
We address the problem of efficiently aligning a transcribed and spliced DNA sequence with a genomic sequence containing that gene, allowing for introns in the genomic sequence and a relatively small number of sequencing errors. A freely available computer program, described herein, solves the problem for a 100-kb genomic sequence in a few seconds on a workstation.
Collapse
Affiliation(s)
- L Florea
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania 16802 USA
| | | | | | | | | |
Collapse
|
40
|
Bailey LC, Fischer S, Schug J, Crabtree J, Gibson M, Overton GC. GAIA: framework annotation of genomic sequence. Genome Res 1998; 8:234-50. [PMID: 9521927 DOI: 10.1101/gr.8.3.234] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
As increasing amounts of genomic sequence from many organisms become available, and as DNA sequences become a primary reagent in biologic investigations, the role of annotation as a prospective guide for laboratory experiments will expand rapidly. Here we describe a process of high-throughput, reliable annotation, called framework annotation, which is designed to provide a foundation for initial biologic characterization of previously unexamined sequence. To examine this concept in practice, we have constructed Genome Annotation and Information Analysis (GAIA), a prototype software architecture that implements several elements important for framework annotation. The center of GAIA consists of an annotation database and the associated data management subsystem that forms the software bus along which other components communicate. The schema for this database defines three principal concepts: (1) Entries, consisting of sequence and associated historical data; (2) Features, comprising information of biologic interest; and (3) Experiments, describing the evidence that supports Features. The database permits tracking of annotation results over time, as well as assessment of the reliability of particular results. New framework annotation is produced by CARTA, a set of autonomous sensors that perform automatic analyses and assert results into the annotation database. These results are available via a Web-based query interface that uses graphical Java applets as well as text-based HTML pages to display data at different levels of resolution and permit interactive exploration of annotation. We present results for initial application of framework annotation to a set of test sequences, demonstrating its effectiveness in providing a starting point for biologic investigation, and discuss ways in which the current prototype can be improved. The prototype is available for public use and comment at http://www.cbil.upenn.edu/gaia.
Collapse
Affiliation(s)
- L C Bailey
- Computational Biology and Informatics Laboratory, Department of Genetics, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania 19104-6021, USA.
| | | | | | | | | | | |
Collapse
|