151
|
Harrison PM, Echols N, Gerstein MB. Digging for dead genes: an analysis of the characteristics of the pseudogene population in the Caenorhabditis elegans genome. Nucleic Acids Res 2001; 29:818-30. [PMID: 11160906 PMCID: PMC30377 DOI: 10.1093/nar/29.3.818] [Citation(s) in RCA: 95] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Pseudogenes are non-functioning copies of genes in genomic DNA, which may either result from reverse transcription from an mRNA transcript (processed pseudogenes) or from gene duplication and subsequent disablement (non-processed pseudogenes). As pseudogenes are apparently 'dead', they usually have a variety of obvious disablements (e.g., insertions, deletions, frameshifts and truncations) relative to their functioning homologs. We have derived an initial estimate of the size, distribution and characteristics of the pseudogene population in the Caenorhabditis elegans genome, performing a survey in 'molecular archaeology'. Corresponding to the 18 576 annotated proteins in the worm (i.e., in Wormpep18), we have found an estimated total of 2168 pseudogenes, about one for every eight genes. Few of these appear to be processed. Details of our pseudogene assignments are available from http://bioinfo.mbb.yale.edu/genome/worm/pseudogene. The population of pseudogenes differs significantly from that of genes in a number of respects: (i) pseudogenes are distributed unevenly across the genome relative to genes, with a disproportionate number on chromosome IV; (ii) the density of pseudogenes is higher on the arms of the chromosomes; (iii) the amino acid composition of pseudogenes is midway between that of genes and (translations of) random intergenic DNA, with enrichment of Phe, Ile, Leu and Lys, and depletion of Asp, Ala, Glu and Gly relative to the worm proteome; and (iv) the most common protein folds and families differ somewhat between genes and pseudogenes-whereas the most common fold found in the worm proteome is the immunoglobulin fold and the most common 'pseudofold' is the C-type lectin. In addition, the size of a gene family bears little overall relationship to the size of its corresponding pseudogene complement, indicating a highly dynamic genome. There are in fact a number of families associated with large populations of pseudogenes. For example, one family of seven-transmembrane receptors (represented by gene B0334.7) has one pseudogene for every four genes, and another uncharacterized family (represented by gene B0403.1) is approximately two-thirds pseudogenic. Furthermore, over a hundred apparent pseudogenic fragments do not have any obvious homologs in the worm.
Collapse
Affiliation(s)
- P M Harrison
- Department of Molecular Biophysics and Biochemistry, Yale University, 260 Whitney Avenue, PO Box 208114, New Haven, CT 06511-8114, USA
| | | | | |
Collapse
|
152
|
Hadano S, Yanagisawa Y, Skaug J, Fichter K, Nasir J, Martindale D, Koop BF, Scherer SW, Nicholson DW, Rouleau GA, Ikeda J, Hayden MR. Cloning and characterization of three novel genes, ALS2CR1, ALS2CR2, and ALS2CR3, in the juvenile amyotrophic lateral sclerosis (ALS2) critical region at chromosome 2q33-q34: candidate genes for ALS2. Genomics 2001; 71:200-13. [PMID: 11161814 DOI: 10.1006/geno.2000.6392] [Citation(s) in RCA: 41] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Amyotrophic lateral sclerosis is a progressive neurodegenerative disease that manifests as selective upper and lower motor neuron degeneration. The autosomal recessive form of juvenile amyotrophic lateral sclerosis (ALS2) has previously been mapped to the 1.7-cM interval flanked by D2S116 and D2S2237 on human chromosome 2q33-q34. We identified three novel full-length transcripts encoded by three distinct genes (HGMW-approved symbols ALS2CR1, ALS2CR2, and ALS2CR3) within the ALS2 critical region. The intron-exon organizations of these genes as well as those of CFLAR, CASP10, and CASP8, which were previously mapped to this region, were defined. These genes were evaluated for mutations in ALS2 patients, and no disease-associated sequence alterations in either exons or intron-exon boundaries were observed. Sequence analysis of overlapping RT-PCR products covering the whole coding sequence for each transcript revealed no aberrant mRNA sequences. These data strongly indicate that ALS2CR1, ALS2CR2, ALS2CR3, CFLAR, CASP10, and CASP8 are not causative genes for ALS2.
Collapse
Affiliation(s)
- S Hadano
- NeuroGenes, International Cooperative Research Project, Japan Science and Technology Corporation, Isehara, 259-1193, Japan
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
153
|
Current Awareness on Comparative and Functional Genomics. Comp Funct Genomics 2001. [PMCID: PMC2447185 DOI: 10.1002/cfg.55] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
|
154
|
Mann M, Pandey A. Use of mass spectrometry-derived data to annotate nucleotide and protein sequence databases. Trends Biochem Sci 2001; 26:54-61. [PMID: 11165518 DOI: 10.1016/s0968-0004(00)01726-6] [Citation(s) in RCA: 94] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Mass spectrometry-based proteomic methodologies can be used to annotate both nucleotide and protein sequence databases. Because such data have to be derived from proteins, they can be used to identify coding regions of the genome as well as provide the complete primary sequence of proteins and their expression patterns and post-translational modifications.
Collapse
Affiliation(s)
- M Mann
- Protein Interaction Laboratory (PIL), Center for Experimental Bioinformatics, University of Southern Denmark, Campusvej 55, DK-5230, and MDS-Protana, Staermosegaardsvej 6, DK-5230, Odense M, Denmark.
| | | |
Collapse
|
155
|
Affiliation(s)
- J Godovac-Zimmermann
- Center for Molecular Medicine, Department of Medicine, University College London, 5 University Street, London WC1E 6JJ, United Kingdom.
| | | |
Collapse
|
156
|
Affiliation(s)
- A J Walhout
- Dana-Farber Cancer Institute and Department of Genetics, Harvard Medical School, 44 Binney Street, Boston, Massachusetts 02115, USA
| | | |
Collapse
|
157
|
Quackenbush J, Cho J, Lee D, Liang F, Holt I, Karamycheva S, Parvizi B, Pertea G, Sultana R, White J. The TIGR Gene Indices: analysis of gene transcript sequences in highly sampled eukaryotic species. Nucleic Acids Res 2001; 29:159-64. [PMID: 11125077 PMCID: PMC29813 DOI: 10.1093/nar/29.1.159] [Citation(s) in RCA: 318] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
While genome sequencing projects are advancing rapidly, EST sequencing and analysis remains a primary research tool for the identification and categorization of gene sequences in a wide variety of species and an important resource for annotation of genomic sequence. The TIGR Gene Indices (http://www.tigr.org/tdb/tgi. shtml) are a collection of species-specific databases that use a highly refined protocol to analyze EST sequences in an attempt to identify the genes represented by that data and to provide additional information regarding those genes. Gene Indices are constructed by first clustering, then assembling EST and annotated gene sequences from GenBank for the targeted species. This process produces a set of unique, high-fidelity virtual transcripts, or Tentative Consensus (TC) sequences. The TC sequences can be used to provide putative genes with functional annotation, to link the transcripts to mapping and genomic sequence data, to provide links between orthologous and paralogous genes and as a resource for comparative sequence analysis.
Collapse
Affiliation(s)
- J Quackenbush
- The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
158
|
Kaplan JC, Junien C. Genomics and medicine: an anticipation. From Boolean Mendelian genetics to multifactorial molecular medicine. COMPTES RENDUS DE L'ACADEMIE DES SCIENCES. SERIE III, SCIENCES DE LA VIE 2000; 323:1167-74. [PMID: 11147103 DOI: 10.1016/s0764-4469(00)01252-x] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
The major impact of the completion of the human genome sequence will be the understanding of diseases, with deduced therapy. In the field of genetic disorders, we will complete the catalogue of monogenic diseases, also called Mendelian diseases because they obey the Boolean logic of Mendel's laws. The major challenge now is to decipher the polygenic and multifactorial etiology of common diseases, such as cancer, cardio-vascular, nutritional, allergic, auto-immune and degenerative diseases. In fact, every gene, when mutated, is a potential disease gene, and we end up with the new concept of 'reverse medicine'; i.e., deriving new diseases or pathogenic pathways from the knowledge of the structure and function of every gene. By going from sequence to function (functional genomics and proteomics) we will gain insight into basic mechanisms of major functions such as cell proliferation, differentiation and development, which are perturbed in many pathological processes. By learning the meaning of some non-coding and of regulatory sequences our understanding will gain in complexity, generating a molecular and supramolecular integrated physiology, helping to build a molecular patho-physiology of the different syndromes. Besides those cognitive advances, there are also other issues at stake, such as: progress in diagnostic and prediction (predictive medicine); progress in therapy (pharmacogenomics and gene-based therapy); ethical issues; impact on business.
Collapse
Affiliation(s)
- J C Kaplan
- Inserm UR129, CHU Cochin, université Paris-V, 24, rue du Faubourg-Saint-Jacques, 75014 Paris, France.
| | | |
Collapse
|
159
|
de Souza SJ, Camargo AA, Briones MR, Costa FF, Nagai MA, Verjovski-Almeida S, Zago MA, Andrade LE, Carrer H, El-Dorry HF, Espreafico EM, Habr-Gama A, Giannella-Neto D, Goldman GH, Gruber A, Hackel C, Kimura ET, Maciel RM, Marie SK, Martins EA, Nobrega MP, Paco-Larson ML, Pardini MI, Pereira GG, Pesquero JB, Rodrigues V, Rogatto SR, da Silva ID, Sogayar MC, de Fátima Sonati M, Tajara EH, Valentini SR, Acencio M, Alberto FL, Amaral ME, Aneas I, Bengtson MH, Carraro DM, Carvalho AF, Carvalho LH, Cerutti JM, Corrêa ML, Costa MC, Curcio C, Gushiken T, Ho PL, Kimura E, Leite LC, Maia G, Majumder P, Marins M, Matsukuma A, Melo AS, Mestriner CA, Miracca EC, Miranda DC, Nascimento AN, Nóbrega FG, Ojopi EP, Pandolfi JR, Pessoa LG, Rahal P, Rainho CA, da Rós N, de Sá RG, Sales MM, da Silva NP, Silva TC, da Silva W, Simão DF, Sousa JF, Stecconi D, Tsukumo F, Valente V, Zalcbeg H, Brentani RR, Reis FL, Dias-Neto E, Simpson AJ. Identification of human chromosome 22 transcribed sequences with ORF expressed sequence tags. Proc Natl Acad Sci U S A 2000; 97:12690-3. [PMID: 11070084 PMCID: PMC18825 DOI: 10.1073/pnas.97.23.12690] [Citation(s) in RCA: 55] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Transcribed sequences in the human genome can be identified with confidence only by alignment with sequences derived from cDNAs synthesized from naturally occurring mRNAs. We constructed a set of 250,000 cDNAs that represent partial expressed gene sequences and that are biased toward the central coding regions of the resulting transcripts. They are termed ORF expressed sequence tags (ORESTES). The 250,000 ORESTES were assembled into 81,429 contigs. Of these, 1, 181 (1.45%) were found to match sequences in chromosome 22 with at least one ORESTES contig for 162 (65.6%) of the 247 known genes, for 67 (44.6%) of the 150 related genes, and for 45 of the 148 (30.4%) EST-predicted genes on this chromosome. Using a set of stringent criteria to validate our sequences, we identified a further 219 previously unannotated transcribed sequences on chromosome 22. Of these, 171 were in fact also defined by EST or full length cDNA sequences available in GenBank but not utilized in the initial annotation of the first human chromosome sequence. Thus despite representing less than 15% of all expressed human sequences in the public databases at the time of the present analysis, ORESTES sequences defined 48 transcribed sequences on chromosome 22 not defined by other sequences. All of the transcribed sequences defined by ORESTES coincided with DNA regions predicted as encoding exons by genscan. (http://genes.mit.edu/GENSCAN.html).
Collapse
Affiliation(s)
- S J de Souza
- Ludwig Institute for Cancer Research, São Paulo 01509-010, SP, Brazil
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
160
|
Ruvinsky I, Silver LM, Gibson-Brown JJ. Phylogenetic analysis of T-Box genes demonstrates the importance of amphioxus for understanding evolution of the vertebrate genome. Genetics 2000; 156:1249-57. [PMID: 11063699 PMCID: PMC1461312 DOI: 10.1093/genetics/156.3.1249] [Citation(s) in RCA: 53] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The duplication of preexisting genes has played a major role in evolution. To understand the evolution of genetic complexity it is important to reconstruct the phylogenetic history of the genome. A widely held view suggests that the vertebrate genome evolved via two successive rounds of whole-genome duplication. To test this model we have isolated seven new T-box genes from the primitive chordate amphioxus. We find that each amphioxus gene generally corresponds to two or three vertebrate counterparts. A phylogenetic analysis of these genes supports the idea that a single whole-genome duplication took place early in vertebrate evolution, but cannot exclude the possibility that a second duplication later took place. The origin of additional paralogs evident in this and other gene families could be the result of subsequent, smaller-scale chromosomal duplications. Our findings highlight the importance of amphioxus as a key organism for understanding evolution of the vertebrate genome.
Collapse
Affiliation(s)
- I Ruvinsky
- Lewis Thomas Laboratory, Department of Molecular Biology, Princeton University, Princeton, New Jersey 08544, USA
| | | | | |
Collapse
|
161
|
Liang F, Holt I, Pertea G, Karamycheva S, Salzberg SL, Quackenbush J. An optimized protocol for analysis of EST sequences. Nucleic Acids Res 2000; 28:3657-65. [PMID: 10982889 PMCID: PMC110731 DOI: 10.1093/nar/28.18.3657] [Citation(s) in RCA: 99] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The vast body of Expressed Sequence Tag (EST) data in the public databases provide an important resource for comparative and functional genomics studies and an invaluable tool for the annotation of genomic sequences. We have developed a rigorous protocol for reconstructing the sequences of transcribed genes from EST and gene sequence fragments. A key element in developing this protocol has been the evaluation of a number of sequence assembly programs to determine which most faithfully reproduce transcript sequences from EST data. The TIGR Gene Indices constructed using this protocol for human, mouse, rat and a variety of other plant and animal models have demonstrated their utility in a variety of applications and are freely available to the scientific research community.
Collapse
Affiliation(s)
- F Liang
- The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA
| | | | | | | | | | | |
Collapse
|
162
|
Affiliation(s)
- J M Claverie
- Structural and Genetic Information Laboratory, CNRS-AVENTIS UMR 1889, Marseille cedex 20, France.
| |
Collapse
|
163
|
Hu RM, Han ZG, Song HD, Peng YD, Huang QH, Ren SX, Gu YJ, Huang CH, Li YB, Jiang CL, Fu G, Zhang QH, Gu BW, Dai M, Mao YF, Gao GF, Rong R, Ye M, Zhou J, Xu SH, Gu J, Shi JX, Jin WR, Zhang CK, Wu TM, Huang GY, Chen Z, Chen MD, Chen JL. Gene expression profiling in the human hypothalamus-pituitary-adrenal axis and full-length cDNA cloning. Proc Natl Acad Sci U S A 2000; 97:9543-8. [PMID: 10931946 PMCID: PMC16901 DOI: 10.1073/pnas.160270997] [Citation(s) in RCA: 78] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The primary neuroendocrine interface, hypothalamus and pituitary, together with adrenals, constitute the major axis responsible for the maintenance of homeostasis and the response to the perturbations in the environment. The gene expression profiling in the human hypothalamus-pituitary-adrenal axis was catalogued by generating a large amount of expressed sequence tags (ESTs), followed by bioinformatics analysis (http://www.chgc.sh.cn/ database). Totally, 25,973 sequences of good quality were obtained from 31,130 clones (83.4%) from cDNA libraries of the hypothalamus, pituitary, and adrenal glands. After eliminating 5,347 sequences corresponding to repetitive elements and mtDNA, 20,626 ESTs could be assembled into 9, 175 clusters (3,979, 3,074, and 4,116 clusters in hypothalamus, pituitary, and adrenal glands, respectively) when overlapping ESTs were integrated. Of these clusters, 2,777 (30.3%) corresponded to known genes, 4,165 (44.8%) to dbESTs, and 2,233 (24.3%) to novel ESTs. The gene expression profiles reflected well the functional characteristics of the three levels in the hypothalamus-pituitary-adrenal axis, because most of the 20 genes with highest expression showed statistical difference in terms of tissue distribution, including a group of tissue-specific functional markers. Meanwhile, some findings were made with regard to the physiology of the axis, and 200 full-length cDNAs of novel genes were cloned and sequenced. All of these data may contribute to the understanding of the neuroendocrine regulation of human life.
Collapse
Affiliation(s)
- R M Hu
- Rui-Jin Hospital, Shanghai Institute of Endocrinology, Shanghai Second Medical University, China
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
164
|
|
165
|
|
166
|
Abstract
A recent flurry of publications and media attention has revived interest in the question of how many genes exist in the human genome. Here, I review the estimates and use genomic sequence data from human chromosomes 21 and 22 to establish my own prediction.
Collapse
Affiliation(s)
- I Dunham
- The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| |
Collapse
|