1
|
Lim LP, Burge CB. A computational analysis of sequence features involved in recognition of short introns. Proc Natl Acad Sci U S A 2001; 98:11193-8. [PMID: 11572975 PMCID: PMC58706 DOI: 10.1073/pnas.201407298] [Citation(s) in RCA: 256] [Impact Index Per Article: 11.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2001] [Accepted: 08/02/2001] [Indexed: 11/18/2022] Open
Abstract
Splicing of short introns by the nuclear pre-mRNA splicing machinery is thought to proceed via an "intron definition" mechanism, in which the 5' and 3' splice sites (5'ss, 3'ss, respectively) are initially recognized and paired across the intron. Here, we describe a computational analysis of sequence features involved in recognition of short introns by using available transcript data from five eukaryotes with complete or nearly complete genomic sequences. The information content of five different transcript features was measured by using methods from information theory, and Monte Carlo simulations were used to determine the amount of information required for accurate recognition of short introns in each organism. We conclude: (i) that short introns in Drosophila melanogaster and Caenorhabditis elegans contain essentially all of the information for their recognition by the splicing machinery, and computer programs that simulate splicing specificity can predict the exact boundaries of approximately 95% of short introns in both organisms; (ii) that in yeast, the 5'ss, branch signal, and 3'ss can accurately identify intron locations but do not precisely determine the location of 3' cleavage in every intron; and (iii) that the 5'ss, branch signal, and 3'ss are not sufficient to accurately identify short introns in plant and human transcripts, but that specific subsets of candidate intronic enhancer motifs can be identified in both human and Arabidopsis that contribute dramatically to the accuracy of splicing simulators.
Collapse
Affiliation(s)
- L P Lim
- Department of Biology and Center for Cancer Research, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | | |
Collapse
|
2
|
Abstract
Variation in the estimates of the number of genes encoded by the human genome (28,000-120,000) attests to the difficulty of systematically identifying human genes. Sequencing of human chromosome 22 (Chr22) provided the first comprehensive, unbiased view of an entire human chromosome, and intensive analysis of this sequence identified 545 genes and 134 pseudogenes that had similarity or identity to known proteins and/or ESTs and which were listed in the gene annotation (http://www.sanger.ac.uk/HGP/Chr22). This analysis yielded an estimate of approximately 36,000 functional expressed genes in the human genome (and 9000 pseudogenes). However, a key uncertainty in this estimate was that hundreds of additional genes beyond those annotated in the Chr22 sequence are predicted by the gene prediction program Genscan, an unknown number of which might represent additional expressed genes. To determine what fraction of these "predicted novel genes" (PNGs) represents expressed human genes, we used a sensitive RT-PCR assay to detect predicted transcripts in 17 tissues and one cell line. Our results indicate that at least 5000-9000 additional human genes which lack similarity to known genes or proteins exist in the human genome, increasing baseline gene estimates to approximately 41,000-45,000.
Collapse
Affiliation(s)
- M Das
- Department of Biochemistry, McGill University, Rm 810, 3655 Drummond St., Montreal, Quebec, H3G 1Y6, Canada
| | | | | | | | | |
Collapse
|
3
|
Abstract
With the human genome sequence approaching completion, a major challenge is to identify the locations and encoded protein sequences of all human genes. To address this problem we have developed a new gene identification algorithm, GenomeScan, which combines exon-intron and splice signal models with similarity to known protein sequences in an integrated model. Extensive testing shows that GenomeScan can accurately identify the exon-intron structures of genes in finished or draft human genome sequence with a low rate of false-positives. Application of GenomeScan to 2.7 billion bases of human genomic DNA identified at least 20,000-25,000 human genes out of an estimated 30,000-40,000 present in the genome. The results show an accurate and efficient automated approach for identifying genes in higher eukaryotic genomes and provide a first-level annotation of the draft human genome.
Collapse
Affiliation(s)
- R F Yeh
- Department of Biology, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | | | | |
Collapse
|
5
|
Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann Y, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS, Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith DR, Doucette-Stamm L, Rubenfield M, Weinstock K, Lee HM, Dubois J, Rosenthal A, Platzer M, Nyakatura G, Taudien S, Rump A, Yang H, Yu J, Wang J, Huang G, Gu J, Hood L, Rowen L, Madan A, Qin S, Davis RW, Federspiel NA, Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood J, Cox DR, Olson MV, Kaul R, Raymond C, Shimizu N, Kawasaki K, Minoshima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, Dedhia N, Blöcker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bailey JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC, Church D, Clamp M, Copley RR, Doerks T, Eddy SR, Eichler EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hayashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS, Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, Koonin EV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A, Mikkelsen T, Moran JV, Mulder N, Pollara VJ, Ponting CP, Schuler G, Schultz J, Slater G, Smit AF, Stupka E, Szustakowki J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, Wolfe KH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA, Patrinos A, Morgan MJ, de Jong P, Catanese JJ, Osoegawa K, Shizuya H, Choi S, Chen YJ, Szustakowki J. Initial sequencing and analysis of the human genome. Nature 2001; 409:860-921. [PMID: 11237011 DOI: 10.1038/35057062] [Citation(s) in RCA: 14509] [Impact Index Per Article: 630.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.
Collapse
Affiliation(s)
- E S Lander
- Whitehead Institute for Biomedical Research, Center for Genome Research, Cambridge, MA 02142, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
7
|
Burge PS, Pantin CF, Newton DT, Gannon PF, Bright P, Belcher J, McCoach J, Baldwin DR, Burge CB. Development of an expert system for the interpretation of serial peak expiratory flow measurements in the diagnosis of occupational asthma. Midlands Thoracic Society Research Group. Occup Environ Med 1999; 56:758-64. [PMID: 10658562 PMCID: PMC1757688 DOI: 10.1136/oem.56.11.758] [Citation(s) in RCA: 39] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Abstract
If asthma is due to work exposures there must be a relation between these exposures and the asthma. Asthma causes airway hyperresponsiveness and obstruction; the obstruction can be measured with portable meters, which usually measure peak expiratory flow, or sometimes forced expiratory volume in 1 second (FEV1). These can be measured serially (for instance 2 hourly) over several weeks at and away from work. Once occupational asthma develops, the asthma will be induced by many non-specific triggers common to non-occupational asthma. The challenge is to identify changes in peak expiratory flow due to work among other non-occupational causes. Standard statistical tests have been found to be insensitive or non-specific, principally because of the variable period for deterioration to occur after exposure, and the sometimes prolonged time for recovery to occur, such that days away from work may initially have lower measurements than days at work. A computer assisted diagnostic aid (Oasys) has been developed to separate occupational from non-occupational causes of airflow obstruction. Oasys-2 is based on a discriminant analysis, and achieved a sensitivity of 75% and a specificity of at least 94%; therefore peak expiratory flow monitoring combined with Oasys-2 analysis is better to confirm than to exclude occupational asthma. A neural network version in development has improved on this. Both have been based on expert interpretation of peak flow measurements plotted as daily maximum, mean, and minimum, with the first reading at work taken as the first reading of the day. Oasys has been evaluated with independent criteria against measurements made in a wide range of occupational situations. Oasys is sufficiently developed to be the initial method for the confirmation, although less so for exclusion of occupational asthma.
Collapse
Affiliation(s)
- P S Burge
- Occupational Lung Disease Unit, Birmingham Heartlands Hospital, UK
| | | | | | | | | | | | | | | | | |
Collapse
|