1
|
Yang J, Liu Z, Liu H, He J, Yang J, Lin P, Wang Q, Du J, Ma W, Yin Z, Davis E, Orlowski RZ, Hou J, Yi Q. C-reactive protein promotes bone destruction in human myeloma through the CD32-p38 MAPK-Twist axis. Sci Signal 2017; 10:10/509/eaan6282. [PMID: 29233917 DOI: 10.1126/scisignal.aan6282] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
Bone destruction is a hallmark of myeloma and affects 80% of patients. Myeloma cells promote bone destruction by activating osteoclasts. In investigating the underlying mechanism, we found that C-reactive protein (CRP), a protein secreted in increased amounts by hepatocytes in response to myeloma-derived cytokines, activated myeloma cells to promote osteoclastogenesis and bone destruction in vivo. In mice bearing human bone grafts and injected with multiple myeloma cells, CRP bound to surface CD32 (also known as FcγRII) on myeloma cells, which activated a pathway mediated by the kinase p38 MAPK and the transcription factor Twist that enhanced the cells' secretion of osteolytic cytokines. Furthermore, analysis of clinical samples from newly diagnosed myeloma patients revealed a positive correlation between the amount of serum CRP and the number of osteolytic bone lesions. These findings establish a mechanism by which myeloma cells are activated to promote bone destruction and suggest that CRP may be targeted to prevent or treat myeloma-associated bone disease in patients.
Collapse
Affiliation(s)
- Jing Yang
- Guangzhou Key Laboratory of Translational Medicine on Malignant Tumor Treatment, Affiliated Cancer Hospital and Institute of Guangzhou Medical University, Guangzhou 510095, China. .,Department of Lymphoma/Myeloma, Division of Cancer Medicine, and the Center for Cancer Immunology Research, University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| | - Zhiqiang Liu
- Department of Lymphoma/Myeloma, Division of Cancer Medicine, and the Center for Cancer Immunology Research, University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA.,Department of Physiology and Pathology, Tianjin Medical University, Tianjin 300070, China
| | - Huan Liu
- Department of Lymphoma/Myeloma, Division of Cancer Medicine, and the Center for Cancer Immunology Research, University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| | - Jin He
- Department of Lymphoma/Myeloma, Division of Cancer Medicine, and the Center for Cancer Immunology Research, University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| | - Jianling Yang
- Department of Lymphoma/Myeloma, Division of Cancer Medicine, and the Center for Cancer Immunology Research, University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| | - Pei Lin
- Department of Hematopathology, Division of Pathology and Laboratory Medicine, University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| | - Qiang Wang
- Department of Cancer Biology, Lerner Research Institute, Cleveland Clinic, Cleveland, OH 44195, USA
| | - Juan Du
- Department of Hematology, The Myeloma and Lymphoma Center, Changzheng Hospital, The Second Military Medical University, Shanghai 200085, China
| | - Wencai Ma
- Department of Lymphoma/Myeloma, Division of Cancer Medicine, and the Center for Cancer Immunology Research, University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| | - Zheng Yin
- Department of Systems Medicine and Bioengineering, Houston Methodist Research Institute, Houston, TX 77030, USA
| | - Eric Davis
- Department of Lymphoma/Myeloma, Division of Cancer Medicine, and the Center for Cancer Immunology Research, University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| | - Robert Z Orlowski
- Department of Lymphoma/Myeloma, Division of Cancer Medicine, and the Center for Cancer Immunology Research, University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| | - Jian Hou
- Department of Hematology, The Myeloma and Lymphoma Center, Changzheng Hospital, The Second Military Medical University, Shanghai 200085, China
| | - Qing Yi
- Department of Cancer Biology, Lerner Research Institute, Cleveland Clinic, Cleveland, OH 44195, USA.
| |
Collapse
|
2
|
Huang WL, Tung CW, Liaw C, Huang HL, Ho SY. Rule-based knowledge acquisition method for promoter prediction in human and Drosophila species. ScientificWorldJournal 2014; 2014:327306. [PMID: 24955394 PMCID: PMC3927563 DOI: 10.1155/2014/327306] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2013] [Accepted: 10/10/2013] [Indexed: 01/08/2023] Open
Abstract
The rapid and reliable identification of promoter regions is important when the number of genomes to be sequenced is increasing very speedily. Various methods have been developed but few methods investigate the effectiveness of sequence-based features in promoter prediction. This study proposes a knowledge acquisition method (named PromHD) based on if-then rules for promoter prediction in human and Drosophila species. PromHD utilizes an effective feature-mining algorithm and a reference feature set of 167 DNA sequence descriptors (DNASDs), comprising three descriptors of physicochemical properties (absorption maxima, molecular weight, and molar absorption coefficient), 128 top-ranked descriptors of 4-mer motifs, and 36 global sequence descriptors. PromHD identifies two feature subsets with 99 and 74 DNASDs and yields test accuracies of 96.4% and 97.5% in human and Drosophila species, respectively. Based on the 99- and 74-dimensional feature vectors, PromHD generates several if-then rules by using the decision tree mechanism for promoter prediction. The top-ranked informative rules with high certainty grades reveal that the global sequence descriptor, the length of nucleotide A at the first position of the sequence, and two physicochemical properties, absorption maxima and molecular weight, are effective in distinguishing promoters from non-promoters in human and Drosophila species, respectively.
Collapse
Affiliation(s)
- Wen-Lin Huang
- Department of Management Information System, Asia Pacific Institute of Creativity, Miaoli 351, Taiwan
| | - Chun-Wei Tung
- School of Pharmacy, College of Pharmacy, Kaohsiung Medical University, Kaohsiung 807, Taiwan
| | - Chyn Liaw
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu 300, Taiwan
| | - Hui-Ling Huang
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu 300, Taiwan
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu 300, Taiwan
| | - Shinn-Ying Ho
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu 300, Taiwan
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu 300, Taiwan
| |
Collapse
|
3
|
Daiger SP, Sullivan LS, Bowne SJ, Birch DG, Heckenlively JR, Pierce EA, Weinstock GM. Targeted high-throughput DNA sequencing for gene discovery in retinitis pigmentosa. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2010; 664:325-31. [PMID: 20238032 DOI: 10.1007/978-1-4419-1399-9_37] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/05/2022]
Abstract
The causes of retinitis pigmentosa (RP) are highly heterogeneous, with mutations in more than 60 genes known to cause syndromic and non-syndromic forms of disease. The prevalence of detectable mutations in known genes ranges from 25 to 85%, depending on mode of inheritance. For example, the likelihood of detecting a disease-causing mutation in known genes in patients with autosomal dominant RP (adRP) is 60% in Americans and less in other populations. Thus many RP genes are still unknown or mutations lie outside of commonly tested regions. Furthermore, current screening strategies can be costly and time-consuming.We are developing targeted high-throughput DNA sequencing to address these problems. In this approach, a microarray with oligonucleotides targeted to hundreds of genes is used to capture sheared human DNA, and the sequence of the eluted DNA is determined by ultra-high-throughput sequencing using next-generation DNA sequencing technology. The first capture array we have designed contains 62 full-length retinal disease genes, including introns and promoter regions, and an additional 531 genes limited to exons and flanking sequences. The full-length genes include all genes known to cause at least 1% of RP or other inherited retinal diseases. All of the genes listed in the RetNet database are included on the capture array as well as many additional retinal-expressed genes. After validation studies, the first DNA's tested will be from 89 unrelated adRP families in which the prevalent RP genes have been excluded. This approach should identify new RP genes and will substantially reduce the cost per patient.
Collapse
Affiliation(s)
- Stephen P Daiger
- Department of Ophthalmology and Visual Science, University of Texas Health Science Center, Houston, TX, USA.
| | | | | | | | | | | | | |
Collapse
|
4
|
Solovyev VV, Shahmuradov IA, Salamov AA. Identification of promoter regions and regulatory sites. Methods Mol Biol 2010; 674:57-83. [PMID: 20827586 DOI: 10.1007/978-1-60761-854-6_5] [Citation(s) in RCA: 106] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
Promoter sequences are the main regulatory elements of gene expression. Their recognition by computer algorithms is fundamental for understanding gene expression patterns, cell specificity and development. This chapter describes the advanced approaches to identify promoters in animal, plant and bacterial sequences. Also, we discuss an approach to identify statistically significant regulatory motifs in genomic sequences.
Collapse
|
5
|
Anwar F, Baker SM, Jabid T, Mehedi Hasan M, Shoyaib M, Khan H, Walshe R. Pol II promoter prediction using characteristic 4-mer motifs: a machine learning approach. BMC Bioinformatics 2008; 9:414. [PMID: 18834544 PMCID: PMC2575220 DOI: 10.1186/1471-2105-9-414] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2008] [Accepted: 10/04/2008] [Indexed: 01/03/2023] Open
Abstract
Background Eukaryotic promoter prediction using computational analysis techniques is one of the most difficult jobs in computational genomics that is essential for constructing and understanding genetic regulatory networks. The increased availability of sequence data for various eukaryotic organisms in recent years has necessitated for better tools and techniques for the prediction and analysis of promoters in eukaryotic sequences. Many promoter prediction methods and tools have been developed to date but they have yet to provide acceptable predictive performance. One obvious criteria to improve on current methods is to devise a better system for selecting appropriate features of promoters that distinguish them from non-promoters. Secondly improved performance can be achieved by enhancing the predictive ability of the machine learning algorithms used. Results In this paper, a novel approach is presented in which 128 4-mer motifs in conjunction with a non-linear machine-learning algorithm utilising a Support Vector Machine (SVM) are used to distinguish between promoter and non-promoter DNA sequences. By applying this approach to plant, Drosophila, human, mouse and rat sequences, the classification model has showed 7-fold cross-validation percentage accuracies of 83.81%, 94.82%, 91.25%, 90.77% and 82.35% respectively. The high sensitivity and specificity value of 0.86 and 0.90 for plant; 0.96 and 0.92 for Drosophila; 0.88 and 0.92 for human; 0.78 and 0.84 for mouse and 0.82 and 0.80 for rat demonstrate that this technique is less prone to false positive results and exhibits better performance than many other tools. Moreover, this model successfully identifies location of promoter using TATA weight matrix. Conclusion The high sensitivity and specificity indicate that 4-mer frequencies in conjunction with supervised machine-learning methods can be beneficial in the identification of RNA pol II promoters comparative to other methods. This approach can be extended to identify promoters in sequences for other eukaryotic genomes.
Collapse
Affiliation(s)
- Firoz Anwar
- Department of Computer Science and Engineering, East West University, Bangladesh.
| | | | | | | | | | | | | |
Collapse
|
6
|
Yang JY, Zhou Y, Yu ZG, Anh V, Zhou LQ. Human Pol II promoter recognition based on primary sequences and free energy of dinucleotides. BMC Bioinformatics 2008; 9:113. [PMID: 18294399 PMCID: PMC2292139 DOI: 10.1186/1471-2105-9-113] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2007] [Accepted: 02/24/2008] [Indexed: 01/29/2023] Open
Abstract
Background Promoter region plays an important role in determining where the transcription of a particular gene should be initiated. Computational prediction of eukaryotic Pol II promoter sequences is one of the most significant problems in sequence analysis. Existing promoter prediction methods are still far from being satisfactory. Results We attempt to recognize the human Pol II promoter sequences from the non-promoter sequences which are made up of exon and intron sequences. Four methods are used: two kinds of multifractal analysis performed on the numeric sequences obtained from the dinucleotide free energy, Z curve analysis and global descriptor of the promoter/non-promoter primary sequences. A total of 141 parameters are extracted from these methods and categorized into seven groups (methods). They are used to generate certain spaces and then each promoter/non-promoter sequence is represented by a point in the corresponding space. All the 120 possible combinations of the seven methods are tested. Based on Fisher's linear discriminant algorithm, with a relatively smaller number of parameters (96 and 117), we get satisfactory discriminant accuracies. Particularly, in the case of 117 parameters, the accuracies for the training and test sets reach 90.43% and 89.79%, respectively. A comparison with five other existing methods indicates that our methods have a better performance. Using the global descriptor method (36 parameters), 17 of the 18 experimentally verified promoter sequences of human chromosome 22 are correctly identified. Conclusion The high accuracies achieved suggest that the methods of this paper are useful for understanding the difficult problem of promoter prediction.
Collapse
Affiliation(s)
- Jian-Yi Yang
- School of Mathematics and Computational Science, Xiangtan University, Hunan 411105, China.
| | | | | | | | | |
Collapse
|
7
|
Abeel T, Saeys Y, Bonnet E, Rouzé P, Van de Peer Y. Generic eukaryotic core promoter prediction using structural features of DNA. Genes Dev 2008; 18:310-23. [PMID: 18096745 PMCID: PMC2203629 DOI: 10.1101/gr.6991408] [Citation(s) in RCA: 133] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2007] [Accepted: 11/14/2007] [Indexed: 11/24/2022]
Abstract
Despite many recent efforts, in silico identification of promoter regions is still in its infancy. However, the accurate identification and delineation of promoter regions is important for several reasons, such as improving genome annotation and devising experiments to study and understand transcriptional regulation. Current methods to identify the core region of promoters require large amounts of high-quality training data and often behave like black box models that output predictions that are difficult to interpret. Here, we present a novel approach for predicting promoters in whole-genome sequences by using large-scale structural properties of DNA. Our technique requires no training, is applicable to many eukaryotic genomes, and performs extremely well in comparison with the best available promoter prediction programs. Moreover, it is fast, simple in design, and has no size constraints, and the results are easily interpretable. We compared our approach with 14 current state-of-the-art implementations using human gene and transcription start site data and analyzed the ENCODE region in more detail. We also validated our method on 12 additional eukaryotic genomes, including vertebrates, invertebrates, plants, fungi, and protists.
Collapse
Affiliation(s)
- Thomas Abeel
- Department of Plant Systems Biology, Flanders Institute for Biotechnology (VIB), 9052 Gent, Belgium
- Department of Molecular Genetics, Ghent University, 9052 Gent, Belgium
| | - Yvan Saeys
- Department of Plant Systems Biology, Flanders Institute for Biotechnology (VIB), 9052 Gent, Belgium
- Department of Molecular Genetics, Ghent University, 9052 Gent, Belgium
| | - Eric Bonnet
- Department of Plant Systems Biology, Flanders Institute for Biotechnology (VIB), 9052 Gent, Belgium
- Department of Molecular Genetics, Ghent University, 9052 Gent, Belgium
| | - Pierre Rouzé
- Department of Plant Systems Biology, Flanders Institute for Biotechnology (VIB), 9052 Gent, Belgium
- Department of Molecular Genetics, Ghent University, 9052 Gent, Belgium
- Laboratoire Associé de l’INRA (France), Ghent University, 9052 Gent, Belgium
| | - Yves Van de Peer
- Department of Plant Systems Biology, Flanders Institute for Biotechnology (VIB), 9052 Gent, Belgium
- Department of Molecular Genetics, Ghent University, 9052 Gent, Belgium
| |
Collapse
|
8
|
Solovyev V, Kosarev P, Seledsov I, Vorobyev D. Automatic annotation of eukaryotic genes, pseudogenes and promoters. Genome Biol 2006; 7 Suppl 1:S10.1-12. [PMID: 16925832 PMCID: PMC1810547 DOI: 10.1186/gb-2006-7-s1-s10] [Citation(s) in RCA: 518] [Impact Index Per Article: 27.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The ENCODE gene prediction workshop (EGASP) has been organized to evaluate how well state-of-the-art automatic gene finding methods are able to reproduce the manual and experimental gene annotation of the human genome. We have used Softberry gene finding software to predict genes, pseudogenes and promoters in 44 selected ENCODE sequences representing approximately 1% (30 Mb) of the human genome. Predictions of gene finding programs were evaluated in terms of their ability to reproduce the ENCODE-HAVANA annotation. RESULTS The Fgenesh++ gene prediction pipeline can identify 91% of coding nucleotides with a specificity of 90%. Our automatic pseudogene finder (PSF program) found 90% of the manually annotated pseudogenes and some new ones. The Fprom promoter prediction program identifies 80% of TATA promoters sequences with one false positive prediction per 2,000 base-pairs (bp) and 50% of TATA-less promoters with one false positive prediction per 650 bp. It can be used to identify transcription start sites upstream of annotated coding parts of genes found by gene prediction software. CONCLUSION We review our software and underlying methods for identifying these three important structural and functional genome components and discuss the accuracy of predictions, recent advances and open problems in annotating genomic sequences. We have demonstrated that our methods can be effectively used for initial automatic annotation of the eukaryotic genome.
Collapse
Affiliation(s)
- Victor Solovyev
- Department of Computer Science, Royal Holloway, University of London, Egham, Surrey TW20 0EX, UK.
| | | | | | | |
Collapse
|
9
|
Narang V, Sung WK, Mittal A. Computational modeling of oligonucleotide positional densities for human promoter prediction. Artif Intell Med 2005; 35:107-19. [PMID: 16076553 DOI: 10.1016/j.artmed.2005.02.005] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2004] [Revised: 01/31/2005] [Accepted: 02/22/2005] [Indexed: 11/18/2022]
Abstract
OBJECTIVE The gene promoter region controls transcriptional initiation of a gene, which is the most important step in gene regulation. In-silico detection of promoter region in genomic sequences has a number of applications in gene discovery and understanding gene expression regulation. However, computational prediction of eukaryotic poly-II promoters has remained a difficult task. This paper introduces a novel statistical technique for detecting promoter regions in long genomic sequences. METHOD A number of existing techniques analyze the occurrence frequencies of oligonucleotides in promoter sequences as compared to other genomic regions. In contrast, the present work studies the positional densities of oligonucleotides in promoter sequences. The analysis does not require any non-promoter sequence dataset or any model of the background oligonucleotide content of the genome. The statistical model learnt from a dataset of promoter sequences automatically recognizes a number of transcription factor binding sites simultaneously with their occurrence positions relative to the transcription start site. Based on this model, a continuous naïve Bayes classifier is developed for the detection of human promoters and transcription start sites in genomic sequences. RESULTS The present study extends the scope of statistical models in general promoter modeling and prediction. Promoter sequence features learnt by the model correlate well with known biological facts. Results of human transcription start site prediction compare favorably with existing 2nd generation promoter prediction tools.
Collapse
Affiliation(s)
- Vipin Narang
- Department of Computer Science, S16 #06-02, 3 Science Drive 2, National University of Singapore, Singapore 117543, Singapore.
| | | | | |
Collapse
|
10
|
Xuan Z, Zhao F, Wang J, Chen G, Zhang MQ. Genome-wide promoter extraction and analysis in human, mouse, and rat. Genome Biol 2005; 6:R72. [PMID: 16086854 PMCID: PMC1273639 DOI: 10.1186/gb-2005-6-8-r72] [Citation(s) in RCA: 54] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2005] [Revised: 05/23/2005] [Accepted: 07/11/2005] [Indexed: 01/27/2023] Open
Abstract
Large-scale and high-throughput genomics research needs reliable and comprehensive genome-wide promoter annotation resources. We have conducted a systematic investigation on how to improve mammalian promoter prediction by incorporating both transcript and conservation information. This enabled us to build a better multispecies promoter annotation pipeline and hence to create CSHLmpd (Cold Spring Harbor Laboratory Mammalian Promoter Database) for the biomedical research community, which can act as a starting reference system for more refined functional annotations.
Collapse
Affiliation(s)
- Zhenyu Xuan
- Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY 11724, USA
| | - Fang Zhao
- Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY 11724, USA
| | - Jinhua Wang
- Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY 11724, USA
| | - Gengxin Chen
- Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY 11724, USA
| | - Michael Q Zhang
- Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY 11724, USA
| |
Collapse
|
11
|
Zheng D, Zhang Z, Harrison PM, Karro J, Carriero N, Gerstein M. Integrated pseudogene annotation for human chromosome 22: evidence for transcription. J Mol Biol 2005; 349:27-45. [PMID: 15876366 DOI: 10.1016/j.jmb.2005.02.072] [Citation(s) in RCA: 63] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2004] [Revised: 02/16/2005] [Accepted: 02/23/2005] [Indexed: 02/06/2023]
Abstract
Pseudogenes are inheritable genetic elements formally defined by two properties: their similarity to functioning genes and their presumed lack of activity. However, their precise characterization, particularly with respect to the latter quality, has proven elusive. An opportunity to explore this issue arises from the recent emergence of tiling-microarray data showing that intergenic regions (containing pseudogenes) are transcribed to a great degree. Here we focus on the transcriptional activity of pseudogenes on human chromosome 22. First, we integrated several sets of annotation to define a unified list of 525 pseudogenes on the chromosome. To characterize these further, we developed a comprehensive list of genomic features based on conservation in related organisms, expression evidence, and the presence of upstream regulatory sites. Of the 525 unified pseudogenes we could confidently classify 154 as processed and 49 as duplicated. Using data from tiling microarrays, especially from recent high-resolution oligonucleotide arrays, we found some evidence that up to a fifth of the 525 pseudogenes are potentially transcribed. Expressed sequence tags (EST) comparison further validated a number of these, and overall we found 17 pseudogenes with strong support for transcription. In particular, one of the pseudogenes with both EST and microarray evidence for transcription turned out to be a duplicated pseudogene in the cat eye syndrome critical region. Although we could not identify a meaningful number of transcription factor-binding sites (based on chromatin immunoprecipitation-chip data) near pseudogenes, we did find that approximately 12% of the pseudogenes had upstream CpG islands. Finally, analysis of corresponding syntenic regions in the mouse, rat and chimp genomes indicates, as previously suggested, that pseudogenes are less conserved than genes, but more preserved than the intergenic background (all notation is available from http://www.pseudogene.org).
Collapse
Affiliation(s)
- Deyou Zheng
- Department of Molecular Biophysics and Biochemistry, Yale University, 266 Whitney Avenue, New Haven, CT 06520, USA
| | | | | | | | | | | |
Collapse
|
12
|
Shahmuradov IA, Solovyev VV, Gammerman AJ. Plant promoter prediction with confidence estimation. Nucleic Acids Res 2005; 33:1069-76. [PMID: 15722481 PMCID: PMC549412 DOI: 10.1093/nar/gki247] [Citation(s) in RCA: 80] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2004] [Revised: 12/15/2004] [Accepted: 01/24/2005] [Indexed: 11/24/2022] Open
Abstract
Accurate prediction of promoters is fundamental to understanding gene expression patterns, where confidence estimation is one of the main requirements. Using recently developed transductive confidence machine (TCM) techniques, we developed a new program TSSP-TCM for the prediction of plant promoters that also provides confidence of the prediction. The program was trained on 132 and 104 sequences and tested on 40 and 25 sequences (containing TATA and TATA-less promoters, respectively) with known transcription start sites (TSSs). As negative training samples for TCM learning we used coding and intron sequences of plant genes annotated in the GenBank. In the test set of TATA promoters, the program correctly predicted TSS for 35 out of 40 (87.5%) genes with a median deviation of several base pairs from the true site location. For 25 TATA-less promoters, TSSs were predicted for 21 out of 25 (84%) genes, including 14 cases of 5 bp distance between annotated and predicted TSSs. Using TSSP-TCM program we annotated promoters in the whole Arabidopsis genome. The predicted promoters were in good agreement with the start position of known Arabidopsis mRNAs. Thus, TCM technique has produced a plant-oriented promoter prediction tool of high accuracy. TSSP-TCM program and annotated promoters are available at http://mendel.cs.rhul.ac.uk/mendel.php?topic=fgen.
Collapse
Affiliation(s)
| | - V. V. Solovyev
- Royal Holloway, University of LondonEgham, Surrey TW20 0EX, UK
- Softberry Inc.116 Radio Circle, Suite 400, Mount Kisco, NY 10549, USA
| | - A. J. Gammerman
- Royal Holloway, University of LondonEgham, Surrey TW20 0EX, UK
| |
Collapse
|
13
|
Bajic VB, Tan SL, Suzuki Y, Sugano S. Promoter prediction analysis on the whole human genome. Nat Biotechnol 2004; 22:1467-73. [PMID: 15529174 DOI: 10.1038/nbt1032] [Citation(s) in RCA: 114] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Promoter prediction programs (PPPs) are important for in silico gene discovery without support from expressed sequence tag (EST)/cDNA/mRNA sequences, in the analysis of gene regulation and in genome annotation. Contrary to previous expectations, a comprehensive analysis of PPPs reveals that no program simultaneously achieves sensitivity and a positive predictive value >65%. PPP performances deduced from a limited number of chromosomes or smaller data sets do not hold when evaluated at the level of the whole genome, with serious inaccuracy of predictions for non-CpG-island-related promoters. Some PPPs even perform worse than, or close to, pure random guessing.
Collapse
Affiliation(s)
- Vladimir B Bajic
- Institute for Infocomm Research, 21 Heng Mui Keng Terrace, 119613 Singapore.
| | | | | | | |
Collapse
|
14
|
Down TA, Hubbard TJP. What can we learn from noncoding regions of similarity between genomes? BMC Bioinformatics 2004; 5:131. [PMID: 15369604 PMCID: PMC523850 DOI: 10.1186/1471-2105-5-131] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2003] [Accepted: 09/15/2004] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In addition to known protein-coding genes, large amounts of apparently non-coding sequence are conserved between the human and mouse genomes. It seems reasonable to assume that these conserved regions are more likely to contain functional elements than less-conserved portions of the genome. METHODS Here we used a motif-oriented machine learning method based on the Relevance Vector Machine algorithm to extract the strongest signal from a set of non-coding conserved sequences. RESULTS We successfully fitted models to reflect the non-coding sequences, and showed that the results were quite consistent for repeated training runs. Using the learned models to scan genomic sequence, we found that they often made predictions close to the start of annotated genes. We compared this method with other published promoter-prediction systems, and showed that the set of promoters which are detected by this method is substantially similar to that detected by existing methods. CONCLUSIONS The results presented here indicate that the promoter signal is the strongest single motif-based signal in the non-coding functional fraction of the genome. They also lend support to the belief that there exists a substantial subset of promoter regions which share several common features including, but not restricted to, a relative abundance of CpG dinucleotides. This subset is detectable by a variety of distinct computational methods.
Collapse
Affiliation(s)
- Thomas A Down
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Tim JP Hubbard
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| |
Collapse
|
15
|
Abstract
Fifty years after the publication of DNA structure, the whole human genome sequence will be officially finished. This achievement marks the beginning of the task to catalogue every human gene and identify each of their function expression patterns. Currently, researchers estimate that there are about 30,000 human genes and approximately 70% of these can be automatically predicted using a combination of ab initio and similarity-based programs. However, to experimentally investigate every gene's function, the research community requires a high-quality annotation of alternative splicing, pseudogenes, and promoter regions that can only be provided by manual intervention. Manual curation of the human genome will be a long-term project as experimental data are continually produced to confirm or refine the predictions, and new features such as noncoding RNAs and enhancers have not been fully identified. Such a highly curated human gene-set made publicly available will be a great asset for the experimental community and for future comparative genome projects.
Collapse
Affiliation(s)
- Jennifer L Ashurst
- The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom.
| | | |
Collapse
|
16
|
Bajic VB, Seah SH. Dragon Gene Start Finder identifies approximate locations of the 5' ends of genes. Nucleic Acids Res 2003; 31:3560-3. [PMID: 12824365 PMCID: PMC168976 DOI: 10.1093/nar/gkg570] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Recognition of gene starts is a difficult and yet unsolved problem. We present a program, Dragon Gene Start Finder (DGSF), which assesses the gene start in mammalian genomes and predicts a region which should overlap with the first exon of the gene or be in its proximity. The program has been rigorously tested on human chromosomes 4, 21 and 22, and in a strand specific search achieves an overall sensitivity of approximately 65% and a positive predictive value of approximately 78%. The sensitivity for the CpG-island related promoters is >88%. DGSF is free for academic and non-profit users at http://sdmc.lit.org.sg/promoter/dragonGSF1_0/genestart.htm; the download version of the program integrated within the TRANSPLORER package can be obtained from Biobase GmbH, at http://www.biobase.de/.
Collapse
Affiliation(s)
- Vladimir B Bajic
- Knowledge Extraction Laboratory. Discovery Systems Laboratory, Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613.
| | | |
Collapse
|
17
|
Bajic VB, Seah SH. Dragon gene start finder: an advanced system for finding approximate locations of the start of gene transcriptional units. Genome Res 2003; 13:1923-9. [PMID: 12869582 PMCID: PMC403784 DOI: 10.1101/gr.869803] [Citation(s) in RCA: 35] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
We present an advanced system for recognition of gene starts in mammalian genomes. The system makes predictions of gene start location by combining information about CpG islands, transcription start sites (TSSs), and signals downstream of the predicted TSSs. The system aims at predicting a region that contains the gene start or is in its proximity. Evaluation on human chromosomes 4, 21, and 22 resulted in Se of over 65% and in a ppv of approximately 78%. The system makes on average one prediction per 177000 nucleotides on the human genome, as judged by the results on chromosome 21. Comparison of abilities to predict TSS with the two other systems on human chromosomes 4, 21, and 22 reveals that our system has superior accuracy and overall provides the most confident predictions.
Collapse
Affiliation(s)
- Vladimir B Bajic
- Knowledge Extraction Lab, Institute for Infocomm Research, Singapore 119613.
| | | |
Collapse
|
18
|
Halees AS, Leyfer D, Weng Z. PromoSer: A large-scale mammalian promoter and transcription start site identification service. Nucleic Acids Res 2003; 31:3554-9. [PMID: 12824364 PMCID: PMC168956 DOI: 10.1093/nar/gkg549] [Citation(s) in RCA: 65] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Proximal promoters have a major impact on transcriptional regulation. Studies of the sequence-based nature of this regulation usually require collection of proximal promoter sequences for large sets of co-regulated genes. We report a newly implemented web service that facilitates extraction of user specified regions around the transcription start site of all annotated human, mouse or rat genes. The transcription start sites have been identified computationally by considering alignments of a large number of partial and full-length mRNA sequences to genomic DNA, with provision for alternative promoters. The service is publicly available at http://biowulf.bu.edu/zlab/PromoSer/.
Collapse
Affiliation(s)
- Anason S Halees
- Bioinformatics Program, Boston University, 44 Cummington Street, Boston, MA 02215, USA
| | | | | |
Collapse
|
19
|
Werner T, Fessele S, Maier H, Nelson PJ. Computer modeling of promoter organization as a tool to study transcriptional coregulation. FASEB J 2003; 17:1228-37. [PMID: 12832287 DOI: 10.1096/fj.02-0955rev] [Citation(s) in RCA: 66] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Understanding how the regulation of gene networks is orchestrated is an important challenge for characterizing complex biological processes. Gene transcription is regulated in part by nuclear factors that recognize short DNA sequence motifs, called transcription factor binding sites, in most cases located upstream of the gene coding sequence in promoter and enhancer regions. Genes expressed in the same tissue under similar conditions often share a common organization of at least some of these regulatory binding elements. In this way the organization of promoter motifs represents a "footprint" of the transcriptional regulatory mechanisms at work in a specific biologic context and thus provides information about signal and tissue specific control of expression. Analysis of promoters for organizational features as demonstrated here provides a crucial link between the static nucleotide sequence of the genome and the dynamic aspects of gene regulation and expression.
Collapse
Affiliation(s)
- Thomas Werner
- GSF-National Research Center for Environment and Health, Institute of Experimental Genetics, Neuherberg, Germany
| | | | | | | |
Collapse
|
20
|
Solovyev VV, Shahmuradov IA. PromH: Promoters identification using orthologous genomic sequences. Nucleic Acids Res 2003; 31:3540-5. [PMID: 12824362 PMCID: PMC168932 DOI: 10.1093/nar/gkg525] [Citation(s) in RCA: 81] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2003] [Revised: 03/21/2003] [Accepted: 03/21/2003] [Indexed: 11/14/2022] Open
Abstract
Accurate prediction of promoters is fundamental for understanding gene expression patterns, cell specificity and development. In the studies of conserved features of regulatory regions of orthologous genes, it was observed that major promoter functional components such as transcription start points, TATA-boxes and regulatory motifs, are significantly more conservative than the sequences around them (70-100% compared with 30-50%). To improve promoter identification accuracy, we employed these findings in a new program, PromH, created by extending the TSSW program feature set. PromH uses linear discriminant functions that take into account conservation features and nucleotide sequences of promoter regions in pairs of orthologous genes. The program was tested on two sets of pairs of orthologous, mostly human and rodent, sequences with known transcription start sites (TSS), annotated to have TATA (21 genes, 11 orthologous pairs) and TATA-less (38 genes, 19 pairs) promoters, respectively. The program correctly predicted TSS for all 21 genes of the first set with a median deviation of 2 bp from true site location. Only for two genes, was there significant (46 and 105 bp) discrepancy between predicted and annotated TSS positions. For 38 TATA-less promoters from the second set, TSS was predicted for 27 genes, in 14 cases within 10 bp distance from annotated TSS, and in 21 cases--within 100 bp distance. Despite more discrepancies between predicted and annotated TSS for genes from the second set, these results are consistent with observations of much higher occurrence of multiple TSS in TATA-less promoters. In any case, our results show that PromH identifies TSS positions significantly more accurately than any other published promoter prediction method. The PromH program is available at http://www.softberry.com/berry.phtml?topic=promh.
Collapse
Affiliation(s)
- V V Solovyev
- Softberry Inc., 116 Radio Circle, Suite 400, Mount Kisco, NY 10549, USA.
| | | |
Collapse
|
21
|
Wasserman WW, Krivan W. In silico identification of metazoan transcriptional regulatory regions. THE SCIENCE OF NATURE - NATURWISSENSCHAFTEN 2003; 90:156-66. [PMID: 12712249 DOI: 10.1007/s00114-003-0409-4] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
Transcriptional regulation remains one of the most intriguing and challenging subjects in biomedical research. The catalysis of transcription is a clear example of multiple proteins interacting to orchestrate a biological process, offering a starting point for the study of biological systems. Transcriptional regulation is viewed as one of the principal mechanisms governing the spatial and temporal distribution of gene expression, thus the field of transcriptional regulation provides a natural stage for quantitative studies of multiple gene systems. Building on the body of focused experimental studies and new genomics-driven data, computational biologists are making significant strides in accelerating our understanding of the transcriptional regulatory process in metazoan cells. Recent advances in the computational analysis of the interplay between factors have been fueled by well-defined computational methods for the modeling of the binding of individual transcription factors. We present here an overview of advances in the analysis of regulatory systems and the fundamental methods that underlie the recent developments.
Collapse
Affiliation(s)
- Wyeth W Wasserman
- Centre for Molecular Medicine and Therapeutics, University of British Columbia, 950 West 28th Avenue, Vancouver, BC, V5Z 4H4, Canada.
| | | |
Collapse
|
22
|
Bajic VB, Seah SH, Chong A, Krishnan SPT, Koh JLY, Brusic V. Computer model for recognition of functional transcription start sites in RNA polymerase II promoters of vertebrates. J Mol Graph Model 2003; 21:323-32. [PMID: 12543131 DOI: 10.1016/s1093-3263(02)00179-1] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
This paper introduces a new computer system for recognition of functional transcription start sites (TSSs) in RNA polymerase II promoter regions of vertebrates. This system allows scanning complete vertebrate genomes for promoters with significantly reduced number of false positive predictions. It can be used in the context of gene finding through its recognition of the 5' end of genes. The implemented recognition model uses a composite-hierarchical approach, artificial intelligence, statistics, and signal processing techniques. It also exploits the separation of promoter sequences into those that are C+G-rich or C+G-poor. The system was evaluated on a large and diverse human sequence-set and exhibited several times higher accuracy than several publicly available TSS-finding programs. Results obtained using human chromosome 22 data showed even greater specificity than the evaluation set results. The system has been implemented in the Dragon Promoter Finder package, which can be accessed at http://sdmc.krdl.org.sg:8080/promoter/.
Collapse
Affiliation(s)
- Vladimir B Bajic
- Computational Immunology Group, BIC-LIT, Laboratories for Information Technology, 21 Heng Mui Keng Terrace, 119613 Singapore, Singapore
| | | | | | | | | | | |
Collapse
|
23
|
Werner T. Promoter analysis. ERNST SCHERING RESEARCH FOUNDATION WORKSHOP 2002:65-82. [PMID: 12061007 DOI: 10.1007/978-3-662-04747-7_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Affiliation(s)
- T Werner
- Institute of Biomathematics and Biometry, GSF-Forschungszentrum für Umwelt und Gesundheit, Ingolstädter Landstrasse 1, 85764 Neuherberg, Germany.
| |
Collapse
|
24
|
Gilligan P, Brenner S, Venkatesh B. Fugu and human sequence comparison identifies novel human genes and conserved non-coding sequences. Gene 2002; 294:35-44. [PMID: 12234665 DOI: 10.1016/s0378-1119(02)00793-x] [Citation(s) in RCA: 51] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
The compact genome of the pufferfish, Fugu rubripes, has been proposed as a 'reference' genome to aid in annotating and analysing the human genome. We have annotated and compared 85 kb of Fugu sequence containing 17 genes with its homologous loci in the human draft genome and identified three 'novel' human genes that were missed or incompletely predicted by the previous gene prediction methods. Two of the novel genes contain zinc finger domains and are designated ZNF366 and ZNF367. They map to human chromosomes 5q13.2 and 9q22.32, respectively. The third novel gene, designated C9orf21, maps to chromosome 9q22.32. This gene is unique to vertebrates, and the protein encoded by it does not contain any known domains. We could not find human homologs for two Fugu genes, a novel chemokine gene and a kinase gene. These genes are either specific to teleosts or lost in the human lineage. The Fugu-human comparison identified several conserved non-coding sequences in the promoter and intronic regions. These sequences, conserved during 450 million years of vertebrate evolution, are likely to be involved in gene regulation. The 85 kb Fugu locus is dispersed over four human loci, occupying about 1.5 Mb. Contiguity is conserved in the human genome between six out of 16 Fugu gene pairs. These contiguous chromosomal segments should share a common evolutionary history dating back to the common ancestor of mammals and teleosts. We propose contiguity as strong evidence to identify orthologous genes in distant organisms. This study confirms the utility of the Fugu as a supplementary tool to uncover and confirm novel genes and putative gene regulatory regions in the human genome.
Collapse
Affiliation(s)
- Patrick Gilligan
- Institute of Molecular and Cell Biology, 30 Medical Drive, 117609, Singapore
| | | | | |
Collapse
|
25
|
Riechmann JL. Transcriptional regulation: a genomic overview. THE ARABIDOPSIS BOOK 2002; 1:e0085. [PMID: 22303220 PMCID: PMC3243377 DOI: 10.1199/tab.0085] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/19/2023]
Abstract
The availability of the Arabidopsis thaliana genome sequence allows a comprehensive analysis of transcriptional regulation in plants using novel genomic approaches and methodologies. Such a genomic view of transcription first necessitates the compilation of lists of elements. Transcription factors are the most numerous of the different types of proteins involved in transcription in eukaryotes, and the Arabidopsis genome codes for more than 1,500 of them, or approximately 6% of its total number of genes. A genome-wide comparison of transcription factors across the three eukaryotic kingdoms reveals the evolutionary generation of diversity in the components of the regulatory machinery of transcription. However, as illustrated by Arabidopsis, transcription in plants follows similar basic principles and logic to those in animals and fungi. A global view and understanding of transcription at a cellular and organismal level requires the characterization of the Arabidopsis transcriptome and promoterome, as well as of the interactome, the localizome, and the phenome of the proteins involved in transcription.
Collapse
Affiliation(s)
- José Luis Riechmann
- Mendel Biotechnology, 21375 Cabot Blvd., Hayward, CA 94545, USA
- California Institute of Technology, Division of Biology 156-29, Pasadena, CA 91125
| |
Collapse
|
26
|
Liu R, States DJ. Consensus promoter identification in the human genome utilizing expressed gene markers and gene modeling. Genome Res 2002; 12:462-9. [PMID: 11875035 PMCID: PMC155291 DOI: 10.1101/gr.198002] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Deciphering the human genome includes locating the promoters that initiate transcription and identifying the exons of genes. Many promoter prediction programs have been proposed, but when they are applied to extended regions of the genome, most of their predictions are false-positives. The extensive collection of gene transcript sequences is an important new source of information, which has not been used previously in promoter predictions. Our approach is to enhance the specificity of predictions by restricting the genomic regions that are searched using gene transcript alignments as anchors in the genome for gene modeling. We developed a consensus promoter prediction method combining previously developed algorithms with the GENSCAN gene modeling program. Our method, CONPRO (CONsensus PROmoter), identifies promoters with very high confidence, and the predicted promoters are guaranteed to be associated with genes. On our test data set, the method correctly detects promoters for approximately half of all human genes (37%-71%), and most predictions are true promoters (85%-90%). Applying our method to the human genome and human genes from the Unigene data set, we find the promoters for 13,744 genes. Of these, 6440 are genes with a functionally cloned mRNA, and 7304 are novel genes for which only expressed sequence tags (ESTs) are available. Candidate promoters for many novel genes will be a useful resource in elucidating complex biological response mechanisms.
Collapse
Affiliation(s)
- Rongxiang Liu
- Bioinformatics Program and the Department of Human Genetics, University of Michigan, Ann Arbor, MI 48109, USA
| | | |
Collapse
|
27
|
Down TA, Hubbard TJP. Computational detection and location of transcription start sites in mammalian genomic DNA. Genome Res 2002; 12:458-61. [PMID: 11875034 PMCID: PMC155284 DOI: 10.1101/gr.216102] [Citation(s) in RCA: 217] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Transcription, the process whereby RNA copies are made from sections of the DNA genome, is directed by promoter regions. These define the transcription start site, and also the set of cellular conditions under which the promoter is active. At least in more complex species, it appears to be common for genes to have several different transcription start sites, which may be active under different conditions. Eukaryotic promoters are complex and fairly diffuse structures, which have proven hard to detect in silico. We show that a novel hybrid machine-learning method is able to build useful models of promoters for >50% of human transcription start sites. We estimate specificity to be >70%, and demonstrate good positional accuracy. Based on the structure of our learned models, we conclude that a signal resembling the well known TATA box, together with flanking regions of C-G enrichment, are the most important sequence-based signals marking sites of transcriptional initiation at a large class of typical promoters.
Collapse
Affiliation(s)
- Thomas A Down
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, United Kingdom.
| | | |
Collapse
|
28
|
Dunham I. Lessons from the sequence of human chromosome 22. ERNST SCHERING RESEARCH FOUNDATION WORKSHOP 2002:31-50. [PMID: 11859563 DOI: 10.1007/978-3-662-04667-8_3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/17/2023]
|
29
|
Ohler U, Liao GC, Niemann H, Rubin GM. Computational analysis of core promoters in the Drosophila genome. Genome Biol 2002; 3:RESEARCH0087. [PMID: 12537576 PMCID: PMC151189 DOI: 10.1186/gb-2002-3-12-research0087] [Citation(s) in RCA: 308] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2002] [Revised: 11/19/2002] [Accepted: 11/27/2002] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The core promoter, a region of about 100 base-pairs flanking the transcription start site (TSS), serves as the recognition site for the basal transcription apparatus. Drosophila TSSs have generally been mapped by individual experiments; the low number of accurately mapped TSSs has limited analysis of promoter sequence motifs and the training of computational prediction tools. RESULTS We identified TSS candidates for about 2,000 Drosophila genes by aligning 5' expressed sequence tags (ESTs) from cap-trapped cDNA libraries to the genome, while applying stringent criteria concerning coverage and 5'-end distribution. Examination of the sequences flanking these TSSs revealed the presence of well-known core promoter motifs such as the TATA box, the initiator and the downstream promoter element (DPE). We also define, and assess the distribution of, several new motifs prevalent in core promoters, including what appears to be a variant DPE motif. Among the prevalent motifs is the DNA-replication-related element DRE, recently shown to be part of the recognition site for the TBP-related factor TRF2. Our TSS set was then used to retrain the computational promoter predictor McPromoter, allowing us to improve the recognition performance to over 50% sensitivity and 40% specificity. We compare these computational results to promoter prediction in vertebrates. CONCLUSIONS There are relatively few recognizable binding sites for previously known general transcription factors in Drosophila core promoters. However, we identified several new motifs enriched in promoter regions. We were also able to significantly improve the performance of computational TSS prediction in Drosophila.
Collapse
Affiliation(s)
- Uwe Ohler
- Department of Molecular and Cell Biology, University of California at Berkeley, Berkeley, CA 94720-3200, USA.
| | | | | | | |
Collapse
|
30
|
Abstract
Microarray technologies for measuring mRNA abundances in cells allow monitoring of gene expression levels for tens of thousands of genes in parallel. By measuring expression responses across hundreds of different conditions or timepoints a relatively detailed gene expression map starts to emerge. Using cluster analysis techniques, it is possible to identify genes that are consistently coexpressed under several different conditions or treatments. These sets of coexpressed genes can then be compared to existing knowledge about biochemical or signalling pathways, the function of unknown genes can be hypothesised by comparing them to other genes with characterised function, or from trends in expression profiles in general - why cell needs to transcribe or silence the genes during particular treatment. The regulation of genes on the DNA level is largely guided by particular sequence features, the transcription factor binding sites, and other signals encaptured in DNA. By analyzing the regulatory regions of the DNA of the genes consistently coexpressed, we can discover the potential signals hidden in DNA by computational analysis methods. The prerequisite for this kind of analysis is the existence of genomic DNA sequence, knowledge about gene locations, and experimental gene expression measurements for a variety of conditions. This article surveys some of the analysis methods and studies for such a computational discovery approach for yeast Saccharomyces cerevisiae.
Collapse
Affiliation(s)
- J Vilo
- European Bioinformatics Institute EBI, EMBL Outstation - Hinxton, Wellcome Trust Genome Campus, Cambridge CB10 1SD, UK.
| | | |
Collapse
|
31
|
Abstract
The availability of the complete genomic sequence of yeast now enables elucidation of molecular mechanisms governing gene expression patterns. New results from the yeast genome and recent advances in predicting and finding human promoters support the use of similar combinatorial approaches to study genome-wide transcriptional regulation in humans.
Collapse
|
32
|
Holste D, Grosse I, Herzel H. Statistical analysis of the DNA sequence of human chromosome 22. PHYSICAL REVIEW E 2001; 64:041917. [PMID: 11690062 DOI: 10.1103/physreve.64.041917] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/19/2001] [Indexed: 11/07/2022]
Abstract
We study statistical patterns in the DNA sequence of human chromosome 22, the first completely sequenced human chromosome. We find that (i). the 33.4 x 10(6) nucleotide long human chromosome exhibits long-range power-law correlations over more than four orders of magnitude, (ii). the entropies H(n) of the frequency distribution of oligonucleotides of length n (n-mers) grow sublinearly with increasing n, indicating the presence of higher-order correlations for all of the studied lengths 1<or=n<or=10, and (iii). the generalized entropies H(n)(q) of n-mers decrease monotonically with increasing q and the decay of H(n)(q) with q becomes steeper with increasing n<or=10, indicating that the frequency distribution of oligonucleotides becomes increasingly nonuniform as the length n increases. We investigate to what degree known biological features may explain the observed statistical patterns. We find that (iv). the presence of interspersed repeats may cause the sublinear increase of H(n) with n, and that (v). the presence of monomeric tandem repeats as well as the suppression of CG dinucleotides may cause the observed decay of H(n)(q) with q.
Collapse
Affiliation(s)
- D Holste
- Department of Theoretical Biophysics, Humboldt University Berlin, Invalidenstrasse 42, D-10115, Berlin, Germany
| | | | | |
Collapse
|
33
|
Mein C. 'ome on the range. Trends Biotechnol 2001; 19:240-1. [PMID: 11434348 DOI: 10.1016/s0167-7799(01)01644-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Affiliation(s)
- C Mein
- Genome Centre, Bart's and London Queen Mary's School of Medicine and Dentistry, UK.
| |
Collapse
|