201
|
Zhang C, Krainer AR, Zhang MQ. Evolutionary impact of limited splicing fidelity in mammalian genes. Trends Genet 2007; 23:484-8. [PMID: 17719121 DOI: 10.1016/j.tig.2007.08.001] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2006] [Revised: 05/08/2007] [Accepted: 08/13/2007] [Indexed: 10/22/2022]
Abstract
The functional significance of most alternative splicing (AS) events, especially frame-shifting ones, has been controversial. Using human-mouse comparison, we demonstrate that frame-preserving AS events adapt and get fixed more rapidly than frame-shifting AS events; selection for smaller exon size is stronger in frame-preserving exons than in frame-shifting ones. These results suggest AS events introducing mild changes are generally favored during evolution and explain the excess of shorter, frame-preserving cassette exons in present mammalian genomes.
Collapse
|
202
|
Abstract
Computational analysis of eukaryotic promoters is one of the most difficult problems in computational genomics and is essential for understanding gene expression profiles and reverse-engineering gene regulation network circuits. Here I give a basic introduction of the problem and recent update on both experimental and computational approaches. More details may be found in the extended references. This review is based on a summer lecture given at Max Planck Institute at Berlin in 2005.
Collapse
|
203
|
Zhang C, Hastings ML, Krainer AR, Zhang MQ. Dual-specificity splice sites function alternatively as 5' and 3' splice sites. Proc Natl Acad Sci U S A 2007; 104:15028-33. [PMID: 17848517 PMCID: PMC1986607 DOI: 10.1073/pnas.0703773104] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
As a result of large-scale sequencing projects and recent splicing-microarray studies, estimates of mammalian genes expressing multiple transcripts continue to increase. This expansion of transcript information makes it possible to better characterize alternative splicing events and gain insights into splicing mechanisms and regulation. Here, we describe a class of splice sites that we call dual-specificity splice sites, which we identified through genome-wide, high-quality alignment of mRNA/EST and genome sequences and experimentally verified by RT-PCR. These splice sites can be alternatively recognized as either 5' or 3' splice sites, and the dual splicing is conceptually similar to a pair of mutually exclusive exons separated by a zero-length intron. The dual-splice-site sequences are essentially a composite of canonical 5' and 3' splice-site consensus sequences, with a CAG|GURAG core. The relative use of a dual site as a 5' or 3' splice site can be accurately predicted by assuming competition for specific binding between spliceosomal components involved in recognition of 5' and 3' splice sites, respectively. Dual-specificity splice sites exist in human and mouse, and possibly in other vertebrate species, although most sites are not conserved, suggesting that their origin is recent. We discuss the implications of this unusual splicing pattern for the diverse mechanisms of exon recognition and for gene evolution.
Collapse
|
204
|
Yang Z, Jiang H, Zhao F, Shankar DB, Sakamoto KM, Zhang MQ, Lin S. A highly conserved regulatory element controls hematopoietic expression of GATA-2 in zebrafish. BMC DEVELOPMENTAL BIOLOGY 2007; 7:97. [PMID: 17708765 PMCID: PMC1988811 DOI: 10.1186/1471-213x-7-97] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/01/2006] [Accepted: 08/20/2007] [Indexed: 01/30/2023]
Abstract
Background GATA-2 is a transcription factor required for hematopoietic stem cell survival as well as for neuronal development in vertebrates. It has been shown that specific expression of GATA-2 in blood progenitor cells requires distal cis-acting regulatory elements. Identification and characterization of these elements should help elucidating transcription regulatory mechanisms of GATA-2 expression in hematopoietic lineage. Results By pair-wise alignments of the zebrafish genomic sequences flanking GATA-2 to orthologous regions of fugu, mouse, rat and human genomes, we identified three highly conserved non-coding sequences in the genomic region flanking GATA-2, two upstream of GATA-2 and another downstream. Using both transposon and bacterial artificial chromosome mediated germline transgenic zebrafish analyses, one of the sequences was established as necessary and sufficient to direct hematopoietic GFP expression in a manner that recapitulates that of GATA-2. In addition, we demonstrated that this element has enhancer activity in mammalian myeloid leukemia cell lines, thus validating its functional conservation among vertebrate species. Further analysis of potential transcription factor binding sites suggested that integrity of the putative HOXA3 and LMO2 sites is required for regulating GATA-2/GFP hematopoietic expression. Conclusion Regulation of GATA-2 expression in hematopoietic cells is likely conserved among vertebrate animals. The integrated approach described here, drawing on embryological, transgenesis and computational methods, should be generally applicable to analyze tissue-specific gene regulation involving distal DNA cis-acting elements.
Collapse
|
205
|
Das D, Zhang MQ. Predictive models of gene regulation: application of regression methods to microarray data. Methods Mol Biol 2007; 377:95-110. [PMID: 17634611 DOI: 10.1007/978-1-59745-390-5_5] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/16/2023]
Abstract
Eukaryotic transcription is a complex process. A myriad of biochemical signals cause activators and repressors to bind specific cis-elements on the promoter DNA, which help to recruit the basal transcription machinery that ultimately initiates transcription. In this chapter, we discuss how regression techniques can be effectively used to infer the functional cis-regulatory elements and their cooperativity from microarray data. Examples from yeast cell cycle are drawn to demonstrate the power of these techniques. Periodic regulation of the cell cycle, connection with underlying energetics, and the inference of combinatorial logic are also discussed. An implementation based on regression splines is discussed in detail.
Collapse
|
206
|
Zhao X, Xuan Z, Zhang MQ. Boosting with stumps for predicting transcription start sites. Genome Biol 2007; 8:R17. [PMID: 17274821 PMCID: PMC1852414 DOI: 10.1186/gb-2007-8-2-r17] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2006] [Revised: 12/01/2006] [Accepted: 02/02/2007] [Indexed: 12/05/2022] Open
Abstract
CoreBoost applies a boosting technique to select important features for predicting core promoters with diverse patterns. Promoter prediction is a difficult but important problem in gene finding, and it is critical for elucidating the regulation of gene expression. We introduce a new promoter prediction program, CoreBoost, which applies a boosting technique with stumps to select important small-scale as well as large-scale features. CoreBoost improves greatly on locating transcription start sites. We also demonstrate that by further utilizing some tissue-specific information, better accuracy can be achieved.
Collapse
|
207
|
Mignone JL, Roig-Lopez JL, Fedtsova N, Schones DE, Manganas LN, Maletic-Savatic M, Keyes WM, Mills AA, Gleiberman A, Zhang MQ, Enikolopov G. Neural potential of a stem cell population in the hair follicle. Cell Cycle 2007; 6:2161-70. [PMID: 17873521 PMCID: PMC3789384 DOI: 10.4161/cc.6.17.4593] [Citation(s) in RCA: 68] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023] Open
Abstract
The bulge region of the hair follicle serves as a repository for epithelial stem cells that can regenerate the follicle in each hair growth cycle and contribute to epidermis regeneration upon injury. Here we describe a population of multipotential stem cells in the hair follicle bulge region; these cells can be identified by fluorescence in transgenic nestin-GFP mice. The morphological features of these cells suggest that they maintain close associations with each other and with the surrounding niche. Upon explantation, these cells can give rise to neurosphere-like structures in vitro. When these cells are permitted to differentiate, they produce several cell types, including cells with neuronal, astrocytic, oligodendrocytic, smooth muscle, adipocytic, and other phenotypes. Furthermore, upon implantation into the developing nervous system of chick, these cells generate neuronal cells in vivo. We used transcriptional profiling to assess the relationship between these cells and embryonic and postnatal neural stem cells and to compare them with other stem cell populations of the bulge. Our results show that nestin-expressing cells in the bulge region of the hair follicle have stem cell-like properties, are multipotent, and can effectively generate cells of neural lineage in vitro and in vivo.
Collapse
|
208
|
Kim TH, Abdullaev ZK, Smith AD, Ching KA, Loukinov DI, Green RD, Zhang MQ, Lobanenkov VV, Ren B. Analysis of the vertebrate insulator protein CTCF-binding sites in the human genome. Cell 2007; 128:1231-45. [PMID: 17382889 PMCID: PMC2572726 DOI: 10.1016/j.cell.2006.12.048] [Citation(s) in RCA: 786] [Impact Index Per Article: 46.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2006] [Revised: 11/23/2006] [Accepted: 12/28/2006] [Indexed: 12/31/2022]
Abstract
Insulator elements affect gene expression by preventing the spread of heterochromatin and restricting transcriptional enhancers from activation of unrelated promoters. In vertebrates, insulator's function requires association with the CCCTC-binding factor (CTCF), a protein that recognizes long and diverse nucleotide sequences. While insulators are critical in gene regulation, only a few have been reported. Here, we describe 13,804 CTCF-binding sites in potential insulators of the human genome, discovered experimentally in primary human fibroblasts. Most of these sequences are located far from the transcriptional start sites, with their distribution strongly correlated with genes. The majority of them fit to a consensus motif highly conserved and suitable for predicting possible insulators driven by CTCF in other vertebrate genomes. In addition, CTCF localization is largely invariant across different cell types. Our results provide a resource for investigating insulator function and possible other general and evolutionarily conserved activities of CTCF sites.
Collapse
|
209
|
Murchison EP, Stein P, Xuan Z, Pan H, Zhang MQ, Schultz RM, Hannon GJ. Critical roles for Dicer in the female germline. Genes Dev 2007; 21:682-93. [PMID: 17369401 PMCID: PMC1820942 DOI: 10.1101/gad.1521307] [Citation(s) in RCA: 372] [Impact Index Per Article: 21.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
Dicer is an essential component of RNA interference (RNAi) pathways, which have broad functions in gene regulation and genome organization. Probing the consequences of tissue-restricted Dicer loss in mice indicates a critical role for Dicer during meiosis in the female germline. Mouse oocytes lacking Dicer arrest in meiosis I with multiple disorganized spindles and severe chromosome congression defects. Oogenesis and early development are times of significant post-transcriptional regulation, with controlled mRNA storage, translation, and degradation. Our results suggest that Dicer is essential for turnover of a substantial subset of maternal transcripts that are normally lost during oocyte maturation. Furthermore, we find evidence that transposon-derived sequence elements may contribute to the metabolism of maternal transcripts through a Dicer-dependent pathway. Our studies identify Dicer as central to a regulatory network that controls oocyte gene expression programs and that promotes genomic integrity in a cell type notoriously susceptible to aneuploidy.
Collapse
|
210
|
Schones DE, Smith AD, Zhang MQ. Statistical significance of cis-regulatory modules. BMC Bioinformatics 2007; 8:19. [PMID: 17241466 PMCID: PMC1796902 DOI: 10.1186/1471-2105-8-19] [Citation(s) in RCA: 64] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2006] [Accepted: 01/22/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND It is becoming increasingly important for researchers to be able to scan through large genomic regions for transcription factor binding sites or clusters of binding sites forming cis-regulatory modules. Correspondingly, there has been a push to develop algorithms for the rapid detection and assessment of cis-regulatory modules. While various algorithms for this purpose have been introduced, most are not well suited for rapid, genome scale scanning. RESULTS We introduce methods designed for the detection and statistical evaluation of cis-regulatory modules, modeled as either clusters of individual binding sites or as combinations of sites with constrained organization. In order to determine the statistical significance of module sites, we first need a method to determine the statistical significance of single transcription factor binding site matches. We introduce a straightforward method of estimating the statistical significance of single site matches using a database of known promoters to produce data structures that can be used to estimate p-values for binding site matches. We next introduce a technique to calculate the statistical significance of the arrangement of binding sites within a module using a max-gap model. If the module scanned for has defined organizational parameters, the probability of the module is corrected to account for organizational constraints. The statistical significance of single site matches and the architecture of sites within the module can be combined to provide an overall estimation of statistical significance of cis-regulatory module sites. CONCLUSION The methods introduced in this paper allow for the detection and statistical evaluation of single transcription factor binding sites and cis-regulatory modules. The features described are implemented in the Search Tool for Occurrences of Regulatory Motifs (STORM) and MODSTORM software.
Collapse
|
211
|
Abstract
MOTIVATION Many heuristic algorithms have been designed to approximate P-values of DNA motifs described by position weight matrices, for evaluating their statistical significance. They often significantly deviate from the true P-value by orders of magnitude. Exact P-value computation is needed for ranking the motifs. Furthermore, surprisingly, the complexity of the problem is unknown. RESULTS We show the problem to be NP-hard, and present MotifRank, software based on dynamic programming, to calculate exact P-values of motifs. We define the exact P-value on a general and more precise model. Asymptotically, MotifRank is faster than the best exact P-value computing algorithm, and is in fact practical. Our experiments clearly demonstrate that MotifRank significantly improves the accuracy of existing approximation algorithms. AVAILABILITY MotifRank is available from http://bio.dlg.cn. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
212
|
Smith AD, Sumazin P, Zhang MQ. Tissue-specific regulatory elements in mammalian promoters. Mol Syst Biol 2007; 3:73. [PMID: 17224917 PMCID: PMC1800356 DOI: 10.1038/msb4100114] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2006] [Accepted: 11/10/2006] [Indexed: 12/18/2022] Open
Abstract
Transcription factor-binding sites and the cis-regulatory modules they compose are central determinants of gene expression. We previously showed that binding site motifs and modules in proximal promoters can be used to predict a significant portion of mammalian tissue-specific transcription. Here, we report on a systematic analysis of promoters controlling tissue-specific expression in heart, kidney, liver, pancreas, skeletal muscle, testis and CD4 T cells, for both human and mouse. We integrated multiple sources of expression data to compile sets of transcripts with strong evidence for tissue-specific regulation. The analysis of the promoters corresponding to these sets produced a catalog of predicted tissue-specific motifs and modules, and cis-regulatory elements. Predicted regulatory interactions are supported by statistical evidence, and provide a foundation for targeted experiments that will improve our understanding of tissue-specific regulatory networks. In a broader context, methods used to construct the catalog provide a model for the analysis of genomic regions that regulate differentially expressed genes.
Collapse
|
213
|
|
214
|
Wang X, Bandyopadhyay S, Xuan Z, Zhao X, Zhang MQ, Zhang X. Prediction of transcription start sites based on feature selection using AMOSA. COMPUTATIONAL SYSTEMS BIOINFORMATICS. COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE 2007; 6:183-193. [PMID: 17951823] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
To understand the regulation of the gene expression, the identification of transcription start sites (TSSs) is a primary and important step. With the aim to improve the computational prediction accuracy, we focus on the most challenging task, i.e., to identify the TSSs within 50 bp in non-CpG related promoter regions. Due to the diversity of non-CpG related promoters, a large number of features are extracted. Effective feature selection can minimize the noise, improve the prediction accuracy, and also to discover biologically meaningful intrinsic properties. In this paper, a newly proposed multi-objective simulated annealing based optimization method, Archive Multi-Objective Simulated Annealing (AMOSA), is integrated with Linear Discriminant Analysis (LDA) to yield a combined feature selection and classification system. This system is found to be comparable to, often better than, several existing methods in terms of different quantitative performance measures.
Collapse
|
215
|
Kim YC, Jung YC, Xuan Z, Dong H, Zhang MQ, Wang SM. Pan-genome isolation of low abundance transcripts using SAGE tag. FEBS Lett 2006; 580:6721-9. [PMID: 17113583 PMCID: PMC1791009 DOI: 10.1016/j.febslet.2006.11.013] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2006] [Revised: 10/31/2006] [Accepted: 11/03/2006] [Indexed: 11/24/2022]
Abstract
The SAGE (serial analysis of gene expression) method is sensitive at detecting the lower abundance transcripts. More than a third of human SAGE tags identified are novel representing the low abundance unknown transcripts. Using the GLGI method (generation of longer 3' EST from SAGE tag for gene identification), we converted 1009 low-copy, human X chromosome-specific SAGE tags into 10210 3' ESTs. We identified 3418 unique 3' ESTs, 46% of which are novel and originated from the lower abundance transcripts. However, nearly all 3' ESTs were mapped to various regions across the genome but not X chromosome. Detailed analysis indicates that those 3' ESTs were isolated by SAGE tag mis-priming to the non-parent transcripts. Replacing SAGE tags with non-transcribed genomic DNA tags resulted in poor amplification, indicating that the sequence similarity between different transcripts contributed to the amplification. Our study shows the prevalence of novel low abundance transcripts that can be isolated efficiently through SAGE tags mis-priming.
Collapse
|
216
|
Martinez MJ, Smith AD, Li B, Zhang MQ, Harrod KS. Computational prediction of novel components of lung transcriptional networks. Bioinformatics 2006; 23:21-9. [PMID: 17050569 DOI: 10.1093/bioinformatics/btl531] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
MOTIVATION Little is known regarding the transcriptional mechanisms involved in forming and maintaining epithelial cell lineages of the mammalian respiratory tract. RESULTS Herein, a motif discovery approach was used to identify novel transcriptional regulators in the lung using genes previously found to be regulated by Foxa2 or Wnt signaling pathways. A human-mouse comparison of both novel and known motifs was also performed. Some of the factors and families identified here were previously shown to be involved epithelial cell differentiation (ETS family, HES-1 and MEIS-1), and ciliogenesis (RFX family), but have never been characterized in lung epithelia. Other unidentified over-represented motifs suggest the existence of novel mammalian lung transcription factors. Of the fraction of motifs examined we describe 25 transcription factor family predictions for lung. Fifteen novel factors were shown here to be expressed in mouse lung, and/or human bronchial or distal lung epithelial tissues or lung epithelial cell lineages. AVAILABILITY DME: http://rulai.cshl.edu/dme. MATCOMPARE: http://rulai.cshl.edu/MatCompare. MOTIFCLASS is available from the authors.
Collapse
|
217
|
Li J, Zhang MQ, Zhang X. A new method for detecting human recombination hotspots and its applications to the HapMap ENCODE data. Am J Hum Genet 2006; 79:628-39. [PMID: 16960799 PMCID: PMC1592557 DOI: 10.1086/508066] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2006] [Accepted: 07/25/2006] [Indexed: 11/03/2022] Open
Abstract
Computational detection of recombination hotspots from population polymorphism data is important both for understanding the nature of recombination and for applications such as association studies. We propose a new method for this task based on a multiple-hotspot model and an (approximate) log-likelihood ratio test. A truncated, weighted pairwise log-likelihood is introduced and applied to the calculation of the log-likelihood ratio, and a forward-selection procedure is adopted to search for the optimal hotspot predictions. The method shows a relatively high power with a low false-positive rate in detecting multiple hotspots in simulation data and has a performance comparable to the best results of leading computational methods in experimental data for which recombination hotspots have been characterized by sperm-typing experiments. The method can be applied to both phased and unphased data directly, with a very fast computational speed. We applied the method to the 10 500-kb regions of the HapMap ENCODE data and found 172 hotspots among the three populations, with average hotspot width of 2.4 kb. By comparisons with the simulation data, we found some evidence that hotspots are not all identical across populations. The correlations between detected hotspots and several genomic characteristics were examined. In particular, we observed that DNaseI-hypersensitive sites are enriched in hotspots, suggesting the existence of human beta hotspots similar to those found in yeast.
Collapse
|
218
|
Fang F, Fan S, Zhang X, Zhang MQ. Predicting methylation status of CpG islands in the human brain. Bioinformatics 2006; 22:2204-9. [PMID: 16837523 DOI: 10.1093/bioinformatics/btl377] [Citation(s) in RCA: 69] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
MOTIVATION Over 50% of human genes contain CpG islands in their 5'-regions. Methylation patterns of CpG islands are involved in tissue-specific gene expression and regulation. Mis-epigenetic silencing associated with aberrant CpG island methylation is one mechanism leading to the loss of tumor suppressor functions in cancer cells. Large-scale experimental detection of DNA methylation is still both labor-intensive and time-consuming. Therefore, it is necessary to develop in silico approaches for predicting methylation status of CpG islands. RESULTS Based on a recent genome-scale dataset of DNA methylation in human brain tissues, we developed a classifier called MethCGI for predicting methylation status of CpG islands using a support vector machine (SVM). Nucleotide sequence contents as well as transcription factor binding sites (TFBSs) are used as features for the classification. The method achieves specificity of 84.65% and sensitivity of 84.32% on the brain data, and can also correctly predict about two-third of the data from other tissues reported in the MethDB database. AVAILABILITY An online predictor based on MethCGI is available at http://166.111.201.7/MethCGI.html CONTACT mzhang@cshl.edu SUPPLEMENTARY INFORMATION Supplementary data available at Bioinformatics online and http://166.111.201.7/help.html.
Collapse
|
219
|
Smith PJ, Zhang C, Wang J, Chew SL, Zhang MQ, Krainer AR. An increased specificity score matrix for the prediction of SF2/ASF-specific exonic splicing enhancers. Hum Mol Genet 2006; 15:2490-508. [PMID: 16825284 DOI: 10.1093/hmg/ddl171] [Citation(s) in RCA: 383] [Impact Index Per Article: 21.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
Numerous disease-associated point mutations exert their effects by disrupting the activity of exonic splicing enhancers (ESEs). We previously derived position weight matrices to predict putative ESEs specific for four human SR proteins. The score matrices are part of ESEfinder, an online resource to identify ESEs in query sequences. We have now carried out a refined functional SELEX screen for motifs that can act as ESEs in response to the human SR protein SF2/ASF. The test BRCA1 exon under selection was internal, rather than the 3'-terminal IGHM exon used in our earlier studies. A naturally occurring heptameric ESE in BRCA1 exon 18 was replaced with two libraries of random sequences, one seven nucleotides in length, the other 14. Following three rounds of selection for in vitro splicing via internal exon inclusion, new consensus motifs and score matrices were derived. Many winner sequences were demonstrated to be functional ESEs in S100-extract-complementation assays with recombinant SF2/ASF. Motif-score threshold values were derived from both experimental and statistical analyses. Motif scores were shown to correlate with levels of exon inclusion, both in vitro and in vivo. Our results confirm and extend our earlier data, as many of the same motifs are recognized as ESEs by both the original and our new score matrix, despite the different context used for selection. Finally, we have derived an increased specificity score matrix that incorporates information from both of our SF2/ASF-specific matrices and that accurately predicts the exon-skipping phenotypes of deleterious point mutations.
Collapse
|
220
|
Das R, Dimitrova N, Xuan Z, Rollins RA, Haghighi F, Edwards JR, Ju J, Bestor TH, Zhang MQ. Computational prediction of methylation status in human genomic sequences. Proc Natl Acad Sci U S A 2006; 103:10713-6. [PMID: 16818882 PMCID: PMC1502297 DOI: 10.1073/pnas.0602949103] [Citation(s) in RCA: 134] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Abstract
Epigenetic effects in mammals depend largely on heritable genomic methylation patterns. We describe a computational pattern recognition method that is used to predict the methylation landscape of human brain DNA. This method can be applied both to CpG islands and to non-CpG island regions. It computes the methylation propensity for an 800-bp region centered on a CpG dinucleotide based on specific sequence features within the region. We tested several classifiers for classification performance, including K means clustering, linear discriminant analysis, logistic regression, and support vector machine. The best performing classifier used the support vector machine approach. Our program (called hdfinder) presently has a prediction accuracy of 86%, as validated with CpG regions for which methylation status has been experimentally determined. Using hdfinder, we have depicted the entire genomic methylation patterns for all 22 human autosomes.
Collapse
|
221
|
Smith AD, Sumazin P, Das D, Zhang MQ. Mining ChIP-chip data for transcription factor and cofactor binding sites. Bioinformatics 2006; 21 Suppl 1:i403-12. [PMID: 15961485 DOI: 10.1093/bioinformatics/bti1043] [Citation(s) in RCA: 69] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Identification of single motifs and motif pairs that can be used to predict transcription factor localization in ChIP-chip data, and gene expression in tissue-specific microarray data. RESULTS We describe methodology to identify de novo individual and interacting pairs of binding site motifs from ChIP-chip data, using an algorithm that integrates localization data directly into the motif discovery process. We combine matrix-enumeration based motif discovery with multivariate regression to evaluate candidate motifs and identify motif interactions. When applied to the HNF localization data in liver and pancreatic islets, our methods produce motifs that are either novel or improved known motifs. All motif pairs identified to predict localization are further evaluated according to how well they predict expression in liver and islets and according to how conserved are the relative positions of their occurrences. We find that interaction models of HNF1 and CDP motifs provide excellent prediction of both HNF1 localization and gene expression in liver. Our results demonstrate that ChIP-chip data can be used to identify interacting binding site motifs. AVAILABILITY Motif discovery programs and analysis tools are available on request from the authors.
Collapse
|
222
|
Das D, Nahlé Z, Zhang MQ. Adaptively inferring human transcriptional subnetworks. Mol Syst Biol 2006; 2:2006.0029. [PMID: 16760900 PMCID: PMC1681499 DOI: 10.1038/msb4100067] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2005] [Accepted: 03/28/2006] [Indexed: 12/21/2022] Open
Abstract
Although the human genome has been sequenced, progress in understanding gene regulation in humans has been particularly slow. Many computational approaches developed for lower eukaryotes to identify cis-regulatory elements and their associated target genes often do not generalize to mammals, largely due to the degenerate and interactive nature of such elements. Motivated by the switch-like behavior of transcriptional responses, we present a systematic approach that allows adaptive determination of active transcriptional subnetworks (cis-motif combinations, the direct target genes and physiological processes regulated by the corresponding transcription factors) from microarray data in mammals, with accuracy similar to that achieved in lower eukaryotes. Our analysis uncovered several new subnetworks active in human liver and in cell-cycle regulation, with similar functional characteristics as the known ones. We present biochemical evidence for our predictions, and show that the recently discovered G2/M-specific E2F pathway is wider than previously thought; in particular, E2F directly activates certain mitotic genes involved in hepatocellular carcinomas. Additionally, we demonstrate that this method can predict subnetworks in a condition-specific manner, as well as regulatory crosstalk across multiple tissues. Our approach allows systematic understanding of how phenotypic complexity is regulated at the transcription level in mammals and offers marked advantage in systems where little or no prior knowledge of transcriptional regulation is available.
Collapse
|
223
|
Suzuki H, Zuo Y, Wang J, Zhang MQ, Malhotra A, Mayeda A. Characterization of RNase R-digested cellular RNA source that consists of lariat and circular RNAs from pre-mRNA splicing. Nucleic Acids Res 2006; 34:e63. [PMID: 16682442 PMCID: PMC1458517 DOI: 10.1093/nar/gkl151] [Citation(s) in RCA: 474] [Impact Index Per Article: 26.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
Besides linear RNAs, pre-mRNA splicing generates three forms of RNAs: lariat introns, Y-structure introns from trans-splicing, and circular exons through exon skipping. To study the persistence of excised introns in total cellular RNA, we used three Escherichia coli 3' to 5' exoribonucleases. Ribonuclease R (RNase R) thoroughly degrades the abundant linear RNAs and the Y-structure RNA, while preserving the loop portion of a lariat RNA. Ribonuclease II (RNase II) and polynucleotide phosphorylase (PNPase) also preserve the lariat loop, but are less efficient in degrading linear RNAs. RNase R digestion of the total RNA from human skeletal muscle generates an RNA pool consisting of lariat and circular RNAs. RT-PCR across the branch sites confirmed lariat RNAs and circular RNAs in the pool generated by constitutive and alternative splicing of the dystrophin pre-mRNA. Our results indicate that RNase R treatment can be used to construct an intronic cDNA library, in which majority of the intron lariats are represented. The highly specific activity of RNase R implies its ability to screen for rare intragenic trans-splicing in any target gene with a large background of cis-splicing. Further analysis of the intronic RNA pool from a specific tissue or cell will provide insights into the global profile of alternative splicing.
Collapse
|
224
|
Zhang C, Xuan Z, Otto S, Hover JR, McCorkle SR, Mandel G, Zhang MQ. A clustering property of highly-degenerate transcription factor binding sites in the mammalian genome. Nucleic Acids Res 2006; 34:2238-46. [PMID: 16670430 PMCID: PMC1456330 DOI: 10.1093/nar/gkl248] [Citation(s) in RCA: 51] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
Transcription factor binding sites (TFBSs) are short DNA sequences interacting with transcription factors (TFs), which regulate gene expression. Due to the relatively short length of such binding sites, it is largely unclear how the specificity of protein–DNA interaction is achieved. Here, we have performed a genome-wide analysis of TFBS-like sequences for the transcriptional repressor, RE1 Silencing Transcription Factor (REST), as well as for several other representative mammalian TFs (c-myc, p53, HNF-1 and CREB). We find a nonrandom distribution of inexact sites for these TFs, referred to as highly-degenerate TFBSs, that are enriched around the cognate binding sites. Comparisons among human, mouse and rat orthologous promoters reveal that these highly-degenerate sites are conserved significantly more than expected by random chance, suggesting their positive selection during evolution. We propose that this arrangement provides a favorable genomic landscape for functional target site selection.
Collapse
|
225
|
Zhang C, Li HR, Fan JB, Wang-Rodriguez J, Downs T, Fu XD, Zhang MQ. Profiling alternatively spliced mRNA isoforms for prostate cancer classification. BMC Bioinformatics 2006; 7:202. [PMID: 16608523 PMCID: PMC1458362 DOI: 10.1186/1471-2105-7-202] [Citation(s) in RCA: 70] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2005] [Accepted: 04/11/2006] [Indexed: 01/02/2023] Open
Abstract
BACKGROUND Prostate cancer is one of the leading causes of cancer illness and death among men in the United States and world wide. There is an urgent need to discover good biomarkers for early clinical diagnosis and treatment. Previously, we developed an exon-junction microarray-based assay and profiled 1532 mRNA splice isoforms from 364 potential prostate cancer related genes in 38 prostate tissues. Here, we investigate the advantage of using splice isoforms, which couple transcriptional and splicing regulation, for cancer classification. RESULTS As many as 464 splice isoforms from more than 200 genes are differentially regulated in tumors at a false discovery rate (FDR) of 0.05. Remarkably, about 30% of genes have isoforms that are called significant but do not exhibit differential expression at the overall mRNA level. A support vector machine (SVM) classifier trained on 128 signature isoforms can correctly predict 92% of the cases, which outperforms the classifier using overall mRNA abundance by about 5%. It is also observed that the classification performance can be improved using multivariate variable selection methods, which take correlation among variables into account. CONCLUSION These results demonstrate that profiling of splice isoforms is able to provide unique and important information which cannot be detected by conventional microarrays.
Collapse
|
226
|
Smith AD, Sumazin P, Xuan Z, Zhang MQ. DNA motifs in human and mouse proximal promoters predict tissue-specific expression. Proc Natl Acad Sci U S A 2006; 103:6275-80. [PMID: 16606849 PMCID: PMC1458868 DOI: 10.1073/pnas.0508169103] [Citation(s) in RCA: 94] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Comprehensive identification of cis-regulatory elements is necessary for accurately reconstructing gene regulatory networks. We studied proximal promoters of human and mouse genes with differential expression across 56 terminally differentiated tissues. Using in silico techniques to discover, evaluate, and model interactions among sequence elements, we systematically identified regulatory modules that distinguish elevated from inhibited expression in the corresponding transcripts. We used these putative regulatory modules to construct a single predictive model for each of the 56 tissues. These predictors distinguish tissue-specific elevated from inhibited expression with statistical significance in 80% of the tissues (45 of 56). The predictors also reveal synergy between cis-regulatory modules and explain large-scale tissue-specific differential expression. For testis and liver, the predictors include computationally predicted motifs. For most other tissues, the predictors reveal synergy between experimentally verified motifs and indicate genes that are regulated by similar tissue-specific machinery. The identification in proximal promoters of cis-regulatory modules with tissue-specific activity lays the groundwork for complete characterization and deciphering of cis-regulatory DNA code in mammalian genomes.
Collapse
|
227
|
Rollins RA, Haghighi F, Edwards JR, Das R, Zhang MQ, Ju J, Bestor TH. Large-scale structure of genomic methylation patterns. Genes Dev 2006; 16:157-63. [PMID: 16365381 PMCID: PMC1361710 DOI: 10.1101/gr.4362006] [Citation(s) in RCA: 289] [Impact Index Per Article: 16.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2005] [Accepted: 09/19/2005] [Indexed: 11/24/2022]
Abstract
The mammalian genome depends on patterns of methylated cytosines for normal function, but the relationship between genomic methylation patterns and the underlying sequence is unclear. We have characterized the methylation landscape of the human genome by global analysis of patterns of CpG depletion and by direct sequencing of 3073 unmethylated domains and 2565 methylated domains from human brain DNA. The genome was found to consist of short (<4 kb) unmethylated domains embedded in a matrix of long methylated domains. Unmethylated domains were enriched in promoters, CpG islands, and first exons, while methylated domains comprised interspersed and tandem-repeated sequences, exons other than first exons, and non-annotated single-copy sequences that are depleted in the CpG dinucleotide. The enrichment of regulatory sequences in the relatively small unmethylated compartment suggests that cytosine methylation constrains the effective size of the genome through the selective exposure of regulatory sequences. This buffers regulatory networks against changes in total genome size and provides an explanation for the C value paradox, which concerns the wide variations in genome size that scale independently of gene number. This suggestion is compatible with the finding that cytosine methylation is universal among large-genome eukaryotes, while many eukaryotes with genome sizes <5 x 10(8) bp do not methylate their DNA.
Collapse
|
228
|
Prasanth KV, Prasanth SG, Xuan Z, Hearn S, Freier SM, Bennett CF, Zhang MQ, Spector DL. Regulating gene expression through RNA nuclear retention. Cell 2005; 123:249-63. [PMID: 16239143 DOI: 10.1016/j.cell.2005.08.033] [Citation(s) in RCA: 539] [Impact Index Per Article: 28.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2005] [Revised: 06/08/2005] [Accepted: 08/09/2005] [Indexed: 01/18/2023]
Abstract
Multiple mechanisms have evolved to regulate the eukaryotic genome. We have identified CTN-RNA, a mouse tissue-specific approximately 8 kb nuclear-retained poly(A)+ RNA that regulates the level of its protein-coding partner. CTN-RNA is transcribed from the protein-coding mouse cationic amino acid transporter 2 (mCAT2) gene through alternative promoter and poly(A) site usage. CTN-RNA is diffusely distributed in nuclei and is also localized to paraspeckles. The 3'UTR of CTN-RNA contains elements for adenosine-to-inosine editing, involved in its nuclear retention. Interestingly, knockdown of CTN-RNA also downregulates mCAT2 mRNA. Under stress, CTN-RNA is posttranscriptionally cleaved to produce protein-coding mCAT2 mRNA. Our findings reveal a role of the cell nucleus in harboring RNA molecules that are not immediately needed to produce proteins but whose cytoplasmic presence is rapidly required upon physiologic stress. This mechanism of action highlights an important paradigm for the role of a nuclear-retained stable RNA transcript in regulating gene expression.
Collapse
MESH Headings
- 3' Untranslated Regions/genetics
- Animals
- Base Sequence
- Cationic Amino Acid Transporter 2/genetics
- Cationic Amino Acid Transporter 2/metabolism
- Cell Fractionation
- Cell Line
- Cell Line, Tumor
- Cell Nucleus/metabolism
- Chromosomes
- Gene Expression Regulation
- Genes, Reporter
- Genome
- Green Fluorescent Proteins/metabolism
- In Situ Hybridization, Fluorescence
- Interferon-gamma/pharmacology
- Lipopolysaccharides/pharmacology
- Mice
- Models, Biological
- Molecular Sequence Data
- NIH 3T3 Cells
- Oligonucleotides, Antisense/pharmacology
- Poly A/genetics
- Precipitin Tests
- Promoter Regions, Genetic
- RNA/genetics
- RNA/metabolism
- RNA Editing
- RNA Processing, Post-Transcriptional
- RNA, Messenger/analysis
- RNA, Small Nuclear/metabolism
- Reverse Transcriptase Polymerase Chain Reaction
- Sequence Analysis, RNA
- Transcription, Genetic
Collapse
|
229
|
Wang J, Smith PJ, Krainer AR, Zhang MQ. Distribution of SR protein exonic splicing enhancer motifs in human protein-coding genes. Nucleic Acids Res 2005; 33:5053-62. [PMID: 16147989 PMCID: PMC1201331 DOI: 10.1093/nar/gki810] [Citation(s) in RCA: 116] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Exonic splicing enhancers (ESEs) are pre-mRNA cis-acting elements required for splice-site recognition. We previously developed a web-based program called ESEfinder that scores any sequence for the presence of ESE motifs recognized by the human SR proteins SF2/ASF, SRp40, SRp55 and SC35 (). Using ESEfinder, we have undertaken a large-scale analysis of ESE motif distribution in human protein-coding genes. Significantly higher frequencies of ESE motifs were observed in constitutive internal protein-coding exons, compared with both their flanking intronic regions and with pseudo exons. Statistical analysis of ESE motif frequency distributions revealed a complex relationship between splice-site strength and increased or decreased frequencies of particular SR protein motifs. Comparison of constitutively and alternatively spliced exons demonstrated slightly weaker splice-site scores, as well as significantly fewer ESE motifs, in the alternatively spliced group. Our results underline the importance of ESE-mediated SR protein function in the process of exon definition, in the context of both constitutive splicing and regulated alternative splicing.
Collapse
|
230
|
Xuan Z, Zhao F, Wang J, Chen G, Zhang MQ. Genome-wide promoter extraction and analysis in human, mouse, and rat. Genome Biol 2005; 6:R72. [PMID: 16086854 PMCID: PMC1273639 DOI: 10.1186/gb-2005-6-8-r72] [Citation(s) in RCA: 54] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2005] [Revised: 05/23/2005] [Accepted: 07/11/2005] [Indexed: 01/27/2023] Open
Abstract
Large-scale and high-throughput genomics research needs reliable and comprehensive genome-wide promoter annotation resources. We have conducted a systematic investigation on how to improve mammalian promoter prediction by incorporating both transcript and conservation information. This enabled us to build a better multispecies promoter annotation pipeline and hence to create CSHLmpd (Cold Spring Harbor Laboratory Mammalian Promoter Database) for the biomedical research community, which can act as a starting reference system for more refined functional annotations.
Collapse
|
231
|
Zhao F, Xuan Z, Liu L, Zhang MQ. TRED: a Transcriptional Regulatory Element Database and a platform for in silico gene regulation studies. Nucleic Acids Res 2005; 33:D103-7. [PMID: 15608156 PMCID: PMC539958 DOI: 10.1093/nar/gki004] [Citation(s) in RCA: 148] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
In order to understand gene regulation, accurate and comprehensive knowledge of transcriptional regulatory elements is essential. Here, we report our efforts in building a mammalian Transcriptional Regulatory Element Database (TRED) with associated data analysis functions. It collects cis- and trans-regulatory elements and is dedicated to easy data access and analysis for both single-gene-based and genome-scale studies. Distinguishing features of TRED include: (i) relatively complete genome-wide promoter annotation for human, mouse and rat; (ii) availability of gene transcriptional regulation information including transcription factor binding sites and experimental evidence; (iii) data accuracy is ensured by hand curation; (iv) efficient user interface for easy and flexible data retrieval; and (v) implementation of on-the-fly sequence analysis tools. TRED can provide good training datasets for further genome-wide cis-regulatory element prediction and annotation, assist detailed functional studies and facilitate the decipher of gene regulatory networks (http://rulai.cshl.edu/TRED).
Collapse
|
232
|
Dike S, Balija VS, Nascimento LU, Xuan Z, Ou J, Zutavern T, Palmer LE, Hannon G, Zhang MQ, McCombie WR. The mouse genome: experimental examination of gene predictions and transcriptional start sites. Genome Res 2005; 14:2424-9. [PMID: 15574821 PMCID: PMC534666 DOI: 10.1101/gr.3158304] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
The completion of the mouse and other mammalian genome sequences will provide necessary, but not sufficient, knowledge for an understanding of much of mouse biology at the molecular level. As a requisite next step in this process, the genes in mouse and their structure must be elucidated. In particular, knowledge of the transcriptional start site of these genes will be necessary for further study of their regulatory regions. To assess the current state of mouse genome annotation to support this activity, we identified several hundred gene predictions in mouse with varying levels of supporting evidence and tested them using RACE-PCR. Modifications were made to the procedure allowing pooling of RNA samples, resulting in a scaleable procedure. The results illustrate potential errors or omissions in the current 5' end annotations in 58% of the genes detected. In testing experimentally unsupported gene predictions, we were able to identify 58 that are not usually annotated as genes but produced spliced transcripts (approximately 25% success rate). In addition, in many genes we were able to detect novel exons not predicted by any gene prediction algorithms. In 19.8% of the genes detected in this study, multiple transcript species were observed. These data show an urgent need to provide direct experimental validation of gene annotations. Moreover, these results show that direct validation using RACE-PCR can be an important component of genome-wide validation. This approach can be a useful tool in the ongoing efforts to increase the quality of gene annotations, especially transcriptional start sites, in complex genomes.
Collapse
|
233
|
Smith AD, Sumazin P, Zhang MQ. Identifying tissue-selective transcription factor binding sites in vertebrate promoters. Proc Natl Acad Sci U S A 2005; 102:1560-5. [PMID: 15668401 PMCID: PMC547828 DOI: 10.1073/pnas.0406123102] [Citation(s) in RCA: 101] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023] Open
Abstract
We present a computational method aimed at systematically identifying tissue-selective transcription factor binding sites. Our method focuses on the differences between sets of promoters that are associated with differentially expressed genes, and it is effective at identifying the highly degenerate motifs that characterize vertebrate transcription factor binding sites. Results on simulated data indicate that our method detects motifs with greater accuracy than the leading methods, and its detection of strongly overrepresented motifs is nearly perfect. We present motifs identified by our method as the most overrepresented in promoters of liver- and muscle-selective genes, demonstrating that our method accurately identifies known transcription factor binding sites and previously uncharacterized motifs.
Collapse
|
234
|
Abstract
Longevity regulatory genes include the Forkhead transcription factor FOXO, in addition to NAD-dependent histone deacetylase silent information regulator 2 (Sir2). The FOXO/DAF-16 family of transcription factors constitute an evolutionarily conserved subgroup within a larger family known as winged helix or Forkhead transcriptional regulators. Here we demonstrate how to identify FOXO target genes and their potential cis-regulatory binding sites in the promoters via bioinformatics approaches. These results provide new testable hypotheses for further experimental verifications.
Collapse
|
235
|
Abstract
Cooperativity between transcription factors is critical to gene regulation. Current computational methods do not take adequate account of this salient aspect. To address this issue, we present a computational method based on multivariate adaptive regression splines to correlate the occurrences of transcription factor binding motifs in the promoter DNA and their interactions to the logarithm of the ratio of gene expression levels. This allows us to discover both the individual motifs and synergistic pairs of motifs that are most likely to be functional, and enumerate their relative contributions at any arbitrary time point for which mRNA expression data are available. We present results of simulations and focus specifically on the yeast cell-cycle data. Inclusion of synergistic interactions can increase the prediction accuracy over linear regression to as much as 1.5- to 3.5-fold. Significant motifs and combinations of motifs are appropriately predicted at each stage of the cell cycle. We believe our multivariate adaptive regression splines-based approach will become more significant when applied to higher eukaryotes, especially mammals, where cooperative control of gene regulation is absolutely essential.
Collapse
|
236
|
Zhang MQ. Prediction, annotation, and analysis of human promoters. COLD SPRING HARBOR SYMPOSIA ON QUANTITATIVE BIOLOGY 2004; 68:217-25. [PMID: 15338621 DOI: 10.1101/sqb.2003.68.217] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
237
|
Zhang J, Li F, Li J, Zhang MQ, Zhang X. Evidence and characteristics of putative human α recombination hotspots. Hum Mol Genet 2004; 13:2823-8. [PMID: 15385449 DOI: 10.1093/hmg/ddh310] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Understanding recombination rate variation is very important for studying genome diversity and evolution, and for investigation of phenotypic association and genetic diseases. Recombination hotspots have been observed in many species and are well studied in yeast. Recent study demonstrated that recombination hotspots are also a ubiquitous feature of the human genome. But the nature of human hotspots remains largely unknown. We have developed and validated a novel computational method for testing the existence of hotspots as well as for localizing them with either unphased or phased genotyping data. To study the characteristics of hotspots within or close to genes, we scanned for unusually high levels of recombination using the European population samples in the SeattleSNPs database, and found evidence for the existence of human alpha hotspots similar to those of yeast. This type of hotspots, found at promoter regions, accounts for about half of the total detected and appears to depend on some specific transcription factor binding sites (such as CGCCCCCGC). These characteristics can explain the observed weak correlation between hotspots and GC-content, and their variation may contribute to the diversity of hotspot distribution among different individuals and species. These long-sought putative human alpha recombination hotspots should deserve further experimental investigations.
Collapse
|
238
|
Abstract
MOTIVATION Tissue-specific transcription factor binding sites give insight into tissue-specific transcription regulation. RESULTS We describe a word-counting-based tool for de novo tissue-specific transcription factor binding site discovery using expression information in addition to sequence information. We incorporate tissue-specific gene expression through gene classification to positive expression and repressed expression. We present a direct statistical approach to find overrepresented transcription factor binding sites in a foreground promoter sequence set against a background promoter sequence set. Our approach naturally extends to synergistic transcription factor binding site search. We find putative transcription factor binding sites that are overrepresented in the proximal promoters of liver-specific genes relative to proximal promoters of liver-independent genes. Our results indicate that binding sites for hepatocyte nuclear factors (especially HNF-1 and HNF-4) and CCAAT/enhancer-binding protein (C/EBPbeta) are the most overrepresented in proximal promoters of liver-specific genes. Our results suggest that HNF-4 has strong synergistic relationships with HNF-1, HNF-4 and HNF-3beta and with C/EBPbeta. AVAILABILITY Programs are available for use over the Web at http://rulai.cshl.edu/tools/dwe.
Collapse
|
239
|
Schones DE, Sumazin P, Zhang MQ. Similarity of position frequency matrices for transcription factor binding sites. Bioinformatics 2004; 21:307-13. [PMID: 15319260 DOI: 10.1093/bioinformatics/bth480] [Citation(s) in RCA: 86] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Transcription-factor binding sites (TFBS) in promoter sequences of higher eukaryotes are commonly modeled using position frequency matrices (PFM). The ability to compare PFMs representing binding sites is especially important for de novo sequence motif discovery, where it is desirable to compare putative matrices to one another and to known matrices. RESULTS We describe a PFM similarity quantification method based on product multinomial distributions, demonstrate its ability to identify PFM similarity and show that it has a better false positive to false negative ratio compared to existing methods. We grouped TFBS frequency matrices from two libraries into matrix families and identified the matrices that are common and unique to these libraries. We identified similarities and differences between the skeletal-muscle-specific and non-muscle-specific frequency matrices for the binding sites of Mef-2, Myf, Sp-1, SRF and TEF of Wasserman and Fickett. We further identified known frequency matrices and matrix families that were strongly similar to the matrices given by Wasserman and Fickett. We provide methodology and tools to compare and query libraries of frequency matrices for TFBSs. AVAILABILITY Software is available to use over the Web at http://rulai.cshl.edu/MatCompare SUPPLEMENTARY INFORMATION Database and clustering statistics, matrix families and representatives are available at http://rulai.cshl.edu/MatCompare/Supplementary.
Collapse
|
240
|
Kato M, Hata N, Banerjee N, Futcher B, Zhang MQ. Identifying combinatorial regulation of transcription factors and binding motifs. Genome Biol 2004; 5:R56. [PMID: 15287978 PMCID: PMC507881 DOI: 10.1186/gb-2004-5-8-r56] [Citation(s) in RCA: 133] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2004] [Revised: 04/26/2004] [Accepted: 06/28/2004] [Indexed: 02/01/2023] Open
Abstract
A novel method that integrates chromatin immunoprecipitation data with microarray expression data and combinatorial TF-motif analysis was used to systematically identify combinations of transcription factors and of motifs and to reconstruct a new combinatorial regulatory map of the yeast cell cycle. Background Combinatorial interaction of transcription factors (TFs) is important for gene regulation. Although various genomic datasets are relevant to this issue, each dataset provides relatively weak evidence on its own. Developing methods that can integrate different sequence, expression and localization data have become important. Results Here we use a novel method that integrates chromatin immunoprecipitation (ChIP) data with microarray expression data and with combinatorial TF-motif analysis. We systematically identify combinations of transcription factors and of motifs. The various combinations of TFs involved multiple binding mechanisms. We reconstruct a new combinatorial regulatory map of the yeast cell cycle in which cell-cycle regulation can be drawn as a chain of extended TF modules. We find that the pairwise combination of a TF for an early cell-cycle phase and a TF for a later phase is often used to control gene expression at intermediate times. Thus the number of distinct times of gene expression is greater than the number of transcription factors. We also see that some TF modules control branch points (cell-cycle entry and exit), and in the presence of appropriate signals they can allow progress along alternative pathways. Conclusions Combining different data sources can increase statistical power as demonstrated by detecting TF interactions and composite TF-binding motifs. The original picture of a chain of simple cell-cycle regulators can be extended to a chain of composite regulatory modules: different modules may share a common TF component in the same pathway or a TF component cross-talking to other pathways.
Collapse
|
241
|
Banerjee N, Zhang MQ. Identifying cooperativity among transcription factors controlling the cell cycle in yeast. Nucleic Acids Res 2004; 31:7024-31. [PMID: 14627835 PMCID: PMC290262 DOI: 10.1093/nar/gkg894] [Citation(s) in RCA: 131] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023] Open
Abstract
Transcription regulation in eukaryotes is known to occur through the coordinated action of multiple transcription factors (TFs). Recently, a few genome-wide transcription studies have begun to explore the combinatorial nature of TF interactions. We propose a novel approach that reveals how multiple TFs cooperate to regulate transcription in the yeast cell cycle. Our method integrates genome-wide gene expression data and chromatin immunoprecipitation (ChIP-chip) data to discover more biologically relevant synergistic interactions between different TFs and their target genes than previous studies. Given any pair of TFs A and B, we define a novel measure of cooperativity between the two TFs based on the expression patterns of sets of target genes of only A, only B, and both A and B. If the cooperativity measure is significant then there is reason to postulate that the presence of both TFs is needed to influence gene expression. Our results indicate that many cooperative TFs that were previously characterized experimentally indeed have high values of cooperativity measures in our analysis. In addition, we propose several novel, experimentally testable predictions of cooperative TFs that play a role in the cell cycle and other biological processes. Many of them hold interesting clues for cross talk between the cell cycle and other processes including metabolism, stress response and pseudohyphal differentiation. Finally, we have created a web tool where researchers can explore the exhaustive list of cooperative TFs and survey the graphical representation of the target genes' expression profiles. The interface includes a tool to dynamically draw a TF cooperativity network of 113 TFs with user-defined significance levels. This study is an example of how systematic combination of diverse data types along with new functional genomic approaches can provide a rigorous platform to map TF interactions more efficiently.
Collapse
|
242
|
Chen G, Hata N, Zhang MQ. Transcription factor binding element detection using functional clustering of mutant expression data. Nucleic Acids Res 2004; 32:2362-71. [PMID: 15115798 PMCID: PMC419446 DOI: 10.1093/nar/gkh557] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
As a powerful tool to reveal gene functions, gene mutation has been used extensively in molecular biology studies. With high throughput technologies, such as DNA microarray, genome-wide gene expression changes can be monitored in mutants. Here we present a simple approach to detect the transcription-factor-binding motif using microarray expression data from a mutant in which the relevant transcription factor is deleted. A core part of our approach is clustering of differentially expressed genes based on functional annotations, such as Gene Ontology (GO). We tested our method with eight microarray data sets from the Rosetta Compendium and were able to detect canonical binding motifs for at least four transcription factors. With the support of chromatin IP chip data, we also predict a possible variant of the Swi4 binding motif and recover a core motif for Arg80. Our approach should be readily applicable to microarray experiments using other types of molecular biology techniques, such as conditional knockout/overexpression or RNAi-mediated 'knockdown', to perturb the expression of a transcription factor. Functional clustering included in our approach may also provide new insights into the function of the relevant transcription factor.
Collapse
|
243
|
Davuluri RV, Zhang MQ. Computer software to find genes in plant genomic DNA. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2004; 236:87-108. [PMID: 14501060 DOI: 10.1385/1-59259-413-1:87] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/27/2023]
Abstract
Gene finding is the most important phase of genome annotation. Eukaryotic genomes contain thousands of protein coding genes, and computational gene prediction would rapidly increase the pace of experimental confirmation of expressed genes at the bench. The purpose of this chapter is to discuss the use of different computer programs that identify protein-coding genes in large genomic sequences. We describe most commonly used gene prediction programs that are available on the World Wide Web and demonstrate the use of some of these programs by an example. We provide a list of these programs along with their. Web uniform resource locators (URLs) and suggest guidelines for successful gene finding.
Collapse
|
244
|
Long F, Liu H, Hahn C, Sumazin P, Zhang MQ, Zilberstein A. Genome-wide prediction and analysis of function-specific transcription factor binding sites. In Silico Biol 2004; 4:395-410. [PMID: 15506990] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/01/2023]
Abstract
DNA-binding transcription factors play a central role in transcription regulation, and the annotation of transcription-factor binding sites in upstream regions of human genes is essential for building a genome-wide regulatory network. We describe methodology to accurately predict the transcription-factor binding sites in the proximal-promoter region of function-specific genes. In order to increase the accuracy of transcription factor binding-site prediction, we rely on recent genome sequence data, known transcription factor binding-site matrices, and Gene Ontology biological-function-based gene classification. Using TRANSFAC position-frequency matrices, we detected individual and cooperating transcription-factor binding sites in proximal promoters of ENSEMBL annotated human genes. We used the over representation of detected binding sites in the proximal promoters as compared to the second exons to control specificity. We confirmed the majority of transcription-factor binding sites predicted in proximal promoters of immune-response genes with evidence from existing literature. We validated the predicted cooperation between transcription factors NF-kappa B and IRF in the regulation of gene expression with microarray transcript profiling data and literature-derived protein-protein interaction network. We also identified over-represented individual and pairs of transcription-factor binding sites in the proximal promoters of each Gene Ontology biological-process gene group. Our tools and analysis provide a new resource for deciphering transcription regulation in different biological paradigms.
Collapse
|
245
|
Cartegni L, Wang J, Zhu Z, Zhang MQ, Krainer AR. ESEfinder: A web resource to identify exonic splicing enhancers. Nucleic Acids Res 2003; 31:3568-71. [PMID: 12824367 PMCID: PMC169022 DOI: 10.1093/nar/gkg616] [Citation(s) in RCA: 1202] [Impact Index Per Article: 57.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Point mutations frequently cause genetic diseases by disrupting the correct pattern of pre-mRNA splicing. The effect of a point mutation within a coding sequence is traditionally attributed to the deduced change in the corresponding amino acid. However, some point mutations can have much more severe effects on the structure of the encoded protein, for example when they inactivate an exonic splicing enhancer (ESE), thereby resulting in exon skipping. ESEs also appear to be especially important in exons that normally undergo alternative splicing. Different classes of ESE consensus motifs have been described, but they are not always easily identified. ESEfinder (http://exon.cshl.edu/ESE/) is a web-based resource that facilitates rapid analysis of exon sequences to identify putative ESEs responsive to the human SR proteins SF2/ASF, SC35, SRp40 and SRp55, and to predict whether exonic mutations disrupt such elements.
Collapse
|
246
|
Li Z, Van Calcar S, Qu C, Cavenee WK, Zhang MQ, Ren B. A global transcriptional regulatory role for c-Myc in Burkitt's lymphoma cells. Proc Natl Acad Sci U S A 2003; 100:8164-9. [PMID: 12808131 PMCID: PMC166200 DOI: 10.1073/pnas.1332764100] [Citation(s) in RCA: 399] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
Overexpression of c-Myc is one of the most common alterations in human cancers, yet it is not clear how this transcription factor acts to promote malignant transformation. To understand the molecular targets of c-Myc function, we have used an unbiased genome-wide location-analysis approach to examine the genomic binding sites of c-Myc in Burkitt's lymphoma cells. We find that c-Myc together with its heterodimeric partner, Max, occupy >15% of gene promoters tested in these cancer cells. The DNA binding of c-Myc and Max correlates extensively with gene expression throughout the genome, a hallmark attribute of general transcription factors. The c-Myc/Max heterodimer complexes also colocalize with transcription factor IID in these cells, further supporting a general role for overexpressed c-Myc in global gene regulation. In addition, transcription of a majority of c-Myc target genes exhibits changes correlated with levels of c-myc mRNA in a diverse set of tissues and cell lines, supporting the conclusion that c-Myc regulates them. Taken together, these results suggest a general role for overexpressed c-Myc in global transcriptional regulation in some cancer cells and point toward molecular mechanisms for c-Myc function in malignant transformation.
Collapse
MESH Headings
- Basic Helix-Loop-Helix Leucine Zipper Transcription Factors
- Basic-Leucine Zipper Transcription Factors
- Burkitt Lymphoma/genetics
- Burkitt Lymphoma/metabolism
- Cell Transformation, Neoplastic/genetics
- DNA-Binding Proteins/chemistry
- DNA-Binding Proteins/physiology
- Dimerization
- Gene Expression Profiling
- Gene Expression Regulation, Neoplastic
- Genes, myc
- Humans
- Neoplasm Proteins/chemistry
- Neoplasm Proteins/physiology
- Oligonucleotide Array Sequence Analysis
- Promoter Regions, Genetic/genetics
- Protein Interaction Mapping
- Proto-Oncogene Proteins c-myc/chemistry
- Proto-Oncogene Proteins c-myc/physiology
- RNA, Messenger/biosynthesis
- RNA, Messenger/genetics
- RNA, Neoplasm/biosynthesis
- RNA, Neoplasm/genetics
- Transcription Factor TFIID/physiology
- Transcription Factors
- Transcription, Genetic
- Translocation, Genetic
- Tumor Cells, Cultured/metabolism
Collapse
|
247
|
Xuan Z, Wang J, Zhang MQ. Computational comparison of two mouse draft genomes and the human golden path. Genome Biol 2003; 4:R1. [PMID: 12537546 PMCID: PMC151282 DOI: 10.1186/gb-2002-4-1-r1] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2002] [Revised: 11/27/2002] [Accepted: 11/28/2002] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The availability of both mouse and human draft genomes has marked the beginning of a new era of comparative mammalian genomics. The two available mouse genome assemblies, from the public mouse genome sequencing consortium and Celera Genomics, were obtained using different clone libraries and different assembly methods. RESULTS We present here a critical comparison of the two latest mouse genome assemblies. The utility of the combined genomes is further demonstrated by comparing them with the human 'golden path' and through a subsequent analysis of a resulting conserved sequence element (CSE) database, which allows us to identify over 6,000 potential novel genes and to derive independent estimates of the number of human protein-coding genes. CONCLUSION The Celera and public mouse assemblies differ in about 10% of the mouse genome. Each assembly has advantages over the other: Celera has higher accuracy in base-pairs and overall higher coverage of the genome; the public assembly, however, has higher sequence quality in some newly finished bacterial artificial chromosome clone (BAC) regions and the data are freely accessible. Perhaps most important, by combining both assemblies, we can get a better annotation of the human genome; in particular, we can obtain the most complete set of CSEs, one third of which are related to known genes and some others are related to other functional genomic regions. More than half the CSEs are of unknown function. From the CSEs, we estimate the total number of human protein-coding genes to be about 40,000. This searchable publicly available online CSEdb will expedite new discoveries through comparative genomics.
Collapse
|
248
|
Davuluri RV, Grosse I, Zhang MQ. Erratum: Computational identification of promoters and first exons in the human genome. Nat Genet 2002. [DOI: 10.1038/ng1102-459a] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
249
|
Nahle Z, Polakoff J, Davuluri RV, McCurrach ME, Jacobson MD, Narita M, Zhang MQ, Lazebnik Y, Bar-Sagi D, Lowe SW. Direct coupling of the cell cycle and cell death machinery by E2F. Nat Cell Biol 2002; 4:859-64. [PMID: 12389032 DOI: 10.1038/ncb868] [Citation(s) in RCA: 317] [Impact Index Per Article: 14.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2002] [Revised: 07/12/2002] [Accepted: 08/19/2002] [Indexed: 12/15/2022]
Abstract
Unrestrained E2F activity forces S phase entry and promotes apoptosis through p53-dependent and -independent mechanisms. Here, we show that deregulation of E2F by adenovirus E1A, loss of Rb or enforced E2F-1 expression results in the accumulation of caspase proenzymes through a direct transcriptional mechanism. Increased caspase levels seem to potentiate cell death in the presence of p53-generated signals that trigger caspase activation. Our results demonstrate that mitogenic oncogenes engage a tumour suppressor network that functions at multiple levels to efficiently induce cell death. The data also underscore how cell cycle progression can be coupled to the apoptotic machinery.
Collapse
|
250
|
Carmell MA, Xuan Z, Zhang MQ, Hannon GJ. The Argonaute family: tentacles that reach into RNAi, developmental control, stem cell maintenance, and tumorigenesis. Genes Dev 2002; 16:2733-42. [PMID: 12414724 DOI: 10.1101/gad.1026102] [Citation(s) in RCA: 582] [Impact Index Per Article: 26.5] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|