51
|
Localization of the cis-enhancer element for mouse type X collagen expression in hypertrophic chondrocytes in vivo. J Bone Miner Res 2009; 24:1022-32. [PMID: 19113928 PMCID: PMC2683646 DOI: 10.1359/jbmr.081249] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Abstract
The type X collagen gene (Col10a1) is a specific molecular marker of hypertrophic chondrocytes during endochondral bone formation. Mutations in human COL10A1 and altered chondrocyte hypertrophy have been associated with multiple skeletal disorders. However, until recently, the cis-enhancer element that specifies Col10a1 expression in hypertrophic chondrocytes in vivo has remained unidentified. Previously, we and others have shown that the Col10a1 distal promoter (-4.4 to -3.8 kb) may harbor a critical enhancer that mediates its tissue specificity in transgenic mice studies. Here, we report further localization of the cis-enhancer element within this Col10a1 distal promoter by using a similar transgenic mouse approach. We identify a 150-bp Col10a1 promoter element (-4296 to -4147 bp) that is sufficient to direct its tissue-specific expression in vivo. In silico analysis identified several putative transcription factor binding sites including two potential activator protein-1 (AP-1) sites within its 5'- and 3'-ends (-4276 to -4243 and -4166 to -4152 bp), respectively. Interestingly, transgenic mice using a reporter construct deleted for these two AP-1 elements still showed tissue-specific reporter activity. EMSAs using oligonucleotide probes derived from this region and MCT cell nuclear extracts identified DNA/protein complexes that were enriched from cells stimulated to hypertrophy. Moreover, these elements mediated increased reporter activity on transfection into MCT cells. These data define a 90-bp cis-enhancer required for tissue-specific Col10a1 expression in vivo and putative DNA/protein complexes that contribute to the regulation of chondrocyte hypertrophy. This work will enable us to identify candidate transcription factors essential both for skeletal development and for the pathogenesis of skeletal disorders.
Collapse
|
52
|
Temiz NA, Camacho CJ. Experimentally based contact energies decode interactions responsible for protein-DNA affinity and the role of molecular waters at the binding interface. Nucleic Acids Res 2009; 37:4076-88. [PMID: 19429892 PMCID: PMC2709573 DOI: 10.1093/nar/gkp289] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
A major obstacle towards understanding the molecular basis of transcriptional regulation is the lack of a recognition code for protein–DNA interactions. Using high-quality crystal structures and binding data on the promiscuous family of C2H2 zinc fingers (ZF), we decode 10 fundamental specific interactions responsible for protein–DNA recognition. The interactions include five hydrogen bond types, three atomic desolvation penalties, a favorable non-polar energy, and a novel water accessibility factor. We apply this code to three large datasets containing a total of 89 C2H2 transcription factor (TF) mutants on the three ZFs of EGR. Guided by molecular dynamics simulations of individual ZFs, we map the interactions into homology models that embody all feasible intra- and intermolecular bonds, selecting for each sequence the structure with the lowest free energy. These interactions reproduce the change in affinity of 35 mutants of finger I (R2 = 0.998), 23 mutants of finger II (R2 = 0.96) and 31 finger III human domains (R2 = 0.94). Our findings reveal recognition rules that depend on DNA sequence/structure, molecular water at the interface and induced fit of the C2H2 TFs. Collectively, our method provides the first robust framework to decode the molecular basis of TFs binding to DNA.
Collapse
Affiliation(s)
- N Alpay Temiz
- Department of Computational Biology, School of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania, USA
| | | |
Collapse
|
53
|
Zhang S, Xu M, Li S, Su Z. Genome-wide de novo prediction of cis-regulatory binding sites in prokaryotes. Nucleic Acids Res 2009; 37:e72. [PMID: 19383880 PMCID: PMC2691844 DOI: 10.1093/nar/gkp248] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
Although cis-regulatory binding sites (CRBSs) are at least as important as the coding sequences in a genome, our general understanding of them in most sequenced genomes is very limited due to the lack of efficient and accurate experimental and computational methods for their characterization, which has largely hindered our understanding of many important biological processes. In this article, we describe a novel algorithm for genome-wide de novo prediction of CRBSs with high accuracy. We designed our algorithm to circumvent three identified difficulties for CRBS prediction using comparative genomics principles based on a new method for the selection of reference genomes, a new metric for measuring the similarity of CRBSs, and a new graph clustering procedure. When operon structures are correctly predicted, our algorithm can predict 81% of known individual binding sites belonging to 94% of known cis-regulatory motifs in the Escherichia coli K12 genome, while achieving high prediction specificity. Our algorithm has also achieved similar prediction accuracy in the Bacillus subtilis genome, suggesting that it is very robust, and thus can be applied to any other sequenced prokaryotic genome. When compared with the prior state-of-the-art algorithms, our algorithm outperforms them in both prediction sensitivity and specificity.
Collapse
Affiliation(s)
- Shaoqiang Zhang
- Department of Bioinformatics and Genomics, Bioinformatics Research Center, the University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | | | | | | |
Collapse
|
54
|
Zhang CQ, Wang J, Gao X. [Computational identification of transcriptional regulatory elements in Arabidopsis TCH4 promoter]. YI CHUAN = HEREDITAS 2009; 30:620-6. [PMID: 18487153 DOI: 10.3724/sp.j.1005.2008.00620] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Arabidopsis TCH4 gene plays an important role in the biological processes related to plant secondary growth, resistance to pathogen, and adaptation to environmental stresses. It is up-regulated by various hormonal, environmental, and mechanical stimuli. Here, we identified 9 transcriptional regulatory elements from TCH4 promoter by bioinformatics approach. In which, 4 elements have been reported previously, and 5 elements are newly identified in this study. All of the identified elements contain the sequences of known cis-elements. Especially, their distribution along some co-expressed gene promoters and the orthologous promoters is typically clustered and syntenic. Based on our predictions and the information of known cis-elements, a model representing the transcriptional regulation mechanism was proposed for TCH4 gene in response to hormonal, mechanical, and environmental stimuli.
Collapse
Affiliation(s)
- Chang-Qing Zhang
- State Key Laboratory of Pharmaceutical Biotechnology, Nanjing University, Nanjing 210093, China.
| | | | | |
Collapse
|
55
|
Pape UJ, Klein H, Vingron M. Statistical detection of cooperative transcription factors with similarity adjustment. Bioinformatics 2009; 25:2103-9. [PMID: 19286833 PMCID: PMC2722994 DOI: 10.1093/bioinformatics/btp143] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Motivation: Statistical assessment of cis-regulatory modules (CRMs) is a crucial task in computational biology. Usually, one concludes from exceptional co-occurrences of DNA motifs that the corresponding transcription factors (TFs) are cooperative. However, similar DNA motifs tend to co-occur in random sequences due to high probability of overlapping occurrences. Therefore, it is important to consider similarity of DNA motifs in the statistical assessment. Results: Based on previous work, we propose to adjust the window size for co-occurrence detection. Using the derived approximation, one obtains different window sizes for different sets of DNA motifs depending on their similarities. This ensures that the probability of co-occurrences in random sequences are equal. Applying the approach to selected similar and dissimilar DNA motifs from human TFs shows the necessity of adjustment and confirms the accuracy of the approximation by comparison to simulated data. Furthermore, it becomes clear that approaches ignoring similarities strongly underestimate P-values for cooperativity of TFs with similar DNA motifs. In addition, the approach is extended to deal with overlapping windows. We derive Chen–Stein error bounds for the approximation. Comparing the error bounds for similar and dissimilar DNA motifs shows that the approximation for similar DNA motifs yields large bounds. Hence, one has to be careful using overlapping windows. Based on the error bounds, one can precompute the approximation errors and select an appropriate overlap scheme before running the analysis. Availability: Software to perform the calculation for pairs of position frequency matrices (PFMs) is available at http://mosta.molgen.mpg.de as well as C++ source code for downloading. Contact:utz.pape@molgen.mpg.de
Collapse
Affiliation(s)
- Utz J Pape
- Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestr. 73 and Mathematics and Computer Science, Free University of Berlin, Takustr. 9, 14195 Berlin, Germany.
| | | | | |
Collapse
|
56
|
Miklós I, Novák Á, Satija R, Lyngsø R, Hein J. Stochastic models of sequence evolution including insertion—deletion events. Stat Methods Med Res 2009; 18:453-85. [DOI: 10.1177/0962280208099500] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Comparison of sequences that have descended from a common ancestor based on an explicit stochastic model of substitutions, insertions and deletions has risen to prominence in the last decade. Making statements about the positions of insertions-deletions (abbr. indels) is central in sequence and genome analysis and is called alignment. This statistical approach is harder conceptually and computationally, than competing approaches based on choosing an alignment according to some optimality criteria. But it has major practical advantages in terms of testing evolutionary hypotheses and parameter estimation. Basic dynamic approaches can allow the analysis of up to 4—5 sequences. MCMC techniques can bring this to about 10—15 sequences. Beyond this, different or heuristic approaches must be used. Besides the computational challenges, increasing realism in the underlying models is presently being addressed. A recent development that has been especially fruitful is combining statistical alignment with the problem of sequence annotation, making statements about the function of each nucleotide/amino acid. So far gene finding, protein secondary structure prediction and regulatory signal detection has been tackled within this framework. Much progress can be reported, but clearly major challenges remain if this approach is to be central in the analyses of large incoming sequence data sets.
Collapse
Affiliation(s)
- István Miklós
- Bioinformatics Group, Alfréd Rényi Institute of Mathematics, Hungarian Academy of Sciences, 1053 Budapest, Reáltanoda u. 13-15, Hungary, , Bioinformatics Group, Department of Statistics, University of Oxford, 1 South Parks Road, OX1 3TG Oxford, UK, Data Mining and Search Research Group, Computer and Automation Institute, Hungarian Academy of Sciences, 1111 Budapest, Lágymányosi u. 11., Hungary
| | - Ádám Novák
- Bioinformatics Group, Department of Statistics, University of Oxford, 1 South Parks Road, OX1 3TG Oxford, UK
| | - Rahul Satija
- Bioinformatics Group, Department of Statistics, University of Oxford, 1 South Parks Road, OX1 3TG Oxford, UK
| | - Rune Lyngsø
- Bioinformatics Group, Department of Statistics, University of Oxford, 1 South Parks Road, OX1 3TG Oxford, UK
| | - Jotun Hein
- Bioinformatics Group, Department of Statistics, University of Oxford, 1 South Parks Road, OX1 3TG Oxford, UK
| |
Collapse
|
57
|
Kang K, Chung JH, Kim J. Evolutionary Conserved Motif Finder (ECMFinder) for genome-wide identification of clustered YY1- and CTCF-binding sites. Nucleic Acids Res 2009; 37:2003-13. [PMID: 19208640 PMCID: PMC2665242 DOI: 10.1093/nar/gkp077] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
We have developed a new bioinformatics approach called ECMFinder (Evolutionary Conserved Motif Finder). This program searches for a given DNA motif within the entire genome of one species and uses the gene association information of a potential transcription factor-binding site (TFBS) to screen the homologous regions of a second and third species. If multiple species have this potential TFBS in homologous positions, this program recognizes the identified TFBS as an evolutionary conserved motif (ECM). This program outputs a list of ECMs, which can be uploaded as a Custom Track in the UCSC genome browser and can be visualized along with other available data. The feasibility of this approach was tested by searching the genomes of three mammals (human, mouse and cow) with the DNA-binding motifs of YY1 and CTCF. This program successfully identified many clustered YY1- and CTCF-binding sites that are conserved among these species but were previously undetected. In particular, this program identified CTCF-binding sites that are located close to the Dlk1, Magel2 and Cdkn1c imprinted genes. Individual ChIP experiments confirmed the in vivo binding of the YY1 and CTCF proteins to most of these newly discovered binding sites, demonstrating the feasibility and usefulness of ECMFinder.
Collapse
Affiliation(s)
- Keunsoo Kang
- Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803, USA and Department of Biological Sciences, KAIST, Daejeon 305-701, South Korea
| | - Jae Hoon Chung
- Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803, USA and Department of Biological Sciences, KAIST, Daejeon 305-701, South Korea
| | - Joomyeong Kim
- Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803, USA and Department of Biological Sciences, KAIST, Daejeon 305-701, South Korea
- *To whom correspondence should be addressed. Tel: +1 225-578-7692; Fax: +1 225-578-2597;
| |
Collapse
|
58
|
Alon S, Eisenberg E, Jacob-Hirsch J, Rechavi G, Vatine G, Toyama R, Coon SL, Klein DC, Gothilf Y. A new cis-acting regulatory element driving gene expression in the zebrafish pineal gland. Bioinformatics 2009; 25:559-62. [PMID: 19147662 DOI: 10.1093/bioinformatics/btp031] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
MOTIVATION The identification of functional cis-acting DNA regulatory elements is a crucial step towards understanding gene regulation. Ab initio motif detection algorithms have been extensively used in search of regulatory elements. Yet, their success in providing experimentally validated regulatory elements in vertebrates has been limited. RESULTS Here we report in silico identification and in vivo validation of regulatory elements that determine enhanced gene expression in the pineal gland of zebrafish. Microarray data enabled detection of genes that exhibit high expression in the pineal gland. The promoter regions of these genes were computationally analyzed in order to identify overrepresented motifs. The highest ranking motif identified is a CRX/OTX binding site, known to govern expression in the pineal gland and retina. The second highest ranking motif was not reported before; we experimentally validated its function in vivo by mutational analysis. The methodology presented here may be applicable as a general scheme for finding regulatory elements that contribute to tissue-specific gene expression.
Collapse
Affiliation(s)
- Shahar Alon
- Department of Neurobiology, The George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | | | | | | | | | | | | | | | | |
Collapse
|
59
|
Wichadakul D, McDermott J, Samudrala R. Prediction and integration of regulatory and protein-protein interactions. Methods Mol Biol 2009; 541:101-43. [PMID: 19381527 DOI: 10.1007/978-1-59745-243-4_6] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
Knowledge of transcriptional regulatory interactions (TRIs) is essential for exploring functional genomics and systems biology in any organism. While several results from genome-wide analysis of transcriptional regulatory networks are available, they are limited to model organisms such as yeast ( 1 ) and worm ( 2 ). Beyond these networks, experiments on TRIs study only individual genes and proteins of specific interest. In this chapter, we present a method for the integration of various data sets to predict TRIs for 54 organisms in the Bioverse ( 3 ). We describe how to compile and handle various formats and identifiers of data sets from different sources and how to predict TRIs using a homology-based approach, utilizing the compiled data sets. Integrated data sets include experimentally verified TRIs, binding sites of transcription factors, promoter sequences, protein subcellular localization, and protein families. Predicted TRIs expand the networks of gene regulation for a large number of organisms. The integration of experimentally verified and predicted TRIs with other known protein-protein interactions (PPIs) gives insight into specific pathways, network motifs, and the topological dynamics of an integrated network with gene expression under different conditions, essential for exploring functional genomics and systems biology.
Collapse
|
60
|
Yaragatti M, Sandler T, Ungar L. A predictive model for identifying mini-regulatory modules in the mouse genome. Bioinformatics 2008; 25:353-7. [DOI: 10.1093/bioinformatics/btn622] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
|
61
|
Chaivorapol C, Melton C, Wei G, Yeh RF, Ramalho-Santos M, Blelloch R, Li H. CompMoby: comparative MobyDick for detection of cis-regulatory motifs. BMC Bioinformatics 2008; 9:455. [PMID: 18950538 PMCID: PMC2605473 DOI: 10.1186/1471-2105-9-455] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2008] [Accepted: 10/27/2008] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND The regulation of gene expression is complex and occurs at many levels, including transcriptional and post-transcriptional, in metazoans. Transcriptional regulation is mainly determined by sequence elements within the promoter regions of genes while sequence elements within the 3' untranslated regions of mRNAs play important roles in post-transcriptional regulation such as mRNA stability and translation efficiency. Identifying cis-regulatory elements, or motifs, in multicellular eukaryotes is more difficult compared to unicellular eukaryotes due to the larger intergenic sequence space and the increased complexity in regulation. Experimental techniques for discovering functional elements are often time consuming and not easily applied on a genome level. Consequently, computational methods are advantageous for genome-wide cis-regulatory motif detection. To decrease the search space in metazoans, many algorithms use cross-species alignment, although studies have demonstrated that a large portion of the binding sites for the same trans-acting factor do not reside in alignable regions. Therefore, a computational algorithm should account for both conserved and nonconserved cis-regulatory elements in metazoans. RESULTS We present CompMoby (Comparative MobyDick), software developed to identify cis-regulatory binding sites at both the transcriptional and post-transcriptional levels in metazoans without prior knowledge of the trans-acting factors. The CompMoby algorithm was previously shown to identify cis-regulatory binding sites in upstream regions of genes co-regulated in embryonic stem cells. In this paper, we extend the software to identify putative cis-regulatory motifs in 3' UTR sequences and verify our results using experimentally validated data sets in mouse and human. We also detail the implementation of CompMoby into a user-friendly tool that includes a web interface to a streamlined analysis. Our software allows detection of motifs in the following three categories: one, those that are alignable and conserved; two, those that are conserved but not alignable; three, those that are species specific. One of the output files from CompMoby gives the user the option to decide what category of cis-regulatory element to experimentally pursue based on their biological problem. Using experimentally validated biological datasets, we demonstrate that CompMoby is successful in detecting cis-regulatory target sites of known and novel trans-acting factors at the transcriptional and post-transcriptional levels. CONCLUSION CompMoby is a powerful software tool for systematic de novo discovery of evolutionarily conserved and nonconserved cis-regulatory sequences involved in transcriptional or post-transcriptional regulation in metazoans. This software is freely available to users at http://genome.ucsf.edu/compmoby/.
Collapse
Affiliation(s)
- Christina Chaivorapol
- Department of Biochemistry and Biophysics, California Institute for Quantitative Biomedical Research, Graduate Program in Biological and Medical Informatics, University of California, San Francisco, CA 94143-2540, USA.
| | | | | | | | | | | | | |
Collapse
|
62
|
da Fonseca PGS, Guimarães KS, Sagot MF. Efficient representation and P-value computation for high-order Markov motifs. Bioinformatics 2008; 24:i160-6. [PMID: 18689819 DOI: 10.1093/bioinformatics/btn282] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Position weight matrices (PWMs) have become a standard for representing biological sequence motifs. Their relative simplicity has favoured the development of efficient algorithms for diverse tasks such as motif identification, sequence scanning and statistical significance evaluation. Markov chainbased models generalize the PWM model by allowing for interposition dependencies to be considered, at the cost of substantial computational overhead, which may limit their application. RESULTS In this article, we consider two aspects regarding the use of higher order Markov models for biological sequence motifs, namely, the representation and the computation of P-values for motifs described by a set of occurrences. We propose an efficient representation based on the use of tries, from which empirical position-specific conditional base probabilities can be computed, and extend state-of-the-art PWM-based algorithms to allow for the computation of exact P-values for high-order Markov motif models. AVAILABILITY The software is available in the form of a Java objectoriented library from http://www.cin.ufpe.br/approxiamtely paguso/kmarkov.
Collapse
Affiliation(s)
- Paulo G S da Fonseca
- Centro de Informática, Universidade Federal de Pernambuco, 50732-970 Recife, Brazil.
| | | | | |
Collapse
|
63
|
Abstract
Motivation: The computational identification of transcription factor binding sites is a major challenge in bioinformatics and an important complement to experimental approaches. Results: We describe a novel, exact discriminative seeding DNA motif discovery algorithm designed for fast and reliable prediction of cis-regulatory elements in eukaryotic promoters. The algorithm is tested on biological benchmark data and shown to perform equally or better than other motif discovery tools. The algorithm is applied to the analysis of plant tissue-specific promoter sequences and successfully identifies key regulatory elements. Availability: The Seeder Perl distribution includes four modules. It is available for download on the Comprehensive Perl Archive Network (CPAN) at http://www.cpan.org. Contact:martina.stromvik@mcgill.ca Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- François Fauteux
- Department of Plant Science, McGill University, Ste-Anne-de-Bellevue, Quebec, Canada
| | | | | |
Collapse
|
64
|
Murray JI, Voelker RB, Henscheid KL, Warf MB, Berglund JA. Identification of motifs that function in the splicing of non-canonical introns. Genome Biol 2008; 9:R97. [PMID: 18549497 PMCID: PMC2481429 DOI: 10.1186/gb-2008-9-6-r97] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2007] [Revised: 12/27/2007] [Accepted: 06/12/2008] [Indexed: 01/22/2023] Open
Abstract
The enrichment of specific intronic splicing enhancers upstream of weak PY tracts suggests a novel mechanism for intron recognition that compensates for a weakened canonical pre-mRNA splicing motif. Background While the current model of pre-mRNA splicing is based on the recognition of four canonical intronic motifs (5' splice site, branchpoint sequence, polypyrimidine (PY) tract and 3' splice site), it is becoming increasingly clear that splicing is regulated by both canonical and non-canonical splicing signals located in the RNA sequence of introns and exons that act to recruit the spliceosome and associated splicing factors. The diversity of human intronic sequences suggests the existence of novel recognition pathways for non-canonical introns. This study addresses the recognition and splicing of human introns that lack a canonical PY tract. The PY tract is a uridine-rich region at the 3' end of introns that acts as a binding site for U2AF65, a key factor in splicing machinery recruitment. Results Human introns were classified computationally into low- and high-scoring PY tracts by scoring the likely U2AF65 binding site strength. Biochemical studies confirmed that low-scoring PY tracts are weak U2AF65 binding sites while high-scoring PY tracts are strong U2AF65 binding sites. A large population of human introns contains weak PY tracts. Computational analysis revealed many families of motifs, including C-rich and G-rich motifs, that are enriched upstream of weak PY tracts. In vivo splicing studies show that C-rich and G-rich motifs function as intronic splicing enhancers in a combinatorial manner to compensate for weak PY tracts. Conclusion The enrichment of specific intronic splicing enhancers upstream of weak PY tracts suggests that a novel mechanism for intron recognition exists, which compensates for a weakened canonical pre-mRNA splicing motif.
Collapse
Affiliation(s)
- Jill I Murray
- Department of Chemistry, Institute of Molecular Biology, University of Oregon, Eugene, Oregon, USA
| | | | | | | | | |
Collapse
|
65
|
Lähdesmäki H, Rust AG, Shmulevich I. Probabilistic inference of transcription factor binding from multiple data sources. PLoS One 2008; 3:e1820. [PMID: 18364997 PMCID: PMC2268002 DOI: 10.1371/journal.pone.0001820] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2007] [Accepted: 02/04/2008] [Indexed: 11/21/2022] Open
Abstract
An important problem in molecular biology is to build a complete understanding of transcriptional regulatory processes in the cell. We have developed a flexible, probabilistic framework to predict TF binding from multiple data sources that differs from the standard hypothesis testing (scanning) methods in several ways. Our probabilistic modeling framework estimates the probability of binding and, thus, naturally reflects our degree of belief in binding. Probabilistic modeling also allows for easy and systematic integration of our binding predictions into other probabilistic modeling methods, such as expression-based gene network inference. The method answers the question of whether the whole analyzed promoter has a binding site, but can also be extended to estimate the binding probability at each nucleotide position. Further, we introduce an extension to model combinatorial regulation by several TFs. Most importantly, the proposed methods can make principled probabilistic inference from multiple evidence sources, such as, multiple statistical models (motifs) of the TFs, evolutionary conservation, regulatory potential, CpG islands, nucleosome positioning, DNase hypersensitive sites, ChIP-chip binding segments and other (prior) sequence-based biological knowledge. We developed both a likelihood and a Bayesian method, where the latter is implemented with a Markov chain Monte Carlo algorithm. Results on a carefully constructed test set from the mouse genome demonstrate that principled data fusion can significantly improve the performance of TF binding prediction methods. We also applied the probabilistic modeling framework to all promoters in the mouse genome and the results indicate a sparse connectivity between transcriptional regulators and their target promoters. To facilitate analysis of other sequences and additional data, we have developed an on-line web tool, ProbTF, which implements our probabilistic TF binding prediction method using multiple data sources. Test data set, a web tool, source codes and supplementary data are available at: http://www.probtf.org.
Collapse
Affiliation(s)
- Harri Lähdesmäki
- Institute for Systems Biology, Seattle, Washington, United States of America
| | - Alistair G. Rust
- Institute for Systems Biology, Seattle, Washington, United States of America
| | - Ilya Shmulevich
- Institute for Systems Biology, Seattle, Washington, United States of America
| |
Collapse
|
66
|
Levitsky VG, Ignatieva EV, Ananko EA, Turnaev II, Merkulova TI, Kolchanov NA, Hodgman TC. Effective transcription factor binding site prediction using a combination of optimization, a genetic algorithm and discriminant analysis to capture distant interactions. BMC Bioinformatics 2007; 8:481. [PMID: 18093302 PMCID: PMC2265442 DOI: 10.1186/1471-2105-8-481] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2007] [Accepted: 12/19/2007] [Indexed: 12/22/2022] Open
Abstract
Background Reliable transcription factor binding site (TFBS) prediction methods are essential for computer annotation of large amount of genome sequence data. However, current methods to predict TFBSs are hampered by the high false-positive rates that occur when only sequence conservation at the core binding-sites is considered. Results To improve this situation, we have quantified the performance of several Position Weight Matrix (PWM) algorithms, using exhaustive approaches to find their optimal length and position. We applied these approaches to bio-medically important TFBSs involved in the regulation of cell growth and proliferation as well as in inflammatory, immune, and antiviral responses (NF-κB, ISGF3, IRF1, STAT1), obesity and lipid metabolism (PPAR, SREBP, HNF4), regulation of the steroidogenic (SF-1) and cell cycle (E2F) genes expression. We have also gained extra specificity using a method, entitled SiteGA, which takes into account structural interactions within TFBS core and flanking regions, using a genetic algorithm (GA) with a discriminant function of locally positioned dinucleotide (LPD) frequencies. To ensure a higher confidence in our approach, we applied resampling-jackknife and bootstrap tests for the comparison, it appears that, optimized PWM and SiteGA have shown similar recognition performances. Then we applied SiteGA and optimized PWMs (both separately and together) to sequences in the Eukaryotic Promoter Database (EPD). The resulting SiteGA recognition models can now be used to search sequences for BSs using the web tool, SiteGA. Analysis of dependencies between close and distant LPDs revealed by SiteGA models has shown that the most significant correlations are between close LPDs, and are generally located in the core (footprint) region. A greater number of less significant correlations are mainly between distant LPDs, which spanned both core and flanking regions. When SiteGA and optimized PWM models were applied together, this substantially reduced false positives at least at higher stringencies. Conclusion Based on this analysis, SiteGA adds substantial specificity even to optimized PWMs and may be considered for large-scale genome analysis. It adds to the range of techniques available for TFBS prediction, and EPD analysis has led to a list of genes which appear to be regulated by the above TFs.
Collapse
Affiliation(s)
- Victor G Levitsky
- Institute of Cytology and Genetics SB RAS, Novosibirsk, 630090, Russia.
| | | | | | | | | | | | | |
Collapse
|
67
|
Bi C, Leeder JS, Vyhlidal CA. A comparative study on computational two-block motif detection: algorithms and applications. Mol Pharm 2007; 5:3-16. [PMID: 18076137 DOI: 10.1021/mp7001126] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Since the completion of human genome sequencing, cataloging of all genomic functional elements has been one of the challenging problems in bioinformatics. Deciphering cis-regulatory elements in the human genome still remains elusive although much effort has been expended. This paper reviews a suite of methods for two-block motif discovery including mathematical modeling, de novo motif-finding based on multiple local alignment, and genomic sequence scanning method for putative sites. We formulate a general method to address this challenge and compare two major existing algorithms (i.e., greedy local search and Gibbs sampling) implemented to solve the popular two-block structured motif discovery issue. We demonstrate how to use this suite of methods and apply them to human nuclear receptor response elements (i.e., protein binding sites of several relevant nuclear receptors, HNF4alpha, CAR/RXR, and PXR/RXR).
Collapse
Affiliation(s)
- Chengpeng Bi
- Bioinformatics and Intelligent Computing, Division of Clinical Pharmacology and Toxicology, Children's Mercy Hospitals and Clinics, 2401 Gillham Road, Kansas City, Missouri 64108, USA.
| | | | | |
Collapse
|
68
|
Wang X, Gu J, Zhang MQ, Li Y. Identification of phylogenetically conserved microRNA cis-regulatory elements across 12 Drosophila species. Bioinformatics 2007; 24:165-71. [DOI: 10.1093/bioinformatics/btm572] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
|
69
|
Cogburn LA, Porter TE, Duclos MJ, Simon J, Burgess SC, Zhu JJ, Cheng HH, Dodgson JB, Burnside J. Functional genomics of the chicken--a model organism. Poult Sci 2007; 86:2059-94. [PMID: 17878436 DOI: 10.1093/ps/86.10.2059] [Citation(s) in RCA: 86] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Since the sequencing of the genome and the development of high-throughput tools for the exploration of functional elements of the genome, the chicken has reached model organism status. Functional genomics focuses on understanding the function and regulation of genes and gene products on a global or genome-wide scale. Systems biology attempts to integrate functional information derived from multiple high-content data sets into a holistic view of all biological processes within a cell or organism. Generation of a large collection ( approximately 600K) of chicken expressed sequence tags, representing most tissues and developmental stages, has enabled the construction of high-density microarrays for transcriptional profiling. Comprehensive analysis of this large expressed sequence tag collection and a set of approximately 20K full-length cDNA sequences indicate that the transcriptome of the chicken represents approximately 20,000 genes. Furthermore, comparative analyses of these sequences have facilitated functional annotation of the genome and the creation of several bioinformatic resources for the chicken. Recently, about 20 papers have been published on transcriptional profiling with DNA microarrays in chicken tissues under various conditions. Proteomics is another powerful high-throughput tool currently used for examining the dynamics of protein expression in chicken tissues and fluids. Computational analyses of the chicken genome are providing new insight into the evolution of gene families in birds and other organisms. Abundant functional genomic resources now support large-scale analyses in the chicken and will facilitate identification of transcriptional mechanisms, gene networks, and metabolic or regulatory pathways that will ultimately determine the phenotype of the bird. New technologies such as marker-assisted selection, transgenics, and RNA interference offer the opportunity to modify the phenotype of the chicken to fit defined production goals. This review focuses on functional genomics in the chicken and provides a road map for large-scale exploration of the chicken genome.
Collapse
Affiliation(s)
- L A Cogburn
- Department of Animal and Food Sciences, University of Delaware, Newark 19717, USA.
| | | | | | | | | | | | | | | | | |
Collapse
|
70
|
Dinkel H, Sticht H. A computational strategy for the prediction of functional linear peptide motifs in proteins. ACTA ACUST UNITED AC 2007; 23:3297-303. [PMID: 17977881 DOI: 10.1093/bioinformatics/btm524] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
MOTIVATION Short linear peptide motifs mediate protein-protein interaction, cell compartment targeting and represent the sites of post-translational modification. The identification of functional motifs by conventional sequence searches, however, is hampered by the short length of the motifs resulting in a large number of hits of which only a small portion is functional. RESULTS We have developed a procedure for the identification of functional motifs, which scores pattern conservation in homologous sequences by taking explicitly into account the sequence similarity to the query sequence. For a further improvement of this method, sequence filters have been optimized to mask those sequence regions containing little or no linear motifs. The performance of this approach was verified by measuring its ability to identify 576 experimentally validated motifs among a total of 15 563 instances in a set of 415 protein sequences. Compared to a random selection procedure, the joint application of sequence filters and the novel scoring scheme resulted in a 9-fold enrichment of validated functional motifs on the first rank. In addition, only half as many hits need to be investigated to recover 75% of the functional instances in our dataset. Therefore, this motif-scoring approach should be helpful to guide experiments because it allows focusing on those short linear peptide motifs that have a high probability to be functional.
Collapse
Affiliation(s)
- Holger Dinkel
- Abteilung für Bioinformatik, Institut für Biochemie, Friedrich-Alexander-Universität Erlangen-Nürnberg, 91054 Erlangen, Germany
| | | |
Collapse
|
71
|
Grskovic M, Chaivorapol C, Gaspar-Maia A, Li H, Ramalho-Santos M. Systematic identification of cis-regulatory sequences active in mouse and human embryonic stem cells. PLoS Genet 2007; 3:e145. [PMID: 17784790 PMCID: PMC1959362 DOI: 10.1371/journal.pgen.0030145] [Citation(s) in RCA: 75] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2007] [Accepted: 07/10/2007] [Indexed: 01/06/2023] Open
Abstract
Understanding the transcriptional regulation of pluripotent cells is of fundamental interest and will greatly inform efforts aimed at directing differentiation of embryonic stem (ES) cells or reprogramming somatic cells. We first analyzed the transcriptional profiles of mouse ES cells and primordial germ cells and identified genes upregulated in pluripotent cells both in vitro and in vivo. These genes are enriched for roles in transcription, chromatin remodeling, cell cycle, and DNA repair. We developed a novel computational algorithm, CompMoby, which combines analyses of sequences both aligned and non-aligned between different genomes with a probabilistic segmentation model to systematically predict short DNA motifs that regulate gene expression. CompMoby was used to identify conserved overrepresented motifs in genes upregulated in pluripotent cells. We show that the motifs are preferentially active in undifferentiated mouse ES and embryonic germ cells in a sequence-specific manner, and that they can act as enhancers in the context of an endogenous promoter. Importantly, the activity of the motifs is conserved in human ES cells. We further show that the transcription factor NF-Y specifically binds to one of the motifs, is differentially expressed during ES cell differentiation, and is required for ES cell proliferation. This study provides novel insights into the transcriptional regulatory networks of pluripotent cells. Our results suggest that this systematic approach can be broadly applied to understanding transcriptional networks in mammalian species. Embryonic stem cells have two remarkable properties: they can proliferate very rapidly, and they can give rise to all of the body's cell types. Understanding how gene activity is regulated in embryonic stem cells will be an important step towards therapeutic applications. The activity of genes is regulated by proteins called transcription factors, which bind to stretches of DNA sequences that act as on or off switches. We identified genes that are active in mouse embryonic stem cells but not in differentiated cells. We reasoned that if these genes have similar patterns of activity, they may be regulated by the same transcription factors. We therefore developed a computational approach that takes information on gene activity and predicts DNA sequences that may act as switches. Using this approach, we discovered new DNA switches that regulate gene activity in mouse and human embryonic stem cells. Furthermore, we identified a transcription factor that binds to one of these DNA switches and is important for the rapid proliferation of embryonic stem cells. Our approach sheds light on the genetic regulation of embryonic stem cells and will be broadly applicable to questions of how gene activity is regulated in other cell types of interest.
Collapse
Affiliation(s)
- Marica Grskovic
- Institute for Regeneration Medicine, University of California San Francisco, San Francisco, California, United States of America
- Diabetes Center, University of California San Francisco, San Francisco, California, United States of America
| | - Christina Chaivorapol
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, California, United States of America
- California Institute for Quantitative Biomedical Research, University of California San Francisco, San Francisco, California, United States of America
- Graduate Program in Biological and Medical Informatics; University of California San Francisco, San Francisco, California, United States of America
| | - Alexandre Gaspar-Maia
- Institute for Regeneration Medicine, University of California San Francisco, San Francisco, California, United States of America
- Diabetes Center, University of California San Francisco, San Francisco, California, United States of America
- Doctoral Program in Biomedicine and Experimental Biology, Center for Neuroscience and Cell Biology, University of Coimbra, Coimbra, Portugal
| | - Hao Li
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, California, United States of America
- California Institute for Quantitative Biomedical Research, University of California San Francisco, San Francisco, California, United States of America
- Graduate Program in Biological and Medical Informatics; University of California San Francisco, San Francisco, California, United States of America
- * To whom correspondence should be addressed. E-mail: (HL); (MRS)
| | - Miguel Ramalho-Santos
- Institute for Regeneration Medicine, University of California San Francisco, San Francisco, California, United States of America
- Diabetes Center, University of California San Francisco, San Francisco, California, United States of America
- * To whom correspondence should be addressed. E-mail: (HL); (MRS)
| |
Collapse
|
72
|
Goto N, Kurokawa K, Yasunaga T. Analysis of invariant sequences in 266 complete genomes. Gene 2007; 401:172-80. [PMID: 17728079 DOI: 10.1016/j.gene.2007.07.017] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2006] [Revised: 07/13/2007] [Accepted: 07/16/2007] [Indexed: 11/29/2022]
Abstract
To date, the complete genome sequences of more than 250 organisms have been determined. This information can now be used to determine whether there exist any invariant sequences that are conserved among all organisms, from bacteria to plants, animals, and humans. The existence of invariant sequences would strongly suggest that these sequences have been inherited unchanged from the last common ancestor of all life, and that they have essential functions. We have developed a new software program to identify invariant sequences conserved among the currently sequenced genomes and applied this analysis to the complete genome sequences of 266 organisms. We have identified 3 invariant DNA sequences longer than or equal to 11 bp and 6 invariant amino acid sequences longer than or equal to 6 aa. The longest invariant DNA sequence, AAGTCGTACAAGGT (15 bp), was found in the 16S/18S rRNA gene. Two 8 aa sequences, GHVDHGKT in IF2 and EF-Tu and DTPGHVDF in EF-G, were the longest invariant amino acid sequences detected. These sequences could be essential elements from the genome of the last common ancestor and may have remained unchanged throughout evolution.
Collapse
MESH Headings
- Amino Acid Sequence/genetics
- Animals
- Archaeal Proteins/chemistry
- Archaeal Proteins/genetics
- Bacterial Proteins/chemistry
- Bacterial Proteins/genetics
- Base Sequence/genetics
- Conserved Sequence/genetics
- Fungal Proteins/chemistry
- Fungal Proteins/genetics
- Genome
- Genome, Archaeal
- Genome, Bacterial
- Genome, Fungal
- Genome, Human
- Genome, Plant
- Humans
- Protein Biosynthesis/genetics
- Protein Processing, Post-Translational/genetics
- RNA, Ribosomal, 16S/chemistry
- RNA, Ribosomal, 16S/genetics
- RNA, Ribosomal, 18S/chemistry
- RNA, Ribosomal, 18S/genetics
- RNA, Ribosomal, 23S/chemistry
- RNA, Ribosomal, 23S/genetics
- Sequence Analysis, DNA
- Sequence Analysis, Protein
- Software
- Transcription, Genetic
Collapse
Affiliation(s)
- Naohisa Goto
- Department of Genome Informatics, Genome Information Research Center, Research Institute for Microbial Diseases, Osaka University, 3-1 Yamadaoka, Suita, Osaka 565-0871, Japan.
| | | | | |
Collapse
|
73
|
Doyon JB, Liu DR. Identification of eukaryotic promoter regulatory elements using nonhomologous random recombination. Nucleic Acids Res 2007; 35:5851-60. [PMID: 17720707 PMCID: PMC2034452 DOI: 10.1093/nar/gkm634] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Understanding the regulatory logic of a eukaryotic promoter requires the elucidation of the regulatory elements within that promoter. Current experimental or computational methods to discover regulatory motifs within a promoter can be labor intensive and may miss redundant, unprecedented or weakly activating elements. We have developed an unbiased combinatorial approach to rapidly identify new upstream activating sequences (UASs) in a promoter. This approach couples nonhomologous random recombination with an in vivo screen to efficiently identify UASs and does not rely on preconceived hypotheses about promoter regulation or on similarity to known activating sequences. We validated this method using the unfolded protein response (UPR) in yeast and were able to identify both known and potentially novel UASs involved in the UPR. One of the new UASs discovered using this approach implicates Crz1 as a possible activator of Hac1, a transcription factor involved in the UPR. This method has several advantages over existing methods for UAS discovery including its speed, potential generality, sensitivity and lack of false positives and negatives.
Collapse
Affiliation(s)
| | - David R. Liu
- *To whom correspondence should be addressed. Tel:+ 1 617 496 1067; Fax:+ 1 617 496 5688
| |
Collapse
|
74
|
A novel ensemble learning method for de novo computational identification of DNA binding sites. BMC Bioinformatics 2007; 8:249. [PMID: 17626633 PMCID: PMC1950314 DOI: 10.1186/1471-2105-8-249] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2007] [Accepted: 07/12/2007] [Indexed: 12/02/2022] Open
Abstract
Background Despite the diversity of motif representations and search algorithms, the de novo computational identification of transcription factor binding sites remains constrained by the limited accuracy of existing algorithms and the need for user-specified input parameters that describe the motif being sought. Results We present a novel ensemble learning method, SCOPE, that is based on the assumption that transcription factor binding sites belong to one of three broad classes of motifs: non-degenerate, degenerate and gapped motifs. SCOPE employs a unified scoring metric to combine the results from three motif finding algorithms each aimed at the discovery of one of these classes of motifs. We found that SCOPE's performance on 78 experimentally characterized regulons from four species was a substantial and statistically significant improvement over that of its component algorithms. SCOPE outperformed a broad range of existing motif discovery algorithms on the same dataset by a statistically significant margin. Conclusion SCOPE demonstrates that combining multiple, focused motif discovery algorithms can provide a significant gain in performance. By building on components that efficiently search for motifs without user-defined parameters, SCOPE requires as input only a set of upstream sequences and a species designation, making it a practical choice for non-expert users. A user-friendly web interface, Java source code and executables are available at .
Collapse
|
75
|
Okumura T, Makiguchi H, Makita Y, Yamashita R, Nakai K. Melina II: a web tool for comparisons among several predictive algorithms to find potential motifs from promoter regions. Nucleic Acids Res 2007; 35:W227-31. [PMID: 17537821 PMCID: PMC1933176 DOI: 10.1093/nar/gkm362] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
We present the second version of Melina, a web-based tool for promoter analysis. Melina II shows potential DNA motifs in promoter regions with a combination of several available programs, Consensus, MEME, Gibbs sampler, MDscan and Weeder, as well as several parameter settings. It allows running a maximum of four programs simultaneously, and comparing their results with graphical representations. In addition, users can build a weight matrix from a predicted motif and apply it to upstream sequences of several typical genomes (human, mouse, S. cerevisiae, E. coli, B. subtilis or A. thaliana) or to public motif databases (JASPAR or DBTBS) in order to find similar motifs. Melina II is a client/server system developed by using Adobe (Macromedia) Flash and is accessible over the web at http://melina.hgc.jp.
Collapse
Affiliation(s)
- Toshiyuki Okumura
- Mitsui Knowledge Industry Co. Ltd, RIKEN Genomic Sciences Center and Human Genome Center, Institute of Medical Science, University of Tokyo, Japan
| | - Hiroki Makiguchi
- Mitsui Knowledge Industry Co. Ltd, RIKEN Genomic Sciences Center and Human Genome Center, Institute of Medical Science, University of Tokyo, Japan
| | - Yuko Makita
- Mitsui Knowledge Industry Co. Ltd, RIKEN Genomic Sciences Center and Human Genome Center, Institute of Medical Science, University of Tokyo, Japan
| | - Riu Yamashita
- Mitsui Knowledge Industry Co. Ltd, RIKEN Genomic Sciences Center and Human Genome Center, Institute of Medical Science, University of Tokyo, Japan
| | - Kenta Nakai
- Mitsui Knowledge Industry Co. Ltd, RIKEN Genomic Sciences Center and Human Genome Center, Institute of Medical Science, University of Tokyo, Japan
- *To whom correspondence should be addressed. +81-3-5449-5131+81-3-5449-5133
| |
Collapse
|
76
|
Carlson JM, Chakravarty A, DeZiel CE, Gross RH. SCOPE: a web server for practical de novo motif discovery. Nucleic Acids Res 2007; 35:W259-64. [PMID: 17485471 PMCID: PMC1933170 DOI: 10.1093/nar/gkm310] [Citation(s) in RCA: 85] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
SCOPE is a novel parameter-free method for the de novo identification of potential regulatory motifs in sets of coordinately regulated genes. The SCOPE algorithm combines the output of three component algorithms, each designed to identify a particular class of motifs. Using an ensemble learning approach, SCOPE identifies the best candidate motifs from its component algorithms. In tests on experimentally determined datasets, SCOPE identified motifs with a significantly higher level of accuracy than a number of other web-based motif finders run with their default parameters. Because SCOPE has no adjustable parameters, the web server has an intuitive interface, requiring only a set of gene names or FASTA sequences and a choice of species. The most significant motifs found by SCOPE are displayed graphically on the main results page with a table containing summary statistics for each motif. Detailed motif information, including the sequence logo, PWM, consensus sequence and specific matching sites can be viewed through a single click on a motif. SCOPE's efficient, parameter-free search strategy has enabled the development of a web server that is readily accessible to the practising biologist while providing results that compare favorably with those of other motif finders. The SCOPE web server is at <http://genie.dartmouth.edu/scope>.
Collapse
Affiliation(s)
- Jonathan M. Carlson
- Department of Computer Science and Engineering, University of Washington, Seattle, WA, Department of Cancer Pharmacology, Millennium Pharmaceuticals Inc., Cambridge, MA and Department of Biological Sciences, Dartmouth College, Hanover, NH, USA
| | - Arijit Chakravarty
- Department of Computer Science and Engineering, University of Washington, Seattle, WA, Department of Cancer Pharmacology, Millennium Pharmaceuticals Inc., Cambridge, MA and Department of Biological Sciences, Dartmouth College, Hanover, NH, USA
| | - Charles E. DeZiel
- Department of Computer Science and Engineering, University of Washington, Seattle, WA, Department of Cancer Pharmacology, Millennium Pharmaceuticals Inc., Cambridge, MA and Department of Biological Sciences, Dartmouth College, Hanover, NH, USA
| | - Robert H. Gross
- Department of Computer Science and Engineering, University of Washington, Seattle, WA, Department of Cancer Pharmacology, Millennium Pharmaceuticals Inc., Cambridge, MA and Department of Biological Sciences, Dartmouth College, Hanover, NH, USA
- *To whom correspondence should be addressed. +603 646 2059+603 646 1347
| |
Collapse
|
77
|
Jegga AG, Chen J, Gowrisankar S, Deshmukh MA, Gudivada R, Kong S, Kaimal V, Aronow BJ. GenomeTrafac: a whole genome resource for the detection of transcription factor binding site clusters associated with conventional and microRNA encoding genes conserved between mouse and human gene orthologs. Nucleic Acids Res 2006; 35:D116-21. [PMID: 17178752 PMCID: PMC1781107 DOI: 10.1093/nar/gkl1011] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Transcriptional cis-regulatory control regions frequently are found within non-coding DNA segments conserved across multi-species gene orthologs. Adopting a systematic gene-centric pipeline approach, we report here the development of a web-accessible database resource--GenomeTraFac (http://genometrafac.cchmc.org)--that allows genome-wide detection and characterization of compositionally similar cis-clusters that occur in gene orthologs between any two genomes for both microRNA genes as well as conventional RNA-encoding genes. Each ortholog gene pair can be scanned to visualize overall conserved sequence regions, and within these, the relative density of conserved cis-element motif clusters form graph peak structures. The results of these analyses can be mined en masse to identify most frequently represented cis-motifs in a list of genes. The system also provides a method for rapid evaluation and visualization of gene model-consistency between orthologs, and facilitates consideration of the potential impact of sequence variation in conserved non-coding regions to impact complex cis-element structures. Using the mouse and human genomes via the NCBI Reference Sequence database and the Sanger Institute miRBase, the system demonstrated the ability to identify validated transcription factor targets within promoter and distal genomic regulatory regions of both conventional and microRNA genes.
Collapse
Affiliation(s)
- Anil G. Jegga
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical CenterCincinnati, OH 45229, USA
- Department of Pediatrics, College of MedicineCincinnati, OH 45229, USA
| | - Jing Chen
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical CenterCincinnati, OH 45229, USA
- Department of Biomedical Engineering, University of CincinnatiCincinnati, OH 45229, USA
| | - Sivakumar Gowrisankar
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical CenterCincinnati, OH 45229, USA
- Department of Biomedical Engineering, University of CincinnatiCincinnati, OH 45229, USA
| | - Mrunal A. Deshmukh
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical CenterCincinnati, OH 45229, USA
| | - RangaChandra Gudivada
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical CenterCincinnati, OH 45229, USA
- Department of Biomedical Engineering, University of CincinnatiCincinnati, OH 45229, USA
| | - Sue Kong
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical CenterCincinnati, OH 45229, USA
| | - Vivek Kaimal
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical CenterCincinnati, OH 45229, USA
- Department of Biomedical Engineering, University of CincinnatiCincinnati, OH 45229, USA
| | - Bruce J. Aronow
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical CenterCincinnati, OH 45229, USA
- Department of Pediatrics, College of MedicineCincinnati, OH 45229, USA
- Department of Biomedical Engineering, University of CincinnatiCincinnati, OH 45229, USA
- To whom correspondence should be addressed at Cincinnati Children's Hospital Medical Center, 3333 Burnet Avenue–MLC 7024, Cincinnati, OH 45229-3039, USA. Tel: +1 513 636 4865; Fax: +1 513 636 2056;
| |
Collapse
|
78
|
Brilli M, Fani R, Lió P. MotifScorer: using a compendium of microarrays to identify regulatory motifs. Bioinformatics 2006; 23:493-5. [PMID: 17138590 DOI: 10.1093/bioinformatics/btl607] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
UNLABELLED We describe MotifScorer, a program for systematic genome-wide identification of transcription sites. The program uses a compendium of gene expression microarrays and implements state-of-art partial least squares (PLSs) based regression and stepwise regression procedures. Candidate motifs from the upstream sequences of groups of co-regulated genes are identified and assigned a score using genomic background models and available motif finding tools. The use of a large library of expression data allows statistical comparative analysis of the specificity of motifs identified in different conditions. AVAILABILITY MotifScorer, which is written in Java and Matlab, manual and example files are available from the authors.
Collapse
Affiliation(s)
- Matteo Brilli
- Dipartimento di Biologia Animale e Genetica, via Romana 17, 50125 Firenze, Italy
| | | | | |
Collapse
|
79
|
GuhaThakurta D, Xie T, Anand M, Edwards SW, Li G, Wang SS, Schadt EE. Cis-regulatory variations: a study of SNPs around genes showing cis-linkage in segregating mouse populations. BMC Genomics 2006; 7:235. [PMID: 16978413 PMCID: PMC1618400 DOI: 10.1186/1471-2164-7-235] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2006] [Accepted: 09/15/2006] [Indexed: 11/10/2022] Open
Abstract
Background Changes in gene expression are known to be responsible for phenotypic variation and susceptibility to diseases. Identification and annotation of the genomic sequence variants that cause gene expression changes is therefore likely to lead to a better understanding of the cause of disease at the molecular level. In this study we investigate the pattern of single nucleotide polymorphisms (SNPs) in genes for which the mRNA levels show cis-genetic linkage (gene expression quantitative trait loci mapping in cis, or cis-eQTLs) in segregating mouse populations. Such genes are expected to have polymorphisms near their physical location (cis-variations) that affect their mRNA levels by altering one or more of the cis-regulatory elements. This led us to characterize the SNPs in promoter (5 Kb upstream) and non-coding gene regions (introns and 5 Kb downstream) (cis-SNPs) and the effects they may have on putative transcription factor binding sites. Results We demonstrate that the cis-eQTL genes (CEGs) have a significantly higher frequency of cis-SNPs compared to non-CEGs (when both sets are taken from the non-IBD regions, i.e. regions not identical by descent). Most CEGs having cis-SNPs do not contain these SNPs in the phylogenetically conserved regions. In those CEGs that contain cis-SNPs in the phylogenetically conserved regions, enrichment of cis-SNPs occurs both within and outside of the conserved sequences. A higher fraction of CEGs are also seen to harbor cis-SNP that affect predicted transcription factor binding sites, a likely consequence of the higher cis-SNPs density in these genes. Conclusion This present study provides the first genome-wide investigation of the putative cis-regulatory variations in a large set of genes whose levels of expression give rise to cis-linkage in segregating mammalian populations. Our results provide insights into the challenges that exist in identifying polymorphisms regulating gene expression using bioinformatic sequence analysis approaches. The data provided herein should benefit future investigations in this area.
Collapse
Affiliation(s)
- Debraj GuhaThakurta
- Genetics, Rosetta Inpharmatics LLC, a wholly owned subsidiaryof Merck & Co., Inc. 401 Terry Avenue North, Seattle, WA 98109, USA
| | - Tao Xie
- Genetics, Rosetta Inpharmatics LLC, a wholly owned subsidiaryof Merck & Co., Inc. 401 Terry Avenue North, Seattle, WA 98109, USA
| | - Manish Anand
- Genetics, Rosetta Inpharmatics LLC, a wholly owned subsidiaryof Merck & Co., Inc. 401 Terry Avenue North, Seattle, WA 98109, USA
- Microsoft Corporation, One Microsoft Way, Redmond, WA 98052-6399, USA
| | - Stephen W Edwards
- Genetics, Rosetta Inpharmatics LLC, a wholly owned subsidiaryof Merck & Co., Inc. 401 Terry Avenue North, Seattle, WA 98109, USA
| | - Guoya Li
- Informatics, Rosetta Inpharmatics LLC, a wholly owned subsidiary of Merck & Co., Inc. 401 Terry Avenue North, Seattle, WA 98109, USA
| | - Susanna S Wang
- Department of Medicine, David Geffen School of Medicine, University of California, Los Angeles, CA 90095-1679, USA
| | - Eric E Schadt
- Genetics, Rosetta Inpharmatics LLC, a wholly owned subsidiaryof Merck & Co., Inc. 401 Terry Avenue North, Seattle, WA 98109, USA
| |
Collapse
|