1
|
Vidovic MMC, Kloft M, Müller KR, Görnitz N. ML2Motif-Reliable extraction of discriminative sequence motifs from learning machines. PLoS One 2017; 12:e0174392. [PMID: 28346487 PMCID: PMC5367830 DOI: 10.1371/journal.pone.0174392] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2016] [Accepted: 03/08/2017] [Indexed: 01/30/2023] Open
Abstract
High prediction accuracies are not the only objective to consider when solving problems using machine learning. Instead, particular scientific applications require some explanation of the learned prediction function. For computational biology, positional oligomer importance matrices (POIMs) have been successfully applied to explain the decision of support vector machines (SVMs) using weighted-degree (WD) kernels. To extract relevant biological motifs from POIMs, the motifPOIM method has been devised and showed promising results on real-world data. Our contribution in this paper is twofold: as an extension to POIMs, we propose gPOIM, a general measure of feature importance for arbitrary learning machines and feature sets (including, but not limited to, SVMs and CNNs) and devise a sampling strategy for efficient computation. As a second contribution, we derive a convex formulation of motifPOIMs that leads to more reliable motif extraction from gPOIMs. Empirical evaluations confirm the usefulness of our approach on artificially generated data as well as on real-world datasets.
Collapse
Affiliation(s)
| | - Marius Kloft
- Department of Computer Science, Humboldt University of Berlin, Berlin, Germany
| | - Klaus-Robert Müller
- Machine Learning Group, Technical University of Berlin, Berlin, Germany
- Department of Brain and Cognitive Engineering, Korea University, Anam-dong, Seongbuk-gu, Seoul 136-713, Korea
| | - Nico Görnitz
- Machine Learning Group, Technical University of Berlin, Berlin, Germany
| |
Collapse
|
2
|
Woo YJ, Wang T, Guadalupe T, Nebel RA, Vino A, Del Bene VA, Molholm S, Ross LA, Zwiers MP, Fisher SE, Foxe JJ, Abrahams BS. A Common CYFIP1 Variant at the 15q11.2 Disease Locus Is Associated with Structural Variation at the Language-Related Left Supramarginal Gyrus. PLoS One 2016; 11:e0158036. [PMID: 27351196 PMCID: PMC4924813 DOI: 10.1371/journal.pone.0158036] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2015] [Accepted: 06/09/2016] [Indexed: 01/03/2023] Open
Abstract
Copy number variants (CNVs) at the Breakpoint 1 to Breakpoint 2 region at 15q11.2 (BP1-2) are associated with language-related difficulties and increased risk for developmental disorders in which language is compromised. Towards underlying mechanisms, we investigated relationships between single nucleotide polymorphisms (SNPs) across the region and quantitative measures of human brain structure obtained by magnetic resonance imaging of healthy subjects. We report an association between rs4778298, a common variant at CYFIP1, and inter-individual variation in surface area across the left supramarginal gyrus (lh.SMG), a cortical structure implicated in speech and language in independent discovery (n = 100) and validation cohorts (n = 2621). In silico analyses determined that this same variant, and others nearby, is also associated with differences in levels of CYFIP1 mRNA in human brain. One of these nearby polymorphisms is predicted to disrupt a consensus binding site for FOXP2, a transcription factor implicated in speech and language. Consistent with a model where FOXP2 regulates CYFIP1 levels and in turn influences lh.SMG surface area, analysis of publically available expression data identified a relationship between expression of FOXP2 and CYFIP1 mRNA in human brain. We propose that altered CYFIP1 dosage, through aberrant patterning of the lh.SMG, may contribute to language-related difficulties associated with BP1-2 CNVs. More generally, this approach may be useful in clarifying the contribution of individual genes at CNV risk loci.
Collapse
Affiliation(s)
- Young Jae Woo
- Department of Genetics, Albert Einstein College of Medicine, Bronx, United States of America
| | - Tao Wang
- Department of Epidemiology & Population Health, Albert Einstein College of Medicine, Bronx, United States of America
| | - Tulio Guadalupe
- Language and Genetics Department, Max Planck Institute for Psycholinguistics, Nijmegen, the Netherlands
| | - Rebecca A. Nebel
- Department of Genetics, Albert Einstein College of Medicine, Bronx, United States of America
| | - Arianna Vino
- Language and Genetics Department, Max Planck Institute for Psycholinguistics, Nijmegen, the Netherlands
| | - Victor A. Del Bene
- The Sheryl and Daniel R. Tishman Cognitive Neurophysiology Laboratory, Children's Evaluation and Rehabilitation Center (CERC), Albert Einstein College of Medicine, Bronx, United States of America
- Department of Pediatrics, Albert Einstein College of Medicine, Bronx, United States of America
- Dominick P. Purpura Department of Neuroscience, Albert Einstein College of Medicine, Bronx, United States of America
| | - Sophie Molholm
- The Sheryl and Daniel R. Tishman Cognitive Neurophysiology Laboratory, Children's Evaluation and Rehabilitation Center (CERC), Albert Einstein College of Medicine, Bronx, United States of America
- Department of Pediatrics, Albert Einstein College of Medicine, Bronx, United States of America
- Dominick P. Purpura Department of Neuroscience, Albert Einstein College of Medicine, Bronx, United States of America
| | - Lars A. Ross
- The Sheryl and Daniel R. Tishman Cognitive Neurophysiology Laboratory, Children's Evaluation and Rehabilitation Center (CERC), Albert Einstein College of Medicine, Bronx, United States of America
- Department of Pediatrics, Albert Einstein College of Medicine, Bronx, United States of America
- Dominick P. Purpura Department of Neuroscience, Albert Einstein College of Medicine, Bronx, United States of America
| | - Marcel P. Zwiers
- Language and Genetics Department, Max Planck Institute for Psycholinguistics, Nijmegen, the Netherlands
| | - Simon E. Fisher
- Language and Genetics Department, Max Planck Institute for Psycholinguistics, Nijmegen, the Netherlands
- Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, The Netherlands
| | - John J. Foxe
- The Sheryl and Daniel R. Tishman Cognitive Neurophysiology Laboratory, Children's Evaluation and Rehabilitation Center (CERC), Albert Einstein College of Medicine, Bronx, United States of America
- Department of Pediatrics, Albert Einstein College of Medicine, Bronx, United States of America
- The Cognitive Neurophysiology Laboratory, Nathan S. Kline Institute for Psychiatric Research, Orangeburg, United States of America
- Dominick P. Purpura Department of Neuroscience, Albert Einstein College of Medicine, Bronx, United States of America
| | - Brett S. Abrahams
- Department of Genetics, Albert Einstein College of Medicine, Bronx, United States of America
- Dominick P. Purpura Department of Neuroscience, Albert Einstein College of Medicine, Bronx, United States of America
- * E-mail:
| |
Collapse
|
3
|
Tan G, Lenhard B. TFBSTools: an R/bioconductor package for transcription factor binding site analysis. Bioinformatics 2016; 32:1555-6. [PMID: 26794315 PMCID: PMC4866524 DOI: 10.1093/bioinformatics/btw024] [Citation(s) in RCA: 199] [Impact Index Per Article: 24.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2015] [Accepted: 01/13/2016] [Indexed: 11/13/2022] Open
Abstract
Summary: The ability to efficiently investigate transcription factor binding sites (TFBSs) genome-wide is central to computational studies of gene regulation. TFBSTools is an R/Bioconductor package for the analysis and manipulation of TFBSs and their associated transcription factor profile matrices. TFBStools provides a toolkit for handling TFBS profile matrices, scanning sequences and alignments including whole genomes, and querying the JASPAR database. The functionality of the package can be easily extended to include advanced statistical analysis, data visualization and data integration. Availability and implementation: The package is implemented in R and available under GPL-2 license from the Bioconductor website (http://bioconductor.org/packages/TFBSTools/). Contact:ge.tan09@imperial.ac.uk Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ge Tan
- Computational Regulatory Genomics, MRC Clinical Sciences Centre, Imperial College London, London W12 0NN, UK
| | - Boris Lenhard
- Computational Regulatory Genomics, MRC Clinical Sciences Centre, Imperial College London, London W12 0NN, UK
| |
Collapse
|
4
|
Vidovic MMC, Görnitz N, Müller KR, Rätsch G, Kloft M. SVM2Motif--Reconstructing Overlapping DNA Sequence Motifs by Mimicking an SVM Predictor. PLoS One 2015; 10:e0144782. [PMID: 26690911 PMCID: PMC4686957 DOI: 10.1371/journal.pone.0144782] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2015] [Accepted: 11/22/2015] [Indexed: 12/02/2022] Open
Abstract
Identifying discriminative motifs underlying the functionality and evolution of organisms is a major challenge in computational biology. Machine learning approaches such as support vector machines (SVMs) achieve state-of-the-art performances in genomic discrimination tasks, but--due to its black-box character--motifs underlying its decision function are largely unknown. As a remedy, positional oligomer importance matrices (POIMs) allow us to visualize the significance of position-specific subsequences. Although being a major step towards the explanation of trained SVM models, they suffer from the fact that their size grows exponentially in the length of the motif, which renders their manual inspection feasible only for comparably small motif sizes, typically k ≤ 5. In this work, we extend the work on positional oligomer importance matrices, by presenting a new machine-learning methodology, entitled motifPOIM, to extract the truly relevant motifs--regardless of their length and complexity--underlying the predictions of a trained SVM model. Our framework thereby considers the motifs as free parameters in a probabilistic model, a task which can be phrased as a non-convex optimization problem. The exponential dependence of the POIM size on the oligomer length poses a major numerical challenge, which we address by an efficient optimization framework that allows us to find possibly overlapping motifs consisting of up to hundreds of nucleotides. We demonstrate the efficacy of our approach on a synthetic data set as well as a real-world human splice site data set.
Collapse
Affiliation(s)
| | - Nico Görnitz
- Machine Learning Group, Technical University of Berlin, Berlin, Germany
| | - Klaus-Robert Müller
- Machine Learning Group, Technical University of Berlin, Berlin, Germany
- Department of Brain and Cognitive Engineering, Korea University, Anam-dong, Seongbuk-gu, Seoul 136–713, Korea
| | - Gunnar Rätsch
- Memorial Sloan-Kettering Cancer Center, New York City, New York, United States of America
| | - Marius Kloft
- Department of Computer Science, Humboldt University of Berlin, Berlin, Germany
| |
Collapse
|
5
|
oPOSSUM-3: advanced analysis of regulatory motif over-representation across genes or ChIP-Seq datasets. G3-GENES GENOMES GENETICS 2012; 2:987-1002. [PMID: 22973536 PMCID: PMC3429929 DOI: 10.1534/g3.112.003202] [Citation(s) in RCA: 230] [Impact Index Per Article: 19.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/27/2012] [Accepted: 06/11/2012] [Indexed: 01/12/2023]
Abstract
oPOSSUM-3 is a web-accessible software system for identification of over-represented transcription factor binding sites (TFBS) and TFBS families in either DNA sequences of co-expressed genes or sequences generated from high-throughput methods, such as ChIP-Seq. Validation of the system with known sets of co-regulated genes and published ChIP-Seq data demonstrates the capacity for oPOSSUM-3 to identify mediating transcription factors (TF) for co-regulated genes or co-recovered sequences. oPOSSUM-3 is available at http://opossum.cisreg.ca.
Collapse
|
6
|
Kwon AT, Chou AY, Arenillas DJ, Wasserman WW. Validation of skeletal muscle cis-regulatory module predictions reveals nucleotide composition bias in functional enhancers. PLoS Comput Biol 2011; 7:e1002256. [PMID: 22144875 PMCID: PMC3228787 DOI: 10.1371/journal.pcbi.1002256] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2011] [Accepted: 09/16/2011] [Indexed: 11/19/2022] Open
Abstract
We performed a genome-wide scan for muscle-specific cis-regulatory modules (CRMs) using three computational prediction programs. Based on the predictions, 339 candidate CRMs were tested in cell culture with NIH3T3 fibroblasts and C2C12 myoblasts for capacity to direct selective reporter gene expression to differentiated C2C12 myotubes. A subset of 19 CRMs validated as functional in the assay. The rate of predictive success reveals striking limitations of computational regulatory sequence analysis methods for CRM discovery. Motif-based methods performed no better than predictions based only on sequence conservation. Analysis of the properties of the functional sequences relative to inactive sequences identifies nucleotide sequence composition can be an important characteristic to incorporate in future methods for improved predictive specificity. Muscle-related TFBSs predicted within the functional sequences display greater sequence conservation than non-TFBS flanking regions. Comparison with recent MyoD and histone modification ChIP-Seq data supports the validity of the functional regions. For efficient identification of genomic sequences responsible for regulating gene expression, a number of computer programs have been developed for automatic annotation of these regulatory regions. We searched for potential regulatory regions responsible for controlling the expression of skeletal muscle-specific genes using these programs, and validated the predictions in a popular cell culture model for muscle. We were able to identify 19 previously uncharacterized regulatory regions for muscle genes. The accuracy of the predictions made by these programs leaves much to be desired, leading us to conclude that other signals in addition to the sequence information will be required to achieve sufficient predictive power for genome annotation. Genomic regions with confirmed regulatory function were compared against non-functional sequences, revealing sequence conservation, composition and chromatin modification properties as important signals in determining regulatory region functionality.
Collapse
Affiliation(s)
- Andrew T. Kwon
- Centre for Molecular Medicine and Therapeutics, Child and Family Research Institute, Genetics Graduate Program, and Department of Medical Genetics, University of British Columbia, Vancouver, British Columbia, Canada
| | - Alice Yi Chou
- Centre for Molecular Medicine and Therapeutics, Child and Family Research Institute, Genetics Graduate Program, and Department of Medical Genetics, University of British Columbia, Vancouver, British Columbia, Canada
| | - David J. Arenillas
- Centre for Molecular Medicine and Therapeutics, Child and Family Research Institute, Genetics Graduate Program, and Department of Medical Genetics, University of British Columbia, Vancouver, British Columbia, Canada
| | - Wyeth W. Wasserman
- Centre for Molecular Medicine and Therapeutics, Child and Family Research Institute, Genetics Graduate Program, and Department of Medical Genetics, University of British Columbia, Vancouver, British Columbia, Canada
- * E-mail:
| |
Collapse
|
7
|
Meng G, Mosig A, Vingron M. A computational evaluation of over-representation of regulatory motifs in the promoter regions of differentially expressed genes. BMC Bioinformatics 2010; 11:267. [PMID: 20487530 PMCID: PMC3098066 DOI: 10.1186/1471-2105-11-267] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2009] [Accepted: 05/20/2010] [Indexed: 12/28/2022] Open
Abstract
Background Observed co-expression of a group of genes is frequently attributed to co-regulation by shared transcription factors. This assumption has led to the hypothesis that promoters of co-expressed genes should share common regulatory motifs, which forms the basis for numerous computational tools that search for these motifs. While frequently explored for yeast, the validity of the underlying hypothesis has not been assessed systematically in mammals. This demonstrates the need for a systematic and quantitative evaluation to what degree co-expressed genes share over-represented motifs for mammals. Results We identified 33 experiments for human and mouse in the ArrayExpress Database where transcription factors were manipulated and which exhibited a significant number of differentially expressed genes. We checked for over-representation of transcription factor binding sites in up- or down-regulated genes using the over-representation analysis tool oPOSSUM. In 25 out of 33 experiments, this procedure identified the binding matrices of the affected transcription factors. We also carried out de novo prediction of regulatory motifs shared by differentially expressed genes. Again, the detected motifs shared significant similarity with the matrices of the affected transcription factors. Conclusions Our results support the claim that functional regulatory motifs are over-represented in sets of differentially expressed genes and that they can be detected with computational methods.
Collapse
Affiliation(s)
- Guofeng Meng
- CAS-MPG Partner Institute and Key Laboratory for Computational Biology, Shanghai Institutes for Biological Sciences, 320 Yue Yang Road, 200031, Shanghai, China.
| | | | | |
Collapse
|
8
|
Luo JW, Wang T. Motif discovery using an immune genetic algorithm. J Theor Biol 2010; 264:319-25. [DOI: 10.1016/j.jtbi.2010.02.010] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2009] [Revised: 02/03/2010] [Accepted: 02/06/2010] [Indexed: 10/19/2022]
|
9
|
Interferon-mediated enhancement of in vitro replication of porcine circovirus type 2 is influenced by an interferon-stimulated response element in the PCV2 genome. Virus Res 2009; 145:236-43. [DOI: 10.1016/j.virusres.2009.07.009] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2009] [Revised: 07/13/2009] [Accepted: 07/13/2009] [Indexed: 01/14/2023]
|
10
|
Marks VD, Ho Sui SJ, Erasmus D, van der Merwe GK, Brumm J, Wasserman WW, Bryan J, van Vuuren HJJ. Dynamics of the yeast transcriptome during wine fermentation reveals a novel fermentation stress response. FEMS Yeast Res 2008; 8:35-52. [PMID: 18215224 DOI: 10.1111/j.1567-1364.2007.00338.x] [Citation(s) in RCA: 142] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023] Open
Abstract
In this study, genome-wide expression analyses were used to study the response of Saccharomyces cerevisiae to stress throughout a 15-day wine fermentation. Forty per cent of the yeast genome significantly changed expression levels to mediate long-term adaptation to fermenting grape must. Among the genes that changed expression levels, a group of 223 genes was identified, which was designated as fermentation stress response (FSR) genes that were dramatically induced at various points during fermentation. FSR genes sustain high levels of induction up to the final time point and exhibited changes in expression levels ranging from four- to 80-fold. The FSR is novel; 62% of the genes involved have not been implicated in global stress responses and 28% of the FSR genes have no functional annotation. Genes involved in respiratory metabolism and gluconeogenesis were expressed during fermentation despite the presence of high concentrations of glucose. Ethanol, rather than nutrient depletion, seems to be responsible for entry of yeast cells into the stationary phase.
Collapse
Affiliation(s)
- Virginia D Marks
- Wine Research Centre, University of British Columbia, Vancouver, Canada
| | | | | | | | | | | | | | | |
Collapse
|
11
|
Lones M, Tyrrell A. Regulatory motif discovery using a population clustering evolutionary algorithm. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2007; 4:403-414. [PMID: 17666760 DOI: 10.1109/tcbb.2007.1044] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2023]
Abstract
This paper describes a novel evolutionary algorithm for regulatory motif discovery in DNA promoter sequences. The algorithm uses data clustering to logically distribute the evolving population across the search space. Mating then takes place within local regions of the population, promoting overall solution diversity and encouraging discovery of multiple solutions. Experiments using synthetic data sets have demonstrated the algorithm's capacity to find position frequency matrix models of known regulatory motifs in relatively long promoter sequences. These experiments have also shown the algorithm's ability to maintain diversity during search and discover multiple motifs within a single population. The utility of the algorithm for discovering motifs in real biological data is demonstrated by its ability to find meaningful motifs within muscle-specific regulatory sequences.
Collapse
|
12
|
Ho Sui SJ, Fulton DL, Arenillas DJ, Kwon AT, Wasserman WW. oPOSSUM: integrated tools for analysis of regulatory motif over-representation. Nucleic Acids Res 2007; 35:W245-52. [PMID: 17576675 PMCID: PMC1933229 DOI: 10.1093/nar/gkm427] [Citation(s) in RCA: 132] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The identification of over-represented transcription factor binding sites from sets of co-expressed genes provides insights into the mechanisms of regulation for diverse biological contexts. oPOSSUM, an internet-based system for such studies of regulation, has been improved and expanded in this new release. New features include a worm-specific version for investigating binding sites conserved between Caenorhabditis elegans and C. briggsae, as well as a yeast-specific version for the analysis of co-expressed sets of Saccharomyces cerevisiae genes. The human and mouse applications feature improvements in ortholog mapping, sequence alignments and the delineation of multiple alternative promoters. oPOSSUM2, introduced for the analysis of over-represented combinations of motifs in human and mouse genes, has been integrated with the original oPOSSUM system. Analysis using user-defined background gene sets is now supported. The transcription factor binding site models have been updated to include new profiles from the JASPAR database. oPOSSUM is available at http://www.cisreg.ca/oPOSSUM/
Collapse
Affiliation(s)
- Shannan J. Ho Sui
- Centre for Molecular Medicine and Therapeutics, Child and Family Research Institute, Genetics Graduate Program and Department of Medical Genetics, University of British Columbia, Vancouver BC, Canada
| | - Debra L. Fulton
- Centre for Molecular Medicine and Therapeutics, Child and Family Research Institute, Genetics Graduate Program and Department of Medical Genetics, University of British Columbia, Vancouver BC, Canada
| | - David J. Arenillas
- Centre for Molecular Medicine and Therapeutics, Child and Family Research Institute, Genetics Graduate Program and Department of Medical Genetics, University of British Columbia, Vancouver BC, Canada
| | - Andrew T. Kwon
- Centre for Molecular Medicine and Therapeutics, Child and Family Research Institute, Genetics Graduate Program and Department of Medical Genetics, University of British Columbia, Vancouver BC, Canada
| | - Wyeth W. Wasserman
- Centre for Molecular Medicine and Therapeutics, Child and Family Research Institute, Genetics Graduate Program and Department of Medical Genetics, University of British Columbia, Vancouver BC, Canada
- *To whom correspondence should be addressed. +1 604 875 3812+1 604 875 3819
| |
Collapse
|
13
|
Kankainen M, Löytynoja A. MATLIGN: a motif clustering, comparison and matching tool. BMC Bioinformatics 2007; 8:189. [PMID: 17559640 PMCID: PMC1925120 DOI: 10.1186/1471-2105-8-189] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2007] [Accepted: 06/08/2007] [Indexed: 11/21/2022] Open
Abstract
Background Sequence motifs representing transcription factor binding sites (TFBS) are commonly encoded as position frequency matrices (PFM) or degenerate consensus sequences (CS). These formats are used to represent the characterised TFBS profiles stored in transcription factor databases, as well as to represent the potential motifs predicted using computational methods. To fill the gap between the known and predicted motifs, methods are needed for the post-processing of prediction results, i.e. for matching, comparison and clustering of pre-selected motifs. The computational identification of over-represented motifs in sets of DNA sequences is, in particular, a task where post-processing can dramatically simplify the analysis. Efficient post-processing, for example, reduces the redundancy of the motifs predicted and enables them to be annotated. Results In order to facilitate the post-processing of motifs, in both PFM and CS formats, we have developed a tool called Matlign. The tool aligns and evaluates the similarity of motifs using a combination of scoring functions, and visualises the results using hierarchical clustering. By limiting the number of distinct gaps created (though, not their length), the alignment algorithm also correctly aligns motifs with an internal spacer. The method selects the best non-redundant motif set, with repetitive motifs merged together, by cutting the hierarchical tree using silhouette values. Our analyses show that Matlign can reliably discover the most similar analogue from a collection of characterised regulatory elements such that the method is also useful for the annotation of motif predictions by PFM library searches. Conclusion Matlign is a user-friendly tool for post-processing large collections of DNA sequence motifs. Starting from a large number of potential regulatory motifs, Matlign provides a researcher with a non-redundant set of motifs, which can then be further associated to known regulatory elements. A web-server is available at .
Collapse
Affiliation(s)
- Matti Kankainen
- Institute of Biotechnology, University of Helsinki, Helsinki, Finland
| | | |
Collapse
|
14
|
Wang J. A new framework for identifying combinatorial regulation of transcription factors: a case study of the yeast cell cycle. J Biomed Inform 2007; 40:707-25. [PMID: 17418646 DOI: 10.1016/j.jbi.2007.02.003] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2006] [Revised: 12/23/2006] [Accepted: 02/27/2007] [Indexed: 01/24/2023]
Abstract
By integrating heterogeneous functional genomic datasets, we have developed a new framework for detecting combinatorial control of gene expression, which includes estimating transcription factor activities using a singular value decomposition method and reducing high-dimensional input gene space by considering genomic properties of gene clusters. The prediction of cooperative gene regulation is accomplished by either Gaussian Graphical Models or Pairwise Mixed Graphical Models. The proposed framework was tested on yeast cell cycle datasets: (1) 54 known yeast cell cycle genes with 9 cell cycle regulators and (2) 676 putative yeast cell cycle genes with 9 cell cycle regulators. The new framework gave promising results on inferring TF-TF and TF-gene interactions. It also revealed several interesting mechanisms such as negatively correlated protein-protein interactions and low affinity protein-DNA interactions that may be important during the yeast cell cycle. The new framework may easily be extended to study other higher eukaryotes.
Collapse
Affiliation(s)
- Junbai Wang
- Department of Biological Sciences, Columbia University, 1212, Amsterdam Avenue, MC 2442, New York, NY 10027, USA.
| |
Collapse
|
15
|
Analyzing the dose-dependence of the Saccharomyces cerevisiae global transcriptional response to methyl methanesulfonate and ionizing radiation. BMC Genomics 2006; 7:305. [PMID: 17140446 PMCID: PMC1698923 DOI: 10.1186/1471-2164-7-305] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2006] [Accepted: 12/01/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND One of the most crucial tasks for a cell to ensure its long term survival is preserving the integrity of its genetic heritage via maintenance of DNA structure and sequence. While the DNA damage response in the yeast Saccharomyces cerevisiae, a model eukaryotic organism, has been extensively studied, much remains to be elucidated about how the organism senses and responds to different types and doses of DNA damage. We have measured the global transcriptional response of S. cerevisiae to multiple doses of two representative DNA damaging agents, methyl methanesulfonate (MMS) and gamma radiation. RESULTS Hierarchical clustering of genes with a statistically significant change in transcription illustrated the differences in the cellular responses to MMS and gamma radiation. Overall, MMS produced a larger transcriptional response than gamma radiation, and many of the genes modulated in response to MMS are involved in protein and translational regulation. Several clusters of coregulated genes whose responses varied with DNA damaging agent dose were identified. Perhaps the most interesting cluster contained four genes exhibiting biphasic induction in response to MMS dose. All of the genes (DUN1, RNR2, RNR4, and HUG1) are involved in the Mec1p kinase pathway known to respond to MMS, presumably due to stalled DNA replication forks. The biphasic responses of these genes suggest that the pathway is induced at lower levels as MMS dose increases. The genes in this cluster with a threefold or greater transcriptional response to gamma radiation all showed an increased induction with increasing gamma radiation dosage. CONCLUSION Analyzing genome-wide transcriptional changes to multiple doses of external stresses enabled the identification of cellular responses that are modulated by magnitude of the stress, providing insights into how a cell deals with genotoxicity.
Collapse
|
16
|
Chua G, Robinson MD, Morris Q, Hughes TR. Transcriptional networks: reverse-engineering gene regulation on a global scale. Curr Opin Microbiol 2005; 7:638-46. [PMID: 15556037 DOI: 10.1016/j.mib.2004.10.009] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
A major objective in post-genome research is to fully understand the transcriptional control of each gene and the targets of each transcription factor. In yeast, large-scale experimental and computational approaches have been applied to identify co-regulated genes, cis regulatory elements, and transcription factor DNA binding sites in vivo. Methods for modeling and predicting system behavior, and for reconciling discrepancies among data types, are being explored. The results indicate that a complete and comprehensive yeast transcriptional network will ultimately be achieved.
Collapse
Affiliation(s)
- Gordon Chua
- Banting and Best Department of Medical Research, University of Toronto, 112 College Street, Room 307, Toronto, Ontario M5G 1L6, Canada
| | | | | | | |
Collapse
|
17
|
Alkema WBL, Lenhard B, Wasserman WW. Regulog analysis: detection of conserved regulatory networks across bacteria: application to Staphylococcus aureus. Genome Res 2004; 14:1362-73. [PMID: 15231752 PMCID: PMC442153 DOI: 10.1101/gr.2242604] [Citation(s) in RCA: 54] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
A transcriptional regulatory network encompasses sets of genes (regulons) whose expression states are directly altered in response to an activating signal, mediated by trans-acting regulatory proteins and cis-acting regulatory sequences. Enumeration of these network components is an essential step toward the creation of a framework for systems-based analysis of biological processes. Profile-based methods for the detection of cis-regulatory elements are often applied to predict regulon members, but they suffer from poor specificity. In this report we describe Regulogger, a novel computational method that uses comparative genomics to eliminate spurious members of predicted gene regulons. Regulogger produces regulogs, sets of coregulated genes for which the regulatory sequence has been conserved across multiple organisms. The quantitative method assigns a confidence score to each predicted regulog member on the basis of the degree of conservation of protein sequence and regulatory mechanisms. When applied to a reference collection of regulons from Escherichia coli, Regulogger increased the specificity of predictions up to 25-fold over methods that use cis-element detection in isolation. The enhanced specificity was observed across a wide range of biologically meaningful parameter combinations, indicating a robust and broad utility for the method. The power of computational pattern discovery methods coupled with Regulogger to unravel transcriptional networks was demonstrated in an analysis of the genome of Staphylococcus aureus. A total of 125 regulogs were found in this organism, including both well-defined functional groups and a subset with unknown functions.
Collapse
Affiliation(s)
- Wynand B L Alkema
- Center for Genomics and Bioinformatics, Karolinska Institutet, Stockholm, Sweden
| | | | | |
Collapse
|
18
|
Höglund A, Kohlbacher O. From sequence to structure and back again: approaches for predicting protein-DNA binding. Proteome Sci 2004; 2:3. [PMID: 15202939 PMCID: PMC441406 DOI: 10.1186/1477-5956-2-3] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2004] [Accepted: 06/17/2004] [Indexed: 12/12/2022] Open
Abstract
Gene regulation in higher organisms is achieved by a complex network of transcription factors (TFs). Modulating gene expression and exploring gene function are major aims in molecular biology. Furthermore, the identification of putative target genes for a certain TF serve as powerful tools for specific targeting of rational drugs. Detecting the short and variable transcription factor binding sites (TFBSs) in genomic DNA is an intriguing challenge for computational and structural biologists. Fast and reliable computational methods for predicting TFBSs on a whole-genome scale offer several advantages compared to the current experimental methods that are rather laborious and slow. Two main approaches are being explored, advanced sequence-based algorithms and structure-based methods. The aim of this review is to outline the computational and experimental methods currently being applied in the field of protein-DNA interactions. With a focus on the former, the current state of the art in modeling these interactions is discussed. Surveying sequence and structure-based methods for predicting TFBSs, we conclude that in order to achieve a sound and specific method applicable on genomic sequences it is desirable and important to bring these two approaches together.
Collapse
Affiliation(s)
- Annette Höglund
- Department for Simulation of Biological Systems, Eberhard Karls University Tübingen, Sand 14, D-72076 Tübingen, Germany
| | - Oliver Kohlbacher
- Department for Simulation of Biological Systems, Eberhard Karls University Tübingen, Sand 14, D-72076 Tübingen, Germany
| |
Collapse
|
19
|
Sandelin A, Wasserman WW. Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics. J Mol Biol 2004; 338:207-15. [PMID: 15066426 DOI: 10.1016/j.jmb.2004.02.048] [Citation(s) in RCA: 120] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2003] [Revised: 02/12/2004] [Accepted: 02/13/2004] [Indexed: 01/28/2023]
Abstract
Diverse computational and experimental efforts are required to elucidate the control circuitry regulating the transcription of human genes. The fusion of gene-specific promoter analyses with large microarray studies and bioinformatics advances has produced optimism that significant progress can be made in unravelling this complex network. Within bioinformatics, past emphasis for improved pattern discovery has been placed upon "phylogenetic footprinting", the identification of sequences conserved over moderate periods of evolution (e.g. human and mouse comparisons). We introduce a new direction in bioinformatics based on the constraints imposed by the structures of DNA-binding proteins. For most structurally related families of transcription factors, there are clear similarities in the sequences of the sites to which they bind. On the basis of this observation, we construct familial binding profiles for well-characterized transcription factor families. The profiles are shown to classify correctly the structural class of mediating transcription factors for novel motifs in 88% of cases. By incorporating the familial profiles into pattern discovery procedures, we demonstrate that functional binding sites can be found in genomic sequences of dramatically greater length than is possible otherwise. Thus, incorporating familial models can overcome the signal-to-noise challenge that has hindered the transition from microarray data to regulatory control sequences for human genes. Biochemically motivated constraints upon sequence diversity of binding sites will complement the genetically motivated constraints imposed in "phylogenetic footprinting" algorithms.
Collapse
Affiliation(s)
- Albin Sandelin
- Center for Genomics and Bioinformatics, Karolinska Institutet, Stockholm, Sweden
| | | |
Collapse
|
20
|
Castrillo JI, Oliver SG. Yeast as a Touchstone in Post-genomic Research: Strategies for Integrative Analysis in Functional Genomics. BMB Rep 2004; 37:93-106. [PMID: 14761307 DOI: 10.5483/bmbrep.2004.37.1.093] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
The new complexity arising from the genome sequencing projects requires new comprehensive post-genomic strategies: advanced studies in regulatory mechanisms, application of new high-throughput technologies at a genome-wide scale, at the different levels of cellular complexity (genome, transcriptome, proteome and metabolome), efficient analysis of the results, and application of new bioinformatic methods in an integrative or systems biology perspective. This can be accomplished in studies with model organisms under controlled conditions. In this review a perspective of the favourable characteristics of yeast as a touchstone model in post-genomic research is presented. The state-of-the art, latest advances in the field and bottlenecks, new strategies, new regulatory mechanisms, applications (patents) and high-throughput technologies, most of them being developed and validated in yeast, are presented. The optimal characteristics of yeast as a well-defined system for comprehensive studies under controlled conditions makes it a perfect model to be used in integrative, "systems biology" studies to get new insights into the mechanisms of regulation (regulatory networks) responsible of specific phenotypes under particular environmental conditions, to be applied to more complex organisms (e.g. plants, human).
Collapse
Affiliation(s)
- Juan I Castrillo
- School of Biological Sciences, University of Manchester, 2205 Stopford Building, Oxford Road, Manchester M13 9PT, UK.
| | | |
Collapse
|
21
|
Sandelin A, Alkema W, Engström P, Wasserman WW, Lenhard B. JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res 2004; 32:D91-4. [PMID: 14681366 PMCID: PMC308747 DOI: 10.1093/nar/gkh012] [Citation(s) in RCA: 1205] [Impact Index Per Article: 60.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
The analysis of regulatory regions in genome sequences is strongly based on the detection of potential transcription factor binding sites. The preferred models for representation of transcription factor binding specificity have been termed position-specific scoring matrices. JASPAR is an open-access database of annotated, high-quality, matrix-based transcription factor binding site profiles for multicellular eukaryotes. The profiles were derived exclusively from sets of nucleotide sequences experimentally demonstrated to bind transcription factors. The database is complemented by a web interface for browsing, searching and subset selection, an online sequence analysis utility and a suite of programming tools for genome-wide and comparative genomic analysis of regulatory regions. JASPAR is available at http://jaspar. cgb.ki.se.
Collapse
Affiliation(s)
- Albin Sandelin
- Center for Genomics and Bioinformatics, Karolinska Institutet, Berzelius väg 35, S-17177 Stockholm, Sweden
| | | | | | | | | |
Collapse
|