1
|
DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers. Nat Genet 2022; 54:613-624. [PMID: 35551305 DOI: 10.1038/s41588-022-01048-5] [Citation(s) in RCA: 69] [Impact Index Per Article: 34.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2021] [Accepted: 03/08/2022] [Indexed: 02/06/2023]
Abstract
Enhancer sequences control gene expression and comprise binding sites (motifs) for different transcription factors (TFs). Despite extensive genetic and computational studies, the relationship between DNA sequence and regulatory activity is poorly understood, and de novo enhancer design has been challenging. Here, we built a deep-learning model, DeepSTARR, to quantitatively predict the activities of thousands of developmental and housekeeping enhancers directly from DNA sequence in Drosophila melanogaster S2 cells. The model learned relevant TF motifs and higher-order syntax rules, including functionally nonequivalent instances of the same TF motif that are determined by motif-flanking sequence and intermotif distances. We validated these rules experimentally and demonstrated that they can be generalized to humans by testing more than 40,000 wildtype and mutant Drosophila and human enhancers. Finally, we designed and functionally validated synthetic enhancers with desired activities de novo.
Collapse
|
2
|
Atak ZK, Taskiran II, Demeulemeester J, Flerin C, Mauduit D, Minnoye L, Hulselmans G, Christiaens V, Ghanem GE, Wouters J, Aerts S. Interpretation of allele-specific chromatin accessibility using cell state-aware deep learning. Genome Res 2021; 31:1082-1096. [PMID: 33832990 PMCID: PMC8168584 DOI: 10.1101/gr.260851.120] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2020] [Accepted: 04/05/2021] [Indexed: 12/26/2022]
Abstract
Genomic sequence variation within enhancers and promoters can have a significant impact on the cellular state and phenotype. However, sifting through the millions of candidate variants in a personal genome or a cancer genome, to identify those that impact cis-regulatory function, remains a major challenge. Interpretation of noncoding genome variation benefits from explainable artificial intelligence to predict and interpret the impact of a mutation on gene regulation. Here we generate phased whole genomes with matched chromatin accessibility, histone modifications, and gene expression for 10 melanoma cell lines. We find that training a specialized deep learning model, called DeepMEL2, on melanoma chromatin accessibility data can capture the various regulatory programs of the melanocytic and mesenchymal-like melanoma cell states. This model outperforms motif-based variant scoring, as well as more generic deep learning models. We detect hundreds to thousands of allele-specific chromatin accessibility variants (ASCAVs) in each melanoma genome, of which 15%-20% can be explained by gains or losses of transcription factor binding sites. A considerable fraction of ASCAVs are caused by changes in AP-1 binding, as confirmed by matched ChIP-seq data to identify allele-specific binding of JUN and FOSL1. Finally, by augmenting the DeepMEL2 model with ChIP-seq data for GABPA, the TERT promoter mutation, as well as additional ETS motif gains, can be identified with high confidence. In conclusion, we present a new integrative genomics approach and a deep learning model to identify and interpret functional enhancer mutations with allelic imbalance of chromatin accessibility and gene expression.
Collapse
Affiliation(s)
- Zeynep Kalender Atak
- VIB-KU Leuven Center for Brain and Disease Research, 3000 Leuven, Belgium.,KU Leuven, Department of Human Genetics KU Leuven, 3000 Leuven, Belgium
| | - Ibrahim Ihsan Taskiran
- VIB-KU Leuven Center for Brain and Disease Research, 3000 Leuven, Belgium.,KU Leuven, Department of Human Genetics KU Leuven, 3000 Leuven, Belgium
| | - Jonas Demeulemeester
- VIB-KU Leuven Center for Brain and Disease Research, 3000 Leuven, Belgium.,KU Leuven, Department of Human Genetics KU Leuven, 3000 Leuven, Belgium.,Cancer Genomics Laboratory, The Francis Crick Institute, London NW1 1AT, United Kingdom
| | - Christopher Flerin
- VIB-KU Leuven Center for Brain and Disease Research, 3000 Leuven, Belgium.,KU Leuven, Department of Human Genetics KU Leuven, 3000 Leuven, Belgium
| | - David Mauduit
- VIB-KU Leuven Center for Brain and Disease Research, 3000 Leuven, Belgium.,KU Leuven, Department of Human Genetics KU Leuven, 3000 Leuven, Belgium
| | - Liesbeth Minnoye
- VIB-KU Leuven Center for Brain and Disease Research, 3000 Leuven, Belgium.,KU Leuven, Department of Human Genetics KU Leuven, 3000 Leuven, Belgium
| | - Gert Hulselmans
- VIB-KU Leuven Center for Brain and Disease Research, 3000 Leuven, Belgium.,KU Leuven, Department of Human Genetics KU Leuven, 3000 Leuven, Belgium
| | - Valerie Christiaens
- VIB-KU Leuven Center for Brain and Disease Research, 3000 Leuven, Belgium.,KU Leuven, Department of Human Genetics KU Leuven, 3000 Leuven, Belgium
| | - Ghanem-Elias Ghanem
- Institut Jules Bordet, Université Libre de Bruxelles, 1000 Brussels, Belgium
| | - Jasper Wouters
- VIB-KU Leuven Center for Brain and Disease Research, 3000 Leuven, Belgium.,KU Leuven, Department of Human Genetics KU Leuven, 3000 Leuven, Belgium
| | - Stein Aerts
- VIB-KU Leuven Center for Brain and Disease Research, 3000 Leuven, Belgium.,KU Leuven, Department of Human Genetics KU Leuven, 3000 Leuven, Belgium
| |
Collapse
|
3
|
Cheng Z, Vermeulen M, Rollins-Green M, DeVeale B, Babak T. Cis-regulatory mutations with driver hallmarks in major cancers. iScience 2021; 24:102144. [PMID: 33665563 PMCID: PMC7903341 DOI: 10.1016/j.isci.2021.102144] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2020] [Revised: 09/02/2020] [Accepted: 01/25/2021] [Indexed: 12/05/2022] Open
Abstract
Despite the recent availability of complete genome sequences of tumors from thousands of patients, isolating disease-causing (driver) non-coding mutations from the plethora of somatic variants remains challenging, and only a handful of validated examples exist. By integrating whole-genome sequencing, genetic data, and allele-specific gene expression from TCGA, we identified 320 somatic non-coding mutations that affect gene expression in cis (FDR<0.25). These mutations cluster into 47 cis-regulatory elements that modulate expression of their subject genes through diverse molecular mechanisms. We further show that these mutations have hallmark features of non-coding drivers; namely, that they preferentially disrupt transcription factor binding motifs, are associated with a selective advantage, increased oncogene expression and decreased tumor suppressor expression. Enrichment of functional non-coding somatic mutations predicts drivers Elevated variant allele frequencies are consistent with roles in tumorigenesis Putative non-coding drivers disrupt transcription factor binding motifs Predicted drivers associate with increased oncogene and decreased TSG expression
Collapse
Affiliation(s)
- Zhongshan Cheng
- Department of Biology, Queen's University, Kingston, ON K7L 3N6, Canada
| | - Michael Vermeulen
- Department of Biology, Queen's University, Kingston, ON K7L 3N6, Canada
| | | | - Brian DeVeale
- The Eli and Edythe Broad Center of Regeneration Medicine and Stem Cell Research, Center for Reproductive Sciences, University of California, San Francisco, San Francisco, CA 94143, USA
| | - Tomas Babak
- Department of Biology, Queen's University, Kingston, ON K7L 3N6, Canada
| |
Collapse
|
4
|
Ilan Y, Spigelman Z. Establishing patient-tailored variability-based paradigms for anti-cancer therapy: Using the inherent trajectories which underlie cancer for overcoming drug resistance. Cancer Treat Res Commun 2020; 25:100240. [PMID: 33246316 DOI: 10.1016/j.ctarc.2020.100240] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2020] [Revised: 10/30/2020] [Accepted: 11/16/2020] [Indexed: 06/11/2023]
Abstract
Drug resistance is a major obstacle for successful therapy of many malignancies and is affecting the loss of response to chemotherapy and immunotherapy. Tumor-related compensatory adaptation mechanisms contribute to the development of drug resistance. Variability is inherent to biological systems and altered patterns of variability are associated with disease conditions. The marked intra and inter patient tumor heterogeneity, and the diverse mechanism contributing to drug resistance in different subjects, which may change over time even in the same patient, necessitate the development of personalized dynamic approaches for overcoming drug resistance. Altered dosing regimens, the potential role of chronotherapy, and drug holidays are effective in cancer therapy and immunotherapy. In the present review we describe the difficulty of overcoming drug resistance in a dynamic system and present the use of the inherent trajectories which underlie cancer development for building therapeutic regimens which can overcome resistance. The establishment of a platform wherein patient-tailored variability signatures are used for overcoming resistance for ensuing long term sustainable improved responses is presented.
Collapse
Affiliation(s)
- Yaron Ilan
- Department of Medicine, Hebrew University-Hadassah Medical Center, Jerusalem, Israel.
| | - Zachary Spigelman
- Department of Hematology and Oncology, Lahey Hospital and Beth Israel Medical Center, MA, USA
| |
Collapse
|
5
|
Liang R, Xie J, Zhang C, Zhang M, Huang H, Huo H, Cao X, Niu B. Identifying Cancer Targets Based on Machine Learning Methods via Chou's 5-steps Rule and General Pseudo Components. Curr Top Med Chem 2019; 19:2301-2317. [PMID: 31622219 DOI: 10.2174/1568026619666191016155543] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2019] [Revised: 07/19/2019] [Accepted: 08/26/2019] [Indexed: 01/09/2023]
Abstract
In recent years, the successful implementation of human genome project has made people realize that genetic, environmental and lifestyle factors should be combined together to study cancer due to the complexity and various forms of the disease. The increasing availability and growth rate of 'big data' derived from various omics, opens a new window for study and therapy of cancer. In this paper, we will introduce the application of machine learning methods in handling cancer big data including the use of artificial neural networks, support vector machines, ensemble learning and naïve Bayes classifiers.
Collapse
Affiliation(s)
- Ruirui Liang
- School of Life Sciences, Shanghai University, Shanghai, 200444, China
| | - Jiayang Xie
- School of Life Sciences, Shanghai University, Shanghai, 200444, China
| | - Chi Zhang
- Foshan Huaxia Eye Hospital, Huaxia Eye Hospital Group, Foshan 528000, China
| | - Mengying Zhang
- School of Life Sciences, Shanghai University, Shanghai, 200444, China
| | - Hai Huang
- School of Life Sciences, Shanghai University, Shanghai, 200444, China
| | - Haizhong Huo
- Department of General Surgery, Shanghai Ninth People's Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai 200011, China
| | - Xin Cao
- Zhongshan Hospital, Institute of Clinical Science, Shanghai Medical College, Fudan University, Shanghai 200032, China
| | - Bing Niu
- School of Life Sciences, Shanghai University, Shanghai, 200444, China
| |
Collapse
|
6
|
Xie X, Hanson C, Sinha S. Mechanistic interpretation of non-coding variants for discovering transcriptional regulators of drug response. BMC Biol 2019; 17:62. [PMID: 31362726 PMCID: PMC6664756 DOI: 10.1186/s12915-019-0679-8] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2019] [Accepted: 07/09/2019] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Identification of functional non-coding variants and their mechanistic interpretation is a major challenge of modern genomics, especially for precision medicine. Transcription factor (TF) binding profiles and epigenomic landscapes in reference samples allow functional annotation of the genome, but do not provide ready answers regarding the effects of non-coding variants on phenotypes. A promising computational approach is to build models that predict TF-DNA binding from sequence, and use such models to score a variant's impact on TF binding strength. Here, we asked if this mechanistic approach to variant interpretation can be combined with information on genotype-phenotype associations to discover transcription factors regulating phenotypic variation among individuals. RESULTS We developed a statistical approach that integrates phenotype, genotype, gene expression, TF ChIP-seq, and Hi-C chromatin interaction data to answer this question. Using drug sensitivity of lymphoblastoid cell lines as the phenotype of interest, we tested if non-coding variants statistically linked to the phenotype are enriched for strong predicted impact on DNA binding strength of a TF and thus identified TFs regulating individual differences in the phenotype. Our approach relies on a new method for predicting variant impact on TF-DNA binding that uses a combination of biophysical modeling and machine learning. We report statistical and literature-based support for many of the TFs discovered here as regulators of drug response variation. We show that the use of mechanistically driven variant impact predictors can identify TF-drug associations that would otherwise be missed. We examined in depth one reported association-that of the transcription factor ELF1 with the drug doxorubicin-and identified several genes that may mediate this regulatory relationship. CONCLUSION Our work represents initial steps in utilizing predictions of variant impact on TF binding sites for discovery of regulatory mechanisms underlying phenotypic variation. Future advances on this topic will be greatly beneficial to the reconstruction of phenotype-associated gene regulatory networks.
Collapse
Affiliation(s)
- Xiaoman Xie
- Center for Biophysics and Quantitative Biology, University of Illinois Urbana-Champaign, Urbana, IL, 61801, USA
| | - Casey Hanson
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL, 61801, USA
| | - Saurabh Sinha
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL, 61801, USA. .,Institute of Genomic Biology, University of Illinois Urbana-Champaign, Urbana, IL, 61801, USA.
| |
Collapse
|
7
|
An information theoretic treatment of sequence-to-expression modeling. PLoS Comput Biol 2018; 14:e1006459. [PMID: 30256780 PMCID: PMC6175532 DOI: 10.1371/journal.pcbi.1006459] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2018] [Revised: 10/08/2018] [Accepted: 08/24/2018] [Indexed: 11/23/2022] Open
Abstract
Studying a gene’s regulatory mechanisms is a tedious process that involves identification of candidate regulators by transcription factor (TF) knockout or over-expression experiments, delineation of enhancers by reporter assays, and demonstration of direct TF influence by site mutagenesis, among other approaches. Such experiments are often chosen based on the biologist’s intuition, from several testable hypotheses. We pursue the goal of making this process systematic by using ideas from information theory to reason about experiments in gene regulation, in the hope of ultimately enabling rigorous experiment design strategies. For this, we make use of a state-of-the-art mathematical model of gene expression, which provides a way to formalize our current knowledge of cis- as well as trans- regulatory mechanisms of a gene. Ambiguities in such knowledge can be expressed as uncertainties in the model, which we capture formally by building an ensemble of plausible models that fit the existing data and defining a probability distribution over the ensemble. We then characterize the impact of a new experiment on our understanding of the gene’s regulation based on how the ensemble of plausible models and its probability distribution changes when challenged with results from that experiment. This allows us to assess the ‘value’ of the experiment retroactively as the reduction in entropy of the distribution (information gain) resulting from the experiment’s results. We fully formalize this novel approach to reasoning about gene regulation experiments and use it to evaluate a variety of perturbation experiments on two developmental genes of D. melanogaster. We also provide objective and ‘biologist-friendly’ descriptions of the information gained from each such experiment. The rigorously defined information theoretic approaches presented here can be used in the future to formulate systematic strategies for experiment design pertaining to studies of gene regulatory mechanisms. In-depth studies of gene regulatory mechanisms employ a variety of experimental approaches such as identifying a gene’s enhancer(s) and testing its variants through reporter assays, followed by transcription factor mis-expression or knockouts, site mutagenesis, etc. The biologist is often faced with the challenging problem of selecting the ideal next experiment to perform so that its results provide novel mechanistic insights, and has to rely on their intuition about what is currently known on the topic and which experiments may add to that knowledge. We seek to make this intuition-based process more systematic, by borrowing ideas from the mature statistical field of experiment design. Towards this goal, we use the language of mathematical models to formally describe what is known about a gene’s regulatory mechanisms, and how an experiment’s results enhance that knowledge. We use information theoretic ideas to assign a ‘value’ to an experiment as well as explain objectively what is learned from that experiment. We demonstrate use of this novel approach on two extensively studied developmental genes in fruitfly. We expect our work to lead to systematic strategies for selecting the most informative experiments in a study of gene regulation.
Collapse
|
8
|
Jacobs J, Atkins M, Davie K, Imrichova H, Romanelli L, Christiaens V, Hulselmans G, Potier D, Wouters J, Taskiran II, Paciello G, González-Blas CB, Koldere D, Aibar S, Halder G, Aerts S. The transcription factor Grainy head primes epithelial enhancers for spatiotemporal activation by displacing nucleosomes. Nat Genet 2018; 50:1011-1020. [PMID: 29867222 PMCID: PMC6031307 DOI: 10.1038/s41588-018-0140-x] [Citation(s) in RCA: 89] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2017] [Accepted: 04/06/2018] [Indexed: 12/21/2022]
Abstract
Transcriptional enhancers function as docking platforms for combinations of transcription factors (TFs) to control gene expression. How enhancer sequences determine nucleosome occupancy, TF recruitment and transcriptional activation in vivo remains unclear. Using ATAC-seq across a panel of Drosophila inbred strains, we found that SNPs affecting binding sites of the TF Grainy head (Grh) causally determine the accessibility of epithelial enhancers. We show that deletion and ectopic expression of Grh cause loss and gain of DNA accessibility, respectively. However, although Grh binding is necessary for enhancer accessibility, it is insufficient to activate enhancers. Finally, we show that human Grh homologs-GRHL1, GRHL2 and GRHL3-function similarly. We conclude that Grh binding is necessary and sufficient for the opening of epithelial enhancers but not for their activation. Our data support a model positing that complex spatiotemporal expression patterns are controlled by regulatory hierarchies in which pioneer factors, such as Grh, establish tissue-specific accessible chromatin landscapes upon which other factors can act.
Collapse
Affiliation(s)
- Jelle Jacobs
- VIB Center for Brain and Disease Research, Laboratory of Computational Biology, Leuven, Belgium
- KU Leuven, Department of Human Genetics, Leuven, Belgium
| | - Mardelle Atkins
- VIB Center for Cancer Biology, Leuven, Belgium
- KU Leuven, Department of Oncology, Leuven, Belgium
| | - Kristofer Davie
- VIB Center for Brain and Disease Research, Laboratory of Computational Biology, Leuven, Belgium
- KU Leuven, Department of Human Genetics, Leuven, Belgium
| | - Hana Imrichova
- VIB Center for Brain and Disease Research, Laboratory of Computational Biology, Leuven, Belgium
- KU Leuven, Department of Human Genetics, Leuven, Belgium
| | - Lucia Romanelli
- VIB Center for Cancer Biology, Leuven, Belgium
- KU Leuven, Department of Oncology, Leuven, Belgium
| | - Valerie Christiaens
- VIB Center for Brain and Disease Research, Laboratory of Computational Biology, Leuven, Belgium
- KU Leuven, Department of Human Genetics, Leuven, Belgium
| | - Gert Hulselmans
- VIB Center for Brain and Disease Research, Laboratory of Computational Biology, Leuven, Belgium
- KU Leuven, Department of Human Genetics, Leuven, Belgium
| | - Delphine Potier
- VIB Center for Brain and Disease Research, Laboratory of Computational Biology, Leuven, Belgium
- KU Leuven, Department of Human Genetics, Leuven, Belgium
| | - Jasper Wouters
- VIB Center for Brain and Disease Research, Laboratory of Computational Biology, Leuven, Belgium
- KU Leuven, Department of Human Genetics, Leuven, Belgium
| | | | - Giulia Paciello
- Politecnico di Torino, Automatics and Informatics, Turin, Italy
| | - Carmen B González-Blas
- VIB Center for Brain and Disease Research, Laboratory of Computational Biology, Leuven, Belgium
- KU Leuven, Department of Human Genetics, Leuven, Belgium
| | - Duygu Koldere
- VIB Center for Brain and Disease Research, Laboratory of Computational Biology, Leuven, Belgium
- KU Leuven, Department of Human Genetics, Leuven, Belgium
| | - Sara Aibar
- VIB Center for Brain and Disease Research, Laboratory of Computational Biology, Leuven, Belgium
- KU Leuven, Department of Human Genetics, Leuven, Belgium
| | - Georg Halder
- VIB Center for Cancer Biology, Leuven, Belgium
- KU Leuven, Department of Oncology, Leuven, Belgium
| | - Stein Aerts
- VIB Center for Brain and Disease Research, Laboratory of Computational Biology, Leuven, Belgium.
- KU Leuven, Department of Human Genetics, Leuven, Belgium.
| |
Collapse
|
9
|
Schwessinger R, Suciu MC, McGowan SJ, Telenius J, Taylor S, Higgs DR, Hughes JR. Sasquatch: predicting the impact of regulatory SNPs on transcription factor binding from cell- and tissue-specific DNase footprints. Genome Res 2017; 27:1730-1742. [PMID: 28904015 PMCID: PMC5630036 DOI: 10.1101/gr.220202.117] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2017] [Accepted: 08/07/2017] [Indexed: 12/22/2022]
Abstract
In the era of genome-wide association studies (GWAS) and personalized medicine, predicting the impact of single nucleotide polymorphisms (SNPs) in regulatory elements is an important goal. Current approaches to determine the potential of regulatory SNPs depend on inadequate knowledge of cell-specific DNA binding motifs. Here, we present Sasquatch, a new computational approach that uses DNase footprint data to estimate and visualize the effects of noncoding variants on transcription factor binding. Sasquatch performs a comprehensive k-mer-based analysis of DNase footprints to determine any k-mer's potential for protein binding in a specific cell type and how this may be changed by sequence variants. Therefore, Sasquatch uses an unbiased approach, independent of known transcription factor binding sites and motifs. Sasquatch only requires a single DNase-seq data set per cell type, from any genotype, and produces consistent predictions from data generated by different experimental procedures and at different sequence depths. Here we demonstrate the effectiveness of Sasquatch using previously validated functional SNPs and benchmark its performance against existing approaches. Sasquatch is available as a versatile webtool incorporating publicly available data, including the human ENCODE collection. Thus, Sasquatch provides a powerful tool and repository for prioritizing likely regulatory SNPs in the noncoding genome.
Collapse
Affiliation(s)
- Ron Schwessinger
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, Oxford OX3 9DS, United Kingdom
| | - Maria C Suciu
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, Oxford OX3 9DS, United Kingdom
| | - Simon J McGowan
- Computational Biology Research Group, MRC Weatherall Institute of Molecular Medicine, Oxford OX3 9DS, United Kingdom
| | - Jelena Telenius
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, Oxford OX3 9DS, United Kingdom
| | - Stephen Taylor
- Computational Biology Research Group, MRC Weatherall Institute of Molecular Medicine, Oxford OX3 9DS, United Kingdom
| | - Doug R Higgs
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, Oxford OX3 9DS, United Kingdom
| | - Jim R Hughes
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, Oxford OX3 9DS, United Kingdom
| |
Collapse
|
10
|
Kalender Atak Z, Imrichova H, Svetlichnyy D, Hulselmans G, Christiaens V, Reumers J, Ceulemans H, Aerts S. Identification of cis-regulatory mutations generating de novo edges in personalized cancer gene regulatory networks. Genome Med 2017; 9:80. [PMID: 28854983 PMCID: PMC5575942 DOI: 10.1186/s13073-017-0464-7] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2017] [Accepted: 08/02/2017] [Indexed: 01/05/2023] Open
Abstract
The identification of functional non-coding mutations is a key challenge in the field of genomics. Here we introduce μ-cisTarget to filter, annotate and prioritize cis-regulatory mutations based on their putative effect on the underlying "personal" gene regulatory network. We validated μ-cisTarget by re-analyzing the TAL1 and LMO1 enhancer mutations in T-ALL, and the TERT promoter mutation in melanoma. Next, we re-sequenced the full genomes of ten cancer cell lines and used matched transcriptome data and motif discovery to identify master regulators with de novo binding sites that result in the up-regulation of nearby oncogenic drivers. μ-cisTarget is available from http://mucistarget.aertslab.org .
Collapse
Affiliation(s)
- Zeynep Kalender Atak
- Laboratory of Computational Biology, VIB Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Hana Imrichova
- Laboratory of Computational Biology, VIB Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Dmitry Svetlichnyy
- Laboratory of Computational Biology, VIB Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Gert Hulselmans
- Laboratory of Computational Biology, VIB Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Valerie Christiaens
- Laboratory of Computational Biology, VIB Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Joke Reumers
- Discovery Sciences, Janssen Research & Development, Turnhoutseweg 30, 2340, Beerse, Belgium
| | - Hugo Ceulemans
- Discovery Sciences, Janssen Research & Development, Turnhoutseweg 30, 2340, Beerse, Belgium
| | - Stein Aerts
- Laboratory of Computational Biology, VIB Center for Brain & Disease Research, Leuven, Belgium.
- Department of Human Genetics, KU Leuven, Leuven, Belgium.
| |
Collapse
|
11
|
Sharma AK, Jaiswal SK, Chaudhary N, Sharma VK. A novel approach for the prediction of species-specific biotransformation of xenobiotic/drug molecules by the human gut microbiota. Sci Rep 2017; 7:9751. [PMID: 28852076 PMCID: PMC5575299 DOI: 10.1038/s41598-017-10203-6] [Citation(s) in RCA: 42] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2017] [Accepted: 08/07/2017] [Indexed: 11/09/2022] Open
Abstract
The human gut microbiota is constituted of a diverse group of microbial species harbouring an enormous metabolic potential, which can alter the metabolism of orally administered drugs leading to individual/population-specific differences in drug responses. Considering the large heterogeneous pool of human gut bacteria and their metabolic enzymes, investigation of species-specific contribution to xenobiotic/drug metabolism by experimental studies is a challenging task. Therefore, we have developed a novel computational approach to predict the metabolic enzymes and gut bacterial species, which can potentially carry out the biotransformation of a xenobiotic/drug molecule. A substrate database was constructed for metabolic enzymes from 491 available human gut bacteria. The structural properties (fingerprints) from these substrates were extracted and used for the development of random forest models, which displayed average accuracies of up to 98.61% and 93.25% on cross-validation and blind set, respectively. After the prediction of EC subclass, the specific metabolic enzyme (EC) is identified using a molecular similarity search. The performance was further evaluated on an independent set of FDA-approved drugs and other clinically important molecules. To our knowledge, this is the only available approach implemented as 'DrugBug' tool for the prediction of xenobiotic/drug metabolism by metabolic enzymes of human gut microbiota.
Collapse
Affiliation(s)
- Ashok K Sharma
- Metagenomics and Systems Biology Laboratory, Indian Institute of Science Education and Research, Bhopal, Madhya Pradesh, India
| | - Shubham K Jaiswal
- Metagenomics and Systems Biology Laboratory, Indian Institute of Science Education and Research, Bhopal, Madhya Pradesh, India
| | - Nikhil Chaudhary
- Metagenomics and Systems Biology Laboratory, Indian Institute of Science Education and Research, Bhopal, Madhya Pradesh, India
| | - Vineet K Sharma
- Metagenomics and Systems Biology Laboratory, Indian Institute of Science Education and Research, Bhopal, Madhya Pradesh, India.
| |
Collapse
|
12
|
Lowdon RF, Wang T. Epigenomic annotation of noncoding mutations identifies mutated pathways in primary liver cancer. PLoS One 2017; 12:e0174032. [PMID: 28333948 PMCID: PMC5363827 DOI: 10.1371/journal.pone.0174032] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2016] [Accepted: 03/02/2017] [Indexed: 11/19/2022] Open
Abstract
Evidence that noncoding mutation can result in cancer driver events is mounting. However, it is more difficult to assign molecular biological consequences to noncoding mutations than to coding mutations, and a typical cancer genome contains many more noncoding mutations than protein-coding mutations. Accordingly, parsing functional noncoding mutation signal from noise remains an important challenge. Here we use an empirical approach to identify putatively functional noncoding somatic single nucleotide variants (SNVs) from liver cancer genomes. Annotation of candidate variants by publicly available epigenome datasets finds that 40.5% of SNVs fall in regulatory elements. When assigned to specific regulatory elements, we find that the distribution of regulatory element mutation mirrors that of nonsynonymous coding mutation, where few regulatory elements are recurrently mutated in a patient population but many are singly mutated. We find potential gain-of-binding site events among candidate SNVs, suggesting a mechanism of action for these variants. When aggregating noncoding somatic mutation in promoters, we find that genes in the ERBB signaling and MAPK signaling pathways are significantly enriched for promoter mutations. Altogether, our results suggest that functional somatic SNVs in cancer are sporadic, but occasionally occur in regulatory elements and may affect phenotype by creating binding sites for transcriptional regulators. Accordingly, we propose that noncoding mutation should be formally accounted for when determining gene- and pathway-mutation burden in cancer.
Collapse
Affiliation(s)
- Rebecca F. Lowdon
- Center for Genome Sciences and Systems Biology, Department of Genetics, Washington University in St. Louis, Saint Louis, Missouri, United States of America
| | - Ting Wang
- Center for Genome Sciences and Systems Biology, Department of Genetics, Washington University in St. Louis, Saint Louis, Missouri, United States of America
| |
Collapse
|
13
|
Pan X, Shen HB. RNA-protein binding motifs mining with a new hybrid deep learning based cross-domain knowledge integration approach. BMC Bioinformatics 2017; 18:136. [PMID: 28245811 PMCID: PMC5331642 DOI: 10.1186/s12859-017-1561-8] [Citation(s) in RCA: 110] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2016] [Accepted: 02/23/2017] [Indexed: 01/08/2023] Open
Abstract
Background RNAs play key roles in cells through the interactions with proteins known as the RNA-binding proteins (RBP) and their binding motifs enable crucial understanding of the post-transcriptional regulation of RNAs. How the RBPs correctly recognize the target RNAs and why they bind specific positions is still far from clear. Machine learning-based algorithms are widely acknowledged to be capable of speeding up this process. Although many automatic tools have been developed to predict the RNA-protein binding sites from the rapidly growing multi-resource data, e.g. sequence, structure, their domain specific features and formats have posed significant computational challenges. One of current difficulties is that the cross-source shared common knowledge is at a higher abstraction level beyond the observed data, resulting in a low efficiency of direct integration of observed data across domains. The other difficulty is how to interpret the prediction results. Existing approaches tend to terminate after outputting the potential discrete binding sites on the sequences, but how to assemble them into the meaningful binding motifs is a topic worth of further investigation. Results In viewing of these challenges, we propose a deep learning-based framework (iDeep) by using a novel hybrid convolutional neural network and deep belief network to predict the RBP interaction sites and motifs on RNAs. This new protocol is featured by transforming the original observed data into a high-level abstraction feature space using multiple layers of learning blocks, where the shared representations across different domains are integrated. To validate our iDeep method, we performed experiments on 31 large-scale CLIP-seq datasets, and our results show that by integrating multiple sources of data, the average AUC can be improved by 8% compared to the best single-source-based predictor; and through cross-domain knowledge integration at an abstraction level, it outperforms the state-of-the-art predictors by 6%. Besides the overall enhanced prediction performance, the convolutional neural network module embedded in iDeep is also able to automatically capture the interpretable binding motifs for RBPs. Large-scale experiments demonstrate that these mined binding motifs agree well with the experimentally verified results, suggesting iDeep is a promising approach in the real-world applications. Conclusion The iDeep framework not only can achieve promising performance than the state-of-the-art predictors, but also easily capture interpretable binding motifs. iDeep is available at http://www.csbio.sjtu.edu.cn/bioinf/iDeep Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1561-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Xiaoyong Pan
- Department of Veterinary Clinical and Animal Sciences, University of Copenhagen, Copenhagen, Denmark.
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China.
| |
Collapse
|
14
|
Wouters J, Kalender Atak Z, Aerts S. Decoding transcriptional states in cancer. Curr Opin Genet Dev 2017; 43:82-92. [PMID: 28129557 DOI: 10.1016/j.gde.2017.01.003] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2016] [Revised: 01/05/2017] [Accepted: 01/09/2017] [Indexed: 12/27/2022]
Abstract
Gene regulatory networks determine cellular identity. In cancer, aberrations of gene networks are caused by driver mutations that often affect transcription factors and chromatin modifiers. Nevertheless, gene transcription in cancer follows the same cis-regulatory rules as normal cells, and cancer cells have served as convenient model systems to study transcriptional regulation. Tumours often show regulatory heterogeneity, with subpopulations of cells in different transcriptional states, which has important therapeutic implications. Here, we review recent experimental and computational techniques to reverse engineer cancer gene networks using transcriptome and epigenome data. New algorithms, data integration strategies, and increasing amounts of single cell genomics data provide exciting opportunities to model dynamic regulatory states at unprecedented resolution.
Collapse
Affiliation(s)
- Jasper Wouters
- Laboratory of Computational Biology, VIB Center for Brain & Disease Research, Leuven, Belgium; Department of Human Genetics, KU Leuven (University of Leuven), Leuven, Belgium
| | - Zeynep Kalender Atak
- Laboratory of Computational Biology, VIB Center for Brain & Disease Research, Leuven, Belgium; Department of Human Genetics, KU Leuven (University of Leuven), Leuven, Belgium
| | - Stein Aerts
- Laboratory of Computational Biology, VIB Center for Brain & Disease Research, Leuven, Belgium; Department of Human Genetics, KU Leuven (University of Leuven), Leuven, Belgium.
| |
Collapse
|
15
|
Yang W, Bang H, Jang K, Sung MK, Choi JK. Predicting the recurrence of noncoding regulatory mutations in cancer. BMC Bioinformatics 2016; 17:492. [PMID: 27912731 PMCID: PMC5135808 DOI: 10.1186/s12859-016-1385-y] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2016] [Accepted: 11/26/2016] [Indexed: 11/25/2022] Open
Abstract
Background One of the greatest challenges in cancer genomics is to distinguish driver mutations from passenger mutations. Whereas recurrence is a hallmark of driver mutations, it is difficult to observe recurring noncoding mutations owing to a limited amount of whole-genome sequenced samples. Hence, it is required to develop a method to predict potentially recurrent mutations. Results In this work, we developed a random forest classifier that predicts regulatory mutations that may recur based on the features of the mutations repeatedly appearing in a given cohort. With breast cancer as a model, we profiled 35 quantitative features describing genetic and epigenetic signals at the mutation site, transcription factors whose binding motif was disrupted by the mutation, and genes targeted by long-range chromatin interactions. A true set of mutations for machine learning was generated by interrogating publicly available pan-cancer genomes based on our statistical model of mutation recurrence. The performance of our random forest classifier was evaluated by cross validations. The variable importance of each feature in the classification of mutations was investigated. Our statistical recurrence model for the random forest classifier showed an area under the curve (AUC) of ~0.78 in predicting recurrent mutations. Chromatin accessibility at the mutation sites, the distance from the mutations to known cancer risk loci, and the role of the target genes in the regulatory or protein interaction network were among the most important variables. Conclusions Our methods enable to characterize recurrent regulatory mutations using a limited number of whole-genome samples, and based on the characterization, to predict potential driver mutations whose recurrence is not found in the given samples but likely to be observed with additional samples. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1385-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Woojin Yang
- Department of Bio and Brain Engineering, KAIST, Daejeon, Republic of Korea
| | - Hyoeun Bang
- Department of Bio and Brain Engineering, KAIST, Daejeon, Republic of Korea
| | - Kiwon Jang
- Department of Bio and Brain Engineering, KAIST, Daejeon, Republic of Korea
| | - Min Kyung Sung
- Department of Bio and Brain Engineering, KAIST, Daejeon, Republic of Korea
| | - Jung Kyoon Choi
- Department of Bio and Brain Engineering, KAIST, Daejeon, Republic of Korea.
| |
Collapse
|
16
|
Ghandi M, Mohammad-Noori M, Ghareghani N, Lee D, Garraway L, Beer MA. gkmSVM: an R package for gapped-kmer SVM. Bioinformatics 2016; 32:2205-7. [PMID: 27153639 DOI: 10.1093/bioinformatics/btw203] [Citation(s) in RCA: 101] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2015] [Accepted: 04/10/2016] [Indexed: 11/12/2022] Open
Abstract
UNLABELLED We present a new R package for training gapped-kmer SVM classifiers for DNA and protein sequences. We describe an improved algorithm for kernel matrix calculation that speeds run time by about 2 to 5-fold over our original gkmSVM algorithm. This package supports several sequence kernels, including: gkmSVM, kmer-SVM, mismatch kernel and wildcard kernel. AVAILABILITY AND IMPLEMENTATION gkmSVM package is freely available through the Comprehensive R Archive Network (CRAN), for Linux, Mac OS and Windows platforms. The C ++ implementation is available at www.beerlab.org/gkmsvm CONTACT mghandi@gmail.com or mbeer@jhu.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mahmoud Ghandi
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Morteza Mohammad-Noori
- School of Mathematics, Statistics, and Computer Science, College of Science, University of Tehran, Tehran, Iran
| | - Narges Ghareghani
- Department of Engineering Science, College of Engineering, University of Tehran, and Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
| | - Dongwon Lee
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, MD, USA
| | - Levi Garraway
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA
| | - Michael A Beer
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, MD, USA Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| |
Collapse
|