1
|
Sundaram L, Kumar A, Zatzman M, Salcedo A, Ravindra N, Shams S, Louie BH, Bagdatli ST, Myers MA, Sarmashghi S, Choi HY, Choi WY, Yost KE, Zhao Y, Granja JM, Hinoue T, Hayes DN, Cherniack A, Felau I, Choudhry H, Zenklusen JC, Farh KKH, McPherson A, Curtis C, Laird PW, Demchok JA, Yang L, Tarnuzzer R, Caesar-Johnson SJ, Wang Z, Doane AS, Khurana E, Castro MAA, Lazar AJ, Broom BM, Weinstein JN, Akbani R, Kumar SV, Raphael BJ, Wong CK, Stuart JM, Safavi R, Benz CC, Johnson BK, Kyi C, Shen H, Corces MR, Chang HY, Greenleaf WJ. Single-cell chromatin accessibility reveals malignant regulatory programs in primary human cancers. Science 2024; 385:eadk9217. [PMID: 39236169 DOI: 10.1126/science.adk9217] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Accepted: 07/03/2024] [Indexed: 09/07/2024]
Abstract
To identify cancer-associated gene regulatory changes, we generated single-cell chromatin accessibility landscapes across eight tumor types as part of The Cancer Genome Atlas. Tumor chromatin accessibility is strongly influenced by copy number alterations that can be used to identify subclones, yet underlying cis-regulatory landscapes retain cancer type-specific features. Using organ-matched healthy tissues, we identified the "nearest healthy" cell types in diverse cancers, demonstrating that the chromatin signature of basal-like-subtype breast cancer is most similar to secretory-type luminal epithelial cells. Neural network models trained to learn regulatory programs in cancer revealed enrichment of model-prioritized somatic noncoding mutations near cancer-associated genes, suggesting that dispersed, nonrecurrent, noncoding mutations in cancer are functional. Overall, these data and interpretable gene regulatory models for cancer and healthy tissue provide a framework for understanding cancer-specific gene regulation.
Collapse
Affiliation(s)
- Laksshman Sundaram
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
- Department of Computer Science, Stanford University, Stanford, CA, USA
- Illumina AI laboratory, Illumina Inc, Foster City, CA, USA
- NVIDIA Bio Research, NVIDIA, Santa Clara, CA, USA
| | - Arvind Kumar
- Illumina AI laboratory, Illumina Inc, Foster City, CA, USA
| | - Matthew Zatzman
- Computational Oncology, Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | | | - Neal Ravindra
- Illumina AI laboratory, Illumina Inc, Foster City, CA, USA
| | - Shadi Shams
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
- Department of Dermatology, Stanford University School of Medicine, Stanford, CA, USA
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA 94305, USA
| | - Bryan H Louie
- Department of Dermatology, Stanford University School of Medicine, Stanford, CA, USA
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA 94305, USA
| | - S Tansu Bagdatli
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
- Department of Dermatology, Stanford University School of Medicine, Stanford, CA, USA
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA 94305, USA
| | - Matthew A Myers
- Computational Oncology, Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | | | - Hyo Young Choi
- Department of Preventive Medicine, University of Tennessee Health Science Center, Memphis, TN, USA
- Department of Medicine, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Won-Young Choi
- UTHSC Center for Cancer Research, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Kathryn E Yost
- Department of Dermatology, Stanford University School of Medicine, Stanford, CA, USA
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA 94305, USA
| | - Yanding Zhao
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
- Department of Dermatology, Stanford University School of Medicine, Stanford, CA, USA
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA 94305, USA
| | - Jeffrey M Granja
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Toshinori Hinoue
- Center for Epigenetics, Van Andel Institute, Grand Rapids, MI 49503, USA
| | - D Neil Hayes
- Department of Preventive Medicine, University of Tennessee Health Science Center, Memphis, TN, USA
- Department of Medicine, University of Tennessee Health Science Center, Memphis, TN, USA
- UTHSC Center for Cancer Research, University of Tennessee Health Science Center, Memphis, TN, USA
| | | | - Ina Felau
- National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Hani Choudhry
- Department of Biochemistry, Faculty of Science, Cancer and Mutagenesis Unit, King Fahd Center for Medical Research, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Jean C Zenklusen
- National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | | | - Andrew McPherson
- Computational Oncology, Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Christina Curtis
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA 94305, USA
- Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA
- Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
| | - Peter W Laird
- Center for Epigenetics, Van Andel Institute, Grand Rapids, MI 49503, USA
| | - John A Demchok
- Center for Cancer Genomics, National Cancer Institute, Bethesda, MD 20892, USA
| | - Liming Yang
- Center for Cancer Genomics, National Cancer Institute, Bethesda, MD 20892, USA
| | - Roy Tarnuzzer
- Center for Cancer Genomics, National Cancer Institute, Bethesda, MD 20892, USA
| | | | - Zhining Wang
- Center for Biomedical Informatics and Information Technology, National Cancer Institute, NIH, 9609 Medical Center Drive, Rockville, MD 20850, USA
| | - Ashley S Doane
- Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY 10065, USA
| | - Ekta Khurana
- Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY 10065, USA
- Sandra and Edward Meyer Cancer Center, Weill Cornell Medicine, New York, NY 10065, USA
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY 10065, USA
- Englander Institute for Precision Medicine, Weill Cornell Medicine, New York, NY 10065, USA
| | - Mauro A A Castro
- Bioinformatics and Systems Biology Laboratory, Federal University of Paraná, Curitiba 81520-260, Brazil
| | - Alexander J Lazar
- Departments of Pathology & Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Bradley M Broom
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| | - John N Weinstein
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
- Department of Systems Biology, University of Texas MD Anderson Cancer Center, Houston, TX 77030
| | - Rehan Akbani
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| | - Shwetha V Kumar
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| | - Benjamin J Raphael
- Department of Computer Science, Princeton University, 35 Olden Street, Princeton, NJ 08540
| | - Christopher K Wong
- Biomolecular Engineering Department, School of Engineering, University of California, Santa Cruz, Santa Cruz, CA 95064, USA
| | - Joshua M Stuart
- Biomolecular Engineering Department, School of Engineering, University of California, Santa Cruz, Santa Cruz, CA 95064, USA
| | - Rojin Safavi
- Biomolecular Engineering Department, School of Engineering, University of California, Santa Cruz, Santa Cruz, CA 95064, USA
| | | | - Benjamin K Johnson
- Center for Epigenetics, Van Andel Institute, Grand Rapids, MI 49503, USA
| | - Cindy Kyi
- Center for Cancer Genomics, National Cancer Institute, Bethesda, MD 20892, USA
| | - Hui Shen
- Center for Epigenetics, Van Andel Institute, Grand Rapids, MI 49503, USA
| | - M Ryan Corces
- Department of Dermatology, Stanford University School of Medicine, Stanford, CA, USA
- Gladstone Institute of Neurological Disease, Gladstone Institutes, San Francisco, CA, USA
- Gladstone Institute of Data Science and Biotechnology, Gladstone Institutes, San Francisco, CA, USA
- Department of Neurology, University of California San Francisco, San Francisco, CA, USA
| | - Howard Y Chang
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
- Department of Dermatology, Stanford University School of Medicine, Stanford, CA, USA
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA 94305, USA
- Howard Hughes Medical Institute, Stanford University, School of Medicine, Stanford, CA, USA
| | - William J Greenleaf
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA 94305, USA
- Department of Applied Physics, Stanford University, Stanford, CA, USA
| |
Collapse
|
2
|
Raditsa V, Tsukanov A, Bogomolov A, Levitsky V. Genomic background sequences systematically outperform synthetic ones in de novo motif discovery for ChIP-seq data. NAR Genom Bioinform 2024; 6:lqae090. [PMID: 39071850 PMCID: PMC11282361 DOI: 10.1093/nargab/lqae090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2024] [Revised: 06/03/2024] [Accepted: 07/19/2024] [Indexed: 07/30/2024] Open
Abstract
Efficient de novo motif discovery from the results of wide-genome mapping of transcription factor binding sites (ChIP-seq) is dependent on the choice of background nucleotide sequences. The foreground sequences (ChIP-seq peaks) represent not only specific motifs of target transcription factors, but also the motifs overrepresented throughout the genome, such as simple sequence repeats. We performed a massive comparison of the 'synthetic' and 'genomic' approaches to generate background sequences for de novo motif discovery. The 'synthetic' approach shuffled nucleotides in peaks, while in the 'genomic' approach selected sequences from the reference genome randomly or only from gene promoters according to the fraction of A/T nucleotides in each sequence. We compiled the benchmark collections of ChIP-seq datasets for mouse, human and Arabidopsis, and performed de novo motif discovery. We showed that the genomic approach has both more robust detection of the known motifs of target transcription factors and more stringent exclusion of the simple sequence repeats as possible non-specific motifs. The advantage of the genomic approach over the synthetic approach was greater in plants compared to mammals. We developed the AntiNoise web service (https://denovosea.icgbio.ru/antinoise/) that implements a genomic approach to extract genomic background sequences for twelve eukaryotic genomes.
Collapse
Affiliation(s)
- Vladimir V Raditsa
- Department of System Biology, Institute of Cytology and Genetics, Novosibirsk 630090, Russia
| | - Anton V Tsukanov
- Department of System Biology, Institute of Cytology and Genetics, Novosibirsk 630090, Russia
| | - Anton G Bogomolov
- Department of Cell Biology, Institute of Cytology and Genetics, Novosibirsk 630090, Russia
| | - Victor G Levitsky
- Department of System Biology, Institute of Cytology and Genetics, Novosibirsk 630090, Russia
- Department of Natural Science, Novosibirsk State University, Novosibirsk 630090, Russia
| |
Collapse
|
3
|
Ramamurthy E, Agarwal S, Toong N, Sestili H, Kaplow IM, Chen Z, Phan B, Pfenning AR. Regression convolutional neural network models implicate peripheral immune regulatory variants in the predisposition to Alzheimer's disease. PLoS Comput Biol 2024; 20:e1012356. [PMID: 39186798 PMCID: PMC11389932 DOI: 10.1371/journal.pcbi.1012356] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2023] [Revised: 09/11/2024] [Accepted: 07/23/2024] [Indexed: 08/28/2024] Open
Abstract
Alzheimer's disease (AD) involves aggregation of amyloid β and tau, neuron loss, cognitive decline, and neuroinflammatory responses. Both resident microglia and peripheral immune cells have been associated with the immune component of AD. However, the relative contribution of resident and peripheral immune cell types to AD predisposition has not been thoroughly explored due to their similarity in gene expression and function. To study the effects of AD-associated variants on cis-regulatory elements, we train convolutional neural network (CNN) regression models that link genome sequence to cell type-specific levels of open chromatin, a proxy for regulatory element activity. We then use in silico mutagenesis of regulatory sequences to predict the relative impact of candidate variants across these cell types. We develop and apply criteria for evaluating our models and refine our models using massively parallel reporter assay (MPRA) data. Our models identify multiple AD-associated variants with a greater predicted impact in peripheral cells relative to microglia or neurons. Our results support their use as models to study the effects of AD-associated variants and even suggest that peripheral immune cells themselves may mediate a component of AD predisposition. We make our library of CNN models and predictions available as a resource for the community to study immune and neurological disorders.
Collapse
Affiliation(s)
- Easwaran Ramamurthy
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - Snigdha Agarwal
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - Noelle Toong
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - Heather Sestili
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - Irene M Kaplow
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - Ziheng Chen
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - BaDoi Phan
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - Andreas R Pfenning
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| |
Collapse
|
4
|
Romero R, Menichelli C, Vroland C, Marin JM, Lèbre S, Lecellier CH, Bréhélin L. TFscope: systematic analysis of the sequence features involved in the binding preferences of transcription factors. Genome Biol 2024; 25:187. [PMID: 38987807 DOI: 10.1186/s13059-024-03321-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2022] [Accepted: 06/24/2024] [Indexed: 07/12/2024] Open
Abstract
Characterizing the binding preferences of transcription factors (TFs) in different cell types and conditions is key to understand how they orchestrate gene expression. Here, we develop TFscope, a machine learning approach that identifies sequence features explaining the binding differences observed between two ChIP-seq experiments targeting either the same TF in two conditions or two TFs with similar motifs (paralogous TFs). TFscope systematically investigates differences in the core motif, nucleotide environment and co-factor motifs, and provides the contribution of each key feature in the two experiments. TFscope was applied to > 305 ChIP-seq pairs, and several examples are discussed.
Collapse
Affiliation(s)
- Raphaël Romero
- LIRMM, Univ Montpellier, CNRS, Montpellier, France
- IMAG, Univ Montpellier, CNRS, Montpellier, France
| | | | - Christophe Vroland
- LIRMM, Univ Montpellier, CNRS, Montpellier, France
- Institut de Génétique Moléculaire de Montpellier, University of Montpellier, CNRS, Montpellier, France
| | | | - Sophie Lèbre
- IMAG, Univ Montpellier, CNRS, Montpellier, France.
- AMIS, Université Paul-Valéry-Montpellier 3, Montpellier, France.
| | - Charles-Henri Lecellier
- LIRMM, Univ Montpellier, CNRS, Montpellier, France.
- Institut de Génétique Moléculaire de Montpellier, University of Montpellier, CNRS, Montpellier, France.
| | | |
Collapse
|
5
|
Ang DA, Carter JM, Deka K, Tan JHL, Zhou J, Chen Q, Chng WJ, Harmston N, Li Y. Aberrant non-canonical NF-κB signalling reprograms the epigenome landscape to drive oncogenic transcriptomes in multiple myeloma. Nat Commun 2024; 15:2513. [PMID: 38514625 PMCID: PMC10957915 DOI: 10.1038/s41467-024-46728-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Accepted: 03/07/2024] [Indexed: 03/23/2024] Open
Abstract
In multiple myeloma, abnormal plasma cells establish oncogenic niches within the bone marrow by engaging the NF-κB pathway to nurture their survival while they accumulate pro-proliferative mutations. Under these conditions, many cases eventually develop genetic abnormalities endowing them with constitutive NF-κB activation. Here, we find that sustained NF-κB/p52 levels resulting from such mutations favours the recruitment of enhancers beyond the normal B-cell repertoire. Furthermore, through targeted disruption of p52, we characterise how such enhancers are complicit in the formation of super-enhancers and the establishment of cis-regulatory interactions with myeloma dependencies during constitutive activation of p52. Finally, we functionally validate the pathological impact of these cis-regulatory modules on cell and tumour phenotypes using in vitro and in vivo models, confirming RGS1 as a p52-dependent myeloma driver. We conclude that the divergent epigenomic reprogramming enforced by aberrant non-canonical NF-κB signalling potentiates transcriptional programs beneficial for multiple myeloma progression.
Collapse
Affiliation(s)
- Daniel A Ang
- School of Biological Sciences (SBS), Nanyang Technological University (NTU), 60 Nanyang Drive, Singapore, 637551, Singapore
| | - Jean-Michel Carter
- School of Biological Sciences (SBS), Nanyang Technological University (NTU), 60 Nanyang Drive, Singapore, 637551, Singapore
| | - Kamalakshi Deka
- School of Biological Sciences (SBS), Nanyang Technological University (NTU), 60 Nanyang Drive, Singapore, 637551, Singapore
| | - Joel H L Tan
- Institute of Molecular and Cell Biology (IMCB), Agency for Science, Technology and Research (A*STAR), 61 Biopolis Drive, Proteos, Singapore, 138673, Singapore
| | - Jianbiao Zhou
- Cancer Science Institute of Singapore, National University of Singapore, 14 Medical Drive, Centre for Translational Medicine, Singapore, 117599, Republic of Singapore
- Department of Medicine, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, 117597, Republic of Singapore
- NUS Centre for Cancer Research, 14 Medical Drive, Centre for Translational Medicine, Singapore, 117599, Singapore
| | - Qingfeng Chen
- Institute of Molecular and Cell Biology (IMCB), Agency for Science, Technology and Research (A*STAR), 61 Biopolis Drive, Proteos, Singapore, 138673, Singapore
| | - Wee Joo Chng
- Cancer Science Institute of Singapore, National University of Singapore, 14 Medical Drive, Centre for Translational Medicine, Singapore, 117599, Republic of Singapore
- Department of Medicine, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, 117597, Republic of Singapore
- NUS Centre for Cancer Research, 14 Medical Drive, Centre for Translational Medicine, Singapore, 117599, Singapore
- Department of Hematology-Oncology, National University Cancer Institute of Singapore (NCIS), The National University Health System (NUHS), 1E, Kent Ridge Road, Singapore, 119228, Republic of Singapore
| | - Nathan Harmston
- Division of Science, Yale-NUS College, Singapore, 138527, Singapore
- Program in Cancer and Stem Cell Biology, Duke-NUS Medical School, Singapore, 169857, Singapore
- Molecular Biosciences Division, Cardiff School of Biosciences, Cardiff University, Cardiff, CF10 3AX, UK
| | - Yinghui Li
- School of Biological Sciences (SBS), Nanyang Technological University (NTU), 60 Nanyang Drive, Singapore, 637551, Singapore.
- Institute of Molecular and Cell Biology (IMCB), Agency for Science, Technology and Research (A*STAR), 61 Biopolis Drive, Proteos, Singapore, 138673, Singapore.
| |
Collapse
|
6
|
Vishnevsky OV, Bocharnikov AV, Ignatieva EV. Peak Scores Significantly Depend on the Relationships between Contextual Signals in ChIP-Seq Peaks. Int J Mol Sci 2024; 25:1011. [PMID: 38256085 PMCID: PMC10816497 DOI: 10.3390/ijms25021011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 12/13/2023] [Accepted: 01/09/2024] [Indexed: 01/24/2024] Open
Abstract
Chromatin immunoprecipitation followed by massively parallel DNA sequencing (ChIP-seq) is a central genome-wide method for in vivo analyses of DNA-protein interactions in various cellular conditions. Numerous studies have demonstrated the complex contextual organization of ChIP-seq peak sequences and the presence of binding sites for transcription factors in them. We assessed the dependence of the ChIP-seq peak score on the presence of different contextual signals in the peak sequences by analyzing these sequences from several ChIP-seq experiments using our fully enumerative GPU-based de novo motif discovery method, Argo_CUDA. Analysis revealed sets of significant IUPAC motifs corresponding to the binding sites of the target and partner transcription factors. For these ChIP-seq experiments, multiple regression models were constructed, demonstrating a significant dependence of the peak scores on the presence in the peak sequences of not only highly significant target motifs but also less significant motifs corresponding to the binding sites of the partner transcription factors. A significant correlation was shown between the presence of the target motifs FOXA2 and the partner motifs HNF4G, which found experimental confirmation in the scientific literature, demonstrating the important contribution of the partner transcription factors to the binding of the target transcription factor to DNA and, consequently, their important contribution to the peak score.
Collapse
Affiliation(s)
- Oleg V. Vishnevsky
- Institute of Cytology and Genetics, 630090 Novosibirsk, Russia;
- Department of Natural Science, Novosibirsk State University, 630090 Novosibirsk, Russia;
| | - Andrey V. Bocharnikov
- Department of Natural Science, Novosibirsk State University, 630090 Novosibirsk, Russia;
| | - Elena V. Ignatieva
- Institute of Cytology and Genetics, 630090 Novosibirsk, Russia;
- Department of Natural Science, Novosibirsk State University, 630090 Novosibirsk, Russia;
| |
Collapse
|
7
|
Kaplow IM, Lawler AJ, Schäffer DE, Srinivasan C, Sestili HH, Wirthlin ME, Phan BN, Prasad K, Brown AR, Zhang X, Foley K, Genereux DP, Karlsson EK, Lindblad-Toh K, Meyer WK, Pfenning AR. Relating enhancer genetic variation across mammals to complex phenotypes using machine learning. Science 2023; 380:eabm7993. [PMID: 37104615 PMCID: PMC10322212 DOI: 10.1126/science.abm7993] [Citation(s) in RCA: 22] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Accepted: 02/23/2023] [Indexed: 04/29/2023]
Abstract
Protein-coding differences between species often fail to explain phenotypic diversity, suggesting the involvement of genomic elements that regulate gene expression such as enhancers. Identifying associations between enhancers and phenotypes is challenging because enhancer activity can be tissue-dependent and functionally conserved despite low sequence conservation. We developed the Tissue-Aware Conservation Inference Toolkit (TACIT) to associate candidate enhancers with species' phenotypes using predictions from machine learning models trained on specific tissues. Applying TACIT to associate motor cortex and parvalbumin-positive interneuron enhancers with neurological phenotypes revealed dozens of enhancer-phenotype associations, including brain size-associated enhancers that interact with genes implicated in microcephaly or macrocephaly. TACIT provides a foundation for identifying enhancers associated with the evolution of any convergently evolved phenotype in any large group of species with aligned genomes.
Collapse
Affiliation(s)
- Irene M. Kaplow
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Alyssa J. Lawler
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, USA
- Department of Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Daniel E. Schäffer
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Chaitanya Srinivasan
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Heather H. Sestili
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Morgan E. Wirthlin
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, USA
| | - BaDoi N. Phan
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, USA
- Medical Scientist Training Program, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA
| | - Kavya Prasad
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Ashley R. Brown
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Xiaomeng Zhang
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Kathleen Foley
- Department of Biological Sciences, Lehigh University, Bethlehem, PA, USA
| | - Diane P. Genereux
- Broad Institute, Cambridge, MA, USA
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Chan Medical School, Worcester, MA, USA
| | | | - Elinor K. Karlsson
- Broad Institute, Cambridge, MA, USA
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Chan Medical School, Worcester, MA, USA
| | - Kerstin Lindblad-Toh
- Broad Institute, Cambridge, MA, USA
- Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden
| | - Wynn K. Meyer
- Department of Biological Sciences, Lehigh University, Bethlehem, PA, USA
| | - Andreas R. Pfenning
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, USA
- Department of Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| |
Collapse
|
8
|
Saha S, Spinelli L, Castro Mondragon JA, Kervadec A, Lynott M, Kremmer L, Roder L, Krifa S, Torres M, Brun C, Vogler G, Bodmer R, Colas AR, Ocorr K, Perrin L. Genetic architecture of natural variation of cardiac performance from flies to humans. eLife 2022; 11:82459. [DOI: 10.7554/elife.82459] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2022] [Accepted: 10/25/2022] [Indexed: 11/17/2022] Open
Abstract
Deciphering the genetic architecture of human cardiac disorders is of fundamental importance but their underlying complexity is a major hurdle. We investigated the natural variation of cardiac performance in the sequenced inbred lines of the Drosophila Genetic Reference Panel (DGRP). Genome-wide associations studies (GWAS) identified genetic networks associated with natural variation of cardiac traits which were used to gain insights as to the molecular and cellular processes affected. Non-coding variants that we identified were used to map potential regulatory non-coding regions, which in turn were employed to predict transcription factors (TFs) binding sites. Cognate TFs, many of which themselves bear polymorphisms associated with variations of cardiac performance, were also validated by heart-specific knockdown. Additionally, we showed that the natural variations associated with variability in cardiac performance affect a set of genes overlapping those associated with average traits but through different variants in the same genes. Furthermore, we showed that phenotypic variability was also associated with natural variation of gene regulatory networks. More importantly, we documented correlations between genes associated with cardiac phenotypes in both flies and humans, which supports a conserved genetic architecture regulating adult cardiac function from arthropods to mammals. Specifically, roles for PAX9 and EGR2 in the regulation of the cardiac rhythm were established in both models, illustrating that the characteristics of natural variations in cardiac function identified in Drosophila can accelerate discovery in humans.
Collapse
Affiliation(s)
- Saswati Saha
- Aix-Marseille University, INSERM, TAGC, Turing Center for Living systems
| | - Lionel Spinelli
- Aix-Marseille University, INSERM, TAGC, Turing Center for Living systems
| | | | - Anaïs Kervadec
- Development, Aging and Regeneration Program, Sanford Burnham Prebys Medical Discovery Institute
| | - Michaela Lynott
- Development, Aging and Regeneration Program, Sanford Burnham Prebys Medical Discovery Institute
| | - Laurent Kremmer
- Aix-Marseille University, INSERM, TAGC, Turing Center for Living systems
| | - Laurence Roder
- Aix-Marseille University, INSERM, TAGC, Turing Center for Living systems
| | - Sallouha Krifa
- Aix-Marseille University, INSERM, TAGC, Turing Center for Living systems
| | - Magali Torres
- Aix-Marseille University, INSERM, TAGC, Turing Center for Living systems
| | - Christine Brun
- Aix-Marseille University, INSERM, TAGC, Turing Center for Living systems
- CNRS
| | - Georg Vogler
- Development, Aging and Regeneration Program, Sanford Burnham Prebys Medical Discovery Institute
| | - Rolf Bodmer
- Development, Aging and Regeneration Program, Sanford Burnham Prebys Medical Discovery Institute
| | - Alexandre R Colas
- Development, Aging and Regeneration Program, Sanford Burnham Prebys Medical Discovery Institute
| | - Karen Ocorr
- Development, Aging and Regeneration Program, Sanford Burnham Prebys Medical Discovery Institute
| | - Laurent Perrin
- Aix-Marseille University, INSERM, TAGC, Turing Center for Living systems
- CNRS
| |
Collapse
|
9
|
Tsukanov AV, Mironova VV, Levitsky VG. Motif models proposing independent and interdependent impacts of nucleotides are related to high and low affinity transcription factor binding sites in Arabidopsis. FRONTIERS IN PLANT SCIENCE 2022; 13:938545. [PMID: 35968123 PMCID: PMC9373801 DOI: 10.3389/fpls.2022.938545] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/07/2022] [Accepted: 07/05/2022] [Indexed: 05/15/2023]
Abstract
Position weight matrix (PWM) is the traditional motif model representing the transcription factor (TF) binding sites. It proposes that the positions contribute independently to TFs binding affinity, although this hypothesis does not fit the data perfectly. This explains why PWM hits are missing in a substantial fraction of ChIP-seq peaks. To study various modes of the direct binding of plant TFs, we compiled the benchmark collection of 111 ChIP-seq datasets for Arabidopsis thaliana, and applied the traditional PWM, and two alternative motif models BaMM and SiteGA, proposing the dependencies of the positions. The variation in the stringency of the recognition thresholds for the models proposed that the hits of PWM, BaMM, and SiteGA models are associated with the sites of high/medium, any, and low affinity, respectively. At the medium recognition threshold, about 60% of ChIP-seq peaks contain PWM hits consisting of conserved core consensuses, while BaMM and SiteGA provide hits for an additional 15% of peaks in which a weaker core consensus is compensated through intra-motif dependencies. The presence/absence of these dependencies in the motifs of alternative/traditional models was confirmed by the dependency logo DepLogo visualizing the position-wise partitioning of the alignments of predicted sites. We exemplify the detailed analysis of ChIP-seq profiles for plant TFs CCA1, MYC2, and SEP3. Gene ontology (GO) enrichment analysis revealed that among the three motif models, the SiteGA had the highest portions of genes with the significantly enriched GO terms among all predicted genes. We showed that both alternative motif models provide for traditional PWM greater extensions in predicted sites for TFs MYC2/SEP3 with condition/tissue specific functions, compared to those for TF CCA1 with housekeeping functions. Overall, the combined application of standard and alternative motif models is beneficial to detect various modes of the direct TF-DNA interactions in the maximal portion of ChIP-seq loci.
Collapse
Affiliation(s)
- Anton V. Tsukanov
- Department of Systems Biology, Institute of Cytology and Genetics, Novosibirsk, Russia
| | - Victoria V. Mironova
- Department of Systems Biology, Institute of Cytology and Genetics, Novosibirsk, Russia
- Department of Plant Systems Physiology, Radboud Institute for Biological and Environmental Sciences (RIBES), Radboud University, Nijmegen, Netherlands
| | - Victor G. Levitsky
- Department of Systems Biology, Institute of Cytology and Genetics, Novosibirsk, Russia
- Department of Natural Science, Novosibirsk State University, Novosibirsk, Russia
- *Correspondence: Victor G. Levitsky
| |
Collapse
|
10
|
Srinivasan C, Phan BN, Lawler AJ, Ramamurthy E, Kleyman M, Brown AR, Kaplow IM, Wirthlin ME, Pfenning AR. Addiction-Associated Genetic Variants Implicate Brain Cell Type- and Region-Specific Cis-Regulatory Elements in Addiction Neurobiology. J Neurosci 2021; 41:9008-9030. [PMID: 34462306 PMCID: PMC8549541 DOI: 10.1523/jneurosci.2534-20.2021] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2020] [Revised: 06/18/2021] [Accepted: 07/10/2021] [Indexed: 12/14/2022] Open
Abstract
Recent large genome-wide association studies have identified multiple confident risk loci linked to addiction-associated behavioral traits. Most genetic variants linked to addiction-associated traits lie in noncoding regions of the genome, likely disrupting cis-regulatory element (CRE) function. CREs tend to be highly cell type-specific and may contribute to the functional development of the neural circuits underlying addiction. Yet, a systematic approach for predicting the impact of risk variants on the CREs of specific cell populations is lacking. To dissect the cell types and brain regions underlying addiction-associated traits, we applied stratified linkage disequilibrium score regression to compare genome-wide association studies to genomic regions collected from human and mouse assays for open chromatin, which is associated with CRE activity. We found enrichment of addiction-associated variants in putative CREs marked by open chromatin in neuronal (NeuN+) nuclei collected from multiple prefrontal cortical areas and striatal regions known to play major roles in reward and addiction. To further dissect the cell type-specific basis of addiction-associated traits, we also identified enrichments in human orthologs of open chromatin regions of female and male mouse neuronal subtypes: cortical excitatory, D1, D2, and PV. Last, we developed machine learning models to predict mouse cell type-specific open chromatin, enabling us to further categorize human NeuN+ open chromatin regions into cortical excitatory or striatal D1 and D2 neurons and predict the functional impact of addiction-associated genetic variants. Our results suggest that different neuronal subtypes within the reward system play distinct roles in the variety of traits that contribute to addiction.SIGNIFICANCE STATEMENT We combine statistical genetic and machine learning techniques to find that the predisposition to for nicotine, alcohol, and cannabis use behaviors can be partially explained by genetic variants in conserved regulatory elements within specific brain regions and neuronal subtypes of the reward system. Our computational framework can flexibly integrate open chromatin data across species to screen for putative causal variants in a cell type- and tissue-specific manner for numerous complex traits.
Collapse
Affiliation(s)
- Chaitanya Srinivasan
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
| | - BaDoi N Phan
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
- Medical Scientist Training Program, School of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania 15213
| | - Alyssa J Lawler
- Department of Biological Sciences, Mellon College of Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
| | - Easwaran Ramamurthy
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
| | - Michael Kleyman
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
| | - Ashley R Brown
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
| | - Irene M Kaplow
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
| | - Morgan E Wirthlin
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
| | - Andreas R Pfenning
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
- Department of Biological Sciences, Mellon College of Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
| |
Collapse
|
11
|
Novakovsky G, Saraswat M, Fornes O, Mostafavi S, Wasserman WW. Biologically relevant transfer learning improves transcription factor binding prediction. Genome Biol 2021; 22:280. [PMID: 34579793 PMCID: PMC8474956 DOI: 10.1186/s13059-021-02499-5] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Accepted: 09/15/2021] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND Deep learning has proven to be a powerful technique for transcription factor (TF) binding prediction but requires large training datasets. Transfer learning can reduce the amount of data required for deep learning, while improving overall model performance, compared to training a separate model for each new task. RESULTS We assess a transfer learning strategy for TF binding prediction consisting of a pre-training step, wherein we train a multi-task model with multiple TFs, and a fine-tuning step, wherein we initialize single-task models for individual TFs with the weights learned by the multi-task model, after which the single-task models are trained at a lower learning rate. We corroborate that transfer learning improves model performance, especially if in the pre-training step the multi-task model is trained with biologically relevant TFs. We show the effectiveness of transfer learning for TFs with ~ 500 ChIP-seq peak regions. Using model interpretation techniques, we demonstrate that the features learned in the pre-training step are refined in the fine-tuning step to resemble the binding motif of the target TF (i.e., the recipient of transfer learning in the fine-tuning step). Moreover, pre-training with biologically relevant TFs allows single-task models in the fine-tuning step to learn useful features other than the motif of the target TF. CONCLUSIONS Our results confirm that transfer learning is a powerful technique for TF binding prediction.
Collapse
Affiliation(s)
- Gherman Novakovsky
- Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, Vancouver, BC, V5Z 4H4, Canada
- Department of Medical Genetics, University of British Columbia, Vancouver, BC, V6H 3 N1, Canada
| | - Manu Saraswat
- Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, Vancouver, BC, V5Z 4H4, Canada
- Department of Medical Genetics, University of British Columbia, Vancouver, BC, V6H 3 N1, Canada
| | - Oriol Fornes
- Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, Vancouver, BC, V5Z 4H4, Canada.
- Department of Medical Genetics, University of British Columbia, Vancouver, BC, V6H 3 N1, Canada.
| | - Sara Mostafavi
- Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, Vancouver, BC, V5Z 4H4, Canada
- Department of Medical Genetics, University of British Columbia, Vancouver, BC, V6H 3 N1, Canada
- Department of Statistics, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
- Canadian Institute for Advanced Research, CIFAR AI Chair, and Child and Brain Development, Toronto, ON, M5G 1 M1, Canada
| | - Wyeth W Wasserman
- Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, Vancouver, BC, V5Z 4H4, Canada.
- Department of Medical Genetics, University of British Columbia, Vancouver, BC, V6H 3 N1, Canada.
| |
Collapse
|
12
|
Khan A, Riudavets Puig R, Boddie P, Mathelier A. BiasAway: command-line and web server to generate nucleotide composition-matched DNA background sequences. Bioinformatics 2021; 37:1607-1609. [PMID: 33135764 PMCID: PMC8275979 DOI: 10.1093/bioinformatics/btaa928] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2020] [Revised: 10/11/2020] [Accepted: 10/19/2020] [Indexed: 12/20/2022] Open
Abstract
Motivation Accurate motif enrichment analyses depend on the choice of background DNA sequences used, which should ideally match the sequence composition of the foreground sequences. It is important to avoid false positive enrichment due to sequence biases in the genome, such as GC-bias. Therefore, relying on an appropriate set of background sequences is crucial for enrichment analysis. Results We developed BiasAway, a command line tool and its dedicated easy-to-use web server to generate synthetic sequences matching any k-mer nucleotide composition or select genomic DNA sequences matching the mononucleotide composition of the foreground sequences through four different models. For genomic sequences, we provide precomputed partitions of genomes from nine species with five different bin sizes to generate appropriate genomic background sequences. Availability and implementation BiasAway source code is freely available from Bitbucket (https://bitbucket.org/CBGR/biasaway) and can be easily installed using bioconda or pip. The web server is available at https://biasaway.uio.no and a detailed documentation is available at https://biasaway.readthedocs.io. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Aziz Khan
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0349 Oslo, Norway.,Stanford University School of Medicine, Stanford Cancer Institute, Stanford, CA 94304, USA
| | - Rafael Riudavets Puig
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0349 Oslo, Norway
| | - Paul Boddie
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0349 Oslo, Norway
| | - Anthony Mathelier
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0349 Oslo, Norway.,Department of Medical Genetics, Oslo University Hospital, 0424 Oslo, Norway
| |
Collapse
|
13
|
Puig RR, Boddie P, Khan A, Castro-Mondragon JA, Mathelier A. UniBind: maps of high-confidence direct TF-DNA interactions across nine species. BMC Genomics 2021; 22:482. [PMID: 34174819 PMCID: PMC8236138 DOI: 10.1186/s12864-021-07760-6] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2021] [Accepted: 05/27/2021] [Indexed: 12/17/2022] Open
Abstract
BACKGROUND Transcription factors (TFs) bind specifically to TF binding sites (TFBSs) at cis-regulatory regions to control transcription. It is critical to locate these TF-DNA interactions to understand transcriptional regulation. Efforts to predict bona fide TFBSs benefit from the availability of experimental data mapping DNA binding regions of TFs (chromatin immunoprecipitation followed by sequencing - ChIP-seq). RESULTS In this study, we processed ~ 10,000 public ChIP-seq datasets from nine species to provide high-quality TFBS predictions. After quality control, it culminated with the prediction of ~ 56 million TFBSs with experimental and computational support for direct TF-DNA interactions for 644 TFs in > 1000 cell lines and tissues. These TFBSs were used to predict > 197,000 cis-regulatory modules representing clusters of binding events in the corresponding genomes. The high-quality of the TFBSs was reinforced by their evolutionary conservation, enrichment at active cis-regulatory regions, and capacity to predict combinatorial binding of TFs. Further, we confirmed that the cell type and tissue specificity of enhancer activity was correlated with the number of TFs with binding sites predicted in these regions. All the data is provided to the community through the UniBind database that can be accessed through its web-interface ( https://unibind.uio.no/ ), a dedicated RESTful API, and as genomic tracks. Finally, we provide an enrichment tool, available as a web-service and an R package, for users to find TFs with enriched TFBSs in a set of provided genomic regions. CONCLUSIONS UniBind is the first resource of its kind, providing the largest collection of high-confidence direct TF-DNA interactions in nine species.
Collapse
Affiliation(s)
- Rafael Riudavets Puig
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0349, Oslo, Norway
| | - Paul Boddie
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0349, Oslo, Norway
| | - Aziz Khan
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0349, Oslo, Norway
- Stanford Cancer Institute, Stanford University School of Medicine, Stanford, CA, 94305, USA
| | | | - Anthony Mathelier
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0349, Oslo, Norway.
- Department of Medical Genetics, Oslo University Hospital, Oslo, 0424, Norway.
| |
Collapse
|
14
|
Fagny M, Kuijjer ML, Stam M, Joets J, Turc O, Rozière J, Pateyron S, Venon A, Vitte C. Identification of Key Tissue-Specific, Biological Processes by Integrating Enhancer Information in Maize Gene Regulatory Networks. Front Genet 2021; 11:606285. [PMID: 33505431 PMCID: PMC7834273 DOI: 10.3389/fgene.2020.606285] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2020] [Accepted: 12/03/2020] [Indexed: 12/27/2022] Open
Abstract
Enhancers are key players in the spatio-temporal coordination of gene expression during numerous crucial processes, including tissue differentiation across development. Characterizing the transcription factors (TFs) and genes they connect, and the molecular functions underpinned is important to better characterize developmental processes. In plants, the recent molecular characterization of enhancers revealed their capacity to activate the expression of several target genes. Nevertheless, identifying these target genes at a genome-wide level is challenging, particularly for large-genome species, where enhancers and target genes can be hundreds of kilobases away. Therefore, the contribution of enhancers to plant regulatory networks remains poorly understood. Here, we investigate the enhancer-driven regulatory network of two maize tissues at different stages: leaves at seedling stage (V2-IST) and husks (bracts) at flowering. Using systems biology, we integrate genomic, epigenomic, and transcriptomic data to model the regulatory relationships between TFs and their potential target genes, and identify regulatory modules specific to husk and V2-IST. We show that leaves at the V2-IST stage are characterized by the response to hormones and macromolecules biogenesis and assembly, which are regulated by the BBR/BPC and AP2/ERF TF families, respectively. In contrast, husks are characterized by cell wall modification and response to abiotic stresses, which are, respectively, orchestrated by the C2C2/DOF and AP2/EREB families. Analysis of the corresponding enhancer sequences reveals that two different transposable element families (TIR transposon Mutator and MITE Pif/Harbinger) have shaped part of the regulatory network in each tissue, and that MITEs have provided potential new TF binding sites involved in husk tissue-specificity.
Collapse
Affiliation(s)
- Maud Fagny
- Université Paris-Saclay, INRAE, CNRS, AgroParisTech, GQE – Le Moulon, Gif-sur-Yvette, France
| | - Marieke Lydia Kuijjer
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, Oslo, Norway
- Department of Pathology, Leiden University Medical Center, Leiden, Netherlands
| | - Maike Stam
- Plant Development and (Epi) Genetics, Swammerdam Institute for Life Sciences, University of Amsterdam, Amsterdam, Netherlands
| | - Johann Joets
- Université Paris-Saclay, INRAE, CNRS, AgroParisTech, GQE – Le Moulon, Gif-sur-Yvette, France
| | - Olivier Turc
- LEPSE, Univ Montpellier, INRAE, Institut Agro, Montpellier, France
| | - Julien Rozière
- Université Paris-Saclay, INRAE, CNRS, AgroParisTech, GQE – Le Moulon, Gif-sur-Yvette, France
- Université Paris-Saclay, CNRS, INRAE, Univ Evry, Institute of Plant Sciences Paris-Saclay (IPS2), Orsay, France
- Université de Paris, CNRS, INRAE, Institute of Plant Sciences Paris-Saclay (IPS2), Orsay, France
| | - Stéphanie Pateyron
- Université Paris-Saclay, CNRS, INRAE, Univ Evry, Institute of Plant Sciences Paris-Saclay (IPS2), Orsay, France
- Université de Paris, CNRS, INRAE, Institute of Plant Sciences Paris-Saclay (IPS2), Orsay, France
| | - Anthony Venon
- Université Paris-Saclay, INRAE, CNRS, AgroParisTech, GQE – Le Moulon, Gif-sur-Yvette, France
| | - Clémentine Vitte
- Université Paris-Saclay, INRAE, CNRS, AgroParisTech, GQE – Le Moulon, Gif-sur-Yvette, France
| |
Collapse
|
15
|
Delos Santos NP, Texari L, Benner C. MEIRLOP: improving score-based motif enrichment by incorporating sequence bias covariates. BMC Bioinformatics 2020; 21:410. [PMID: 32938397 PMCID: PMC7493370 DOI: 10.1186/s12859-020-03739-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2020] [Accepted: 09/04/2020] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND Motif enrichment analysis (MEA) identifies over-represented transcription factor binding (TF) motifs in the DNA sequence of regulatory regions, enabling researchers to infer which transcription factors can regulate transcriptional response to a stimulus, or identify sequence features found near a target protein in a ChIP-seq experiment. Score-based MEA determines motifs enriched in regions exhibiting extreme differences in regulatory activity, but existing methods do not control for biases in GC content or dinucleotide composition. This lack of control for sequence bias, such as those often found in CpG islands, can obscure the enrichment of biologically relevant motifs. RESULTS We developed Motif Enrichment In Ranked Lists of Peaks (MEIRLOP), a novel MEA method that determines enrichment of TF binding motifs in a list of scored regulatory regions, while controlling for sequence bias. In this study, we compare MEIRLOP against other MEA methods in identifying binding motifs found enriched in differentially active regulatory regions after interferon-beta stimulus, finding that using logistic regression and covariates improves the ability to call enrichment of ISGF3 binding motifs from differential acetylation ChIP-seq data compared to other methods. Our method achieves similar or better performance compared to other methods when quantifying the enrichment of TF binding motifs from ENCODE TF ChIP-seq datasets. We also demonstrate how MEIRLOP is broadly applicable to the analysis of numerous types of NGS assays and experimental designs. CONCLUSIONS Our results demonstrate the importance of controlling for sequence bias when accurately identifying enriched DNA sequence motifs using score-based MEA. MEIRLOP is available for download from https://github.com/npdeloss/meirlop under the MIT license.
Collapse
Affiliation(s)
- Nathaniel P Delos Santos
- Department of Biomedical Informatics, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA, 92093-0640, USA
| | - Lorane Texari
- Department of Medicine, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA, 92093-0640, USA
| | - Christopher Benner
- Department of Medicine, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA, 92093-0640, USA.
| |
Collapse
|
16
|
Partridge EC, Chhetri SB, Prokop JW, Ramaker RC, Jansen CS, Goh ST, Mackiewicz M, Newberry KM, Brandsmeier LA, Meadows SK, Messer CL, Hardigan AA, Coppola CJ, Dean EC, Jiang S, Savic D, Mortazavi A, Wold BJ, Myers RM, Mendenhall EM. Occupancy maps of 208 chromatin-associated proteins in one human cell type. Nature 2020; 583:720-728. [PMID: 32728244 PMCID: PMC7398277 DOI: 10.1038/s41586-020-2023-4] [Citation(s) in RCA: 73] [Impact Index Per Article: 18.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2017] [Accepted: 01/09/2020] [Indexed: 01/02/2023]
Abstract
Transcription factors are DNA-binding proteins that have key roles in gene regulation1,2. Genome-wide occupancy maps of transcriptional regulators are important for understanding gene regulation and its effects on diverse biological processes3–6. However, only a minority of the more than 1,600 transcription factors encoded in the human genome has been assayed. Here we present, as part of the ENCODE (Encyclopedia of DNA Elements) project, data and analyses from chromatin immunoprecipitation followed by high-throughput sequencing (ChIP–seq) experiments using the human HepG2 cell line for 208 chromatin-associated proteins (CAPs). These comprise 171 transcription factors and 37 transcriptional cofactors and chromatin regulator proteins, and represent nearly one-quarter of CAPs expressed in HepG2 cells. The binding profiles of these CAPs form major groups associated predominantly with promoters or enhancers, or with both. We confirm and expand the current catalogue of DNA sequence motifs for transcription factors, and describe motifs that correspond to other transcription factors that are co-enriched with the primary ChIP target. For example, FOX family motifs are enriched in ChIP–seq peaks of 37 other CAPs. We show that motif content and occupancy patterns can distinguish between promoters and enhancers. This catalogue reveals high-occupancy target regions at which many CAPs associate, although each contains motifs for only a minority of the numerous associated transcription factors. These analyses provide a more complete overview of the gene regulatory networks that define this cell type, and demonstrate the usefulness of the large-scale production efforts of the ENCODE Consortium. ChIP–seq and CETCh–seq data are used to analyse binding maps for 208 transcription factors and other chromatin-associated proteins in a single human cell type, providing a comprehensive catalogue of the transcription factor landscape and gene regulatory networks in these cells.
Collapse
Affiliation(s)
| | - Surya B Chhetri
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA.,Department of Biological Sciences, The University of Alabama in Huntsville, Huntsville, AL, USA.,Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MA, USA
| | - Jeremy W Prokop
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA.,Department of Pediatrics and Human Development, College of Human Medicine, Michigan State University, Grand Rapids, MI, USA
| | - Ryne C Ramaker
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA.,Department of Genetics, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Camden S Jansen
- Department of Developmental and Cell Biology, University of California Irvine, Irvine, CA, USA
| | - Say-Tar Goh
- Division of Biology, California Institute of Technology, Pasadena, CA, USA
| | - Mark Mackiewicz
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA
| | | | | | - Sarah K Meadows
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA
| | - C Luke Messer
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA
| | - Andrew A Hardigan
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA.,Department of Genetics, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Candice J Coppola
- Department of Biological Sciences, The University of Alabama in Huntsville, Huntsville, AL, USA
| | - Emma C Dean
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA.,Department of Pathology, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Shan Jiang
- Department of Developmental and Cell Biology, University of California Irvine, Irvine, CA, USA
| | - Daniel Savic
- Pharmaceutical Sciences Department, St Jude Children's Research Hospital, Memphis, TN, USA
| | - Ali Mortazavi
- Department of Developmental and Cell Biology, University of California Irvine, Irvine, CA, USA
| | - Barbara J Wold
- Division of Biology, California Institute of Technology, Pasadena, CA, USA
| | - Richard M Myers
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA.
| | - Eric M Mendenhall
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA. .,Department of Biological Sciences, The University of Alabama in Huntsville, Huntsville, AL, USA.
| |
Collapse
|
17
|
Ibarra IL, Hollmann NM, Klaus B, Augsten S, Velten B, Hennig J, Zaugg JB. Mechanistic insights into transcription factor cooperativity and its impact on protein-phenotype interactions. Nat Commun 2020; 11:124. [PMID: 31913281 PMCID: PMC6949242 DOI: 10.1038/s41467-019-13888-7] [Citation(s) in RCA: 47] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Accepted: 11/28/2019] [Indexed: 11/25/2022] Open
Abstract
Recent high-throughput transcription factor (TF) binding assays revealed that TF cooperativity is a widespread phenomenon. However, a global mechanistic and functional understanding of TF cooperativity is still lacking. To address this, here we introduce a statistical learning framework that provides structural insight into TF cooperativity and its functional consequences based on next generation sequencing data. We identify DNA shape as driver for cooperativity, with a particularly strong effect for Forkhead-Ets pairs. Follow-up experiments reveal a local shape preference at the Ets-DNA-Forkhead interface and decreased cooperativity upon loss of the interaction. Additionally, we discover many functional associations for cooperatively bound TFs. Examination of the link between FOXO1:ETV6 and lymphomas reveals that their joint expression levels improve patient clinical outcome stratification. Altogether, our results demonstrate that inter-family cooperative TF binding is driven by position-specific DNA readout mechanisms, which provides an additional regulatory layer for downstream biological functions. Although transcription factor (TF) cooperativity is widespread, a global mechanistic understanding of the role of TF cooperativity is still lacking. Here the authors introduce a statistical learning framework that provides structural insight into TF cooperativity and its functional consequences based on next generation sequencing data and provide mechanistic insights into TF cooperativity and its impact on protein-phenotype interactions.
Collapse
Affiliation(s)
- Ignacio L Ibarra
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany.,Faculty of Biosciences, Collaboration for Joint PhD Degree between EMBL and Heidelberg University, Heidelberg, Germany
| | - Nele M Hollmann
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany.,Faculty of Biosciences, Collaboration for Joint PhD Degree between EMBL and Heidelberg University, Heidelberg, Germany
| | - Bernd Klaus
- Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Sandra Augsten
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Britta Velten
- Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Janosch Hennig
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Judith B Zaugg
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany.
| |
Collapse
|
18
|
Villanueva-Cañas JL, Horvath V, Aguilera L, González J. Diverse families of transposable elements affect the transcriptional regulation of stress-response genes in Drosophila melanogaster. Nucleic Acids Res 2020; 47:6842-6857. [PMID: 31175824 PMCID: PMC6649756 DOI: 10.1093/nar/gkz490] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2018] [Revised: 05/20/2019] [Accepted: 05/22/2019] [Indexed: 12/25/2022] Open
Abstract
Although transposable elements are an important source of regulatory variation, their genome-wide contribution to the transcriptional regulation of stress-response genes has not been studied yet. Stress is a major aspect of natural selection in the wild, leading to changes in the transcriptional regulation of a variety of genes that are often triggered by one or a few transcription factors. In this work, we take advantage of the wealth of information available for Drosophila melanogaster and humans to analyze the role of transposable elements in six stress regulatory networks: immune, hypoxia, oxidative, xenobiotic, heat shock, and heavy metal. We found that transposable elements were enriched for caudal, dorsal, HSF, and tango binding sites in D. melanogaster and for NFE2L2 binding sites in humans. Taking into account the D. melanogaster population frequencies of transposable elements with predicted binding motifs and/or binding sites, we showed that those containing three or more binding motifs/sites are more likely to be functional. For a representative subset of these TEs, we performed in vivo transgenic reporter assays in different stress conditions. Overall, our results showed that TEs are relevant contributors to the transcriptional regulation of stress-response genes.
Collapse
Affiliation(s)
| | - Vivien Horvath
- Institute of Evolutionary Biology, CSIC-Universitat Pompeu Fabra, 08003 Barcelona, Spain
| | - Laura Aguilera
- Institute of Evolutionary Biology, CSIC-Universitat Pompeu Fabra, 08003 Barcelona, Spain
| | - Josefa González
- Institute of Evolutionary Biology, CSIC-Universitat Pompeu Fabra, 08003 Barcelona, Spain
| |
Collapse
|
19
|
Müller AU, Imkamp F, Weber-Ban E. The Mycobacterial LexA/RecA-Independent DNA Damage Response Is Controlled by PafBC and the Pup-Proteasome System. Cell Rep 2019; 23:3551-3564. [PMID: 29924998 DOI: 10.1016/j.celrep.2018.05.073] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2018] [Revised: 04/16/2018] [Accepted: 05/22/2018] [Indexed: 12/11/2022] Open
Abstract
Mycobacteria exhibit two DNA damage response pathways: the LexA/RecA-dependent SOS response and a LexA/RecA-independent pathway. Using a combination of transcriptomics and genome-wide binding site analysis, we demonstrate that PafBC (proteasome accessory factor B and C), encoded in the Pup-proteasome system (PPS) gene locus, is the transcriptional regulator of the predominant LexA/RecA-independent pathway. Comparison of the resulting PafBC regulon with the DNA damage response of Mycobacterium smegmatis reveals that the majority of induced DNA repair genes are upregulated by PafBC. We further demonstrate that RecA, a member of the PafBC regulon and principal regulator of the SOS response, is degraded by the PPS when DNA damage stress has been overcome. Our results suggest a model for the regulation of the mycobacterial DNA damage response that employs the concerted action of PafBC as master transcriptional activator and the PPS for removal of DNA repair proteins to maintain a temporally controlled stress response.
Collapse
Affiliation(s)
- Andreas U Müller
- ETH Zurich, Institute of Molecular Biology and Biophysics, 8093 Zurich, Switzerland
| | - Frank Imkamp
- University of Zurich, Institute of Medical Microbiology, 8006 Zurich, Switzerland
| | - Eilika Weber-Ban
- ETH Zurich, Institute of Molecular Biology and Biophysics, 8093 Zurich, Switzerland.
| |
Collapse
|
20
|
Berest I, Arnold C, Reyes-Palomares A, Palla G, Rasmussen KD, Giles H, Bruch PM, Huber W, Dietrich S, Helin K, Zaugg JB. Quantification of Differential Transcription Factor Activity and Multiomics-Based Classification into Activators and Repressors: diffTF. Cell Rep 2019; 29:3147-3159.e12. [DOI: 10.1016/j.celrep.2019.10.106] [Citation(s) in RCA: 54] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2018] [Revised: 09/20/2019] [Accepted: 10/28/2019] [Indexed: 12/26/2022] Open
|
21
|
Gheorghe M, Sandve GK, Khan A, Chèneby J, Ballester B, Mathelier A. A map of direct TF-DNA interactions in the human genome. Nucleic Acids Res 2019; 47:e21. [PMID: 30517703 PMCID: PMC6393237 DOI: 10.1093/nar/gky1210] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2018] [Revised: 10/31/2018] [Accepted: 11/20/2018] [Indexed: 12/11/2022] Open
Abstract
Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is the most popular assay to identify genomic regions, called ChIP-seq peaks, that are bound in vivo by transcription factors (TFs). These regions are derived from direct TF-DNA interactions, indirect binding of the TF to the DNA (through a co-binding partner), nonspecific binding to the DNA, and noise/bias/artifacts. Delineating the bona fide direct TF-DNA interactions within the ChIP-seq peaks remains challenging. We developed a dedicated software, ChIP-eat, that combines computational TF binding models and ChIP-seq peaks to automatically predict direct TF-DNA interactions. Our work culminated with predicted interactions covering >4% of the human genome, obtained by uniformly processing 1983 ChIP-seq peak data sets from the ReMap database for 232 unique TFs. The predictions were a posteriori assessed using protein binding microarray and ChIP-exo data, and were predominantly found in high quality ChIP-seq peaks. The set of predicted direct TF-DNA interactions suggested that high-occupancy target regions are likely not derived from direct binding of the TFs to the DNA. Our predictions derived co-binding TFs supported by protein-protein interaction data and defined cis-regulatory modules enriched for disease- and trait-associated SNPs. We provide this collection of direct TF-DNA interactions and cis-regulatory modules through the UniBind web-interface (http://unibind.uio.no).
Collapse
Affiliation(s)
- Marius Gheorghe
- Centre for Molecular Medicine Norway (NCMM), University of Oslo, Oslo, Norway
| | | | - Aziz Khan
- Centre for Molecular Medicine Norway (NCMM), University of Oslo, Oslo, Norway
| | - Jeanne Chèneby
- Aix Marseille Université, INSERM, TAGC, Marseille, France
| | | | - Anthony Mathelier
- Centre for Molecular Medicine Norway (NCMM), University of Oslo, Oslo, Norway.,Department of Cancer Genetics, Institute for Cancer Research, Radiumhospitalet, Oslo, Norway
| |
Collapse
|
22
|
Youn A, Marquez EJ, Lawlor N, Stitzel ML, Ucar D. BiFET: sequencing Bias-free transcription factor Footprint Enrichment Test. Nucleic Acids Res 2019; 47:e11. [PMID: 30428075 PMCID: PMC6344870 DOI: 10.1093/nar/gky1117] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2018] [Accepted: 10/23/2018] [Indexed: 01/15/2023] Open
Abstract
Transcription factor (TF) footprinting uncovers putative protein–DNA binding via combined analyses of chromatin accessibility patterns and their underlying TF sequence motifs. TF footprints are frequently used to identify TFs that regulate activities of cell/condition-specific genomic regions (target loci) in comparison to control regions (background loci) using standard enrichment tests. However, there is a strong association between the chromatin accessibility level and the GC content of a locus and the number and types of TF footprints that can be detected at this site. Traditional enrichment tests (e.g. hypergeometric) do not account for this bias and inflate false positive associations. Therefore, we developed a novel post-processing method, Bias-free Footprint Enrichment Test (BiFET), that corrects for the biases arising from the differences in chromatin accessibility levels and GC contents between target and background loci in footprint enrichment analyses. We applied BiFET on TF footprint calls obtained from EndoC-βH1 ATAC-seq samples using three different algorithms (CENTIPEDE, HINT-BC and PIQ) and showed BiFET’s ability to increase power and reduce false positive rate when compared to hypergeometric test. Furthermore, we used BiFET to study TF footprints from human PBMC and pancreatic islet ATAC-seq samples to show its utility to identify putative TFs associated with cell-type-specific loci.
Collapse
Affiliation(s)
- Ahrim Youn
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Eladio J Marquez
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Nathan Lawlor
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Michael L Stitzel
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA.,Institute for Systems Genomics, University of Connecticut Health Center, Farmington, CT 06030, USA.,Department of Genetics & Genome Sciences, University of Connecticut Health Center, Farmington, CT 06030, USA
| | - Duygu Ucar
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA.,Institute for Systems Genomics, University of Connecticut Health Center, Farmington, CT 06030, USA.,Department of Genetics & Genome Sciences, University of Connecticut Health Center, Farmington, CT 06030, USA
| |
Collapse
|
23
|
Lecellier CH, Wasserman WW, Mathelier A. Human Enhancers Harboring Specific Sequence Composition, Activity, and Genome Organization Are Linked to the Immune Response. Genetics 2018; 209:1055-1071. [PMID: 29871881 PMCID: PMC6063234 DOI: 10.1534/genetics.118.301116] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2018] [Accepted: 06/01/2018] [Indexed: 12/15/2022] Open
Abstract
The FANTOM5 consortium recently characterized 65,423 human enhancers from 1829 cell and tissue samples using the Cap Analysis of Gene Expression technology. We showed that the guanine and cytosine content at enhancer regions distinguishes two classes of enhancers harboring distinct DNA structural properties at flanking regions. A functional analysis of their predicted gene targets highlighted one class of enhancers as significantly enriched for associations with immune response genes. Moreover, these enhancers were specifically enriched for regulatory motifs recognized by transcription factors involved in immune response. We observed that enhancers enriched for links to immune response genes were more cell-type specific, preferentially activated upon bacterial infection, and with specific response activity. Looking at chromatin capture data, we found that the two classes of enhancers were lying in distinct topologically associating domains and chromatin loops. Our results suggest that specific nucleotide compositions encode for classes of enhancers that are functionally distinct and specifically organized in the human genome.
Collapse
Affiliation(s)
- Charles-Henri Lecellier
- Institut de Génétique Moléculaire de Montpellier, University of Montpellier, Centre National de la Recherche Scientifique (CNRS), 34293 Montpellier cedex5, France
- Institut de Biologie Computationnelle, 34095 Montpellier, France
- Centre for Molecular Medicine and Therapeutics at the Child and Family Research Institute, Department of Medical Genetics, University of British Columbia, Vancouver, V5Z 4H4, Canada
| | - Wyeth W Wasserman
- Centre for Molecular Medicine and Therapeutics at the Child and Family Research Institute, Department of Medical Genetics, University of British Columbia, Vancouver, V5Z 4H4, Canada
| | - Anthony Mathelier
- Centre for Molecular Medicine and Therapeutics at the Child and Family Research Institute, Department of Medical Genetics, University of British Columbia, Vancouver, V5Z 4H4, Canada
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, Faculty of Medicine, University of Oslo, 0349 Oslo, Norway
- Department of Cancer Genetics, Institute for Cancer Research, Oslo University Hospital Radiumhospitalet, 0372 Oslo, Norway
| |
Collapse
|
24
|
Wang M, Tai C, E W, Wei L. DeFine: deep convolutional neural networks accurately quantify intensities of transcription factor-DNA binding and facilitate evaluation of functional non-coding variants. Nucleic Acids Res 2018; 46:e69. [PMID: 29617928 PMCID: PMC6009584 DOI: 10.1093/nar/gky215] [Citation(s) in RCA: 56] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2017] [Revised: 03/12/2018] [Accepted: 03/14/2018] [Indexed: 01/19/2023] Open
Abstract
The complex system of gene expression is regulated by the cell type-specific binding of transcription factors (TFs) to regulatory elements. Identifying variants that disrupt TF binding and lead to human diseases remains a great challenge. To address this, we implement sequence-based deep learning models that accurately predict the TF binding intensities to given DNA sequences. In addition to accurately classifying TF-DNA binding or unbinding, our models are capable of accurately predicting real-valued TF binding intensities by leveraging large-scale TF ChIP-seq data. The changes in the TF binding intensities between the altered sequence and the reference sequence reflect the degree of functional impact for the variant. This enables us to develop the tool DeFine (Deep learning based Functional impact of non-coding variants evaluator, http://define.cbi.pku.edu.cn) with improved performance for assessing the functional impact of non-coding variants including SNPs and indels. DeFine accurately identifies the causal functional non-coding variants from disease-associated variants in GWAS. DeFine is an effective and easy-to-use tool that facilities systematic prioritization of functional non-coding variants.
Collapse
Affiliation(s)
- Meng Wang
- Center for Bioinformatics, State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Peking University, Beijing, 100871, P.R. China
| | - Cheng Tai
- Center for Data Science, Peking University, Beijing, 100871, P.R. China
- Beijing Institute of Big Data Research, Beijing, 100871, P.R. China
| | - Weinan E
- Center for Data Science, Peking University, Beijing, 100871, P.R. China
- Beijing Institute of Big Data Research, Beijing, 100871, P.R. China
- Department of Mathematics and PACM, Princeton University, Princeton, NJ, 08544, USA
| | - Liping Wei
- Center for Bioinformatics, State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Peking University, Beijing, 100871, P.R. China
| |
Collapse
|
25
|
Wyler E, Menegatti J, Franke V, Kocks C, Boltengagen A, Hennig T, Theil K, Rutkowski A, Ferrai C, Baer L, Kermas L, Friedel C, Rajewsky N, Akalin A, Dölken L, Grässer F, Landthaler M. Widespread activation of antisense transcription of the host genome during herpes simplex virus 1 infection. Genome Biol 2017; 18:209. [PMID: 29089033 PMCID: PMC5663069 DOI: 10.1186/s13059-017-1329-5] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2017] [Accepted: 09/29/2017] [Indexed: 12/19/2022] Open
Abstract
Background Herpesviruses can infect a wide range of animal species. Herpes simplex virus 1 (HSV-1) is one of the eight herpesviruses that can infect humans and is prevalent worldwide. Herpesviruses have evolved multiple ways to adapt the infected cells to their needs, but knowledge about these transcriptional and post-transcriptional modifications is sparse. Results Here, we show that HSV-1 induces the expression of about 1000 antisense transcripts from the human host cell genome. A subset of these is also activated by the closely related varicella zoster virus. Antisense transcripts originate either at gene promoters or within the gene body, and they show different susceptibility to the inhibition of early and immediate early viral gene expression. Overexpression of the major viral transcription factor ICP4 is sufficient to turn on a subset of antisense transcripts. Histone marks around transcription start sites of HSV-1-induced and constitutively transcribed antisense transcripts are highly similar, indicating that the genetic loci are already poised to transcribe these novel RNAs. Furthermore, an antisense transcript overlapping with the BBC3 gene (also known as PUMA) transcriptionally silences this potent inducer of apoptosis in cis. Conclusions We show for the first time that a virus induces widespread antisense transcription of the host cell genome. We provide evidence that HSV-1 uses this to downregulate a strong inducer of apoptosis. Our findings open new perspectives on global and specific alterations of host cell transcription by viruses. Electronic supplementary material The online version of this article (doi:10.1186/s13059-017-1329-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Emanuel Wyler
- Berlin Institute for Medical Systems Biology, Max-Delbrück-Center for Molecular Medicine in the Helmholtz Association, Robert-Rössle-Strasse 10, 13125, Berlin, Germany
| | - Jennifer Menegatti
- Institute of Virology, Saarland University Medical School, Kirrbergerstrasse, Haus 47, 66421, Homburg/Saar, Germany
| | - Vedran Franke
- Berlin Institute for Medical Systems Biology, Max-Delbrück-Center for Molecular Medicine in the Helmholtz Association, Robert-Rössle-Strasse 10, 13125, Berlin, Germany
| | - Christine Kocks
- Berlin Institute for Medical Systems Biology, Max-Delbrück-Center for Molecular Medicine in the Helmholtz Association, Robert-Rössle-Strasse 10, 13125, Berlin, Germany
| | - Anastasiya Boltengagen
- Berlin Institute for Medical Systems Biology, Max-Delbrück-Center for Molecular Medicine in the Helmholtz Association, Robert-Rössle-Strasse 10, 13125, Berlin, Germany
| | - Thomas Hennig
- Institut für Virologie und Immunbiologie, Julius-Maximilians-Universität Würzburg, Versbacherstr. 7, 97078, Würzburg, Germany
| | - Kathrin Theil
- Berlin Institute for Medical Systems Biology, Max-Delbrück-Center for Molecular Medicine in the Helmholtz Association, Robert-Rössle-Strasse 10, 13125, Berlin, Germany
| | - Andrzej Rutkowski
- Department of Medicine, University of Cambridge, Addenbrookes Hospital, Box 157, Hills Rd, Cambridge, CB2 0QQ, UK.,Present address: AstraZeneca, Darwin Building, 310 Cambridge Science Park, Cambridge, CB4 0WG, UK
| | - Carmelo Ferrai
- Berlin Institute for Medical Systems Biology, Max-Delbrück-Center for Molecular Medicine in the Helmholtz Association, Robert-Rössle-Strasse 10, 13125, Berlin, Germany
| | - Laura Baer
- Institute of Virology, Saarland University Medical School, Kirrbergerstrasse, Haus 47, 66421, Homburg/Saar, Germany
| | - Lisa Kermas
- Berlin Institute for Medical Systems Biology, Max-Delbrück-Center for Molecular Medicine in the Helmholtz Association, Robert-Rössle-Strasse 10, 13125, Berlin, Germany
| | - Caroline Friedel
- Institut für Informatik, Ludwig-Maximilians-Universität München, Amalienstraße 17, 80333, München, Germany
| | - Nikolaus Rajewsky
- Berlin Institute for Medical Systems Biology, Max-Delbrück-Center for Molecular Medicine in the Helmholtz Association, Robert-Rössle-Strasse 10, 13125, Berlin, Germany
| | - Altuna Akalin
- Berlin Institute for Medical Systems Biology, Max-Delbrück-Center for Molecular Medicine in the Helmholtz Association, Robert-Rössle-Strasse 10, 13125, Berlin, Germany
| | - Lars Dölken
- Institut für Virologie und Immunbiologie, Julius-Maximilians-Universität Würzburg, Versbacherstr. 7, 97078, Würzburg, Germany
| | - Friedrich Grässer
- Institute of Virology, Saarland University Medical School, Kirrbergerstrasse, Haus 47, 66421, Homburg/Saar, Germany.
| | - Markus Landthaler
- Berlin Institute for Medical Systems Biology, Max-Delbrück-Center for Molecular Medicine in the Helmholtz Association, Robert-Rössle-Strasse 10, 13125, Berlin, Germany. .,IRI Life Sciences, Institute für Biologie, Humboldt Universität zu Berlin, Philippstraße 13, 10115, Berlin, Germany.
| |
Collapse
|
26
|
Mariani L, Weinand K, Vedenko A, Barrera LA, Bulyk ML. Identification of Human Lineage-Specific Transcriptional Coregulators Enabled by a Glossary of Binding Modules and Tunable Genomic Backgrounds. Cell Syst 2017; 5:187-201.e7. [PMID: 28957653 PMCID: PMC5657590 DOI: 10.1016/j.cels.2017.06.015] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2017] [Revised: 06/03/2017] [Accepted: 06/29/2017] [Indexed: 01/08/2023]
Abstract
Transcription factors (TFs) control cellular processes by binding specific DNA motifs to modulate gene expression. Motif enrichment analysis of regulatory regions can identify direct and indirect TF binding sites. Here, we created a glossary of 108 non-redundant TF-8mer "modules" of shared specificity for 671 metazoan TFs from publicly available and new universal protein binding microarray data. Analysis of 239 ENCODE TF chromatin immunoprecipitation sequencing datasets and associated RNA sequencing profiles suggest the 8mer modules are more precise than position weight matrices in identifying indirect binding motifs and their associated tethering TFs. We also developed GENRE (genomically equivalent negative regions), a tunable tool for construction of matched genomic background sequences for analysis of regulatory regions. GENRE outperformed four state-of-the-art approaches to background sequence construction. We used our TF-8mer glossary and GENRE in the analysis of the indirect binding motifs for the co-occurrence of tethering factors, suggesting novel TF-TF interactions. We anticipate that these tools will aid in elucidating tissue-specific gene-regulatory programs.
Collapse
Affiliation(s)
- Luca Mariani
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA
| | - Kathryn Weinand
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA
| | - Anastasia Vedenko
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA
| | - Luis A Barrera
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA; Harvard-MIT Division of Health Sciences and Technology (HST), Harvard Medical School, Boston, MA 02115, USA; Committee on Higher Degrees in Biophysics, Harvard University, Cambridge, MA 02138, USA
| | - Martha L Bulyk
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA; Harvard-MIT Division of Health Sciences and Technology (HST), Harvard Medical School, Boston, MA 02115, USA; Committee on Higher Degrees in Biophysics, Harvard University, Cambridge, MA 02138, USA; Department of Pathology, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA.
| |
Collapse
|
27
|
Jayaram N, Usvyat D, R Martin AC. Evaluating tools for transcription factor binding site prediction. BMC Bioinformatics 2016; 17:547. [PMID: 27806697 PMCID: PMC6889335 DOI: 10.1186/s12859-016-1298-9] [Citation(s) in RCA: 56] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2016] [Accepted: 10/20/2016] [Indexed: 12/21/2022] Open
Abstract
Background Binding of transcription factors to transcription factor binding sites (TFBSs) is key to the mediation of transcriptional regulation. Information on experimentally validated functional TFBSs is limited and consequently there is a need for accurate prediction of TFBSs for gene annotation and in applications such as evaluating the effects of single nucleotide variations in causing disease. TFBSs are generally recognized by scanning a position weight matrix (PWM) against DNA using one of a number of available computer programs. Thus we set out to evaluate the best tools that can be used locally (and are therefore suitable for large-scale analyses) for creating PWMs from high-throughput ChIP-Seq data and for scanning them against DNA. Results We evaluated a set of de novo motif discovery tools that could be downloaded and installed locally using ENCODE-ChIP-Seq data and showed that rGADEM was the best-performing tool. TFBS prediction tools used to scan PWMs against DNA fall into two classes — those that predict individual TFBSs and those that identify clusters. Our evaluation showed that FIMO and MCAST performed best respectively. Conclusions Selection of the best-performing tools for generating PWMs from ChIP-Seq data and for scanning PWMs against DNA has the potential to improve prediction of precise transcription factor binding sites within regions identified by ChIP-Seq experiments for gene finding, understanding regulation and in evaluating the effects of single nucleotide variations in causing disease. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1298-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Narayan Jayaram
- Institute of Structural and Molecular Biology, Division of Biosciences, University College London, Darwin Building, Gower Street, London, WC1E 6BT, UK
| | - Daniel Usvyat
- Institute of Structural and Molecular Biology, Division of Biosciences, University College London, Darwin Building, Gower Street, London, WC1E 6BT, UK
| | - Andrew C R Martin
- Institute of Structural and Molecular Biology, Division of Biosciences, University College London, Darwin Building, Gower Street, London, WC1E 6BT, UK.
| |
Collapse
|
28
|
Mathelier A, Xin B, Chiu TP, Yang L, Rohs R, Wasserman WW. DNA Shape Features Improve Transcription Factor Binding Site Predictions In Vivo. Cell Syst 2016; 3:278-286.e4. [PMID: 27546793 PMCID: PMC5042832 DOI: 10.1016/j.cels.2016.07.001] [Citation(s) in RCA: 85] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2015] [Revised: 03/04/2016] [Accepted: 06/30/2016] [Indexed: 01/09/2023]
Abstract
Interactions of transcription factors (TFs) with DNA comprise a complex interplay between base-specific amino acid contacts and readout of DNA structure. Recent studies have highlighted the complementarity of DNA sequence and shape in modeling TF binding in vitro. Here, we have provided a comprehensive evaluation of in vivo datasets to assess the predictive power obtained by augmenting various DNA sequence-based models of TF binding sites (TFBSs) with DNA shape features (helix twist, minor groove width, propeller twist, and roll). Results from 400 human ChIP-seq datasets for 76 TFs show that combining DNA shape features with position-specific scoring matrix (PSSM) scores improves TFBS predictions. Improvement has also been observed using TF flexible models and a machine-learning approach using a binary encoding of nucleotides in lieu of PSSMs. Incorporating DNA shape information is most beneficial for E2F and MADS-domain TF families. Our findings indicate that incorporating DNA sequence and shape information benefits the modeling of TF binding under complex in vivo conditions.
Collapse
Affiliation(s)
- Anthony Mathelier
- Centre for Molecular Medicine and Therapeutics at the Child and Family Research Institute, Department of Medical Genetics, University of British Columbia, 980 West 28th Avenue, Vancouver, BC V5Z 4H4, Canada; Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo and Oslo University Hospital, 0318 Oslo, Norway; Department of Cancer Genetics, Institute for Cancer Research, Oslo University Hospital Radiumhospitalet, 0372 Oslo, Norway
| | - Beibei Xin
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Tsu-Pei Chiu
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Lin Yang
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Remo Rohs
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA.
| | - Wyeth W Wasserman
- Centre for Molecular Medicine and Therapeutics at the Child and Family Research Institute, Department of Medical Genetics, University of British Columbia, 980 West 28th Avenue, Vancouver, BC V5Z 4H4, Canada.
| |
Collapse
|
29
|
Shi W, Fornes O, Mathelier A, Wasserman WW. Evaluating the impact of single nucleotide variants on transcription factor binding. Nucleic Acids Res 2016; 44:10106-10116. [PMID: 27492288 PMCID: PMC5137422 DOI: 10.1093/nar/gkw691] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2016] [Revised: 07/25/2016] [Accepted: 07/26/2016] [Indexed: 12/21/2022] Open
Abstract
Diseases and phenotypes caused by disrupted transcription factor (TF) binding are being identified, but progress is hampered by our limited capacity to predict such functional alterations. Improving predictions may be dependent on expanding the set of bona fide TF binding alterations. Allele-specific binding (ASB) events, where TFs preferentially bind to one of the two alleles at heterozygous sites, reveal the impact of sequence variations in altered TF binding. Here, we present the largest ASB compilation to our knowledge, 10 765 ASB events retrieved from 45 ENCODE ChIP-Seq data sets. Our analysis showed that ASB events were frequently associated with motif alterations of the ChIP'ed TF and potential partner TFs, allelic difference of DNase I hypersensitivity and allelic difference of histone modifications. For TF dimers bound symmetrically to DNA, ASB data revealed that central positions of the TF binding motifs were disproportionately important for binding. Lastly, the impact of variation on TF binding was predicted by a classification model incorporating all the investigated features of ASB events. Classification models using only DNase I hypersensitivity and sequence data exhibited predictive accuracy approaching the models with substantially more features. Taken together, the combination of ASB data and the classification model represents an important step toward elucidating regulatory variants across the human genome.
Collapse
Affiliation(s)
- Wenqiang Shi
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, Child & Family Research Institute, University of British Columbia, 950 28th Ave W, Vancouver, BC V5Z 4H4, Canada.,Bioinformatics Graduate Program, University of British Columbia, 2329 W Mall, Vancouver, BC V6T 1Z4, Canada
| | - Oriol Fornes
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, Child & Family Research Institute, University of British Columbia, 950 28th Ave W, Vancouver, BC V5Z 4H4, Canada
| | - Anthony Mathelier
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, Child & Family Research Institute, University of British Columbia, 950 28th Ave W, Vancouver, BC V5Z 4H4, Canada.,Centre for Molecular Medicine Norway (NCMM), Nordic EMBL partnership, University of Oslo and Oslo University Hospital, Norway
| | - Wyeth W Wasserman
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, Child & Family Research Institute, University of British Columbia, 950 28th Ave W, Vancouver, BC V5Z 4H4, Canada
| |
Collapse
|
30
|
Differences in the Early Development of Human and Mouse Embryonic Stem Cells. PLoS One 2015; 10:e0140803. [PMID: 26473594 PMCID: PMC4608779 DOI: 10.1371/journal.pone.0140803] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2015] [Accepted: 09/30/2015] [Indexed: 01/22/2023] Open
Abstract
We performed a systematic analysis of gene expression features in early (10–21 days) development of human vs mouse embryonic cells (hESCs vs mESCs). Many development features were found to be conserved, and a majority of differentially regulated genes have similar expression change in both organisms. The similarity is especially evident, when gene expression profiles are clustered together and properties of clustered groups of genes are compared. First 10 days of mESC development match the features of hESC development within 21 days, in accordance with the differences in population doubling time in human and mouse ESCs. At the same time, several important differences are seen. There is a clear difference in initial expression change of transcription factors and stimulus responsive genes, which may be caused by the difference in experimental procedures. However, we also found that some biological processes develop differently; this can clearly be shown, for example, for neuron and sensory organ development. Some groups of genes show peaks of the expression levels during the development and these peaks cannot be claimed to happen at the same time points in the two organisms, as well as for the same groups of (orthologous) genes. We also detected a larger number of upregulated genes during development of mESCs as compared to hESCs. The differences were quantified by comparing promoters of related genes. Most of gene groups behave similarly and have similar transcription factor (TF) binding sites on their promoters. A few groups of genes have similar promoters, but are expressed differently in two species. Interestingly, there are groups of genes expressed similarly, although they have different promoters, which can be shown by comparing their TF binding sites. Namely, a large group of similarly expressed cell cycle-related genes is found to have discrepant TF binding properties in mouse vs human.
Collapse
|
31
|
Dabrowski M, Dojer N, Krystkowiak I, Kaminska B, Wilczynski B. Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data. BMC Bioinformatics 2015; 16:140. [PMID: 25927199 PMCID: PMC4436866 DOI: 10.1186/s12859-015-0573-5] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2014] [Accepted: 04/14/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND For many years now, binding preferences of Transcription Factors have been described by so called motifs, usually mathematically defined by position weight matrices or similar models, for the purpose of predicting potential binding sites. However, despite the availability of thousands of motif models in public and commercial databases, a researcher who wants to use them is left with many competing methods of identifying potential binding sites in a genome of interest and there is little published information regarding the optimality of different choices. Thanks to the availability of large number of different motif models as well as a number of experimental datasets describing actual binding of TFs in hundreds of TF-ChIP-seq pairs, we set out to perform a comprehensive analysis of this matter. RESULTS We focus on the task of identifying potential transcription factor binding sites in the human genome. Firstly, we provide a comprehensive comparison of the coverage and quality of models available in different databases, showing that the public databases have comparable TFs coverage and better motif performance than commercial databases. Secondly, we compare different motif scanners showing that, regardless of the database used, the tools developed by the scientific community outperform the commercial tools. Thirdly, we calculate for each motif a detection threshold optimizing the accuracy of prediction. Finally, we provide an in-depth comparison of different methods of choosing thresholds for all motifs a priori. Surprisingly, we show that selecting a common false-positive rate gives results that are the least biased by the information content of the motif and therefore most uniformly accurate. CONCLUSION We provide a guide for researchers working with transcription factor motifs. It is supplemented with detailed results of the analysis and the benchmark datasets at http://bioputer.mimuw.edu.pl/papers/motifs/ .
Collapse
Affiliation(s)
- Michal Dabrowski
- Laboratory of Bioinformatics, Nencki Institute of Experimental Biology, Pasteura 3, Warszawa, 02-093, Poland.
| | - Norbert Dojer
- Institute of Informatics, Univeristy of Warsaw, Banacha 2, Warszawa, 02-097, Poland.
| | - Izabella Krystkowiak
- Laboratory of Molecular Neurobiology, Nencki Institute of Experimental Biology, Pasteura 3, Warszawa, 02-093, Poland.
| | - Bozena Kaminska
- Laboratory of Molecular Neurobiology, Nencki Institute of Experimental Biology, Pasteura 3, Warszawa, 02-093, Poland.
| | - Bartek Wilczynski
- Institute of Informatics, Univeristy of Warsaw, Banacha 2, Warszawa, 02-097, Poland.
| |
Collapse
|
32
|
Mathelier A, Shi W, Wasserman WW. Identification of altered cis-regulatory elements in human disease. Trends Genet 2015; 31:67-76. [DOI: 10.1016/j.tig.2014.12.003] [Citation(s) in RCA: 82] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2014] [Revised: 12/19/2014] [Accepted: 12/19/2014] [Indexed: 02/01/2023]
|
33
|
Worsley Hunt R, Wasserman WW. Non-targeted transcription factors motifs are a systemic component of ChIP-seq datasets. Genome Biol 2014; 15:412. [PMID: 25070602 PMCID: PMC4165360 DOI: 10.1186/s13059-014-0412-4] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2014] [Accepted: 07/29/2014] [Indexed: 12/15/2022] Open
Abstract
Background The global effort to annotate the non-coding portion of the human genome relies heavily on chromatin immunoprecipitation data generated with high-throughput DNA sequencing (ChIP-seq). ChIP-seq is generally successful in detailing the segments of the genome bound by the immunoprecipitated transcription factor (TF), however almost all datasets contain genomic regions devoid of the canonical motif for the TF. It remains to be determined if these regions are related to the immunoprecipitated TF or whether, despite the use of controls, there is a portion of peaks that can be attributed to other causes. Results Analyses across hundreds of ChIP-seq datasets generated for sequence-specific DNA binding TFs reveal a small set of TF binding profiles for which predicted TF binding site motifs are repeatedly observed to be significantly enriched. Grouping related binding profiles, the set includes: CTCF-like, ETS-like, JUN-like, and THAP11 profiles. These frequently enriched profiles are termed ‘zingers’ to highlight their unanticipated enrichment in datasets for which they were not the targeted TF, and their potential impact on the interpretation and analysis of TF ChIP-seq data. Peaks with zinger motifs and lacking the ChIPped TF’s motif are observed to compose up to 45% of a ChIP-seq dataset. There is substantial overlap of zinger motif containing regions between diverse TF datasets, suggesting a mechanism that is not TF-specific for the recovery of these regions. Conclusions Based on the zinger regions proximity to cohesin-bound segments, a loading station model is proposed. Further study of zingers will advance understanding of gene regulation. Electronic supplementary material The online version of this article (doi:10.1186/s13059-014-0412-4) contains supplementary material, which is available to authorized users.
Collapse
|