1
|
Raditsa V, Tsukanov A, Bogomolov A, Levitsky V. Genomic background sequences systematically outperform synthetic ones in de novo motif discovery for ChIP-seq data. NAR Genom Bioinform 2024; 6:lqae090. [PMID: 39071850 PMCID: PMC11282361 DOI: 10.1093/nargab/lqae090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2024] [Revised: 06/03/2024] [Accepted: 07/19/2024] [Indexed: 07/30/2024] Open
Abstract
Efficient de novo motif discovery from the results of wide-genome mapping of transcription factor binding sites (ChIP-seq) is dependent on the choice of background nucleotide sequences. The foreground sequences (ChIP-seq peaks) represent not only specific motifs of target transcription factors, but also the motifs overrepresented throughout the genome, such as simple sequence repeats. We performed a massive comparison of the 'synthetic' and 'genomic' approaches to generate background sequences for de novo motif discovery. The 'synthetic' approach shuffled nucleotides in peaks, while in the 'genomic' approach selected sequences from the reference genome randomly or only from gene promoters according to the fraction of A/T nucleotides in each sequence. We compiled the benchmark collections of ChIP-seq datasets for mouse, human and Arabidopsis, and performed de novo motif discovery. We showed that the genomic approach has both more robust detection of the known motifs of target transcription factors and more stringent exclusion of the simple sequence repeats as possible non-specific motifs. The advantage of the genomic approach over the synthetic approach was greater in plants compared to mammals. We developed the AntiNoise web service (https://denovosea.icgbio.ru/antinoise/) that implements a genomic approach to extract genomic background sequences for twelve eukaryotic genomes.
Collapse
Affiliation(s)
- Vladimir V Raditsa
- Department of System Biology, Institute of Cytology and Genetics, Novosibirsk 630090, Russia
| | - Anton V Tsukanov
- Department of System Biology, Institute of Cytology and Genetics, Novosibirsk 630090, Russia
| | - Anton G Bogomolov
- Department of Cell Biology, Institute of Cytology and Genetics, Novosibirsk 630090, Russia
| | - Victor G Levitsky
- Department of System Biology, Institute of Cytology and Genetics, Novosibirsk 630090, Russia
- Department of Natural Science, Novosibirsk State University, Novosibirsk 630090, Russia
| |
Collapse
|
2
|
Xu C, Kleinschmidt H, Yang J, Leith EM, Johnson J, Tan S, Mahony S, Bai L. Systematic dissection of sequence features affecting binding specificity of a pioneer factor reveals binding synergy between FOXA1 and AP-1. Mol Cell 2024; 84:2838-2855.e10. [PMID: 39019045 PMCID: PMC11334613 DOI: 10.1016/j.molcel.2024.06.022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Revised: 04/23/2024] [Accepted: 06/21/2024] [Indexed: 07/19/2024]
Abstract
Despite the unique ability of pioneer factors (PFs) to target nucleosomal sites in closed chromatin, they only bind a small fraction of their genomic motifs. The underlying mechanism of this selectivity is not well understood. Here, we design a high-throughput assay called chromatin immunoprecipitation with integrated synthetic oligonucleotides (ChIP-ISO) to systematically dissect sequence features affecting the binding specificity of a classic PF, FOXA1, in human A549 cells. Combining ChIP-ISO with in vitro and neural network analyses, we find that (1) FOXA1 binding is strongly affected by co-binding transcription factors (TFs) AP-1 and CEBPB; (2) FOXA1 and AP-1 show binding cooperativity in vitro; (3) FOXA1's binding is determined more by local sequences than chromatin context, including eu-/heterochromatin; and (4) AP-1 is partially responsible for differential binding of FOXA1 in different cell types. Our study presents a framework for elucidating genetic rules underlying PF binding specificity and reveals a mechanism for context-specific regulation of its binding.
Collapse
Affiliation(s)
- Cheng Xu
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA 16802, USA; Center for Eukaryotic Gene Regulation, The Pennsylvania State University, University Park, PA 16802, USA
| | - Holly Kleinschmidt
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA 16802, USA; Center for Eukaryotic Gene Regulation, The Pennsylvania State University, University Park, PA 16802, USA
| | - Jianyu Yang
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA 16802, USA; Center for Eukaryotic Gene Regulation, The Pennsylvania State University, University Park, PA 16802, USA
| | - Erik M Leith
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA 16802, USA; Center for Eukaryotic Gene Regulation, The Pennsylvania State University, University Park, PA 16802, USA
| | - Jenna Johnson
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA 16802, USA
| | - Song Tan
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA 16802, USA; Center for Eukaryotic Gene Regulation, The Pennsylvania State University, University Park, PA 16802, USA
| | - Shaun Mahony
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA 16802, USA; Center for Eukaryotic Gene Regulation, The Pennsylvania State University, University Park, PA 16802, USA
| | - Lu Bai
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA 16802, USA; Center for Eukaryotic Gene Regulation, The Pennsylvania State University, University Park, PA 16802, USA; Department of Physics, The Pennsylvania State University, University Park, PA 16802, USA.
| |
Collapse
|
3
|
Elkayam S, Tziony I, Orenstein Y. DeepCRISTL: deep transfer learning to predict CRISPR/Cas9 on-target editing efficiency in specific cellular contexts. Bioinformatics 2024; 40:btae481. [PMID: 39073893 PMCID: PMC11319645 DOI: 10.1093/bioinformatics/btae481] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Revised: 05/28/2024] [Accepted: 07/27/2024] [Indexed: 07/31/2024] Open
Abstract
MOTIVATION CRISPR/Cas9 technology has been revolutionizing the field of gene editing. Guide RNAs (gRNAs) enable Cas9 proteins to target specific genomic loci for editing. However, editing efficiency varies between gRNAs and so computational methods were developed to predict editing efficiency for any gRNA of interest. High-throughput datasets of Cas9 editing efficiencies were produced to train machine-learning models to predict editing efficiency. However, these high-throughput datasets have a low correlation with functional and endogenous datasets, which are too small to train accurate machine-learning models on. RESULTS We developed DeepCRISTL, a deep-learning model to predict the editing efficiency in a specific cellular context. DeepCRISTL takes advantage of high-throughput datasets to learn general patterns of gRNA editing efficiency and then fine-tunes the model on functional or endogenous data to fit a specific cellular context. We tested two state-of-the-art models trained on high-throughput datasets for editing efficiency prediction, our newly improved DeepHF and CRISPRon, combined with various transfer-learning approaches. The combination of CRISPRon and fine-tuning all model weights was the overall best performer. DeepCRISTL outperformed state-of-the-art methods in predicting editing efficiency in a specific cellular context on functional and endogenous datasets. Using saliency maps, we identified and compared the important features learned by DeepCRISTL across cellular contexts. We believe DeepCRISTL will improve prediction performance in many other CRISPR/Cas9 editing contexts by leveraging transfer learning to utilize both high-throughput datasets and smaller and more biologically relevant datasets. AVAILABILITY AND IMPLEMENTATION DeepCRISTL is available via https://github.com/OrensteinLab/DeepCRISTL.
Collapse
Affiliation(s)
- Shai Elkayam
- School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel
| | - Ido Tziony
- Department of Computer Science, Bar-Ilan University, Ramat Gan 5290002, Israel
| | - Yaron Orenstein
- Department of Computer Science, Bar-Ilan University, Ramat Gan 5290002, Israel
- The Mina and Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat Gan 5290002, Israel
| |
Collapse
|
4
|
Yang Y, Pe’er D. REUNION: transcription factor binding prediction and regulatory association inference from single-cell multi-omics data. Bioinformatics 2024; 40:i567-i575. [PMID: 38940155 PMCID: PMC11211829 DOI: 10.1093/bioinformatics/btae234] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
MOTIVATION Profiling of gene expression and chromatin accessibility by single-cell multi-omics approaches can help to systematically decipher how transcription factors (TFs) regulate target gene expression via cis-region interactions. However, integrating information from different modalities to discover regulatory associations is challenging, in part because motif scanning approaches miss many likely TF binding sites. RESULTS We develop REUNION, a framework for predicting genome-wide TF binding and cis-region-TF-gene "triplet" regulatory associations using single-cell multi-omics data. The first component of REUNION, Unify, utilizes information theory-inspired complementary score functions that incorporate TF expression, chromatin accessibility, and target gene expression to identify regulatory associations. The second component, Rediscover, takes Unify estimates as input for pseudo semi-supervised learning to predict TF binding in accessible genomic regions that may or may not include detected TF motifs. Rediscover leverages latent chromatin accessibility and sequence feature spaces of the genomic regions, without requiring chromatin immunoprecipitation data for model training. Applied to peripheral blood mononuclear cell data, REUNION outperforms alternative methods in TF binding prediction on average performance. In particular, it recovers missing region-TF associations from regions lacking detected motifs, which circumvents the reliance on motif scanning and facilitates discovery of novel associations involving potential co-binding transcriptional regulators. Newly identified region-TF associations, even in regions lacking a detected motif, improve the prediction of target gene expression in regulatory triplets, and are thus likely to genuinely participate in the regulation. AVAILABILITY AND IMPLEMENTATION All source code is available at https://github.com/yangymargaret/REUNION.
Collapse
Affiliation(s)
- Yang Yang
- Computational and Systems Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, NY 10065, United States
- Howard Hughes Medical Institute, Chevy Chase, MD 20815, United States
| | - Dana Pe’er
- Computational and Systems Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, NY 10065, United States
- Howard Hughes Medical Institute, Chevy Chase, MD 20815, United States
| |
Collapse
|
5
|
Ehle C, Iyer-Bierhoff A, Wu Y, Xing S, Kiehntopf M, Mosig AS, Godmann M, Heinzel T. Downregulation of HNF4A enables transcriptomic reprogramming during the hepatic acute-phase response. Commun Biol 2024; 7:589. [PMID: 38755249 PMCID: PMC11099168 DOI: 10.1038/s42003-024-06288-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Accepted: 05/03/2024] [Indexed: 05/18/2024] Open
Abstract
The hepatic acute-phase response is characterized by a massive upregulation of serum proteins, such as haptoglobin and serum amyloid A, at the expense of liver homeostatic functions. Although the transcription factor hepatocyte nuclear factor 4 alpha (HNF4A) has a well-established role in safeguarding liver function and its cistrome spans around 50% of liver-specific genes, its role in the acute-phase response has received little attention so far. We demonstrate that HNF4A binds to and represses acute-phase genes under basal conditions. The reprogramming of hepatic transcription during inflammation necessitates loss of HNF4A function to allow expression of acute-phase genes while liver homeostatic genes are repressed. In a pre-clinical liver organoid model overexpression of HNF4A maintained liver functionality in spite of inflammation-induced cell damage. Conversely, HNF4A overexpression potently impaired the acute-phase response by retaining chromatin at regulatory regions of acute-phase genes inaccessible to transcription. Taken together, our data extend the understanding of dual HNF4A action as transcriptional activator and repressor, establishing HNF4A as gatekeeper for the hepatic acute-phase response.
Collapse
Affiliation(s)
- Charlotte Ehle
- Institute of Biochemistry and Biophysics, Center for Molecular Biomedicine, Friedrich Schiller University Jena, 07745, Jena, Germany
| | - Aishwarya Iyer-Bierhoff
- Institute of Biochemistry and Biophysics, Center for Molecular Biomedicine, Friedrich Schiller University Jena, 07745, Jena, Germany
| | - Yunchen Wu
- Institute of Biochemistry and Biophysics, Center for Molecular Biomedicine, Friedrich Schiller University Jena, 07745, Jena, Germany
- Marshall Laboratory of Biomedical Engineering, Department of Pathogen Biology, Shenzhen University Medical School, Shenzhen University, Shenzhen, Guangdong, 518060, China
| | - Shaojun Xing
- Marshall Laboratory of Biomedical Engineering, Department of Pathogen Biology, Shenzhen University Medical School, Shenzhen University, Shenzhen, Guangdong, 518060, China
| | - Michael Kiehntopf
- Department of Clinical Chemistry and Laboratory Diagnostics, Jena University Hospital, 07747, Jena, Germany
| | - Alexander S Mosig
- Institute of Biochemistry II, Center for Sepsis Control and Care, Jena University Hospital, 07747, Jena, Germany
| | - Maren Godmann
- Institute of Biochemistry and Biophysics, Center for Molecular Biomedicine, Friedrich Schiller University Jena, 07745, Jena, Germany
| | - Thorsten Heinzel
- Institute of Biochemistry and Biophysics, Center for Molecular Biomedicine, Friedrich Schiller University Jena, 07745, Jena, Germany.
| |
Collapse
|
6
|
Saotome M, Poduval D, Grimm SA, Nagornyuk A, Gunarathna S, Shimbo T, Wade P, Takaku M. Genomic transcription factor binding site selection is edited by the chromatin remodeling factor CHD4. Nucleic Acids Res 2024; 52:3607-3622. [PMID: 38281186 PMCID: PMC11039999 DOI: 10.1093/nar/gkae025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2023] [Revised: 12/19/2023] [Accepted: 01/04/2024] [Indexed: 01/30/2024] Open
Abstract
Biologically precise enhancer licensing by lineage-determining transcription factors enables activation of transcripts appropriate to biological demand and prevents deleterious gene activation. This essential process is challenged by the millions of matches to most transcription factor binding motifs present in many eukaryotic genomes, leading to questions about how transcription factors achieve the exquisite specificity required. The importance of chromatin remodeling factors to enhancer activation is highlighted by their frequent mutation in developmental disorders and in cancer. Here, we determine the roles of CHD4 in enhancer licensing and maintenance in breast cancer cells and during cellular reprogramming. In unchallenged basal breast cancer cells, CHD4 modulates chromatin accessibility. Its depletion leads to redistribution of transcription factors to previously unoccupied sites. During cellular reprogramming induced by the pioneer factor GATA3, CHD4 activity is necessary to prevent inappropriate chromatin opening. Mechanistically, CHD4 promotes nucleosome positioning over GATA3 binding motifs to compete with transcription factor-DNA interaction. We propose that CHD4 acts as a chromatin proof-reading enzyme that prevents unnecessary gene expression by editing chromatin binding activities of transcription factors.
Collapse
Affiliation(s)
- Mika Saotome
- Department of Biomedical Sciences, University of North Dakota School of Medicine and Health Sciences, Grand Forks, ND 58202, USA
| | - Deepak B Poduval
- Department of Biomedical Sciences, University of North Dakota School of Medicine and Health Sciences, Grand Forks, ND 58202, USA
| | - Sara A Grimm
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC 27709, USA
| | - Aerica Nagornyuk
- Department of Biomedical Sciences, University of North Dakota School of Medicine and Health Sciences, Grand Forks, ND 58202, USA
| | - Sakuntha Gunarathna
- Department of Biomedical Sciences, University of North Dakota School of Medicine and Health Sciences, Grand Forks, ND 58202, USA
| | - Takashi Shimbo
- Epigenetics and Stem Cell Biology Laboratory, National Institute of Environmental Health Sciences, Research Triangle Park, NC 27709, USA
| | - Paul A Wade
- Epigenetics and Stem Cell Biology Laboratory, National Institute of Environmental Health Sciences, Research Triangle Park, NC 27709, USA
| | - Motoki Takaku
- Department of Biomedical Sciences, University of North Dakota School of Medicine and Health Sciences, Grand Forks, ND 58202, USA
| |
Collapse
|
7
|
Yang Z, Li X, Sheng L, Zhu M, Lan X, Gu F. Multiomics-integrated deep language model enables in silico genome-wide detection of transcription factor binding site in unexplored biosamples. Bioinformatics 2024; 40:btae013. [PMID: 38216534 PMCID: PMC10812877 DOI: 10.1093/bioinformatics/btae013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2023] [Revised: 12/07/2023] [Accepted: 01/11/2024] [Indexed: 01/14/2024] Open
Abstract
MOTIVATION Transcription factor binding sites (TFBS) are regulatory elements that have significant impact on transcription regulation and cell fate determination. Canonical motifs, biological experiments, and computational methods have made it possible to discover TFBS. However, most existing in silico TFBS prediction models are solely DNA-based, and are trained and utilized within the same biosample, which fail to infer TFBS in experimentally unexplored biosamples. RESULTS Here, we propose TFBS prediction by modified TransFormer (TFTF), a multimodal deep language architecture which integrates multiomics information in epigenetic studies. In comparison to existing computational techniques, TFTF has state-of-the-art accuracy, and is also the first approach to accurately perform genome-wide detection for cell-type and species-specific TFBS in experimentally unexplored biosamples. Compared to peak calling methods, TFTF consistently discovers true TFBS in threshold tuning-free way, with higher recalled rates. The underlying mechanism of TFTF reveals greater attention to the targeted TF's motif region in TFBS, and general attention to the entire peak region in non-TFBS. TFTF can benefit from the integration of broader and more diverse data for improvement and can be applied to multiple epigenetic scenarios. AVAILABILITY AND IMPLEMENTATION We provide a web server (https://tftf.ibreed.cn/) for users to utilize TFTF model. Users can train TFTF model and discover TFBS with their own data.
Collapse
Affiliation(s)
- Zikun Yang
- Damo Academy, Alibaba Group, Hangzhou 310023, China
- Hupan Lab, Hangzhou 310023, China
| | - Xin Li
- Damo Academy, Alibaba Group, Hangzhou 310023, China
- Hupan Lab, Hangzhou 310023, China
| | - Lele Sheng
- Damo Academy, Alibaba Group, Hangzhou 310023, China
- Hupan Lab, Hangzhou 310023, China
| | - Ming Zhu
- Department of Basic Medical Science, School of Medicine, Tsinghua University, Beijing 100084, China
- Tsinghua-Peking Joint Center for Life Sciences, Tsinghua University, Beijing 100084, China
- MOE Key Laboratory of Bioinformatics, Tsinghua University, Beijing 100084, China
| | - Xun Lan
- Department of Basic Medical Science, School of Medicine, Tsinghua University, Beijing 100084, China
- Tsinghua-Peking Joint Center for Life Sciences, Tsinghua University, Beijing 100084, China
- MOE Key Laboratory of Bioinformatics, Tsinghua University, Beijing 100084, China
| | - Fei Gu
- Damo Academy, Alibaba Group, Hangzhou 310023, China
- Hupan Lab, Hangzhou 310023, China
| |
Collapse
|
8
|
Neikes HK, Kliza KW, Gräwe C, Wester RA, Jansen PWTC, Lamers LA, Baltissen MP, van Heeringen SJ, Logie C, Teichmann SA, Lindeboom RGH, Vermeulen M. Quantification of absolute transcription factor binding affinities in the native chromatin context using BANC-seq. Nat Biotechnol 2023; 41:1801-1809. [PMID: 36973556 DOI: 10.1038/s41587-023-01715-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2022] [Accepted: 02/16/2023] [Indexed: 03/29/2023]
Abstract
Transcription factor binding across the genome is regulated by DNA sequence and chromatin features. However, it is not yet possible to quantify the impact of chromatin context on transcription factor binding affinities. Here, we report a method called binding affinities to native chromatin by sequencing (BANC-seq) to determine absolute apparent binding affinities of transcription factors to native DNA across the genome. In BANC-seq, a concentration range of a tagged transcription factor is added to isolated nuclei. Concentration-dependent binding is then measured per sample to quantify apparent binding affinities across the genome. BANC-seq adds a quantitative dimension to transcription factor biology, which enables stratification of genomic targets based on transcription factor concentration and prediction of transcription factor binding sites under non-physiological conditions, such as disease-associated overexpression of (onco)genes. Notably, whereas consensus DNA binding motifs for transcription factors are important to establish high-affinity binding sites, these motifs are not always strictly required to generate nanomolar-affinity interactions in the genome.
Collapse
Affiliation(s)
- Hannah K Neikes
- Department of Molecular Biology, Faculty of Science, Radboud Institute for Molecular Life Sciences, Oncode Institute, Radboud University Nijmegen, Nijmegen, the Netherlands
| | - Katarzyna W Kliza
- Department of Molecular Biology, Faculty of Science, Radboud Institute for Molecular Life Sciences, Oncode Institute, Radboud University Nijmegen, Nijmegen, the Netherlands
| | - Cathrin Gräwe
- Department of Molecular Biology, Faculty of Science, Radboud Institute for Molecular Life Sciences, Oncode Institute, Radboud University Nijmegen, Nijmegen, the Netherlands
| | - Roelof A Wester
- Department of Molecular Biology, Faculty of Science, Radboud Institute for Molecular Life Sciences, Oncode Institute, Radboud University Nijmegen, Nijmegen, the Netherlands
| | - Pascal W T C Jansen
- Department of Molecular Biology, Faculty of Science, Radboud Institute for Molecular Life Sciences, Oncode Institute, Radboud University Nijmegen, Nijmegen, the Netherlands
| | - Lieke A Lamers
- Department of Molecular Biology, Faculty of Science, Radboud Institute for Molecular Life Sciences, Oncode Institute, Radboud University Nijmegen, Nijmegen, the Netherlands
| | - Marijke P Baltissen
- Department of Molecular Biology, Faculty of Science, Radboud Institute for Molecular Life Sciences, Oncode Institute, Radboud University Nijmegen, Nijmegen, the Netherlands
| | - Simon J van Heeringen
- Department of Molecular Developmental Biology, Faculty of Science, Radboud Institute for Molecular Life Sciences, Radboud University Nijmegen, Nijmegen, the Netherlands
| | - Colin Logie
- Department of Molecular Biology, Faculty of Science, Radboud Institute for Molecular Life Sciences, Radboud University Nijmegen, Nijmegen, the Netherlands
| | | | - Rik G H Lindeboom
- Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge, UK.
- The Netherlands Cancer Institute, Amsterdam, the Netherlands.
| | - Michiel Vermeulen
- Department of Molecular Biology, Faculty of Science, Radboud Institute for Molecular Life Sciences, Oncode Institute, Radboud University Nijmegen, Nijmegen, the Netherlands.
- The Netherlands Cancer Institute, Amsterdam, the Netherlands.
| |
Collapse
|
9
|
Filipovic D, Qi W, Kana O, Marri D, LeCluyse EL, Andersen ME, Cuddapah S, Bhattacharya S. Interpretable predictive models of genome-wide aryl hydrocarbon receptor-DNA binding reveal tissue-specific binding determinants. Toxicol Sci 2023; 196:170-186. [PMID: 37707797 PMCID: PMC10682972 DOI: 10.1093/toxsci/kfad094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/15/2023] Open
Abstract
The aryl hydrocarbon receptor (AhR) is an inducible transcription factor whose ligands include the potent environmental contaminant 2,3,7,8-tetrachlorodibenzo-p-dioxin (TCDD). Ligand-activated AhR binds to DNA at dioxin response elements (DREs) containing the core motif 5'-GCGTG-3'. However, AhR binding is highly tissue specific. Most DREs in accessible chromatin are not bound by TCDD-activated AhR, and DREs accessible in multiple tissues can be bound in some and unbound in others. As such, AhR functions similarly to many nuclear receptors. Given that AhR possesses a strong core motif, it is suited for a motif-centered analysis of its binding. We developed interpretable machine learning models predicting the AhR binding status of DREs in MCF-7, GM17212, and HepG2 cells, as well as primary human hepatocytes. Cross-tissue models predicting transcription factor (TF)-DNA binding generally perform poorly. However, reasons for the low performance remain unexplored. By interpreting the results of individual within-tissue models and by examining the features leading to low cross-tissue performance, we identified sequence and chromatin context patterns correlated with AhR binding. We conclude that AhR binding is driven by a complex interplay of tissue-agnostic DRE flanking DNA sequence and tissue-specific local chromatin context. Additionally, we demonstrate that interpretable machine learning models can provide novel and experimentally testable mechanistic insights into DNA binding by inducible TFs.
Collapse
Affiliation(s)
- David Filipovic
- Department of Biomedical Engineering, Michigan State University, East Lansing, Michigan 48824, USA
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, Michigan 48824, USA
- Institute for Quantitative Health Science & Engineering, Michigan State University, East Lansing, Michigan 48824, USA
| | - Wenjie Qi
- Department of Biomedical Engineering, Michigan State University, East Lansing, Michigan 48824, USA
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, Michigan 48824, USA
- Institute for Quantitative Health Science & Engineering, Michigan State University, East Lansing, Michigan 48824, USA
| | - Omar Kana
- Institute for Quantitative Health Science & Engineering, Michigan State University, East Lansing, Michigan 48824, USA
- Department of Pharmacology & Toxicology, Michigan State University, East Lansing, Michigan 48824, USA
- Institute for Integrative Toxicology, Michigan State University, East Lansing, Michigan 48824, USA
| | - Daniel Marri
- Department of Biomedical Engineering, Michigan State University, East Lansing, Michigan 48824, USA
- Institute for Quantitative Health Science & Engineering, Michigan State University, East Lansing, Michigan 48824, USA
| | - Edward L LeCluyse
- LifeSciences Division, LifeNet Health, Research Triangle Park, North Carolina 27709, USA
| | | | - Suresh Cuddapah
- Division of Environmental Medicine, Department of Medicine, New York University School of Medicine, New York, New York 10010, USA
| | - Sudin Bhattacharya
- Department of Biomedical Engineering, Michigan State University, East Lansing, Michigan 48824, USA
- Institute for Quantitative Health Science & Engineering, Michigan State University, East Lansing, Michigan 48824, USA
- Department of Pharmacology & Toxicology, Michigan State University, East Lansing, Michigan 48824, USA
- Institute for Integrative Toxicology, Michigan State University, East Lansing, Michigan 48824, USA
- Center for Research on Ingredient Safety, Michigan State University, East Lansing, Michigan 48824, USA
| |
Collapse
|
10
|
Xu C, Kleinschmidt H, Yang J, Leith E, Johnson J, Tan S, Mahony S, Bai L. Systematic Dissection of Sequence Features Affecting the Binding Specificity of a Pioneer Factor Reveals Binding Synergy Between FOXA1 and AP-1. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.08.566246. [PMID: 37986839 PMCID: PMC10659273 DOI: 10.1101/2023.11.08.566246] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/22/2023]
Abstract
Despite the unique ability of pioneer transcription factors (PFs) to target nucleosomal sites in closed chromatin, they only bind a small fraction of their genomic motifs. The underlying mechanism of this selectivity is not well understood. Here, we design a high-throughput assay called ChIP-ISO to systematically dissect sequence features affecting the binding specificity of a classic PF, FOXA1. Combining ChIP-ISO with in vitro and neural network analyses, we find that 1) FOXA1 binding is strongly affected by co-binding TFs AP-1 and CEBPB, 2) FOXA1 and AP-1 show binding cooperativity in vitro, 3) FOXA1's binding is determined more by local sequences than chromatin context, including eu-/heterochromatin, and 4) AP-1 is partially responsible for differential binding of FOXA1 in different cell types. Our study presents a framework for elucidating genetic rules underlying PF binding specificity and reveals a mechanism for context-specific regulation of its binding.
Collapse
Affiliation(s)
- Cheng Xu
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA 16802, USA
- Center for Eukaryotic Gene Regulation, The Pennsylvania State University, University Park, PA 16802, USA
| | - Holly Kleinschmidt
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA 16802, USA
- Center for Eukaryotic Gene Regulation, The Pennsylvania State University, University Park, PA 16802, USA
| | - Jianyu Yang
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA 16802, USA
- Center for Eukaryotic Gene Regulation, The Pennsylvania State University, University Park, PA 16802, USA
| | - Erik Leith
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA 16802, USA
- Center for Eukaryotic Gene Regulation, The Pennsylvania State University, University Park, PA 16802, USA
| | - Jenna Johnson
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA 16802, USA
| | - Song Tan
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA 16802, USA
- Center for Eukaryotic Gene Regulation, The Pennsylvania State University, University Park, PA 16802, USA
| | - Shaun Mahony
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA 16802, USA
- Center for Eukaryotic Gene Regulation, The Pennsylvania State University, University Park, PA 16802, USA
| | - Lu Bai
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA 16802, USA
- Center for Eukaryotic Gene Regulation, The Pennsylvania State University, University Park, PA 16802, USA
- Department of Physics, The Pennsylvania State University, University Park, PA 16802, USA
| |
Collapse
|
11
|
Grau J, Schmidt F, Schulz MH. Widespread effects of DNA methylation and intra-motif dependencies revealed by novel transcription factor binding models. Nucleic Acids Res 2023; 51:e95. [PMID: 37650641 PMCID: PMC10570048 DOI: 10.1093/nar/gkad693] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Revised: 07/20/2023] [Accepted: 08/10/2023] [Indexed: 09/01/2023] Open
Abstract
Several studies suggested that transcription factor (TF) binding to DNA may be impaired or enhanced by DNA methylation. We present MeDeMo, a toolbox for TF motif analysis that combines information about DNA methylation with models capturing intra-motif dependencies. In a large-scale study using ChIP-seq data for 335 TFs, we identify novel TFs that show a binding behaviour associated with DNA methylation. Overall, we find that the presence of CpG methylation decreases the likelihood of binding for the majority of methylation-associated TFs. For a considerable subset of TFs, we show that intra-motif dependencies are pivotal for accurately modelling the impact of DNA methylation on TF binding. We illustrate that the novel methylation-aware TF binding models allow to predict differential ChIP-seq peaks and improve the genome-wide analysis of TF binding. Our work indicates that simplistic models that neglect the effect of DNA methylation on DNA binding may lead to systematic underperformance for methylation-associated TFs.
Collapse
Affiliation(s)
- Jan Grau
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle 06120, Germany
| | - Florian Schmidt
- Goethe-University Frankfurt, Institute for Cardiovascular Regeneration, Theodor-Stern-Kai 7, 60590 Frankfurt, Germany
- Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken 66123, Germany
- Systems Biology and Data Analytics, Genome Institute of Singapore, Singapore 13862, Singapore
- ImmunoScape Pte Ltd, Singapore 228208, Singapore
| | - Marcel H Schulz
- Goethe-University Frankfurt, Institute for Cardiovascular Regeneration, Theodor-Stern-Kai 7, 60590 Frankfurt, Germany
- Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken 66123, Germany
- German Center for Cardiovascular Research, Partner site Rhein-Main, 60590 Frankfurt am Main, Germany
- Cardio-Pulmonary Institute, Goethe University, Frankfurt am Main, Germany
| |
Collapse
|
12
|
Walker M, Li Y, Morales-Hernandez A, Qi Q, Parupalli C, Brown S, Christian C, Clements WK, Cheng Y, McKinney-Freeman S. An NFIX-mediated regulatory network governs the balance of hematopoietic stem and progenitor cells during hematopoiesis. Blood Adv 2023; 7:4677-4689. [PMID: 36478187 PMCID: PMC10468369 DOI: 10.1182/bloodadvances.2022007811] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2022] [Revised: 10/07/2022] [Accepted: 11/09/2022] [Indexed: 12/12/2022] Open
Abstract
The transcription factor (TF) nuclear factor I-X (NFIX) is a positive regulator of hematopoietic stem and progenitor cell (HSPC) transplantation. Nfix-deficient HSPCs exhibit a severe loss of repopulating activity, increased apoptosis, and a loss of colony-forming potential. However, the underlying mechanism remains elusive. Here, we performed cellular indexing of transcriptomes and epitopes by high-throughput sequencing (CITE-seq) on Nfix-deficient HSPCs and observed a loss of long-term hematopoietic stem cells and an accumulation of megakaryocyte and myelo-erythroid progenitors. The genome-wide binding profile of NFIX in primitive murine hematopoietic cells revealed its colocalization with other hematopoietic TFs, such as PU.1. We confirmed the physical interaction between NFIX and PU.1 and demonstrated that the 2 TFs co-occupy super-enhancers and regulate genes implicated in cellular respiration and hematopoietic differentiation. In addition, we provide evidence suggesting that the absence of NFIX negatively affects PU.1 binding at some genomic loci. Our data support a model in which NFIX collaborates with PU.1 at super-enhancers to promote the differentiation and homeostatic balance of hematopoietic progenitors.
Collapse
Affiliation(s)
- Megan Walker
- Department of Hematology, St. Jude Children’s Research Hospital, Memphis, TN
| | - Yichao Li
- Department of Hematology, St. Jude Children’s Research Hospital, Memphis, TN
| | | | - Qian Qi
- Department of Hematology, St. Jude Children’s Research Hospital, Memphis, TN
| | | | - Scott Brown
- Department of Immunology, St. Jude Children’s Research Hospital, Memphis, TN
| | - Claiborne Christian
- Department of Hematology, St. Jude Children’s Research Hospital, Memphis, TN
| | - Wilson K. Clements
- Department of Hematology, St. Jude Children’s Research Hospital, Memphis, TN
| | - Yong Cheng
- Department of Hematology, St. Jude Children’s Research Hospital, Memphis, TN
| | | |
Collapse
|
13
|
Villaman C, Pollastri G, Saez M, Martin AJ. Benefiting from the intrinsic role of epigenetics to predict patterns of CTCF binding. Comput Struct Biotechnol J 2023; 21:3024-3031. [PMID: 37266407 PMCID: PMC10229758 DOI: 10.1016/j.csbj.2023.05.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2022] [Revised: 05/11/2023] [Accepted: 05/11/2023] [Indexed: 06/03/2023] Open
Abstract
Motivation One of the most relevant mechanisms involved in the determination of chromatin structure is the formation of structural loops that are also related with the conservation of chromatin states. Many of these loops are stabilized by CCCTC-binding factor (CTCF) proteins at their base. Despite the relevance of chromatin structure and the key role of CTCF, the role of the epigenetic factors that are involved in the regulation of CTCF binding, and thus, in the formation of structural loops in the chromatin, is not thoroughly understood. Results Here we describe a CTCF binding predictor based on Random Forest that employs different epigenetic data and genomic features. Importantly, given the ability of Random Forests to determine the relevance of features for the prediction, our approach also shows how the different types of descriptors impact the binding of CTCF, confirming previous knowledge on the relevance of chromatin accessibility and DNA methylation, but demonstrating the effect of epigenetic modifications on the activity of CTCF. We compared our approach against other predictors and found improved performance in terms of areas under PR and ROC curves (PRAUC-ROCAUC), outperforming current state-of-the-art methods.
Collapse
Affiliation(s)
- Camilo Villaman
- Programa de Doctorado en Genómica Integrativa, Vicerrectoría de Investigación, Universidad Mayor, Santiago, Chile
- Laboratorio de Redes Biológicas, Centro Científico y Tecnológico de Excelencia Ciencia & Vida, Fundación Ciencia & Vida, Escuela de Ingeniería, Facultad de Ingeniería, Arquitectura y Diseño, Universidad San Sebastián, Santiago, Chile
| | | | - Mauricio Saez
- Centro de Oncología de Precisión, Facultad de Medicina y Ciencias de la Salud, Universidad Mayor, Santiago, Chile
- Laboratorio de Investigación en Salud de Precisión, Departamento de Procesos Diagnósticos y Evaluación, Facultad de Ciencias de la Salud, Universidad Católica de Temuco, Chile
| | - Alberto J.M. Martin
- Laboratorio de Redes Biológicas, Centro Científico y Tecnológico de Excelencia Ciencia & Vida, Fundación Ciencia & Vida, Escuela de Ingeniería, Facultad de Ingeniería, Arquitectura y Diseño, Universidad San Sebastián, Santiago, Chile
| |
Collapse
|
14
|
Computational approaches to understand transcription regulation in development. Biochem Soc Trans 2023; 51:1-12. [PMID: 36695505 PMCID: PMC9988001 DOI: 10.1042/bst20210145] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2022] [Revised: 01/07/2023] [Accepted: 01/13/2023] [Indexed: 01/26/2023]
Abstract
Gene regulatory networks (GRNs) serve as useful abstractions to understand transcriptional dynamics in developmental systems. Computational prediction of GRNs has been successfully applied to genome-wide gene expression measurements with the advent of microarrays and RNA-sequencing. However, these inferred networks are inaccurate and mostly based on correlative rather than causative interactions. In this review, we highlight three approaches that significantly impact GRN inference: (1) moving from one genome-wide functional modality, gene expression, to multi-omics, (2) single cell sequencing, to measure cell type-specific signals and predict context-specific GRNs, and (3) neural networks as flexible models. Together, these experimental and computational developments have the potential to significantly impact the quality of inferred GRNs. Ultimately, accurately modeling the regulatory interactions between transcription factors and their target genes will be essential to understand the role of transcription factors in driving developmental gene expression programs and to derive testable hypotheses for validation.
Collapse
|
15
|
Cazares TA, Rizvi FW, Iyer B, Chen X, Kotliar M, Bejjani AT, Wayman JA, Donmez O, Wronowski B, Parameswaran S, Kottyan LC, Barski A, Weirauch MT, Prasath VBS, Miraldi ER. maxATAC: Genome-scale transcription-factor binding prediction from ATAC-seq with deep neural networks. PLoS Comput Biol 2023; 19:e1010863. [PMID: 36719906 PMCID: PMC9917285 DOI: 10.1371/journal.pcbi.1010863] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2022] [Revised: 02/10/2023] [Accepted: 01/10/2023] [Indexed: 02/01/2023] Open
Abstract
Transcription factors read the genome, fundamentally connecting DNA sequence to gene expression across diverse cell types. Determining how, where, and when TFs bind chromatin will advance our understanding of gene regulatory networks and cellular behavior. The 2017 ENCODE-DREAM in vivo Transcription-Factor Binding Site (TFBS) Prediction Challenge highlighted the value of chromatin accessibility data to TFBS prediction, establishing state-of-the-art methods for TFBS prediction from DNase-seq. However, the more recent Assay-for-Transposase-Accessible-Chromatin (ATAC)-seq has surpassed DNase-seq as the most widely-used chromatin accessibility profiling method. Furthermore, ATAC-seq is the only such technique available at single-cell resolution from standard commercial platforms. While ATAC-seq datasets grow exponentially, suboptimal motif scanning is unfortunately the most common method for TFBS prediction from ATAC-seq. To enable community access to state-of-the-art TFBS prediction from ATAC-seq, we (1) curated an extensive benchmark dataset (127 TFs) for ATAC-seq model training and (2) built "maxATAC", a suite of user-friendly, deep neural network models for genome-wide TFBS prediction from ATAC-seq in any cell type. With models available for 127 human TFs, maxATAC is the largest collection of high-performance TFBS prediction models for ATAC-seq. maxATAC performance extends to primary cells and single-cell ATAC-seq, enabling improved TFBS prediction in vivo. We demonstrate maxATAC's capabilities by identifying TFBS associated with allele-dependent chromatin accessibility at atopic dermatitis genetic risk loci.
Collapse
Affiliation(s)
- Tareian A. Cazares
- Immunology Graduate Program, University of Cincinnati College of Medicine, Cincinnati, Ohio, United States of America
| | - Faiz W. Rizvi
- Systems Biology and Physiology Graduate Program, University of Cincinnati College of Medicine, Cincinnati, Ohio, United States of America
| | - Balaji Iyer
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
- Department of Electrical Engineering and Computer Science, University of Cincinnati, Cincinnati, Ohio, United States of America
| | - Xiaoting Chen
- The Center for Autoimmune Genetics and Etiology (CAGE), Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
| | - Michael Kotliar
- Division of Allergy and Immunology, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
| | - Anthony T. Bejjani
- Molecular and Developmental Biology Graduate Program, University of Cincinnati College of Medicine, Cincinnati, Ohio, United States of America
| | - Joseph A. Wayman
- Division of Immunobiology, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
| | - Omer Donmez
- The Center for Autoimmune Genetics and Etiology (CAGE), Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
| | - Benjamin Wronowski
- Division of Allergy and Immunology, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
| | - Sreeja Parameswaran
- The Center for Autoimmune Genetics and Etiology (CAGE), Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
| | - Leah C. Kottyan
- The Center for Autoimmune Genetics and Etiology (CAGE), Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
- Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, Ohio, United States of America
- Division of Human Genetics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
| | - Artem Barski
- Division of Allergy and Immunology, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
- Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, Ohio, United States of America
- Division of Human Genetics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
| | - Matthew T. Weirauch
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
- The Center for Autoimmune Genetics and Etiology (CAGE), Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
- Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, Ohio, United States of America
- Division of Human Genetics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
- Division of Developmental Biology, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
| | - V. B. Surya Prasath
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
- Department of Electrical Engineering and Computer Science, University of Cincinnati, Cincinnati, Ohio, United States of America
- Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, Ohio, United States of America
| | - Emily R. Miraldi
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
- Department of Electrical Engineering and Computer Science, University of Cincinnati, Cincinnati, Ohio, United States of America
- Division of Immunobiology, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio, United States of America
- Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, Ohio, United States of America
| |
Collapse
|
16
|
Zhang Q, Teng P, Wang S, He Y, Cui Z, Guo Z, Liu Y, Yuan C, Liu Q, Huang DS. Computational prediction and characterization of cell-type-specific and shared binding sites. Bioinformatics 2022; 39:6885447. [PMID: 36484687 PMCID: PMC9825777 DOI: 10.1093/bioinformatics/btac798] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2022] [Revised: 11/24/2022] [Accepted: 12/08/2022] [Indexed: 12/13/2022] Open
Abstract
MOTIVATION Cell-type-specific gene expression is maintained in large part by transcription factors (TFs) selectively binding to distinct sets of sites in different cell types. Recent research works have provided evidence that such cell-type-specific binding is determined by TF's intrinsic sequence preferences, cooperative interactions with co-factors, cell-type-specific chromatin landscapes and 3D chromatin interactions. However, computational prediction and characterization of cell-type-specific and shared binding sites is rarely studied. RESULTS In this article, we propose two computational approaches for predicting and characterizing cell-type-specific and shared binding sites by integrating multiple types of features, in which one is based on XGBoost and another is based on convolutional neural network (CNN). To validate the performance of our proposed approaches, ChIP-seq datasets of 10 binding factors were collected from the GM12878 (lymphoblastoid) and K562 (erythroleukemic) human hematopoietic cell lines, each of which was further categorized into cell-type-specific (GM12878- and K562-specific) and shared binding sites. Then, multiple types of features for these binding sites were integrated to train the XGBoost- and CNN-based models. Experimental results show that our proposed approaches significantly outperform other competing methods on three classification tasks. Moreover, we identified independent feature contributions for cell-type-specific and shared sites through SHAP values and explored the ability of the CNN-based model to predict cell-type-specific and shared binding sites by excluding or including DNase signals. Furthermore, we investigated the generalization ability of our proposed approaches to different binding factors in the same cellular environment. AVAILABILITY AND IMPLEMENTATION The source code is available at: https://github.com/turningpoint1988/CSSBS. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Qinhu Zhang
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Pengrui Teng
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Siguo Wang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai 201804, China
| | - Ying He
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai 201804, China
| | - Zhen Cui
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai 201804, China
| | - Zhenghao Guo
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai 201804, China
| | - Yixin Liu
- School of Health Science and Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China
| | - Changan Yuan
- Big Data and Intelligent Computing Research Center, Guangxi Academy of Science, Nanning 530007, China
| | - Qi Liu
- To whom correspondence should be addressed. or
| | | |
Collapse
|
17
|
Yan W, Li Z, Pian C, Wu Y. PlantBind: an attention-based multi-label neural network for predicting plant transcription factor binding sites. Brief Bioinform 2022; 23:6713513. [PMID: 36155619 DOI: 10.1093/bib/bbac425] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2022] [Revised: 08/29/2022] [Accepted: 08/31/2022] [Indexed: 12/14/2022] Open
Abstract
Identification of transcription factor binding sites (TFBSs) is essential to understanding of gene regulation. Designing computational models for accurate prediction of TFBSs is crucial because it is not feasible to experimentally assay all transcription factors (TFs) in all sequenced eukaryotic genomes. Although many methods have been proposed for the identification of TFBSs in humans, methods designed for plants are comparatively underdeveloped. Here, we present PlantBind, a method for integrated prediction and interpretation of TFBSs based on DNA sequences and DNA shape profiles. Built on an attention-based multi-label deep learning framework, PlantBind not only simultaneously predicts the potential binding sites of 315 TFs, but also identifies the motifs bound by transcription factors. During the training process, this model revealed a strong similarity among TF family members with respect to target binding sequences. Trans-species prediction performance using four Zea mays TFs demonstrated the suitability of this model for transfer learning. Overall, this study provides an effective solution for identifying plant TFBSs, which will promote greater understanding of transcriptional regulatory mechanisms in plants.
Collapse
Affiliation(s)
| | - Zutan Li
- Nanjing Agricultur al University
| | - Cong Pian
- College of Sciences at Nanjing Agricultural University
| | - Yufeng Wu
- State Key Laboratory for Crop Genetics and Germplasm Enhancement, Bioinformatics Center, College of Agriculture, Academy for Advanced Interdisciplinary Studies at Nanjing Agricultural University
| |
Collapse
|
18
|
Rivière Q, Corso M, Ciortan M, Noël G, Verbruggen N, Defrance M. Exploiting Genomic Features to Improve the Prediction of Transcription Factor-Binding Sites in Plants. PLANT & CELL PHYSIOLOGY 2022; 63:1457-1473. [PMID: 35799371 DOI: 10.1093/pcp/pcac095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/12/2021] [Revised: 06/07/2022] [Accepted: 07/06/2022] [Indexed: 06/15/2023]
Abstract
The identification of transcription factor (TF) target genes is central in biology. A popular approach is based on the location by pattern matching of potential cis-regulatory elements (CREs). During the last few years, tools integrating next-generation sequencing data have been developed to improve the performance of pattern matching. However, such tools have not yet been comprehensively evaluated in plants. Hence, we developed a new streamlined method aiming at predicting CREs and target genes of plant TFs in specific organs or conditions. Our approach implements a supervised machine learning strategy, which allows decision rule models to be learnt using TF ChIP-chip/seq experimental data. Different layers of genomic features were integrated in predictive models: the position on the gene, the DNA sequence conservation, the chromatin state and various CRE footprints. Among the tested features, the chromatin features were crucial for improving the accuracy of the method. Furthermore, we evaluated the transferability of predictive models across TFs, organs and species. Finally, we validated our method by correctly inferring the target genes of key TFs controlling metabolite biosynthesis at the organ level in Arabidopsis. We developed a tool-Wimtrap-to reproduce our approach in plant species and conditions/organs for which ChIP-chip/seq data are available. Wimtrap is a user-friendly R package that supports an R Shiny web interface and is provided with pre-built models that can be used to quickly get predictions of CREs and TF gene targets in different organs or conditions in Arabidopsis thaliana, Solanum lycopersicum, Oryza sativa and Zea mays.
Collapse
Affiliation(s)
- Quentin Rivière
- Brussels Bioengineering School, Laboratory of Plant Physiology and molecular Genetics, Université Libre de Bruxelles, Brussels 1050, Belgium
| | - Massimiliano Corso
- Brussels Bioengineering School, Laboratory of Plant Physiology and molecular Genetics, Université Libre de Bruxelles, Brussels 1050, Belgium
- INRAE, AgroParisTech, Institut Jean-Pierre Bourgin (IJPB), Université Paris-Saclay, Versailles 78000, France
| | - Madalina Ciortan
- Interuniversity Institute of Bioinformatics in Brussels, Machine Learning Group, Université Libre de Bruxelles, Brussels 1050, Belgium
| | - Grégoire Noël
- Functional and Evolutionary Entomology, Gembloux Agro-Bio Tech, University of Liège, Passage des Déportés 2, Gembloux 5030, Belgium
| | - Nathalie Verbruggen
- Brussels Bioengineering School, Laboratory of Plant Physiology and molecular Genetics, Université Libre de Bruxelles, Brussels 1050, Belgium
| | - Matthieu Defrance
- Interuniversity Institute of Bioinformatics in Brussels, Machine Learning Group, Université Libre de Bruxelles, Brussels 1050, Belgium
| |
Collapse
|
19
|
McAfee JC, Bell JL, Krupa O, Matoba N, Stein JL, Won H. Focus on your locus with a massively parallel reporter assay. J Neurodev Disord 2022; 14:50. [PMID: 36085003 PMCID: PMC9463819 DOI: 10.1186/s11689-022-09461-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Accepted: 09/01/2022] [Indexed: 01/01/2023] Open
Abstract
A growing number of variants associated with risk for neurodevelopmental disorders have been identified by genome-wide association and whole genome sequencing studies. As common risk variants often fall within large haplotype blocks covering long stretches of the noncoding genome, the causal variants within an associated locus are often unknown. Similarly, the effect of rare noncoding risk variants identified by whole genome sequencing on molecular traits is seldom known without functional assays. A massively parallel reporter assay (MPRA) is an assay that can functionally validate thousands of regulatory elements simultaneously using high-throughput sequencing and barcode technology. MPRA has been adapted to various experimental designs that measure gene regulatory effects of genetic variants within cis- and trans-regulatory elements as well as posttranscriptional processes. This review discusses different MPRA designs that have been or could be used in the future to experimentally validate genetic variants associated with neurodevelopmental disorders. Though MPRA has limitations such as it does not model genomic context, this assay can help narrow down the underlying genetic causes of neurodevelopmental disorders by screening thousands of sequences in one experiment. We conclude by describing future directions of this technique such as applications of MPRA for gene-by-environment interactions and pharmacogenetics.
Collapse
Affiliation(s)
- Jessica C. McAfee
- grid.10698.360000000122483208Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599 USA ,grid.10698.360000000122483208UNC Neuroscience Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599 USA
| | - Jessica L. Bell
- grid.10698.360000000122483208Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599 USA ,grid.10698.360000000122483208UNC Neuroscience Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599 USA
| | - Oleh Krupa
- grid.10698.360000000122483208Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599 USA ,grid.10698.360000000122483208UNC Neuroscience Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599 USA
| | - Nana Matoba
- grid.10698.360000000122483208Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599 USA ,grid.10698.360000000122483208UNC Neuroscience Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599 USA
| | - Jason L. Stein
- grid.10698.360000000122483208Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599 USA ,grid.10698.360000000122483208UNC Neuroscience Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599 USA
| | - Hyejung Won
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA. .,UNC Neuroscience Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA.
| |
Collapse
|
20
|
Lal A. Deciphering the regulatory syntax of genomic DNA with deep learning. J Biosci 2022. [DOI: 10.1007/s12038-022-00291-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
21
|
Yang MG, Ling E, Cowley CJ, Greenberg ME, Vierbuchen T. Characterization of sequence determinants of enhancer function using natural genetic variation. eLife 2022; 11:76500. [PMID: 36043696 PMCID: PMC9662815 DOI: 10.7554/elife.76500] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2021] [Accepted: 08/30/2022] [Indexed: 02/04/2023] Open
Abstract
Sequence variation in enhancers that control cell-type-specific gene transcription contributes significantly to phenotypic variation within human populations. However, it remains difficult to predict precisely the effect of any given sequence variant on enhancer function due to the complexity of DNA sequence motifs that determine transcription factor (TF) binding to enhancers in their native genomic context. Using F1-hybrid cells derived from crosses between distantly related inbred strains of mice, we identified thousands of enhancers with allele-specific TF binding and/or activity. We find that genetic variants located within the central region of enhancers are most likely to alter TF binding and enhancer activity. We observe that the AP-1 family of TFs (Fos/Jun) are frequently required for binding of TEAD TFs and for enhancer function. However, many sequence variants outside of core motifs for AP-1 and TEAD also impact enhancer function, including sequences flanking core TF motifs and AP-1 half sites. Taken together, these data represent one of the most comprehensive assessments of allele-specific TF binding and enhancer function to date and reveal how sequence changes at enhancers alter their function across evolutionary timescales.
Collapse
Affiliation(s)
- Marty G Yang
- Department of Neurobiology, Harvard Medical School, Boston, United States.,Program in Neuroscience, Harvard Medical School, Boston, United States
| | - Emi Ling
- Department of Neurobiology, Harvard Medical School, Boston, United States
| | | | | | - Thomas Vierbuchen
- Developmental Biology Program, Sloan Kettering Institute for Cancer Research, New York, United States.,Center for Stem Cell Biology, Sloan Kettering Institute for Cancer Research, New York, United States
| |
Collapse
|
22
|
Yi R, Cho K, Bonneau R. NetTIME: a Multitask and Base-pair Resolution Framework for Improved Transcription Factor Binding Site Prediction. Bioinformatics 2022; 38:4762-4770. [PMID: 35997560 PMCID: PMC9563695 DOI: 10.1093/bioinformatics/btac569] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2021] [Revised: 08/16/2022] [Accepted: 08/20/2022] [Indexed: 12/05/2022] Open
Abstract
Motivation Machine learning models for predicting cell-type-specific transcription factor (TF) binding sites have become increasingly more accurate thanks to the increased availability of next-generation sequencing data and more standardized model evaluation criteria. However, knowledge transfer from data-rich to data-limited TFs and cell types remains crucial for improving TF binding prediction models because available binding labels are highly skewed towards a small collection of TFs and cell types. Transfer prediction of TF binding sites can potentially benefit from a multitask learning approach; however, existing methods typically use shallow single-task models to generate low-resolution predictions. Here, we propose NetTIME, a multitask learning framework for predicting cell-type-specific TF binding sites with base-pair resolution. Results We show that the multitask learning strategy for TF binding prediction is more efficient than the single-task approach due to the increased data availability. NetTIME trains high-dimensional embedding vectors to distinguish TF and cell-type identities. We show that this approach is critical for the success of the multitask learning strategy and allows our model to make accurate transfer predictions within and beyond the training panels of TFs and cell types. We additionally train a linear-chain conditional random field (CRF) to classify binding predictions and show that this CRF eliminates the need for setting a probability threshold and reduces classification noise. We compare our method’s predictive performance with two state-of-the-art methods, Catchitt and Leopard, and show that our method outperforms previous methods under both supervised and transfer learning settings. Availability and implementation NetTIME is freely available at https://github.com/ryi06/NetTIME and the code is also archived at https://doi.org/10.5281/zenodo.6994897. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ren Yi
- Department of Computer Science, New York University, New York, NY, 10011, USA
| | - Kyunghyun Cho
- Department of Computer Science, New York University, New York, NY, 10011, USA.,Center for Data Science, New York University, New York, NY, 10011, USA.,Prescient Design, a Genentech accelerator, New York, NY, 10010, USA
| | - Richard Bonneau
- Department of Computer Science, New York University, New York, NY, 10011, USA.,Center for Data Science, New York University, New York, NY, 10011, USA.,Department of Biology, New York University, New York, NY, 10003, USA.,Prescient Design, a Genentech accelerator, New York, NY, 10010, USA
| |
Collapse
|
23
|
Ng JWK, Ong EHQ, Tucker-Kellogg L, Tucker-Kellogg G. Deep learning for de-convolution of Smad2 versus Smad3 binding sites. BMC Genomics 2022; 23:525. [PMID: 35858839 PMCID: PMC9297549 DOI: 10.1186/s12864-022-08565-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2022] [Accepted: 04/19/2022] [Indexed: 11/10/2022] Open
Abstract
Background The transforming growth factor beta-1 (TGF β-1) cytokine exerts both pro-tumor and anti-tumor effects in carcinogenesis. An increasing body of literature suggests that TGF β-1 signaling outcome is partially dependent on the regulatory targets of downstream receptor-regulated Smad (R-Smad) proteins Smad2 and Smad3. However, the lack of Smad-specific antibodies for ChIP-seq hinders convenient identification of Smad-specific binding sites. Results In this study, we use localization and affinity purification (LAP) tags to identify Smad-specific binding sites in a cancer cell line. Using ChIP-seq data obtained from LAP-tagged Smad proteins, we develop a convolutional neural network with long-short term memory (CNN-LSTM) as a deep learning approach to classify a pool of Smad-bound sites as being Smad2- or Smad3-bound. Our data showed that this approach is able to accurately classify Smad2- versus Smad3-bound sites. We use our model to dissect the role of each R-Smad in the progression of breast cancer using a previously published dataset. Conclusions Our results suggests that deep learning approaches can be used to dissect binding site specificity of closely related transcription factors. Supplementary Information The online version contains supplementary material available at (10.1186/s12864-022-08565-x).
Collapse
Affiliation(s)
- Jeremy W K Ng
- Department of Biological Sciences, National University of Singapore, Singapore, Singapore
| | - Esther H Q Ong
- Department of Biological Sciences, National University of Singapore, Singapore, Singapore
| | - Lisa Tucker-Kellogg
- Cancer and Stem Cell Biology, and Centre for Computational Biology, Duke-NUS Medical School, Singapore, Singapore.
| | - Greg Tucker-Kellogg
- Department of Biological Sciences, National University of Singapore, Singapore, Singapore. .,Computational Biology Programme, Faculty of Science, National University of Singapore, Singapore, Singapore.
| |
Collapse
|
24
|
Hernandez-Corchado A, Najafabadi HS. Toward a base-resolution panorama of the in vivo impact of cytosine methylation on transcription factor binding. Genome Biol 2022; 23:151. [PMID: 35799193 PMCID: PMC9264634 DOI: 10.1186/s13059-022-02713-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2021] [Accepted: 06/19/2022] [Indexed: 11/10/2022] Open
Abstract
Background While methylation of CpG dinucleotides is traditionally considered antagonistic to the DNA-binding activity of most transcription factors (TFs), recent in vitro studies have revealed a more complex picture, suggesting that over a third of TFs may preferentially bind to methylated sequences. Expanding these in vitro observations to in vivo TF binding preferences is challenging since the effect of methylation of individual CpG sites cannot be easily isolated from the confounding effects of DNA accessibility and regional DNA methylation. Thus, in vivo methylation preferences of most TFs remain uncharacterized. Results We introduce joint accessibility-methylation-sequence (JAMS) models, which connect the strength of the binding signal observed in ChIP-seq to the DNA accessibility of the binding site, regional methylation level, DNA sequence, and base-resolution cytosine methylation. We show that JAMS models quantitatively explain TF occupancy, recapitulate cell type-specific TF binding, and have high positive predictive value for identification of TFs affected by intra-motif methylation. Analysis of 2209 ChIP-seq experiments results in high-confidence JAMS models for 260 TFs, revealing a negative association between in vivo TF occupancy and intra-motif methylation for 45% of studied TFs, as well as 16 TFs that are predicted to bind to methylated sites, including 11 novel methyl-binding TFs mostly from the multi-zinc finger family. Conclusions Our study substantially expands the repertoire of in vivo methyl-binding TFs, but also suggests that most TFs that prefer methylated CpGs in vitro present themselves as methylation agnostic in vivo, potentially due to the balancing effect of competition with other methyl-binding proteins. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-022-02713-y.
Collapse
Affiliation(s)
- Aldo Hernandez-Corchado
- Department of Human Genetics, McGill University, Montreal, QC, H3A 0C7, Canada.,McGill Genome Centre, Montreal, QC, H3A 0G1, Canada
| | - Hamed S Najafabadi
- Department of Human Genetics, McGill University, Montreal, QC, H3A 0C7, Canada. .,McGill Genome Centre, Montreal, QC, H3A 0G1, Canada.
| |
Collapse
|
25
|
Karimzadeh M, Hoffman MM. Virtual ChIP-seq: predicting transcription factor binding by learning from the transcriptome. Genome Biol 2022; 23:126. [PMID: 35681170 PMCID: PMC9185870 DOI: 10.1186/s13059-022-02690-2] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2022] [Accepted: 05/16/2022] [Indexed: 11/29/2022] Open
Abstract
Existing methods for computational prediction of transcription factor (TF) binding sites evaluate genomic regions with similarity to known TF sequence preferences. Most TF binding sites, however, do not resemble known TF sequence motifs, and many TFs are not sequence-specific. We developed Virtual ChIP-seq, which predicts binding of individual TFs in new cell types, integrating learned associations with gene expression and binding, TF binding sites from other cell types, and chromatin accessibility data in the new cell type. This approach outperforms methods that predict TF binding solely based on sequence preference, predicting binding for 36 TFs (MCC>0.3).
Collapse
Affiliation(s)
- Mehran Karimzadeh
- Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada.,Princess Margaret Cancer Centre, Toronto, ON, Canada.,Vector Institute, Toronto, ON, Canada
| | - Michael M Hoffman
- Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada. .,Princess Margaret Cancer Centre, Toronto, ON, Canada. .,Vector Institute, Toronto, ON, Canada. .,Department of Computer Science, University of Toronto, Toronto, ON, Canada.
| |
Collapse
|
26
|
Luo K, Zhong J, Safi A, Hong LK, Tewari AK, Song L, Reddy TE, Ma L, Crawford GE, Hartemink AJ. Profiling the quantitative occupancy of myriad transcription factors across conditions by modeling chromatin accessibility data. Genome Res 2022; 32:1183-1198. [PMID: 35609992 PMCID: PMC9248881 DOI: 10.1101/gr.272203.120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2020] [Accepted: 05/06/2022] [Indexed: 11/24/2022]
Abstract
Over a thousand different transcription factors (TFs) bind with varying occupancy across the human genome. Chromatin immunoprecipitation (ChIP) can assay occupancy genome-wide, but only one TF at a time, limiting our ability to comprehensively observe the TF occupancy landscape, let alone quantify how it changes across conditions. We developed TF occupancy profiler (TOP), a Bayesian hierarchical regression framework, to profile genome-wide quantitative occupancy of numerous TFs using data from a single chromatin accessibility experiment (DNase- or ATAC-seq). TOP is supervised, and its hierarchical structure allows it to predict the occupancy of any sequence-specific TF, even those never assayed with ChIP. We used TOP to profile the quantitative occupancy of hundreds of sequence-specific TFs at sites throughout the genome and examined how their occupancies changed in multiple contexts: in approximately 200 human cell types, through 12 h of exposure to different hormones, and across the genetic backgrounds of 70 individuals. TOP enables cost-effective exploration of quantitative changes in the landscape of TF binding.
Collapse
Affiliation(s)
- Kaixuan Luo
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina 27708, USA
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Computer Science, Duke University, Durham, North Carolina 27708, USA
- Department of Human Genetics, The University of Chicago, Chicago, Illinois 60637, USA
| | - Jianling Zhong
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina 27708, USA
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Computer Science, Duke University, Durham, North Carolina 27708, USA
| | - Alexias Safi
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Pediatrics, Duke University Medical Center, Durham, North Carolina 27710, USA
| | - Linda K Hong
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Pediatrics, Duke University Medical Center, Durham, North Carolina 27710, USA
| | - Alok K Tewari
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA
| | - Lingyun Song
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Pediatrics, Duke University Medical Center, Durham, North Carolina 27710, USA
| | - Timothy E Reddy
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina 27708, USA
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Biostatistics and Bioinformatics, Durham, North Carolina 27710, USA
- Department of Molecular Genetics and Microbiology, Duke University Medical Center, Durham, North Carolina 27710, USA
- Department of Biomedical Engineering, Duke University, Durham, North Carolina 27708, USA
| | - Li Ma
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina 27708, USA
- Department of Statistical Science, Duke University, Durham, North Carolina 27708, USA
| | - Gregory E Crawford
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina 27708, USA
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Pediatrics, Duke University Medical Center, Durham, North Carolina 27710, USA
| | - Alexander J Hartemink
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina 27708, USA
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Computer Science, Duke University, Durham, North Carolina 27708, USA
- Department of Biology, Duke University, Durham, North Carolina 27708, USA
| |
Collapse
|
27
|
Jing F, Zhang SW, Zhang S. Prediction of the transcription factor binding sites with meta-learning. Methods 2022; 203:207-213. [DOI: 10.1016/j.ymeth.2022.04.010] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2021] [Revised: 04/01/2022] [Accepted: 04/17/2022] [Indexed: 11/26/2022] Open
|
28
|
Sapoval N, Aghazadeh A, Nute MG, Antunes DA, Balaji A, Baraniuk R, Barberan CJ, Dannenfelser R, Dun C, Edrisi M, Elworth RAL, Kille B, Kyrillidis A, Nakhleh L, Wolfe CR, Yan Z, Yao V, Treangen TJ. Current progress and open challenges for applying deep learning across the biosciences. Nat Commun 2022; 13:1728. [PMID: 35365602 PMCID: PMC8976012 DOI: 10.1038/s41467-022-29268-7] [Citation(s) in RCA: 76] [Impact Index Per Article: 38.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2021] [Accepted: 03/09/2022] [Indexed: 11/19/2022] Open
Abstract
Deep Learning (DL) has recently enabled unprecedented advances in one of the grand challenges in computational biology: the half-century-old problem of protein structure prediction. In this paper we discuss recent advances, limitations, and future perspectives of DL on five broad areas: protein structure prediction, protein function prediction, genome engineering, systems biology and data integration, and phylogenetic inference. We discuss each application area and cover the main bottlenecks of DL approaches, such as training data, problem scope, and the ability to leverage existing DL architectures in new contexts. To conclude, we provide a summary of the subject-specific and general challenges for DL across the biosciences.
Collapse
Affiliation(s)
- Nicolae Sapoval
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Amirali Aghazadeh
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, CA, USA
| | - Michael G Nute
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Dinler A Antunes
- Department of Biology and Biochemistry, University of Houston, Houston, TX, USA
| | - Advait Balaji
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Richard Baraniuk
- Department of Electrical and Computer Engineering, Rice University, Houston, TX, USA
| | - C J Barberan
- Department of Electrical and Computer Engineering, Rice University, Houston, TX, USA
| | | | - Chen Dun
- Department of Computer Science, Rice University, Houston, TX, USA
| | | | - R A Leo Elworth
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Bryce Kille
- Department of Computer Science, Rice University, Houston, TX, USA
| | | | - Luay Nakhleh
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Cameron R Wolfe
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Zhi Yan
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Vicky Yao
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Todd J Treangen
- Department of Computer Science, Rice University, Houston, TX, USA.
- Department of Bioengineering, Rice University, Houston, TX, USA.
| |
Collapse
|
29
|
Heller IS, Guenther CA, Meireles AM, Talbot WS, Kingsley DM. Characterization of mouse Bmp5 regulatory injury element in zebrafish wound models. Bone 2022; 155:116263. [PMID: 34826632 PMCID: PMC9007314 DOI: 10.1016/j.bone.2021.116263] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/10/2021] [Revised: 11/17/2021] [Accepted: 11/18/2021] [Indexed: 11/21/2022]
Abstract
Many key signaling molecules used to build tissues during embryonic development are re-activated at injury sites to stimulate tissue regeneration and repair. Bone morphogenetic proteins provide a classic example, but the mechanisms that lead to reactivation of BMPs following injury are still unknown. Previous studies have mapped a large "injury response element" (IRE) in the mouse Bmp5 gene that drives gene expression following bone fractures and other types of injury. Here we show that the large mouse IRE region is also activated in both zebrafish tail resection and mechanosensory hair cell injury models. Using the ability to test multiple constructs and image temporal and spatial dynamics following injury responses, we have narrowed the original size of the mouse IRE region by over 100 fold and identified a small 142 bp minimal enhancer that is rapidly induced in both mesenchymal and epithelial tissues after injury. These studies identify a small sequence that responds to evolutionarily conserved local signals in wounded tissues and suggest candidate pathways that contribute to BMP reactivation after injury.
Collapse
Affiliation(s)
- Ian S Heller
- Department of Developmental Biology, Stanford University School of Medicine, United States of America
| | - Catherine A Guenther
- Department of Developmental Biology, Stanford University School of Medicine, United States of America; Howard Hughes Medical Institute, Stanford University School of Medicine, United States of America
| | - Ana M Meireles
- Department of Developmental Biology, Stanford University School of Medicine, United States of America
| | - William S Talbot
- Department of Developmental Biology, Stanford University School of Medicine, United States of America
| | - David M Kingsley
- Department of Developmental Biology, Stanford University School of Medicine, United States of America; Howard Hughes Medical Institute, Stanford University School of Medicine, United States of America.
| |
Collapse
|
30
|
Erkes A, Mücke S, Reschke M, Boch J, Grau J. Epigenetic features improve TALE target prediction. BMC Genomics 2021; 22:914. [PMID: 34965853 PMCID: PMC8717664 DOI: 10.1186/s12864-021-08210-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2021] [Accepted: 11/25/2021] [Indexed: 11/20/2022] Open
Abstract
Background The yield of many crop plants can be substantially reduced by plant-pathogenic Xanthomonas bacteria. The infection strategy of many Xanthomonas strains is based on transcription activator-like effectors (TALEs), which are secreted into the host cells and act as transcriptional activators of plant genes that are beneficial for the bacteria.The modular DNA binding domain of TALEs contains tandem repeats, each comprising two hyper-variable amino acids. These repeat-variable diresidues (RVDs) bind to their target box and determine the specificity of a TALE.All available tools for the prediction of TALE targets within the host plant suffer from many false positives. In this paper we propose a strategy to improve prediction accuracy by considering the epigenetic state of the host plant genome in the region of the target box. Results To this end, we extend our previously published tool PrediTALE by considering two epigenetic features: (i) chromatin accessibility of potentially bound regions and (ii) DNA methylation of cytosines within target boxes. Here, we determine the epigenetic features from publicly available DNase-seq, ATAC-seq, and WGBS data in rice.We benchmark the utility of both epigenetic features separately and in combination, deriving ground-truth from RNA-seq data of infections studies in rice. We find an improvement for each individual epigenetic feature, but especially the combination of both.Having established an advantage in TALE target predicting considering epigenetic features, we use these data for promoterome and genome-wide scans by our new tool EpiTALE, leading to several novel putative virulence targets. Conclusions Our results suggest that it would be worthwhile to collect condition-specific chromatin accessibility data and methylation information when studying putative virulence targets of Xanthomonas TALEs. Supplementary Information The online version contains supplementary material available at (10.1186/s12864-021-08210-z).
Collapse
Affiliation(s)
- Annett Erkes
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle, Germany.
| | - Stefanie Mücke
- Institute of Plant Genetics, Leibniz Universität Hannover, Hannover, Germany
| | - Maik Reschke
- Institute of Plant Genetics, Leibniz Universität Hannover, Hannover, Germany
| | - Jens Boch
- Institute of Plant Genetics, Leibniz Universität Hannover, Hannover, Germany
| | - Jan Grau
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle, Germany.
| |
Collapse
|
31
|
Constructing gene regulatory networks using epigenetic data. NPJ Syst Biol Appl 2021; 7:45. [PMID: 34887443 PMCID: PMC8660777 DOI: 10.1038/s41540-021-00208-3] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2020] [Accepted: 11/01/2021] [Indexed: 12/24/2022] Open
Abstract
The biological processes that drive cellular function can be represented by a complex network of interactions between regulators (transcription factors) and their targets (genes). A cell's epigenetic state plays an important role in mediating these interactions, primarily by influencing chromatin accessibility. However, how to effectively use epigenetic data when constructing a gene regulatory network remains an open question. Almost all existing network reconstruction approaches focus on estimating transcription factor to gene connections using transcriptomic data. In contrast, computational approaches for analyzing epigenetic data generally focus on improving transcription factor binding site predictions rather than deducing regulatory network relationships. We bridged this gap by developing SPIDER, a network reconstruction approach that incorporates epigenetic data into a message-passing framework to estimate gene regulatory networks. We validated SPIDER's predictions using ChIP-seq data from ENCODE and found that SPIDER networks are both highly accurate and include cell-line-specific regulatory interactions. Notably, SPIDER can recover ChIP-seq verified transcription factor binding events in the regulatory regions of genes that do not have a corresponding sequence motif. The networks estimated by SPIDER have the potential to identify novel hypotheses that will allow us to better characterize cell-type and phenotype specific regulatory mechanisms.
Collapse
|
32
|
Morrow A, Hughes J, Singh J, Joseph A, Yosef N. Epitome: predicting epigenetic events in novel cell types with multi-cell deep ensemble learning. Nucleic Acids Res 2021; 49:e110. [PMID: 34379786 PMCID: PMC8565335 DOI: 10.1093/nar/gkab676] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Revised: 07/19/2021] [Accepted: 07/25/2021] [Indexed: 01/04/2023] Open
Abstract
The accumulation of large epigenomics data consortiums provides us with the opportunity to extrapolate existing knowledge to new cell types and conditions. We propose Epitome, a deep neural network that learns similarities of chromatin accessibility between well characterized reference cell types and a query cellular context, and copies over signal of transcription factor binding and modification of histones from reference cell types when chromatin profiles are similar to the query. Epitome achieves state-of-the-art accuracy when predicting transcription factor binding sites on novel cellular contexts and can further improve predictions as more epigenetic signals are collected from both reference cell types and the query cellular context of interest.
Collapse
Affiliation(s)
- Alyssa Kramer Morrow
- Electrical Engineering and Computer Science Department, University of California-Berkeley 465 Soda Hall, Berkeley, CA 94720-1776, USA
| | - John Weston Hughes
- Electrical Engineering and Computer Science Department, University of California-Berkeley 465 Soda Hall, Berkeley, CA 94720-1776, USA
- Computer Science Department, Stanford University, 353 Serra Mall, Stanford, CA 94305, USA
| | - Jahnavi Singh
- Electrical Engineering and Computer Science Department, University of California-Berkeley 465 Soda Hall, Berkeley, CA 94720-1776, USA
| | - Anthony Douglas Joseph
- Electrical Engineering and Computer Science Department, University of California-Berkeley 465 Soda Hall, Berkeley, CA 94720-1776, USA
- Center for Computational Biology, University of California-Berkeley 108 Stanley Hall, Berkeley, CA 94720-3220, USA
- Unite Genomics, Inc., 1301 Marina Village Pkwy, Suite 320, Alameda, CA 94501, USA
| | - Nir Yosef
- Electrical Engineering and Computer Science Department, University of California-Berkeley 465 Soda Hall, Berkeley, CA 94720-1776, USA
- Center for Computational Biology, University of California-Berkeley 108 Stanley Hall, Berkeley, CA 94720-3220, USA
- Ragon Institute of Massachusetts General Hospital, Massachusetts Institute of Technology, and Harvard University, Boston, MA, 02139, USA
- Chan Zuckerberg Biohub, San Francisco, CA, 94158, USA
| |
Collapse
|
33
|
Wang H, Huang B, Wang J. Predict long-range enhancer regulation based on protein-protein interactions between transcription factors. Nucleic Acids Res 2021; 49:10347-10368. [PMID: 34570239 PMCID: PMC8501976 DOI: 10.1093/nar/gkab841] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2021] [Revised: 08/10/2021] [Accepted: 09/10/2021] [Indexed: 12/18/2022] Open
Abstract
Long-range regulation by distal enhancers plays critical roles in cell-type specific transcriptional programs. Computational predictions of genome-wide enhancer-promoter interactions are still challenging due to limited accuracy and the lack of knowledge on the molecular mechanisms. Based on recent biological investigations, the protein-protein interactions (PPIs) between transcription factors (TFs) have been found to participate in the regulation of chromatin loops. Therefore, we developed a novel predictive model for cell-type specific enhancer-promoter interactions by leveraging the information of TF PPI signatures. Evaluated by a series of rigorous performance comparisons, the new model achieves superior performance over other methods. The model also identifies specific TF PPIs that may mediate long-range regulatory interactions, revealing new mechanistic understandings of enhancer regulation. The prioritized TF PPIs are associated with genes in distinct biological pathways, and the predicted enhancer-promoter interactions are strongly enriched with cis-eQTLs. Most interestingly, the model discovers enhancer-mediated trans-regulatory links between TFs and genes, which are significantly enriched with trans-eQTLs. The new predictive model, along with the genome-wide analyses, provides a platform to systematically delineate the complex interplay among TFs, enhancers and genes in long-range regulation. The novel predictions also lead to mechanistic interpretations of eQTLs to decode the genetic associations with gene expression.
Collapse
Affiliation(s)
- Hao Wang
- Department of Computational Mathematics, Science and Engineering, Michigan State University, 428 S. Shaw Ln., East Lansing, MI 48824, USA
| | - Binbin Huang
- Department of Computational Mathematics, Science and Engineering, Michigan State University, 428 S. Shaw Ln., East Lansing, MI 48824, USA
| | - Jianrong Wang
- Department of Computational Mathematics, Science and Engineering, Michigan State University, 428 S. Shaw Ln., East Lansing, MI 48824, USA
| |
Collapse
|
34
|
Xu Q, Georgiou G, Frölich S, van der Sande M, Veenstra G, Zhou H, van Heeringen S. ANANSE: an enhancer network-based computational approach for predicting key transcription factors in cell fate determination. Nucleic Acids Res 2021; 49:7966-7985. [PMID: 34244796 PMCID: PMC8373078 DOI: 10.1093/nar/gkab598] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2020] [Revised: 06/02/2021] [Accepted: 06/28/2021] [Indexed: 12/21/2022] Open
Abstract
Proper cell fate determination is largely orchestrated by complex gene regulatory networks centered around transcription factors. However, experimental elucidation of key transcription factors that drive cellular identity is currently often intractable. Here, we present ANANSE (ANalysis Algorithm for Networks Specified by Enhancers), a network-based method that exploits enhancer-encoded regulatory information to identify the key transcription factors in cell fate determination. As cell type-specific transcription factors predominantly bind to enhancers, we use regulatory networks based on enhancer properties to prioritize transcription factors. First, we predict genome-wide binding profiles of transcription factors in various cell types using enhancer activity and transcription factor binding motifs. Subsequently, applying these inferred binding profiles, we construct cell type-specific gene regulatory networks, and then predict key transcription factors controlling cell fate transitions using differential networks between cell types. This method outperforms existing approaches in correctly predicting major transcription factors previously identified to be sufficient for trans-differentiation. Finally, we apply ANANSE to define an atlas of key transcription factors in 18 normal human tissues. In conclusion, we present a ready-to-implement computational tool for efficient prediction of transcription factors in cell fate determination and to study transcription factor-mediated regulatory mechanisms. ANANSE is freely available at https://github.com/vanheeringen-lab/ANANSE.
Collapse
Affiliation(s)
- Quan Xu
- Radboud University, Department of Molecular Developmental Biology, Faculty of Science, Radboud Institute for Molecular Life Sciences, 6525GA Nijmegen, The Netherlands
| | - Georgios Georgiou
- Radboud University, Department of Molecular Developmental Biology, Faculty of Science, Radboud Institute for Molecular Life Sciences, 6525GA Nijmegen, The Netherlands
| | - Siebren Frölich
- Radboud University, Department of Molecular Developmental Biology, Faculty of Science, Radboud Institute for Molecular Life Sciences, 6525GA Nijmegen, The Netherlands
| | - Maarten van der Sande
- Radboud University, Department of Molecular Developmental Biology, Faculty of Science, Radboud Institute for Molecular Life Sciences, 6525GA Nijmegen, The Netherlands
| | - Gert Jan C Veenstra
- Radboud University, Department of Molecular Developmental Biology, Faculty of Science, Radboud Institute for Molecular Life Sciences, 6525GA Nijmegen, The Netherlands
| | - Huiqing Zhou
- Radboud University, Department of Molecular Developmental Biology, Faculty of Science, Radboud Institute for Molecular Life Sciences, 6525GA Nijmegen, The Netherlands
- Radboud University Medical Center, Department of Human Genetics, Radboud Institute for Molecular Life Sciences, 6525GA Nijmegen, The Netherlands
| | - Simon J van Heeringen
- Radboud University, Department of Molecular Developmental Biology, Faculty of Science, Radboud Institute for Molecular Life Sciences, 6525GA Nijmegen, The Netherlands
| |
Collapse
|
35
|
Shu H, Zhou J, Lian Q, Li H, Zhao D, Zeng J, Ma J. Modeling gene regulatory networks using neural network architectures. NATURE COMPUTATIONAL SCIENCE 2021; 1:491-501. [PMID: 38217125 DOI: 10.1038/s43588-021-00099-8] [Citation(s) in RCA: 47] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/25/2021] [Accepted: 06/15/2021] [Indexed: 01/15/2024]
Abstract
Gene regulatory networks (GRNs) encode the complex molecular interactions that govern cell identity. Here we propose DeepSEM, a deep generative model that can jointly infer GRNs and biologically meaningful representation of single-cell RNA sequencing (scRNA-seq) data. In particular, we developed a neural network version of the structural equation model (SEM) to explicitly model the regulatory relationships among genes. Benchmark results show that DeepSEM achieves comparable or better performance on a variety of single-cell computational tasks, such as GRN inference, scRNA-seq data visualization, clustering and simulation, compared with the state-of-the-art methods. In addition, the gene regulations predicted by DeepSEM on cell-type marker genes in the mouse cortex can be validated by epigenetic data, which further demonstrates the accuracy and efficiency of our method. DeepSEM can provide a useful and powerful tool to analyze scRNA-seq data and infer a GRN.
Collapse
Affiliation(s)
- Hantao Shu
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China
| | - Jingtian Zhou
- Genomic Analysis Laboratory, The Salk Institute for Biological Studies, La Jolla, CA, USA
- Bioinformatics and Systems Biology Program, University of California, San Diego, La Jolla, CA, USA
| | - Qiuyu Lian
- UM-SJTU Joint Institute, Shanghai Jiao Tong University, Shanghai, China
- Department of Automation, Shanghai Jiao Tong University, Shanghai, China
| | - Han Li
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China
| | - Dan Zhao
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China
| | - Jianyang Zeng
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China.
| | - Jianzhu Ma
- Institute for Artificial Intelligence, Peking University, Beijing, China.
| |
Collapse
|
36
|
Zrimec J, Buric F, Kokina M, Garcia V, Zelezniak A. Learning the Regulatory Code of Gene Expression. Front Mol Biosci 2021; 8:673363. [PMID: 34179082 PMCID: PMC8223075 DOI: 10.3389/fmolb.2021.673363] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2021] [Accepted: 05/24/2021] [Indexed: 11/13/2022] Open
Abstract
Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology.
Collapse
Affiliation(s)
- Jan Zrimec
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Filip Buric
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Mariia Kokina
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Victor Garcia
- School of Life Sciences and Facility Management, Zurich University of Applied Sciences, Wädenswil, Switzerland
| | - Aleksej Zelezniak
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
- Science for Life Laboratory, Stockholm, Sweden
| |
Collapse
|
37
|
Schreiber J, Singh R. Machine learning for profile prediction in genomics. Curr Opin Chem Biol 2021; 65:35-41. [PMID: 34107341 DOI: 10.1016/j.cbpa.2021.04.008] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Revised: 04/21/2021] [Accepted: 04/24/2021] [Indexed: 02/08/2023]
Abstract
A recent deluge of publicly available multi-omics data has fueled the development of machine learning methods aimed at investigating important questions in genomics. Although the motivations for these methods vary, a task that is commonly adopted is that of profile prediction, where predictions are made for one or more forms of biochemical activity along the genome, for example, histone modification, chromatin accessibility, or protein binding. In this review, we give an overview of the research works performing profile prediction, define two broad categories of profile prediction tasks, and discuss the types of scientific questions that can be answered in each.
Collapse
Affiliation(s)
| | - Ritambhara Singh
- Department of Computer Science, Center for Computational Molecular Biology, Brown University, United States.
| |
Collapse
|
38
|
Meyer P, Saez-Rodriguez J. Advances in systems biology modeling: 10 years of crowdsourcing DREAM challenges. Cell Syst 2021; 12:636-653. [PMID: 34139170 DOI: 10.1016/j.cels.2021.05.015] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2020] [Revised: 03/29/2021] [Accepted: 05/18/2021] [Indexed: 02/07/2023]
Abstract
Computational and mathematical models are key to obtain a system-level understanding of biological processes, but their limitations have to be clearly defined to allow their proper application and interpretation. Crowdsourced benchmarks in the form of challenges provide an unbiased assessment of methods, and for the past decade, the Dialogue for Reverse Engineering Assessment and Methods (DREAM) organized more than 15 systems biology challenges. From transcription factor binding to dynamical network models, from signaling networks to gene regulation, from whole-cell models to cell-lineage reconstruction, and from single-cell positioning in a tissue to drug combinations and cell survival, the breadth is broad. To celebrate the 5-year anniversary of Cell Systems, we review the genesis of these systems biology challenges and discuss how interlocking the forward- and reverse-modeling paradigms allows to push the rim of systems biology. This approach will persist for systems levels approaches in biology and medicine.
Collapse
Affiliation(s)
- Pablo Meyer
- IBM T.J. Watson Research Center, Yorktown Heights, NY, USA.
| | - Julio Saez-Rodriguez
- Institute for Computational Biomedicine, Heidelberg University Hospital and Heidelberg University, Faculty of Medicine, Bioquant, Heidelberg 69120, Germany
| |
Collapse
|
39
|
Patel N, Bush WS. Modeling transcriptional regulation using gene regulatory networks based on multi-omics data sources. BMC Bioinformatics 2021; 22:200. [PMID: 33874910 PMCID: PMC8056605 DOI: 10.1186/s12859-021-04126-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2020] [Accepted: 04/09/2021] [Indexed: 11/17/2022] Open
Abstract
Background Transcriptional regulation is complex, requiring multiple cis (local) and trans acting mechanisms working in concert to drive gene expression, with disruption of these processes linked to multiple diseases. Previous computational attempts to understand the influence of regulatory mechanisms on gene expression have used prediction models containing input features derived from cis regulatory factors. However, local chromatin looping and trans-acting mechanisms are known to also influence transcriptional regulation, and their inclusion may improve model accuracy and interpretation. In this study, we create a general model of transcription factor influence on gene expression by incorporating both cis and trans gene regulatory features. Results We describe a computational framework to model gene expression for GM12878 and K562 cell lines. This framework weights the impact of transcription factor-based regulatory data using multi-omics gene regulatory networks to account for both cis and trans acting mechanisms, and measures of the local chromatin context. These prediction models perform significantly better compared to models containing cis-regulatory features alone. Models that additionally integrate long distance chromatin interactions (or chromatin looping) between distal transcription factor binding regions and gene promoters also show improved accuracy. As a demonstration of their utility, effect estimates from these models were used to weight cis-regulatory rare variants for sequence kernel association test analyses of gene expression. Conclusions Our models generate refined effect estimates for the influence of individual transcription factors on gene expression, allowing characterization of their roles across the genome. This work also provides a framework for integrating multiple data types into a single model of transcriptional regulation. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04126-3.
Collapse
Affiliation(s)
- Neel Patel
- Department of Nutrition, Case Western Reserve University, Cleveland, OH, USA.,Department of Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, OH, USA
| | - William S Bush
- Department of Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, OH, USA.
| |
Collapse
|
40
|
Li H, Guan Y. Fast decoding cell type-specific transcription factor binding landscape at single-nucleotide resolution. Genome Res 2021; 31:721-731. [PMID: 33741685 PMCID: PMC8015851 DOI: 10.1101/gr.269613.120] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2020] [Accepted: 02/17/2021] [Indexed: 01/22/2023]
Abstract
Decoding the cell type-specific transcription factor (TF) binding landscape at single-nucleotide resolution is crucial for understanding the regulatory mechanisms underlying many fundamental biological processes and human diseases. However, limits on time and resources restrict the high-resolution experimental measurements of TF binding profiles of all possible TF-cell type combinations. Previous computational approaches either cannot distinguish the cell context-dependent TF binding profiles across diverse cell types or can only provide a relatively low-resolution prediction. Here we present a novel deep learning approach, Leopard, for predicting TF binding sites at single-nucleotide resolution, achieving the average area under receiver operating characteristic curve (AUROC) of 0.982 and the average area under precision recall curve (AUPRC) of 0.208. Our method substantially outperformed the state-of-the-art methods Anchor and FactorNet, improving the predictive AUPRC by 19% and 27%, respectively, when evaluated at 200-bp resolution. Meanwhile, by leveraging a many-to-many neural network architecture, Leopard features a hundredfold to thousandfold speedup compared with current many-to-one machine learning methods.
Collapse
Affiliation(s)
- Hongyang Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, USA
| |
Collapse
|
41
|
Nakato R, Sakata T. Methods for ChIP-seq analysis: A practical workflow and advanced applications. Methods 2021; 187:44-53. [PMID: 32240773 DOI: 10.1016/j.ymeth.2020.03.005] [Citation(s) in RCA: 90] [Impact Index Per Article: 30.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2020] [Revised: 03/17/2020] [Accepted: 03/18/2020] [Indexed: 12/13/2022] Open
Abstract
Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is a central method in epigenomic research. Genome-wide analysis of histone modifications, such as enhancer analysis and genome-wide chromatin state annotation, enables systematic analysis of how the epigenomic landscape contributes to cell identity, development, lineage specification, and disease. In this review, we first present a typical ChIP-seq analysis workflow, from quality assessment to chromatin-state annotation. We focus on practical, rather than theoretical, approaches for biological studies. Next, we outline various advanced ChIP-seq applications and introduce several state-of-the-art methods, including prediction of gene expression level and chromatin loops from epigenome data and data imputation. Finally, we discuss recently developed single-cell ChIP-seq analysis methodologies that elucidate the cellular diversity within complex tissues and cancers.
Collapse
Affiliation(s)
- Ryuichiro Nakato
- Laboratory of Computational Genomics, Institute for Quantitative Biosciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo 113-0032, Japan.
| | - Toyonori Sakata
- Laboratory of Genome Structure and Function, Institute for Quantitative Biosciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo 113-0032, Japan.
| |
Collapse
|
42
|
Integrative analysis identifies bHLH transcription factors as contributors to Parkinson's disease risk mechanisms. Sci Rep 2021; 11:3502. [PMID: 33568722 PMCID: PMC7875985 DOI: 10.1038/s41598-021-83087-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2019] [Accepted: 01/26/2021] [Indexed: 11/08/2022] Open
Abstract
Genome-wide association studies (GWAS) have identified multiple genetic risk signals for Parkinson’s disease (PD), however translation into underlying biological mechanisms remains scarce. Genomic functional annotations of neurons provide new resources that may be integrated into analyses of GWAS findings. Altered transcription factor binding plays an important role in human diseases. Insight into transcriptional networks involved in PD risk mechanisms may thus improve our understanding of pathogenesis. We analysed overlap between genome-wide association signals in PD and open chromatin in neurons across multiple brain regions, finding a significant enrichment in the superior temporal cortex. The involvement of transcriptional networks was explored in neurons of the superior temporal cortex based on the location of candidate transcription factor motifs identified by two de novo motif discovery methods. Analyses were performed in parallel, both finding that PD risk variants significantly overlap with open chromatin regions harboring motifs of basic Helix-Loop-Helix (bHLH) transcription factors. Our findings show that cortical neurons are likely mediators of genetic risk for PD. The concentration of PD risk variants at sites of open chromatin targeted by members of the bHLH transcription factor family points to an involvement of these transcriptional networks in PD risk mechanisms.
Collapse
|
43
|
Xu L, Zu T, Li T, Li M, Mi J, Bai F, Liu G, Wen J, Li H, Brakebusch C, Wang X, Wu X. ATF3 downmodulates its new targets IFI6 and IFI27 to suppress the growth and migration of tongue squamous cell carcinoma cells. PLoS Genet 2021; 17:e1009283. [PMID: 33539340 PMCID: PMC7888615 DOI: 10.1371/journal.pgen.1009283] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2020] [Revised: 02/17/2021] [Accepted: 11/18/2020] [Indexed: 01/16/2023] Open
Abstract
Activating transcription factor 3 (ATF3) is a key transcription factor involved in regulating cellular stress responses, with different expression levels and functions in different tissues. ATF3 has also been shown to play crucial roles in regulating tumor development and progression, however its potential role in oral squamous cell carcinomas has not been fully explored. In this study, we examined biopsies of tongue squamous cell carcinomas (TSCCs) and found that the nuclear expression level of ATF3 correlated negatively with the differentiation status of TSCCs, which was validated by analysis of the ATGC database. By using gain- or loss- of function analyses of ATF3 in four different TSCC cell lines, we demonstrated that ATF3 negatively regulates the growth and migration of human TSCC cells in vitro. RNA-seq analysis identified two new downstream targets of ATF3, interferon alpha inducible proteins 6 (IFI6) and 27 (IFI27), which were upregulated in ATF3-deleted cells and were downregulated in ATF3-overexpressing cells. Chromatin immunoprecipitation assays showed that ATF3 binds the promoter regions of the IFI6 and IFI27 genes. Both IFI6 and IFI27 were highly expressed in TSCC biopsies and knockdown of either IFI6 or IFI27 in TSCC cells blocked the cell growth and migration induced by the deletion of ATF3. Conversely, overexpression of either IFI6 or IFI27 counteracted the inhibition of TSCC cell growth and migration induced by the overexpression of ATF3. Finally, an in vivo study in mice confirmed those in vitro findings. Our study suggests that ATF3 plays an anti-tumor function in TSCCs through the negative regulation of its downstream targets, IFI6 and IFI27. Activating transcription factor 3 (ATF3), a stress response gene, has been shown to play either tumor promoting or tumor suppressing functions depending on the type of tumor cell and the stromal context. Here we discovered that ATF3 plays an anti-tumor role in tongue squamous cell carcinoma (TSCC) cells through the transcriptional suppression of its new downstream targets interferon alpha inducible proteins 6 (IFI6) and 27 (IFI27). This finding contributes to understanding how ATF3, a transcriptional repressor, can target specific downstream genes in different tumor cells to play anti-tumor or pro-tumor functions. A thorough understanding of ATF3 functions and its downstream signaling pathways provides a potential approach to develop new therapeutics for the treatment of tumors such as TSCCs.
Collapse
Affiliation(s)
- Lin Xu
- Department of Tissue Engineering and Regeneration, School and Hospital of Stomatology, Cheeloo College of Medicine, Shandong University & Shandong Key Laboratory of Oral Tissue Regeneration and Shandong Engineering Laboratory for Dental Materials and Oral Tissue Regeneration, Jinan, Shandong, China
- Department of Oral and Maxillofacial Surgery, School and Hospital of Stomatology, Cheeloo College of Medicine, Shandong University & Shandong Key Laboratory of Oral Tissue Regeneration & Shandong Engineering Laboratory for Dental Materials and Oral Tissue Regeneration, Shandong, China
- Department of Orthodontics, Liaocheng People’s Hospital, Liaocheng, Shandong, China
- Precision Biomedical Key Laboratory, Liaocheng People’s Hospital, Liaocheng, Shandong, China
| | - Tingjian Zu
- Department of Tissue Engineering and Regeneration, School and Hospital of Stomatology, Cheeloo College of Medicine, Shandong University & Shandong Key Laboratory of Oral Tissue Regeneration and Shandong Engineering Laboratory for Dental Materials and Oral Tissue Regeneration, Jinan, Shandong, China
- School of Stomatology, Shandong First Medical University & Shandong Academy of Medical Sciences, Tai’an, Shandong, China
| | - Tao Li
- Department of Tissue Engineering and Regeneration, School and Hospital of Stomatology, Cheeloo College of Medicine, Shandong University & Shandong Key Laboratory of Oral Tissue Regeneration and Shandong Engineering Laboratory for Dental Materials and Oral Tissue Regeneration, Jinan, Shandong, China
- Department of Oral and Maxillofacial Surgery, School and Hospital of Stomatology, Cheeloo College of Medicine, Shandong University & Shandong Key Laboratory of Oral Tissue Regeneration & Shandong Engineering Laboratory for Dental Materials and Oral Tissue Regeneration, Shandong, China
| | - Min Li
- Precision Biomedical Key Laboratory, Liaocheng People’s Hospital, Liaocheng, Shandong, China
| | - Jun Mi
- Department of Tissue Engineering and Regeneration, School and Hospital of Stomatology, Cheeloo College of Medicine, Shandong University & Shandong Key Laboratory of Oral Tissue Regeneration and Shandong Engineering Laboratory for Dental Materials and Oral Tissue Regeneration, Jinan, Shandong, China
| | - Fuxiang Bai
- Department of Tissue Engineering and Regeneration, School and Hospital of Stomatology, Cheeloo College of Medicine, Shandong University & Shandong Key Laboratory of Oral Tissue Regeneration and Shandong Engineering Laboratory for Dental Materials and Oral Tissue Regeneration, Jinan, Shandong, China
| | - Guanyi Liu
- Department of Tissue Engineering and Regeneration, School and Hospital of Stomatology, Cheeloo College of Medicine, Shandong University & Shandong Key Laboratory of Oral Tissue Regeneration and Shandong Engineering Laboratory for Dental Materials and Oral Tissue Regeneration, Jinan, Shandong, China
| | - Jie Wen
- Department of Tissue Engineering and Regeneration, School and Hospital of Stomatology, Cheeloo College of Medicine, Shandong University & Shandong Key Laboratory of Oral Tissue Regeneration and Shandong Engineering Laboratory for Dental Materials and Oral Tissue Regeneration, Jinan, Shandong, China
| | - Hui Li
- Department of Hematology, Southwest Hospital, Third Military Medical University, Chongqing, China
| | - Cord Brakebusch
- Biotech Research and Innovation Centre (BRIC), University of Copenhagen, Ole Maaløes Vej 5, Copenhagen, Denmark
| | - Xuxia Wang
- Department of Oral and Maxillofacial Surgery, School and Hospital of Stomatology, Cheeloo College of Medicine, Shandong University & Shandong Key Laboratory of Oral Tissue Regeneration & Shandong Engineering Laboratory for Dental Materials and Oral Tissue Regeneration, Shandong, China
- * E-mail: (XW); (XW)
| | - Xunwei Wu
- Department of Tissue Engineering and Regeneration, School and Hospital of Stomatology, Cheeloo College of Medicine, Shandong University & Shandong Key Laboratory of Oral Tissue Regeneration and Shandong Engineering Laboratory for Dental Materials and Oral Tissue Regeneration, Jinan, Shandong, China
- * E-mail: (XW); (XW)
| |
Collapse
|
44
|
Chen C, Hou J, Shi X, Yang H, Birchler JA, Cheng J. DeepGRN: prediction of transcription factor binding site across cell-types using attention-based deep neural networks. BMC Bioinformatics 2021; 22:38. [PMID: 33522898 PMCID: PMC7852092 DOI: 10.1186/s12859-020-03952-1] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2020] [Accepted: 12/29/2020] [Indexed: 12/21/2022] Open
Abstract
Background Due to the complexity of the biological systems, the prediction of the potential DNA binding sites for transcription factors remains a difficult problem in computational biology. Genomic DNA sequences and experimental results from parallel sequencing provide available information about the affinity and accessibility of genome and are commonly used features in binding sites prediction. The attention mechanism in deep learning has shown its capability to learn long-range dependencies from sequential data, such as sentences and voices. Until now, no study has applied this approach in binding site inference from massively parallel sequencing data. The successful applications of attention mechanism in similar input contexts motivate us to build and test new methods that can accurately determine the binding sites of transcription factors. Results In this study, we propose a novel tool (named DeepGRN) for transcription factors binding site prediction based on the combination of two components: single attention module and pairwise attention module. The performance of our methods is evaluated on the ENCODE-DREAM in vivo Transcription Factor Binding Site Prediction Challenge datasets. The results show that DeepGRN achieves higher unified scores in 6 of 13 targets than any of the top four methods in the DREAM challenge. We also demonstrate that the attention weights learned by the model are correlated with potential informative inputs, such as DNase-Seq coverage and motifs, which provide possible explanations for the predictive improvements in DeepGRN. Conclusions DeepGRN can automatically and effectively predict transcription factor binding sites from DNA sequences and DNase-Seq coverage. Furthermore, the visualization techniques we developed for the attention modules help to interpret how critical patterns from different types of input features are recognized by our model.
Collapse
Affiliation(s)
- Chen Chen
- Electrical Engineering and Computer Science Department, University of Missouri, Columbia, MO, 65211, USA
| | - Jie Hou
- Department of Computer Science, Saint Louis University, St. Louis, MO, 63103, USA
| | - Xiaowen Shi
- Division of Biological Sciences, University of Missouri, Columbia, MO, 65211, USA
| | - Hua Yang
- Division of Biological Sciences, University of Missouri, Columbia, MO, 65211, USA
| | - James A Birchler
- Division of Biological Sciences, University of Missouri, Columbia, MO, 65211, USA
| | - Jianlin Cheng
- Electrical Engineering and Computer Science Department, University of Missouri, Columbia, MO, 65211, USA.
| |
Collapse
|
45
|
Zhou M, Li H, Wang X, Guan Y. Evidence of widespread, independent sequence signature for transcription factor cobinding. Genome Res 2021; 31:265-278. [PMID: 33303494 PMCID: PMC7849410 DOI: 10.1101/gr.267310.120] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2020] [Accepted: 12/03/2020] [Indexed: 01/03/2023]
Abstract
Transcription factors (TFs) are the vocabulary that genomes use to regulate gene expression and phenotypes. The interactions among TFs enrich this vocabulary and orchestrate diverse biological processes. Although simple models identify open chromatin and the presence of TF motifs as the two major contributors to TF binding patterns, it remains elusive what contributes to the in vivo TF cobinding landscape. In this study, we developed a machine learning algorithm to explore the contributors of the cobinding patterns. The algorithm substantially outperforms the state-of-the-field models for TF cobinding prediction. Game theory-based feature importance analysis reveals that, for most of the TF pairs we studied, independent motif sequences contribute one or more of the two TFs under investigation to their cobinding patterns. Such independent motif sequences include, but are not limited to, transcription initiation-related proteins and known TF complexes. We found the motif sequence signatures and the TFs are rarely mutual, corroborating a hierarchical and directional organization of the regulatory network and refuting the possibility of artifacts caused by shared sequence similarity with the TFs under investigation. We modeled such regulatory language with directed graphs, which reveal shared, global factors that are related to many binding and cobinding patterns.
Collapse
Affiliation(s)
- Manqi Zhou
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Hongyang Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Xueqing Wang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, USA
| |
Collapse
|
46
|
Srivastava D, Aydin B, Mazzoni EO, Mahony S. An interpretable bimodal neural network characterizes the sequence and preexisting chromatin predictors of induced transcription factor binding. Genome Biol 2021; 22:20. [PMID: 33413545 PMCID: PMC7788824 DOI: 10.1186/s13059-020-02218-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2019] [Accepted: 12/03/2020] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Transcription factor (TF) binding specificity is determined via a complex interplay between the transcription factor's DNA binding preference and cell type-specific chromatin environments. The chromatin features that correlate with transcription factor binding in a given cell type have been well characterized. For instance, the binding sites for a majority of transcription factors display concurrent chromatin accessibility. However, concurrent chromatin features reflect the binding activities of the transcription factor itself and thus provide limited insight into how genome-wide TF-DNA binding patterns became established in the first place. To understand the determinants of transcription factor binding specificity, we therefore need to examine how newly activated transcription factors interact with sequence and preexisting chromatin landscapes. RESULTS Here, we investigate the sequence and preexisting chromatin predictors of TF-DNA binding by examining the genome-wide occupancy of transcription factors that have been induced in well-characterized chromatin environments. We develop Bichrom, a bimodal neural network that jointly models sequence and preexisting chromatin data to interpret the genome-wide binding patterns of induced transcription factors. We find that the preexisting chromatin landscape is a differential global predictor of TF-DNA binding; incorporating preexisting chromatin features improves our ability to explain the binding specificity of some transcription factors substantially, but not others. Furthermore, by analyzing site-level predictors, we show that transcription factor binding in previously inaccessible chromatin tends to correspond to the presence of more favorable cognate DNA sequences. CONCLUSIONS Bichrom thus provides a framework for modeling, interpreting, and visualizing the joint sequence and chromatin landscapes that determine TF-DNA binding dynamics.
Collapse
Affiliation(s)
- Divyanshi Srivastava
- Center for Eukaryotic Gene Regulation, Department of Biochemistry & Molecular Biology, Pennsylvania State University, University Park, PA, USA
| | - Begüm Aydin
- Department of Biology, New York University, New York, NY, USA
| | | | - Shaun Mahony
- Center for Eukaryotic Gene Regulation, Department of Biochemistry & Molecular Biology, Pennsylvania State University, University Park, PA, USA.
| |
Collapse
|
47
|
Endometriosis Is Associated with a Significant Increase in hTERC and Altered Telomere/Telomerase Associated Genes in the Eutopic Endometrium, an Ex-Vivo and In Silico Study. Biomedicines 2020; 8:biomedicines8120588. [PMID: 33317189 PMCID: PMC7764055 DOI: 10.3390/biomedicines8120588] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2020] [Revised: 12/02/2020] [Accepted: 12/03/2020] [Indexed: 12/13/2022] Open
Abstract
Telomeres protect chromosomal ends and they are maintained by the specialised enzyme, telomerase. Endometriosis is a common gynaecological disease and high telomerase activity and higher hTERT levels associated with longer endometrial telomere lengths are characteristics of eutopic secretory endometrial aberrations of women with endometriosis. Our ex-vivo study examined the levels of hTERC and DKC1 RNA and dyskerin protein levels in the endometrium from healthy women and those with endometriosis (n = 117). The in silico study examined endometriosis-specific telomere- and telomerase-associated gene (TTAG) transcriptional aberrations of secretory phase eutopic endometrium utilising publicly available microarray datasets. Eutopic secretory endometrial hTERC levels were significantly increased in women with endometriosis compared to healthy endometrium, yet dyskerin mRNA and protein levels were unperturbed. Our in silico study identified 10 TTAGs (CDKN2A, PML, ZNHIT2, UBE3A, MCCC2, HSPC159, FGFR2, PIK3C2A, RALGAPA1, and HNRNPA2B1) to be altered in mid-secretory endometrium of women with endometriosis. High levels of hTERC and the identified other TTAGs might be part of the established alteration in the eutopic endometrial telomerase biology in women with endometriosis in the secretory phase of the endometrium and our data informs future research to unravel the fundamental involvement of telomerase in the pathogenesis of endometriosis.
Collapse
|
48
|
López-Rivera F, Foster Rhoades OK, Vincent BJ, Pym ECG, Bragdon MDJ, Estrada J, DePace AH, Wunderlich Z. A Mutation in the Drosophila melanogaster eve Stripe 2 Minimal Enhancer Is Buffered by Flanking Sequences. G3 (BETHESDA, MD.) 2020; 10:4473-4482. [PMID: 33037064 PMCID: PMC7718739 DOI: 10.1534/g3.120.401777] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/01/2020] [Accepted: 10/01/2020] [Indexed: 01/18/2023]
Abstract
Enhancers are DNA sequences composed of transcription factor binding sites that drive complex patterns of gene expression in space and time. Until recently, studying enhancers in their genomic context was technically challenging. Therefore, minimal enhancers, the shortest pieces of DNA that can drive an expression pattern that resembles a gene's endogenous pattern, are often used to study features of enhancer function. However, evidence suggests that some enhancers require sequences outside the minimal enhancer to maintain function under environmental perturbations. We hypothesized that these additional sequences also prevent misexpression caused by a transcription factor binding site mutation within a minimal enhancer. Using the Drosophila melanogastereven-skipped stripe 2 enhancer as a case study, we tested the effect of a Giant binding site mutation (gt-2) on the expression patterns driven by minimal and extended enhancer reporter constructs. We found that, in contrast to the misexpression caused by the gt-2 binding site deletion in the minimal enhancer, the same gt-2 binding site deletion in the extended enhancer did not have an effect on expression. The buffering of expression levels, but not expression pattern, is partially explained by an additional Giant binding site outside the minimal enhancer. Deleting the gt-2 binding site in the endogenous locus had no significant effect on stripe 2 expression. Our results indicate that rules derived from mutating enhancer reporter constructs may not represent what occurs in the endogenous context.
Collapse
Affiliation(s)
- Francheska López-Rivera
- Department of Systems Biology, Harvard Medical School, Boston, MA 02115
- GSAS Research Scholar Initiative, Harvard University, Cambridge, MA 02138
| | | | - Ben J Vincent
- Department of Systems Biology, Harvard Medical School, Boston, MA 02115
| | - Edward C G Pym
- Department of Systems Biology, Harvard Medical School, Boston, MA 02115
| | | | - Javier Estrada
- Department of Systems Biology, Harvard Medical School, Boston, MA 02115
| | - Angela H DePace
- Department of Systems Biology, Harvard Medical School, Boston, MA 02115
| | - Zeba Wunderlich
- Department of Systems Biology, Harvard Medical School, Boston, MA 02115
- Department of Systems Biology, Harvard Medical School, Boston, MA 02115
| |
Collapse
|
49
|
Martin PC, Zabet NR. Dissecting the binding mechanisms of transcription factors to DNA using a statistical thermodynamics framework. Comput Struct Biotechnol J 2020; 18:3590-3605. [PMID: 33304457 PMCID: PMC7708957 DOI: 10.1016/j.csbj.2020.11.006] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2020] [Revised: 11/02/2020] [Accepted: 11/04/2020] [Indexed: 01/22/2023] Open
Abstract
Transcription Factors (TFs) bind to DNA and control activity of target genes. Here, we present ChIPanalyser, a user-friendly, versatile and powerful R/Bioconductor package predicting and modelling the binding of TFs to DNA. ChIPanalyser performs similarly to state-of-the-art tools, but is an explainable model and provides biological insights into binding mechanisms of TFs. We focused on investigating the binding mechanisms of three TFs that are known architectural proteins CTCF, BEAF-32 and su(Hw) in three Drosophila cell lines (BG3, Kc167 and S2). While CTCF preferentially binds only to a subset of high affinity sites located mainly in open chromatin, BEAF-32 binds to most of its high affinity binding sites available in open chromatin. In contrast, su(Hw) binds to both open chromatin and also partially closed chromatin. Most importantly, differences in TF binding profiles between cell lines for these TFs are mainly driven by differences in DNA accessibility and not by differences in TF concentrations between cell lines. Finally, we investigated binding of Hox TFs in Drosophila and found that Ubx binds only in open chromatin, while Abd-B and Dfd are capable to bind in both open and partially closed chromatin. Overall, our results show that TFs display different binding mechanisms and that our model is able to recapitulate their specific binding behaviour.
Collapse
Affiliation(s)
- Patrick C.N. Martin
- School of Life Sciences, University of Essex, Colchester CO4 3SQ, UK
- Biotech Research and Innovation Centre (BRIC), University of Copenhagen, DK-2200 Copenhagen, Denmark
| | - Nicolae Radu Zabet
- School of Life Sciences, University of Essex, Colchester CO4 3SQ, UK
- Blizard Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London E1 2AT, UK
| |
Collapse
|
50
|
Nameki R, Chang H, Reddy J, Corona RI, Lawrenson K. Transcription factors in epithelial ovarian cancer: histotype-specific drivers and novel therapeutic targets. Pharmacol Ther 2020; 220:107722. [PMID: 33137377 DOI: 10.1016/j.pharmthera.2020.107722] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2020] [Accepted: 10/26/2020] [Indexed: 02/06/2023]
Abstract
Transcription factors (TFs) are major contributors to cancer risk and somatic development. In preclinical and clinical studies, direct or indirect inhibition of TF-mediated oncogenic gene expression profiles have proven to be effective in many tumor types, highlighting this group of proteins as valuable therapeutic targets. In spite of this, our understanding of TFs in epithelial ovarian cancer (EOC) is relatively limited. EOC is a heterogeneous disease composed of five major histologic subtypes; high-grade serous, low-grade serous, endometrioid, clear cell and mucinous. Each histology is associated with unique clinical etiologies, sensitivity to therapies, and molecular signatures - including diverse transcriptional regulatory programs. While some TFs are shared across EOC subtypes, a set of TFs are expressed in a histotype-specific manner and likely explain part of the histologic diversity of EOC subtypes. Targeting TFs present with unique opportunities for development of novel precision medicine strategies for ovarian cancer. This article reviews the critical TFs in EOC subtypes and highlights the potential of exploiting TFs as biomarkers and therapeutic targets.
Collapse
Affiliation(s)
- Robbin Nameki
- Women's Cancer Research Program at the Samuel Oschin Comprehensive Cancer Center, Cedars-Sinai Medical Center, Los Angeles, CA, USA; Division of Gynecologic Oncology, Department of Obstetrics and Gynecology, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Heidi Chang
- Women's Cancer Research Program at the Samuel Oschin Comprehensive Cancer Center, Cedars-Sinai Medical Center, Los Angeles, CA, USA; Division of Gynecologic Oncology, Department of Obstetrics and Gynecology, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Jessica Reddy
- Women's Cancer Research Program at the Samuel Oschin Comprehensive Cancer Center, Cedars-Sinai Medical Center, Los Angeles, CA, USA; Division of Gynecologic Oncology, Department of Obstetrics and Gynecology, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Rosario I Corona
- Women's Cancer Research Program at the Samuel Oschin Comprehensive Cancer Center, Cedars-Sinai Medical Center, Los Angeles, CA, USA; Division of Gynecologic Oncology, Department of Obstetrics and Gynecology, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Kate Lawrenson
- Women's Cancer Research Program at the Samuel Oschin Comprehensive Cancer Center, Cedars-Sinai Medical Center, Los Angeles, CA, USA; Division of Gynecologic Oncology, Department of Obstetrics and Gynecology, Cedars-Sinai Medical Center, Los Angeles, CA, USA; Center for Bioinformatics and Functional Genomics, Cedars-Sinai Medical Center, Los Angeles, CA, USA.
| |
Collapse
|