1
|
Baumgarten N, Rumpf L, Kessler T, Schulz MH. A statistical approach for identifying single nucleotide variants that affect transcription factor binding. iScience 2024; 27:109765. [PMID: 38736546 PMCID: PMC11088338 DOI: 10.1016/j.isci.2024.109765] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Revised: 01/30/2024] [Accepted: 04/15/2024] [Indexed: 05/14/2024] Open
Abstract
Non-coding variants located within regulatory elements may alter gene expression by modifying transcription factor (TF) binding sites, thereby leading to functional consequences. Different TF models are being used to assess the effect of DNA sequence variants, such as single nucleotide variants (SNVs). Often existing methods are slow and do not assess statistical significance of results. We investigated the distribution of absolute maximal differential TF binding scores for general computational models that affect TF binding. We find that a modified Laplace distribution can adequately approximate the empirical distributions. A benchmark on in vitro and in vivo datasets showed that our approach improves upon an existing method in terms of performance and speed. Applications on eQTLs and on a genome-wide association study illustrate the usefulness of our statistics by highlighting cell type-specific regulators and target genes. An implementation of our approach is freely available on GitHub and as bioconda package.
Collapse
Affiliation(s)
- Nina Baumgarten
- Institute of Cardiovascular Regeneration, Goethe University, 60590 Frankfurt am Main, Germany
- Institute for Computational Genomic Medicine, Goethe University, 60590 Frankfurt am Main, Germany
- Institute for Computer Science, Goethe University, 60590 Frankfurt am Main, Germany
- German Center for Cardiovascular Research, Partner Site Rhein-Main, 60590 Frankfurt am Main, Germany
| | - Laura Rumpf
- Institute of Cardiovascular Regeneration, Goethe University, 60590 Frankfurt am Main, Germany
- Institute for Computational Genomic Medicine, Goethe University, 60590 Frankfurt am Main, Germany
- Institute for Computer Science, Goethe University, 60590 Frankfurt am Main, Germany
- German Center for Cardiovascular Research, Partner Site Rhein-Main, 60590 Frankfurt am Main, Germany
| | - Thorsten Kessler
- German Heart Centre Munich, Department of Cardiology, School of Medicine and Health, Technical University of Munich, 80636 Munich, Germany
- German Centre for Cardiovascular Research, Partner Site Munich Heart Alliance, 80636 Munich, Germany
| | - Marcel H. Schulz
- Institute of Cardiovascular Regeneration, Goethe University, 60590 Frankfurt am Main, Germany
- Institute for Computational Genomic Medicine, Goethe University, 60590 Frankfurt am Main, Germany
- Institute for Computer Science, Goethe University, 60590 Frankfurt am Main, Germany
- German Center for Cardiovascular Research, Partner Site Rhein-Main, 60590 Frankfurt am Main, Germany
| |
Collapse
|
2
|
Lavezzo GM, Lauretto MDS, Andrioli LPM, Machado-Lima A. Position Weight Matrix or Acyclic Probabilistic Finite Automaton: Which model to use? A decision rule inferred for the prediction of transcription factor binding sites. Genet Mol Biol 2024; 46:e20230048. [PMID: 38285430 PMCID: PMC10945726 DOI: 10.1590/1678-4685-gmb-2023-0048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Accepted: 10/18/2023] [Indexed: 01/30/2024] Open
Abstract
Prediction of transcription factor binding sites (TFBS) is an example of application of Bioinformatics where DNA molecules are represented as sequences of A, C, G and T symbols. The most used model in this problem is Position Weight Matrix (PWM). Notwithstanding the advantage of being simple, PWMs cannot capture dependency between nucleotide positions, which may affect prediction performance. Acyclic Probabilistic Finite Automata (APFA) is an alternative model able to accommodate position dependencies. However, APFA is a more complex model, which means more parameters have to be learned. In this paper, we propose an innovative method to identify when position dependencies influence preference for PWMs or APFAs. This implied using position dependency features extracted from 1106 sets of TFBS to infer a decision tree able to predict which is the best model - PWM or APFA - for a given set of TFBSs. According to our results, as few as three pinpointed features are able to choose the best model, providing a balance of performance (average precision) and model simplicity.
Collapse
Affiliation(s)
- Guilherme Miura Lavezzo
- Universidade de São Paulo, Instituto de Matemática e Estatística,
Programa Interunidades de Pós-Graduação em Bioinformática, São Paulo, SP,
Brazil
| | | | | | - Ariane Machado-Lima
- Universidade de São Paulo, Escola de Artes, Ciências e Humanidades,
São Paulo, SP, Brazil
| |
Collapse
|
3
|
Vishnevsky OV, Bocharnikov AV, Ignatieva EV. Peak Scores Significantly Depend on the Relationships between Contextual Signals in ChIP-Seq Peaks. Int J Mol Sci 2024; 25:1011. [PMID: 38256085 PMCID: PMC10816497 DOI: 10.3390/ijms25021011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 12/13/2023] [Accepted: 01/09/2024] [Indexed: 01/24/2024] Open
Abstract
Chromatin immunoprecipitation followed by massively parallel DNA sequencing (ChIP-seq) is a central genome-wide method for in vivo analyses of DNA-protein interactions in various cellular conditions. Numerous studies have demonstrated the complex contextual organization of ChIP-seq peak sequences and the presence of binding sites for transcription factors in them. We assessed the dependence of the ChIP-seq peak score on the presence of different contextual signals in the peak sequences by analyzing these sequences from several ChIP-seq experiments using our fully enumerative GPU-based de novo motif discovery method, Argo_CUDA. Analysis revealed sets of significant IUPAC motifs corresponding to the binding sites of the target and partner transcription factors. For these ChIP-seq experiments, multiple regression models were constructed, demonstrating a significant dependence of the peak scores on the presence in the peak sequences of not only highly significant target motifs but also less significant motifs corresponding to the binding sites of the partner transcription factors. A significant correlation was shown between the presence of the target motifs FOXA2 and the partner motifs HNF4G, which found experimental confirmation in the scientific literature, demonstrating the important contribution of the partner transcription factors to the binding of the target transcription factor to DNA and, consequently, their important contribution to the peak score.
Collapse
Affiliation(s)
- Oleg V. Vishnevsky
- Institute of Cytology and Genetics, 630090 Novosibirsk, Russia;
- Department of Natural Science, Novosibirsk State University, 630090 Novosibirsk, Russia;
| | - Andrey V. Bocharnikov
- Department of Natural Science, Novosibirsk State University, 630090 Novosibirsk, Russia;
| | - Elena V. Ignatieva
- Institute of Cytology and Genetics, 630090 Novosibirsk, Russia;
- Department of Natural Science, Novosibirsk State University, 630090 Novosibirsk, Russia;
| |
Collapse
|
4
|
Augustijn HE, Roseboom AM, Medema MH, van Wezel GP. Harnessing regulatory networks in Actinobacteria for natural product discovery. J Ind Microbiol Biotechnol 2024; 51:kuae011. [PMID: 38569653 PMCID: PMC10996143 DOI: 10.1093/jimb/kuae011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2024] [Accepted: 04/02/2024] [Indexed: 04/05/2024]
Abstract
Microbes typically live in complex habitats where they need to rapidly adapt to continuously changing growth conditions. To do so, they produce an astonishing array of natural products with diverse structures and functions. Actinobacteria stand out for their prolific production of bioactive molecules, including antibiotics, anticancer agents, antifungals, and immunosuppressants. Attention has been directed especially towards the identification of the compounds they produce and the mining of the large diversity of biosynthetic gene clusters (BGCs) in their genomes. However, the current return on investment in random screening for bioactive compounds is low, while it is hard to predict which of the millions of BGCs should be prioritized. Moreover, many of the BGCs for yet undiscovered natural products are silent or cryptic under laboratory growth conditions. To identify ways to prioritize and activate these BGCs, knowledge regarding the way their expression is controlled is crucial. Intricate regulatory networks control global gene expression in Actinobacteria, governed by a staggering number of up to 1000 transcription factors per strain. This review highlights recent advances in experimental and computational methods for characterizing and predicting transcription factor binding sites and their applications to guide natural product discovery. We propose that regulation-guided genome mining approaches will open new avenues toward eliciting the expression of BGCs, as well as prioritizing subsets of BGCs for expression using synthetic biology approaches. ONE-SENTENCE SUMMARY This review provides insights into advances in experimental and computational methods aimed at predicting transcription factor binding sites and their applications to guide natural product discovery.
Collapse
Affiliation(s)
- Hannah E Augustijn
- Bioinformatics Group, Wageningen University, Wageningen, The Netherlands
- Molecular Biotechnology, Institute of Biology, Leiden University, Leiden, The Netherlands
| | - Anna M Roseboom
- Molecular Biotechnology, Institute of Biology, Leiden University, Leiden, The Netherlands
| | - Marnix H Medema
- Bioinformatics Group, Wageningen University, Wageningen, The Netherlands
- Molecular Biotechnology, Institute of Biology, Leiden University, Leiden, The Netherlands
| | - Gilles P van Wezel
- Molecular Biotechnology, Institute of Biology, Leiden University, Leiden, The Netherlands
- Netherlands Institute for Ecology (NIOO-KNAW), Wageningen, The Netherlands
| |
Collapse
|
5
|
Vorontsov IE, Eliseeva IA, Zinkevich A, Nikonov M, Abramov S, Boytsov A, Kamenets V, Kasianova A, Kolmykov S, Yevshin I, Favorov A, Medvedeva YA, Jolma A, Kolpakov F, Makeev V, Kulakovskiy I. HOCOMOCO in 2024: a rebuild of the curated collection of binding models for human and mouse transcription factors. Nucleic Acids Res 2024; 52:D154-D163. [PMID: 37971293 PMCID: PMC10767914 DOI: 10.1093/nar/gkad1077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2023] [Revised: 10/17/2023] [Accepted: 10/26/2023] [Indexed: 11/19/2023] Open
Abstract
We present a major update of the HOCOMOCO collection that provides DNA binding specificity patterns of 949 human transcription factors and 720 mouse orthologs. To make this release, we performed motif discovery in peak sets that originated from 14 183 ChIP-Seq experiments and reads from 2554 HT-SELEX experiments yielding more than 400 thousand candidate motifs. The candidate motifs were annotated according to their similarity to known motifs and the hierarchy of DNA-binding domains of the respective transcription factors. Next, the motifs underwent human expert curation to stratify distinct motif subtypes and remove non-informative patterns and common artifacts. Finally, the curated subset of 100 thousand motifs was supplied to the automated benchmarking to select the best-performing motifs for each transcription factor. The resulting HOCOMOCO v12 core collection contains 1443 verified position weight matrices, including distinct subtypes of DNA binding motifs for particular transcription factors. In addition to the core collection, HOCOMOCO v12 provides motif sets optimized for the recognition of binding sites in vivo and in vitro, and for annotation of regulatory sequence variants. HOCOMOCO is available at https://hocomoco12.autosome.org and https://hocomoco.autosome.org.
Collapse
Affiliation(s)
- Ilya E Vorontsov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991 Moscow, Russia
| | - Irina A Eliseeva
- Institute of Protein Research, Russian Academy of Sciences, 142290 Pushchino, Russia
| | - Arsenii Zinkevich
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991 Moscow, Russia
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 119991 Moscow, Russia
| | - Mikhail Nikonov
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 119991 Moscow, Russia
| | - Sergey Abramov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991 Moscow, Russia
- Altius Institute for Biomedical Sciences, 98121 Seattle, WA, USA
| | - Alexandr Boytsov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991 Moscow, Russia
- Altius Institute for Biomedical Sciences, 98121 Seattle, WA, USA
| | - Vasily Kamenets
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991 Moscow, Russia
- Moscow Institute of Physics and Technology, 141700 Dolgoprudny, Russia
- Institute of Biochemistry and Genetics of the Ufa Federal Research Centre of the Russian Academy of Sciences, 450054 Ufa, Russia
| | - Alexandra Kasianova
- Skolkovo Institute of Science and Technology, 121205 Moscow, Russia
- Institute for Information Transmission Problems of the Russian Academy of Sciences, 127051 Moscow, Russia
| | - Semyon Kolmykov
- Department of Computational Biology, Sirius University of Science and Technology, 354340 Sirius, Krasnodar region, Russia
| | | | - Alexander Favorov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991 Moscow, Russia
- Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
| | - Yulia A Medvedeva
- Research Center of Biotechnology RAS, Russian Academy of Sciences, 119071 Moscow, Russia
| | - Arttu Jolma
- Donnelly Centre, University of Toronto, Toronto, Ontario M5S 3E1, Canada
| | - Fedor Kolpakov
- Department of Computational Biology, Sirius University of Science and Technology, 354340 Sirius, Krasnodar region, Russia
- Bioinformatics Laboratory, Federal Research Center for Information and Computational Technologies, 630090 Novosibirsk, Russia
| | - Vsevolod J Makeev
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991 Moscow, Russia
- Moscow Institute of Physics and Technology, 141700 Dolgoprudny, Russia
- Institute of Biochemistry and Genetics of the Ufa Federal Research Centre of the Russian Academy of Sciences, 450054 Ufa, Russia
| | - Ivan V Kulakovskiy
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991 Moscow, Russia
- Institute of Protein Research, Russian Academy of Sciences, 142290 Pushchino, Russia
- Laboratory of Regulatory Genomics, Institute of Fundamental Medicine and Biology, Kazan Federal University, 420008 Kazan, Russia
| |
Collapse
|
6
|
Rauluseviciute I, Riudavets-Puig R, Blanc-Mathieu R, Castro-Mondragon J, Ferenc K, Kumar V, Lemma RB, Lucas J, Chèneby J, Baranasic D, Khan A, Fornes O, Gundersen S, Johansen M, Hovig E, Lenhard B, Sandelin A, Wasserman W, Parcy F, Mathelier A. JASPAR 2024: 20th anniversary of the open-access database of transcription factor binding profiles. Nucleic Acids Res 2024; 52:D174-D182. [PMID: 37962376 PMCID: PMC10767809 DOI: 10.1093/nar/gkad1059] [Citation(s) in RCA: 65] [Impact Index Per Article: 65.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 10/20/2023] [Accepted: 10/31/2023] [Indexed: 11/15/2023] Open
Abstract
JASPAR (https://jaspar.elixir.no/) is a widely-used open-access database presenting manually curated high-quality and non-redundant DNA-binding profiles for transcription factors (TFs) across taxa. In this 10th release and 20th-anniversary update, the CORE collection has expanded with 329 new profiles. We updated three existing profiles and provided orthogonal support for 72 profiles from the previous release's UNVALIDATED collection. Altogether, the JASPAR 2024 update provides a 20% increase in CORE profiles from the previous release. A trimming algorithm enhanced profiles by removing low information content flanking base pairs, which were likely uninformative (within the capacity of the PFM models) for TFBS predictions and modelling TF-DNA interactions. This release includes enhanced metadata, featuring a refined classification for plant TFs' structural DNA-binding domains. The new JASPAR collections prompt updates to the genomic tracks of predicted TF binding sites (TFBSs) in 8 organisms, with human and mouse tracks available as native tracks in the UCSC Genome browser. All data are available through the JASPAR web interface and programmatically through its API and the updated Bioconductor and pyJASPAR packages. Finally, a new TFBS extraction tool enables users to retrieve predicted JASPAR TFBSs intersecting their genomic regions of interest.
Collapse
Affiliation(s)
- Ieva Rauluseviciute
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Rafael Riudavets-Puig
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Romain Blanc-Mathieu
- Laboratoire Physiologie Cellulaire et Végétale, Univ. Grenoble Alpes, CNRS, CEA, INRAE, IRIG-DBSCI-LPCV, 17 avenue des martyrs, F-38054, Grenoble, France
| | - Jaime A Castro-Mondragon
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Katalin Ferenc
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Vipin Kumar
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Roza Berhanu Lemma
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Jérémy Lucas
- Laboratoire Physiologie Cellulaire et Végétale, Univ. Grenoble Alpes, CNRS, CEA, INRAE, IRIG-DBSCI-LPCV, 17 avenue des martyrs, F-38054, Grenoble, France
| | - Jeanne Chèneby
- Center for Bioinformatics, Department of Informatics, University of Oslo, Oslo, Norway
| | - Damir Baranasic
- MRC London Institute of Medical Sciences, Du Cane Road, London W12 0NN, UK
- Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, Hammersmith Hospital Campus, Du Cane Road, London W12 0NN, UK
- Division of Electronics, Ruđer Bošković Institute, Bijenička cesta, 10000 Zagreb, Croatia
| | - Aziz Khan
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
- Stanford Cancer Institute, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Oriol Fornes
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 W 28th Ave, Vancouver, BC V5Z 4H4, Canada
| | - Sveinung Gundersen
- Center for Bioinformatics, Department of Informatics, University of Oslo, Oslo, Norway
| | - Morten Johansen
- Center for Bioinformatics, Department of Informatics, University of Oslo, Oslo, Norway
| | - Eivind Hovig
- Center for Bioinformatics, Department of Informatics, University of Oslo, Oslo, Norway
- Department of Tumor Biology, Institute for Cancer Research, Oslo University Hospital, 0424 Oslo, Norway
| | - Boris Lenhard
- MRC London Institute of Medical Sciences, Du Cane Road, London W12 0NN, UK
- Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, Hammersmith Hospital Campus, Du Cane Road, London W12 0NN, UK
| | - Albin Sandelin
- Department of Biology and Biotech Research and Innovation Centre, University of Copenhagen, Ole Maaløes Vej 5, DK2200 Copenhagen N, Denmark
| | - Wyeth W Wasserman
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 W 28th Ave, Vancouver, BC V5Z 4H4, Canada
| | - François Parcy
- Laboratoire Physiologie Cellulaire et Végétale, Univ. Grenoble Alpes, CNRS, CEA, INRAE, IRIG-DBSCI-LPCV, 17 avenue des martyrs, F-38054, Grenoble, France
| | - Anthony Mathelier
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
- Center for Bioinformatics, Department of Informatics, University of Oslo, Oslo, Norway
- Department of Medical Genetics, Institute of Clinical Medicine, University of Oslo and Oslo University Hospital, Oslo, Norway
| |
Collapse
|
7
|
Proft S, Leiz J, Heinemann U, Seelow D, Schmidt-Ott KM, Rutkiewicz M. Discovery of a non-canonical GRHL1 binding site using deep convolutional and recurrent neural networks. BMC Genomics 2023; 24:736. [PMID: 38049725 PMCID: PMC10696883 DOI: 10.1186/s12864-023-09830-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Accepted: 11/22/2023] [Indexed: 12/06/2023] Open
Abstract
BACKGROUND Transcription factors regulate gene expression by binding to transcription factor binding sites (TFBSs). Most models for predicting TFBSs are based on position weight matrices (PWMs), which require a specific motif to be present in the DNA sequence and do not consider interdependencies of nucleotides. Novel approaches such as Transcription Factor Flexible Models or recurrent neural networks consequently provide higher accuracies. However, it is unclear whether such approaches can uncover novel non-canonical, hitherto unexpected TFBSs relevant to human transcriptional regulation. RESULTS In this study, we trained a convolutional recurrent neural network with HT-SELEX data for GRHL1 binding and applied it to a set of GRHL1 binding sites obtained from ChIP-Seq experiments from human cells. We identified 46 non-canonical GRHL1 binding sites, which were not found by a conventional PWM approach. Unexpectedly, some of the newly predicted binding sequences lacked the CNNG core motif, so far considered obligatory for GRHL1 binding. Using isothermal titration calorimetry, we experimentally confirmed binding between the GRHL1-DNA binding domain and predicted GRHL1 binding sites, including a non-canonical GRHL1 binding site. Mutagenesis of individual nucleotides revealed a correlation between predicted binding strength and experimentally validated binding affinity across representative sequences. This correlation was neither observed with a PWM-based nor another deep learning approach. CONCLUSIONS Our results show that convolutional recurrent neural networks may uncover unanticipated binding sites and facilitate quantitative transcription factor binding predictions.
Collapse
Affiliation(s)
- Sebastian Proft
- Exploratory Diagnostic Sciences, Berlin Institute of Health, Charité - Universitätsmedizin Berlin, 10117, Berlin, Germany
- Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, 13353, Berlin, Germany
| | - Janna Leiz
- Department of Nephrology and Hypertension, Hannover Medical School, 30625, Hannover, Germany
- Department of Nephrology and Intensive Care Medicine, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, 12203, Berlin, Germany
- Molecular and Translational Kidney Research, Max-Delbrück-Center for Molecular Medicine in the Helmholtz Association, 13125, Berlin, Germany
| | - Udo Heinemann
- Macromolecular Structure and Interaction, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, 13125, Berlin, Germany.
| | - Dominik Seelow
- Exploratory Diagnostic Sciences, Berlin Institute of Health, Charité - Universitätsmedizin Berlin, 10117, Berlin, Germany.
- Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, 13353, Berlin, Germany.
| | - Kai M Schmidt-Ott
- Department of Nephrology and Hypertension, Hannover Medical School, 30625, Hannover, Germany.
- Department of Nephrology and Intensive Care Medicine, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, 12203, Berlin, Germany.
- Molecular and Translational Kidney Research, Max-Delbrück-Center for Molecular Medicine in the Helmholtz Association, 13125, Berlin, Germany.
| | - Maria Rutkiewicz
- Macromolecular Structure and Interaction, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, 13125, Berlin, Germany
- Department of Structural Biology of Eukaryotes, Institute of Bioorganic Chemistry, Polish Academy of Sciences, Poznań, 61-704, Poland
| |
Collapse
|
8
|
Nikumbh S, Lenhard B. Identifying promoter sequence architectures via a chunking-based algorithm using non-negative matrix factorisation. PLoS Comput Biol 2023; 19:e1011491. [PMID: 37983292 PMCID: PMC10695386 DOI: 10.1371/journal.pcbi.1011491] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2023] [Revised: 12/04/2023] [Accepted: 09/05/2023] [Indexed: 11/22/2023] Open
Abstract
Core promoters are stretches of DNA at the beginning of genes that contain information that facilitates the binding of transcription initiation complexes. Different functional subsets of genes have core promoters with distinct architectures and characteristic motifs. Some of these motifs inform the selection of transcription start sites (TSS). By discovering motifs with fixed distances from known TSS positions, we could in principle classify promoters into different functional groups. Due to the variability and overlap of architectures, promoter classification is a difficult task that requires new approaches. In this study, we present a new method based on non-negative matrix factorisation (NMF) and the associated software called seqArchR that clusters promoter sequences based on their motifs at near-fixed distances from a reference point, such as TSS. When combined with experimental data from CAGE, seqArchR can efficiently identify TSS-directing motifs, including known ones like TATA, DPE, and nucleosome positioning signal, as well as novel lineage-specific motifs and the function of genes associated with them. By using seqArchR on developmental time courses, we reveal how relative use of promoter architectures changes over time with stage-specific expression. seqArchR is a powerful tool for initial genome-wide classification and functional characterisation of promoters. Its use cases are more general: it can also be used to discover any motifs at near-fixed distances from a reference point, even if they are present in only a small subset of sequences.
Collapse
Affiliation(s)
- Sarvesh Nikumbh
- Computational Regulatory Genomics, MRC London Institute of Medical Sciences, London, United Kingdom
- Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, Hammersmith Hospital Campus, London, United Kingdom
| | - Boris Lenhard
- Computational Regulatory Genomics, MRC London Institute of Medical Sciences, London, United Kingdom
- Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, Hammersmith Hospital Campus, London, United Kingdom
| |
Collapse
|
9
|
Grau J, Schmidt F, Schulz MH. Widespread effects of DNA methylation and intra-motif dependencies revealed by novel transcription factor binding models. Nucleic Acids Res 2023; 51:e95. [PMID: 37650641 PMCID: PMC10570048 DOI: 10.1093/nar/gkad693] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Revised: 07/20/2023] [Accepted: 08/10/2023] [Indexed: 09/01/2023] Open
Abstract
Several studies suggested that transcription factor (TF) binding to DNA may be impaired or enhanced by DNA methylation. We present MeDeMo, a toolbox for TF motif analysis that combines information about DNA methylation with models capturing intra-motif dependencies. In a large-scale study using ChIP-seq data for 335 TFs, we identify novel TFs that show a binding behaviour associated with DNA methylation. Overall, we find that the presence of CpG methylation decreases the likelihood of binding for the majority of methylation-associated TFs. For a considerable subset of TFs, we show that intra-motif dependencies are pivotal for accurately modelling the impact of DNA methylation on TF binding. We illustrate that the novel methylation-aware TF binding models allow to predict differential ChIP-seq peaks and improve the genome-wide analysis of TF binding. Our work indicates that simplistic models that neglect the effect of DNA methylation on DNA binding may lead to systematic underperformance for methylation-associated TFs.
Collapse
Affiliation(s)
- Jan Grau
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle 06120, Germany
| | - Florian Schmidt
- Goethe-University Frankfurt, Institute for Cardiovascular Regeneration, Theodor-Stern-Kai 7, 60590 Frankfurt, Germany
- Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken 66123, Germany
- Systems Biology and Data Analytics, Genome Institute of Singapore, Singapore 13862, Singapore
- ImmunoScape Pte Ltd, Singapore 228208, Singapore
| | - Marcel H Schulz
- Goethe-University Frankfurt, Institute for Cardiovascular Regeneration, Theodor-Stern-Kai 7, 60590 Frankfurt, Germany
- Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken 66123, Germany
- German Center for Cardiovascular Research, Partner site Rhein-Main, 60590 Frankfurt am Main, Germany
- Cardio-Pulmonary Institute, Goethe University, Frankfurt am Main, Germany
| |
Collapse
|
10
|
Karlebach G, Steinhaus R, Danis D, Devoucoux M, Anczuków O, Sheynkman G, Seelow D, Robinson PN. Alternative splicing is coupled to gene expression in a subset of variably expressed genes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.06.13.544742. [PMID: 37398049 PMCID: PMC10312658 DOI: 10.1101/2023.06.13.544742] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/04/2023]
Abstract
Numerous factors regulate alternative splicing of human genes at a co-transcriptional level. However, how alternative splicing depends on the regulation of gene expression is poorly understood. We leveraged data from the Genotype-Tissue Expression (GTEx) project to show a significant association of gene expression and splicing for 6874 (4.9%) of 141,043 exons in 1106 (13.3%) of 8314 genes with substantially variable expression in ten GTEx tissues. About half of these exons demonstrate higher inclusion with higher gene expression, and half demonstrate higher exclusion, with the observed direction of coupling being highly consistent across different tissues and in external datasets. The exons differ with respect to sequence characteristics, enriched sequence motifs, RNA polymerase II binding, and inferred transcription rate of downstream introns. The exons were enriched for hundreds of isoform-specific Gene Ontology annotations, suggesting that the coupling of expression and alternative splicing described here may provide an important gene regulatory mechanism that might be used in a variety of biological contexts. In particular, higher inclusion exons could play an important role during cell division.
Collapse
Affiliation(s)
- Guy Karlebach
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Robin Steinhaus
- Exploratory Diagnostic Sciences, Berlin Institute of Health, 10117 Berlin, Germany
- Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, corporate member of Freie Universitat Berlin and Humboldt-Universität zu Berlin, 13353 10117 Berlin, Germany
| | - Daniel Danis
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Maeva Devoucoux
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Olga Anczuków
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
- Department of Genetics and Genome Sciences, UConn Health, Farmington, CT 06032, USA
- Institute for Systems Genomics, University of Connecticut, Farmington, CT 06032, USA
| | - Gloria Sheynkman
- Department of Molecular Physiology and Biological Physics, University of Virginia School of Medicine, Charlottesville, VA 22903, USA
| | - Dominik Seelow
- Exploratory Diagnostic Sciences, Berlin Institute of Health, 10117 Berlin, Germany
- Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, corporate member of Freie Universitat Berlin and Humboldt-Universität zu Berlin, 13353 10117 Berlin, Germany
| | - Peter N Robinson
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
- Department of Molecular Physiology and Biological Physics, University of Virginia School of Medicine, Charlottesville, VA 22903, USA
| |
Collapse
|
11
|
Martin-Geary AC, Blakes AJM, Dawes R, Findlay SD, Lord J, Walker S, Talbot-Martin J, Wieder N, D’Souza EN, Fernandes M, Hilton S, Lahiri N, Campbell C, Jenkinson S, DeGoede CGEL, Anderson ER, Burge CB, Sanders SJ, Ellingford J, Baralle D, Banka S, Whiffin N. Systematic identification of disease-causing promoter and untranslated region variants in 8,040 undiagnosed individuals with rare disease. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.09.12.23295416. [PMID: 37745552 PMCID: PMC10516070 DOI: 10.1101/2023.09.12.23295416] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/26/2023]
Abstract
Background Both promoters and untranslated regions (UTRs) have critical regulatory roles, yet variants in these regions are largely excluded from clinical genetic testing due to difficulty in interpreting pathogenicity. The extent to which these regions may harbour diagnoses for individuals with rare disease is currently unknown. Methods We present a framework for the identification and annotation of potentially deleterious proximal promoter and UTR variants in known dominant disease genes. We use this framework to annotate de novo variants (DNVs) in 8,040 undiagnosed individuals in the Genomics England 100,000 genomes project, which were subject to strict region-based filtering, clinical review, and validation studies where possible. In addition, we performed region and variant annotation-based burden testing in 7,862 unrelated probands against matched unaffected controls. Results We prioritised eleven DNVs and identified an additional variant overlapping one of the eleven. Ten of these twelve variants (82%) are in genes that are a strong match to the individual's phenotype and six had not previously been identified. Through burden testing, we did not observe a significant enrichment of potentially deleterious promoter and/or UTR variants in individuals with rare disease collectively across any of our region or variant annotations. Conclusions Overall, we demonstrate the value of screening promoters and UTRs to uncover additional diagnoses for previously undiagnosed individuals with rare disease and provide a framework for doing so without dramatically increasing interpretation burden.
Collapse
Affiliation(s)
- Alexandra C Martin-Geary
- Big Data Institute, University of Oxford, UK
- Wellcome Centre for Human Genetics, University of Oxford, UK
| | - Alexander J M Blakes
- Manchester Centre for Genomic Medicine, Division of Evolution and Genomic Sciences, School of Biological Sciences, Faculty of Biology, Medicine and Health, University of Manchester, Manchester, UK
| | - Ruebena Dawes
- Big Data Institute, University of Oxford, UK
- Wellcome Centre for Human Genetics, University of Oxford, UK
| | - Scott D Findlay
- Department of Biology, Massachusetts Institute of Technology, Cambridge, USA
| | | | | | | | - Nechama Wieder
- Big Data Institute, University of Oxford, UK
- Wellcome Centre for Human Genetics, University of Oxford, UK
| | - Elston N D’Souza
- Big Data Institute, University of Oxford, UK
- Wellcome Centre for Human Genetics, University of Oxford, UK
| | - Maria Fernandes
- Big Data Institute, University of Oxford, UK
- Wellcome Centre for Human Genetics, University of Oxford, UK
| | - Sarah Hilton
- Manchester Centre for Genomic Medicine, Manchester University NHS Foundation Trust, Health Innovation Manchester, Manchester M13 9WL, UK
| | - Nayana Lahiri
- St George’s, University of London & St George’s University Hospitals NHS Foundation Trust, Institute of Molecular and Clinical Sciences, London, SW17 0QT, UK
| | - Christopher Campbell
- Manchester Centre for Genomic Medicine, Manchester University NHS Foundation Trust, Health Innovation Manchester, Manchester M13 9WL, UK
| | - Sarah Jenkinson
- Manchester Centre for Genomic Medicine, Manchester University NHS Foundation Trust, Health Innovation Manchester, Manchester M13 9WL, UK
| | - Christian G E L DeGoede
- Department of Paediatric Neurology, Clinical research Facility, Lancashire Teaching Hospitals NHS Trust
- Manchester Metropolitan University
| | - Emily R Anderson
- Liverpool Centre for Genomic Medicine, Liverpool Women’s Hospital, Liverpool, UK
| | | | - Stephan J Sanders
- Institute of Developmental and Regenerative Medicine, Department of Paediatrics, University of Oxford, Oxford, OX3 7TY, UK
- Department of Psychiatry and Behavioral Sciences, UCSF Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA 94158, USA
- New York Genome Center, New York, NY, USA
| | - Jamie Ellingford
- Manchester Centre for Genomic Medicine, Division of Evolution and Genomic Sciences, School of Biological Sciences, Faculty of Biology, Medicine and Health, University of Manchester, Manchester, UK
- Manchester Centre for Genomic Medicine, Manchester University NHS Foundation Trust, Health Innovation Manchester, Manchester M13 9WL, UK
| | - Diana Baralle
- School of Human Development and Health, Faculty of Medicine, University of Southampton, Southampton, United Kingdom
| | - Siddharth Banka
- Manchester Centre for Genomic Medicine, Manchester University NHS Foundation Trust, Health Innovation Manchester, Manchester M13 9WL, UK
| | - Nicola Whiffin
- Big Data Institute, University of Oxford, UK
- Wellcome Centre for Human Genetics, University of Oxford, UK
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| |
Collapse
|
12
|
Monti R, Ohler U. Toward Identification of Functional Sequences and Variants in Noncoding DNA. Annu Rev Biomed Data Sci 2023; 6:191-210. [PMID: 37262323 DOI: 10.1146/annurev-biodatasci-122120-110102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Understanding the noncoding part of the genome, which encodes gene regulation, is necessary to identify genetic mechanisms of disease and translate findings from genome-wide association studies into actionable results for treatments and personalized care. Here we provide an overview of the computational analysis of noncoding regions, starting from gene-regulatory mechanisms and their representation in data. Deep learning methods, when applied to these data, highlight important regulatory sequence elements and predict the functional effects of genetic variants. These and other algorithms are used to predict damaging sequence variants. Finally, we introduce rare-variant association tests that incorporate functional annotations and predictions in order to increase interpretability and statistical power.
Collapse
Affiliation(s)
- Remo Monti
- Max Delbrück Center for Molecular Medicine (MDC), Helmholtz Association of German Research Centers, Berlin Institute for Medical Systems Biology (BIMSB), Berlin, Germany;
- Digital Health-Machine Learning, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam, Germany
| | - Uwe Ohler
- Max Delbrück Center for Molecular Medicine (MDC), Helmholtz Association of German Research Centers, Berlin Institute for Medical Systems Biology (BIMSB), Berlin, Germany;
| |
Collapse
|
13
|
Dresch JM, Conrad RD, Klonaros D, Drewell RA. Investigating the sequence landscape in the Drosophila initiator core promoter element using an enhanced MARZ algorithm. PeerJ 2023; 11:e15597. [PMID: 37366427 PMCID: PMC10290830 DOI: 10.7717/peerj.15597] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Accepted: 05/29/2023] [Indexed: 06/28/2023] Open
Abstract
The core promoter elements are important DNA sequences for the regulation of RNA polymerase II transcription in eukaryotic cells. Despite the broad evolutionary conservation of these elements, there is extensive variation in the nucleotide composition of the actual sequences. In this study, we aim to improve our understanding of the complexity of this sequence variation in the TATA box and initiator core promoter elements in Drosophila melanogaster. Using computational approaches, including an enhanced version of our previously developed MARZ algorithm that utilizes gapped nucleotide matrices, several sequence landscape features are uncovered, including an interdependency between the nucleotides in position 2 and 5 in the initiator. Incorporating this information in an expanded MARZ algorithm improves predictive performance for the identification of the initiator element. Overall our results demonstrate the need to carefully consider detailed sequence composition features in core promoter elements in order to make more robust and accurate bioinformatic predictions.
Collapse
|
14
|
Tognon M, Giugno R, Pinello L. A survey on algorithms to characterize transcription factor binding sites. Brief Bioinform 2023; 24:bbad156. [PMID: 37099664 PMCID: PMC10422928 DOI: 10.1093/bib/bbad156] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2023] [Revised: 03/27/2023] [Accepted: 04/01/2023] [Indexed: 04/28/2023] Open
Abstract
Transcription factors (TFs) are key regulatory proteins that control the transcriptional rate of cells by binding short DNA sequences called transcription factor binding sites (TFBS) or motifs. Identifying and characterizing TFBS is fundamental to understanding the regulatory mechanisms governing the transcriptional state of cells. During the last decades, several experimental methods have been developed to recover DNA sequences containing TFBS. In parallel, computational methods have been proposed to discover and identify TFBS motifs based on these DNA sequences. This is one of the most widely investigated problems in bioinformatics and is referred to as the motif discovery problem. In this manuscript, we review classical and novel experimental and computational methods developed to discover and characterize TFBS motifs in DNA sequences, highlighting their advantages and drawbacks. We also discuss open challenges and future perspectives that could fill the remaining gaps in the field.
Collapse
Affiliation(s)
- Manuel Tognon
- Computer Science Department, University of Verona, Verona, Italy
- Molecular Pathology Unit, Center for Computational and Integrative Biology and Center for Cancer Research, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Rosalba Giugno
- Computer Science Department, University of Verona, Verona, Italy
| | - Luca Pinello
- Molecular Pathology Unit, Center for Computational and Integrative Biology and Center for Cancer Research, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- Department of Pathology, Harvard Medical School, Boston, Massachusetts, United States of America
| |
Collapse
|
15
|
Alexandari AM, Horton CA, Shrikumar A, Shah N, Li E, Weilert M, Pufall MA, Zeitlinger J, Fordyce PM, Kundaje A. De novo distillation of thermodynamic affinity from deep learning regulatory sequence models of in vivo protein-DNA binding. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.11.540401. [PMID: 37214836 PMCID: PMC10197627 DOI: 10.1101/2023.05.11.540401] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Transcription factors (TF) are proteins that bind DNA in a sequence-specific manner to regulate gene transcription. Despite their unique intrinsic sequence preferences, in vivo genomic occupancy profiles of TFs differ across cellular contexts. Hence, deciphering the sequence determinants of TF binding, both intrinsic and context-specific, is essential to understand gene regulation and the impact of regulatory, non-coding genetic variation. Biophysical models trained on in vitro TF binding assays can estimate intrinsic affinity landscapes and predict occupancy based on TF concentration and affinity. However, these models cannot adequately explain context-specific, in vivo binding profiles. Conversely, deep learning models, trained on in vivo TF binding assays, effectively predict and explain genomic occupancy profiles as a function of complex regulatory sequence syntax, albeit without a clear biophysical interpretation. To reconcile these complementary models of in vitro and in vivo TF binding, we developed Affinity Distillation (AD), a method that extracts thermodynamic affinities de-novo from deep learning models of TF chromatin immunoprecipitation (ChIP) experiments by marginalizing away the influence of genomic sequence context. Applied to neural networks modeling diverse classes of yeast and mammalian TFs, AD predicts energetic impacts of sequence variation within and surrounding motifs on TF binding as measured by diverse in vitro assays with superior dynamic range and accuracy compared to motif-based methods. Furthermore, AD can accurately discern affinities of TF paralogs. Our results highlight thermodynamic affinity as a key determinant of in vivo binding, suggest that deep learning models of in vivo binding implicitly learn high-resolution affinity landscapes, and show that these affinities can be successfully distilled using AD. This new biophysical interpretation of deep learning models enables high-throughput in silico experiments to explore the influence of sequence context and variation on both intrinsic affinity and in vivo occupancy.
Collapse
Affiliation(s)
- Amr M. Alexandari
- Department of Computer Science, Stanford University, Stanford, CA 94305
| | | | - Avanti Shrikumar
- Department of Earth System Science, Stanford University, Stanford, CA 94305
| | - Nilay Shah
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Eileen Li
- Department of Genetics, Stanford University, Stanford, CA 94305
| | - Melanie Weilert
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Miles A. Pufall
- Department of Biochemistry, Carver College of Medicine, University of Iowa, Iowa City, Iowa 52242, USA
| | - Julia Zeitlinger
- Stowers Institute for Medical Research, Kansas City, MO, USA
- The University of Kansas Medical Center, Kansas City, KS, USA
| | - Polly M. Fordyce
- Department of Genetics, Stanford University, Stanford, CA 94305
- Department of Bioengineering, Stanford University, Stanford, CA 94305
- ChEM-H Institute, Stanford University, Stanford, CA 94305
- Chan Zuckerberg Biohub, San Francisco, CA 94110
| | - Anshul Kundaje
- Department of Computer Science, Stanford University, Stanford, CA 94305
- Department of Genetics, Stanford University, Stanford, CA 94305
| |
Collapse
|
16
|
Li M, Yao T, Lin W, Hinckley WE, Galli M, Muchero W, Gallavotti A, Chen JG, Huang SSC. Double DAP-seq uncovered synergistic DNA binding of interacting bZIP transcription factors. Nat Commun 2023; 14:2600. [PMID: 37147307 PMCID: PMC10163045 DOI: 10.1038/s41467-023-38096-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2022] [Accepted: 04/15/2023] [Indexed: 05/07/2023] Open
Abstract
Many eukaryotic transcription factors (TF) form homodimer or heterodimer complexes to regulate gene expression. Dimerization of BASIC LEUCINE ZIPPER (bZIP) TFs are critical for their functions, but the molecular mechanism underlying the DNA binding and functional specificity of homo- versus heterodimers remains elusive. To address this gap, we present the double DNA Affinity Purification-sequencing (dDAP-seq) technique that maps heterodimer binding sites on endogenous genomic DNA. Using dDAP-seq we profile twenty pairs of C/S1 bZIP heterodimers and S1 homodimers in Arabidopsis and show that heterodimerization significantly expands the DNA binding preferences of these TFs. Analysis of dDAP-seq binding sites reveals the function of bZIP9 in abscisic acid response and the role of bZIP53 heterodimer-specific binding in seed maturation. The C/S1 heterodimers show distinct preferences for the ACGT elements recognized by plant bZIPs and motifs resembling the yeast GCN4 cis-elements. This study demonstrates the potential of dDAP-seq in deciphering the DNA binding specificities of interacting TFs that are key for combinatorial gene regulation.
Collapse
Affiliation(s)
- Miaomiao Li
- Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, 10003, USA
| | - Tao Yao
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, 37831, USA
| | - Wanru Lin
- Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, 10003, USA
| | - Will E Hinckley
- Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, 10003, USA
| | - Mary Galli
- Waksman Institute of Microbiology, Rutgers University, Piscataway, NJ, 08854-8020, USA
| | - Wellington Muchero
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, 37831, USA
| | - Andrea Gallavotti
- Waksman Institute of Microbiology, Rutgers University, Piscataway, NJ, 08854-8020, USA
| | - Jin-Gui Chen
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, 37831, USA
| | - Shao-Shan Carol Huang
- Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, 10003, USA.
| |
Collapse
|
17
|
Chen Y, Lin YCD, Luo Y, Cai X, Qiu P, Cui S, Wang Z, Huang HY, Huang HD. Quantitative model for genome-wide cyclic AMP receptor protein binding site identification and characteristic analysis. Brief Bioinform 2023; 24:7145906. [PMID: 37114659 DOI: 10.1093/bib/bbad138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Revised: 03/10/2023] [Accepted: 03/16/2023] [Indexed: 04/29/2023] Open
Abstract
Cyclic AMP receptor proteins (CRPs) are important transcription regulators in many species. The prediction of CRP-binding sites was mainly based on position-weighted matrixes (PWMs). Traditional prediction methods only considered known binding motifs, and their ability to discover inflexible binding patterns was limited. Thus, a novel CRP-binding site prediction model called CRPBSFinder was developed in this research, which combined the hidden Markov model, knowledge-based PWMs and structure-based binding affinity matrixes. We trained this model using validated CRP-binding data from Escherichia coli and evaluated it with computational and experimental methods. The result shows that the model not only can provide higher prediction performance than a classic method but also quantitatively indicates the binding affinity of transcription factor binding sites by prediction scores. The prediction result included not only the most knowns regulated genes but also 1089 novel CRP-regulated genes. The major regulatory roles of CRPs were divided into four classes: carbohydrate metabolism, organic acid metabolism, nitrogen compound metabolism and cellular transport. Several novel functions were also discovered, including heterocycle metabolic and response to stimulus. Based on the functional similarity of homologous CRPs, we applied the model to 35 other species. The prediction tool and the prediction results are online and are available at: https://awi.cuhk.edu.cn/∼CRPBSFinder.
Collapse
Affiliation(s)
- Yigang Chen
- School of Medicine, Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, Guangdong Province 518172, China
- Warshel Institute for Computational Biology, School of Medicine, Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, Guangdong Province 518172, China
| | - Yang-Chi-Dung Lin
- School of Medicine, Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, Guangdong Province 518172, China
- Warshel Institute for Computational Biology, School of Medicine, Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, Guangdong Province 518172, China
| | - Yijun Luo
- School of Medicine, Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, Guangdong Province 518172, China
| | - Xiaoxuan Cai
- School of Medicine, Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, Guangdong Province 518172, China
- Warshel Institute for Computational Biology, School of Medicine, Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, Guangdong Province 518172, China
| | - Peng Qiu
- School of Medicine, Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, Guangdong Province 518172, China
| | - Shidong Cui
- School of Medicine, Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, Guangdong Province 518172, China
- Warshel Institute for Computational Biology, School of Medicine, Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, Guangdong Province 518172, China
| | - Zhe Wang
- School of Humanities and Social Science, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, Guangdong Province 518172, China
| | - Hsi-Yuan Huang
- School of Medicine, Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, Guangdong Province 518172, China
- Warshel Institute for Computational Biology, School of Medicine, Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, Guangdong Province 518172, China
| | - Hsien-Da Huang
- School of Medicine, Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, Guangdong Province 518172, China
- Warshel Institute for Computational Biology, School of Medicine, Life and Health Sciences, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen, Guangdong Province 518172, China
| |
Collapse
|
18
|
Quan L, Chu X, Sun X, Wu T, Lyu Q. How Deepbics Quantifies Intensities of Transcription Factor-DNA Binding and Facilitates Prediction of Single Nucleotide Variant Pathogenicity With a Deep Learning Model Trained On ChIP-Seq Data Sets. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1594-1599. [PMID: 35471887 DOI: 10.1109/tcbb.2022.3170343] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The binding of DNA sequences to cell type-specific transcription factors is essential for regulating gene expression in all organisms. Many variants occurring in these binding regions play crucial roles in human disease by disrupting the cis-regulation of gene expression. We first implemented a sequence-based deep learning model called deepBICS to quantify the intensity of transcription factors-DNA binding. The experimental results not only showed the superiority of deepBICS on ChIP-seq data sets but also suggested deepBICS as a language model could help the classification of disease-related and neutral variants. We then built a language model-based method called deepBICS4SNV to predict the pathogenicity of single nucleotide variants. The good performance of deepBICS4SNV on 2 tests related to Mendelian disorders and viral diseases shows the sequence contextual information derived from language models can improve prediction accuracy and generalization capability.
Collapse
|
19
|
Yan W, Li Z, Pian C, Wu Y. PlantBind: an attention-based multi-label neural network for predicting plant transcription factor binding sites. Brief Bioinform 2022; 23:6713513. [PMID: 36155619 DOI: 10.1093/bib/bbac425] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2022] [Revised: 08/29/2022] [Accepted: 08/31/2022] [Indexed: 12/14/2022] Open
Abstract
Identification of transcription factor binding sites (TFBSs) is essential to understanding of gene regulation. Designing computational models for accurate prediction of TFBSs is crucial because it is not feasible to experimentally assay all transcription factors (TFs) in all sequenced eukaryotic genomes. Although many methods have been proposed for the identification of TFBSs in humans, methods designed for plants are comparatively underdeveloped. Here, we present PlantBind, a method for integrated prediction and interpretation of TFBSs based on DNA sequences and DNA shape profiles. Built on an attention-based multi-label deep learning framework, PlantBind not only simultaneously predicts the potential binding sites of 315 TFs, but also identifies the motifs bound by transcription factors. During the training process, this model revealed a strong similarity among TF family members with respect to target binding sequences. Trans-species prediction performance using four Zea mays TFs demonstrated the suitability of this model for transfer learning. Overall, this study provides an effective solution for identifying plant TFBSs, which will promote greater understanding of transcriptional regulatory mechanisms in plants.
Collapse
Affiliation(s)
| | - Zutan Li
- Nanjing Agricultur al University
| | - Cong Pian
- College of Sciences at Nanjing Agricultural University
| | - Yufeng Wu
- State Key Laboratory for Crop Genetics and Germplasm Enhancement, Bioinformatics Center, College of Agriculture, Academy for Advanced Interdisciplinary Studies at Nanjing Agricultural University
| |
Collapse
|
20
|
Motif and conserved module analysis in DNA (promoters, enhancers) and RNA (lncRNA, mRNA) using AlModules. Sci Rep 2022; 12:17588. [PMID: 36266399 PMCID: PMC9584888 DOI: 10.1038/s41598-022-21732-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2022] [Accepted: 09/30/2022] [Indexed: 01/13/2023] Open
Abstract
Nucleic acid motifs consist of conserved and variable nucleotide regions. For functional action, several motifs are combined to modules. The tool AIModules allows identification of such motifs including combinations of them and conservation in several nucleic acid stretches. AIModules recognizes conserved motifs and combinations of motifs (modules) allowing a number of interesting biological applications such as analysis of promoter and transcription factor binding sites (TFBS), identification of conserved modules shared between several gene families, e.g. promoter regions, but also analysis of shared and conserved other DNA motifs such as enhancers and silencers, in mRNA (motifs or regulatory elements e.g. for polyadenylation) and lncRNAs. The tool AIModules presented here is an integrated solution for motif analysis, offered as a Web service as well as downloadable software. Several nucleotide sequences are queried for TFBSs using predefined matrices from the JASPAR DB or by using one's own matrices for diverse types of DNA or RNA motif discovery. Furthermore, AIModules can find TFBSs common to two or more sequences. Demanding high or low conservation, AIModules outperforms other solutions in speed and finds more modules (specific combinations of TFBS) than alternative available software. The application also searches RNA motifs such as polyadenylation site or RNA-protein binding motifs as well as DNA motifs such as enhancers as well as user-specified motif combinations ( https://bioinfo-wuerz.de/aimodules/ ; alternative entry pages: https://aimodules.heinzelab.de or https://www.biozentrum.uni-wuerzburg.de/bioinfo/computing/aimodules ). The application is free and open source whether used online, on-site, or locally.
Collapse
|
21
|
Steinhaus R, Robinson PN, Seelow D. FABIAN-variant: predicting the effects of DNA variants on transcription factor binding. Nucleic Acids Res 2022; 50:W322-W329. [PMID: 35639768 PMCID: PMC9252790 DOI: 10.1093/nar/gkac393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2022] [Revised: 04/22/2022] [Accepted: 05/06/2022] [Indexed: 12/03/2022] Open
Abstract
While great advances in predicting the effects of coding variants have been made, the assessment of non-coding variants remains challenging. This is especially problematic for variants within promoter regions which can lead to over-expression of a gene or reduce or even abolish its expression. The binding of transcription factors to the DNA can be predicted using position weight matrices (PWMs). More recently, transcription factor flexible models (TFFMs) have been introduced and shown to be more accurate than PWMs. TFFMs are based on hidden Markov models and can account for complex positional dependencies. Our new web-based application FABIAN-variant uses 1224 TFFMs and 3790 PWMs to predict whether and to which degree DNA variants affect the binding of 1387 different human transcription factors. For each variant and transcription factor, the software combines the results of different models for a final prediction of the resulting binding-affinity change. The software is written in C++ for speed but variants can be entered through a web interface. Alternatively, a VCF file can be uploaded to assess variants identified by high-throughput sequencing. The search can be restricted to variants in the vicinity of candidate genes. FABIAN-variant is available freely at https://www.genecascade.org/fabian/.
Collapse
Affiliation(s)
- Robin Steinhaus
- Exploratory Diagnostic Sciences, Berlin Institute of Health, 10117 Berlin, Germany.,Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, 13353 Berlin, Germany
| | - Peter N Robinson
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06030, USA.,Institute for Systems Genomics, University of Connecticut, Farmington, CT 06030, USA
| | - Dominik Seelow
- Exploratory Diagnostic Sciences, Berlin Institute of Health, 10117 Berlin, Germany.,Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, 13353 Berlin, Germany
| |
Collapse
|
22
|
Hossain MA, Al Amin M, Hasan MI, Sohel M, Ahammed MA, Mahmud SH, Rahman MR, Rahman MH. Bioinformatics and system biology approaches to identify molecular pathogenesis of polycystic ovarian syndrome, type 2 diabetes, obesity, and cardiovascular disease that are linked to the progression of female infertility. INFORMATICS IN MEDICINE UNLOCKED 2022. [DOI: 10.1016/j.imu.2022.100960] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
|
23
|
Quan L, Sun X, Wu J, Mei J, Huang L, He R, Nie L, Chen Y, Lyu Q. Learning Useful Representations of DNA Sequences From ChIP-Seq Datasets for Exploring Transcription Factor Binding Specificities. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:998-1008. [PMID: 32976105 DOI: 10.1109/tcbb.2020.3026787] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Deep learning has been successfully applied to surprisingly different domains. Researchers and practitioners are employing trained deep learning models to enrich our knowledge. Transcription factors (TFs)are essential for regulating gene expression in all organisms by binding to specific DNA sequences. Here, we designed a deep learning model named SemanticCS (Semantic ChIP-seq)to predict TF binding specificities. We trained our learning model on an ensemble of ChIP-seq datasets (Multi-TF-cell)to learn useful intermediate features across multiple TFs and cells. To interpret these feature vectors, visualization analysis was used. Our results indicate that these learned representations can be used to train shallow machines for other tasks. Using diverse experimental data and evaluation metrics, we show that SemanticCS outperforms other popular methods. In addition, from experimental data, SemanticCS can help to identify the substitutions that cause regulatory abnormalities and to evaluate the effect of substitutions on the binding affinity for the RXR transcription factor. The online server for SemanticCS is freely available at http://qianglab.scst.suda.edu.cn/semanticCS/.
Collapse
|
24
|
Berger CA, Ward CP, Karchner SI, Nelson RK, Reddy CM, Hahn ME, Tarrant AM. Nematostella vectensis exhibits an enhanced molecular stress response upon co-exposure to highly weathered oil and surface UV radiation. MARINE ENVIRONMENTAL RESEARCH 2022; 175:105569. [PMID: 35248985 DOI: 10.1016/j.marenvres.2022.105569] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/06/2021] [Revised: 01/24/2022] [Accepted: 01/26/2022] [Indexed: 06/14/2023]
Abstract
Crude oil released into the environment undergoes weathering processes that gradually change its composition and toxicity. Co-exposure to petroleum mixtures and other stressors, including ultraviolet (UV) radiation, may lead to synergistic effects and increased toxicity. Laboratory studies should consider these factors when testing the effects of oil exposure on aquatic organisms. Here, we study transcriptomic responses of the estuarine sea anemone Nematostella vectensis to naturally weathered oil, with or without co-exposure to environmental levels of UV radiation. We find that co-exposure greatly enhances the response. We use bioinformatic analyses to identify molecular pathways implicated in this response, which suggest phototoxicity and oxidative damage as mechanisms for the enhanced stress response. Nematostella's stress response shares similarities with the vertebrate oxidative stress response, implying deep conservation of certain stress pathways in animals. We show that exposure to weathered oil along with surface-level UV exposure has substantial physiological consequences in a model cnidarian.
Collapse
Affiliation(s)
- Cory A Berger
- Biology Department, Woods Hole Oceanographic Institution, Woods Hole, MA, 02543, United States; MIT-WHOI Joint Program in Oceanography/Applied Ocean Science & Engineering, Cambridge and Woods Hole, MA, USA.
| | - Collin P Ward
- Department of Marine Chemistry & Geochemistry, Woods Hole Oceanographic Institution, Woods Hole, MA, 02543, United States
| | - Sibel I Karchner
- Biology Department, Woods Hole Oceanographic Institution, Woods Hole, MA, 02543, United States
| | - Robert K Nelson
- Department of Marine Chemistry & Geochemistry, Woods Hole Oceanographic Institution, Woods Hole, MA, 02543, United States
| | - Christopher M Reddy
- Department of Marine Chemistry & Geochemistry, Woods Hole Oceanographic Institution, Woods Hole, MA, 02543, United States
| | - Mark E Hahn
- Biology Department, Woods Hole Oceanographic Institution, Woods Hole, MA, 02543, United States
| | - Ann M Tarrant
- Biology Department, Woods Hole Oceanographic Institution, Woods Hole, MA, 02543, United States.
| |
Collapse
|
25
|
Castro-Mondragon JA, Riudavets-Puig R, Rauluseviciute I, Berhanu Lemma R, Turchi L, Blanc-Mathieu R, Lucas J, Boddie P, Khan A, Manosalva Pérez N, Fornes O, Leung T, Aguirre A, Hammal F, Schmelter D, Baranasic D, Ballester B, Sandelin A, Lenhard B, Vandepoele K, Wasserman WW, Parcy F, Mathelier A. JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles. Nucleic Acids Res 2022; 50:D165-D173. [PMID: 34850907 PMCID: PMC8728201 DOI: 10.1093/nar/gkab1113] [Citation(s) in RCA: 877] [Impact Index Per Article: 438.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2021] [Revised: 10/20/2021] [Accepted: 10/22/2021] [Indexed: 12/18/2022] Open
Abstract
JASPAR (http://jaspar.genereg.net/) is an open-access database containing manually curated, non-redundant transcription factor (TF) binding profiles for TFs across six taxonomic groups. In this 9th release, we expanded the CORE collection with 341 new profiles (148 for plants, 101 for vertebrates, 85 for urochordates, and 7 for insects), which corresponds to a 19% expansion over the previous release. We added 298 new profiles to the Unvalidated collection when no orthogonal evidence was found in the literature. All the profiles were clustered to provide familial binding profiles for each taxonomic group. Moreover, we revised the structural classification of DNA binding domains to consider plant-specific TFs. This release introduces word clouds to represent the scientific knowledge associated with each TF. We updated the genome tracks of TFBSs predicted with JASPAR profiles in eight organisms; the human and mouse TFBS predictions can be visualized as native tracks in the UCSC Genome Browser. Finally, we provide a new tool to perform JASPAR TFBS enrichment analysis in user-provided genomic regions. All the data is accessible through the JASPAR website, its associated RESTful API, the R/Bioconductor data package, and a new Python package, pyJASPAR, that facilitates serverless access to the data.
Collapse
Affiliation(s)
- Jaime A Castro-Mondragon
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Rafael Riudavets-Puig
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Ieva Rauluseviciute
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Roza Berhanu Lemma
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Laura Turchi
- Laboratoire Physiologie Cellulaire et Végétale, Univ. Grenoble Alpes, CNRS, CEA, INRAE, IRIG-DBSCI-LPCV, 17 avenue des martyrsF-38054, Grenoble, France
| | - Romain Blanc-Mathieu
- Laboratoire Physiologie Cellulaire et Végétale, Univ. Grenoble Alpes, CNRS, CEA, INRAE, IRIG-DBSCI-LPCV, 17 avenue des martyrsF-38054, Grenoble, France
| | - Jeremy Lucas
- Laboratoire Physiologie Cellulaire et Végétale, Univ. Grenoble Alpes, CNRS, CEA, INRAE, IRIG-DBSCI-LPCV, 17 avenue des martyrsF-38054, Grenoble, France
| | - Paul Boddie
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Aziz Khan
- Stanford Cancer Institute, Stanford University School of Medicine, Stanford, CA94305, USA
| | - Nicolás Manosalva Pérez
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 71, 9052 Ghent, Belgium
- VIB Center for Plant Systems Biology, Technologiepark 71, 9052 Ghent, Belgium
| | - Oriol Fornes
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 W 28th Ave, Vancouver, BC V5Z 4H4, Canada
| | - Tiffany Y Leung
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 W 28th Ave, Vancouver, BC V5Z 4H4, Canada
| | - Alejandro Aguirre
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 W 28th Ave, Vancouver, BC V5Z 4H4, Canada
| | | | - Daniel Schmelter
- UCSC Genome Browser, University of California Santa Cruz, Santa Cruz, CA95060, USA
| | - Damir Baranasic
- MRC London Institute of Medical Sciences, Du Cane Road, London, W12 0NN, UK
- Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, Hammersmith Hospital Campus, Du Cane Road, London W12 0NN, UK
| | | | - Albin Sandelin
- The Bioinformatics Centre, Department of Biology & Biotech Research and Innovation Centre, University of Copenhagen, Ole Maaloes Vej 5, DK2200 Copenhagen N, Denmark
| | - Boris Lenhard
- MRC London Institute of Medical Sciences, Du Cane Road, London, W12 0NN, UK
- Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, Hammersmith Hospital Campus, Du Cane Road, London W12 0NN, UK
| | - Klaas Vandepoele
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 71, 9052 Ghent, Belgium
- VIB Center for Plant Systems Biology, Technologiepark 71, 9052 Ghent, Belgium
- Bioinformatics Institute Ghent, Ghent University, Technologiepark 71, 9052 Ghent, Belgium
| | - Wyeth W Wasserman
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 W 28th Ave, Vancouver, BC V5Z 4H4, Canada
| | - François Parcy
- Laboratoire Physiologie Cellulaire et Végétale, Univ. Grenoble Alpes, CNRS, CEA, INRAE, IRIG-DBSCI-LPCV, 17 avenue des martyrsF-38054, Grenoble, France
| | - Anthony Mathelier
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
- Department of Medical Genetics, Institute of Clinical Medicine, University of Oslo and Oslo University Hospital, Oslo, Norway
| |
Collapse
|
26
|
Tsukanov AV, Mironova VV, Levitsky VG. Motif models proposing independent and interdependent impacts of nucleotides are related to high and low affinity transcription factor binding sites in Arabidopsis. FRONTIERS IN PLANT SCIENCE 2022; 13:938545. [PMID: 35968123 PMCID: PMC9373801 DOI: 10.3389/fpls.2022.938545] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/07/2022] [Accepted: 07/05/2022] [Indexed: 05/15/2023]
Abstract
Position weight matrix (PWM) is the traditional motif model representing the transcription factor (TF) binding sites. It proposes that the positions contribute independently to TFs binding affinity, although this hypothesis does not fit the data perfectly. This explains why PWM hits are missing in a substantial fraction of ChIP-seq peaks. To study various modes of the direct binding of plant TFs, we compiled the benchmark collection of 111 ChIP-seq datasets for Arabidopsis thaliana, and applied the traditional PWM, and two alternative motif models BaMM and SiteGA, proposing the dependencies of the positions. The variation in the stringency of the recognition thresholds for the models proposed that the hits of PWM, BaMM, and SiteGA models are associated with the sites of high/medium, any, and low affinity, respectively. At the medium recognition threshold, about 60% of ChIP-seq peaks contain PWM hits consisting of conserved core consensuses, while BaMM and SiteGA provide hits for an additional 15% of peaks in which a weaker core consensus is compensated through intra-motif dependencies. The presence/absence of these dependencies in the motifs of alternative/traditional models was confirmed by the dependency logo DepLogo visualizing the position-wise partitioning of the alignments of predicted sites. We exemplify the detailed analysis of ChIP-seq profiles for plant TFs CCA1, MYC2, and SEP3. Gene ontology (GO) enrichment analysis revealed that among the three motif models, the SiteGA had the highest portions of genes with the significantly enriched GO terms among all predicted genes. We showed that both alternative motif models provide for traditional PWM greater extensions in predicted sites for TFs MYC2/SEP3 with condition/tissue specific functions, compared to those for TF CCA1 with housekeeping functions. Overall, the combined application of standard and alternative motif models is beneficial to detect various modes of the direct TF-DNA interactions in the maximal portion of ChIP-seq loci.
Collapse
Affiliation(s)
- Anton V. Tsukanov
- Department of Systems Biology, Institute of Cytology and Genetics, Novosibirsk, Russia
| | - Victoria V. Mironova
- Department of Systems Biology, Institute of Cytology and Genetics, Novosibirsk, Russia
- Department of Plant Systems Physiology, Radboud Institute for Biological and Environmental Sciences (RIBES), Radboud University, Nijmegen, Netherlands
| | - Victor G. Levitsky
- Department of Systems Biology, Institute of Cytology and Genetics, Novosibirsk, Russia
- Department of Natural Science, Novosibirsk State University, Novosibirsk, Russia
- *Correspondence: Victor G. Levitsky
| |
Collapse
|
27
|
Wang S, He Y, Chen Z, Zhang Q. FCNGRU: Locating Transcription Factor Binding Sites by combing Fully Convolutional Neural Network with Gated Recurrent Unit. IEEE J Biomed Health Inform 2021; 26:1883-1890. [PMID: 34613923 DOI: 10.1109/jbhi.2021.3117616] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Deciphering the relationship between transcription factors (TFs) and DNA sequences is very helpful for computational inference of gene regulation and a comprehensive understanding of gene regulation mechanisms. Transcription factor binding sites (TFBSs) are specific DNA short sequences that play a pivotal role in controlling gene expression through interaction with TF proteins. Although recently many computational and deep learning methods have been proposed to predict TFBSs aiming to predict sequence specificity of TF-DNA binding, there is still a lack of effective methods to directly locate TFBSs. In order to address this problem, we propose FCNGRU combing a fully convolutional neural network (FCN) with the gated recurrent unit (GRU) to directly locate TFBSs in this paper. Furthermore, we present a two-task framework (FCNGRU-double): one is a classification task at nucleotide level which predicts the probability of each nucleotide and locates TFBSs, and the other is a regression task at sequence level which predicts the intensity of each sequence. A series of experiments are conducted on 45 in-vitro datasets collected from the UniPROBE database derived from universal protein binding microarrays (uPBMs). Compared with competing methods, FCNGRU-double achieves much better results on these datasets. Moreover, FCNGRU-double has an advantage over a single-task framework, FCNGRU-single, which only contains the branch of locating TFBSs. In additionwe combine with in vivo datasets to make a further analysis and discussion. The source codes are avaiable at https://github.com/wangguoguoa/FCNGRU.
Collapse
|
28
|
Jin Y, Jiang J, Wang R, Qin ZS. Systematic Evaluation of DNA Sequence Variations on in vivo Transcription Factor Binding Affinity. Front Genet 2021; 12:667866. [PMID: 34567058 PMCID: PMC8458901 DOI: 10.3389/fgene.2021.667866] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Accepted: 08/02/2021] [Indexed: 02/01/2023] Open
Abstract
The majority of the single nucleotide variants (SNVs) identified by genome-wide association studies (GWAS) fall outside of the protein-coding regions. Elucidating the functional implications of these variants has been a major challenge. A possible mechanism for functional non-coding variants is that they disrupted the canonical transcription factor (TF) binding sites that affect the in vivo binding of the TF. However, their impact varies since many positions within a TF binding motif are not well conserved. Therefore, simply annotating all variants located in putative TF binding sites may overestimate the functional impact of these SNVs. We conducted a comprehensive survey to study the effect of SNVs on the TF binding affinity. A sequence-based machine learning method was used to estimate the change in binding affinity for each SNV located inside a putative motif site. From the results obtained on 18 TF binding motifs, we found that there is a substantial variation in terms of a SNV’s impact on TF binding affinity. We found that only about 20% of SNVs located inside putative TF binding sites would likely to have significant impact on the TF-DNA binding.
Collapse
Affiliation(s)
- Yutong Jin
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA, United States
| | - Jiahui Jiang
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA, United States
| | - Ruixuan Wang
- College of Environmental Sciences and Engineering, Peking University, Beijing, China
| | - Zhaohui S Qin
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA, United States
| |
Collapse
|
29
|
Tsukanov AV, Levitsky VG, Merkulova TI. Application of alternative de novo motif recognition models for analysis of structural heterogeneity of transcription factor binding sites: a case study of FOXA2 binding sites. Vavilovskii Zhurnal Genet Selektsii 2021; 25:7. [PMID: 34547062 PMCID: PMC8408018 DOI: 10.18699/vj21.002] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2020] [Revised: 01/10/2021] [Accepted: 01/12/2021] [Indexed: 11/24/2022] Open
Abstract
The most popular model for the search of ChIP-seq data for transcription factor binding sites (TFBS)
is the positional weight matrix (PWM). However, this model does not take into account dependencies between
nucleotide occurrences in different site positions. Currently, two recently proposed models, BaMM and InMoDe,
can do as much. However, application of these models was usually limited only to comparing their recognition
accuracies with that of PWMs, while none of the analyses of the co-prediction and relative positioning of hits of different models in peaks has yet been performed. To close this gap, we propose the pipeline called MultiDeNA. This
pipeline includes stages of model training, assessing their recognition accuracy, scanning ChIP-seq peaks and their
classification based on scan results. We applied our pipeline to 22 ChIP-seq datasets of TF FOXA2 and considered
PWM, dinucleotide PWM (diPWM), BaMM and InMoDe models. The combination of these four models allowed a
significant increase in the fraction of recognized peaks compared to that for the sole PWM model: the increase was
26.3 %. The BaMM model provided the main contribution to the recognition of sites. Although the major fraction of
predicted peaks contained TFBS of different models with coincided positions, the medians of the fraction of peaks
containing the predictions of sole models were 1.08, 0.49, 4.15 and 1.73 % for PWM, diPWM, BaMM and InMoDe,
respectively. Thus, FOXA2 BSs were not fully described by only a sole model, which indicates theirs heterogeneity.
We assume that the BaMM model is the most successful in describing the structure of the FOXA2 BS in ChIP-seq
datasets under study.
Collapse
Affiliation(s)
- A V Tsukanov
- Institute of Cytology and Genetics of Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia
| | - V G Levitsky
- Institute of Cytology and Genetics of Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia Novosibirsk State University, Novosibirsk, Russia
| | - T I Merkulova
- Institute of Cytology and Genetics of Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia Novosibirsk State University, Novosibirsk, Russia
| |
Collapse
|
30
|
The intervening domain is required for DNA-binding and functional identity of plant MADS transcription factors. Nat Commun 2021; 12:4760. [PMID: 34362909 PMCID: PMC8346517 DOI: 10.1038/s41467-021-24978-w] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2021] [Accepted: 07/14/2021] [Indexed: 02/06/2023] Open
Abstract
The MADS transcription factors (TF) are an ancient eukaryotic protein family. In plants, the family is divided into two main lineages. Here, we demonstrate that DNA binding in both lineages absolutely requires a short amino acid sequence C-terminal to the MADS domain (M domain) called the Intervening domain (I domain) that was previously defined only in type II lineage MADS. Structural elucidation of the MI domains from the floral regulator, SEPALLATA3 (SEP3), shows a conserved fold with the I domain acting to stabilise the M domain. Using the floral organ identity MADS TFs, SEP3, APETALA1 (AP1) and AGAMOUS (AG), domain swapping demonstrate that the I domain alters genome-wide DNA-binding specificity and dimerisation specificity. Introducing AG carrying the I domain of AP1 in the Arabidopsis ap1 mutant resulted in strong complementation and restoration of first and second whorl organs. Taken together, these data demonstrate that the I domain acts as an integral part of the DNA-binding domain and significantly contributes to the functional identity of the MADS TF. MADS transcription factors regulate multiple aspects of plant development. Here the authors show that the intervening I domain is conserved in both type I and type II plant MADS lineages and contributes to the functional identity of the protein by influencing both DNA binding activity and dimerisation specificity.
Collapse
|
31
|
Puig RR, Boddie P, Khan A, Castro-Mondragon JA, Mathelier A. UniBind: maps of high-confidence direct TF-DNA interactions across nine species. BMC Genomics 2021; 22:482. [PMID: 34174819 PMCID: PMC8236138 DOI: 10.1186/s12864-021-07760-6] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2021] [Accepted: 05/27/2021] [Indexed: 12/17/2022] Open
Abstract
BACKGROUND Transcription factors (TFs) bind specifically to TF binding sites (TFBSs) at cis-regulatory regions to control transcription. It is critical to locate these TF-DNA interactions to understand transcriptional regulation. Efforts to predict bona fide TFBSs benefit from the availability of experimental data mapping DNA binding regions of TFs (chromatin immunoprecipitation followed by sequencing - ChIP-seq). RESULTS In this study, we processed ~ 10,000 public ChIP-seq datasets from nine species to provide high-quality TFBS predictions. After quality control, it culminated with the prediction of ~ 56 million TFBSs with experimental and computational support for direct TF-DNA interactions for 644 TFs in > 1000 cell lines and tissues. These TFBSs were used to predict > 197,000 cis-regulatory modules representing clusters of binding events in the corresponding genomes. The high-quality of the TFBSs was reinforced by their evolutionary conservation, enrichment at active cis-regulatory regions, and capacity to predict combinatorial binding of TFs. Further, we confirmed that the cell type and tissue specificity of enhancer activity was correlated with the number of TFs with binding sites predicted in these regions. All the data is provided to the community through the UniBind database that can be accessed through its web-interface ( https://unibind.uio.no/ ), a dedicated RESTful API, and as genomic tracks. Finally, we provide an enrichment tool, available as a web-service and an R package, for users to find TFs with enriched TFBSs in a set of provided genomic regions. CONCLUSIONS UniBind is the first resource of its kind, providing the largest collection of high-confidence direct TF-DNA interactions in nine species.
Collapse
Affiliation(s)
- Rafael Riudavets Puig
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0349, Oslo, Norway
| | - Paul Boddie
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0349, Oslo, Norway
| | - Aziz Khan
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0349, Oslo, Norway
- Stanford Cancer Institute, Stanford University School of Medicine, Stanford, CA, 94305, USA
| | | | - Anthony Mathelier
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0349, Oslo, Norway.
- Department of Medical Genetics, Oslo University Hospital, Oslo, 0424, Norway.
| |
Collapse
|
32
|
Chakraborty D, Zhu H, Jüngel A, Summa L, Li YN, Matei AE, Zhou X, Huang J, Trinh-Minh T, Chen CW, Lafyatis R, Dees C, Bergmann C, Soare A, Luo H, Ramming A, Schett G, Distler O, Distler JHW. Fibroblast growth factor receptor 3 activates a network of profibrotic signaling pathways to promote fibrosis in systemic sclerosis. Sci Transl Med 2021; 12:12/563/eaaz5506. [PMID: 32998972 DOI: 10.1126/scitranslmed.aaz5506] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2019] [Accepted: 09/08/2020] [Indexed: 12/11/2022]
Abstract
Aberrant activation of fibroblasts with progressive deposition of extracellular matrix is a key feature of systemic sclerosis (SSc), a prototypical idiopathic fibrotic disease. Here, we demonstrate that the profibrotic cytokine transforming growth factor β selectively up-regulates fibroblast growth factor receptor 3 (FGFR3) and its ligand FGF9 to promote fibroblast activation and tissue fibrosis, leading to a prominent FGFR3 signature in the SSc skin. Transcriptome profiling, in silico analysis and functional experiments revealed that FGFR3 induces multiple profibrotic pathways including endothelin, interleukin-4, and connective tissue growth factor signaling mediated by transcription factor CREB (cAMP response element-binding protein). Inhibition of FGFR3 signaling by fibroblast-specific knockout of FGFR3 or FGF9 or pharmacological inhibition of FGFR3 blocked fibroblast activation and attenuated experimental skin fibrosis in mice. These findings characterize FGFR3 as an upstream regulator of a network of profibrotic mediators in SSc and as a potential target for the treatment of fibrosis.
Collapse
Affiliation(s)
- Debomita Chakraborty
- Department of Internal Medicine 3 - Rheumatology and Immunology, Friedrich-Alexander University (FAU) Erlangen-Nürnberg and Universitätsklinikum Erlangen, 91054 Erlangen, Germany
| | - Honglin Zhu
- Department of Internal Medicine 3 - Rheumatology and Immunology, Friedrich-Alexander University (FAU) Erlangen-Nürnberg and Universitätsklinikum Erlangen, 91054 Erlangen, Germany.,Department of Rheumatology, Xiangya Hospital, Central South University, Changsha, Hunan 410008, P.R. China
| | - Astrid Jüngel
- Center of Experimental Rheumatology and Zurich Center of Integrative Human Physiology, University Hospital Zurich, 8091 Zürich, Switzerland
| | - Lena Summa
- Department of Internal Medicine 3 - Rheumatology and Immunology, Friedrich-Alexander University (FAU) Erlangen-Nürnberg and Universitätsklinikum Erlangen, 91054 Erlangen, Germany
| | - Yi-Nan Li
- Department of Internal Medicine 3 - Rheumatology and Immunology, Friedrich-Alexander University (FAU) Erlangen-Nürnberg and Universitätsklinikum Erlangen, 91054 Erlangen, Germany
| | - Alexandru-Emil Matei
- Department of Internal Medicine 3 - Rheumatology and Immunology, Friedrich-Alexander University (FAU) Erlangen-Nürnberg and Universitätsklinikum Erlangen, 91054 Erlangen, Germany
| | - Xiang Zhou
- Department of Internal Medicine 3 - Rheumatology and Immunology, Friedrich-Alexander University (FAU) Erlangen-Nürnberg and Universitätsklinikum Erlangen, 91054 Erlangen, Germany
| | - Jingang Huang
- Department of Internal Medicine 3 - Rheumatology and Immunology, Friedrich-Alexander University (FAU) Erlangen-Nürnberg and Universitätsklinikum Erlangen, 91054 Erlangen, Germany
| | - Thuong Trinh-Minh
- Department of Internal Medicine 3 - Rheumatology and Immunology, Friedrich-Alexander University (FAU) Erlangen-Nürnberg and Universitätsklinikum Erlangen, 91054 Erlangen, Germany
| | - Chih-Wei Chen
- Department of Internal Medicine 3 - Rheumatology and Immunology, Friedrich-Alexander University (FAU) Erlangen-Nürnberg and Universitätsklinikum Erlangen, 91054 Erlangen, Germany
| | - Robert Lafyatis
- Department of Medicine, University of Pittsburgh, PA 15261, USA
| | - Clara Dees
- Department of Internal Medicine 3 - Rheumatology and Immunology, Friedrich-Alexander University (FAU) Erlangen-Nürnberg and Universitätsklinikum Erlangen, 91054 Erlangen, Germany
| | - Christina Bergmann
- Department of Internal Medicine 3 - Rheumatology and Immunology, Friedrich-Alexander University (FAU) Erlangen-Nürnberg and Universitätsklinikum Erlangen, 91054 Erlangen, Germany
| | - Alina Soare
- Department of Internal Medicine 3 - Rheumatology and Immunology, Friedrich-Alexander University (FAU) Erlangen-Nürnberg and Universitätsklinikum Erlangen, 91054 Erlangen, Germany
| | - Hui Luo
- Department of Rheumatology, Xiangya Hospital, Central South University, Changsha, Hunan 410008, P.R. China
| | - Andreas Ramming
- Department of Internal Medicine 3 - Rheumatology and Immunology, Friedrich-Alexander University (FAU) Erlangen-Nürnberg and Universitätsklinikum Erlangen, 91054 Erlangen, Germany
| | - Georg Schett
- Department of Internal Medicine 3 - Rheumatology and Immunology, Friedrich-Alexander University (FAU) Erlangen-Nürnberg and Universitätsklinikum Erlangen, 91054 Erlangen, Germany
| | - Oliver Distler
- Center of Experimental Rheumatology and Zurich Center of Integrative Human Physiology, University Hospital Zurich, 8091 Zürich, Switzerland
| | - Jörg H W Distler
- Department of Internal Medicine 3 - Rheumatology and Immunology, Friedrich-Alexander University (FAU) Erlangen-Nürnberg and Universitätsklinikum Erlangen, 91054 Erlangen, Germany.
| |
Collapse
|
33
|
Ge W, Meier M, Roth C, Söding J. Bayesian Markov models improve the prediction of binding motifs beyond first order. NAR Genom Bioinform 2021; 3:lqab026. [PMID: 33928244 PMCID: PMC8057495 DOI: 10.1093/nargab/lqab026] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2020] [Revised: 03/11/2021] [Accepted: 03/30/2021] [Indexed: 12/13/2022] Open
Abstract
Transcription factors (TFs) regulate gene expression by binding to specific DNA motifs. Accurate models for predicting binding affinities are crucial for quantitatively understanding of transcriptional regulation. Motifs are commonly described by position weight matrices, which assume that each position contributes independently to the binding energy. Models that can learn dependencies between positions, for instance, induced by DNA structure preferences, have yielded markedly improved predictions for most TFs on in vivo data. However, they are more prone to overfit the data and to learn patterns merely correlated with rather than directly involved in TF binding. We present an improved, faster version of our Bayesian Markov model software, BaMMmotif2. We tested it with state-of-the-art motif discovery tools on a large collection of ChIP-seq and HT-SELEX datasets. BaMMmotif2 models of fifth-order achieved a median false-discovery-rate-averaged recall 13.6% and 12.2% higher than the next best tool on 427 ChIP-seq datasets and 164 HT-SELEX datasets, respectively, while being 8 to 1000 times faster. BaMMmotif2 models showed no signs of overtraining in cross-cell line and cross-platform tests, with similar improvements on the next-best tool. These results demonstrate that dependencies beyond first order clearly improve binding models for most TFs.
Collapse
Affiliation(s)
- Wanwan Ge
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| | - Markus Meier
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| | - Christian Roth
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| | - Johannes Söding
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| |
Collapse
|
34
|
Katsushima K, Lee B, Kunhiraman H, Zhong C, Murad R, Yin J, Liu B, Garancher A, Gonzalez-Gomez I, Monforte HL, Stapleton S, Vibhakar R, Bettegowda C, Wechsler-Reya RJ, Jallo G, Raabe E, Eberhart CG, Perera RJ. The long noncoding RNA lnc-HLX-2-7 is oncogenic in Group 3 medulloblastomas. Neuro Oncol 2021; 23:572-585. [PMID: 33844835 PMCID: PMC8041340 DOI: 10.1093/neuonc/noaa235] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Medulloblastoma (MB) is an aggressive brain tumor that predominantly affects children. Recent high-throughput sequencing studies suggest that the noncoding RNA genome, in particular long noncoding RNAs (lncRNAs), contributes to MB subgrouping. Here we report the identification of a novel lncRNA, lnc-HLX-2-7, as a potential molecular marker and therapeutic target in Group 3 MBs. METHODS Publicly available RNA sequencing (RNA-seq) data from 175 MB patients were interrogated to identify lncRNAs that differentiate between MB subgroups. After characterizing a subset of differentially expressed lncRNAs in vitro and in vivo, lnc-HLX-2-7 was deleted by CRISPR/Cas9 in the MB cell line. Intracranial injected tumors were further characterized by bulk and single-cell RNA-seq. RESULTS Lnc-HLX-2-7 is highly upregulated in Group 3 MB cell lines, patient-derived xenografts, and primary MBs compared with other MB subgroups as assessed by quantitative real-time, RNA-seq, and RNA fluorescence in situ hybridization. Depletion of lnc-HLX-2-7 significantly reduced cell proliferation and 3D colony formation and induced apoptosis. Lnc-HLX-2-7-deleted cells injected into mouse cerebellums produced smaller tumors than those derived from parental cells. Pathway analysis revealed that lnc-HLX-2-7 modulated oxidative phosphorylation, mitochondrial dysfunction, and sirtuin signaling pathways. The MYC oncogene regulated lnc-HLX-2-7, and the small-molecule bromodomain and extraterminal domain family‒bromodomain 4 inhibitor Jun Qi 1 (JQ1) reduced lnc-HLX-2-7 expression. CONCLUSIONS Lnc-HLX-2-7 is oncogenic in MB and represents a promising novel molecular marker and a potential therapeutic target in Group 3 MBs.
Collapse
Affiliation(s)
- Keisuke Katsushima
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, School of Medicine, Johns Hopkins University, Baltimore, Maryland
- Johns Hopkins All Children’s Hospital, Petersburg, Florida
| | - Bongyong Lee
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, School of Medicine, Johns Hopkins University, Baltimore, Maryland
- Johns Hopkins All Children’s Hospital, Petersburg, Florida
| | - Haritha Kunhiraman
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, School of Medicine, Johns Hopkins University, Baltimore, Maryland
- Johns Hopkins All Children’s Hospital, Petersburg, Florida
| | - Cuncong Zhong
- University of Kansas, Department of Electrical Engineering and Computer Science, Lawrence, Kansas
| | - Rabi Murad
- Sanford Burnham Prebys Medical Discovery Institute, La Jolla, California
| | - Jun Yin
- Sanford Burnham Prebys Medical Discovery Institute, La Jolla, California
| | - Ben Liu
- University of Kansas, Department of Electrical Engineering and Computer Science, Lawrence, Kansas
| | | | | | | | | | - Rajeev Vibhakar
- University of Colorado School of Medicine Center for Cancer and Blood Disorders, Children’s Hospital Colorado, Aurora, Colorado
| | - Chetan Bettegowda
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, School of Medicine, Johns Hopkins University, Baltimore, Maryland
| | | | - George Jallo
- Johns Hopkins All Children’s Hospital, Petersburg, Florida
| | - Eric Raabe
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, School of Medicine, Johns Hopkins University, Baltimore, Maryland
- Department of Pathology, Johns Hopkins University School of Medicine, Baltimore, Maryland
| | - Charles G Eberhart
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, School of Medicine, Johns Hopkins University, Baltimore, Maryland
- Department of Pathology, Johns Hopkins University School of Medicine, Baltimore, Maryland
| | - Ranjan J Perera
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, School of Medicine, Johns Hopkins University, Baltimore, Maryland
- Johns Hopkins All Children’s Hospital, Petersburg, Florida
- Sanford Burnham Prebys Medical Discovery Institute, La Jolla, California
| |
Collapse
|
35
|
Chen C, Hou J, Shi X, Yang H, Birchler JA, Cheng J. DeepGRN: prediction of transcription factor binding site across cell-types using attention-based deep neural networks. BMC Bioinformatics 2021; 22:38. [PMID: 33522898 PMCID: PMC7852092 DOI: 10.1186/s12859-020-03952-1] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2020] [Accepted: 12/29/2020] [Indexed: 12/21/2022] Open
Abstract
Background Due to the complexity of the biological systems, the prediction of the potential DNA binding sites for transcription factors remains a difficult problem in computational biology. Genomic DNA sequences and experimental results from parallel sequencing provide available information about the affinity and accessibility of genome and are commonly used features in binding sites prediction. The attention mechanism in deep learning has shown its capability to learn long-range dependencies from sequential data, such as sentences and voices. Until now, no study has applied this approach in binding site inference from massively parallel sequencing data. The successful applications of attention mechanism in similar input contexts motivate us to build and test new methods that can accurately determine the binding sites of transcription factors. Results In this study, we propose a novel tool (named DeepGRN) for transcription factors binding site prediction based on the combination of two components: single attention module and pairwise attention module. The performance of our methods is evaluated on the ENCODE-DREAM in vivo Transcription Factor Binding Site Prediction Challenge datasets. The results show that DeepGRN achieves higher unified scores in 6 of 13 targets than any of the top four methods in the DREAM challenge. We also demonstrate that the attention weights learned by the model are correlated with potential informative inputs, such as DNase-Seq coverage and motifs, which provide possible explanations for the predictive improvements in DeepGRN. Conclusions DeepGRN can automatically and effectively predict transcription factor binding sites from DNA sequences and DNase-Seq coverage. Furthermore, the visualization techniques we developed for the attention modules help to interpret how critical patterns from different types of input features are recognized by our model.
Collapse
Affiliation(s)
- Chen Chen
- Electrical Engineering and Computer Science Department, University of Missouri, Columbia, MO, 65211, USA
| | - Jie Hou
- Department of Computer Science, Saint Louis University, St. Louis, MO, 63103, USA
| | - Xiaowen Shi
- Division of Biological Sciences, University of Missouri, Columbia, MO, 65211, USA
| | - Hua Yang
- Division of Biological Sciences, University of Missouri, Columbia, MO, 65211, USA
| | - James A Birchler
- Division of Biological Sciences, University of Missouri, Columbia, MO, 65211, USA
| | - Jianlin Cheng
- Electrical Engineering and Computer Science Department, University of Missouri, Columbia, MO, 65211, USA.
| |
Collapse
|
36
|
Lai X, Stigliani A, Lucas J, Hugouvieux V, Parcy F, Zubieta C. Genome-wide binding of SEPALLATA3 and AGAMOUS complexes determined by sequential DNA-affinity purification sequencing. Nucleic Acids Res 2020; 48:9637-9648. [PMID: 32890394 PMCID: PMC7515736 DOI: 10.1093/nar/gkaa729] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 08/17/2020] [Accepted: 08/24/2020] [Indexed: 01/18/2023] Open
Abstract
The MADS transcription factors (TF), SEPALLATA3 (SEP3) and AGAMOUS (AG) are required for floral organ identity and floral meristem determinacy. While dimerization is obligatory for DNA binding, SEP3 and SEP3–AG also form tetrameric complexes. How homo and hetero-dimerization and tetramerization of MADS TFs affect genome-wide DNA-binding and gene regulation is not known. Using sequential DNA affinity purification sequencing (seq-DAP-seq), we determined genome-wide binding of SEP3 homomeric and SEP3–AG heteromeric complexes, including SEP3Δtet-AG, a complex with a SEP3 splice variant, SEP3Δtet, which is largely dimeric and SEP3–AG tetramer. SEP3 and SEP3–AG share numerous bound regions, however each complex bound unique sites, demonstrating that protein identity plays a role in DNA-binding. SEP3–AG and SEP3Δtet-AG share a similar genome-wide binding pattern; however the tetrameric form could access new sites and demonstrated a global increase in DNA-binding affinity. Tetramerization exhibited significant cooperative binding with preferential distances between two sites, allowing efficient binding to regions that are poorly recognized by dimeric SEP3Δtet-AG. By intersecting seq-DAP-seq with ChIP-seq and expression data, we identified unique target genes bound either in SEP3–AG seq-DAP-seq or in SEP3/AG ChIP-seq. Seq-DAP-seq is a versatile genome-wide technique and complements in vivo methods to identify putative direct regulatory targets.
Collapse
Affiliation(s)
- Xuelei Lai
- Laboratoire de Physiologie Cellulaire et Végétale, Université Grenoble-Alpes, CNRS, CEA, INRAE, IRIG-DBSCI, 38000 Grenoble, France
| | - Arnaud Stigliani
- Laboratoire de Physiologie Cellulaire et Végétale, Université Grenoble-Alpes, CNRS, CEA, INRAE, IRIG-DBSCI, 38000 Grenoble, France.,Biotech Research and Innovation Centre, University of Copenhagen, Copenhagen, DK-2200, Denmark.,Department of Biology, University of Copenhagen, Copenhagen, DK-2200 Denmark
| | - Jérémy Lucas
- Laboratoire de Physiologie Cellulaire et Végétale, Université Grenoble-Alpes, CNRS, CEA, INRAE, IRIG-DBSCI, 38000 Grenoble, France
| | - Véronique Hugouvieux
- Laboratoire de Physiologie Cellulaire et Végétale, Université Grenoble-Alpes, CNRS, CEA, INRAE, IRIG-DBSCI, 38000 Grenoble, France
| | - François Parcy
- Laboratoire de Physiologie Cellulaire et Végétale, Université Grenoble-Alpes, CNRS, CEA, INRAE, IRIG-DBSCI, 38000 Grenoble, France
| | - Chloe Zubieta
- Laboratoire de Physiologie Cellulaire et Végétale, Université Grenoble-Alpes, CNRS, CEA, INRAE, IRIG-DBSCI, 38000 Grenoble, France
| |
Collapse
|
37
|
Chen L, Capra JA. Learning and interpreting the gene regulatory grammar in a deep learning framework. PLoS Comput Biol 2020; 16:e1008334. [PMID: 33137083 PMCID: PMC7660921 DOI: 10.1371/journal.pcbi.1008334] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2019] [Revised: 11/12/2020] [Accepted: 09/12/2020] [Indexed: 12/12/2022] Open
Abstract
Deep neural networks (DNNs) have achieved state-of-the-art performance in identifying gene regulatory sequences, but they have provided limited insight into the biology of regulatory elements due to the difficulty of interpreting the complex features they learn. Several models of how combinatorial binding of transcription factors, i.e. the regulatory grammar, drives enhancer activity have been proposed, ranging from the flexible TF billboard model to the stringent enhanceosome model. However, there is limited knowledge of the prevalence of these (or other) sequence architectures across enhancers. Here we perform several hypothesis-driven analyses to explore the ability of DNNs to learn the regulatory grammar of enhancers. We created synthetic datasets based on existing hypotheses about combinatorial transcription factor binding site (TFBS) patterns, including homotypic clusters, heterotypic clusters, and enhanceosomes, from real TF binding motifs from diverse TF families. We then trained deep residual neural networks (ResNets) to model the sequences under a range of scenarios that reflect real-world multi-label regulatory sequence prediction tasks. We developed a gradient-based unsupervised clustering method to extract the patterns learned by the ResNet models. We demonstrated that simulated regulatory grammars are best learned in the penultimate layer of the ResNets, and the proposed method can accurately retrieve the regulatory grammar even when there is heterogeneity in the enhancer categories and a large fraction of TFBS outside of the regulatory grammar. However, we also identify common scenarios where ResNets fail to learn simulated regulatory grammars. Finally, we applied the proposed method to mouse developmental enhancers and were able to identify the components of a known heterotypic TF cluster. Our results provide a framework for interpreting the regulatory rules learned by ResNets, and they demonstrate that the ability and efficiency of ResNets in learning the regulatory grammar depends on the nature of the prediction task.
Collapse
Affiliation(s)
- Ling Chen
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, United States of America
| | - John A. Capra
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, United States of America
- Vanderbilt Genetics Institute and Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States of America
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States of America
| |
Collapse
|
38
|
Gupta D, Ranjan R. In silico characterization of synthetic promoters designed from mirabilis mosaic virus and rice tungro bacilliform virus. Virusdisease 2020; 31:369-373. [PMID: 32904869 DOI: 10.1007/s13337-020-00617-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2020] [Accepted: 07/25/2020] [Indexed: 11/30/2022] Open
Abstract
CaMV35S is the most extensively used promoter for ectopic gene expression in plant system. However, multiple use of this promoter possesses several limitation i.e. homologous based gene silencing and differential suitability in monocot and dicot plants. The strength of a promoter is defined by the presence of cis-acting elements and trans acting nucleic binding factors, thus its strength can be regulated by changing the architecture of these regulatory elements. In the present study, eight hybrid promoters were designed from two parareteroviruses, rice tungro bacilliform viruses (RTBV) and mirabilis mosaic virus (MMV). The eight hybrid promoters, along with parental promoters were characterized for the presence of functional cis-elements and transcription factor binding sites (TFBS), which were predicted using bioinformatics tools such as PLACE and Matinspector. Presence of mirabilis mosaic virus modules for specific functions and over-represented modules was determined using Model inspector. A broad range of cis-elements (85), TFBS (1471) was obtained. Presence of Dehydration responsive element binding factors, Apetala 2 (AP2), WRKY, DNA binding with one finger DOF (DOFF) motifs had shown the functional relevance of these designed promoters with abiotic stress inducibility. In addition to these stress regulating TFBS, the presence of some enhancer like motifs such as P$OCSE, P$TERE, P$TODS, P$ASRC had shown the functional relevance of these promoters as a strong candidate for enhanced expression of ectopic gene.
Collapse
Affiliation(s)
- Dipinte Gupta
- Plant Biotechnology Lab, Department of Botany, Faculty of Science, Dayalbagh Educational Institute, Dayalbagh, Agra, 282005 India
| | - Rajiv Ranjan
- Plant Biotechnology Lab, Department of Botany, Faculty of Science, Dayalbagh Educational Institute, Dayalbagh, Agra, 282005 India
| |
Collapse
|
39
|
Zhou J, Lu Q, Xu R, Gui L, Wang H. Prediction of TF-Binding Site by Inclusion of Higher Order Position Dependencies. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1383-1393. [PMID: 30629513 DOI: 10.1109/tcbb.2019.2892124] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Most proposed methods for TF-binding site (TFBS) predictions only use low order dependencies for predictions due to the lack of efficient methods to extract higher order dependencies. In this work, we first propose a novel method to extract higher order dependencies by applying CNN on histone modification features. We then propose a novel TFBS prediction method, referred to as CNN_TF, by incorporating low order and higher order dependencies. CNN_TF is first evaluated on 13 TFs in the mES cell. Results show that using higher order dependencies outperforms low order dependencies significantly on 11 TFs. This indicates that higher order dependencies are indeed more effective for TFBS predictions than low order dependencies. Further experiments show that using both low order dependencies and higher order dependencies improves performance significantly on 12 TFs, indicating the two dependency types are complementary. To evaluate the influence of cell-types on prediction performances, CNN_TF was applied to five TFs in five cell-types of humans. Even though low order dependencies and higher order dependencies show different contributions in different cell-types, they are always complementary in predictions. When comparing to several state-of-the-art methods, CNN_TF outperforms them by at least 5.3 percent in AUPR.
Collapse
|
40
|
Zhou J, Lu Q, Gui L, Xu R, Long Y, Wang H. MTTFsite: cross-cell type TF binding site prediction by using multi-task learning. Bioinformatics 2020; 35:5067-5077. [PMID: 31161194 PMCID: PMC6954652 DOI: 10.1093/bioinformatics/btz451] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Revised: 05/19/2019] [Accepted: 05/30/2019] [Indexed: 12/30/2022] Open
Abstract
Motivation The prediction of transcription factor binding sites (TFBSs) is crucial for gene expression analysis. Supervised learning approaches for TFBS predictions require large amounts of labeled data. However, many TFs of certain cell types either do not have sufficient labeled data or do not have any labeled data. Results In this paper, a multi-task learning framework (called MTTFsite) is proposed to address the lack of labeled data problem by leveraging on labeled data available in cross-cell types. The proposed MTTFsite contains a shared CNN to learn common features for all cell types and a private CNN for each cell type to learn private features. The common features are aimed to help predicting TFBSs for all cell types especially those cell types that lack labeled data. MTTFsite is evaluated on 241 cell type TF pairs and compared with a baseline method without using any multi-task learning model and a fully shared multi-task model that uses only a shared CNN and do not use private CNNs. For cell types with insufficient labeled data, results show that MTTFsite performs better than the baseline method and the fully shared model on more than 89% pairs. For cell types without any labeled data, MTTFsite outperforms the baseline method and the fully shared model by more than 80 and 93% pairs, respectively. A novel gene expression prediction method (called TFChrome) using both MTTFsite and histone modification features is also presented. Results show that TFBSs predicted by MTTFsite alone can achieve good performance. When MTTFsite is combined with histone modification features, a significant 5.7% performance improvement is obtained. Availability and implementation The resource and executable code are freely available at http://hlt.hitsz.edu.cn/MTTFsite/ and http://www.hitsz-hlt.com:8080/MTTFsite/. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jiyun Zhou
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China.,Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Hong Kong
| | - Qin Lu
- Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Hong Kong
| | - Lin Gui
- Department of Computer Science, University of Warwick, Coventry CV4 4AL, UK
| | - Ruifeng Xu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China
| | - Yunfei Long
- Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Hong Kong
| | - Hongpeng Wang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China
| |
Collapse
|
41
|
Srivastava D, Mahony S. Sequence and chromatin determinants of transcription factor binding and the establishment of cell type-specific binding patterns. BIOCHIMICA ET BIOPHYSICA ACTA. GENE REGULATORY MECHANISMS 2020; 1863:194443. [PMID: 31639474 PMCID: PMC7166147 DOI: 10.1016/j.bbagrm.2019.194443] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/30/2019] [Revised: 09/21/2019] [Accepted: 10/06/2019] [Indexed: 12/14/2022]
Abstract
Transcription factors (TFs) selectively bind distinct sets of sites in different cell types. Such cell type-specific binding specificity is expected to result from interplay between the TF's intrinsic sequence preferences, cooperative interactions with other regulatory proteins, and cell type-specific chromatin landscapes. Cell type-specific TF binding events are highly correlated with patterns of chromatin accessibility and active histone modifications in the same cell type. However, since concurrent chromatin may itself be a consequence of TF binding, chromatin landscapes measured prior to TF activation provide more useful insights into how cell type-specific TF binding events became established in the first place. Here, we review the various sequence and chromatin determinants of cell type-specific TF binding specificity. We identify the current challenges and opportunities associated with computational approaches to characterizing, imputing, and predicting cell type-specific TF binding patterns. We further focus on studies that characterize TF binding in dynamic regulatory settings, and we discuss how these studies are leading to a more complex and nuanced understanding of dynamic protein-DNA binding activities. We propose that TF binding activities at individual sites can be viewed along a two-dimensional continuum of local sequence and chromatin context. Under this view, cell type-specific TF binding activities may result from either strongly favorable sequence features or strongly favorable chromatin context.
Collapse
Affiliation(s)
- Divyanshi Srivastava
- Center for Eukaryotic Gene Regulation, Department of Biochemistry & Molecular Biology, The Pennsylvania State University, University Park, PA, United States of America
| | - Shaun Mahony
- Center for Eukaryotic Gene Regulation, Department of Biochemistry & Molecular Biology, The Pennsylvania State University, University Park, PA, United States of America.
| |
Collapse
|
42
|
Moradifard S, Saghiri R, Ehsani P, Mirkhani F, Ebrahimi‐Rad M. A preliminary computational outputs versus experimental results: Application of sTRAP, a biophysical tool for the analysis of SNPs of transcription factor-binding sites. Mol Genet Genomic Med 2020; 8:e1219. [PMID: 32155318 PMCID: PMC7216802 DOI: 10.1002/mgg3.1219] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2019] [Accepted: 02/25/2020] [Indexed: 11/12/2022] Open
Abstract
BACKGROUND In the human genome, the transcription factors (TFs) and transcription factor-binding sites (TFBSs) network has a great regulatory function in the biological pathways. Such crosstalk might be affected by the single-nucleotide polymorphisms (SNPs), which could create or disrupt a TFBS, leading to either a disease or a phenotypic defect. Many computational resources have been introduced to predict the TFs binding variations due to SNPs inside TFBSs, sTRAP being one of them. METHODS A literature review was performed and the experimental data for 18 TFBSs located in 12 genes was provided. The sequences of TFBS motifs were extracted using two different strategies; in the size similar with synthetic target sites used in the experimental techniques, and with 60 bp upstream and downstream of the SNPs. The sTRAP (http://trap.molgen.mpg.de/cgi-bin/trap_two_seq_form.cgi) was applied to compute the binding affinity scores of their cognate TFs in the context of reference and mutant sequences of TFBSs. The alternative bioinformatics model used in this study was regulatory analysis of variation in enhancers (RAVEN; http://www.cisreg.ca/cgi-bin/RAVEN/a). The bioinformatics outputs of our study were compared with experimental data, electrophoretic mobility shift assay (EMSA). RESULTS In 6 out of 18 TFBSs in the following genes COL1A1, Hb ḉᴪ, TF, FIX, MBL2, NOS2A, the outputs of sTRAP were inconsistent with the results of EMSA. Furthermore, no p value of the difference between the two scores of binding affinity under the wild and mutant conditions of TFBSs was presented. Nor, were any criteria for preference or selection of any of the measurements of different matrices used for the same analysis. CONCLUSION Our preliminary study indicated some paradoxical results between sTRAP and experimental data. However, to link the data of sTRAP to the biological functions, its optimization via experimental procedures with the integration of expanded data and applying several other bioinformatics tools might be required.
Collapse
Affiliation(s)
| | - Reza Saghiri
- Biochemistry DepartmentPasteur Institute of IranTehranIran
| | - Parastoo Ehsani
- Molecular Biology DepartmentPasteur Institute of IranTehranIran
| | | | | |
Collapse
|
43
|
Fornes O, Castro-Mondragon JA, Khan A, van der Lee R, Zhang X, Richmond PA, Modi BP, Correard S, Gheorghe M, Baranašić D, Santana-Garcia W, Tan G, Chèneby J, Ballester B, Parcy F, Sandelin A, Lenhard B, Wasserman WW, Mathelier A. JASPAR 2020: update of the open-access database of transcription factor binding profiles. Nucleic Acids Res 2020; 48:D87-D92. [PMID: 31701148 PMCID: PMC7145627 DOI: 10.1093/nar/gkz1001] [Citation(s) in RCA: 758] [Impact Index Per Article: 189.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2019] [Revised: 10/15/2019] [Accepted: 10/16/2019] [Indexed: 02/07/2023] Open
Abstract
JASPAR (http://jaspar.genereg.net) is an open-access database of curated, non-redundant transcription factor (TF)-binding profiles stored as position frequency matrices (PFMs) for TFs across multiple species in six taxonomic groups. In this 8th release of JASPAR, the CORE collection has been expanded with 245 new PFMs (169 for vertebrates, 42 for plants, 17 for nematodes, 10 for insects, and 7 for fungi), and 156 PFMs were updated (125 for vertebrates, 28 for plants and 3 for insects). These new profiles represent an 18% expansion compared to the previous release. JASPAR 2020 comes with a novel collection of unvalidated TF-binding profiles for which our curators did not find orthogonal supporting evidence in the literature. This collection has a dedicated web form to engage the community in the curation of unvalidated TF-binding profiles. Moreover, we created a Q&A forum to ease the communication between the user community and JASPAR curators. Finally, we updated the genomic tracks, inference tool, and TF-binding profile similarity clusters. All the data is available through the JASPAR website, its associated RESTful API, and through the JASPAR2020 R/Bioconductor package.
Collapse
Affiliation(s)
- Oriol Fornes
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 W 28th Ave, Vancouver, BC V5Z 4H4, Canada
| | - Jaime A Castro-Mondragon
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Aziz Khan
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Robin van der Lee
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 W 28th Ave, Vancouver, BC V5Z 4H4, Canada
| | - Xi Zhang
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 W 28th Ave, Vancouver, BC V5Z 4H4, Canada
| | - Phillip A Richmond
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 W 28th Ave, Vancouver, BC V5Z 4H4, Canada
| | - Bhavi P Modi
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 W 28th Ave, Vancouver, BC V5Z 4H4, Canada
| | - Solenne Correard
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 W 28th Ave, Vancouver, BC V5Z 4H4, Canada
| | - Marius Gheorghe
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Damir Baranašić
- Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, London W12 0NN, UK
- Computational Regulatory Genomics, MRC London Institute of Medical Sciences, London W120NN, UK
| | - Walter Santana-Garcia
- Institut de Biologie de l’ENS (IBENS), Département de biologie, École normale supérieure, CNRS, INSERM, Université PSL, 75005 Paris, France
| | - Ge Tan
- Functional Genomics Centre Zurich, ETH Zurich, Zurich, Switzerland
| | | | | | - François Parcy
- CNRS, Univ. Grenoble Alpes, CEA, INRA, IRIG-LPCV, 38000 Grenoble, France
| | - Albin Sandelin
- The Bioinformatics Centre, Department of Biology and Biotech Research & Innovation Centre, University of Copenhagen, DK2200 Copenhagen N, Denmark
| | - Boris Lenhard
- Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, London W12 0NN, UK
- Computational Regulatory Genomics, MRC London Institute of Medical Sciences, London W120NN, UK
- Sars International Centre for Marine Molecular Biology, University of Bergen, N-5008 Bergen, Norway
| | - Wyeth W Wasserman
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 W 28th Ave, Vancouver, BC V5Z 4H4, Canada
| | - Anthony Mathelier
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
- Department of Cancer Genetics, Institute for Cancer Research, Oslo University Hospital Radiumhospitalet, 0310 Oslo, Norway
| |
Collapse
|
44
|
Villanueva-Cañas JL, Horvath V, Aguilera L, González J. Diverse families of transposable elements affect the transcriptional regulation of stress-response genes in Drosophila melanogaster. Nucleic Acids Res 2020; 47:6842-6857. [PMID: 31175824 PMCID: PMC6649756 DOI: 10.1093/nar/gkz490] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2018] [Revised: 05/20/2019] [Accepted: 05/22/2019] [Indexed: 12/25/2022] Open
Abstract
Although transposable elements are an important source of regulatory variation, their genome-wide contribution to the transcriptional regulation of stress-response genes has not been studied yet. Stress is a major aspect of natural selection in the wild, leading to changes in the transcriptional regulation of a variety of genes that are often triggered by one or a few transcription factors. In this work, we take advantage of the wealth of information available for Drosophila melanogaster and humans to analyze the role of transposable elements in six stress regulatory networks: immune, hypoxia, oxidative, xenobiotic, heat shock, and heavy metal. We found that transposable elements were enriched for caudal, dorsal, HSF, and tango binding sites in D. melanogaster and for NFE2L2 binding sites in humans. Taking into account the D. melanogaster population frequencies of transposable elements with predicted binding motifs and/or binding sites, we showed that those containing three or more binding motifs/sites are more likely to be functional. For a representative subset of these TEs, we performed in vivo transgenic reporter assays in different stress conditions. Overall, our results showed that TEs are relevant contributors to the transcriptional regulation of stress-response genes.
Collapse
Affiliation(s)
| | - Vivien Horvath
- Institute of Evolutionary Biology, CSIC-Universitat Pompeu Fabra, 08003 Barcelona, Spain
| | - Laura Aguilera
- Institute of Evolutionary Biology, CSIC-Universitat Pompeu Fabra, 08003 Barcelona, Spain
| | - Josefa González
- Institute of Evolutionary Biology, CSIC-Universitat Pompeu Fabra, 08003 Barcelona, Spain
| |
Collapse
|
45
|
Campbell MC, Ashong B, Teng S, Harvey J, Cross CN. Multiple selective sweeps of ancient polymorphisms in and around LTα located in the MHC class III region on chromosome 6. BMC Evol Biol 2019; 19:218. [PMID: 31791241 PMCID: PMC6889576 DOI: 10.1186/s12862-019-1516-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2019] [Accepted: 09/20/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Lymphotoxin-α (LTα), located in the Major Histocompatibility Complex (MHC) class III region on chromosome 6, encodes a cytotoxic protein that mediates a variety of antiviral responses among other biological functions. Furthermore, several genotypes at this gene have been implicated in the onset of a number of complex diseases, including myocardial infarction, autoimmunity, and various types of cancer. However, little is known about levels of nucleotide variation and linkage disequilibrium (LD) in and near LTα, which could also influence phenotypic variance. To address this gap in knowledge, we examined sequence variation across ~ 10 kilobases (kbs), encompassing LTα and the upstream region, in 2039 individuals from the 1000 Genomes Project originating from 21 global populations. RESULTS Here, we observed striking patterns of diversity, including an excess of intermediate-frequency alleles, the maintenance of multiple common haplotypes and a deep coalescence time for variation (dating > 1.0 million years ago), in global populations. While these results are generally consistent with a model of balancing selection, we also uncovered a signature of positive selection in the form of long-range LD on chromosomes with derived alleles primarily in Eurasian populations. To reconcile these findings, which appear to support different models of selection, we argue that selective sweeps (particularly, soft sweeps) of multiple derived alleles in and/or near LTα occurred in non-Africans after their ancestors left Africa. Furthermore, these targets of selection were predicted to alter transcription factor binding site affinity and protein stability, suggesting they play a role in gene function. Additionally, our data also showed that a subset of these functional adaptive variants are present in archaic hominin genomes. CONCLUSIONS Overall, this study identified candidate functional alleles in a biologically-relevant genomic region, and offers new insights into the evolutionary origins of these loci in modern human populations.
Collapse
Affiliation(s)
- Michael C. Campbell
- Department of Biology, College of Arts and Sciences, Howard University, Washington, DC 20059 USA
| | - Bryan Ashong
- Department of Biology, College of Arts and Sciences, Howard University, Washington, DC 20059 USA
| | - Shaolei Teng
- Department of Biology, College of Arts and Sciences, Howard University, Washington, DC 20059 USA
| | - Jayla Harvey
- Department of Biology, College of Arts and Sciences, Howard University, Washington, DC 20059 USA
| | - Christopher N. Cross
- Department of Anatomy, College of Medicine, Howard University, Washington, DC 20059 USA
| |
Collapse
|
46
|
Grau J, Nettling M, Keilwagen J. DepLogo: visualizing sequence dependencies in R. Bioinformatics 2019; 35:4812-4814. [PMID: 31225867 DOI: 10.1093/bioinformatics/btz507] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2019] [Revised: 05/27/2019] [Accepted: 06/13/2019] [Indexed: 11/13/2022] Open
Abstract
SUMMARY Statistical dependencies are present in a variety of sequence data, but are not discernible from traditional sequence logos. Here, we present the R package DepLogo for visualizing inter-position dependencies in aligned sequence data as dependency logos. Dependency logos make dependency structures, which correspond to regular co-occurrences of symbols at dependent positions, visually perceptible. To this end, sequences are partitioned based on their symbols at highly dependent positions as measured by mutual information, and each partition obtains its own visual representation. We illustrate the utility of the DepLogo package in several use cases generating dependency logos from DNA, RNA and protein sequences. AVAILABILITY AND IMPLEMENTATION The DepLogo R package is available from CRAN and its source code is available at https://github.com/Jstacs/DepLogo. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jan Grau
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany
| | - Martin Nettling
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany
| | - Jens Keilwagen
- Institute for Biosafety in Plant Biotechnology, Julius Kühn-Institut (JKI), Quedlinburg, Germany
| |
Collapse
|
47
|
Gearing LJ, Cumming HE, Chapman R, Finkel AM, Woodhouse IB, Luu K, Gould JA, Forster SC, Hertzog PJ. CiiiDER: A tool for predicting and analysing transcription factor binding sites. PLoS One 2019; 14:e0215495. [PMID: 31483836 PMCID: PMC6726224 DOI: 10.1371/journal.pone.0215495] [Citation(s) in RCA: 123] [Impact Index Per Article: 24.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2019] [Accepted: 08/05/2019] [Indexed: 12/30/2022] Open
Abstract
The availability of large amounts of high-throughput genomic, transcriptomic and epigenomic data has provided opportunity to understand regulation of the cellular transcriptome with an unprecedented level of detail. As a result, research has advanced from identifying gene expression patterns associated with particular conditions to elucidating signalling pathways that regulate expression. There are over 1,000 transcription factors (TFs) in vertebrates that play a role in this regulation. Determining which of these are likely to be controlling a set of genes can be assisted by computational prediction, utilising experimentally verified binding site motifs. Here we present CiiiDER, an integrated computational toolkit for transcription factor binding analysis, written in the Java programming language, to make it independent of computer operating system. It is operated through an intuitive graphical user interface with interactive, high-quality visual outputs, making it accessible to all researchers. CiiiDER predicts transcription factor binding sites (TFBSs) across regulatory regions of interest, such as promoters and enhancers derived from any species. It can perform an enrichment analysis to identify TFs that are significantly over- or under-represented in comparison to a bespoke background set and thereby elucidate pathways regulating sets of genes of pathophysiological importance.
Collapse
Affiliation(s)
- Linden J. Gearing
- Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Clayton, Victoria, Australia
- Department of Molecular Translational Science, Monash University, Clayton, Victoria, Australia
| | - Helen E. Cumming
- Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Clayton, Victoria, Australia
- Department of Molecular Translational Science, Monash University, Clayton, Victoria, Australia
| | - Ross Chapman
- Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Clayton, Victoria, Australia
- Department of Molecular Translational Science, Monash University, Clayton, Victoria, Australia
| | - Alexander M. Finkel
- Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Clayton, Victoria, Australia
- Department of Molecular Translational Science, Monash University, Clayton, Victoria, Australia
| | - Isaac B. Woodhouse
- Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Clayton, Victoria, Australia
- Department of Molecular Translational Science, Monash University, Clayton, Victoria, Australia
| | - Kevin Luu
- Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Clayton, Victoria, Australia
- Department of Molecular Translational Science, Monash University, Clayton, Victoria, Australia
| | - Jodee A. Gould
- Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Clayton, Victoria, Australia
- Department of Molecular Translational Science, Monash University, Clayton, Victoria, Australia
| | - Samuel C. Forster
- Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Clayton, Victoria, Australia
- Department of Molecular Translational Science, Monash University, Clayton, Victoria, Australia
| | - Paul J. Hertzog
- Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Clayton, Victoria, Australia
- Department of Molecular Translational Science, Monash University, Clayton, Victoria, Australia
- * E-mail:
| |
Collapse
|
48
|
Gheorghe M, Sandve GK, Khan A, Chèneby J, Ballester B, Mathelier A. A map of direct TF-DNA interactions in the human genome. Nucleic Acids Res 2019; 47:e21. [PMID: 30517703 PMCID: PMC6393237 DOI: 10.1093/nar/gky1210] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2018] [Revised: 10/31/2018] [Accepted: 11/20/2018] [Indexed: 12/11/2022] Open
Abstract
Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is the most popular assay to identify genomic regions, called ChIP-seq peaks, that are bound in vivo by transcription factors (TFs). These regions are derived from direct TF-DNA interactions, indirect binding of the TF to the DNA (through a co-binding partner), nonspecific binding to the DNA, and noise/bias/artifacts. Delineating the bona fide direct TF-DNA interactions within the ChIP-seq peaks remains challenging. We developed a dedicated software, ChIP-eat, that combines computational TF binding models and ChIP-seq peaks to automatically predict direct TF-DNA interactions. Our work culminated with predicted interactions covering >4% of the human genome, obtained by uniformly processing 1983 ChIP-seq peak data sets from the ReMap database for 232 unique TFs. The predictions were a posteriori assessed using protein binding microarray and ChIP-exo data, and were predominantly found in high quality ChIP-seq peaks. The set of predicted direct TF-DNA interactions suggested that high-occupancy target regions are likely not derived from direct binding of the TFs to the DNA. Our predictions derived co-binding TFs supported by protein-protein interaction data and defined cis-regulatory modules enriched for disease- and trait-associated SNPs. We provide this collection of direct TF-DNA interactions and cis-regulatory modules through the UniBind web-interface (http://unibind.uio.no).
Collapse
Affiliation(s)
- Marius Gheorghe
- Centre for Molecular Medicine Norway (NCMM), University of Oslo, Oslo, Norway
| | | | - Aziz Khan
- Centre for Molecular Medicine Norway (NCMM), University of Oslo, Oslo, Norway
| | - Jeanne Chèneby
- Aix Marseille Université, INSERM, TAGC, Marseille, France
| | | | - Anthony Mathelier
- Centre for Molecular Medicine Norway (NCMM), University of Oslo, Oslo, Norway.,Department of Cancer Genetics, Institute for Cancer Research, Radiumhospitalet, Oslo, Norway
| |
Collapse
|
49
|
Wong KC, Lin J, Li X, Lin Q, Liang C, Song YQ. Heterodimeric DNA motif synthesis and validations. Nucleic Acids Res 2019; 47:1628-1636. [PMID: 30590725 PMCID: PMC6393289 DOI: 10.1093/nar/gky1297] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2018] [Revised: 12/04/2018] [Accepted: 12/19/2018] [Indexed: 02/06/2023] Open
Abstract
Bound by transcription factors, DNA motifs (i.e. transcription factor binding sites) are prevalent and important for gene regulation in different tissues at different developmental stages of eukaryotes. Although considerable efforts have been made on elucidating monomeric DNA motif patterns, our knowledge on heterodimeric DNA motifs are still far from complete. Therefore, we propose to develop a computational approach to synthesize a heterodimeric DNA motif from two monomeric DNA motifs. The approach is sequentially divided into two components (Phases A and B). In Phase A, we propose to develop the inference models on how two DNA monomeric motifs can be oriented and overlapped with each other at nucleotide level. In Phase B, given the two monomeric DNA motifs oriented, we further propose to develop DNA-binding family-specific input-output hidden Markov models (IOHMMs) to synthesize a heterodimeric DNA motif. To validate the approach, we execute and cross-validate it with the experimentally verified 618 heterodimeric DNA motifs across 49 DNA-binding family combinations. We observe that our approach can even "rescue" the existing heterodimeric DNA motif pattern (i.e. HOXB2_EOMES) previously published on Nature. Lastly, we apply the proposed approach to infer previously uncharacterized heterodimeric motifs. Their motif instances are supported by DNase accessibility, gene ontology, protein-protein interactions, in vivo ChIP-seq peaks, and even structural data from PDB. A public web-server is built for open accessibility and scientific impact. Its address is listed as follows: http://motif.cs.cityu.edu.hk/custom/MotifKirin.
Collapse
Affiliation(s)
- Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Jiecong Lin
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Xiangtao Li
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Qiuzhen Lin
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
| | - Cheng Liang
- School of Information Science and Engineering, Shandong Normal University, Jinan, China
| | - You-Qiang Song
- School of Biomedical Sciences, University of Hong Kong, Pokfulam, Hong Kong SAR
| |
Collapse
|
50
|
Kiesel A, Roth C, Ge W, Wess M, Meier M, Söding J. The BaMM web server for de-novo motif discovery and regulatory sequence analysis. Nucleic Acids Res 2019; 46:W215-W220. [PMID: 29846656 PMCID: PMC6030882 DOI: 10.1093/nar/gky431] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2018] [Accepted: 05/09/2018] [Indexed: 12/25/2022] Open
Abstract
The BaMM web server offers four tools: (i) de-novo discovery of enriched motifs in a set of nucleotide sequences, (ii) scanning a set of nucleotide sequences with motifs to find motif occurrences, (iii) searching with an input motif for similar motifs in our BaMM database with motifs for >1000 transcription factors, trained from the GTRD ChIP-seq database and (iv) browsing and keyword searching the motif database. In contrast to most other servers, we represent sequence motifs not by position weight matrices (PWMs) but by Bayesian Markov Models (BaMMs) of order 4, which we showed previously to perform substantially better in ROC analyses than PWMs or first order models. To address the inadequacy of P- and E-values as measures of motif quality, we introduce the AvRec score, the average recall over the TP-to-FP ratio between 1 and 100. The BaMM server is freely accessible without registration at https://bammmotif.mpibpc.mpg.de.
Collapse
Affiliation(s)
- Anja Kiesel
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| | - Christian Roth
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| | - Wanwan Ge
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| | - Maximilian Wess
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| | - Markus Meier
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| | - Johannes Söding
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| |
Collapse
|