1
|
Copley RR, Buttin J, Arguel MJ, Williaume G, Lebrigand K, Barbry P, Hudson C, Yasuo H. Early transcriptional similarities between two distinct neural lineages during ascidian embryogenesis. Dev Biol 2024; 514:1-11. [PMID: 38878991 DOI: 10.1016/j.ydbio.2024.06.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2024] [Revised: 05/31/2024] [Accepted: 06/12/2024] [Indexed: 06/20/2024]
Abstract
In chordates, the central nervous system arises from precursors that have distinct developmental and transcriptional trajectories. Anterior nervous systems are ontogenically associated with ectodermal lineages while posterior nervous systems are associated with mesoderm. Taking advantage of the well-documented cell lineage of ascidian embryos, we asked to what extent the transcriptional states of the different neural lineages become similar during the course of progressive lineage restriction. We performed single-cell RNA sequencing (scRNA-seq) analyses on hand-dissected neural precursor cells of the two distinct lineages, together with those of their sister cell lineages, with a high temporal resolution covering five successive cell cycles from the 16-cell to neural plate stages. A transcription factor binding site enrichment analysis of neural specific genes at the neural plate stage revealed limited evidence for shared transcriptional control between the two neural lineages, consistent with their different ontogenies. Nevertheless, PCA analysis and hierarchical clustering showed that, by neural plate stages, the two neural lineages cluster together. Consistent with this, we identified a set of genes enriched in both neural lineages at the neural plate stage, including miR-124, Celf3.a, Zic.r-b, and Ets1/2. Altogether, the current study has revealed genome-wide transcriptional dynamics of neural progenitor cells of two distinct developmental origins. Our scRNA-seq dataset is unique and provides a valuable resource for future analyses, enabling a precise temporal resolution of cell types not previously described from dissociated embryos.
Collapse
Affiliation(s)
- Richard R Copley
- Laboratoire de Biologie du Développement de Villefranche-sur-mer, Institut de la Mer de Villefranche-sur-mer, Sorbonne Université, CNRS UMR7009, 06230, Villefranche-sur-mer, France.
| | - Julia Buttin
- Laboratoire de Biologie du Développement de Villefranche-sur-mer, Institut de la Mer de Villefranche-sur-mer, Sorbonne Université, CNRS UMR7009, 06230, Villefranche-sur-mer, France
| | - Marie-Jeanne Arguel
- Institut de Pharmacologie Moléculaire et Cellulaire, Université Côte d'Azur, CNRS UMR 7275, 06560, Sophia Antipolis, France
| | - Géraldine Williaume
- Laboratoire de Biologie du Développement de Villefranche-sur-mer, Institut de la Mer de Villefranche-sur-mer, Sorbonne Université, CNRS UMR7009, 06230, Villefranche-sur-mer, France
| | - Kevin Lebrigand
- Institut de Pharmacologie Moléculaire et Cellulaire, Université Côte d'Azur, CNRS UMR 7275, 06560, Sophia Antipolis, France
| | - Pascal Barbry
- Institut de Pharmacologie Moléculaire et Cellulaire, Université Côte d'Azur, CNRS UMR 7275, 06560, Sophia Antipolis, France
| | - Clare Hudson
- Laboratoire de Biologie du Développement de Villefranche-sur-mer, Institut de la Mer de Villefranche-sur-mer, Sorbonne Université, CNRS UMR7009, 06230, Villefranche-sur-mer, France
| | - Hitoyoshi Yasuo
- Laboratoire de Biologie du Développement de Villefranche-sur-mer, Institut de la Mer de Villefranche-sur-mer, Sorbonne Université, CNRS UMR7009, 06230, Villefranche-sur-mer, France.
| |
Collapse
|
2
|
Fu X, Mo S, Buendia A, Laurent A, Shao A, del Mar Alvarez-Torres M, Yu T, Tan J, Su J, Sagatelian R, Ferrando AA, Ciccia A, Lan Y, Owens DM, Palomero T, Xing EP, Rabadan R. GET: a foundation model of transcription across human cell types. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.09.24.559168. [PMID: 39005360 PMCID: PMC11244937 DOI: 10.1101/2023.09.24.559168] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/16/2024]
Abstract
Transcriptional regulation, involving the complex interplay between regulatory sequences and proteins, directs all biological processes. Computational models of transcription lack generalizability to accurately extrapolate in unseen cell types and conditions. Here, we introduce GET, an interpretable foundation model designed to uncover regulatory grammars across 213 human fetal and adult cell types. Relying exclusively on chromatin accessibility data and sequence information, GET achieves experimental-level accuracy in predicting gene expression even in previously unseen cell types. GET showcases remarkable adaptability across new sequencing platforms and assays, enabling regulatory inference across a broad range of cell types and conditions, and uncovering universal and cell type specific transcription factor interaction networks. We evaluated its performance on prediction of regulatory activity, inference of regulatory elements and regulators, and identification of physical interactions between transcription factors. Specifically, we show GET outperforms current models in predicting lentivirus-based massive parallel reporter assay readout with reduced input data. In fetal erythroblasts, we identify distal (>1Mbp) regulatory regions that were missed by previous models. In B cells, we identified a lymphocyte-specific transcription factor-transcription factor interaction that explains the functional significance of a leukemia-risk predisposing germline mutation. In sum, we provide a generalizable and accurate model for transcription together with catalogs of gene regulation and transcription factor interactions, all with cell type specificity.
Collapse
Affiliation(s)
- Xi Fu
- Department of Systems Biology, Columbia University, New York, NY, USA
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Shentong Mo
- Department of Machine Learning, Carnegie Mellon University, Pittsburgh, PA, USA
- Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
| | - Alejandro Buendia
- Department of Systems Biology, Columbia University, New York, NY, USA
| | - Anouchka Laurent
- Institute for Cancer Genetics, Columbia University, New York, NY, USA
| | - Anqi Shao
- Department of Dermatology, Columbia University, New York, NY, USA
| | | | - Tianji Yu
- Department of Systems Biology, Columbia University, New York, NY, USA
| | - Jimin Tan
- Regeneron Genetics Center, Regeneron, Tarrytown, NY, USA
| | - Jiayu Su
- Department of Systems Biology, Columbia University, New York, NY, USA
| | | | - Adolfo A. Ferrando
- Department of Dermatology, Columbia University, New York, NY, USA
- Regeneron Genetics Center, Regeneron, Tarrytown, NY, USA
| | - Alberto Ciccia
- Department of Genetics and Development, Columbia University, New York, NY, USA
| | - Yanyan Lan
- Institute for AI Industry Research, Tsinghua University, Beijing, China
| | - David M. Owens
- Institute for Cancer Genetics, Columbia University, New York, NY, USA
- Department of Pathology & Cell Biology, Columbia University, New York, NY, USA
| | - Teresa Palomero
- Institute for Cancer Genetics, Columbia University, New York, NY, USA
- Department of Pathology & Cell Biology, Columbia University, New York, NY, USA
| | - Eric P. Xing
- Department of Machine Learning, Carnegie Mellon University, Pittsburgh, PA, USA
- Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
| | - Raul Rabadan
- Department of Systems Biology, Columbia University, New York, NY, USA
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| |
Collapse
|
3
|
Huo Q, Song R, Ma Z. Recent advances in exploring transcriptional regulatory landscape of crops. FRONTIERS IN PLANT SCIENCE 2024; 15:1421503. [PMID: 38903438 PMCID: PMC11188431 DOI: 10.3389/fpls.2024.1421503] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/22/2024] [Accepted: 05/23/2024] [Indexed: 06/22/2024]
Abstract
Crop breeding entails developing and selecting plant varieties with improved agronomic traits. Modern molecular techniques, such as genome editing, enable more efficient manipulation of plant phenotype by altering the expression of particular regulatory or functional genes. Hence, it is essential to thoroughly comprehend the transcriptional regulatory mechanisms that underpin these traits. In the multi-omics era, a large amount of omics data has been generated for diverse crop species, including genomics, epigenomics, transcriptomics, proteomics, and single-cell omics. The abundant data resources and the emergence of advanced computational tools offer unprecedented opportunities for obtaining a holistic view and profound understanding of the regulatory processes linked to desirable traits. This review focuses on integrated network approaches that utilize multi-omics data to investigate gene expression regulation. Various types of regulatory networks and their inference methods are discussed, focusing on recent advancements in crop plants. The integration of multi-omics data has been proven to be crucial for the construction of high-confidence regulatory networks. With the refinement of these methodologies, they will significantly enhance crop breeding efforts and contribute to global food security.
Collapse
Affiliation(s)
| | | | - Zeyang Ma
- State Key Laboratory of Maize Bio-breeding, Frontiers Science Center for Molecular Design Breeding, Joint International Research Laboratory of Crop Molecular Breeding, National Maize Improvement Center, College of Agronomy and Biotechnology, China Agricultural University, Beijing, China
| |
Collapse
|
4
|
Tambe A, MacCarthy T, Pavri R. Interpretable deep learning reveals the role of an E-box motif in suppressing somatic hypermutation of AGCT motifs within human immunoglobulin variable regions. Front Immunol 2024; 15:1407470. [PMID: 38863710 PMCID: PMC11165027 DOI: 10.3389/fimmu.2024.1407470] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2024] [Accepted: 05/08/2024] [Indexed: 06/13/2024] Open
Abstract
Introduction Somatic hypermutation (SHM) of immunoglobulin variable (V) regions by activation induced deaminase (AID) is essential for robust, long-term humoral immunity against pathogen and vaccine antigens. AID mutates cytosines preferentially within WRCH motifs (where W=A or T, R=A or G and H=A, C or T). However, it has been consistently observed that the mutability of WRCH motifs varies substantially, with large variations in mutation frequency even between multiple occurrences of the same motif within a single V region. This has led to the notion that the immediate sequence context of WRCH motifs contributes to mutability. Recent studies have highlighted the potential role of local DNA sequence features in promoting mutagenesis of AGCT, a commonly mutated WRCH motif. Intriguingly, AGCT motifs closer to 5' ends of V regions, within the framework 1 (FW1) sub-region1, mutate less frequently, suggesting an SHM-suppressing sequence context. Methods Here, we systematically examined the basis of AGCT positional biases in human SHM datasets with DeepSHM, a machine-learning model designed to predict SHM patterns. This was combined with integrated gradients, an interpretability method, to interrogate the basis of DeepSHM predictions. Results DeepSHM predicted the observed positional differences in mutation frequencies at AGCT motifs with high accuracy. For the conserved, lowly mutating AGCT motifs in FW1, integrated gradients predicted a large negative contribution of 5'C and 3'G flanking residues, suggesting that a CAGCTG context in this location was suppressive for SHM. CAGCTG is the recognition motif for E-box transcription factors, including E2A, which has been implicated in SHM. Indeed, we found a strong, inverse relationship between E-box motif fidelity and mutation frequency. Moreover, E2A was found to associate with the V region locale in two human B cell lines. Finally, analysis of human SHM datasets revealed that naturally occurring mutations in the 3'G flanking residues, which effectively ablate the E-box motif, were associated with a significantly increased rate of AGCT mutation. Discussion Our results suggest an antagonistic relationship between mutation frequency and the binding of E-box factors like E2A at specific AGCT motif contexts and, therefore, highlight a new, suppressive mechanism regulating local SHM patterns in human V regions.
Collapse
Affiliation(s)
- Abhik Tambe
- Department of Biochemistry and Cell Biology, Stony Brook University, Stony Brook, NY, United States
| | - Thomas MacCarthy
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, United States
| | - Rushad Pavri
- Research Institute of Molecular Pathology (IMP), Vienna, Austria
- Peter Gorer Department of Immunobiology, School of Immunology & Microbial Sciences, King’s College London, London, United Kingdom
| |
Collapse
|
5
|
Huizing GJ, Deutschmann IM, Peyré G, Cantini L. Paired single-cell multi-omics data integration with Mowgli. Nat Commun 2023; 14:7711. [PMID: 38001063 PMCID: PMC10673889 DOI: 10.1038/s41467-023-43019-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2023] [Accepted: 10/30/2023] [Indexed: 11/26/2023] Open
Abstract
The profiling of multiple molecular layers from the same set of cells has recently become possible. There is thus a growing need for multi-view learning methods able to jointly analyze these data. We here present Multi-Omics Wasserstein inteGrative anaLysIs (Mowgli), a novel method for the integration of paired multi-omics data with any type and number of omics. Of note, Mowgli combines integrative Nonnegative Matrix Factorization and Optimal Transport, enhancing at the same time the clustering performance and interpretability of integrative Nonnegative Matrix Factorization. We apply Mowgli to multiple paired single-cell multi-omics data profiled with 10X Multiome, CITE-seq, and TEA-seq. Our in-depth benchmark demonstrates that Mowgli's performance is competitive with the state-of-the-art in cell clustering and superior to the state-of-the-art once considering biological interpretability. Mowgli is implemented as a Python package seamlessly integrated within the scverse ecosystem and it is available at http://github.com/cantinilab/mowgli .
Collapse
Affiliation(s)
- Geert-Jan Huizing
- Institut Pasteur, Université Paris Cité, CNRS UMR 3738, Machine Learning for Integrative Genomics Group, F-75015, Paris, France.
- Institut de Biologie de l'Ecole Normale Supérieure, CNRS, INSERM, Ecole Normale Supérieure, Université PSL, 75005, Paris, France.
| | - Ina Maria Deutschmann
- Institut de Biologie de l'Ecole Normale Supérieure, CNRS, INSERM, Ecole Normale Supérieure, Université PSL, 75005, Paris, France
| | - Gabriel Peyré
- CNRS and DMA de l'Ecole Normale Supérieure, CNRS, Ecole Normale Supérieure, Université PSL, 75005, Paris, France
| | - Laura Cantini
- Institut Pasteur, Université Paris Cité, CNRS UMR 3738, Machine Learning for Integrative Genomics Group, F-75015, Paris, France.
- Institut de Biologie de l'Ecole Normale Supérieure, CNRS, INSERM, Ecole Normale Supérieure, Université PSL, 75005, Paris, France.
| |
Collapse
|
6
|
Badia-I-Mompel P, Wessels L, Müller-Dott S, Trimbour R, Ramirez Flores RO, Argelaguet R, Saez-Rodriguez J. Gene regulatory network inference in the era of single-cell multi-omics. Nat Rev Genet 2023; 24:739-754. [PMID: 37365273 DOI: 10.1038/s41576-023-00618-5] [Citation(s) in RCA: 74] [Impact Index Per Article: 74.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/12/2023] [Indexed: 06/28/2023]
Abstract
The interplay between chromatin, transcription factors and genes generates complex regulatory circuits that can be represented as gene regulatory networks (GRNs). The study of GRNs is useful to understand how cellular identity is established, maintained and disrupted in disease. GRNs can be inferred from experimental data - historically, bulk omics data - and/or from the literature. The advent of single-cell multi-omics technologies has led to the development of novel computational methods that leverage genomic, transcriptomic and chromatin accessibility information to infer GRNs at an unprecedented resolution. Here, we review the key principles of inferring GRNs that encompass transcription factor-gene interactions from transcriptomics and chromatin accessibility data. We focus on the comparison and classification of methods that use single-cell multimodal data. We highlight challenges in GRN inference, in particular with respect to benchmarking, and potential further developments using additional data modalities.
Collapse
Affiliation(s)
- Pau Badia-I-Mompel
- Heidelberg University, Faculty of Medicine, Heidelberg University Hospital, Institute for Computational Biomedicine, Bioquant, Heidelberg, Germany
| | - Lorna Wessels
- Heidelberg University, Faculty of Medicine, Heidelberg University Hospital, Institute for Computational Biomedicine, Bioquant, Heidelberg, Germany
- Department of Vascular Biology and Tumor Angiogenesis, European Center for Angioscience, Medical Faculty, MannHeim Heidelberg University, Mannheim, Germany
| | - Sophia Müller-Dott
- Heidelberg University, Faculty of Medicine, Heidelberg University Hospital, Institute for Computational Biomedicine, Bioquant, Heidelberg, Germany
| | - Rémi Trimbour
- Heidelberg University, Faculty of Medicine, Heidelberg University Hospital, Institute for Computational Biomedicine, Bioquant, Heidelberg, Germany
- Institut Pasteur, Université Paris Cité, CNRS UMR 3738, Machine Learning for Integrative Genomics Group, Paris, France
| | - Ricardo O Ramirez Flores
- Heidelberg University, Faculty of Medicine, Heidelberg University Hospital, Institute for Computational Biomedicine, Bioquant, Heidelberg, Germany
| | | | - Julio Saez-Rodriguez
- Heidelberg University, Faculty of Medicine, Heidelberg University Hospital, Institute for Computational Biomedicine, Bioquant, Heidelberg, Germany.
| |
Collapse
|
7
|
Tognon M, Giugno R, Pinello L. A survey on algorithms to characterize transcription factor binding sites. Brief Bioinform 2023; 24:bbad156. [PMID: 37099664 PMCID: PMC10422928 DOI: 10.1093/bib/bbad156] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2023] [Revised: 03/27/2023] [Accepted: 04/01/2023] [Indexed: 04/28/2023] Open
Abstract
Transcription factors (TFs) are key regulatory proteins that control the transcriptional rate of cells by binding short DNA sequences called transcription factor binding sites (TFBS) or motifs. Identifying and characterizing TFBS is fundamental to understanding the regulatory mechanisms governing the transcriptional state of cells. During the last decades, several experimental methods have been developed to recover DNA sequences containing TFBS. In parallel, computational methods have been proposed to discover and identify TFBS motifs based on these DNA sequences. This is one of the most widely investigated problems in bioinformatics and is referred to as the motif discovery problem. In this manuscript, we review classical and novel experimental and computational methods developed to discover and characterize TFBS motifs in DNA sequences, highlighting their advantages and drawbacks. We also discuss open challenges and future perspectives that could fill the remaining gaps in the field.
Collapse
Affiliation(s)
- Manuel Tognon
- Computer Science Department, University of Verona, Verona, Italy
- Molecular Pathology Unit, Center for Computational and Integrative Biology and Center for Cancer Research, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Rosalba Giugno
- Computer Science Department, University of Verona, Verona, Italy
| | - Luca Pinello
- Molecular Pathology Unit, Center for Computational and Integrative Biology and Center for Cancer Research, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- Department of Pathology, Harvard Medical School, Boston, Massachusetts, United States of America
| |
Collapse
|
8
|
Mimoso CA, Adelman K. U1 snRNP increases RNA Pol II elongation rate to enable synthesis of long genes. Mol Cell 2023; 83:1264-1279.e10. [PMID: 36965480 PMCID: PMC10135401 DOI: 10.1016/j.molcel.2023.03.002] [Citation(s) in RCA: 11] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Revised: 02/06/2023] [Accepted: 02/28/2023] [Indexed: 03/27/2023]
Abstract
The expansion of introns within mammalian genomes poses a challenge for the production of full-length messenger RNAs (mRNAs), with increasing evidence that these long AT-rich sequences present obstacles to transcription. Here, we investigate RNA polymerase II (RNAPII) elongation at high resolution in mammalian cells and demonstrate that RNAPII transcribes faster across introns. Moreover, we find that this acceleration requires the association of U1 snRNP (U1) with the elongation complex at 5' splice sites. The role of U1 to stimulate elongation rate through introns reduces the frequency of both premature termination and transcriptional arrest, thereby dramatically increasing RNA production. We further show that changes in RNAPII elongation rate due to AT content and U1 binding explain previous reports of pausing or termination at splice junctions and the edge of CpG islands. We propose that U1-mediated acceleration of elongation has evolved to mitigate the risks that long AT-rich introns pose to transcript completion.
Collapse
Affiliation(s)
- Claudia A Mimoso
- Department of Biological Chemistry and Molecular Pharmacology, Blavatnik Institute, Harvard Medical School, Boston, MA 02115, USA
| | - Karen Adelman
- Department of Biological Chemistry and Molecular Pharmacology, Blavatnik Institute, Harvard Medical School, Boston, MA 02115, USA; Ludwig Center at Harvard, Boston, MA 02115, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
| |
Collapse
|
9
|
Whole-genome analysis of noncoding genetic variations identifies multiscale regulatory element perturbations associated with Hirschsprung disease. Genome Res 2020; 30:1618-1632. [PMID: 32948616 PMCID: PMC7605255 DOI: 10.1101/gr.264473.120] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2020] [Accepted: 09/14/2020] [Indexed: 12/16/2022]
Abstract
It is widely recognized that noncoding genetic variants play important roles in many human diseases, but there are multiple challenges that hinder the identification of functional disease-associated noncoding variants. The number of noncoding variants can be many times that of coding variants; many of them are not functional but in linkage disequilibrium with the functional ones; different variants can have epistatic effects; different variants can affect the same genes or pathways in different individuals; and some variants are related to each other not by affecting the same gene but by affecting the binding of the same upstream regulator. To overcome these difficulties, we propose a novel analysis framework that considers convergent impacts of different genetic variants on protein binding, which provides multiscale information about disease-associated perturbations of regulatory elements, genes, and pathways. Applying it to our whole-genome sequencing data of 918 short-segment Hirschsprung disease patients and matched controls, we identify various novel genes not detected by standard single-variant and region-based tests, functionally centering on neural crest migration and development. Our framework also identifies upstream regulators whose binding is influenced by the noncoding variants. Using human neural crest cells, we confirm cell stage-specific regulatory roles of three top novel regulatory elements on our list, respectively in the RET, RASGEF1A, and PIK3C2B loci. In the PIK3C2B regulatory element, we further show that a noncoding variant found only in the patients affects the binding of the gliogenesis regulator NFIA, with a corresponding up-regulation of multiple genes in the same topologically associating domain.
Collapse
|
10
|
Jolma A, Zhang J, Mondragón E, Morgunova E, Kivioja T, Laverty KU, Yin Y, Zhu F, Bourenkov G, Morris Q, Hughes TR, Maher LJ, Taipale J. Binding specificities of human RNA-binding proteins toward structured and linear RNA sequences. Genome Res 2020; 30:962-973. [PMID: 32703884 PMCID: PMC7397871 DOI: 10.1101/gr.258848.119] [Citation(s) in RCA: 48] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2019] [Accepted: 06/23/2020] [Indexed: 01/09/2023]
Abstract
RNA-binding proteins (RBPs) regulate RNA metabolism at multiple levels by affecting splicing of nascent transcripts, RNA folding, base modification, transport, localization, translation, and stability. Despite their central role in RNA function, the RNA-binding specificities of most RBPs remain unknown or incompletely defined. To address this, we have assembled a genome-scale collection of RBPs and their RNA-binding domains (RBDs) and assessed their specificities using high-throughput RNA-SELEX (HTR-SELEX). Approximately 70% of RBPs for which we obtained a motif bound to short linear sequences, whereas ∼30% preferred structured motifs folding into stem-loops. We also found that many RBPs can bind to multiple distinctly different motifs. Analysis of the matches of the motifs in human genomic sequences suggested novel roles for many RBPs. We found that three cytoplasmic proteins-ZC3H12A, ZC3H12B, and ZC3H12C-bound to motifs resembling the splice donor sequence, suggesting that these proteins are involved in degradation of cytoplasmic viral and/or unspliced transcripts. Structural analysis revealed that the RNA motif was not bound by the conventional C3H1 RNA-binding domain of ZC3H12B. Instead, the RNA motif was bound by the ZC3H12B's PilT N terminus (PIN) RNase domain, revealing a potential mechanism by which unconventional RBDs containing active sites or molecule-binding pockets could interact with short, structured RNA molecules. Our collection containing 145 high-resolution binding specificity models for 86 RBPs is the largest systematic resource for the analysis of human RBPs and will greatly facilitate future analysis of the various biological roles of this important class of proteins.
Collapse
Affiliation(s)
- Arttu Jolma
- Department of Medical Biochemistry and Biophysics, Karolinska Institutet, SE-171 77, Solna, Sweden
| | - Jilin Zhang
- Department of Medical Biochemistry and Biophysics, Karolinska Institutet, SE-171 77, Solna, Sweden
| | - Estefania Mondragón
- Department of Biochemistry and Molecular Biology, Mayo Clinic Graduate School of Biomedical Sciences, Mayo Clinic College of Medicine and Science, Rochester, Minnesota 55905, USA
| | - Ekaterina Morgunova
- Department of Medical Biochemistry and Biophysics, Karolinska Institutet, SE-171 77, Solna, Sweden
| | - Teemu Kivioja
- Genome-Scale Biology Program, University of Helsinki, FI-00014, Helsinki, Finland
| | - Kaitlin U Laverty
- Department of Molecular Genetics, University of Toronto, M5S 1A8, Toronto, Canada
| | - Yimeng Yin
- Department of Medical Biochemistry and Biophysics, Karolinska Institutet, SE-171 77, Solna, Sweden
| | - Fangjie Zhu
- Department of Medical Biochemistry and Biophysics, Karolinska Institutet, SE-171 77, Solna, Sweden
| | - Gleb Bourenkov
- European Molecular Biology Laboratory (EMBL), Hamburg Unit c/o DESY, D-22603 Hamburg, Germany
| | - Quaid Morris
- Department of Molecular Genetics, University of Toronto, M5S 1A8, Toronto, Canada
- Donnelly Centre, University of Toronto, M5S 3E1, Toronto, Canada
- Edward S. Rogers Sr. Department of Electrical and Computer Engineering, University of Toronto, M5S 3G4, Toronto, Canada
- Department of Computer Science, University of Toronto, M5S 2E4, Toronto, Canada
- Memorial Sloan Kettering Cancer Center, New York, New York 10065, USA
| | - Timothy R Hughes
- Department of Molecular Genetics, University of Toronto, M5S 1A8, Toronto, Canada
- Donnelly Centre, University of Toronto, M5S 3E1, Toronto, Canada
| | - Louis James Maher
- Department of Biochemistry and Molecular Biology, Mayo Clinic Graduate School of Biomedical Sciences, Mayo Clinic College of Medicine and Science, Rochester, Minnesota 55905, USA
| | - Jussi Taipale
- Department of Medical Biochemistry and Biophysics, Karolinska Institutet, SE-171 77, Solna, Sweden
- Genome-Scale Biology Program, University of Helsinki, FI-00014, Helsinki, Finland
- Department of Biochemistry, University of Cambridge, CB2 1QW, Cambridge, United Kingdom
| |
Collapse
|
11
|
Toivonen J, Das PK, Taipale J, Ukkonen E. MODER2: first-order Markov modeling and discovery of monomeric and dimeric binding motifs. Bioinformatics 2020; 36:2690-2696. [PMID: 31999322 PMCID: PMC7203737 DOI: 10.1093/bioinformatics/btaa045] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2019] [Revised: 12/23/2019] [Accepted: 01/23/2020] [Indexed: 12/21/2022] Open
Abstract
MOTIVATION Position-specific probability matrices (PPMs, also called position-specific weight matrices) have been the dominating model for transcription factor (TF)-binding motifs in DNA. There is, however, increasing recent evidence of better performance of higher order models such as Markov models of order one, also called adjacent dinucleotide matrices (ADMs). ADMs can model dependencies between adjacent nucleotides, unlike PPMs. A modeling technique and software tool that would estimate such models simultaneously both for monomers and their dimers have been missing. RESULTS We present an ADM-based mixture model for monomeric and dimeric TF-binding motifs and an expectation maximization algorithm MODER2 for learning such models from training data and seeds. The model is a mixture that includes monomers and dimers, built from the monomers, with a description of the dimeric structure (spacing, orientation). The technique is modular, meaning that the co-operative effect of dimerization is made explicit by evaluating the difference between expected and observed models. The model is validated using HT-SELEX and generated datasets, and by comparing to some earlier PPM and ADM techniques. The ADM models explain data slightly better than PPM models for 314 tested TFs (or their DNA-binding domains) from four families (bHLH, bZIP, ETS and Homeodomain), the ADM mixture models by MODER2 being the best on average. AVAILABILITY AND IMPLEMENTATION Software implementation is available from https://github.com/jttoivon/moder2. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jarkko Toivonen
- Department of Computer Science, University of Helsinki, Helsinki FI-00014, Finland
| | - Pratyush K Das
- Applied Tumor Genomics, Research Programs Unit, University of Helsinki, Helsinki FI-00014, Finland
| | - Jussi Taipale
- Department of Biochemistry, University of Cambridge, CB2 1GA Cambridge, UK
- Division of Functional Genomics and Systems Biology, Department of Medical Biochemistry and Biophysics, SE 141 83 Stockholm, Sweden
- Department of Biosciences and Nutrition, Karolinska Institutet, SE 141 83 Stockholm, Sweden
- Genome-Scale Biology Program, University of Helsinki, Helsinki FI-00014, Finland
| | - Esko Ukkonen
- Department of Computer Science, University of Helsinki, Helsinki FI-00014, Finland
| |
Collapse
|
12
|
Fostier J. BLAMM: BLAS-based algorithm for finding position weight matrix occurrences in DNA sequences on CPUs and GPUs. BMC Bioinformatics 2020; 21:81. [PMID: 32164557 PMCID: PMC7068855 DOI: 10.1186/s12859-020-3348-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND The identification of all matches of a large set of position weight matrices (PWMs) in long DNA sequences requires significant computational resources for which a number of efficient yet complex algorithms have been proposed. RESULTS We propose BLAMM, a simple and efficient tool inspired by high performance computing techniques. The workload is expressed in terms of matrix-matrix products that are evaluated with high efficiency using optimized BLAS library implementations. The algorithm is easy to parallelize and implement on CPUs and GPUs and has a runtime that is independent of the selected p-value. In terms of single-core performance, it is competitive with state-of-the-art software for PWM matching while being much more efficient when using multithreading. Additionally, BLAMM requires negligible memory. For example, both strands of the entire human genome can be scanned for 1404 PWMs in the JASPAR database in 13 min with a p-value of 10-4 using a 36-core machine. On a dual GPU system, the same task can be performed in under 5 min. CONCLUSIONS BLAMM is an efficient tool for identifying PWM matches in large DNA sequences. Its C++ source code is available under the GNU General Public License Version 3 at https://github.com/biointec/blamm.
Collapse
Affiliation(s)
- Jan Fostier
- Department of Information Technology - IDLab, Ghent University - imec, Technologiepark 126, Ghent (Zwijnaarde), B-9052, Belgium.
| |
Collapse
|
13
|
Yang EW, Bahn JH, Hsiao EYH, Tan BX, Sun Y, Fu T, Zhou B, Van Nostrand EL, Pratt GA, Freese P, Wei X, Quinones-Valdez G, Urban AE, Graveley BR, Burge CB, Yeo GW, Xiao X. Allele-specific binding of RNA-binding proteins reveals functional genetic variants in the RNA. Nat Commun 2019; 10:1338. [PMID: 30902979 PMCID: PMC6430814 DOI: 10.1038/s41467-019-09292-w] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2018] [Accepted: 03/05/2019] [Indexed: 12/31/2022] Open
Abstract
Allele-specific protein-RNA binding is an essential aspect that may reveal functional genetic variants (GVs) mediating post-transcriptional regulation. Recently, genome-wide detection of in vivo binding of RNA-binding proteins is greatly facilitated by the enhanced crosslinking and immunoprecipitation (eCLIP) method. We developed a new computational approach, called BEAPR, to identify allele-specific binding (ASB) events in eCLIP-Seq data. BEAPR takes into account crosslinking-induced sequence propensity and variations between replicated experiments. Using simulated and actual data, we show that BEAPR largely outperforms often-used count analysis methods. Importantly, BEAPR overcomes the inherent overdispersion problem of these methods. Complemented by experimental validations, we demonstrate that the application of BEAPR to ENCODE eCLIP-Seq data of 154 proteins helps to predict functional GVs that alter splicing or mRNA abundance. Moreover, many GVs with ASB patterns have known disease relevance. Overall, BEAPR is an effective method that helps to address the outstanding challenge of functional interpretation of GVs.
Collapse
Affiliation(s)
- Ei-Wen Yang
- Department of Integrative Biology and Physiology, UCLA, Los Angeles, CA, 90095, USA
| | - Jae Hoon Bahn
- Department of Integrative Biology and Physiology, UCLA, Los Angeles, CA, 90095, USA
| | - Esther Yun-Hua Hsiao
- Department of Integrative Biology and Physiology, UCLA, Los Angeles, CA, 90095, USA
- Department of Bioengineering, UCLA, Los Angeles, CA, 90095, USA
| | - Boon Xin Tan
- Department of Integrative Biology and Physiology, UCLA, Los Angeles, CA, 90095, USA
| | - Yiwei Sun
- Department of Integrative Biology and Physiology, UCLA, Los Angeles, CA, 90095, USA
| | - Ting Fu
- Department of Integrative Biology and Physiology, UCLA, Los Angeles, CA, 90095, USA
- Molecular, Cellular and Integrative Physiology Interdepartmental Program, UCLA, Los Angeles, CA, 90095, USA
| | - Bo Zhou
- Department of Psychiatry and Behavioral Sciences, Department of Genetics, Stanford University School of Medicine, Palo Alto, CA, 94305, USA
| | - Eric L Van Nostrand
- Department of Cellular and Molecular Medicine, UCSD, La Jolla, CA, 92093, USA
- Institute for Genomic Medicine, UCSD, La Jolla, CA, 92093, USA
| | - Gabriel A Pratt
- Department of Cellular and Molecular Medicine, UCSD, La Jolla, CA, 92093, USA
- Institute for Genomic Medicine, UCSD, La Jolla, CA, 92093, USA
| | - Peter Freese
- Department of Biology, MIT, Cambridge, MA, 02139, USA
| | - Xintao Wei
- Department of Genetics and Genome Sciences, Institute for Systems Genomics, UConn Health, Farmington, CT, 06030, USA
| | | | - Alexander E Urban
- Department of Psychiatry and Behavioral Sciences, Department of Genetics, Stanford University School of Medicine, Palo Alto, CA, 94305, USA
| | - Brenton R Graveley
- Department of Genetics and Genome Sciences, Institute for Systems Genomics, UConn Health, Farmington, CT, 06030, USA
| | | | - Gene W Yeo
- Department of Cellular and Molecular Medicine, UCSD, La Jolla, CA, 92093, USA
- Institute for Genomic Medicine, UCSD, La Jolla, CA, 92093, USA
- Department of Physiology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, 117593, Singapore
- Molecular Engineering Laboratory, A*STAR, Singapore, 138673, Singapore
| | - Xinshu Xiao
- Department of Integrative Biology and Physiology, UCLA, Los Angeles, CA, 90095, USA.
- Department of Bioengineering, UCLA, Los Angeles, CA, 90095, USA.
- Molecular, Cellular and Integrative Physiology Interdepartmental Program, UCLA, Los Angeles, CA, 90095, USA.
- Molecular Biology Institute, UCLA, Los Angeles, CA, 90095, USA.
| |
Collapse
|
14
|
Kang R, Zhang Y, Huang Q, Meng J, Ding R, Chang Y, Xiong L, Guo Z. EnhancerDB: a resource of transcriptional regulation in the context of enhancers. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2019; 2019:5298788. [PMID: 30689845 PMCID: PMC6344666 DOI: 10.1093/database/bay141] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/14/2018] [Accepted: 12/09/2018] [Indexed: 01/24/2023]
Abstract
Enhancers can act as cis-regulatory elements to control transcriptional regulation by recruiting DNA-binding transcription factors (TFs) in a tissue-specific manner. Recent studies show that enhancers regulate not only protein-coding genes but also microRNAs (miRNAs), and mutations within the TF binding sites (TFBSs) located on enhancers will cause a variety of diseases such as cancer. However, a comprehensive resource to integrate these regulation elements for revealing transcriptional regulations in the context of enhancers is not currently available. Here, we introduce EnhancerDB, a web-accessible database to provide a resource to browse and search regulatory relationships identified in this study, including 131 054 581 TF–enhancer, 17 059 enhancer–miRNAs, 318 993 enhancer–genes, 4 639 558 TF–miRNAs, 1 059 695 TF–genes, 11 439 394 enhancer–single-nucleotide polymorphisms (SNPs) and 23 334 genes associated with expression quantitative trait loci (eQTL) SNP and expression profile of TF/gene/miRNA across multiple human tissues/cell lines. We also developed a tool that further allows users to define tissue-specific enhancers by setting the threshold score of tissue specificity of enhancers. In addition, links to external resources are also available at EnhancerDB.
Collapse
Affiliation(s)
- Ran Kang
- School of Life Sciences and Engineering, Southwest Jiaotong University, Chengdu City, Sichuan Province, P.R. China
| | - Yiming Zhang
- School of Life Sciences and Engineering, Southwest Jiaotong University, Chengdu City, Sichuan Province, P.R. China
| | - Qingqing Huang
- School of Life Sciences and Engineering, Southwest Jiaotong University, Chengdu City, Sichuan Province, P.R. China
| | - Junhua Meng
- School of Life Sciences and Engineering, Southwest Jiaotong University, Chengdu City, Sichuan Province, P.R. China
| | - Ruofan Ding
- School of Life Sciences and Engineering, Southwest Jiaotong University, Chengdu City, Sichuan Province, P.R. China
| | - Yunjian Chang
- School of Life Sciences and Engineering, Southwest Jiaotong University, Chengdu City, Sichuan Province, P.R. China
| | - Lili Xiong
- School of Life Sciences and Engineering, Southwest Jiaotong University, Chengdu City, Sichuan Province, P.R. China
| | - Zhiyun Guo
- School of Life Sciences and Engineering, Southwest Jiaotong University, Chengdu City, Sichuan Province, P.R. China
| |
Collapse
|
15
|
Panzade G, Gangwar I, Awasthi S, Sharma N, Shankar R. Plant Regulomics Portal (PRP): a comprehensive integrated regulatory information and analysis portal for plant genomes. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2019; 2019:5650983. [PMID: 31796964 PMCID: PMC6891001 DOI: 10.1093/database/baz130] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/17/2019] [Revised: 10/16/2019] [Accepted: 10/17/2019] [Indexed: 12/20/2022]
Abstract
Gene regulation is a highly complex and networked phenomenon where multiple tiers of control determine the cell state in a spatio-temporal manner. Among these, the transcription factors, DNA and histone modifications, and post-transcriptional control by small RNAs like miRNAs serve as major regulators. An understanding of the integrative and spatio-temporal impact of these regulatory factors can provide better insights into the state of a ‘cell system’. Yet, there are limited resources available to this effect. Therefore, we hereby report an integrative information portal (Plant Regulomics Portal; PRP) for plants for the first time. The portal has been developed by integrating a huge amount of curated data from published sources, RNA-, methylome- and sRNA/miRNA sequencing, histone modifications and repeats, gene ontology, digital gene expression and characterized pathways. The key features of the portal include a regulatory search engine for fetching numerous analytical outputs and tracks of the abovementioned regulators and also a genome browser for integrated visualization of the search results. It also has numerous analytical features for analyses of transcription factors (TFs) and sRNA/miRNA, spot-specific methylation, gene expression and interactions and details of pathways for any given genomic element. It can also provide information on potential RdDM regulation, while facilitating enrichment analysis, generation of visually rich plots and downloading of data in a selective manner. Visualization of intricate biological networks is an important feature which utilizes the Neo4j Graph database making analysis of relationships and long-range system viewing possible. Till date, PRP hosts 571-GB processed data for four plant species namely Arabidopsis thaliana, Oryza sativa subsp. japonica, Zea mays and Glycine max. Database URL: https://scbb.ihbt.res.in/PRP
Collapse
Affiliation(s)
- Ganesh Panzade
- Studio of Computational Biology & Bioinformatics, Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, Kangra, Himachal Pradesh 176061, India.,Academy of Scientific & Innovative Research (AcSIR), CSIR-HRDC Campus, Postal Staff College Area, Sector 19, Kamla Nehru Nagar, Ghaziabad, Uttar Pradesh 201002, India.,Division of Biology, Kansas State University, Zinovyeva Lab, 28 Ackert Hall, Manhattan, KS, USA, 66506
| | - Indu Gangwar
- Studio of Computational Biology & Bioinformatics, Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, Kangra, Himachal Pradesh 176061, India.,Academy of Scientific & Innovative Research (AcSIR), CSIR-HRDC Campus, Postal Staff College Area, Sector 19, Kamla Nehru Nagar, Ghaziabad, Uttar Pradesh 201002, India
| | - Supriya Awasthi
- Studio of Computational Biology & Bioinformatics, Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, Kangra, Himachal Pradesh 176061, India
| | - Nitesh Sharma
- Studio of Computational Biology & Bioinformatics, Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, Kangra, Himachal Pradesh 176061, India.,Academy of Scientific & Innovative Research (AcSIR), CSIR-HRDC Campus, Postal Staff College Area, Sector 19, Kamla Nehru Nagar, Ghaziabad, Uttar Pradesh 201002, India
| | - Ravi Shankar
- Studio of Computational Biology & Bioinformatics, Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, Kangra, Himachal Pradesh 176061, India.,Academy of Scientific & Innovative Research (AcSIR), CSIR-HRDC Campus, Postal Staff College Area, Sector 19, Kamla Nehru Nagar, Ghaziabad, Uttar Pradesh 201002, India
| |
Collapse
|