51
|
Mahendrawada L, Warfield L, Donczew R, Hahn S. Surprising connections between DNA binding and function for the near-complete set of yeast transcription factors. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.07.25.550593. [PMID: 37546716 PMCID: PMC10402042 DOI: 10.1101/2023.07.25.550593] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/08/2023]
Abstract
DNA sequence-specific transcription factors (TFs) modulate transcription and chromatin architecture, acting from regulatory sites in enhancers and promoters of eukaryotic genes. How TFs locate their DNA targets and how multiple TFs cooperate to regulate individual genes is still unclear. Most yeast TFs are thought to regulate transcription via binding to upstream activating sequences, situated within a few hundred base pairs upstream of the regulated gene. While this model has been validated for individual TFs and specific genes, it has not been tested in a systematic way with the large set of yeast TFs. Here, we have integrated information on the binding and expression targets for the near-complete set of yeast TFs. While we found many instances of functional TF binding sites in upstream regulatory regions, we found many more instances that do not fit this model. In many cases, rapid TF depletion affects gene expression where there is no detectable binding of that TF to the upstream region of the affected gene. In addition, for most TFs, only a small fraction of bound TFs regulates the nearby gene, showing that TF binding does not automatically correspond to regulation of the linked gene. Finally, we found that only a small percentage of TFs are exclusively strong activators or repressors with most TFs having dual function. Overall, our comprehensive mapping of TF binding and regulatory targets have both confirmed known TF relationships and revealed surprising properties of TF function.
Collapse
|
52
|
Penzar D, Nogina D, Noskova E, Zinkevich A, Meshcheryakov G, Lando A, Rafi AM, de Boer C, Kulakovskiy IV. LegNet: a best-in-class deep learning model for short DNA regulatory regions. Bioinformatics 2023; 39:btad457. [PMID: 37490428 PMCID: PMC10400376 DOI: 10.1093/bioinformatics/btad457] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2022] [Revised: 05/28/2023] [Accepted: 07/24/2023] [Indexed: 07/27/2023] Open
Abstract
MOTIVATION The increasing volume of data from high-throughput experiments including parallel reporter assays facilitates the development of complex deep-learning approaches for modeling DNA regulatory grammar. RESULTS Here, we introduce LegNet, an EfficientNetV2-inspired convolutional network for modeling short gene regulatory regions. By approaching the sequence-to-expression regression problem as a soft classification task, LegNet secured first place for the autosome.org team in the DREAM 2022 challenge of predicting gene expression from gigantic parallel reporter assays. Using published data, here, we demonstrate that LegNet outperforms existing models and accurately predicts gene expression per se as well as the effects of single-nucleotide variants. Furthermore, we show how LegNet can be used in a diffusion network manner for the rational design of promoter sequences yielding the desired expression level. AVAILABILITY AND IMPLEMENTATION https://github.com/autosome-ru/LegNet. The GitHub repository includes Jupyter Notebook tutorials and Python scripts under the MIT license to reproduce the results presented in the study.
Collapse
Affiliation(s)
- Dmitry Penzar
- Vavilov Institute of General Genetics, Moscow 119991, Russia
- Institute of Protein Research, Pushchino 142290, Russia
- Institute of Translational Medicine, Pirogov Russian National Research Medical University, Moscow 117997, Russia
| | - Daria Nogina
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow 119991, Russia
| | - Elizaveta Noskova
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow 119991, Russia
| | - Arsenii Zinkevich
- Vavilov Institute of General Genetics, Moscow 119991, Russia
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow 119991, Russia
| | | | | | - Abdul Muntakim Rafi
- School of Biomedical Engineering, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
| | - Carl de Boer
- School of Biomedical Engineering, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
| | - Ivan V Kulakovskiy
- Vavilov Institute of General Genetics, Moscow 119991, Russia
- Institute of Protein Research, Pushchino 142290, Russia
- Laboratory of Regulatory Genomics, Institute of Fundamental Medicine and Biology, Kazan Federal University, Kazan 420008, Russia
| |
Collapse
|
53
|
Oliveros W, Delfosse K, Lato DF, Kiriakopulos K, Mokhtaridoost M, Said A, McMurray BJ, Browning JW, Mattioli K, Meng G, Ellis J, Mital S, Melé M, Maass PG. Systematic characterization of regulatory variants of blood pressure genes. CELL GENOMICS 2023; 3:100330. [PMID: 37492106 PMCID: PMC10363820 DOI: 10.1016/j.xgen.2023.100330] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Revised: 03/29/2023] [Accepted: 04/28/2023] [Indexed: 07/27/2023]
Abstract
High blood pressure (BP) is the major risk factor for cardiovascular disease. Genome-wide association studies have identified genetic variants for BP, but functional insights into causality and related molecular mechanisms lag behind. We functionally characterize 4,608 genetic variants in linkage with 135 BP loci in vascular smooth muscle cells and cardiomyocytes by massively parallel reporter assays. High densities of regulatory variants at BP loci (i.e., ULK4, MAP4, CFDP1, PDE5A) indicate that multiple variants drive genetic association. Regulatory variants are enriched in repeats, alter cardiovascular-related transcription factor motifs, and spatially converge with genes controlling specific cardiovascular pathways. Using heuristic scoring, we define likely causal variants, and CRISPR prime editing finally determines causal variants for KCNK9, SFXN2, and PCGF6, which are candidates for developing high BP. Our systems-level approach provides a catalog of functionally relevant variants and their genomic architecture in two trait-relevant cell lines for a better understanding of BP gene regulation.
Collapse
Affiliation(s)
- Winona Oliveros
- Life Sciences Department, Barcelona Supercomputing Center, 08034 Barcelona, Catalonia, Spain
| | - Kate Delfosse
- Genetics & Genome Biology Program, The Hospital for Sick Children, Toronto, ON M5G 0A4, Canada
| | - Daniella F. Lato
- Genetics & Genome Biology Program, The Hospital for Sick Children, Toronto, ON M5G 0A4, Canada
| | - Katerina Kiriakopulos
- Genetics & Genome Biology Program, The Hospital for Sick Children, Toronto, ON M5G 0A4, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
| | - Milad Mokhtaridoost
- Genetics & Genome Biology Program, The Hospital for Sick Children, Toronto, ON M5G 0A4, Canada
| | - Abdelrahman Said
- Genetics & Genome Biology Program, The Hospital for Sick Children, Toronto, ON M5G 0A4, Canada
| | - Brandon J. McMurray
- Genetics & Genome Biology Program, The Hospital for Sick Children, Toronto, ON M5G 0A4, Canada
| | - Jared W.L. Browning
- Genetics & Genome Biology Program, The Hospital for Sick Children, Toronto, ON M5G 0A4, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
| | - Kaia Mattioli
- Division of Genetics, Department of Medicine, Brigham & Women’s Hospital and Harvard Medical School, Boston, MA 02115, USA
| | - Guoliang Meng
- Developmental and Stem Cell Biology Program, The Hospital for Sick Children, Toronto, ON M5G 0A4, Canada
| | - James Ellis
- Developmental and Stem Cell Biology Program, The Hospital for Sick Children, Toronto, ON M5G 0A4, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
| | - Seema Mital
- Genetics & Genome Biology Program, The Hospital for Sick Children, Toronto, ON M5G 0A4, Canada
- Ted Rogers Centre for Heart Research, Toronto, ON M5G 1X8, Canada
- Department of Pediatrics, The Hospital for Sick Children, University of Toronto, Toronto, ON M5G 0A4, Canada
| | - Marta Melé
- Life Sciences Department, Barcelona Supercomputing Center, 08034 Barcelona, Catalonia, Spain
| | - Philipp G. Maass
- Genetics & Genome Biology Program, The Hospital for Sick Children, Toronto, ON M5G 0A4, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
| |
Collapse
|
54
|
Zhang Z, Feng F, Qiu Y, Liu J. A generalizable framework to comprehensively predict epigenome, chromatin organization, and transcriptome. Nucleic Acids Res 2023; 51:5931-5947. [PMID: 37224527 PMCID: PMC10325920 DOI: 10.1093/nar/gkad436] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2022] [Revised: 03/31/2023] [Accepted: 05/09/2023] [Indexed: 05/26/2023] Open
Abstract
Many deep learning approaches have been proposed to predict epigenetic profiles, chromatin organization, and transcription activity. While these approaches achieve satisfactory performance in predicting one modality from another, the learned representations are not generalizable across predictive tasks or across cell types. In this paper, we propose a deep learning approach named EPCOT which employs a pre-training and fine-tuning framework, and is able to accurately and comprehensively predict multiple modalities including epigenome, chromatin organization, transcriptome, and enhancer activity for new cell types, by only requiring cell-type specific chromatin accessibility profiles. Many of these predicted modalities, such as Micro-C and ChIA-PET, are quite expensive to get in practice, and the in silico prediction from EPCOT should be quite helpful. Furthermore, this pre-training and fine-tuning framework allows EPCOT to identify generic representations generalizable across different predictive tasks. Interpreting EPCOT models also provides biological insights including mapping between different genomic modalities, identifying TF sequence binding patterns, and analyzing cell-type specific TF impacts on enhancer activity.
Collapse
Affiliation(s)
- Zhenhao Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, 500 S. State St, Ann Arbor, MI 48109, USA
| | - Fan Feng
- Department of Computational Medicine and Bioinformatics, University of Michigan, 500 S. State St, Ann Arbor, MI 48109, USA
| | - Yiyang Qiu
- Department of Computer Science and Engineering, University of Michigan, 500 S. State St, Ann Arbor, MI 48109, USA
| | - Jie Liu
- Department of Computational Medicine and Bioinformatics, University of Michigan, 500 S. State St, Ann Arbor, MI 48109, USA
- Department of Computer Science and Engineering, University of Michigan, 500 S. State St, Ann Arbor, MI 48109, USA
| |
Collapse
|
55
|
Catta-Preta R, Lindtner S, Ypsilanti A, Price J, Abnousi A, Su-Feher L, Wang Y, Juric I, Jones IR, Akiyama JA, Hu M, Shen Y, Visel A, Pennacchio LA, Dickel D, Rubenstein JLR, Nord AS. Combinatorial transcription factor binding encodes cis-regulatory wiring of forebrain GABAergic neurogenesis. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.06.28.546894. [PMID: 37425940 PMCID: PMC10327028 DOI: 10.1101/2023.06.28.546894] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/11/2023]
Abstract
Transcription factors (TFs) bind combinatorially to genomic cis-regulatory elements (cREs), orchestrating transcription programs. While studies of chromatin state and chromosomal interactions have revealed dynamic neurodevelopmental cRE landscapes, parallel understanding of the underlying TF binding lags. To elucidate the combinatorial TF-cRE interactions driving mouse basal ganglia development, we integrated ChIP-seq for twelve TFs, H3K4me3-associated enhancer-promoter interactions, chromatin and transcriptional state, and transgenic enhancer assays. We identified TF-cREs modules with distinct chromatin features and enhancer activity that have complementary roles driving GABAergic neurogenesis and suppressing other developmental fates. While the majority of distal cREs were bound by one or two TFs, a small proportion were extensively bound, and these enhancers also exhibited exceptional evolutionary conservation, motif density, and complex chromosomal interactions. Our results provide new insights into how modules of combinatorial TF-cRE interactions activate and repress developmental expression programs and demonstrate the value of TF binding data in modeling gene regulatory wiring.
Collapse
Affiliation(s)
- Rinaldo Catta-Preta
- Department of Neurobiology, Physiology and Behavior, and Department of Psychiatry and Behavioral Sciences, University of California, Davis, Davis, CA 95618, USA
- Current Address: Department of Genetics, Blavatnik Institute, Harvard Medical School, Boston, MA 02115, USA
| | - Susan Lindtner
- Nina Ireland Laboratory of Developmental Neurobiology, Department of Psychiatry and Behavioral Sciences, UCSF Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA 94143, USA
| | - Athena Ypsilanti
- Nina Ireland Laboratory of Developmental Neurobiology, Department of Psychiatry and Behavioral Sciences, UCSF Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA 94143, USA
| | - James Price
- Nina Ireland Laboratory of Developmental Neurobiology, Department of Psychiatry and Behavioral Sciences, UCSF Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA 94143, USA
| | - Armen Abnousi
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic Foundation, Cleveland, OH 44106, USA
- Current Address: NovaSignal, Los Angeles, CA 90064, USA
| | - Linda Su-Feher
- Department of Neurobiology, Physiology and Behavior, and Department of Psychiatry and Behavioral Sciences, University of California, Davis, Davis, CA 95618, USA
| | - Yurong Wang
- Department of Neurobiology, Physiology and Behavior, and Department of Psychiatry and Behavioral Sciences, University of California, Davis, Davis, CA 95618, USA
| | - Ivan Juric
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic Foundation, Cleveland, OH 44106, USA
| | - Ian R Jones
- Institute for Human Genetics, Department of Neurology, University of California, San Francisco, San Francisco, CA 94143, USA
- Department of Neurology, University of California, San Francisco, CA 94143, USA
| | - Jennifer A Akiyama
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Ming Hu
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic Foundation, Cleveland, OH 44106, USA
| | - Yin Shen
- Institute for Human Genetics, Department of Neurology, University of California, San Francisco, San Francisco, CA 94143, USA
- Department of Neurology, University of California, San Francisco, CA 94143, USA
| | - Axel Visel
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
- U.S. Department of Energy Joint Genome Institute, Walnut Creek, CA 94598, USA
- School of Natural Sciences, University of California, Merced, Merced, CA 95343, USA
| | - Len A Pennacchio
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
- U.S. Department of Energy Joint Genome Institute, Walnut Creek, CA 94598, USA
- Comparative Biochemistry Program, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Diane Dickel
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - John L R Rubenstein
- Nina Ireland Laboratory of Developmental Neurobiology, Department of Psychiatry and Behavioral Sciences, UCSF Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA 94143, USA
| | - Alex S Nord
- Department of Neurobiology, Physiology and Behavior, and Department of Psychiatry and Behavioral Sciences, University of California, Davis, Davis, CA 95618, USA
| |
Collapse
|
56
|
Mach P, Giorgetti L. Integrative approaches to study enhancer-promoter communication. Curr Opin Genet Dev 2023; 80:102052. [PMID: 37257410 PMCID: PMC10293802 DOI: 10.1016/j.gde.2023.102052] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Revised: 04/21/2023] [Accepted: 04/22/2023] [Indexed: 06/02/2023]
Abstract
The spatiotemporal control of gene expression in complex multicellular organisms relies on noncoding regulatory sequences such as enhancers, which activate transcription of target genes often over large genomic distances. Despite the advances in the identification and characterization of enhancers, the principles and mechanisms by which enhancers select and control their target genes remain largely unknown. Here, we review recent interdisciplinary and quantitative approaches based on emerging techniques that aim to address open questions in the field, notably how regulatory information is encoded in the DNA sequence, how this information is transferred from enhancers to promoters, and how these processes are regulated in time.
Collapse
Affiliation(s)
- Pia Mach
- Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland; University of Basel, Basel, Switzerland. https://twitter.com/@MachPia
| | - Luca Giorgetti
- Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland.
| |
Collapse
|
57
|
Kim S, Morgunova E, Naqvi S, Bader M, Koska M, Popov A, Luong C, Pogson A, Claes P, Taipale J, Wysocka J. DNA-guided transcription factor cooperativity shapes face and limb mesenchyme. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.29.541540. [PMID: 37398193 PMCID: PMC10312427 DOI: 10.1101/2023.05.29.541540] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/04/2023]
Abstract
Transcription factors (TFs) can define distinct cellular identities despite nearly identical DNA-binding specificities. One mechanism for achieving regulatory specificity is DNA-guided TF cooperativity. Although in vitro studies suggest it may be common, examples of such cooperativity remain scarce in cellular contexts. Here, we demonstrate how 'Coordinator', a long DNA motif comprised of common motifs bound by many basic helix-loop-helix (bHLH) and homeodomain (HD) TFs, uniquely defines regulatory regions of embryonic face and limb mesenchyme. Coordinator guides cooperative and selective binding between the bHLH family mesenchymal regulator TWIST1 and a collective of HD factors associated with regional identities in the face and limb. TWIST1 is required for HD binding and open chromatin at Coordinator sites, while HD factors stabilize TWIST1 occupancy at Coordinator and titrate it away from HD-independent sites. This cooperativity results in shared regulation of genes involved in cell-type and positional identities, and ultimately shapes facial morphology and evolution.
Collapse
Affiliation(s)
- Seungsoo Kim
- Department of Chemical and Systems Biology, Stanford University, Stanford, CA 94305
- Department of Developmental Biology, Stanford University, Stanford, CA 94305
- Institute for Stem Cell Biology and Regenerative Medicine, Stanford University, Stanford, CA 94305
- Howard Hughes Medical Institute, Stanford, CA 94305
| | - Ekaterina Morgunova
- Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Solna, Sweden
| | - Sahin Naqvi
- Department of Chemical and Systems Biology, Stanford University, Stanford, CA 94305
- Department of Developmental Biology, Stanford University, Stanford, CA 94305
- Institute for Stem Cell Biology and Regenerative Medicine, Stanford University, Stanford, CA 94305
- Department of Genetics, Stanford University, Stanford, CA 94305
| | - Maram Bader
- Department of Chemical and Systems Biology, Stanford University, Stanford, CA 94305
- Department of Developmental Biology, Stanford University, Stanford, CA 94305
- Institute for Stem Cell Biology and Regenerative Medicine, Stanford University, Stanford, CA 94305
| | - Mervenaz Koska
- Department of Developmental Biology, Stanford University, Stanford, CA 94305
| | | | - Christy Luong
- Department of Chemical and Systems Biology, Stanford University, Stanford, CA 94305
| | - Angela Pogson
- Department of Developmental Biology, Stanford University, Stanford, CA 94305
| | - Peter Claes
- Department of Electrical Engineering, ESAT/PSI, KU Leuven, Leuven, Belgium
- Medical Imaging Research Center, UZ Leuven, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Jussi Taipale
- Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Solna, Sweden
- Department of Biochemistry, University of Cambridge, Cambridge, United Kingdom
- Applied Tumor Genomics Program, University of Helsinki, Helsinki, Finland
| | - Joanna Wysocka
- Department of Chemical and Systems Biology, Stanford University, Stanford, CA 94305
- Department of Developmental Biology, Stanford University, Stanford, CA 94305
- Institute for Stem Cell Biology and Regenerative Medicine, Stanford University, Stanford, CA 94305
- Howard Hughes Medical Institute, Stanford, CA 94305
| |
Collapse
|
58
|
Ziyani C, Delaneau O, Ribeiro DM. Multimodal single cell analysis infers widespread enhancer co-activity in a lymphoblastoid cell line. Commun Biol 2023; 6:563. [PMID: 37237005 PMCID: PMC10219981 DOI: 10.1038/s42003-023-04954-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Accepted: 05/18/2023] [Indexed: 05/28/2023] Open
Abstract
Non-coding regulatory elements such as enhancers are key in controlling the cell-type specificity and spatio-temporal expression of genes. To drive stable and precise gene transcription robust to genetic variation and environmental stress, genes are often targeted by multiple enhancers with redundant action. However, it is unknown whether enhancers targeting the same gene display simultaneous activity or whether some enhancer combinations are more often co-active than others. Here, we take advantage of recent developments in single cell technology that permit assessing chromatin status (scATAC-seq) and gene expression (scRNA-seq) in the same single cells to correlate gene expression to the activity of multiple enhancers. Measuring activity patterns across 24,844 human lymphoblastoid single cells, we find that the majority of enhancers associated with the same gene display significant correlation in their chromatin profiles. For 6944 expressed genes associated with enhancers, we predict 89,885 significant enhancer-enhancer associations between nearby enhancers. We find that associated enhancers share similar transcription factor binding profiles and that gene essentiality is linked with higher enhancer co-activity. We provide a set of predicted enhancer-enhancer associations based on correlation derived from a single cell line, which can be further investigated for functional relevance.
Collapse
Affiliation(s)
- Chaymae Ziyani
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Olivier Delaneau
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Diogo M Ribeiro
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland.
- Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland.
| |
Collapse
|
59
|
Smith GD, Ching WH, Cornejo-Páramo P, Wong ES. Decoding enhancer complexity with machine learning and high-throughput discovery. Genome Biol 2023; 24:116. [PMID: 37173718 PMCID: PMC10176946 DOI: 10.1186/s13059-023-02955-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2022] [Accepted: 04/28/2023] [Indexed: 05/15/2023] Open
Abstract
Enhancers are genomic DNA elements controlling spatiotemporal gene expression. Their flexible organization and functional redundancies make deciphering their sequence-function relationships challenging. This article provides an overview of the current understanding of enhancer organization and evolution, with an emphasis on factors that influence these relationships. Technological advancements, particularly in machine learning and synthetic biology, are discussed in light of how they provide new ways to understand this complexity. Exciting opportunities lie ahead as we continue to unravel the intricacies of enhancer function.
Collapse
Affiliation(s)
- Gabrielle D Smith
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia
- School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Kensington, NSW, Australia
| | - Wan Hern Ching
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia
| | - Paola Cornejo-Páramo
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia
- School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Kensington, NSW, Australia
| | - Emily S Wong
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia.
- School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Kensington, NSW, Australia.
| |
Collapse
|
60
|
Zahm AM, Owens WS, Himes SR, Rondem KE, Fallon BS, Gormick AN, Bloom JS, Kosuri S, Chan H, English JG. Discovery and Validation of Context-Dependent Synthetic Mammalian Promoters. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.11.539703. [PMID: 37214829 PMCID: PMC10197685 DOI: 10.1101/2023.05.11.539703] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Cellular transcription enables cells to adapt to various stimuli and maintain homeostasis. Transcription factors bind to transcription response elements (TREs) in gene promoters, initiating transcription. Synthetic promoters, derived from natural TREs, can be engineered to control exogenous gene expression using endogenous transcription machinery. This technology has found extensive use in biological research for applications including reporter gene assays, biomarker development, and programming synthetic circuits in living cells. However, a reliable and precise method for selecting minimally-sized synthetic promoters with desired background, amplitude, and stimulation response profiles has been elusive. In this study, we introduce a massively parallel reporter assay library containing 6184 synthetic promoters, each less than 250 bp in length. This comprehensive library allows for rapid identification of promoters with optimal transcriptional output parameters across multiple cell lines and stimuli. We showcase this library's utility to identify promoters activated in unique cell types, and in response to metabolites, mitogens, cellular toxins, and agonism of both aminergic and non-aminergic GPCRs. We further show these promoters can be used in luciferase reporter assays, eliciting 50-100 fold dynamic ranges in response to stimuli. Our platform is effective, easily implemented, and provides a solution for selecting short-length promoters with precise performance for a multitude of applications.
Collapse
Affiliation(s)
- Adam M. Zahm
- Department of Biochemistry, University of Utah School of Medicine, Salt Lake City, UT, USA
| | | | - Samuel R. Himes
- Department of Biochemistry, University of Utah School of Medicine, Salt Lake City, UT, USA
| | - Kathleen E. Rondem
- Department of Biochemistry, University of Utah School of Medicine, Salt Lake City, UT, USA
| | - Braden S. Fallon
- Department of Biochemistry, University of Utah School of Medicine, Salt Lake City, UT, USA
| | - Alexa N. Gormick
- Department of Biochemistry, University of Utah School of Medicine, Salt Lake City, UT, USA
| | | | | | | | - Justin G. English
- Department of Biochemistry, University of Utah School of Medicine, Salt Lake City, UT, USA
| |
Collapse
|
61
|
Nikolados EM, Oyarzún DA. Deep learning for optimization of protein expression. Curr Opin Biotechnol 2023; 81:102941. [PMID: 37087839 DOI: 10.1016/j.copbio.2023.102941] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Revised: 02/02/2023] [Accepted: 03/17/2023] [Indexed: 04/25/2023]
Abstract
Recent progress in high-throughput DNA synthesis and sequencing has enabled the development of massively parallel reporter assays for strain characterization. These datasets map a large number of DNA sequences to protein expression levels, sparking increased interest in data-driven methods for sequence-to-expression modeling. Here, we highlight advances in deep learning models of protein expression and their potential for optimizing strains engineered to produce recombinant proteins. We review recent works that built highly accurate models and discuss challenges that hinder adoption by end users. There is a need to better align this technology with the constraints encountered in strain engineering, particularly the cost of acquiring large amounts of data and the requirement for interpretable models that generalize beyond the training data. Overcoming these barriers will help to incentivize academic and industrial laboratories to tap into a new era of data-centric strain engineering.
Collapse
Affiliation(s)
| | - Diego A Oyarzún
- School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JH, UK; School of Informatics, University of Edinburgh, Edinburgh EH8 9AB, UK; The Alan Turing Institute, London NW1 2DB, UK.
| |
Collapse
|
62
|
Brosh R, Coelho C, Ribeiro-Dos-Santos AM, Ellis G, Hogan MS, Ashe HJ, Somogyi N, Ordoñez R, Luther RD, Huang E, Boeke JD, Maurano MT. Synthetic regulatory genomics uncovers enhancer context dependence at the Sox2 locus. Mol Cell 2023; 83:1140-1152.e7. [PMID: 36931273 PMCID: PMC10081970 DOI: 10.1016/j.molcel.2023.02.027] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2022] [Revised: 01/20/2023] [Accepted: 02/23/2023] [Indexed: 03/18/2023]
Abstract
Sox2 expression in mouse embryonic stem cells (mESCs) depends on a distal cluster of DNase I hypersensitive sites (DHSs), but their individual contributions and degree of interdependence remain a mystery. We analyzed the endogenous Sox2 locus using Big-IN to scarlessly integrate large DNA payloads incorporating deletions, rearrangements, and inversions affecting single or multiple DHSs, as well as surgical alterations to transcription factor (TF) recognition sequences. Multiple mESC clones were derived for each payload, sequence-verified, and analyzed for Sox2 expression. We found that two DHSs comprising a handful of key TF recognition sequences were each sufficient for long-range activation of Sox2 expression. By contrast, three nearby DHSs were entirely context dependent, showing no activity alone but dramatically augmenting the activity of the autonomous DHSs. Our results highlight the role of context in modulating genomic regulatory element function, and our synthetic regulatory genomics approach provides a roadmap for the dissection of other genomic loci.
Collapse
Affiliation(s)
- Ran Brosh
- Institute for Systems Genetics, NYU School of Medicine, New York, NY 10016, USA
| | - Camila Coelho
- Institute for Systems Genetics, NYU School of Medicine, New York, NY 10016, USA
| | | | - Gwen Ellis
- Institute for Systems Genetics, NYU School of Medicine, New York, NY 10016, USA
| | - Megan S Hogan
- Institute for Systems Genetics, NYU School of Medicine, New York, NY 10016, USA
| | - Hannah J Ashe
- Institute for Systems Genetics, NYU School of Medicine, New York, NY 10016, USA
| | - Nicolette Somogyi
- Institute for Systems Genetics, NYU School of Medicine, New York, NY 10016, USA
| | - Raquel Ordoñez
- Institute for Systems Genetics, NYU School of Medicine, New York, NY 10016, USA
| | - Raven D Luther
- Institute for Systems Genetics, NYU School of Medicine, New York, NY 10016, USA
| | - Emily Huang
- Institute for Systems Genetics, NYU School of Medicine, New York, NY 10016, USA
| | - Jef D Boeke
- Institute for Systems Genetics, NYU School of Medicine, New York, NY 10016, USA; Department of Biochemistry Molecular Pharmacology, NYU School of Medicine, New York, NY 10016, USA; Department of Biomedical Engineering, NYU Tandon School of Engineering, Brooklyn, NY 11201, USA
| | - Matthew T Maurano
- Institute for Systems Genetics, NYU School of Medicine, New York, NY 10016, USA; Department of Pathology, NYU School of Medicine, New York, NY 10016, USA.
| |
Collapse
|
63
|
Das M, Hossain A, Banerjee D, Praul CA, Girirajan S. Challenges and considerations for reproducibility of STARR-seq assays. Genome Res 2023; 33:479-495. [PMID: 37130797 PMCID: PMC10234304 DOI: 10.1101/gr.277204.122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2022] [Accepted: 03/15/2023] [Indexed: 05/04/2023]
Abstract
High-throughput methods such as RNA-seq, ChIP-seq, and ATAC-seq have well-established guidelines, commercial kits, and analysis pipelines that enable consistency and wider adoption for understanding genome function and regulation. STARR-seq, a popular assay for directly quantifying the activities of thousands of enhancer sequences simultaneously, has seen limited standardization across studies. The assay is long, with more than 250 steps, and frequent customization of the protocol and variations in bioinformatics methods raise concerns for reproducibility of STARR-seq studies. Here, we assess each step of the protocol and analysis pipelines from published sources and in-house assays, and identify critical steps and quality control (QC) checkpoints necessary for reproducibility of the assay. We also provide guidelines for experimental design, protocol scaling, customization, and analysis pipelines for better adoption of the assay. These resources will allow better optimization of STARR-seq for specific research needs, enable comparisons and integration across studies, and improve the reproducibility of results.
Collapse
Affiliation(s)
- Maitreya Das
- Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, Pennsylvania 16802, USA;
- Molecular and Cellular Integrative Biosciences Graduate Program, Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Ayaan Hossain
- Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Bioinformatics and Genomics Graduate Program, Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Deepro Banerjee
- Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Bioinformatics and Genomics Graduate Program, Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Craig Alan Praul
- Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Santhosh Girirajan
- Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, Pennsylvania 16802, USA;
- Molecular and Cellular Integrative Biosciences Graduate Program, Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Bioinformatics and Genomics Graduate Program, Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Department of Anthropology, Pennsylvania State University, University Park, Pennsylvania 16802, USA
| |
Collapse
|
64
|
Avalos D, Rey G, Ribeiro DM, Ramisch A, Dermitzakis ET, Delaneau O. Genetic variation in cis-regulatory domains suggests cell type-specific regulatory mechanisms in immunity. Commun Biol 2023; 6:335. [PMID: 36977773 PMCID: PMC10050075 DOI: 10.1038/s42003-023-04688-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2022] [Accepted: 03/09/2023] [Indexed: 03/30/2023] Open
Abstract
Studying the interplay between genetic variation, epigenetic changes, and regulation of gene expression is crucial to understand the modification of cellular states in various conditions, including immune diseases. In this study, we characterize the cell-specificity in three key cells of the human immune system by building cis maps of regulatory regions with coordinated activity (CRDs) from ChIP-seq peaks and methylation data. We find that only 33% of CRD-gene associations are shared between cell types, revealing how similarly located regulatory regions provide cell-specific modulation of gene activity. We emphasize important biological mechanisms, as most of our associations are enriched in cell-specific transcription factor binding sites, blood-traits, and immune disease-associated loci. Notably, we show that CRD-QTLs aid in interpreting GWAS findings and help prioritize variants for testing functional hypotheses within human complex diseases. Additionally, we map trans CRD regulatory associations, and among 207 trans-eQTLs discovered, 46 overlap with the QTLGen Consortium meta-analysis in whole blood, showing that mapping functional regulatory units using population genomics allows discovering important mechanisms in the regulation of gene expression in immune cells. Finally, we constitute a comprehensive resource describing multi-omics changes to gain a greater understanding of cell-type specific regulatory mechanisms of immunity.
Collapse
Affiliation(s)
- Diana Avalos
- Department of Genetic Medicine and Development, University of Geneva, Geneva, Switzerland
- Swiss Institute of Bioinformatics (SIB), University of Geneva, Geneva, Switzerland
- Institute of Genetics and Genomics in Geneva, University of Geneva, Geneva, Switzerland
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
| | - Guillaume Rey
- Department of Genetic Medicine and Development, University of Geneva, Geneva, Switzerland
- Swiss Institute of Bioinformatics (SIB), University of Geneva, Geneva, Switzerland
- Institute of Genetics and Genomics in Geneva, University of Geneva, Geneva, Switzerland
| | - Diogo M Ribeiro
- Swiss Institute of Bioinformatics (SIB), University of Geneva, Geneva, Switzerland
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
| | - Anna Ramisch
- Department of Genetic Medicine and Development, University of Geneva, Geneva, Switzerland
- Swiss Institute of Bioinformatics (SIB), University of Geneva, Geneva, Switzerland
- Institute of Genetics and Genomics in Geneva, University of Geneva, Geneva, Switzerland
| | - Emmanouil T Dermitzakis
- Department of Genetic Medicine and Development, University of Geneva, Geneva, Switzerland
- Swiss Institute of Bioinformatics (SIB), University of Geneva, Geneva, Switzerland
- Institute of Genetics and Genomics in Geneva, University of Geneva, Geneva, Switzerland
| | - Olivier Delaneau
- Swiss Institute of Bioinformatics (SIB), University of Geneva, Geneva, Switzerland.
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland.
| |
Collapse
|
65
|
Agarwal V, Inoue F, Schubach M, Martin BK, Dash PM, Zhang Z, Sohota A, Noble WS, Yardimci GG, Kircher M, Shendure J, Ahituv N. Massively parallel characterization of transcriptional regulatory elements in three diverse human cell types. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.05.531189. [PMID: 36945371 PMCID: PMC10028905 DOI: 10.1101/2023.03.05.531189] [Citation(s) in RCA: 19] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/11/2023]
Abstract
The human genome contains millions of candidate cis-regulatory elements (CREs) with cell-type-specific activities that shape both health and myriad disease states. However, we lack a functional understanding of the sequence features that control the activity and cell-type-specific features of these CREs. Here, we used lentivirus-based massively parallel reporter assays (lentiMPRAs) to test the regulatory activity of over 680,000 sequences, representing a nearly comprehensive set of all annotated CREs among three cell types (HepG2, K562, and WTC11), finding 41.7% to be functional. By testing sequences in both orientations, we find promoters to have significant strand orientation effects. We also observe that their 200 nucleotide cores function as non-cell-type-specific 'on switches' providing similar expression levels to their associated gene. In contrast, enhancers have weaker orientation effects, but increased tissue-specific characteristics. Utilizing our lentiMPRA data, we develop sequence-based models to predict CRE function with high accuracy and delineate regulatory motifs. Testing an additional lentiMPRA library encompassing 60,000 CREs in all three cell types, we further identified factors that determine cell-type specificity. Collectively, our work provides an exhaustive catalog of functional CREs in three widely used cell lines, and showcases how large-scale functional measurements can be used to dissect regulatory grammar.
Collapse
Affiliation(s)
- Vikram Agarwal
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
- mRNA Center of Excellence, Sanofi Pasteur Inc., Waltham, MA 02451, USA
| | - Fumitaka Inoue
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA 94158, USA
- Institute for Human Genetics, University of California San Francisco, San Francisco, CA 94158, USA
- Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto, Japan
| | - Max Schubach
- Berlin Institute of Health of Health at Charité - Universitätsmedizin Berlin, 10178, Berlin, Germany
| | - Beth K. Martin
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
| | - Pyaree Mohan Dash
- Berlin Institute of Health of Health at Charité - Universitätsmedizin Berlin, 10178, Berlin, Germany
| | - Zicong Zhang
- Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto, Japan
| | - Ajuni Sohota
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA 94158, USA
- Institute for Human Genetics, University of California San Francisco, San Francisco, CA 94158, USA
| | - William Stafford Noble
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA
| | - Galip Gürkan Yardimci
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
- Knight Cancer Institute, Oregon Health and Science University, Portland, OR, USA
- Cancer Early Detection Advanced Research Center, Oregon Health and Science University, Portland, OR, USA
| | - Martin Kircher
- Berlin Institute of Health of Health at Charité - Universitätsmedizin Berlin, 10178, Berlin, Germany
- Institute of Human Genetics, University Medical Center Schleswig-Holstein, University of Lübeck, Lübeck, Germany
| | - Jay Shendure
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
- Howard Hughes Medical Institute, Seattle, WA 98195, USA
- Brotman Baty Institute for Precision Medicine, University of Washington, Seattle, WA 98195, USA
- Allen Center for Cell Lineage Tracing, University of Washington, Seattle, WA 98195, USA
| | - Nadav Ahituv
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA 94158, USA
- Institute for Human Genetics, University of California San Francisco, San Francisco, CA 94158, USA
| |
Collapse
|
66
|
Gallego Romero I, Lea AJ. Leveraging massively parallel reporter assays for evolutionary questions. Genome Biol 2023; 24:26. [PMID: 36788564 PMCID: PMC9926830 DOI: 10.1186/s13059-023-02856-6] [Citation(s) in RCA: 12] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2022] [Accepted: 01/17/2023] [Indexed: 02/16/2023] Open
Abstract
A long-standing goal of evolutionary biology is to decode how gene regulation contributes to organismal diversity. Doing so is challenging because it is hard to predict function from non-coding sequence and to perform molecular research with non-model taxa. Massively parallel reporter assays (MPRAs) enable the testing of thousands to millions of sequences for regulatory activity simultaneously. Here, we discuss the execution, advantages, and limitations of MPRAs, with a focus on evolutionary questions. We propose solutions for extending MPRAs to rare taxa and those with limited genomic resources, and we underscore MPRA's broad potential for driving genome-scale, functional studies across organisms.
Collapse
Affiliation(s)
- Irene Gallego Romero
- Melbourne Integrative Genomics, University of Melbourne, Royal Parade, Parkville, Victoria, 3010, Australia. .,School of BioSciences, The University of Melbourne, Royal Parade, Parkville, 3010, Australia. .,The Centre for Stem Cell Systems, Faculty of Medicine, Dentistry and Health Sciences, The University of Melbourne, 30 Royal Parade, Parkville, Victoria, 3010, Australia. .,Center for Genomics, Evolution and Medicine, Institute of Genomics, University of Tartu, Riia 23b, 51010, Tartu, Estonia.
| | - Amanda J. Lea
- grid.152326.10000 0001 2264 7217Department of Biological Sciences, Vanderbilt University, Nashville, TN 37240 USA ,grid.152326.10000 0001 2264 7217Vanderbilt Genetics Institute, Vanderbilt University, Nashville, TN 37240 USA ,grid.152326.10000 0001 2264 7217Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN 37240 USA ,Child and Brain Development Program, Canadian Institute for Advanced Study, Toronto, Canada
| |
Collapse
|
67
|
Kim S, Wysocka J. Deciphering the multi-scale, quantitative cis-regulatory code. Mol Cell 2023; 83:373-392. [PMID: 36693380 PMCID: PMC9898153 DOI: 10.1016/j.molcel.2022.12.032] [Citation(s) in RCA: 69] [Impact Index Per Article: 69.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Revised: 12/29/2022] [Accepted: 12/30/2022] [Indexed: 01/24/2023]
Abstract
Uncovering the cis-regulatory code that governs when and how much each gene is transcribed in a given genome and cellular state remains a central goal of biology. Here, we discuss major layers of regulation that influence how transcriptional outputs are encoded by DNA sequence and cellular context. We first discuss how transcription factors bind specific DNA sequences in a dosage-dependent and cooperative manner and then proceed to the cofactors that facilitate transcription factor function and mediate the activity of modular cis-regulatory elements such as enhancers, silencers, and promoters. We then consider the complex and poorly understood interplay of these diverse elements within regulatory landscapes and its relationships with chromatin states and nuclear organization. We propose that a mechanistically informed, quantitative model of transcriptional regulation that integrates these multiple regulatory layers will be the key to ultimately cracking the cis-regulatory code.
Collapse
Affiliation(s)
- Seungsoo Kim
- Howard Hughes Medical Institute, Stanford University School of Medicine, Stanford, CA 94305, USA; Department of Chemical and Systems Biology, Stanford University School of Medicine, Stanford, CA 94305, USA; Department of Developmental Biology, Stanford University School of Medicine, Stanford, CA 94305, USA; Institute for Stem Cell Biology and Regenerative Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Joanna Wysocka
- Howard Hughes Medical Institute, Stanford University School of Medicine, Stanford, CA 94305, USA; Department of Chemical and Systems Biology, Stanford University School of Medicine, Stanford, CA 94305, USA; Department of Developmental Biology, Stanford University School of Medicine, Stanford, CA 94305, USA; Institute for Stem Cell Biology and Regenerative Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA.
| |
Collapse
|
68
|
Novakovsky G, Dexter N, Libbrecht MW, Wasserman WW, Mostafavi S. Obtaining genetics insights from deep learning via explainable artificial intelligence. Nat Rev Genet 2023; 24:125-137. [PMID: 36192604 DOI: 10.1038/s41576-022-00532-2] [Citation(s) in RCA: 63] [Impact Index Per Article: 63.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/31/2022] [Indexed: 01/24/2023]
Abstract
Artificial intelligence (AI) models based on deep learning now represent the state of the art for making functional predictions in genomics research. However, the underlying basis on which predictive models make such predictions is often unknown. For genomics researchers, this missing explanatory information would frequently be of greater value than the predictions themselves, as it can enable new insights into genetic processes. We review progress in the emerging area of explainable AI (xAI), a field with the potential to empower life science researchers to gain mechanistic insights into complex deep learning models. We discuss and categorize approaches for model interpretation, including an intuitive understanding of how each approach works and their underlying assumptions and limitations in the context of typical high-throughput biological datasets.
Collapse
Affiliation(s)
- Gherman Novakovsky
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, British Columbia, Canada.,Bioinformatics Graduate Program, University of British Columbia, Vancouver, British Columbia, Canada
| | - Nick Dexter
- Department of Mathematics, Simon Fraser University, Burnaby, British Columbia, Canada.,School of Computing Science, Simon Fraser University, Burnaby, British Columbia, Canada
| | - Maxwell W Libbrecht
- School of Computing Science, Simon Fraser University, Burnaby, British Columbia, Canada.
| | - Wyeth W Wasserman
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, British Columbia, Canada.
| | - Sara Mostafavi
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA. .,Canadian Institute for Advanced Research, Toronto, Ontario, Canada.
| |
Collapse
|
69
|
Pihlajamaa P, Kauko O, Sahu B, Kivioja T, Taipale J. A competitive precision CRISPR method to identify the fitness effects of transcription factor binding sites. Nat Biotechnol 2023; 41:197-203. [PMID: 36163549 PMCID: PMC9931575 DOI: 10.1038/s41587-022-01444-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2021] [Accepted: 07/20/2022] [Indexed: 12/26/2022]
Abstract
Here we describe a competitive genome editing method that measures the effect of mutations on molecular functions, based on precision CRISPR editing using template libraries with either the original or altered sequence, and a sequence tag, enabling direct comparison between original and mutated cells. Using the example of the MYC oncogene, we identify important transcriptional targets and show that E-box mutations at MYC target gene promoters reduce cellular fitness.
Collapse
Affiliation(s)
- Päivi Pihlajamaa
- Applied Tumor Genomics Research Program, Faculty of Medicine, University of Helsinki, Helsinki, Finland
- Department of Biochemistry, University of Cambridge, Cambridge, UK
| | - Otto Kauko
- Applied Tumor Genomics Research Program, Faculty of Medicine, University of Helsinki, Helsinki, Finland
- Department of Biochemistry, University of Cambridge, Cambridge, UK
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, Turku, Finland
| | - Biswajyoti Sahu
- Applied Tumor Genomics Research Program, Faculty of Medicine, University of Helsinki, Helsinki, Finland
- Medicum, Faculty of Medicine, University of Helsinki, Helsinki, Finland
| | - Teemu Kivioja
- Applied Tumor Genomics Research Program, Faculty of Medicine, University of Helsinki, Helsinki, Finland
- Department of Biochemistry, University of Cambridge, Cambridge, UK
| | - Jussi Taipale
- Applied Tumor Genomics Research Program, Faculty of Medicine, University of Helsinki, Helsinki, Finland.
- Department of Biochemistry, University of Cambridge, Cambridge, UK.
- Department of Medical Biochemistry and Biophysics, Karolinska Institute, Stockholm, Sweden.
| |
Collapse
|
70
|
Shin B, Rothenberg EV. Multi-modular structure of the gene regulatory network for specification and commitment of murine T cells. Front Immunol 2023; 14:1108368. [PMID: 36817475 PMCID: PMC9928580 DOI: 10.3389/fimmu.2023.1108368] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2022] [Accepted: 01/11/2023] [Indexed: 02/04/2023] Open
Abstract
T cells develop from multipotent progenitors by a gradual process dependent on intrathymic Notch signaling and coupled with extensive proliferation. The stages leading them to T-cell lineage commitment are well characterized by single-cell and bulk RNA analyses of sorted populations and by direct measurements of precursor-product relationships. This process depends not only on Notch signaling but also on multiple transcription factors, some associated with stemness and multipotency, some with alternative lineages, and others associated with T-cell fate. These factors interact in opposing or semi-independent T cell gene regulatory network (GRN) subcircuits that are increasingly well defined. A newly comprehensive picture of this network has emerged. Importantly, because key factors in the GRN can bind to markedly different genomic sites at one stage than they do at other stages, the genes they significantly regulate are also stage-specific. Global transcriptome analyses of perturbations have revealed an underlying modular structure to the T-cell commitment GRN, separating decisions to lose "stem-ness" from decisions to block alternative fates. Finally, the updated network sheds light on the intimate relationship between the T-cell program, which depends on the thymus, and the innate lymphoid cell (ILC) program, which does not.
Collapse
Affiliation(s)
- Boyoung Shin
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, United States
| | - Ellen V. Rothenberg
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, United States
| |
Collapse
|
71
|
Chen Z, King WC, Hwang A, Gerstein M, Zhang J. DeepVelo: Single-cell transcriptomic deep velocity field learning with neural ordinary differential equations. SCIENCE ADVANCES 2022; 8:eabq3745. [PMID: 36449617 PMCID: PMC9710871 DOI: 10.1126/sciadv.abq3745] [Citation(s) in RCA: 22] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/14/2023]
Abstract
Recent advances in single-cell sequencing technologies have provided unprecedented opportunities to measure the gene expression profile and RNA velocity of individual cells. However, modeling transcriptional dynamics is computationally challenging because of the high-dimensional, sparse nature of the single-cell gene expression measurements and the nonlinear regulatory relationships. Here, we present DeepVelo, a neural network-based ordinary differential equation that can model complex transcriptome dynamics by describing continuous-time gene expression changes within individual cells. We apply DeepVelo to public datasets from different sequencing platforms to (i) formulate transcriptome dynamics on different time scales, (ii) measure the instability of cell states, and (iii) identify developmental driver genes via perturbation analysis. Benchmarking against the state-of-the-art methods shows that DeepVelo can learn a more accurate representation of the velocity field. Furthermore, our perturbation studies reveal that single-cell dynamical systems could exhibit chaotic properties. In summary, DeepVelo allows data-driven discoveries of differential equations that delineate single-cell transcriptome dynamics.
Collapse
Affiliation(s)
- Zhanlin Chen
- Department of Statistics and Data Science, Yale University, New Haven, CT 06520, USA
| | - William C. King
- Healthcare and Life Sciences, Microsoft, Redmond, WA 98052, USA
| | - Aheyon Hwang
- Mathematical, Computational, and Systems Biology, University of California, Irvine, Irvine, CA 92697, USA
| | - Mark Gerstein
- Department of Statistics and Data Science, Yale University, New Haven, CT 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
- Department of Computer Science, Yale University, New Haven, CT 06520, USA
- Corresponding author. (M.G.); (J.Z.)
| | - Jing Zhang
- Department of Computer Science, University of California, Irvine, Irvine, CA 92697, USA
- Corresponding author. (M.G.); (J.Z.)
| |
Collapse
|
72
|
Current challenges in understanding the role of enhancers in disease. Nat Struct Mol Biol 2022; 29:1148-1158. [PMID: 36482255 DOI: 10.1038/s41594-022-00896-3] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2022] [Accepted: 11/04/2022] [Indexed: 12/13/2022]
Abstract
Enhancers play a central role in the spatiotemporal control of gene expression and tend to work in a cell-type-specific manner. In addition, they are suggested to be major contributors to phenotypic variation, evolution and disease. There is growing evidence that enhancer dysfunction due to genetic, structural or epigenetic mechanisms contributes to a broad range of human diseases referred to as enhanceropathies. Such mechanisms often underlie the susceptibility to common diseases, but can also play a direct causal role in cancer or Mendelian diseases. Despite the recent gain of insights into enhancer biology and function, we still have a limited ability to predict how enhancer dysfunction impacts gene expression. Here we discuss the major challenges that need to be overcome when studying the role of enhancers in disease etiology and highlight opportunities and directions for future studies, aiming to disentangle the molecular basis of enhanceropathies.
Collapse
|
73
|
Zhang T, Tang Q, Nie F, Zhao Q, Chen W. DeepLncPro: an interpretable convolutional neural network model for identifying long non-coding RNA promoters. Brief Bioinform 2022; 23:6754194. [PMID: 36209437 DOI: 10.1093/bib/bbac447] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2022] [Revised: 09/14/2022] [Accepted: 09/17/2022] [Indexed: 12/14/2022] Open
Abstract
Long non-coding RNA (lncRNA) plays important roles in a series of biological processes. The transcription of lncRNA is regulated by its promoter. Hence, accurate identification of lncRNA promoter will be helpful to understand its regulatory mechanisms. Since experimental techniques remain time consuming for gnome-wide promoter identification, developing computational tools to identify promoters are necessary. However, only few computational methods have been proposed for lncRNA promoter prediction and their performances still have room to be improved. In the present work, a convolutional neural network based model, called DeepLncPro, was proposed to identify lncRNA promoters in human and mouse. Comparative results demonstrated that DeepLncPro was superior to both state-of-the-art machine learning methods and existing models for identifying lncRNA promoters. Furthermore, DeepLncPro has the ability to extract and analyze transcription factor binding motifs from lncRNAs, which made it become an interpretable model. These results indicate that the DeepLncPro can server as a powerful tool for identifying lncRNA promoters. An open-source tool for DeepLncPro was provided at https://github.com/zhangtian-yang/DeepLncPro.
Collapse
Affiliation(s)
- Tianyang Zhang
- School of Life Sciences, North China University of Science and Technology
| | - Qiang Tang
- School of Basic Medical Sciences, Chengdu University of Traditional Chinese Medicine
| | - Fulei Nie
- School of Life Sciences, North China University of Science and Technology
| | - Qi Zhao
- School of Computer Science and Software Engineering, University of Science and Technology Liaoning
| | - Wei Chen
- Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine
| |
Collapse
|
74
|
Bergman DT, Jones TR, Liu V, Ray J, Jagoda E, Siraj L, Kang HY, Nasser J, Kane M, Rios A, Nguyen TH, Grossman SR, Fulco CP, Lander ES, Engreitz JM. Compatibility rules of human enhancer and promoter sequences. Nature 2022; 607:176-184. [PMID: 35594906 PMCID: PMC9262863 DOI: 10.1038/s41586-022-04877-w] [Citation(s) in RCA: 78] [Impact Index Per Article: 39.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2021] [Accepted: 05/17/2022] [Indexed: 01/03/2023]
Abstract
Gene regulation in the human genome is controlled by distal enhancers that activate specific nearby promoters1. A proposed model for this specificity is that promoters have sequence-encoded preferences for certain enhancers, for example, mediated by interacting sets of transcription factors or cofactors2. This 'biochemical compatibility' model has been supported by observations at individual human promoters and by genome-wide measurements in Drosophila3-9. However, the degree to which human enhancers and promoters are intrinsically compatible has not yet been systematically measured, and how their activities combine to control RNA expression remains unclear. Here we design a high-throughput reporter assay called enhancer × promoter self-transcribing active regulatory region sequencing (ExP STARR-seq) and applied it to examine the combinatorial compatibilities of 1,000 enhancer and 1,000 promoter sequences in human K562 cells. We identify simple rules for enhancer-promoter compatibility, whereby most enhancers activate all promoters by similar amounts, and intrinsic enhancer and promoter activities multiplicatively combine to determine RNA output (R2 = 0.82). In addition, two classes of enhancers and promoters show subtle preferential effects. Promoters of housekeeping genes contain built-in activating motifs for factors such as GABPA and YY1, which decrease the responsiveness of promoters to distal enhancers. Promoters of variably expressed genes lack these motifs and show stronger responsiveness to enhancers. Together, this systematic assessment of enhancer-promoter compatibility suggests a multiplicative model tuned by enhancer and promoter class to control gene transcription in the human genome.
Collapse
Affiliation(s)
- Drew T Bergman
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | | | - Vincent Liu
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Judhajeet Ray
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Evelyn Jagoda
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Layla Siraj
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Biophysics Graduate Program, Harvard University, Cambridge, MA, USA
| | - Helen Y Kang
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
- BASE Initiative, Betty Irene Moore Children's Heart Center, Lucile Packard Children's Hospital, Stanford University School of Medicine, Stanford, CA, USA
| | - Joseph Nasser
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Michael Kane
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Antonio Rios
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Tung H Nguyen
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | | | - Charles P Fulco
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Bristol Myers Squibb, Cambridge, MA, USA
| | - Eric S Lander
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Biology, MIT, Cambridge, MA, USA
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | - Jesse M Engreitz
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA.
- BASE Initiative, Betty Irene Moore Children's Heart Center, Lucile Packard Children's Hospital, Stanford University School of Medicine, Stanford, CA, USA.
| |
Collapse
|
75
|
DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers. Nat Genet 2022; 54:613-624. [PMID: 35551305 DOI: 10.1038/s41588-022-01048-5] [Citation(s) in RCA: 69] [Impact Index Per Article: 34.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2021] [Accepted: 03/08/2022] [Indexed: 02/06/2023]
Abstract
Enhancer sequences control gene expression and comprise binding sites (motifs) for different transcription factors (TFs). Despite extensive genetic and computational studies, the relationship between DNA sequence and regulatory activity is poorly understood, and de novo enhancer design has been challenging. Here, we built a deep-learning model, DeepSTARR, to quantitatively predict the activities of thousands of developmental and housekeeping enhancers directly from DNA sequence in Drosophila melanogaster S2 cells. The model learned relevant TF motifs and higher-order syntax rules, including functionally nonequivalent instances of the same TF motif that are determined by motif-flanking sequence and intermotif distances. We validated these rules experimentally and demonstrated that they can be generalized to humans by testing more than 40,000 wildtype and mutant Drosophila and human enhancers. Finally, we designed and functionally validated synthetic enhancers with desired activities de novo.
Collapse
|
76
|
Martinez-Ara M, Comoglio F, van Arensbergen J, van Steensel B. Systematic analysis of intrinsic enhancer-promoter compatibility in the mouse genome. Mol Cell 2022; 82:2519-2531.e6. [PMID: 35594855 PMCID: PMC9278412 DOI: 10.1016/j.molcel.2022.04.009] [Citation(s) in RCA: 50] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2021] [Revised: 02/17/2022] [Accepted: 04/05/2022] [Indexed: 12/12/2022]
Affiliation(s)
- Miguel Martinez-Ara
- Division of Gene Regulation and Oncode Institute, Netherlands Cancer Institute, 1066 CX Amsterdam, the Netherlands
| | - Federico Comoglio
- Division of Gene Regulation and Oncode Institute, Netherlands Cancer Institute, 1066 CX Amsterdam, the Netherlands
| | - Joris van Arensbergen
- Division of Gene Regulation and Oncode Institute, Netherlands Cancer Institute, 1066 CX Amsterdam, the Netherlands
| | - Bas van Steensel
- Division of Gene Regulation and Oncode Institute, Netherlands Cancer Institute, 1066 CX Amsterdam, the Netherlands.
| |
Collapse
|