1
|
Xi X, Li J, Jia J, Meng Q, Li C, Wang X, Wei L, Zhang X. A mechanism-informed deep neural network enables prioritization of regulators that drive cell state transitions. Nat Commun 2025; 16:1284. [PMID: 39900922 PMCID: PMC11790924 DOI: 10.1038/s41467-025-56475-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2024] [Accepted: 01/15/2025] [Indexed: 02/05/2025] Open
Abstract
Cells are regulated at multiple levels, from regulations of individual genes to interactions across multiple genes. Some recent neural network models can connect molecular changes to cellular phenotypes, but their design lacks modeling of regulatory mechanisms, limiting the decoding of regulations behind key cellular events, such as cell state transitions. Here, we present regX, a deep neural network incorporating both gene-level regulation and gene-gene interaction mechanisms, which enables prioritizing potential driver regulators of cell state transitions and providing mechanistic interpretations. Applied to single-cell multi-omics data on type 2 diabetes and hair follicle development, regX reliably prioritizes key transcription factors and candidate cis-regulatory elements that drive cell state transitions. Some regulators reveal potential new therapeutic targets, drug repurposing possibilities, and putative causal single nucleotide polymorphisms. This method to analyze single-cell multi-omics data demonstrates how the interpretable design of neural networks can better decode biological systems.
Collapse
Affiliation(s)
- Xi Xi
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, BNRIST / Department of Automation, Tsinghua University, Beijing, China
| | - Jiaqi Li
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, BNRIST / Department of Automation, Tsinghua University, Beijing, China
| | - Jinmeng Jia
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, BNRIST / Department of Automation, Tsinghua University, Beijing, China
| | - Qiuchen Meng
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, BNRIST / Department of Automation, Tsinghua University, Beijing, China
| | - Chen Li
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, BNRIST / Department of Automation, Tsinghua University, Beijing, China
| | - Xiaowo Wang
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, BNRIST / Department of Automation, Tsinghua University, Beijing, China
| | - Lei Wei
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, BNRIST / Department of Automation, Tsinghua University, Beijing, China
| | - Xuegong Zhang
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, BNRIST / Department of Automation, Tsinghua University, Beijing, China.
- School of Life Sciences, Tsinghua University, Beijing, China.
| |
Collapse
|
2
|
Mekkaoui F, Drewell RA, Dresch JM, Spratt DE. Experimental approaches to investigate biophysical interactions between homeodomain transcription factors and DNA. BIOCHIMICA ET BIOPHYSICA ACTA. GENE REGULATORY MECHANISMS 2024; 1868:195074. [PMID: 39644990 DOI: 10.1016/j.bbagrm.2024.195074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/03/2024] [Revised: 11/26/2024] [Accepted: 12/01/2024] [Indexed: 12/09/2024]
Abstract
Homeodomain transcription factors (TFs) bind to specific DNA sequences to regulate the expression of target genes. Structural work has provided insight into molecular identities and aided in unraveling structural features of these TFs. However, the detailed affinity and specificity by which these TFs bind to DNA sequences is still largely unknown. Qualitative methods, such as DNA footprinting, Electrophoretic Mobility Shift Assays (EMSAs), Systematic Evolution of Ligands by Exponential Enrichment (SELEX), Bacterial One Hybrid (B1H) systems, Surface Plasmon Resonance (SPR), and Protein Binding Microarrays (PBMs) have been widely used to investigate the biochemical characteristics of TF-DNA binding events. In addition to these qualitative methods, bioinformatic approaches have also assisted in TF binding site discovery. Here we discuss the advantages and limitations of these different approaches, as well as the benefits of utilizing more quantitative approaches, such as Mechanically Induced Trapping of Molecular Interactions (MITOMI), Microscale Thermophoresis (MST) and Isothermal Titration Calorimetry (ITC), in determining the biophysical basis of binding specificity of TF-DNA complexes and improving upon existing computational approaches aimed at affinity predictions.
Collapse
Affiliation(s)
- Fadwa Mekkaoui
- Gustaf H. Carlson School of Chemistry and Biochemistry, Clark University, 950 Main Street, Worcester, MA 01610, United States of America
| | - Robert A Drewell
- Biology Department, Clark University, 950 Main Street, Worcester, MA 01610, United States of America
| | - Jacqueline M Dresch
- Biology Department, Clark University, 950 Main Street, Worcester, MA 01610, United States of America
| | - Donald E Spratt
- Gustaf H. Carlson School of Chemistry and Biochemistry, Clark University, 950 Main Street, Worcester, MA 01610, United States of America.
| |
Collapse
|
3
|
Tahir M, Norouzi M, Khan SS, Davie JR, Yamanaka S, Ashraf A. Artificial intelligence and deep learning algorithms for epigenetic sequence analysis: A review for epigeneticists and AI experts. Comput Biol Med 2024; 183:109302. [PMID: 39500240 DOI: 10.1016/j.compbiomed.2024.109302] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2024] [Revised: 09/22/2024] [Accepted: 10/17/2024] [Indexed: 11/20/2024]
Abstract
Epigenetics encompasses mechanisms that can alter the expression of genes without changing the underlying genetic sequence. The epigenetic regulation of gene expression is initiated and sustained by several mechanisms such as DNA methylation, histone modifications, chromatin conformation, and non-coding RNA. The changes in gene regulation and expression can manifest in the form of various diseases and disorders such as cancer and congenital deformities. Over the last few decades, high-throughput experimental approaches have been used to identify and understand epigenetic changes, but these laboratory experimental approaches and biochemical processes are time-consuming and expensive. To overcome these challenges, machine learning and artificial intelligence (AI) approaches have been extensively used for mapping epigenetic modifications to their phenotypic manifestations. In this paper we provide a narrative review of published research on AI models trained on epigenomic data to address a variety of problems such as prediction of disease markers, gene expression, enhancer-promoter interaction, and chromatin states. The purpose of this review is twofold as it is addressed to both AI experts and epigeneticists. For AI researchers, we provided a taxonomy of epigenetics research problems that can benefit from an AI-based approach. For epigeneticists, given each of the above problems we provide a list of candidate AI solutions in the literature. We have also identified several gaps in the literature, research challenges, and recommendations to address these challenges.
Collapse
Affiliation(s)
- Muhammad Tahir
- Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg, R3T 5V6, MB, Canada
| | - Mahboobeh Norouzi
- Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg, R3T 5V6, MB, Canada
| | - Shehroz S Khan
- College of Engineering and Technology, American University of the Middle East, Kuwait
| | - James R Davie
- Department of Biochemistry and Medical Genetics, Max Rady College of Medicine, Rady Faculty of Health Sciences, University of Manitoba, Winnipeg, MB, Canada
| | - Soichiro Yamanaka
- Graduate School of Science, Department of Biophysics and Biochemistry, University of Tokyo, Japan
| | - Ahmed Ashraf
- Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg, R3T 5V6, MB, Canada.
| |
Collapse
|
4
|
Yin Q, Chen L. CellTICS: an explainable neural network for cell-type identification and interpretation based on single-cell RNA-seq data. Brief Bioinform 2023; 25:bbad449. [PMID: 38061196 PMCID: PMC10703497 DOI: 10.1093/bib/bbad449] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Revised: 10/30/2023] [Accepted: 11/14/2023] [Indexed: 12/18/2023] Open
Abstract
Identifying cell types is crucial for understanding the functional units of an organism. Machine learning has shown promising performance in identifying cell types, but many existing methods lack biological significance due to poor interpretability. However, it is of the utmost importance to understand what makes cells share the same function and form a specific cell type, motivating us to propose a biologically interpretable method. CellTICS prioritizes marker genes with cell-type-specific expression, using a hierarchy of biological pathways for neural network construction, and applying a multi-predictive-layer strategy to predict cell and sub-cell types. CellTICS usually outperforms existing methods in prediction accuracy. Moreover, CellTICS can reveal pathways that define a cell type or a cell type under specific physiological conditions, such as disease or aging. The nonlinear nature of neural networks enables us to identify many novel pathways. Interestingly, some of the pathways identified by CellTICS exhibit differential expression "variability" rather than differential expression across cell types, indicating that expression stochasticity within a pathway could be an important feature characteristic of a cell type. Overall, CellTICS provides a biologically interpretable method for identifying and characterizing cell types, shedding light on the underlying pathways that define cellular heterogeneity and its role in organismal function. CellTICS is available at https://github.com/qyyin0516/CellTICS.
Collapse
Affiliation(s)
- Qingyang Yin
- Department of Quantitative and Computational Biology, University of Southern California, 1050 Childs Way, Los Angeles, CA 90089, United States
| | - Liang Chen
- Department of Quantitative and Computational Biology, University of Southern California, 1050 Childs Way, Los Angeles, CA 90089, United States
| |
Collapse
|
5
|
Balcı AT, Ebeid MM, Benos PV, Kostka D, Chikina M. An intrinsically interpretable neural network architecture for sequence-to-function learning. Bioinformatics 2023; 39:i413-i422. [PMID: 37387140 PMCID: PMC10311317 DOI: 10.1093/bioinformatics/btad271] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Sequence-based deep learning approaches have been shown to predict a multitude of functional genomic readouts, including regions of open chromatin and RNA expression of genes. However, a major limitation of current methods is that model interpretation relies on computationally demanding post hoc analyses, and even then, one can often not explain the internal mechanics of highly parameterized models. Here, we introduce a deep learning architecture called totally interpretable sequence-to-function model (tiSFM). tiSFM improves upon the performance of standard multilayer convolutional models while using fewer parameters. Additionally, while tiSFM is itself technically a multilayer neural network, internal model parameters are intrinsically interpretable in terms of relevant sequence motifs. RESULTS We analyze published open chromatin measurements across hematopoietic lineage cell-types and demonstrate that tiSFM outperforms a state-of-the-art convolutional neural network model custom-tailored to this dataset. We also show that it correctly identifies context-specific activities of transcription factors with known roles in hematopoietic differentiation, including Pax5 and Ebf1 for B-cells, and Rorc for innate lymphoid cells. tiSFM's model parameters have biologically meaningful interpretations, and we show the utility of our approach on a complex task of predicting the change in epigenetic state as a function of developmental transition. AVAILABILITY AND IMPLEMENTATION The source code, including scripts for the analysis of key findings, can be found at https://github.com/boooooogey/ATAConv, implemented in Python.
Collapse
Affiliation(s)
- Ali Tuğrul Balcı
- Joint Carnegie Mellon University-University of Pittsburgh Program in Computational Biology, Pittsburgh, PA 15213, United States
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA 15213, United States
| | - Mark Maher Ebeid
- Joint Carnegie Mellon University-University of Pittsburgh Program in Computational Biology, Pittsburgh, PA 15213, United States
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA 15213, United States
| | - Panayiotis V Benos
- Department of Epidemiology, University of Florida, Gainesville, FL 32610, United States
| | - Dennis Kostka
- Joint Carnegie Mellon University-University of Pittsburgh Program in Computational Biology, Pittsburgh, PA 15213, United States
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA 15213, United States
- Department of Developmental Biology, University of Pittsburgh, Pittsburgh, PA 15213, United States
| | - Maria Chikina
- Joint Carnegie Mellon University-University of Pittsburgh Program in Computational Biology, Pittsburgh, PA 15213, United States
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA 15213, United States
| |
Collapse
|
6
|
Gautam P, Sinha SK. Theoretical investigation of functional responses of bio-molecular assembly networks. SOFT MATTER 2023; 19:3803-3817. [PMID: 37191191 DOI: 10.1039/d2sm01530g] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
Cooperative protein-protein and protein-DNA interactions form programmable complex assemblies, often performing non-linear gene regulatory operations involved in signal transductions and cell fate determination. The apparent structure of those complex assemblies is very similar, but their functional response strongly depends on the topology of the protein-DNA interaction networks. Here, we demonstrate how the coordinated self-assembly creates gene regulatory network motifs that corroborate the existence of a precise functional response at the molecular level using thermodynamic and dynamic analyses. Our theoretical and Monte Carlo simulations show that a complex network of interactions can form a decision-making loop, such as feedback and feed-forward circuits, only by a few molecular mechanisms. We characterize each possible network of interactions by systematic variations of free energy parameters associated with the binding among biomolecules and DNA looping. We also find that the higher-order networks exhibit alternative steady states from the stochastic dynamics of each network. We capture this signature by calculating stochastic potentials and attributing their multi-stability features. We validate our findings against the Gal promoter system in yeast cells. Overall, we show that the network topology is vital in phenotype diversity in regulatory circuits.
Collapse
Affiliation(s)
- Pankaj Gautam
- Theoretical and Computational Biophysical Chemistry Group, Department of Chemistry, Indian Institute of Technology Ropar, Rupnagar, Punjab 140001, India.
| | - Sudipta Kumar Sinha
- Theoretical and Computational Biophysical Chemistry Group, Department of Chemistry, Indian Institute of Technology Ropar, Rupnagar, Punjab 140001, India.
| |
Collapse
|
7
|
Novakovsky G, Dexter N, Libbrecht MW, Wasserman WW, Mostafavi S. Obtaining genetics insights from deep learning via explainable artificial intelligence. Nat Rev Genet 2023; 24:125-137. [PMID: 36192604 DOI: 10.1038/s41576-022-00532-2] [Citation(s) in RCA: 96] [Impact Index Per Article: 48.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/31/2022] [Indexed: 01/24/2023]
Abstract
Artificial intelligence (AI) models based on deep learning now represent the state of the art for making functional predictions in genomics research. However, the underlying basis on which predictive models make such predictions is often unknown. For genomics researchers, this missing explanatory information would frequently be of greater value than the predictions themselves, as it can enable new insights into genetic processes. We review progress in the emerging area of explainable AI (xAI), a field with the potential to empower life science researchers to gain mechanistic insights into complex deep learning models. We discuss and categorize approaches for model interpretation, including an intuitive understanding of how each approach works and their underlying assumptions and limitations in the context of typical high-throughput biological datasets.
Collapse
Affiliation(s)
- Gherman Novakovsky
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, British Columbia, Canada.,Bioinformatics Graduate Program, University of British Columbia, Vancouver, British Columbia, Canada
| | - Nick Dexter
- Department of Mathematics, Simon Fraser University, Burnaby, British Columbia, Canada.,School of Computing Science, Simon Fraser University, Burnaby, British Columbia, Canada
| | - Maxwell W Libbrecht
- School of Computing Science, Simon Fraser University, Burnaby, British Columbia, Canada.
| | - Wyeth W Wasserman
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, British Columbia, Canada.
| | - Sara Mostafavi
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA. .,Canadian Institute for Advanced Research, Toronto, Ontario, Canada.
| |
Collapse
|
8
|
Jiang Y, Li X, Luo H, Yin S, Kaynak O. Quo vadis artificial intelligence? DISCOVER ARTIFICIAL INTELLIGENCE 2022. [DOI: 10.1007/s44163-022-00022-8] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
AbstractThe study of artificial intelligence (AI) has been a continuous endeavor of scientists and engineers for over 65 years. The simple contention is that human-created machines can do more than just labor-intensive work; they can develop human-like intelligence. Being aware or not, AI has penetrated into our daily lives, playing novel roles in industry, healthcare, transportation, education, and many more areas that are close to the general public. AI is believed to be one of the major drives to change socio-economical lives. In another aspect, AI contributes to the advancement of state-of-the-art technologies in many fields of study, as helpful tools for groundbreaking research. However, the prosperity of AI as we witness today was not established smoothly. During the past decades, AI has struggled through historical stages with several winters. Therefore, at this juncture, to enlighten future development, it is time to discuss the past, present, and have an outlook on AI. In this article, we will discuss from a historical perspective how challenges were faced on the path of revolution of both the AI tools and the AI systems. Especially, in addition to the technical development of AI in the short to mid-term, thoughts and insights are also presented regarding the symbiotic relationship of AI and humans in the long run.
Collapse
|
9
|
Frank SA. Optimization of Transcription Factor Genetic Circuits. BIOLOGY 2022; 11:1294. [PMID: 36138773 PMCID: PMC9495410 DOI: 10.3390/biology11091294] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Revised: 08/27/2022] [Accepted: 08/30/2022] [Indexed: 04/13/2023]
Abstract
Transcription factors (TFs) affect the production of mRNAs. In essence, the TFs form a large computational network that controls many aspects of cellular function. This article introduces a computational method to optimize TF networks. The method extends recent advances in artificial neural network optimization. In a simple example, computational optimization discovers a four-dimensional TF network that maintains a circadian rhythm over many days, successfully buffering strong stochastic perturbations in molecular dynamics and entraining to an external day-night signal that randomly turns on and off at intervals of several days. This work highlights the similar challenges in understanding how computational TF and neural networks gain information and improve performance.
Collapse
Affiliation(s)
- Steven A Frank
- Department of Ecology and Evolutionary Biology, University of California, Irvine, CA 92697, USA
| |
Collapse
|
10
|
Nilsson A, Peters JM, Meimetis N, Bryson B, Lauffenburger DA. Artificial neural networks enable genome-scale simulations of intracellular signaling. Nat Commun 2022; 13:3069. [PMID: 35654811 PMCID: PMC9163072 DOI: 10.1038/s41467-022-30684-y] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2021] [Accepted: 05/11/2022] [Indexed: 12/14/2022] Open
Abstract
Mammalian cells adapt their functional state in response to external signals in form of ligands that bind receptors on the cell-surface. Mechanistically, this involves signal-processing through a complex network of molecular interactions that govern transcription factor activity patterns. Computer simulations of the information flow through this network could help predict cellular responses in health and disease. Here we develop a recurrent neural network framework constrained by prior knowledge of the signaling network with ligand-concentrations as input and transcription factor-activity as output. Applied to synthetic data, it predicts unseen test-data (Pearson correlation r = 0.98) and the effects of gene knockouts (r = 0.8). We stimulate macrophages with 59 different ligands, with and without the addition of lipopolysaccharide, and collect transcriptomics data. The framework predicts this data under cross-validation (r = 0.8) and knockout simulations suggest a role for RIPK1 in modulating the lipopolysaccharide response. This work demonstrates the feasibility of genome-scale simulations of intracellular signaling.
Collapse
Affiliation(s)
- Avlant Nilsson
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, SE 41296, Sweden
- Ragon Institute of MGH, MIT, and Harvard, Cambridge, MA, 02139, USA
| | - Joshua M Peters
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
- Ragon Institute of MGH, MIT, and Harvard, Cambridge, MA, 02139, USA
| | - Nikolaos Meimetis
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Bryan Bryson
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
- Ragon Institute of MGH, MIT, and Harvard, Cambridge, MA, 02139, USA
| | - Douglas A Lauffenburger
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.
- Ragon Institute of MGH, MIT, and Harvard, Cambridge, MA, 02139, USA.
| |
Collapse
|
11
|
Duk MA, Gursky VV, Samsonova MG, Surkova SY. Application of Domain- and Genotype-Specific Models to Infer Post-Transcriptional Regulation of Segmentation Gene Expression in Drosophila. Life (Basel) 2021; 11:life11111232. [PMID: 34833107 PMCID: PMC8618293 DOI: 10.3390/life11111232] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2021] [Revised: 11/05/2021] [Accepted: 11/10/2021] [Indexed: 11/16/2022] Open
Abstract
Unlike transcriptional regulation, the post-transcriptional mechanisms underlying zygotic segmentation gene expression in early Drosophila embryo have been insufficiently investigated. Condition-specific post-transcriptional regulation plays an important role in the development of many organisms. Our recent study revealed the domain- and genotype-specific differences between mRNA and the protein expression of Drosophila hb, gt, and eve genes in cleavage cycle 14A. Here, we use this dataset and the dynamic mathematical model to recapitulate protein expression from the corresponding mRNA patterns. The condition-specific nonuniformity in parameter values is further interpreted in terms of possible post-transcriptional modifications. For hb expression in wild-type embryos, our results predict the position-specific differences in protein production. The protein synthesis rate parameter is significantly higher in hb anterior domain compared to the posterior domain. The parameter sets describing Gt protein dynamics in wild-type embryos and Kr mutants are genotype-specific. The spatial discrepancy between gt mRNA and protein posterior expression in Kr mutants is well reproduced by the whole axis model, thus rejecting the involvement of post-transcriptional mechanisms. Our models fail to describe the full dynamics of eve expression, presumably due to its complex shape and the variable time delays between mRNA and protein patterns, which likely require a more complex model. Overall, our modeling approach enables the prediction of regulatory scenarios underlying the condition-specific differences between mRNA and protein expression in early embryo.
Collapse
Affiliation(s)
- Maria A. Duk
- Mathematical Biology and Bioinformatics Laboratory, Peter the Great Saint Petersburg Polytechnic University, 195251 St. Petersburg, Russia; (M.A.D.); (M.G.S.)
- Theoretical Department, Ioffe Institute, 194021 St. Petersburg, Russia;
| | - Vitaly V. Gursky
- Theoretical Department, Ioffe Institute, 194021 St. Petersburg, Russia;
| | - Maria G. Samsonova
- Mathematical Biology and Bioinformatics Laboratory, Peter the Great Saint Petersburg Polytechnic University, 195251 St. Petersburg, Russia; (M.A.D.); (M.G.S.)
| | - Svetlana Yu. Surkova
- Mathematical Biology and Bioinformatics Laboratory, Peter the Great Saint Petersburg Polytechnic University, 195251 St. Petersburg, Russia; (M.A.D.); (M.G.S.)
- Correspondence:
| |
Collapse
|
12
|
Dibaeinia P, Sinha S. Deciphering enhancer sequence using thermodynamics-based models and convolutional neural networks. Nucleic Acids Res 2021; 49:10309-10327. [PMID: 34508359 PMCID: PMC8501998 DOI: 10.1093/nar/gkab765] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2021] [Revised: 08/18/2021] [Accepted: 08/25/2021] [Indexed: 11/18/2022] Open
Abstract
Deciphering the sequence-function relationship encoded in enhancers holds the key to interpreting non-coding variants and understanding mechanisms of transcriptomic variation. Several quantitative models exist for predicting enhancer function and underlying mechanisms; however, there has been no systematic comparison of these models characterizing their relative strengths and shortcomings. Here, we interrogated a rich data set of neuroectodermal enhancers in Drosophila, representing cis- and trans- sources of expression variation, with a suite of biophysical and machine learning models. We performed rigorous comparisons of thermodynamics-based models implementing different mechanisms of activation, repression and cooperativity. Moreover, we developed a convolutional neural network (CNN) model, called CoNSEPT, that learns enhancer 'grammar' in an unbiased manner. CoNSEPT is the first general-purpose CNN tool for predicting enhancer function in varying conditions, such as different cell types and experimental conditions, and we show that such complex models can suggest interpretable mechanisms. We found model-based evidence for mechanisms previously established for the studied system, including cooperative activation and short-range repression. The data also favored one hypothesized activation mechanism over another and suggested an intriguing role for a direct, distance-independent repression mechanism. Our modeling shows that while fundamentally different models can yield similar fits to data, they vary in their utility for mechanistic inference. CoNSEPT is freely available at: https://github.com/PayamDiba/CoNSEPT.
Collapse
Affiliation(s)
- Payam Dibaeinia
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Saurabh Sinha
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
- Cancer Center at Illinois, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
13
|
Zrimec J, Buric F, Kokina M, Garcia V, Zelezniak A. Learning the Regulatory Code of Gene Expression. Front Mol Biosci 2021; 8:673363. [PMID: 34179082 PMCID: PMC8223075 DOI: 10.3389/fmolb.2021.673363] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2021] [Accepted: 05/24/2021] [Indexed: 11/13/2022] Open
Abstract
Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology.
Collapse
Affiliation(s)
- Jan Zrimec
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Filip Buric
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Mariia Kokina
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Victor Garcia
- School of Life Sciences and Facility Management, Zurich University of Applied Sciences, Wädenswil, Switzerland
| | - Aleksej Zelezniak
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
- Science for Life Laboratory, Stockholm, Sweden
| |
Collapse
|
14
|
Matthews ML, Marshall-Colón A. Multiscale plant modeling: from genome to phenome and beyond. Emerg Top Life Sci 2021; 5:231-237. [PMID: 33543231 PMCID: PMC8166335 DOI: 10.1042/etls20200276] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2020] [Revised: 01/18/2021] [Accepted: 01/20/2021] [Indexed: 01/08/2023]
Abstract
Plants are complex organisms that adapt to changes in their environment using an array of regulatory mechanisms that span across multiple levels of biological organization. Due to this complexity, it is difficult to predict emergent properties using conventional approaches that focus on single levels of biology such as the genome, transcriptome, or metabolome. Mathematical models of biological systems have emerged as useful tools for exploring pathways and identifying gaps in our current knowledge of biological processes. Identification of emergent properties, however, requires their vertical integration across biological scales through multiscale modeling. Multiscale models that capture and predict these emergent properties will allow us to predict how plants will respond to a changing climate and explore strategies for plant engineering. In this review, we (1) summarize the recent developments in plant multiscale modeling; (2) examine multiscale models of microbial systems that offer insight to potential future directions for the modeling of plant systems; (3) discuss computational tools and resources for developing multiscale models; and (4) examine future directions of the field.
Collapse
Affiliation(s)
- Megan L Matthews
- Department of Civil and Environmental Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801 USA
- Institute for Sustainability, Energy, and Environment, University of Illinois at Urbana-Champaign, Urbana, IL 61801 USA
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801 USA
| | - Amy Marshall-Colón
- Institute for Sustainability, Energy, and Environment, University of Illinois at Urbana-Champaign, Urbana, IL 61801 USA
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801 USA
- Department of Plant Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801 USA
| |
Collapse
|
15
|
Fortelny N, Bock C. Knowledge-primed neural networks enable biologically interpretable deep learning on single-cell sequencing data. Genome Biol 2020; 21:190. [PMID: 32746932 PMCID: PMC7397672 DOI: 10.1186/s13059-020-02100-5] [Citation(s) in RCA: 54] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2020] [Accepted: 07/10/2020] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND Deep learning has emerged as a versatile approach for predicting complex biological phenomena. However, its utility for biological discovery has so far been limited, given that generic deep neural networks provide little insight into the biological mechanisms that underlie a successful prediction. Here we demonstrate deep learning on biological networks, where every node has a molecular equivalent, such as a protein or gene, and every edge has a mechanistic interpretation, such as a regulatory interaction along a signaling pathway. RESULTS With knowledge-primed neural networks (KPNNs), we exploit the ability of deep learning algorithms to assign meaningful weights in multi-layered networks, resulting in a widely applicable approach for interpretable deep learning. We present a learning method that enhances the interpretability of trained KPNNs by stabilizing node weights in the presence of redundancy, enhancing the quantitative interpretability of node weights, and controlling for uneven connectivity in biological networks. We validate KPNNs on simulated data with known ground truth and demonstrate their practical use and utility in five biological applications with single-cell RNA-seq data for cancer and immune cells. CONCLUSIONS We introduce KPNNs as a method that combines the predictive power of deep learning with the interpretability of biological networks. While demonstrated here on single-cell sequencing data, this method is broadly relevant to other research areas where prior domain knowledge can be represented as networks.
Collapse
Affiliation(s)
- Nikolaus Fortelny
- CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, Vienna, Austria
| | - Christoph Bock
- CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, Vienna, Austria.
- Department of Laboratory Medicine, Medical University of Vienna, Vienna, Austria.
| |
Collapse
|
16
|
Koo PK, Ploenzke M. Deep learning for inferring transcription factor binding sites. CURRENT OPINION IN SYSTEMS BIOLOGY 2020; 19:16-23. [PMID: 32905524 PMCID: PMC7469942 DOI: 10.1016/j.coisb.2020.04.001] [Citation(s) in RCA: 37] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Deep learning is a powerful tool for predicting transcription factor binding sites from DNA sequence. Despite their high predictive accuracy, there are no guarantees that a high-performing deep learning model will learn causal sequence-function relationships. Thus a move beyond performance comparisons on benchmark datasets is needed. Interpreting model predictions is a powerful approach to identify which features drive performance gains and ideally provide insight into the underlying biological mechanisms. Here we highlight timely advances in deep learning for genomics, with a focus on inferring transcription factors binding sites. We describe recent applications, model architectures, and advances in local and global model interpretability methods, then conclude with a discussion on future research directions.
Collapse
Affiliation(s)
- Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Matt Ploenzke
- Department of Biostatistics, Harvard University, Cambridge, MA, USA
| |
Collapse
|