1
|
Kabir A, Bhattarai M, Peterson S, Najman-Licht Y, Rasmussen K, Shehu A, Bishop A, Alexandrov B, Usheva A. DNA breathing integration with deep learning foundational model advances genome-wide binding prediction of human transcription factors. Nucleic Acids Res 2024; 52:e91. [PMID: 39271116 PMCID: PMC11514457 DOI: 10.1093/nar/gkae783] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2024] [Revised: 08/21/2024] [Accepted: 08/29/2024] [Indexed: 09/15/2024] Open
Abstract
It was previously shown that DNA breathing, thermodynamic stability, as well as transcriptional activity and transcription factor (TF) bindings are functionally correlated. To ascertain the precise relationship between TF binding and DNA breathing, we developed the multi-modal deep learning model EPBDxDNABERT-2, which is based on the Extended Peyrard-Bishop-Dauxois (EPBD) nonlinear DNA dynamics model. To train our EPBDxDNABERT-2, we used chromatin immunoprecipitation sequencing (ChIP-Seq) data comprising 690 ChIP-seq experimental results encompassing 161 distinct TFs and 91 human cell types. EPBDxDNABERT-2 significantly improves the prediction of over 660 TF-DNA, with an increase in the area under the receiver operating characteristic (AUROC) metric of up to 9.6% when compared to the baseline model that does not leverage DNA biophysical properties. We expanded our analysis to in vitro high-throughput Systematic Evolution of Ligands by Exponential enrichment (HT-SELEX) dataset of 215 TFs from 27 families, comparing EPBD with established frameworks. The integration of the DNA breathing features with DNABERT-2 foundational model, greatly enhanced TF-binding predictions. Notably, EPBDxDNABERT-2, trained on a large-scale multi-species genomes, with a cross-attention mechanism, improved predictive power shedding light on the mechanisms underlying disease-related non-coding variants discovered in genome-wide association studies.
Collapse
Affiliation(s)
- Anowarul Kabir
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544 NM, USA
- Department of Computer Science, George Mason University, 4400 University Dr, 22030 VA, USA
| | - Manish Bhattarai
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544 NM, USA
| | - Selma Peterson
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544 NM, USA
| | | | - Kim Ø Rasmussen
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544 NM, USA
| | - Amarda Shehu
- Department of Computer Science, George Mason University, 4400 University Dr, 22030 VA, USA
| | - Alan R Bishop
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544 NM, USA
| | - Boian Alexandrov
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544 NM, USA
| | - Anny Usheva
- Department of Surgery, Brown University, 69 Brown St Box 1822, 02912 RI, USA
| |
Collapse
|
2
|
Kabir A, Bhattarai M, Rasmussen KØ, Shehu A, Usheva A, Bishop AR, Alexandrov B. Examining DNA breathing with pyDNA-EPBD. Bioinformatics 2023; 39:btad699. [PMID: 37991847 PMCID: PMC10681863 DOI: 10.1093/bioinformatics/btad699] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2023] [Revised: 10/23/2023] [Accepted: 11/21/2023] [Indexed: 11/24/2023] Open
Abstract
MOTIVATION The two strands of the DNA double helix locally and spontaneously separate and recombine in living cells due to the inherent thermal DNA motion. This dynamics results in transient openings in the double helix and is referred to as "DNA breathing" or "DNA bubbles." The propensity to form local transient openings is important in a wide range of biological processes, such as transcription, replication, and transcription factors binding. However, the modeling and computer simulation of these phenomena, have remained a challenge due to the complex interplay of numerous factors, such as, temperature, salt content, DNA sequence, hydrogen bonding, base stacking, and others. RESULTS We present pyDNA-EPBD, a parallel software implementation of the Extended Peyrard-Bishop-Dauxois (EPBD) nonlinear DNA model that allows us to describe some features of DNA dynamics in detail. The pyDNA-EPBD generates genomic scale profiles of average base-pair openings, base flipping probability, DNA bubble probability, and calculations of the characteristically dynamic length indicating the number of base pairs statistically significantly affected by a single point mutation using the Markov Chain Monte Carlo algorithm. AVAILABILITY AND IMPLEMENTATION pyDNA-EPBD is supported across most operating systems and is freely available at https://github.com/lanl/pyDNA_EPBD. Extensive documentation can be found at https://lanl.github.io/pyDNA_EPBD/.
Collapse
Affiliation(s)
- Anowarul Kabir
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM 87544, United States
- Department of Computer Science, George Mason University, Fairfax, VA 22030, United States
| | - Manish Bhattarai
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM 87544, United States
| | - Kim Ø Rasmussen
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM 87544, United States
| | - Amarda Shehu
- Department of Computer Science, George Mason University, Fairfax, VA 22030, United States
| | - Anny Usheva
- Department of Surgery, Brown University, Providence, RI 02912, United States
| | - Alan R Bishop
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM 87544, United States
| | - Boian Alexandrov
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM 87544, United States
| |
Collapse
|
3
|
Kabir A, Bhattarai M, Rasmussen KØ, Shehu A, Usheva A, Bishop AR, Alexandrov BS. Examining DNA Breathing with pyDNA-EPBD. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.09.09.557010. [PMID: 37745370 PMCID: PMC10515784 DOI: 10.1101/2023.09.09.557010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/26/2023]
Abstract
Motivation The two strands of the DNA double helix locally and spontaneously separate and recombine in living cells due to the inherent thermal DNA motion.This dynamics results in transient openings in the double helix and is referred to as "DNA breathing" or "DNA bubbles." The propensity to form local transient openings is important in a wide range of biological processes, such as transcription, replication, and transcription factors binding. However, the modeling and computer simulation of these phenomena, have remained a challenge due to the complex interplay of numerous factors, such as, temperature, salt content, DNA sequence, hydrogen bonding, base stacking, and others. Results We present pyDNA-EPBD, a parallel software implementation of the Extended Peyrard-Bishop- Dauxois (EPBD) nonlinear DNA model that allows us to describe some features of DNA dynamics in detail. The pyDNA-EPBD generates genomic scale profiles of average base-pair openings, base flipping probability, DNA bubble probability, and calculations of the characteristically dynamic length indicating the number of base pairs statistically significantly affected by a single point mutation using the Markov Chain Monte Carlo (MCMC) algorithm.
Collapse
Affiliation(s)
- Anowarul Kabir
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544, NM, 87102
- George Mason University, 4400 University Dr, Fairfax, VA 22030
| | - Manish Bhattarai
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544, NM, 87102
| | - Kim Ø. Rasmussen
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544, NM, 87102
| | - Amarda Shehu
- George Mason University, 4400 University Dr, Fairfax, VA 22030
| | - Anny Usheva
- Brown University, 69 Brown St Box 1822, Providence, RI 02912
| | - Alan R Bishop
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544, NM, 87102
| | - Boian S Alexandrov
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544, NM, 87102
| |
Collapse
|
4
|
Monteoliva D, Diambra L. Information propagation in a noisy gene cascade. Phys Rev E 2018; 96:012403. [PMID: 29347170 DOI: 10.1103/physreve.96.012403] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2017] [Indexed: 11/07/2022]
Abstract
We use information theory to study the information transmission through a simple gene cascade where the product of an unregulated gene regulates the expression activity of a cooperative genetic switch. While the input signal is provided by the upstream gene with two states, we consider that the expression of downstream gene is controlled by a cis-regulatory system with three binding sites for the regulator product, which can bind cooperatively. By computing exactly the associated probability distributions, we estimate information transmission thought the mutual information measure. We found that the mutual information associated with unimodal input signal is lower than the associated with bimodal inputs. We also observe that mutual information presents a maximum in the cooperativity intensity, and the position of this maximum depends on the kinetic rates of the promoter. Furthermore, we found that the bursting dynamics of the input signal can enhance the information transmission capacity.
Collapse
Affiliation(s)
- D Monteoliva
- Departamento de Física, Facultad de Ciencias Exactas, Universidad Nacional de La Plata, Argentine
| | - L Diambra
- Centro Regional de Estudios Genómicos, Universidad Nacional de La Plata, CONICET, Argentine
| |
Collapse
|
5
|
Brautigam CA, Zhao H, Vargas C, Keller S, Schuck P. Integration and global analysis of isothermal titration calorimetry data for studying macromolecular interactions. Nat Protoc 2016; 11:882-94. [PMID: 27055097 PMCID: PMC7466939 DOI: 10.1038/nprot.2016.044] [Citation(s) in RCA: 191] [Impact Index Per Article: 21.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Isothermal titration calorimetry (ITC) is a powerful and widely used method to measure the energetics of macromolecular interactions by recording a thermogram of differential heating power during a titration. However, traditional ITC analysis is limited by stochastic thermogram noise and by the limited information content of a single titration experiment. Here we present a protocol for bias-free thermogram integration based on automated shape analysis of the injection peaks, followed by combination of isotherms from different calorimetric titration experiments into a global analysis, statistical analysis of binding parameters and graphical presentation of the results. This is performed using the integrated public-domain software packages NITPIC, SEDPHAT and GUSSI. The recently developed low-noise thermogram integration approach and global analysis allow for more precise parameter estimates and more reliable quantification of multisite and multicomponent cooperative and competitive interactions. Titration experiments typically take 1-2.5 h each, and global analysis usually takes 10-20 min.
Collapse
Affiliation(s)
- Chad A. Brautigam
- Department of Biophysics, The University of Texas Southwestern Medical Center, Dallas, Texas, U.S.A
| | - Huaying Zhao
- Dynamics of Macromolecular Assembly Section, Laboratory of Cellular Imaging and Macromolecular Biophysics, National Institute of Biomedical Imaging and Bioengineering, National Institutes of Health, Bethesda, U.S.A
| | - Carolyn Vargas
- Molecular Biophysics, University of Kaiserslautern, Germany
| | - Sandro Keller
- Molecular Biophysics, University of Kaiserslautern, Germany
| | - Peter Schuck
- Dynamics of Macromolecular Assembly Section, Laboratory of Cellular Imaging and Macromolecular Biophysics, National Institute of Biomedical Imaging and Bioengineering, National Institutes of Health, Bethesda, U.S.A
| |
Collapse
|
6
|
Zhao H, Piszczek G, Schuck P. SEDPHAT--a platform for global ITC analysis and global multi-method analysis of molecular interactions. Methods 2015; 76:137-148. [PMID: 25477226 PMCID: PMC4380758 DOI: 10.1016/j.ymeth.2014.11.012] [Citation(s) in RCA: 245] [Impact Index Per Article: 24.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2014] [Revised: 11/19/2014] [Accepted: 11/20/2014] [Indexed: 01/02/2023] Open
Abstract
Isothermal titration calorimetry experiments can provide significantly more detailed information about molecular interactions when combined in global analysis. For example, global analysis can improve the precision of binding affinity and enthalpy, and of possible linkage parameters, even for simple bimolecular interactions, and greatly facilitate the study of multi-site and multi-component systems with competition or cooperativity. A pre-requisite for global analysis is the departure from the traditional binding model, including an 'n'-value describing unphysical, non-integral numbers of sites. Instead, concentration correction factors can be introduced to account for either errors in the concentration determination or for the presence of inactive fractions of material. SEDPHAT is a computer program that embeds these ideas and provides a graphical user interface for the seamless combination of biophysical experiments to be globally modeled with a large number of different binding models. It offers statistical tools for the rigorous determination of parameter errors, correlations, as well as advanced statistical functions for global ITC (gITC) and global multi-method analysis (GMMA). SEDPHAT will also take full advantage of error bars of individual titration data points determined with the unbiased integration software NITPIC. The present communication reviews principles and strategies of global analysis for ITC and its extension to GMMA in SEDPHAT. We will also introduce a new graphical tool for aiding experimental design by surveying the concentration space and generating simulated data sets, which can be subsequently statistically examined for their information content. This procedure can replace the 'c'-value as an experimental design parameter, which ceases to be helpful for multi-site systems and in the context of gITC.
Collapse
Affiliation(s)
- Huaying Zhao
- Dynamics of Macromolecular Assembly Section, Laboratory of Cellular Imaging and Macromolecular Biophysics, National Institute of Biomedical Imaging and Bioengineering, National Institutes of Health, Bethesda, MD 20892, USA
| | - Grzegorz Piszczek
- Biochemistry and Biophysics Center, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Peter Schuck
- Dynamics of Macromolecular Assembly Section, Laboratory of Cellular Imaging and Macromolecular Biophysics, National Institute of Biomedical Imaging and Bioengineering, National Institutes of Health, Bethesda, MD 20892, USA.
| |
Collapse
|
7
|
Cayrou C, Coulombe P, Puy A, Rialle S, Kaplan N, Segal E, Méchali M. New insights into replication origin characteristics in metazoans. Cell Cycle 2012; 11:658-67. [PMID: 22373526 DOI: 10.4161/cc.11.4.19097] [Citation(s) in RCA: 131] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
We recently reported the identification and characterization of DNA replication origins (Oris) in metazoan cell lines. Here, we describe additional bioinformatic analyses showing that the previously identified GC-rich sequence elements form origin G-rich repeated elements (OGREs) that are present in 67% to 90% of the DNA replication origins from Drosophila to human cells, respectively. Our analyses also show that initiation of DNA synthesis takes place precisely at 160 bp (Drosophila) and 280 bp (mouse) from the OGRE. We also found that in most CpG islands, an OGRE is positioned in opposite orientation on each of the two DNA strands and detected two sites of initiation of DNA synthesis upstream or downstream of each OGRE. Conversely, Oris not associated with CpG islands have a single initiation site. OGRE density along chromosomes correlated with previously published replication timing data. Ori sequences centered on the OGRE are also predicted to have high intrinsic nucleosome occupancy. Finally, OGREs predict G-quadruplex structures at Oris that might be structural elements controlling the choice or activation of replication origins.
Collapse
|
8
|
Cooperative binding of transcription factors promotes bimodal gene expression response. PLoS One 2012; 7:e44812. [PMID: 22984566 PMCID: PMC3440358 DOI: 10.1371/journal.pone.0044812] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2011] [Accepted: 08/13/2012] [Indexed: 12/14/2022] Open
Abstract
In the present work we extend and analyze the scope of our recently proposed stochastic model for transcriptional regulation, which considers an arbitrarily complex cis-regulatory system using only elementary reactions. Previously, we determined the role of cooperativity on the intrinsic fluctuations of gene expression for activating transcriptional switches, by means of master equation formalism and computer simulation. This model allowed us to distinguish between two cooperative binding mechanisms and, even though the mean expression levels were not affected differently by the acting mechanism, we showed that the associated fluctuations were different. In the present generalized model we include other regulatory functions in addition to those associated to an activator switch. Namely, we introduce repressive regulatory functions and two theoretical mechanisms that account for the biphasic response that some cis-regulatory systems show to the transcription factor concentration. We have also extended our previous master equation formalism in order to include protein production by stochastic translation of mRNA. Furthermore, we examine the graded/binary scenarios in the context of the interaction energy between transcription factors. In this sense, this is the first report to show that the cooperative binding of transcription factors to DNA promotes the "all-or-none" phenomenon observed in eukaryotic systems. In addition, we confirm that gene expression fluctuation levels associated with one of two cooperative binding mechanism never exceed the fluctuation levels of the other.
Collapse
|
9
|
Ghai R, Falconer RJ, Collins BM. Applications of isothermal titration calorimetry in pure and applied research--survey of the literature from 2010. J Mol Recognit 2012; 25:32-52. [PMID: 22213449 DOI: 10.1002/jmr.1167] [Citation(s) in RCA: 123] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
Isothermal titration calorimetry (ITC) is a biophysical technique for measuring the formation and dissociation of molecular complexes and has become an invaluable tool in many branches of science from cell biology to food chemistry. By measuring the heat absorbed or released during bond formation, ITC provides accurate, rapid, and label-free measurement of the thermodynamics of molecular interactions. In this review, we survey the recent literature reporting the use of ITC and have highlighted a number of interesting studies that provide a flavour of the diverse systems to which ITC can be applied. These include measurements of protein-protein and protein-membrane interactions required for macromolecular assembly, analysis of enzyme kinetics, experimental validation of molecular dynamics simulations, and even in manufacturing applications such as food science. Some highlights include studies of the biological complex formed by Staphylococcus aureus enterotoxin C3 and the murine T-cell receptor, the mechanism of membrane association of the Parkinson's disease-associated protein α-synuclein, and the role of non-specific tannin-protein interactions in the quality of different beverages. Recent developments in automation are overcoming limitations on throughput imposed by previous manual procedures and promise to greatly extend usefulness of ITC in the future. We also attempt to impart some practical advice for getting the most out of ITC data for those researchers less familiar with the method.
Collapse
Affiliation(s)
- Rajesh Ghai
- Institute for Molecular Bioscience (IMB), University of Queensland, St. Lucia, Queensland, 4072, Australia
| | | | | |
Collapse
|