1
|
Zhu YH, Liu Z, Liu Y, Ji Z, Yu DJ. ULDNA: integrating unsupervised multi-source language models with LSTM-attention network for high-accuracy protein-DNA binding site prediction. Brief Bioinform 2024; 25:bbae040. [PMID: 38349057 PMCID: PMC10939370 DOI: 10.1093/bib/bbae040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2023] [Revised: 01/02/2024] [Accepted: 01/22/2024] [Indexed: 02/15/2024] Open
Abstract
Efficient and accurate recognition of protein-DNA interactions is vital for understanding the molecular mechanisms of related biological processes and further guiding drug discovery. Although the current experimental protocols are the most precise way to determine protein-DNA binding sites, they tend to be labor-intensive and time-consuming. There is an immediate need to design efficient computational approaches for predicting DNA-binding sites. Here, we proposed ULDNA, a new deep-learning model, to deduce DNA-binding sites from protein sequences. This model leverages an LSTM-attention architecture, embedded with three unsupervised language models that are pre-trained on large-scale sequences from multiple database sources. To prove its effectiveness, ULDNA was tested on 229 protein chains with experimental annotation of DNA-binding sites. Results from computational experiments revealed that ULDNA significantly improves the accuracy of DNA-binding site prediction in comparison with 17 state-of-the-art methods. In-depth data analyses showed that the major strength of ULDNA stems from employing three transformer language models. Specifically, these language models capture complementary feature embeddings with evolution diversity, in which the complex DNA-binding patterns are buried. Meanwhile, the specially crafted LSTM-attention network effectively decodes evolution diversity-based embeddings as DNA-binding results at the residue level. Our findings demonstrated a new pipeline for predicting DNA-binding sites on a large scale with high accuracy from protein sequence alone.
Collapse
Affiliation(s)
- Yi-Heng Zhu
- College of Artificial Intelligence, Nanjing Agricultural University, Nanjing 210095, China
| | - Zi Liu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Yan Liu
- School of Information Engineering, Yangzhou University, Yangzhou 225000, China
| | - Zhiwei Ji
- College of Artificial Intelligence, Nanjing Agricultural University, Nanjing 210095, China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| |
Collapse
|
2
|
Noncanonical DNA Cleavage by BamHI Endonuclease in Laterally Confined DNA Monolayers Is a Step Function of DNA Density and Sequence. Molecules 2022; 27:molecules27165262. [PMID: 36014501 PMCID: PMC9416302 DOI: 10.3390/molecules27165262] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Revised: 08/04/2022] [Accepted: 08/15/2022] [Indexed: 11/17/2022] Open
Abstract
Cleavage of DNA at noncanonical recognition sequences by restriction endonucleases (star activity) in bulk solution can be promoted by global experimental parameters, including enzyme or substrate concentration, temperature, pH, or buffer composition. To study the effect of nanoscale confinement on the noncanonical behaviour of BamHI, which cleaves a single unique sequence of 6 bp, we used AFM nanografting to generate laterally confined DNA monolayers (LCDM) at different densities, either in the form of small patches, several microns in width, or complete monolayers of thiol-modified DNA on a gold surface. We focused on two 44-bp DNAs, each containing a noncanonical BamHI site differing by 2 bp from the cognate recognition sequence. Topographic AFM imaging was used to monitor end-point reactions by measuring the decrease in the LCDM height with respect to the surrounding reference surface. At low DNA densities, BamHI efficiently cleaves only its cognate sequence while at intermediate DNA densities, noncanonical sequence cleavage occurs, and can be controlled in a stepwise (on/off) fashion by varying the DNA density and restriction site sequence. This study shows that endonuclease action on noncanonical sites in confined nanoarchitectures can be modulated by varying local physical parameters, independent of global chemical parameters.
Collapse
|
3
|
Identifying essential proteins from protein-protein interaction networks based on influence maximization. BMC Bioinformatics 2022; 23:339. [PMID: 35974329 PMCID: PMC9380286 DOI: 10.1186/s12859-022-04874-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2022] [Accepted: 08/03/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Essential proteins are indispensable to the development and survival of cells. The identification of essential proteins not only is helpful for the understanding of the minimal requirements for cell survival, but also has practical significance in disease diagnosis, drug design and medical treatment. With the rapidly amassing of protein-protein interaction (PPI) data, computationally identifying essential proteins from protein-protein interaction networks (PINs) becomes more and more popular. Up to now, a number of various approaches for essential protein identification based on PINs have been developed. RESULTS In this paper, we propose a new and effective approach called iMEPP to identify essential proteins from PINs by fusing multiple types of biological data and applying the influence maximization mechanism to the PINs. Concretely, we first integrate PPI data, gene expression data and Gene Ontology to construct weighted PINs, to alleviate the impact of high false-positives in the raw PPI data. Then, we define the influence scores of nodes in PINs with both orthological data and PIN topological information. Finally, we develop an influence discount algorithm to identify essential proteins based on the influence maximization mechanism. CONCLUSIONS We applied our method to identifying essential proteins from saccharomyces cerevisiae PIN. Experiments show that our iMEPP method outperforms the existing methods, which validates its effectiveness and advantage.
Collapse
|
4
|
Malik FK, Guo JT. Insights into protein-DNA interactions from hydrogen bond energy-based comparative protein-ligand analyses. Proteins 2022; 90:1303-1314. [PMID: 35122321 PMCID: PMC9018545 DOI: 10.1002/prot.26313] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2021] [Revised: 01/17/2022] [Accepted: 01/31/2022] [Indexed: 01/18/2023]
Abstract
Hydrogen bonds play important roles in protein folding and protein-ligand interactions, particularly in specific protein-DNA recognition. However, the distributions of hydrogen bonds, especially hydrogen bond energy (HBE) in different types of protein-ligand complexes, is unknown. Here we performed a comparative analysis of hydrogen bonds among three non-redundant datasets of protein-protein, protein-peptide, and protein-DNA complexes. Besides comparing the number of hydrogen bonds in terms of types and locations, we investigated the distributions of HBE. Our results indicate that while there is no significant difference of hydrogen bonds within protein chains among the three types of complexes, interfacial hydrogen bonds are significantly more prevalent in protein-DNA complexes. More importantly, the interfacial hydrogen bonds in protein-DNA complexes displayed a unique energy distribution of strong and weak hydrogen bonds whereas majority of the interfacial hydrogen bonds in protein-protein and protein-peptide complexes are of predominantly high strength with low energy. Moreover, there is a significant difference in the energy distributions of minor groove hydrogen bonds between protein-DNA complexes with different binding specificity. Highly specific protein-DNA complexes contain more strong hydrogen bonds in the minor groove than multi-specific complexes, suggesting important role of minor groove in specific protein-DNA recognition. These results can help better understand protein-DNA interactions and have important implications in improving quality assessments of protein-DNA complex models.
Collapse
Affiliation(s)
- Fareeha K Malik
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, North Carolina, USA.,Research Center of Modeling and Simulation, National University of Science and Technology, Islamabad, Pakistan
| | - Jun-Tao Guo
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, North Carolina, USA
| |
Collapse
|
5
|
Xu W, Gao Y, Wang Y, Guan J. Protein-protein interaction prediction based on ordinal regression and recurrent convolutional neural networks. BMC Bioinformatics 2021; 22:485. [PMID: 34625020 PMCID: PMC8501564 DOI: 10.1186/s12859-021-04369-0] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2021] [Accepted: 09/02/2021] [Indexed: 11/10/2022] Open
Abstract
Background Protein protein interactions (PPIs) are essential to most of the biological processes. The prediction of PPIs is beneficial to the understanding of protein functions and thus is helpful to pathological analysis, disease diagnosis and drug design etc. As the amount of protein data is growing fast in the post genomic era, high-throughput experimental methods are expensive and time-consuming for the prediction of PPIs. Thus, computational methods have attracted researcher’s attention in recent years. A large number of computational methods have been proposed based on different protein sequence encoders. Results Notably, the confidence score of a protein sequence pair could be regarded as a kind of measurement to PPIs. The higher the confidence score for one protein pair is, the more likely the protein pair interacts. Thus in this paper, a deep learning framework, called ordinal regression and recurrent convolutional neural network (OR-RCNN) method, is introduced to predict PPIs from the perspective of confidence score. It mainly contains two parts: the encoder part of protein sequence pair and the prediction part of PPIs by confidence score. In the first part, two recurrent convolutional neural networks (RCNNs) with shared parameters are applied to construct two protein sequence embedding vectors, which can automatically extract robust local features and sequential information from the protein pairs. Based on it, the two embedding vectors are encoded into one novel embedding vector by element-wise multiplication. By taking the ordinal information behind confidence score into consideration, ordinal regression is used to construct multiple sub-classifiers in the second part. The results of multiple sub-classifiers are aggregated to obtain the final confidence score. Following that, the existence of PPIs is determined by the confidence score. We set a threshold \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\theta$$\end{document}θ, and say the interaction exists between the protein pair if its confidence score is bigger than \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\theta$$\end{document}θ. Conclusions We applied our method to predict PPIs on data sets S. cerevisiae and Homo sapiens. Through experimental verification, our method outperforms state-of-the-art PPI prediction models.
Collapse
Affiliation(s)
- Weixia Xu
- School of Information Management, Shanghai Lixin University of Accounting and Finance, No. 995 Shangchuan Road, Shanghai, 201209, China
| | - Yangyun Gao
- Shanghai Key Laboratory of Intelligent Information Processing, and School of Computer Science, Fudan University, No. 220 Handan Road, Shanghai, 200433, China
| | - Yang Wang
- Department of Computer Science and Technology, Tongji University, No. 4800 Caoan Road, Shanghai, 201804, China
| | - Jihong Guan
- Department of Computer Science and Technology, Tongji University, No. 4800 Caoan Road, Shanghai, 201804, China.
| |
Collapse
|
6
|
Suvorova IA, Gelfand MS. Comparative Analysis of the IclR-Family of Bacterial Transcription Factors and Their DNA-Binding Motifs: Structure, Positioning, Co-Evolution, Regulon Content. Front Microbiol 2021; 12:675815. [PMID: 34177859 PMCID: PMC8222616 DOI: 10.3389/fmicb.2021.675815] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2021] [Accepted: 05/14/2021] [Indexed: 11/13/2022] Open
Abstract
The IclR-family is a large group of transcription factors (TFs) regulating various biological processes in diverse bacteria. Using comparative genomics techniques, we have identified binding motifs of IclR-family TFs, reconstructed regulons and analyzed their content, finding co-occurrences between the regulated COGs (clusters of orthologous genes), useful for future functional characterizations of TFs and their regulated genes. We describe two main types of IclR-family motifs, similar in sequence but different in the arrangement of the half-sites (boxes), with GKTYCRYW3-4RYGRAMC and TGRAACAN1-2TGTTYCA consensuses, and also predict that TFs in 32 orthologous groups have binding sites comprised of three boxes with alternating direction, which implies two possible alternative modes of dimerization of TFs. We identified trends in site positioning relative to the translational gene start, and show that TFs in 94 orthologous groups bind tandem sites with 18-22 nucleotides between their centers. We predict protein-DNA contacts via the correlation analysis of nucleotides in binding sites and amino acids of the DNA-binding domain of TFs, and show that the majority of interacting positions and predicted contacts are similar for both types of motifs and conform well both to available experimental data and to general protein-DNA interaction trends.
Collapse
Affiliation(s)
- Inna A Suvorova
- Institute for Information Transmission Problems of Russian Academy of Sciences (The Kharkevich Institute), Moscow, Russia
| | - Mikhail S Gelfand
- Institute for Information Transmission Problems of Russian Academy of Sciences (The Kharkevich Institute), Moscow, Russia.,Skolkovo Institute of Science and Technology, Moscow, Russia
| |
Collapse
|
7
|
Jiang Y, Liu HF, Liu R. Systematic comparison and prediction of the effects of missense mutations on protein-DNA and protein-RNA interactions. PLoS Comput Biol 2021; 17:e1008951. [PMID: 33872313 PMCID: PMC8084330 DOI: 10.1371/journal.pcbi.1008951] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2021] [Revised: 04/29/2021] [Accepted: 04/08/2021] [Indexed: 12/30/2022] Open
Abstract
The binding affinities of protein-nucleic acid interactions could be altered due to missense mutations occurring in DNA- or RNA-binding proteins, therefore resulting in various diseases. Unfortunately, a systematic comparison and prediction of the effects of mutations on protein-DNA and protein-RNA interactions (these two mutation classes are termed MPDs and MPRs, respectively) is still lacking. Here, we demonstrated that these two classes of mutations could generate similar or different tendencies for binding free energy changes in terms of the properties of mutated residues. We then developed regression algorithms separately for MPDs and MPRs by introducing novel geometric partition-based energy features and interface-based structural features. Through feature selection and ensemble learning, similar computational frameworks that integrated energy- and nonenergy-based models were established to estimate the binding affinity changes resulting from MPDs and MPRs, but the selected features for the final models were different and therefore reflected the specificity of these two mutation classes. Furthermore, the proposed methodology was extended to the identification of mutations that significantly decreased the binding affinities. Extensive validations indicated that our algorithm generally performed better than the state-of-the-art methods on both the regression and classification tasks. The webserver and software are freely available at http://liulab.hzau.edu.cn/PEMPNI and https://github.com/hzau-liulab/PEMPNI. Protein-nucleic acid interactions play important roles in various cellular processes. Missense mutations occurring in DNA- or RNA-binding proteins (termed MPDs and MPRs, respectively) could change the binding affinities of these interactions. Previous studies have compared protein-DNA and protein-RNA interactions from multifaceted viewpoints, but less attention has been given to the similarities and specific differences between the effects of MPDs and MPRs and between the methodologies for predicting the affinity changes induced by the two mutation classes. Therefore, we systematically compared their impacts and demonstrated that MPDs and MPRs could have specific preferences for binding affinity changes. These observations motivated us to construct regression models separately for MPDs and MPRs by introducing novel energy and nonenergy descriptors. Although similar frameworks were developed to estimate these two categories of mutation effects, different descriptors were selected in the regression models and further revealed the specificity of mutation classes. The interplay between the energy and nonenergy modules effectively improved prediction performance. Our algorithm can also be adopted to disentangle mutations significantly decreasing binding affinities from other mutations.
Collapse
Affiliation(s)
- Yao Jiang
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, P. R. China
| | - Hui-Fang Liu
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, P. R. China
| | - Rong Liu
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, P. R. China
| |
Collapse
|
8
|
Long P, Zhang L, Huang B, Chen Q, Liu H. Integrating genome sequence and structural data for statistical learning to predict transcription factor binding sites. Nucleic Acids Res 2021; 48:12604-12617. [PMID: 33264415 PMCID: PMC7736823 DOI: 10.1093/nar/gkaa1134] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2020] [Revised: 09/18/2020] [Accepted: 11/10/2020] [Indexed: 01/11/2023] Open
Abstract
We report an approach to predict DNA specificity of the tetracycline repressor (TetR) family transcription regulators (TFRs). First, a genome sequence-based method was streamlined with quantitative P-values defined to filter out reliable predictions. Then, a framework was introduced to incorporate structural data and to train a statistical energy function to score the pairing between TFR and TFR binding site (TFBS) based on sequences. The predictions benchmarked against experiments, TFBSs for 29 out of 30 TFRs were correctly predicted by either the genome sequence-based or the statistical energy-based method. Using P-values or Z-scores as indicators, we estimate that 59.6% of TFRs are covered with relatively reliable predictions by at least one of the two methods, while only 28.7% are covered by the genome sequence-based method alone. Our approach predicts a large number of new TFBs which cannot be correctly retrieved from public databases such as FootprintDB. High-throughput experimental assays suggest that the statistical energy can model the TFBSs of a significant number of TFRs reliably. Thus the energy function may be applied to explore for new TFBSs in respective genomes. It is possible to extend our approach to other transcriptional factor families with sufficient structural information.
Collapse
Affiliation(s)
- Pengpeng Long
- School of Life Sciences, University of Science and Technology of China, Hefei, Anhui 230026, China
| | - Lu Zhang
- School of Life Sciences, University of Science and Technology of China, Hefei, Anhui 230026, China
| | - Bin Huang
- School of Life Sciences, University of Science and Technology of China, Hefei, Anhui 230026, China
| | - Quan Chen
- School of Life Sciences, University of Science and Technology of China, Hefei, Anhui 230026, China.,Hefei National Laboratory for Physical Sciences at the Microscale, Hefei, Anhui 230026, China
| | - Haiyan Liu
- School of Life Sciences, University of Science and Technology of China, Hefei, Anhui 230026, China.,Hefei National Laboratory for Physical Sciences at the Microscale, Hefei, Anhui 230026, China.,School of Data Science, University of Science and Technology of China, Hefei, Anhui 230026, China
| |
Collapse
|
9
|
Yadav D, Kaur S, Banerjee D, Bhattacharyya R. Metformin and Rifampicin combination augments active to latent tuberculosis conversion: A computational study. Biotechnol Appl Biochem 2020; 68:1307-1312. [PMID: 33059386 DOI: 10.1002/bab.2052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2020] [Accepted: 10/07/2020] [Indexed: 11/10/2022]
Abstract
Tuberculosis, a global threat, is a highly infectious disease intensified by the emergence of drug-resistant strains. In tuberculosis disease spectrum, a typical situation is a dormant or latent phase where a person exposed to Mycobacterium tuberculosis has the reservoir of the disease that may or may not result in an active state. Existence of the dormant state is retarding the eradication of tuberculosis. Transcription of several genes helps M. tuberculosis to survive in nonreplicative mode. DosR transcription factor is the hallmark for this genesis. Diabetes mellitus is a predisposition factor leading to the development of tuberculosis and latent tuberculosis. High plasma insulin concentrations in the prediabetic state can increase the tuberculosis bacterium. On the other hand, antidiabetic drug metformin is known to reduce active tuberculosis disease when provided in combination with antitubercular therapy. However, the effect of the same on latent tuberculosis is still unknown. In the present work using tools of computational biology, we have tried to find the consequence of adding metformin in combination with rifampicin, a well-known antitubercular drug, on molecular mechanisms of latent tuberculosis. We have investigated whether metformin and rifampicin interact with DosR machinery or not. Our results indicate that if metformin-bound DosR-DNA complex binds with rifampicin, it will result in the conversion of active tuberculosis to latent tuberculosis.
Collapse
Affiliation(s)
- Deepak Yadav
- Department of Experimental Medicine and Biotechnology, Postgraduate Institute of Medical Education and Research, Chandigarh, India
| | - Sumanpreet Kaur
- Department of Experimental Medicine and Biotechnology, Postgraduate Institute of Medical Education and Research, Chandigarh, India
| | - Dibyajyoti Banerjee
- Department of Experimental Medicine and Biotechnology, Postgraduate Institute of Medical Education and Research, Chandigarh, India
| | - Rajasri Bhattacharyya
- Department of Experimental Medicine and Biotechnology, Postgraduate Institute of Medical Education and Research, Chandigarh, India
| |
Collapse
|
10
|
Lin M, Guo JT. New insights into protein-DNA binding specificity from hydrogen bond based comparative study. Nucleic Acids Res 2020; 47:11103-11113. [PMID: 31665426 PMCID: PMC6868434 DOI: 10.1093/nar/gkz963] [Citation(s) in RCA: 34] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2019] [Revised: 10/06/2019] [Accepted: 10/10/2019] [Indexed: 12/25/2022] Open
Abstract
Knowledge of protein-DNA binding specificity has important implications in understanding DNA metabolism, transcriptional regulation and developing therapeutic drugs. Previous studies demonstrated hydrogen bonds between amino acid side chains and DNA bases play major roles in specific protein-DNA interactions. In this paper, we investigated the roles of individual DNA strands and protein secondary structure types in specific protein-DNA recognition based on side chain-base hydrogen bonds. By comparing the contribution of each DNA strand to the overall binding specificity between DNA-binding proteins with different degrees of binding specificity, we found that highly specific DNA-binding proteins show balanced hydrogen bonding with each of the two DNA strands while multi-specific DNA binding proteins are generally biased towards one strand. Protein-base pair hydrogen bonds, in which both bases of a base pair are involved in forming hydrogen bonds with amino acid side chains, are more prevalent in the highly specific protein-DNA complexes than those in the multi-specific group. Amino acids involved in side chain-base hydrogen bonds favor strand and coil secondary structure types in highly specific DNA-binding proteins while multi-specific DNA-binding proteins prefer helices.
Collapse
Affiliation(s)
- Maoxuan Lin
- Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | - Jun-Tao Guo
- Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| |
Collapse
|
11
|
Blanco JD, Radusky L, Climente-González H, Serrano L. FoldX accurate structural protein-DNA binding prediction using PADA1 (Protein Assisted DNA Assembly 1). Nucleic Acids Res 2019; 46:3852-3863. [PMID: 29608705 PMCID: PMC5934639 DOI: 10.1093/nar/gky228] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2018] [Accepted: 03/20/2018] [Indexed: 12/20/2022] Open
Abstract
The speed at which new genomes are being sequenced highlights the need for genome-wide methods capable of predicting protein–DNA interactions. Here, we present PADA1, a generic algorithm that accurately models structural complexes and predicts the DNA-binding regions of resolved protein structures. PADA1 relies on a library of protein and double-stranded DNA fragment pairs obtained from a training set of 2103 DNA–protein complexes. It includes a fast statistical force field computed from atom-atom distances, to evaluate and filter the 3D docking models. Using published benchmark validation sets and 212 DNA–protein structures published after 2016 we predicted the DNA-binding regions with an RMSD of <1.8 Å per residue in >95% of the cases. We show that the quality of the docked templates is compatible with FoldX protein design tool suite to identify the crystallized DNA molecule sequence as the most energetically favorable in 80% of the cases. We highlighted the biological potential of PADA1 by reconstituting DNA and protein conformational changes upon protein mutagenesis of a meganuclease and its variants, and by predicting DNA-binding regions and nucleotide sequences in proteins crystallized without DNA. These results opens up new perspectives for the engineering of DNA–protein interfaces.
Collapse
Affiliation(s)
- Javier Delgado Blanco
- Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Dr. Aiguader 88, 08003 Barcelona, Spain
| | - Leandro Radusky
- Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Dr. Aiguader 88, 08003 Barcelona, Spain
| | - Héctor Climente-González
- Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Dr. Aiguader 88, 08003 Barcelona, Spain
| | - Luis Serrano
- Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Dr. Aiguader 88, 08003 Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain.,Institució Catalana de Recerca i Estudis Avançats (ICREA), Pg. Lluis Companys 23, 08010 Barcelona, Spain
| |
Collapse
|
12
|
Poddar S, Chakravarty D, Chakrabarti P. Structural changes in DNA-binding proteins on complexation. Nucleic Acids Res 2019. [PMID: 29534202 PMCID: PMC6283420 DOI: 10.1093/nar/gky170] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
Characterization and prediction of the DNA-biding regions in proteins are essential for our understanding of how proteins recognize/bind DNA. We analyze the unbound (U) and the bound (B) forms of proteins from the protein–DNA docking benchmark that contains 66 binary protein–DNA complexes along with their unbound counterparts. Proteins binding DNA undergo greater structural changes on complexation (in particular, those in the enzyme category) than those involved in protein–protein interactions (PPI). While interface atoms involved in PPI exhibit an increase in their solvent-accessible surface area (ASA) in the bound form in the majority of the cases compared to the unbound interface, protein–DNA interactions indicate increase and decrease in equal measure. In 25% structures, the U form has missing residues which are located in the interface in the B form. The missing atoms contribute more toward the buried surface area compared to other interface atoms. Lys, Gly and Arg are prominent in disordered segments that get ordered in the interface on complexation. In going from U to B, there may be an increase in coil and helical content at the expense of turns and strands. Consideration of flexibility cannot distinguish the interface residues from the surface residues in the U form.
Collapse
Affiliation(s)
- Sayan Poddar
- Department of Biochemistry, Bose Institute, P1/12 CIT Scheme VIIM, Kolkata 700054, India
| | - Devlina Chakravarty
- Bioinformatics Centre, Bose Institute, P1/12CIT Scheme VIIM, Kolkata 700054, India
| | - Pinak Chakrabarti
- Department of Biochemistry, Bose Institute, P1/12 CIT Scheme VIIM, Kolkata 700054, India.,Bioinformatics Centre, Bose Institute, P1/12CIT Scheme VIIM, Kolkata 700054, India
| |
Collapse
|
13
|
Zhu YH, Hu J, Song XN, Yu DJ. DNAPred: Accurate Identification of DNA-Binding Sites from Protein Sequence by Ensembled Hyperplane-Distance-Based Support Vector Machines. J Chem Inf Model 2019; 59:3057-3071. [PMID: 30943723 DOI: 10.1021/acs.jcim.8b00749] [Citation(s) in RCA: 39] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Accurate identification of protein-DNA binding sites is significant for both understanding protein function and drug design. Machine-learning-based methods have been extensively used for the prediction of protein-DNA binding sites. However, the data imbalance problem, in which the number of nonbinding residues (negative-class samples) is far larger than that of binding residues (positive-class samples), seriously restricts the performance improvements of machine-learning-based predictors. In this work, we designed a two-stage imbalanced learning algorithm, called ensembled hyperplane-distance-based support vector machines (E-HDSVM), to improve the prediction performance of protein-DNA binding sites. The first stage of E-HDSVM designs a new iterative sampling algorithm, called hyperplane-distance-based under-sampling (HD-US), to extract multiple subsets from the original imbalanced data set, each of which is used to train a support vector machine (SVM). Unlike traditional sampling algorithms, HD-US selects samples by calculating the distances between the samples and the separating hyperplane of the SVM. The second stage of E-HDSVM proposes an enhanced AdaBoost (EAdaBoost) algorithm to ensemble multiple trained SVMs. As an enhanced version of the original AdaBoost algorithm, EAdaBoost overcomes the overfitting problem. Stringent cross-validation and independent tests on benchmark data sets demonstrated the superiority of E-HDSVM over several popular imbalanced learning algorithms. Based on the proposed E-HDSVM algorithm, we further implemented a sequence-based protein-DNA binding site predictor, called DNAPred, which is freely available at http://csbio.njust.edu.cn/bioinf/dnapred/ for academic use. The computational experimental results showed that our predictor achieved an average overall accuracy of 91.7% and a Mathew's correlation coefficient of 0.395 on five benchmark data sets and outperformed several state-of-the-art sequence-based protein-DNA binding site predictors.
Collapse
Affiliation(s)
- Yi-Heng Zhu
- School of Computer Science and Engineering , Nanjing University of Science and Technology , Xiaolingwei 200 , Nanjing 210094 , P. R. China
| | - Jun Hu
- College of Information Engineering , Zhejiang University of Technology , Hangzhou 310023 , P. R. China
| | - Xiao-Ning Song
- School of Internet of Things , Jiangnan University , 1800 Lihu Road , Wuxi 214122 , P. R. China
| | - Dong-Jun Yu
- School of Computer Science and Engineering , Nanjing University of Science and Technology , Xiaolingwei 200 , Nanjing 210094 , P. R. China
| |
Collapse
|
14
|
Emamjomeh A, Choobineh D, Hajieghrari B, MahdiNezhad N, Khodavirdipour A. DNA-protein interaction: identification, prediction and data analysis. Mol Biol Rep 2019; 46:3571-3596. [PMID: 30915687 DOI: 10.1007/s11033-019-04763-1] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2018] [Accepted: 03/14/2019] [Indexed: 12/30/2022]
Abstract
Life in living organisms is dependent on specific and purposeful interaction between other molecules. Such purposeful interactions make the various processes inside the cells and the bodies of living organisms possible. DNA-protein interactions, among all the types of interactions between different molecules, are of considerable importance. Currently, with the development of numerous experimental techniques, diverse methods are convenient for recognition and investigating such interactions. While the traditional experimental techniques to identify DNA-protein complexes are time-consuming and are unsuitable for genome-scale studies, the current high throughput approaches are more efficient in determining such interaction at a large-scale, but they are clearly too costly to be practice for daily applications. Hence, according to the availability of much information related to different biological sequences and clearing different dimensions of conditions in which such interactions are formed, with the developments related to the computer, mathematics, and statistics motivate scientists to develop bioinformatics tools for prediction the interaction site(s). Until now, there has been much progress in this field. In this review, the factors and conditions governing the interaction and the laboratory techniques for examining such interactions are addressed. In addition, developed bioinformatics tools are introduced and compared for this reason and, in the end, several suggestions are offered for the promotion of such tools in prediction with much more precision.
Collapse
Affiliation(s)
- Abbasali Emamjomeh
- Laboratory of Computational Biotechnology and Bioinformatics (CBB), Department of Plant Breeding and Biotechnology (PBB), University of Zabol, Zabol, 98615-538, Iran.
| | - Darush Choobineh
- Agricultural Biotechnology, Department of Plant Breeding and Biotechnology (PBB), Faculty of Agriculture, University of Zabol, Zabol, Iran
| | - Behzad Hajieghrari
- Department of Agricultural Biotechnology, College of Agriculture, Jahrom University, Jahrom, 74135-111, Iran.
| | - Nafiseh MahdiNezhad
- Laboratory of Computational Biotechnology and Bioinformatics (CBB), Department of Plant Breeding and Biotechnology (PBB), University of Zabol, Zabol, 98615-538, Iran
| | - Amir Khodavirdipour
- Division of Human Genetics, Department of Anatomy, St. John's hospital, Bangalore, India
| |
Collapse
|
15
|
Peng Y, Sun L, Jia Z, Li L, Alexov E. Predicting protein-DNA binding free energy change upon missense mutations using modified MM/PBSA approach: SAMPDI webserver. Bioinformatics 2018; 34:779-786. [PMID: 29091991 DOI: 10.1093/bioinformatics/btx698] [Citation(s) in RCA: 42] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2017] [Accepted: 10/27/2017] [Indexed: 12/28/2022] Open
Abstract
Motivation Protein-DNA interactions are essential for regulating many cellular processes, such as transcription, replication, recombination and translation. Amino acid mutations occurring in DNA-binding proteins have profound effects on protein-DNA binding and are linked with many diseases. Hence, accurate and fast predictions of the effects of mutations on protein-DNA binding affinity are essential for understanding disease-causing mechanisms and guiding plausible treatments. Results Here we report a new method Single Amino acid Mutation binding free energy change of Protein-DNA Interaction (SAMPDI). The method utilizes modified Molecular Mechanics Poisson-Boltzmann Surface Area (MM/PBSA) approach along with an additional set of knowledge-based terms delivered from investigations of the physicochemical properties of protein-DNA complexes. The method is benchmarked against experimentally determined binding free energy changes caused by 105 mutations in 13 proteins (compiled ProNIT database and data from recent references), and results in correlation coefficient of 0.72. Availability and implementation http://compbio.clemson.edu/SAMPDI. Contact ealexov@clemson.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yunhui Peng
- Department of Physics and Astronomy, Clemson University, Clemson SC 29634, USA
| | - Lexuan Sun
- Department of Physics and Astronomy, Clemson University, Clemson SC 29634, USA
| | - Zhe Jia
- Department of Physics and Astronomy, Clemson University, Clemson SC 29634, USA
| | - Lin Li
- Department of Physics and Astronomy, Clemson University, Clemson SC 29634, USA
| | - Emil Alexov
- Department of Physics and Astronomy, Clemson University, Clemson SC 29634, USA
| |
Collapse
|
16
|
Connolly M, Arra A, Zvoda V, Steinbach PJ, Rice PA, Ansari A. Static Kinks or Flexible Hinges: Multiple Conformations of Bent DNA Bound to Integration Host Factor Revealed by Fluorescence Lifetime Measurements. J Phys Chem B 2018; 122:11519-11534. [DOI: 10.1021/acs.jpcb.8b07405] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Affiliation(s)
- Mitchell Connolly
- Department of Physics, University of Illinois at Chicago, Chicago, Illinois 60607, United States
| | - Aline Arra
- Department of Physics, University of Illinois at Chicago, Chicago, Illinois 60607, United States
| | - Viktoriya Zvoda
- Department of Physics, University of Illinois at Chicago, Chicago, Illinois 60607, United States
| | - Peter J. Steinbach
- Center for Molecular Modeling, Center for Information Technology, National Institutes of Health, Bethesda, Maryland 20892, United States
| | - Phoebe A. Rice
- Department of Biochemistry & Molecular Biology, University of Chicago, Chicago, Illinois 60637, United States
| | - Anjum Ansari
- Department of Physics, University of Illinois at Chicago, Chicago, Illinois 60607, United States
- Department of Bioengineering, University of Illinois at Chicago, Chicago, Illinois 60607, United States
| |
Collapse
|
17
|
Gapsys V, de Groot BL. Alchemical Free Energy Calculations for Nucleotide Mutations in Protein–DNA Complexes. J Chem Theory Comput 2017; 13:6275-6289. [DOI: 10.1021/acs.jctc.7b00849] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Vytautas Gapsys
- Computational Biomolecular
Dynamics Group, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| | - Bert L. de Groot
- Computational Biomolecular
Dynamics Group, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| |
Collapse
|
18
|
Zamanighomi M, Lin Z, Wang Y, Jiang R, Wong WH. Predicting transcription factor binding motifs from DNA-binding domains, chromatin accessibility and gene expression data. Nucleic Acids Res 2017; 45:5666-5677. [PMID: 28472398 PMCID: PMC5449588 DOI: 10.1093/nar/gkx358] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2016] [Accepted: 04/20/2017] [Indexed: 01/08/2023] Open
Abstract
Transcription factors (TFs) play crucial roles in regulating gene expression through interactions with specific DNA sequences. Recently, the sequence motif of almost 400 human TFs have been identified using high-throughput SELEX sequencing. However, there remain a large number of TFs (∼800) with no high-throughput-derived binding motifs. Computational methods capable of associating known motifs to such TFs will avoid tremendous experimental efforts and enable deeper understanding of transcriptional regulatory functions. We present a method to associate known motifs to TFs (MATLAB code is available in Supplementary Materials). Our method is based on a probabilistic framework that not only exploits DNA-binding domains and specificities, but also integrates open chromatin, gene expression and genomic data to accurately infer monomeric and homodimeric binding motifs. Our analysis resulted in the assignment of motifs to 200 TFs with no SELEX-derived motifs, roughly a 50% increase compared to the existing coverage.
Collapse
Affiliation(s)
- Mahdi Zamanighomi
- Department of Statistics, Stanford University, Stanford, CA 94305, USA
| | - Zhixiang Lin
- Department of Statistics, Stanford University, Stanford, CA 94305, USA
| | - Yong Wang
- Academy of Mathematics and Systems Science, National Center for Mathematics and Interdisciplinary Sciences, Chinese Academy of Sciences, Beijing 100190, China
| | - Rui Jiang
- MOE Key Laboratory of Bioinformatics, Bioinformatics Division and Center for Synthetic & Systems Biology, TNLIST, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Wing Hung Wong
- Department of Statistics, Stanford University, Stanford, CA 94305, USA.,Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
19
|
Smolinska K, Pacholczyk M. EMQIT: a machine learning approach for energy based PWM matrix quality improvement. Biol Direct 2017; 12:17. [PMID: 28764727 PMCID: PMC5539975 DOI: 10.1186/s13062-017-0189-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2017] [Accepted: 07/17/2017] [Indexed: 11/10/2022] Open
Abstract
Background Transcription factor binding affinities to DNA play a key role for the gene regulation. Learning the specificity of the mechanisms of binding TFs to DNA is important both to experimentalists and theoreticians. With the development of high-throughput methods such as, e.g., ChiP-seq the need to provide unbiased models of binding events has been made apparent. We present EMQIT a modification to the approach introduced by Alamanova et al. and later implemented as 3DTF server. We observed that tuning of Boltzmann factor weights, used for conversion of calculated energies to nucleotide probabilities, has a significant impact on the quality of the associated PWM matrix. Results Consequently, we proposed to use receiver operator characteristics curves and the 10-fold cross-validation to learn best weights using experimentally verified data from TRANSFAC database. We applied our method to data available for various TFs. We verified the efficiency of detecting TF binding sites by the 3DTF matrices improved with our technique using experimental data from the TRANSFAC database. The comparison showed a significant similarity and comparable performance between the improved and the experimental matrices (TRANSFAC). Improved 3DTF matrices achieved significantly higher AUC values than the original 3DTF matrices (at least by 0.1) and, at the same time, detected notably more experimentally verified TFBSs. Conclusions The resulting new improved PWM matrices for analyzed factors show similarity to TRANSFAC matrices. Matrices had comparable predictive capabilities. Moreover, improved PWMs achieve better results than matrices downloaded from 3DTF server. Presented approach is general and applicable to any energy-based matrices. EMQIT is available online at http://biosolvers.polsl.pl:3838/emqit. Reviewers This article was reviewed by Oliviero Carugo, Marek Kimmel and István Simon. Electronic supplementary material The online version of this article (doi:10.1186/s13062-017-0189-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Karolina Smolinska
- Institute of Automatic Control, Silesian University of Technology, Akademicka 16, 44-100, Gliwice, Poland
| | - Marcin Pacholczyk
- Institute of Automatic Control, Silesian University of Technology, Akademicka 16, 44-100, Gliwice, Poland.
| |
Collapse
|
20
|
Farrel A, Murphy J, Guo JT. Structure-based prediction of transcription factor binding specificity using an integrative energy function. Bioinformatics 2017; 32:i306-i313. [PMID: 27307632 PMCID: PMC4908348 DOI: 10.1093/bioinformatics/btw264] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
UNLABELLED Transcription factors (TFs) regulate gene expression through binding to specific target DNA sites. Accurate annotation of transcription factor binding sites (TFBSs) at genome scale represents an essential step toward our understanding of gene regulation networks. In this article, we present a structure-based method for computational prediction of TFBSs using a novel, integrative energy (IE) function. The new energy function combines a multibody (MB) knowledge-based potential and two atomic energy terms (hydrogen bond and π interaction) that might not be accurately captured by the knowledge-based potential owing to the mean force nature and low count problem. We applied the new energy function to the TFBS prediction using a non-redundant dataset that consists of TFs from 12 different families. Our results show that the new IE function improves the prediction accuracy over the knowledge-based, statistical potentials, especially for homeodomain TFs, the second largest TF family in mammals. CONTACT jguo4@uncc.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Alvin Farrel
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | - Jonathan Murphy
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | - Jun-Tao Guo
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| |
Collapse
|
21
|
Omidi S, Zavolan M, Pachkov M, Breda J, Berger S, van Nimwegen E. Automated incorporation of pairwise dependency in transcription factor binding site prediction using dinucleotide weight tensors. PLoS Comput Biol 2017; 13:e1005176. [PMID: 28753602 PMCID: PMC5550003 DOI: 10.1371/journal.pcbi.1005176] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2016] [Revised: 08/09/2017] [Accepted: 06/02/2017] [Indexed: 11/17/2022] Open
Abstract
Gene regulatory networks are ultimately encoded by the sequence-specific binding of (TFs) to short DNA segments. Although it is customary to represent the binding specificity of a TF by a position-specific weight matrix (PSWM), which assumes each position within a site contributes independently to the overall binding affinity, evidence has been accumulating that there can be significant dependencies between positions. Unfortunately, methodological challenges have so far hindered the development of a practical and generally-accepted extension of the PSWM model. On the one hand, simple models that only consider dependencies between nearest-neighbor positions are easy to use in practice, but fail to account for the distal dependencies that are observed in the data. On the other hand, models that allow for arbitrary dependencies are prone to overfitting, requiring regularization schemes that are difficult to use in practice for non-experts. Here we present a new regulatory motif model, called dinucleotide weight tensor (DWT), that incorporates arbitrary pairwise dependencies between positions in binding sites, rigorously from first principles, and free from tunable parameters. We demonstrate the power of the method on a large set of ChIP-seq data-sets, showing that DWTs outperform both PSWMs and motif models that only incorporate nearest-neighbor dependencies. We also demonstrate that DWTs outperform two previously proposed methods. Finally, we show that DWTs inferred from ChIP-seq data also outperform PSWMs on HT-SELEX data for the same TF, suggesting that DWTs capture inherent biophysical properties of the interactions between the DNA binding domains of TFs and their binding sites. We make a suite of DWT tools available at dwt.unibas.ch, that allow users to automatically perform ‘motif finding’, i.e. the inference of DWT motifs from a set of sequences, binding site prediction with DWTs, and visualization of DWT ‘dilogo’ motifs. Gene regulatory networks are ultimately encoded in constellations of short binding sites in the DNA and RNA that are recognized by regulatory factors such as transcription factors (TFs). For several decades, computational analysis of regulatory networks has relied on a model of TF sequence-specificity, the position-specific weight-matrix (PSWM), that assumes different positions in a binding site contribute independently to the total binding energy of the TF. However, in recent years evidence has been accumulating that, at least for some TFs, this assumption does not hold. Here we present a new model for the sequence-specificity of TFs, the dinucleotide weight tensor (DWT), that takes arbitrary dependencies between positions in binding sites into account and show that it consistently outperforms PSWMs on high-throughput datasets on TF binding. Moreover, in contrast to previous approaches, DWTs are directly derived from first principles within a Bayesian framework, and contain no tunable parameters. This allows them to be easily applied in practice and we make a suite of tools available for computational analysis with DWTs.
Collapse
Affiliation(s)
- Saeed Omidi
- Biozentrum, University of Basel, Basel, Switzerland.,Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Mihaela Zavolan
- Biozentrum, University of Basel, Basel, Switzerland.,Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Mikhail Pachkov
- Biozentrum, University of Basel, Basel, Switzerland.,Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Jeremie Breda
- Biozentrum, University of Basel, Basel, Switzerland.,Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Severin Berger
- Biozentrum, University of Basel, Basel, Switzerland.,Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Erik van Nimwegen
- Biozentrum, University of Basel, Basel, Switzerland.,Swiss Institute of Bioinformatics, Basel, Switzerland
| |
Collapse
|
22
|
Liu S, Zibetti C, Wan J, Wang G, Blackshaw S, Qian J. Assessing the model transferability for prediction of transcription factor binding sites based on chromatin accessibility. BMC Bioinformatics 2017; 18:355. [PMID: 28750606 PMCID: PMC5530957 DOI: 10.1186/s12859-017-1769-7] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2017] [Accepted: 07/19/2017] [Indexed: 12/04/2022] Open
Abstract
Background Computational prediction of transcription factor (TF) binding sites in different cell types is challenging. Recent technology development allows us to determine the genome-wide chromatin accessibility in various cellular and developmental contexts. The chromatin accessibility profiles provide useful information in prediction of TF binding events in various physiological conditions. Furthermore, ChIP-Seq analysis was used to determine genome-wide binding sites for a range of different TFs in multiple cell types. Integration of these two types of genomic information can improve the prediction of TF binding events. Results We assessed to what extent a model built upon on other TFs and/or other cell types could be used to predict the binding sites of TFs of interest. A random forest model was built using a set of cell type-independent features such as specific sequences recognized by the TFs and evolutionary conservation, as well as cell type-specific features derived from chromatin accessibility data. Our analysis suggested that the models learned from other TFs and/or cell lines performed almost as well as the model learned from the target TF in the cell type of interest. Interestingly, models based on multiple TFs performed better than single-TF models. Finally, we proposed a universal model, BPAC, which was generated using ChIP-Seq data from multiple TFs in various cell types. Conclusion Integrating chromatin accessibility information with sequence information improves prediction of TF binding.The prediction of TF binding is transferable across TFs and/or cell lines suggesting there are a set of universal “rules”. A computational tool was developed to predict TF binding sites based on the universal “rules”. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1769-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Sheng Liu
- Department of Ophthalmology, Johns Hopkins University School of Medicine, Baltimore, 21287, MD, USA
| | - Cristina Zibetti
- Solomon H. Snyder Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, 21287, MD, USA
| | - Jun Wan
- Department of Ophthalmology, Johns Hopkins University School of Medicine, Baltimore, 21287, MD, USA
| | - Guohua Wang
- Department of Ophthalmology, Johns Hopkins University School of Medicine, Baltimore, 21287, MD, USA
| | - Seth Blackshaw
- Department of Ophthalmology, Johns Hopkins University School of Medicine, Baltimore, 21287, MD, USA.,Solomon H. Snyder Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, 21287, MD, USA.,Department of Neurology, Johns Hopkins University School of Medicine, Baltimore, 21287, MD, USA.,Centre for Human Systems Biology, Johns Hopkins University School of Medicine, Baltimore, 21287, MD, USA.,Institute for Cell Engineering, Johns Hopkins University School of Medicine, Baltimore, 21287, MD, USA
| | - Jiang Qian
- Department of Ophthalmology, Johns Hopkins University School of Medicine, Baltimore, 21287, MD, USA.
| |
Collapse
|
23
|
Farrel A, Guo JT. An efficient algorithm for improving structure-based prediction of transcription factor binding sites. BMC Bioinformatics 2017; 18:342. [PMID: 28715997 PMCID: PMC5514533 DOI: 10.1186/s12859-017-1755-0] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2017] [Accepted: 07/12/2017] [Indexed: 01/07/2023] Open
Abstract
Background Gene expression is regulated by transcription factors binding to specific target DNA sites. Understanding how and where transcription factors bind at genome scale represents an essential step toward our understanding of gene regulation networks. Previously we developed a structure-based method for prediction of transcription factor binding sites using an integrative energy function that combines a knowledge-based multibody potential and two atomic energy terms. While the method performs well, it is not computationally efficient due to the exponential increase in the number of binding sequences to be evaluated for longer binding sites. In this paper, we present an efficient pentamer algorithm by splitting DNA binding sequences into overlapping fragments along with a simplified integrative energy function for transcription factor binding site prediction. Results A DNA binding sequence is split into overlapping pentamers (5 base pairs) for calculating transcription factor-pentamer interaction energy. To combine the results from overlapping pentamer scores, we developed two methods, Kmer-Sum and PWM (Position Weight Matrix) stacking, for full-length binding motif prediction. Our results show that both Kmer-Sum and PWM stacking in the new pentamer approach along with a simplified integrative energy function improved transcription factor binding site prediction accuracy and dramatically reduced computation time, especially for longer binding sites. Conclusion Our new fragment-based pentamer algorithm and simplified energy function improve both efficiency and accuracy. To our knowledge, this is the first fragment-based method for structure-based transcription factor binding sites prediction. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1755-0) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Alvin Farrel
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, 9201 University City Blvd, Charlotte, NC, 28223, USA
| | - Jun-Tao Guo
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, 9201 University City Blvd, Charlotte, NC, 28223, USA.
| |
Collapse
|
24
|
Grisewood MJ, Hernández-Lozada NJ, Thoden JB, Gifford NP, Mendez-Perez D, Schoenberger HA, Allan MF, Floy ME, Lai RY, Holden HM, Pfleger BF, Maranas CD. Computational Redesign of Acyl-ACP Thioesterase with Improved Selectivity toward Medium-Chain-Length Fatty Acids. ACS Catal 2017; 7:3837-3849. [PMID: 29375928 DOI: 10.1021/acscatal.7b00408] [Citation(s) in RCA: 62] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Enzyme and metabolic engineering offer the potential to develop biocatalysts for converting natural resources into a wide range of chemicals. To broaden the scope of potential products beyond natural metabolites, methods of engineering enzymes to accept alternative substrates and/or perform novel chemistries must be developed. DNA synthesis can create large libraries of enzyme-coding sequences, but most biochemistries lack a simple assay to screen for promising enzyme variants. Our solution to this challenge is structure-guided mutagenesis in which optimization algorithms select the best sequences from libraries based on specified criteria (i.e. binding selectivity). Here, we demonstrate this approach by identifying medium-chain (C6-C12) acyl-ACP thioesterases through structure-guided mutagenesis. Medium-chain fatty acids, products of thioesterase-catalyzed hydrolysis, are limited in natural abundance compared to long-chain fatty acids; the limited supply leads to high costs of C6-C10 oleochemicals such as fatty alcohols, amines, and esters. Here, we applied computational tools to tune substrate binding to the highly-active 'TesA thioesterase in Escherichia coli. We used the IPRO algorithm to design thioesterase variants with enhanced C12- or C8-specificity while maintaining high activity. After four rounds of structure-guided mutagenesis, we identified three thioesterases with enhanced production of dodecanoic acid (C12) and twenty-seven thioesterases with enhanced production of octanoic acid (C8). The top variants reached up to 49% C12 and 50% C8 while exceeding native levels of total free fatty acids. A comparably sized library created by random mutagenesis failed to identify promising mutants. The chain length-preference of 'TesA and the best mutant were confirmed in vitro using acyl-CoA substrates. Molecular dynamics simulations, confirmed by resolved crystal structures, of 'TesA variants suggest that hydrophobic forces govern 'TesA substrate specificity. We expect that the design rules we uncovered and the thioesterase variants identified will be useful to metabolic engineering projects aimed at sustainable production of medium-chain oleochemicals.
Collapse
Affiliation(s)
- Matthew J. Grisewood
- Department
of Chemical Engineering, Pennsylvania State University, 158 Fenske Laboratory, University Park, Pennsylvania 16802, United States
| | - Néstor J. Hernández-Lozada
- Department
of Chemical and Biological Engineering, University of Wisconsin−Madison, 1415 Engineering Drive, Madison, Wisconsin 53706, United States
| | - James B. Thoden
- Department
of Biochemistry, University of Wisconsin−Madison, 440 Henry Mall, Madison, Wisconsin 53706, United States
| | - Nathanael P. Gifford
- Department
of Chemical Engineering, Pennsylvania State University, 158 Fenske Laboratory, University Park, Pennsylvania 16802, United States
| | - Daniel Mendez-Perez
- Department
of Chemical and Biological Engineering, University of Wisconsin−Madison, 1415 Engineering Drive, Madison, Wisconsin 53706, United States
| | - Haley A. Schoenberger
- Department
of Chemical and Biological Engineering, University of Wisconsin−Madison, 1415 Engineering Drive, Madison, Wisconsin 53706, United States
| | - Matthew F. Allan
- Department
of Chemical Engineering, Pennsylvania State University, 158 Fenske Laboratory, University Park, Pennsylvania 16802, United States
| | - Martha E. Floy
- Department
of Chemical and Biological Engineering, University of Wisconsin−Madison, 1415 Engineering Drive, Madison, Wisconsin 53706, United States
| | - Rung-Yi Lai
- Department
of Chemical and Biological Engineering, University of Wisconsin−Madison, 1415 Engineering Drive, Madison, Wisconsin 53706, United States
| | - Hazel M. Holden
- Department
of Biochemistry, University of Wisconsin−Madison, 440 Henry Mall, Madison, Wisconsin 53706, United States
| | - Brian F. Pfleger
- Department
of Chemical and Biological Engineering, University of Wisconsin−Madison, 1415 Engineering Drive, Madison, Wisconsin 53706, United States
| | - Costas D. Maranas
- Department
of Chemical Engineering, Pennsylvania State University, 158 Fenske Laboratory, University Park, Pennsylvania 16802, United States
| |
Collapse
|
25
|
Paul T, Bera SC, Mishra PP. Direct observation of breathing dynamics at the mismatch induced DNA bubble with nanometre accuracy: a smFRET study. NANOSCALE 2017; 9:5835-5842. [PMID: 28332666 DOI: 10.1039/c6nr09348e] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
The detailed conformational dynamics of the melted region in double-stranded DNA has been studied using a combination of ensemble and single-molecule FRET techniques. We monitored the millisecond time scale fluctuation kinetics of the two strands at the bubble region that varies with the size of the bubble. As the individual strands at the melting bubble behave as single-stranded DNA, and hence fluctuate dynamically to attain energetically favored configurations, the rates of these fluctuations increase with increase in the bubble size. In different short DNAs under investigation, the two strands never cross each other to form a knot, irrespective of the number of base pair mismatches present. Rather, they prefer to stay apart from each other, as the size of the bubble increases and follow exactly an opposite trend for bubbles of smaller size. The range within which the bubble strands fluctuate are monitored with great accuracy in the nanometre resolution from the single-molecule FRET measurements. The shape of the bubble that plays a crucial role in determining the activity of the DNA was speculated. These results shall be useful in quantifying the chemical processes within DNA as well as to develop a deeper understanding of the activity of the DNA due to induced mismatches.
Collapse
Affiliation(s)
- Tapas Paul
- Chemical Sciences Division, Saha Institute of Nuclear Physics, 1/AF Bidhannagar, Kolkata 700064, India.
| | | | | |
Collapse
|
26
|
P. S, D. TK, C. GPD, R. S, Zayed H. Determining the role of missense mutations in the POU domain of HNF1A that reduce the DNA-binding affinity: A computational approach. PLoS One 2017; 12:e0174953. [PMID: 28410371 PMCID: PMC5391926 DOI: 10.1371/journal.pone.0174953] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2016] [Accepted: 03/18/2017] [Indexed: 12/21/2022] Open
Abstract
Maturity-onset diabetes of the young type 3 (MODY3) is a non-ketotic form of diabetes associated with poor insulin secretion. Over the past years, several studies have reported the association of missense mutations in the Hepatocyte Nuclear Factor 1 Alpha (HNF1A) with MODY3. Missense mutations in the POU homeodomain (POUH) of HNF1A hinder binding to the DNA, thereby leading to a dysfunctional protein. Missense mutations of the HNF1A were retrieved from public databases and subjected to a three-step computational mutational analysis to identify the underlying mechanism. First, the pathogenicity and stability of the mutations were analyzed to determine whether they alter protein structure and function. Second, the sequence conservation and DNA-binding sites of the mutant positions were assessed; as HNF1A protein is a transcription factor. Finally, the biochemical properties of the biological system were validated using molecular dynamic simulations in Gromacs 4.6.3 package. Two arginine residues (131 and 203) in the HNF1A protein are highly conserved residues and contribute to the function of the protein. Furthermore, the R131W, R131Q, and R203C mutations were predicted to be highly deleterious by in silico tools and showed lower binding affinity with DNA when compared to the native protein using the molecular docking analysis. Triplicate runs of molecular dynamic (MD) simulations (50ns) revealed smaller changes in patterns of deviation, fluctuation, and compactness, in complexes containing the R131Q and R131W mutations, compared to complexes containing the R203C mutant complex. We observed reduction in the number of intermolecular hydrogen bonds, compactness, and electrostatic potential, as well as the loss of salt bridges, in the R203C mutant complex. Substitution of arginine with cysteine at position 203 decreases the affinity of the protein for DNA, thereby destabilizing the protein. Based on our current findings, the MD approach is an important tool for elucidating the impact and affinity of mutations in DNA-protein interactions and understanding their function.
Collapse
Affiliation(s)
- Sneha P.
- School of BioSciences and Technology,Vellore Institute of Technology, Vellore, Tamil Nadu, India
| | - Thirumal Kumar D.
- School of BioSciences and Technology,Vellore Institute of Technology, Vellore, Tamil Nadu, India
| | - George Priya Doss C.
- School of BioSciences and Technology,Vellore Institute of Technology, Vellore, Tamil Nadu, India
| | - Siva R.
- School of BioSciences and Technology,Vellore Institute of Technology, Vellore, Tamil Nadu, India
| | - Hatem Zayed
- Department of Biomedical Sciences, College of Health Sciences, Qatar University, Doha, Qatar
| |
Collapse
|
27
|
Andrews CT, Campbell BA, Elcock AH. Direct Comparison of Amino Acid and Salt Interactions with Double-Stranded and Single-Stranded DNA from Explicit-Solvent Molecular Dynamics Simulations. J Chem Theory Comput 2017; 13:1794-1811. [PMID: 28288277 DOI: 10.1021/acs.jctc.6b00883] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
Given the ubiquitous nature of protein-DNA interactions, it is important to understand the interaction thermodynamics of individual amino acid side chains for DNA. One way to assess these preferences is to perform molecular dynamics (MD) simulations. Here we report MD simulations of 20 amino acid side chain analogs interacting simultaneously with both a 70-base-pair double-stranded DNA and with a 70-nucleotide single-stranded DNA. The relative preferences of the amino acid side chains for dsDNA and ssDNA match well with values deduced from crystallographic analyses of protein-DNA complexes. The estimated apparent free energies of interaction for ssDNA, on the other hand, correlate well with previous simulation values reported for interactions with isolated nucleobases, and with experimental values reported for interactions with guanosine. Comparisons of the interactions with dsDNA and ssDNA indicate that, with the exception of the positively charged side chains, all types of amino acid side chain interact more favorably with ssDNA, with intercalation of aromatic and aliphatic side chains being especially notable. Analysis of the data on a base-by-base basis indicates that positively charged side chains, as well as sodium ions, preferentially bind to cytosine in ssDNA, and that negatively charged side chains, and chloride ions, preferentially bind to guanine in ssDNA. These latter observations provide a novel explanation for the lower salt dependence of DNA duplex stability in GC-rich sequences relative to AT-rich sequences.
Collapse
Affiliation(s)
- Casey T Andrews
- Department of Biochemistry, University of Iowa , Iowa City, Iowa 52242, United States
| | - Brady A Campbell
- Department of Biochemistry, University of Iowa , Iowa City, Iowa 52242, United States
| | - Adrian H Elcock
- Department of Biochemistry, University of Iowa , Iowa City, Iowa 52242, United States
| |
Collapse
|
28
|
Lee ESA, Sze-To HYA, Wong MH, Leung KS, Lau TCK, Wong AKC. Discovering Protein-DNA Binding Cores by Aligned Pattern Clustering. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:254-263. [PMID: 26336137 DOI: 10.1109/tcbb.2015.2474376] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
UNLABELLED Understanding binding cores is of fundamental importance in deciphering Protein-DNA (TF-TFBS) binding and gene regulation. Limited by expensive experiments, it is promising to discover them with variations directly from sequence data. Although existing computational methods have produced satisfactory results, they are one-to-one mappings with no site-specific information on residue/nucleotide variations, where these variations in binding cores may impact binding specificity. This study presents a new representation for modeling binding cores by incorporating variations and an algorithm to discover them from only sequence data. Our algorithm takes protein and DNA sequences from TRANSFAC (a Protein-DNA Binding Database) as input; discovers from both sets of sequences conserved regions in Aligned Pattern Clusters (APCs); associates them as Protein-DNA Co-Occurring APCs; ranks the Protein-DNA Co-Occurring APCs according to their co-occurrence, and among the top ones, finds three-dimensional structures to support each binding core candidate. If successful, candidates are verified as binding cores. Otherwise, homology modeling is applied to their close matches in PDB to attain new chemically feasible binding cores. Our algorithm obtains binding cores with higher precision and much faster runtime ( ≥ 1,600x) than that of its contemporaries, discovering candidates that do not co-occur as one-to-one associated patterns in the raw data. AVAILABILITY http://www.pami.uwaterloo.ca/~ealee/files/tcbbPnDna2015/Release.zip.
Collapse
|
29
|
Chai H, Zhang J, Yang G, Ma Z. An evolution-based DNA-binding residue predictor using a dynamic query-driven learning scheme. MOLECULAR BIOSYSTEMS 2016; 12:3643-3650. [PMID: 27730230 DOI: 10.1039/c6mb00626d] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
DNA-binding proteins play a pivotal role in various biological activities. Identification of DNA-binding residues (DBRs) is of great importance for understanding the mechanism of gene regulations and chromatin remodeling. Most traditional computational methods usually construct their predictors on static non-redundant datasets. They excluded many homologous DNA-binding proteins so as to guarantee the generalization capability of their models. However, those ignored samples may potentially provide useful clues when studying protein-DNA interactions, which have not obtained enough attention. In view of this, we propose a novel method, namely DQPred-DBR, to fill the gap of DBR predictions. First, a large-scale extensible sample pool was compiled. Second, evolution-based features in the form of a relative position specific score matrix and covariant evolutionary conservation descriptors were used to encode the feature space. Third, a dynamic query-driven learning scheme was designed to make more use of proteins with known structure and functions. In comparison with a traditional static model, the introduction of dynamic models could obviously improve the prediction performance. Experimental results from the benchmark and independent datasets proved that our DQPred-DBR had promising generalization capability. It was capable of producing decent predictions and outperforms many state-of-the-art methods. For the convenience of academic use, our proposed method was also implemented as a web server at .
Collapse
Affiliation(s)
- H Chai
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, 130117, P. R. China.
| | - J Zhang
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, 130117, P. R. China.
| | - G Yang
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, 130117, P. R. China. and Office of Informatization Management and Planning, Northeast Normal University, Changchun, 130117, P. R. China
| | - Z Ma
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, 130117, P. R. China.
| |
Collapse
|
30
|
Chandrasekaran A, Chan J, Lim C, Yang LW. Protein Dynamics and Contact Topology Reveal Protein–DNA Binding Orientation. J Chem Theory Comput 2016; 12:5269-5277. [DOI: 10.1021/acs.jctc.6b00688] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
Affiliation(s)
| | | | | | - Lee-Wei Yang
- Physics
Division, National Center for Theoretical Sciences, Hsinchu 30013, Taiwan
| |
Collapse
|
31
|
Korostelev YD, Zharov IA, Mironov AA, Rakhmaininova AB, Gelfand MS. Identification of Position-Specific Correlations between DNA-Binding Domains and Their Binding Sites. Application to the MerR Family of Transcription Factors. PLoS One 2016; 11:e0162681. [PMID: 27690309 PMCID: PMC5045206 DOI: 10.1371/journal.pone.0162681] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2015] [Accepted: 08/26/2016] [Indexed: 11/25/2022] Open
Abstract
The large and increasing volume of genomic data analyzed by comparative methods provides information about transcription factors and their binding sites that, in turn, enables statistical analysis of correlations between factors and sites, uncovering mechanisms and evolution of specific protein-DNA recognition. Here we present an online tool, Prot-DNA-Korr, designed to identify and analyze crucial protein-DNA pairs of positions in a family of transcription factors. Correlations are identified by analysis of mutual information between columns of protein and DNA alignments. The algorithm reduces the effects of common phylogenetic history and of abundance of closely related proteins and binding sites. We apply it to five closely related subfamilies of the MerR family of bacterial transcription factors that regulate heavy metal resistance systems. We validate the approach using known 3D structures of MerR-family proteins in complexes with their cognate DNA binding sites and demonstrate that a significant fraction of correlated positions indeed form specific side-chain-to-base contacts. The joint distribution of amino acids and nucleotides hence may be used to predict changes of specificity for point mutations in transcription factors.
Collapse
Affiliation(s)
- Yuriy D. Korostelev
- A.A. Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, 19-1 Bolshoy Karetny pereulok, Moscow, Russia, 127994
- Department of Bioengineering and Bioinformatics, Moscow State University, 1-73 Vorobievy Gory, Moscow, Russia, 119991
| | - Ilya A. Zharov
- A.A. Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, 19-1 Bolshoy Karetny pereulok, Moscow, Russia, 127994
| | - Andrey A. Mironov
- A.A. Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, 19-1 Bolshoy Karetny pereulok, Moscow, Russia, 127994
- Department of Bioengineering and Bioinformatics, Moscow State University, 1-73 Vorobievy Gory, Moscow, Russia, 119991
| | - Alexandra B. Rakhmaininova
- A.A. Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, 19-1 Bolshoy Karetny pereulok, Moscow, Russia, 127994
| | - Mikhail S. Gelfand
- A.A. Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, 19-1 Bolshoy Karetny pereulok, Moscow, Russia, 127994
- Department of Bioengineering and Bioinformatics, Moscow State University, 1-73 Vorobievy Gory, Moscow, Russia, 119991
- * E-mail:
| |
Collapse
|
32
|
Xiao X, Agris PF, Hall CK. Designing peptide sequences in flexible chain conformations to bind RNA: a search algorithm combining Monte Carlo, self-consistent mean field and concerted rotation techniques. J Chem Theory Comput 2016; 11:740-52. [PMID: 26579605 DOI: 10.1021/ct5008247] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Abstract
A search algorithm combining Monte Carlo, self-consistent mean field, and concerted rotation techniques was developed to discover peptide sequences that are reasonable HIV drug candidates due to their exceptional binding to human tRNAUUU(Lys3), the primer of HIV replication. The search algorithm allows for iteration between sequence mutations and conformation changes during sequence evolution. Searches conducted for different classes of peptides identified several potential peptide candidates. Analysis of the energy revealed that the asparagine and cysteine at residues 11 and 12 play important roles in "recognizing" tRNA(Lys3) via van der Waals interactions, contributing to binding specificity. Arginines preferentially attract the phosphate linkage via charge-charge interaction, contributing to binding affinity. Evaluation of the RNA/peptide complex's structure revealed that adding conformation changes to the search algorithm yields peptides with better binding affinity and specificity to tRNA(Lys3) than a previous mutation-only algorithm.
Collapse
Affiliation(s)
- Xingqing Xiao
- Chemical and Biomolecular Engineering Department, North Carolina State University , Raleigh, North Carolina 27695-7905, United States
| | - Paul F Agris
- The RNA Institute, University at Albany, State University of New York , Albany, New York 12222, United States
| | - Carol K Hall
- Chemical and Biomolecular Engineering Department, North Carolina State University , Raleigh, North Carolina 27695-7905, United States
| |
Collapse
|
33
|
Dresch JM, Zellers RG, Bork DK, Drewell RA. Nucleotide Interdependency in Transcription Factor Binding Sites in the Drosophila Genome. GENE REGULATION AND SYSTEMS BIOLOGY 2016; 10:21-33. [PMID: 27330274 PMCID: PMC4907338 DOI: 10.4137/grsb.s38462] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/05/2016] [Revised: 04/17/2016] [Accepted: 04/28/2016] [Indexed: 01/14/2023]
Abstract
A long-standing objective in modern biology is to characterize the molecular components that drive the development of an organism. At the heart of eukaryotic development lies gene regulation. On the molecular level, much of the research in this field has focused on the binding of transcription factors (TFs) to regulatory regions in the genome known as cis-regulatory modules (CRMs). However, relatively little is known about the sequence-specific binding preferences of many TFs, especially with respect to the possible interdependencies between the nucleotides that make up binding sites. A particular limitation of many existing algorithms that aim to predict binding site sequences is that they do not allow for dependencies between nonadjacent nucleotides. In this study, we use a recently developed computational algorithm, MARZ, to compare binding site sequences using 32 distinct models in a systematic and unbiased approach to explore nucleotide dependencies within binding sites for 15 distinct TFs known to be critical to Drosophila development. Our results indicate that many of these proteins have varying levels of nucleotide interdependencies within their DNA recognition sequences, and that, in some cases, models that account for these dependencies greatly outperform traditional models that are used to predict binding sites. We also directly compare the ability of different models to identify the known KRUPPEL TF binding sites in CRMs and demonstrate that a more complex model that accounts for nucleotide interdependencies performs better when compared with simple models. This ability to identify TFs with critical nucleotide interdependencies in their binding sites will lead to a deeper understanding of how these molecular characteristics contribute to the architecture of CRMs and the precise regulation of transcription during organismal development.
Collapse
Affiliation(s)
- Jacqueline M. Dresch
- Department of Mathematics and Computer Science, Clark University, Worcester, MA, USA
| | - Rowan G. Zellers
- Computer Science Department, Harvey Mudd College, Claremont, CA, USA
- Mathematics Department, Harvey Mudd College, Claremont, CA, USA
| | - Daniel K. Bork
- Computer Science Department, Harvey Mudd College, Claremont, CA, USA
- Mathematics Department, Harvey Mudd College, Claremont, CA, USA
| | | |
Collapse
|
34
|
Pettie KP, Dresch JM, Drewell RA. Spatial distribution of predicted transcription factor binding sites in Drosophila ChIP peaks. Mech Dev 2016; 141:51-61. [PMID: 27264535 DOI: 10.1016/j.mod.2016.06.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2015] [Revised: 04/24/2016] [Accepted: 06/01/2016] [Indexed: 11/19/2022]
Abstract
In the development of the Drosophila embryo, gene expression is directed by the sequence-specific interactions of a large network of protein transcription factors (TFs) and DNA cis-regulatory binding sites. Once the identity of the typically 8-10bp binding sites for any given TF has been determined by one of several experimental procedures, the sequences can be represented in a position weight matrix (PWM) and used to predict the location of additional TF binding sites elsewhere in the genome. Often, alignments of large (>200bp) genomic fragments that have been experimentally determined to bind the TF of interest in Chromatin Immunoprecipitation (ChIP) studies are trimmed under the assumption that the majority of the binding sites are located near the center of all the aligned fragments. In this study, ChIP/chip datasets are analyzed using the corresponding PWMs for the well-studied TFs; CAUDAL, HUNCHBACK, KNIRPS and KRUPPEL, to determine the distribution of predicted binding sites. All four TFs are critical regulators of gene expression along the anterio-posterior axis in early Drosophila development. For all four TFs, the ChIP peaks contain multiple binding sites that are broadly distributed across the genomic region represented by the peak, regardless of the prediction stringency criteria used. This result suggests that ChIP peak trimming may exclude functional binding sites from subsequent analyses.
Collapse
Affiliation(s)
- Kade P Pettie
- Department of Biology, Amherst College, Amherst, MA 01002, United States
| | - Jacqueline M Dresch
- Department of Mathematics and Computer Science, Clark University, 950 Main Street, Worcester, MA 01610, United States
| | - Robert A Drewell
- Biology Department, Clark University, 950 Main Street, Worcester, MA 01610, United States
| |
Collapse
|
35
|
Hamed MY, Arya G. Zinc finger protein binding to DNA: an energy perspective using molecular dynamics simulation and free energy calculations on mutants of both zinc finger domains and their specific DNA bases. J Biomol Struct Dyn 2016. [PMID: 26196228 DOI: 10.1080/07391102.2015.1068224] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
Energy calculations based on MM-GBSA were employed to study various zinc finger protein (ZF) motifs binding to DNA. Mutants of both the DNA bound to their specific amino acids were studied. Calculated energies gave evidence for a relationship between binding energy and affinity of ZF motifs to their sites on DNA. ΔG values were -15.82(12), -3.66(12), and -12.14(11.6) kcal/mol for finger one, finger two, and finger three, respectively. The mutations in the DNA bases reduced the value of the negative energies of binding (maximum value for ΔΔG = 42Kcal/mol for F1 when GCG mutated to GGG, and ΔΔG = 22 kcal/mol for F2, the loss in total energy of binding originated in the loss in electrostatic energies upon mutation (r = .98). The mutations in key amino acids in the ZF motif in positions-1, 2, 3, and 6 showed reduced binding energies to DNA with correlation coefficients between total free energy and electrostatic was .99 and with Van der Waal was .93. Results agree with experimentally found selectivity which showed that Arginine in position-1 is specific to G, while Aspartic acid (D) in position 2 plays a complicated role in binding. There is a correlation between the MD calculated free energies of binding and those obtained experimentally for prepared ZF motifs bound to triplet bases in other reports (), our results may help in the design of ZF motifs based on the established recognition codes based on energies and contributing energies to the total energy.
Collapse
Affiliation(s)
- Mazen Y Hamed
- a Department of Chemistry , Birzeit University , P. O. Box 14 Birzeit, Palestine
| | - Gaurav Arya
- b Department of Nanoengineering , University of California , San Diego, 9500 Gilman Dr., MC-0448, La Jolla , CA 92093-0448 , USA
| |
Collapse
|
36
|
Qin W, Zhao G, Carson M, Jia C, Lu H. Knowledge-based three-body potential for transcription factor binding site prediction. IET Syst Biol 2016; 10:23-9. [PMID: 26816396 DOI: 10.1049/iet-syb.2014.0066] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
A structure-based statistical potential is developed for transcription factor binding site (TFBS) prediction. Besides the direct contact between amino acids from TFs and DNA bases, the authors also considered the influence of the neighbouring base. This three-body potential showed better discriminate powers than the two-body potential. They validate the performance of the potential in TFBS identification, binding energy prediction and binding mutation prediction.
Collapse
Affiliation(s)
- Wenyi Qin
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL, USA
| | - Guijun Zhao
- Key Laboratory of Molecular Embryology, Ministry of Health & Shanghai Key Laboratory of Embryo and Reproduction Engineering, Shanghai 200040, People's Republic of China
| | - Matthew Carson
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL, USA
| | - Caiyan Jia
- School of Computer and Information Technology & Beijing Key Lab of Traffic Data Analysis, Beijing Jiaotong University, Beijing, People's Republic of China
| | - Hui Lu
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL, USA.
| |
Collapse
|
37
|
FootprintDB: Analysis of Plant Cis-Regulatory Elements, Transcription Factors, and Binding Interfaces. Methods Mol Biol 2016; 1482:259-77. [PMID: 27557773 DOI: 10.1007/978-1-4939-6396-6_17] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
FootprintDB is a database and search engine that compiles regulatory sequences from open access libraries of curated DNA cis-elements and motifs, and their associated transcription factors (TFs). It systematically annotates the binding interfaces of the TFs by exploiting protein-DNA complexes deposited in the Protein Data Bank. Each entry in footprintDB is thus a DNA motif linked to the protein sequence of the TF(s) known to recognize it, and in most cases, the set of predicted interface residues involved in specific recognition. This chapter explains step-by-step how to search for DNA motifs and protein sequences in footprintDB and how to focus the search to a particular organism. Two real-world examples are shown where this software was used to analyze transcriptional regulation in plants. Results are described with the aim of guiding users on their interpretation, and special attention is given to the choices users might face when performing similar analyses.
Collapse
|
38
|
AlQuraishi M, Tang S, Xia X. An affinity-structure database of helix-turn-helix: DNA complexes with a universal coordinate system. BMC Bioinformatics 2015; 16:390. [PMID: 26586237 PMCID: PMC4653904 DOI: 10.1186/s12859-015-0819-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2015] [Accepted: 11/11/2015] [Indexed: 11/28/2022] Open
Abstract
Background Molecular interactions between proteins and DNA molecules underlie many cellular processes, including transcriptional regulation, chromosome replication, and nucleosome positioning. Computational analyses of protein-DNA interactions rely on experimental data characterizing known protein-DNA interactions structurally and biochemically. While many databases exist that contain either structural or biochemical data, few integrate these two data sources in a unified fashion. Such integration is becoming increasingly critical with the rapid growth of structural and biochemical data, and the emergence of algorithms that rely on the synthesis of multiple data types to derive computational models of molecular interactions. Description We have developed an integrated affinity-structure database in which the experimental and quantitative DNA binding affinities of helix-turn-helix proteins are mapped onto the crystal structures of the corresponding protein-DNA complexes. This database provides access to: (i) protein-DNA structures, (ii) quantitative summaries of protein-DNA binding affinities using position weight matrices, and (iii) raw experimental data of protein-DNA binding instances. Critically, this database establishes a correspondence between experimental structural data and quantitative binding affinity data at the single basepair level. Furthermore, we present a novel alignment algorithm that structurally aligns the protein-DNA complexes in the database and creates a unified residue-level coordinate system for comparing the physico-chemical environments at the interface between complexes. Using this unified coordinate system, we compute the statistics of atomic interactions at the protein-DNA interface of helix-turn-helix proteins. We provide an interactive website for visualization, querying, and analyzing this database, and a downloadable version to facilitate programmatic analysis. Conclusions This database will facilitate the analysis of protein-DNA interactions and the development of programmatic computational methods that capitalize on integration of structural and biochemical datasets. The database can be accessed at http://ProteinDNA.hms.harvard.edu.
Collapse
Affiliation(s)
- Mohammed AlQuraishi
- Department of Systems Biology, Harvard Medical School, Boston, MA, 02115, USA. .,HMS Laboratory of Systems Pharmacology, Harvard Medical School, 200 Longwood Avenue, Boston, MA, 02115, USA.
| | - Shengdong Tang
- Department of Systems Biology, Harvard Medical School, Boston, MA, 02115, USA.,HMS Laboratory of Systems Pharmacology, Harvard Medical School, 200 Longwood Avenue, Boston, MA, 02115, USA
| | - Xide Xia
- Department of Systems Biology, Harvard Medical School, Boston, MA, 02115, USA.,HMS Laboratory of Systems Pharmacology, Harvard Medical School, 200 Longwood Avenue, Boston, MA, 02115, USA
| |
Collapse
|
39
|
Bazzoli A, Kelow SP, Karanicolas J. Enhancements to the Rosetta Energy Function Enable Improved Identification of Small Molecules that Inhibit Protein-Protein Interactions. PLoS One 2015; 10:e0140359. [PMID: 26484863 PMCID: PMC4617380 DOI: 10.1371/journal.pone.0140359] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2015] [Accepted: 09/24/2015] [Indexed: 11/25/2022] Open
Abstract
Protein-protein interactions are among today’s most exciting and promising targets for therapeutic intervention. To date, identifying small-molecules that selectively disrupt these interactions has proven particularly challenging for virtual screening tools, since these have typically been optimized to perform well on more “traditional” drug discovery targets. Here, we test the performance of the Rosetta energy function for identifying compounds that inhibit protein interactions, when these active compounds have been hidden amongst pools of “decoys.” Through this virtual screening benchmark, we gauge the effect of two recent enhancements to the functional form of the Rosetta energy function: the new “Talaris” update and the “pwSHO” solvation model. Finally, we conclude by developing and validating a new weight set that maximizes Rosetta’s ability to pick out the active compounds in this test set. Looking collectively over the course of these enhancements, we find a marked improvement in Rosetta’s ability to identify small-molecule inhibitors of protein-protein interactions.
Collapse
Affiliation(s)
- Andrea Bazzoli
- Center for Computational Biology, University of Kansas, 2030 Becker Dr., Lawrence, Kansas, 66045–7534, United States of America
| | - Simon P. Kelow
- Center for Computational Biology, University of Kansas, 2030 Becker Dr., Lawrence, Kansas, 66045–7534, United States of America
| | - John Karanicolas
- Center for Computational Biology, University of Kansas, 2030 Becker Dr., Lawrence, Kansas, 66045–7534, United States of America
- Department of Molecular Biosciences, University of Kansas, 2030 Becker Dr., Lawrence, Kansas, 66045–7534, United States of America
- * E-mail:
| |
Collapse
|
40
|
Suvorova IA, Korostelev YD, Gelfand MS. GntR Family of Bacterial Transcription Factors and Their DNA Binding Motifs: Structure, Positioning and Co-Evolution. PLoS One 2015; 10:e0132618. [PMID: 26151451 PMCID: PMC4494728 DOI: 10.1371/journal.pone.0132618] [Citation(s) in RCA: 64] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2015] [Accepted: 06/16/2015] [Indexed: 12/03/2022] Open
Abstract
The GntR family of transcription factors (TFs) is a large group of proteins present in diverse bacteria and regulating various biological processes. Here we use the comparative genomics approach to reconstruct regulons and identify binding motifs of regulators from three subfamilies of the GntR family, FadR, HutC, and YtrA. Using these data, we attempt to predict DNA-protein contacts by analyzing correlations between binding motifs in DNA and amino acid sequences of TFs. We identify pairs of positions with high correlation between amino acids and nucleotides for FadR, HutC, and YtrA subfamilies and show that the most predicted DNA-protein interactions are quite similar in all subfamilies and conform well to the experimentally identified contacts formed by FadR from E. coli and AraR from B. subtilis. The most frequent predicted contacts in the analyzed subfamilies are Arg-G, Asn-A, Asp-C. We also analyze the divergon structure and preferred site positions relative to regulated genes in the FadR and HutC subfamilies. A single site in a divergon usually regulates both operons and is approximately in the middle of the intergenic area. Double sites are either involved in the co-operative regulation of both operons and then are in the center of the intergenic area, or each site in the pair independently regulates its own operon and tends to be near it. We also identify additional candidate TF-binding boxes near palindromic binding sites of TFs from the FadR, HutC, and YtrA subfamilies, which may play role in the binding of additional TF-subunits.
Collapse
Affiliation(s)
- Inna A. Suvorova
- Research and Training Center on Bioinformatics, Institute for Information Transmission Problems RAS (The Kharkevich Institute), Moscow, Russia
- * E-mail:
| | - Yuri D. Korostelev
- Research and Training Center on Bioinformatics, Institute for Information Transmission Problems RAS (The Kharkevich Institute), Moscow, Russia
| | - Mikhail S. Gelfand
- Research and Training Center on Bioinformatics, Institute for Information Transmission Problems RAS (The Kharkevich Institute), Moscow, Russia
- Faculty of Bioengineering and Bioinformatics, Moscow State University, Moscow, Russia
| |
Collapse
|
41
|
A Biophysical Approach to Predicting Protein-DNA Binding Energetics. Genetics 2015; 200:1349-61. [PMID: 26081193 DOI: 10.1534/genetics.115.178384] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2014] [Accepted: 06/10/2015] [Indexed: 11/18/2022] Open
Abstract
Sequence-specific interactions between proteins and DNA play a central role in DNA replication, repair, recombination, and control of gene expression. These interactions can be studied in vitro using microfluidics, protein-binding microarrays (PBMs), and other high-throughput techniques. Here we develop a biophysical approach to predicting protein-DNA binding specificities from high-throughput in vitro data. Our algorithm, called BindSter, can model alternative DNA-binding modes and multiple protein species competing for access to DNA, while rigorously taking into account all sterically allowed configurations of DNA-bound factors. BindSter can be used with a hierarchy of protein-DNA interaction models of increasing complexity, including contributions of mononucleotides, dinucleotides, and longer words to the total protein-DNA binding energy. We observe that the quality of BindSter predictions does not change significantly as some of the energy parameters vary over a sizable range. To take this degeneracy into account, we have developed a graphical representation of parameter uncertainties called IntervalLogo. We find that our simplest model, in which each nucleotide in the binding site is treated independently, performs better than previous biophysical approaches. The extensions of this model, in which contributions of longer words are also considered, result in further improvements, underscoring the importance of higher-order effects in protein-DNA energetics. In contrast, we find little evidence of multiple binding modes for the transcription factors (TFs) and experimental conditions in our data set. Furthermore, there is limited consistency in predictions for the same TF based on microfluidics and PBM data.
Collapse
|
42
|
An overview of the prediction of protein DNA-binding sites. Int J Mol Sci 2015; 16:5194-215. [PMID: 25756377 PMCID: PMC4394471 DOI: 10.3390/ijms16035194] [Citation(s) in RCA: 53] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2014] [Revised: 02/21/2015] [Accepted: 02/27/2015] [Indexed: 02/06/2023] Open
Abstract
Interactions between proteins and DNA play an important role in many essential biological processes such as DNA replication, transcription, splicing, and repair. The identification of amino acid residues involved in DNA-binding sites is critical for understanding the mechanism of these biological activities. In the last decade, numerous computational approaches have been developed to predict protein DNA-binding sites based on protein sequence and/or structural information, which play an important role in complementing experimental strategies. At this time, approaches can be divided into three categories: sequence-based DNA-binding site prediction, structure-based DNA-binding site prediction, and homology modeling and threading. In this article, we review existing research on computational methods to predict protein DNA-binding sites, which includes data sets, various residue sequence/structural features, machine learning methods for comparison and selection, evaluation methods, performance comparison of different tools, and future directions in protein DNA-binding site prediction. In particular, we detail the meta-analysis of protein DNA-binding sites. We also propose specific implications that are likely to result in novel prediction methods, increased performance, or practical applications.
Collapse
|
43
|
Pujato M, Kieken F, Skiles AA, Tapinos N, Fiser A. Prediction of DNA binding motifs from 3D models of transcription factors; identifying TLX3 regulated genes. Nucleic Acids Res 2014; 42:13500-12. [PMID: 25428367 PMCID: PMC4267649 DOI: 10.1093/nar/gku1228] [Citation(s) in RCA: 63] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Proper cell functioning depends on the precise spatio-temporal expression of its genetic material. Gene expression is controlled to a great extent by sequence-specific transcription factors (TFs). Our current knowledge on where and how TFs bind and associate to regulate gene expression is incomplete. A structure-based computational algorithm (TF2DNA) is developed to identify binding specificities of TFs. The method constructs homology models of TFs bound to DNA and assesses the relative binding affinity for all possible DNA sequences using a knowledge-based potential, after optimization in a molecular mechanics force field. TF2DNA predictions were benchmarked against experimentally determined binding motifs. Success rates range from 45% to 81% and primarily depend on the sequence identity of aligned target sequences and template structures, TF2DNA was used to predict 1321 motifs for 1825 putative human TF proteins, facilitating the reconstruction of most of the human gene regulatory network. As an illustration, the predicted DNA binding site for the poorly characterized T-cell leukemia homeobox 3 (TLX3) TF was confirmed with gel shift assay experiments. TLX3 motif searches in human promoter regions identified a group of genes enriched in functions relating to hematopoiesis, tissue morphology, endocrine system and connective tissue development and function.
Collapse
Affiliation(s)
- Mario Pujato
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, 1300 Morris Park Ave., Bronx, NY 10461, USA Department of Biochemistry, Albert Einstein College of Medicine, 1300 Morris Park Ave., Bronx, NY 10461, USA
| | - Fabien Kieken
- Department of Biochemistry, Albert Einstein College of Medicine, 1300 Morris Park Ave., Bronx, NY 10461, USA Macromolecular Therapeutics Development, Albert Einstein College of Medicine, 1300 Morris Park Ave., Bronx, NY 10461, USA
| | - Amanda A Skiles
- Molecular Neuroscience Laboratory, Geisinger Clinic, 100 North Academy Avenue, Danville, PA 17822, USA
| | - Nikos Tapinos
- Molecular Neuroscience Laboratory, Geisinger Clinic, 100 North Academy Avenue, Danville, PA 17822, USA
| | - Andras Fiser
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, 1300 Morris Park Ave., Bronx, NY 10461, USA Department of Biochemistry, Albert Einstein College of Medicine, 1300 Morris Park Ave., Bronx, NY 10461, USA
| |
Collapse
|
44
|
Joyce AP, Zhang C, Bradley P, Havranek JJ. Structure-based modeling of protein: DNA specificity. Brief Funct Genomics 2014; 14:39-49. [PMID: 25414269 DOI: 10.1093/bfgp/elu044] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Protein:DNA interactions are essential to a range of processes that maintain and express the information encoded in the genome. Structural modeling is an approach that aims to understand these interactions at the physicochemical level. It has been proposed that structural modeling can lead to deeper understanding of the mechanisms of protein:DNA interactions, and that progress in this field can not only help to rationalize the observed specificities of DNA-binding proteins but also to allow researchers to engineer novel DNA site specificities. In this review we discuss recent developments in the structural description of protein:DNA interactions and specificity, as well as the challenges facing the field in the future.
Collapse
|
45
|
Ashworth J, Plaisier CL, Lo FY, Reiss DJ, Baliga NS. Inference of expanded Lrp-like feast/famine transcription factor targets in a non-model organism using protein structure-based prediction. PLoS One 2014; 9:e107863. [PMID: 25255272 PMCID: PMC4177876 DOI: 10.1371/journal.pone.0107863] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2014] [Accepted: 08/16/2014] [Indexed: 11/18/2022] Open
Abstract
Widespread microbial genome sequencing presents an opportunity to understand the gene regulatory networks of non-model organisms. This requires knowledge of the binding sites for transcription factors whose DNA-binding properties are unknown or difficult to infer. We adapted a protein structure-based method to predict the specificities and putative regulons of homologous transcription factors across diverse species. As a proof-of-concept we predicted the specificities and transcriptional target genes of divergent archaeal feast/famine regulatory proteins, several of which are encoded in the genome of Halobacterium salinarum. This was validated by comparison to experimentally determined specificities for transcription factors in distantly related extremophiles, chromatin immunoprecipitation experiments, and cis-regulatory sequence conservation across eighteen related species of halobacteria. Through this analysis we were able to infer that Halobacterium salinarum employs a divergent local trans-regulatory strategy to regulate genes (carA and carB) involved in arginine and pyrimidine metabolism, whereas Escherichia coli employs an operon. The prediction of gene regulatory binding sites using structure-based methods is useful for the inference of gene regulatory relationships in new species that are otherwise difficult to infer.
Collapse
Affiliation(s)
- Justin Ashworth
- Institute for Systems Biology, Seattle, Washington, United States of America
- * E-mail: (JA); (NB)
| | | | - Fang Yin Lo
- Institute for Systems Biology, Seattle, Washington, United States of America
| | - David J. Reiss
- Institute for Systems Biology, Seattle, Washington, United States of America
| | - Nitin S. Baliga
- Institute for Systems Biology, Seattle, Washington, United States of America
- Department of Microbiology, University of Washington, Seattle, Washington, United States of America
- * E-mail: (JA); (NB)
| |
Collapse
|
46
|
Slattery M, Zhou T, Yang L, Dantas Machado AC, Gordân R, Rohs R. Absence of a simple code: how transcription factors read the genome. Trends Biochem Sci 2014; 39:381-99. [PMID: 25129887 DOI: 10.1016/j.tibs.2014.07.002] [Citation(s) in RCA: 337] [Impact Index Per Article: 33.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2014] [Revised: 07/11/2014] [Accepted: 07/15/2014] [Indexed: 12/21/2022]
Abstract
Transcription factors (TFs) influence cell fate by interpreting the regulatory DNA within a genome. TFs recognize DNA in a specific manner; the mechanisms underlying this specificity have been identified for many TFs based on 3D structures of protein-DNA complexes. More recently, structural views have been complemented with data from high-throughput in vitro and in vivo explorations of the DNA-binding preferences of many TFs. Together, these approaches have greatly expanded our understanding of TF-DNA interactions. However, the mechanisms by which TFs select in vivo binding sites and alter gene expression remain unclear. Recent work has highlighted the many variables that influence TF-DNA binding, while demonstrating that a biophysical understanding of these many factors will be central to understanding TF function.
Collapse
Affiliation(s)
- Matthew Slattery
- Department of Biomedical Sciences, University of Minnesota Medical School, Duluth, MN 55812, USA; Developmental Biology Center, University of Minnesota, Minneapolis, MN 55455, USA.
| | - Tianyin Zhou
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Lin Yang
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Ana Carolina Dantas Machado
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Raluca Gordân
- Center for Genomic and Computational Biology, Departments of Biostatistics and Bioinformatics, Computer Science, and Molecular Genetics and Microbiology, Duke University, Durham, NC 27708, USA.
| | - Remo Rohs
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA.
| |
Collapse
|
47
|
Abstract
Building protein tools that can selectively bind or cleave specific DNA sequences requires efficient technologies for modifying protein-DNA interactions. Computational design is one method for accomplishing this goal. In this chapter, we present the current state of protein-DNA interface design with the Rosetta macromolecular modeling program. The LAGLIDADG endonuclease family of DNA-cleaving enzymes, under study as potential gene therapy reagents, has been the main testing ground for these in silico protocols. At this time, the computational methods are most useful for designing endonuclease variants that can accommodate small numbers of target site substitutions. Attempts to engineer for more extensive interface changes will likely benefit from an approach that uses the computational design results in conjunction with a high-throughput directed evolution or screening procedure. The family of enzymes presents an engineering challenge because their interfaces are highly integrated and there is significant coordination between the binding and catalysis events. Future developments in the computational algorithms depend on experimental feedback to improve understanding and modeling of these complex enzymatic features. This chapter presents both the basic method of design that has been successfully used to modulate specificity and more advanced procedures that incorporate DNA flexibility and other properties that are likely necessary for reliable modeling of more extensive target site changes.
Collapse
Affiliation(s)
- Summer Thyme
- Department of Biological Sciences, University of Washington, Seattle, WA, USA
| | | |
Collapse
|
48
|
Eichner J, Topf F, Dräger A, Wrzodek C, Wanke D, Zell A. TFpredict and SABINE: sequence-based prediction of structural and functional characteristics of transcription factors. PLoS One 2013; 8:e82238. [PMID: 24349230 PMCID: PMC3861411 DOI: 10.1371/journal.pone.0082238] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2013] [Accepted: 10/21/2013] [Indexed: 11/18/2022] Open
Abstract
One of the key mechanisms of transcriptional control are the specific connections between transcription factors (TF) and cis-regulatory elements in gene promoters. The elucidation of these specific protein-DNA interactions is crucial to gain insights into the complex regulatory mechanisms and networks underlying the adaptation of organisms to dynamically changing environmental conditions. As experimental techniques for determining TF binding sites are expensive and mostly performed for selected TFs only, accurate computational approaches are needed to analyze transcriptional regulation in eukaryotes on a genome-wide level. We implemented a four-step classification workflow which for a given protein sequence (1) discriminates TFs from other proteins, (2) determines the structural superclass of TFs, (3) identifies the DNA-binding domains of TFs and (4) predicts their cis-acting DNA motif. While existing tools were extended and adapted for performing the latter two prediction steps, the first two steps are based on a novel numeric sequence representation which allows for combining existing knowledge from a BLAST scan with robust machine learning-based classification. By evaluation on a set of experimentally confirmed TFs and non-TFs, we demonstrate that our new protein sequence representation facilitates more reliable identification and structural classification of TFs than previously proposed sequence-derived features. The algorithms underlying our proposed methodology are implemented in the two complementary tools TFpredict and SABINE. The online and stand-alone versions of TFpredict and SABINE are freely available to academics at http://www.cogsys.cs.uni-tuebingen.de/software/TFpredict/ and http://www.cogsys.cs.uni-tuebingen.de/software/SABINE/.
Collapse
Affiliation(s)
- Johannes Eichner
- Center of Bioinformatics Tuebingen (ZBIT), University of Tuebingen, Tübingen, Germany
- * E-mail:
| | - Florian Topf
- Center of Bioinformatics Tuebingen (ZBIT), University of Tuebingen, Tübingen, Germany
| | - Andreas Dräger
- Center of Bioinformatics Tuebingen (ZBIT), University of Tuebingen, Tübingen, Germany
- University of California San Diego, La Jolla, California, United States of America
| | - Clemens Wrzodek
- Center of Bioinformatics Tuebingen (ZBIT), University of Tuebingen, Tübingen, Germany
| | - Dierk Wanke
- Center for Plant Physiology Tuebingen (ZMBP), University of Tuebingen, Tübingen, Germany
| | - Andreas Zell
- Center of Bioinformatics Tuebingen (ZBIT), University of Tuebingen, Tübingen, Germany
| |
Collapse
|
49
|
Zeigler RD, Cohen BA. Discrimination between thermodynamic models of cis-regulation using transcription factor occupancy data. Nucleic Acids Res 2013; 42:2224-34. [PMID: 24288374 PMCID: PMC3936720 DOI: 10.1093/nar/gkt1230] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Many studies have identified binding preferences for transcription factors (TFs), but few have yielded predictive models of how combinations of transcription factor binding sites generate specific levels of gene expression. Synthetic promoters have emerged as powerful tools for generating quantitative data to parameterize models of combinatorial cis-regulation. We sought to improve the accuracy of such models by quantifying the occupancy of TFs on synthetic promoters in vivo and incorporating these data into statistical thermodynamic models of cis-regulation. Using chromatin immunoprecipitation-seq, we measured the occupancy of Gcn4 and Cbf1 in synthetic promoter libraries composed of binding sites for Gcn4, Cbf1, Met31/Met32 and Nrg1. We measured the occupancy of these two TFs and the expression levels of all promoters in two growth conditions. Models parameterized using only expression data predicted expression but failed to identify several interactions between TFs. In contrast, models parameterized with occupancy and expression data predicted expression data, and also revealed Gcn4 self-cooperativity and a negative interaction between Gcn4 and Nrg1. Occupancy data also allowed us to distinguish between competing regulatory mechanisms for the factor Gcn4. Our framework for combining occupancy and expression data produces predictive models that better reflect the mechanisms underlying combinatorial cis-regulation of gene expression.
Collapse
Affiliation(s)
- Robert D Zeigler
- Department of Genetics, Center for Genome Sciences and Systems Biology, Washington University School of Medicine in St. Louis, MO 63108, USA
| | | |
Collapse
|
50
|
Stringham JL, Brown AS, Drewell RA, Dresch JM. Flanking sequence context-dependent transcription factor binding in early Drosophila development. BMC Bioinformatics 2013; 14:298. [PMID: 24093548 PMCID: PMC3851692 DOI: 10.1186/1471-2105-14-298] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2013] [Accepted: 09/24/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Gene expression in the Drosophila embryo is controlled by functional interactions between a large network of protein transcription factors (TFs) and specific sequences in DNA cis-regulatory modules (CRMs). The binding site sequences for any TF can be experimentally determined and represented in a position weight matrix (PWM). PWMs can then be used to predict the location of TF binding sites in other regions of the genome, although there are limitations to this approach as currently implemented. RESULTS In this proof-of-principle study, we analyze 127 CRMs and focus on four TFs that control transcription of target genes along the anterio-posterior axis of the embryo early in development. For all four of these TFs, there is some degree of conserved flanking sequence that extends beyond the predicted binding regions. A potential role for these conserved flanking sequences may be to enhance the specificity of TF binding, as the abundance of these sequences is greatly diminished when we examine only predicted high-affinity binding sites. CONCLUSIONS Expanding PWMs to include sequence context-dependence will increase the information content in PWMs and facilitate a more efficient functional identification and dissection of CRMs.
Collapse
Affiliation(s)
- Jessica L Stringham
- Mathematics Department, Harvey Mudd College, 301 Platt Boulevard, Claremont, CA 91711, USA.
| | | | | | | |
Collapse
|