1
|
Khamis AM, Motwalli O, Oliva R, Jankovic BR, Medvedeva YA, Ashoor H, Essack M, Gao X, Bajic VB. A novel method for improved accuracy of transcription factor binding site prediction. Nucleic Acids Res 2018; 46:e72. [PMID: 29617876 PMCID: PMC6037060 DOI: 10.1093/nar/gky237] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2017] [Revised: 03/01/2018] [Accepted: 03/20/2018] [Indexed: 12/12/2022] Open
Abstract
Identifying transcription factor (TF) binding sites (TFBSs) is important in the computational inference of gene regulation. Widely used computational methods of TFBS prediction based on position weight matrices (PWMs) usually have high false positive rates. Moreover, computational studies of transcription regulation in eukaryotes frequently require numerous PWM models of TFBSs due to a large number of TFs involved. To overcome these problems we developed DRAF, a novel method for TFBS prediction that requires only 14 prediction models for 232 human TFs, while at the same time significantly improves prediction accuracy. DRAF models use more features than PWM models, as they combine information from TFBS sequences and physicochemical properties of TF DNA-binding domains into machine learning models. Evaluation of DRAF on 98 human ChIP-seq datasets shows on average 1.54-, 1.96- and 5.19-fold reduction of false positives at the same sensitivities compared to models from HOCOMOCO, TRANSFAC and DeepBind, respectively. This observation suggests that one can efficiently replace the PWM models for TFBS prediction by a small number of DRAF models that significantly improve prediction accuracy. The DRAF method is implemented in a web tool and in a stand-alone software freely available at http://cbrc.kaust.edu.sa/DRAF.
Collapse
Affiliation(s)
- Abdullah M Khamis
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955–6900, Saudi Arabia
| | - Olaa Motwalli
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955–6900, Saudi Arabia
| | - Romina Oliva
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955–6900, Saudi Arabia
- Department of Sciences and Technologies, University ‘Parthenope’ of Naples, Centro Direzionale Isola C4 80143, Naples, Italy
| | - Boris R Jankovic
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955–6900, Saudi Arabia
| | - Yulia A Medvedeva
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955–6900, Saudi Arabia
- Institute of Bioengineering, Research Centre of Biotechnology, Russian Academy of Science, 117312 Moscow, Russia
- Department of Computational Biology, Vavilov Institute of General Genetics, Russian Academy of Science, 119991 Moscow, Russia
- Department of Biological and Medical Physics, Moscow Institute of Physics and Technology, 141701, Dolgoprudny, Moscow Region, Russia
| | - Haitham Ashoor
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955–6900, Saudi Arabia
| | - Magbubah Essack
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955–6900, Saudi Arabia
| | - Xin Gao
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955–6900, Saudi Arabia
| | - Vladimir B Bajic
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955–6900, Saudi Arabia
| |
Collapse
|
2
|
Kulandaisamy A, Srivastava A, Nagarajan R, Gromiha MM. Dissecting and analyzing key residues in protein-DNA complexes. J Mol Recognit 2017; 31. [DOI: 10.1002/jmr.2692] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2017] [Revised: 11/06/2017] [Accepted: 11/06/2017] [Indexed: 02/03/2023]
Affiliation(s)
- A. Kulandaisamy
- Department of Biotechnology, Bhupat and Jyoti Mehta School of BioSciences; Indian Institute of Technology Madras; Chennai 600 036 Tamilnadu India
| | - Ambuj Srivastava
- Department of Biotechnology, Bhupat and Jyoti Mehta School of BioSciences; Indian Institute of Technology Madras; Chennai 600 036 Tamilnadu India
| | - R. Nagarajan
- Department of Biotechnology, Bhupat and Jyoti Mehta School of BioSciences; Indian Institute of Technology Madras; Chennai 600 036 Tamilnadu India
| | - M. Michael Gromiha
- Department of Biotechnology, Bhupat and Jyoti Mehta School of BioSciences; Indian Institute of Technology Madras; Chennai 600 036 Tamilnadu India
| |
Collapse
|
3
|
Gapsys V, de Groot BL. Alchemical Free Energy Calculations for Nucleotide Mutations in Protein–DNA Complexes. J Chem Theory Comput 2017; 13:6275-6289. [DOI: 10.1021/acs.jctc.7b00849] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Vytautas Gapsys
- Computational Biomolecular
Dynamics Group, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| | - Bert L. de Groot
- Computational Biomolecular
Dynamics Group, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| |
Collapse
|
4
|
Brimmo AT, Qasaimeh MA. Microfluidic Probes and Quadrupoles: A new era of open microfluidics. IEEE NANOTECHNOLOGY MAGAZINE 2017. [DOI: 10.1109/mnano.2016.2633678] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
5
|
Qin W, Zhao G, Carson M, Jia C, Lu H. Knowledge-based three-body potential for transcription factor binding site prediction. IET Syst Biol 2016; 10:23-9. [PMID: 26816396 DOI: 10.1049/iet-syb.2014.0066] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
A structure-based statistical potential is developed for transcription factor binding site (TFBS) prediction. Besides the direct contact between amino acids from TFs and DNA bases, the authors also considered the influence of the neighbouring base. This three-body potential showed better discriminate powers than the two-body potential. They validate the performance of the potential in TFBS identification, binding energy prediction and binding mutation prediction.
Collapse
Affiliation(s)
- Wenyi Qin
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL, USA
| | - Guijun Zhao
- Key Laboratory of Molecular Embryology, Ministry of Health & Shanghai Key Laboratory of Embryo and Reproduction Engineering, Shanghai 200040, People's Republic of China
| | - Matthew Carson
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL, USA
| | - Caiyan Jia
- School of Computer and Information Technology & Beijing Key Lab of Traffic Data Analysis, Beijing Jiaotong University, Beijing, People's Republic of China
| | - Hui Lu
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL, USA.
| |
Collapse
|
6
|
Joyce AP, Zhang C, Bradley P, Havranek JJ. Structure-based modeling of protein: DNA specificity. Brief Funct Genomics 2014; 14:39-49. [PMID: 25414269 DOI: 10.1093/bfgp/elu044] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Protein:DNA interactions are essential to a range of processes that maintain and express the information encoded in the genome. Structural modeling is an approach that aims to understand these interactions at the physicochemical level. It has been proposed that structural modeling can lead to deeper understanding of the mechanisms of protein:DNA interactions, and that progress in this field can not only help to rationalize the observed specificities of DNA-binding proteins but also to allow researchers to engineer novel DNA site specificities. In this review we discuss recent developments in the structural description of protein:DNA interactions and specificity, as well as the challenges facing the field in the future.
Collapse
|
7
|
McHarris DM, Barr DA. Truncated variants of the GCN4 transcription activator protein bind DNA with dramatically different dynamical motifs. J Chem Inf Model 2014; 54:2869-75. [PMID: 25204850 DOI: 10.1021/ci500448e] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The yeast protein GCN4 is a transcriptional activator in the basic leucine zipper (bZip) family, whose distinguishing feature is the "chopstick-like" homodimer of alpha helices formed at the DNA-binding interface. While experiments have shown that truncated versions of the protein retain biologically relevant DNA-binding affinity, we present the results of a computational study revealing that these variants show a wide variety of dynamical modes in their interaction with the target DNA sequence. We have performed all-atom molecular dynamics simulations of the full-length GCN4 protein as well as three truncated variants; our data indicate that the truncated mutants show dramatically different correlation patterns. We conclude that although the truncated mutants still retain DNA-binding ability, the bZip interface present in the full-length protein provides important stability for the protein-DNA complex.
Collapse
Affiliation(s)
- Danielle M McHarris
- Department of Chemistry and Biochemistry, Utica College , 1600 Burrstone Road, Utica, New York 13502, United States
| | | |
Collapse
|
8
|
Levy-Sakin M, Grunwald A, Kim S, Gassman NR, Gottfried A, Antelman J, Kim Y, Ho S, Samuel R, Michalet X, Lin RR, Dertinger T, Kim AS, Chung S, Colyer RA, Weinhold E, Weiss S, Ebenstein Y. Toward single-molecule optical mapping of the epigenome. ACS NANO 2014; 8:14-26. [PMID: 24328256 PMCID: PMC4022788 DOI: 10.1021/nn4050694] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
The past decade has seen an explosive growth in the utilization of single-molecule techniques for the study of complex systems. The ability to resolve phenomena otherwise masked by ensemble averaging has made these approaches especially attractive for the study of biological systems, where stochastic events lead to inherent inhomogeneity at the population level. The complex composition of the genome has made it an ideal system to study at the single-molecule level, and methods aimed at resolving genetic information from long, individual, genomic DNA molecules have been in use for the last 30 years. These methods, and particularly optical-based mapping of DNA, have been instrumental in highlighting genomic variation and contributed significantly to the assembly of many genomes including the human genome. Nanotechnology and nanoscopy have been a strong driving force for advancing genomic mapping approaches, allowing both better manipulation of DNA on the nanoscale and enhanced optical resolving power for analysis of genomic information. During the past few years, these developments have been adopted also for epigenetic studies. The common principle for these studies is the use of advanced optical microscopy for the detection of fluorescently labeled epigenetic marks on long, extended DNA molecules. Here we will discuss recent single-molecule studies for the mapping of chromatin composition and epigenetic DNA modifications, such as DNA methylation.
Collapse
Affiliation(s)
- Michal Levy-Sakin
- Raymond and Beverly Sackler Faculty of Exact Sciences, School of Chemistry, Tel Aviv University, Tel Aviv, Israel
| | - Assaf Grunwald
- Raymond and Beverly Sackler Faculty of Exact Sciences, School of Chemistry, Tel Aviv University, Tel Aviv, Israel
| | - Soohong Kim
- Department of Chemistry and Biochemistry, University of California, Los Angeles, USA
| | - Natalie R. Gassman
- Department of Chemistry and Biochemistry, University of California, Los Angeles, USA
| | - Anna Gottfried
- Institute of Organic Chemistry, RWTH Aachen University, Aachen, Germany
| | - Josh Antelman
- Department of Chemistry and Biochemistry, University of California, Los Angeles, USA
| | - Younggyu Kim
- Department of Chemistry and Biochemistry, University of California, Los Angeles, USA
| | - Sam Ho
- Department of Chemistry and Biochemistry, University of California, Los Angeles, USA
| | - Robin Samuel
- Department of Chemistry and Biochemistry, University of California, Los Angeles, USA
| | - Xavier Michalet
- Department of Chemistry and Biochemistry, University of California, Los Angeles, USA
| | - Ron R. Lin
- Department of Chemistry and Biochemistry, University of California, Los Angeles, USA
| | - Thomas Dertinger
- Department of Chemistry and Biochemistry, University of California, Los Angeles, USA
| | - Andrew S. Kim
- Department of Chemistry and Biochemistry, University of California, Los Angeles, USA
| | - Sangyoon Chung
- Department of Chemistry and Biochemistry, University of California, Los Angeles, USA
| | - Ryan A. Colyer
- Department of Chemistry and Biochemistry, University of California, Los Angeles, USA
| | - Elmar Weinhold
- Institute of Organic Chemistry, RWTH Aachen University, Aachen, Germany
| | - Shimon Weiss
- Department of Chemistry and Biochemistry, University of California, Los Angeles, USA
- Corresponding authors: (Y. Ebenstein), (S. Weiss)
| | - Yuval Ebenstein
- Raymond and Beverly Sackler Faculty of Exact Sciences, School of Chemistry, Tel Aviv University, Tel Aviv, Israel
- Corresponding authors: (Y. Ebenstein), (S. Weiss)
| |
Collapse
|
9
|
On the use of knowledge-based potentials for the evaluation of models of protein-protein, protein-DNA, and protein-RNA interactions. ADVANCES IN PROTEIN CHEMISTRY AND STRUCTURAL BIOLOGY 2014; 94:77-120. [PMID: 24629186 DOI: 10.1016/b978-0-12-800168-4.00004-4] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Proteins are the bricks and mortar of cells, playing structural and functional roles. In order to perform their function, they interact with each other as well as with other biomolecules such as DNA or RNA. Therefore, to fathom the function of a protein, we require knowing its partners and the atomic details of its interactions (i.e., the structure of the complex). However, the amount of protein interactions with an experimentally determined three-dimensional structure is scarce. Therefore, computational techniques such as homology modeling are foremost to fill this gap. Protein interactions can be modeled using as templates the interactions of homologous proteins, if the structure of the complex is known, or using docking methods. In both approaches, the estimation of the quality of models is essential. There are several ways to address this problem. In this review, we focus on the use of knowledge-based potentials for the analysis of protein interactions. We describe the procedure to derive statistical potentials and split them into different energetic terms that can be used for different purposes. We extensively discuss the fields where knowledge-based potentials have been successfully applied to (1) model protein-protein, protein-DNA, and protein-RNA interactions and (2) predict binding sites (in the protein and in the DNA). Moreover, we provide ready-to-use resources for docking and benchmarking protein interactions.
Collapse
|
10
|
Abstract
Predicting binding sites of a transcription factor in the genome is an important, but challenging, issue in studying gene regulation. In the past decade, a large number of protein–DNA co-crystallized structures available in the Protein Data Bank have facilitated the understanding of interacting mechanisms between transcription factors and their binding sites. Recent studies have shown that both physics-based and knowledge-based potential functions can be applied to protein–DNA complex structures to deliver position weight matrices (PWMs) that are consistent with the experimental data. To further use the available structural models, the proposed Web server, PiDNA, aims at first constructing reliable PWMs by applying an atomic-level knowledge-based scoring function on numerous in silico mutated complex structures, and then using the PWM constructed by the structure models with small energy changes to predict the interaction between proteins and DNA sequences. With PiDNA, the users can easily predict the relative preference of all the DNA sequences with limited mutations from the native sequence co-crystallized in the model in a single run. More predictions on sequences with unlimited mutations can be realized by additional requests or file uploading. Three types of information can be downloaded after prediction: (i) the ranked list of mutated sequences, (ii) the PWM constructed by the favourable mutated structures, and (iii) any mutated protein–DNA complex structure models specified by the user. This study first shows that the constructed PWMs are similar to the annotated PWMs collected from databases or literature. Second, the prediction accuracy of PiDNA in detecting relatively high-specificity sites is evaluated by comparing the ranked lists against in vitro experiments from protein-binding microarrays. Finally, PiDNA is shown to be able to select the experimentally validated binding sites from 10 000 random sites with high accuracy. With PiDNA, the users can design biological experiments based on the predicted sequence specificity and/or request mutated structure models for further protein design. As well, it is expected that PiDNA can be incorporated with chromatin immunoprecipitation data to refine large-scale inference of in vivo protein–DNA interactions. PiDNA is available at: http://dna.bime.ntu.edu.tw/pidna.
Collapse
Affiliation(s)
- Chih-Kang Lin
- Center for Systems Biology, National Taiwan University, Taipei 106, Taiwan
| | | |
Collapse
|
11
|
Xu B, Schones DE, Wang Y, Liang H, Li G. A structural-based strategy for recognition of transcription factor binding sites. PLoS One 2013; 8:e52460. [PMID: 23320072 PMCID: PMC3540023 DOI: 10.1371/journal.pone.0052460] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2012] [Accepted: 11/19/2012] [Indexed: 12/30/2022] Open
Abstract
Scanning through genomes for potential transcription factor binding sites (TFBSs) is becoming increasingly important in this post-genomic era. The position weight matrix (PWM) is the standard representation of TFBSs utilized when scanning through sequences for potential binding sites. However, many transcription factor (TF) motifs are short and highly degenerate, and methods utilizing PWMs to scan for sites are plagued by false positives. Furthermore, many important TFs do not have well-characterized PWMs, making identification of potential binding sites even more difficult. One approach to the identification of sites for these TFs has been to use the 3D structure of the TF to predict the DNA structure around the TF and then to generate a PWM from the predicted 3D complex structure. However, this approach is dependent on the similarity of the predicted structure to the native structure. We introduce here a novel approach to identify TFBSs utilizing structure information that can be applied to TFs without characterized PWMs, as long as a 3D complex structure (TF/DNA) exists. This approach utilizes an energy function that is uniquely trained on each structure. Our approach leads to increased prediction accuracy and robustness compared with those using a more general energy function. The software is freely available upon request.
Collapse
Affiliation(s)
- Beisi Xu
- Laboratory of Molecular Modeling and Design, State Key Laboratory of Molecular Reaction Dynamics, Dalian Institute of Chemical Physics, The Chinese Academy of Sciences, Dalian, Liaoning, China
- Department of Microbiology, Immunology and Biochemistry, University of Tennessee Health Science Center, Memphis, Tennessee, United States of America
- Center for Integrative and Translational Genomics, University of Tennessee Health Science Center, Memphis, Tennessee, United States of America
| | - Dustin E. Schones
- Department of Cancer Biology, Beckman Research Institute, City of Hope, Duarte, California, United States of America
| | - Yongmei Wang
- Department of Chemistry, University of Memphis, Memphis, Tennessee, United States of America
| | - Haojun Liang
- Department of Polymer Science and Engineering, University of Science and Technology of China, Hefei, Anhui, China
| | - Guohui Li
- Laboratory of Molecular Modeling and Design, State Key Laboratory of Molecular Reaction Dynamics, Dalian Institute of Chemical Physics, The Chinese Academy of Sciences, Dalian, Liaoning, China
- * E-mail:
| |
Collapse
|
12
|
Chang DTH, Li WS, Bai YH, Wu WS. YGA: identifying distinct biological features between yeast gene sets. Gene 2012; 518:26-34. [PMID: 23266802 DOI: 10.1016/j.gene.2012.11.089] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2012] [Accepted: 11/27/2012] [Indexed: 12/01/2022]
Abstract
The advance of high-throughput experimental technologies generates many gene sets with different biological meanings, where many important insights can only be extracted by identifying the biological (regulatory/functional) features that are distinct between different gene sets (e.g. essential vs. non-essential genes, TATA box-containing vs. TATA box-less genes, induced vs. repressed genes under certain biological conditions). Although many servers have been developed to identify enriched features in a gene set, most of them were designed to analyze one gene set at a time but cannot compare two gene sets. Moreover, the features used in existing servers were mainly focused on functional annotations (GO terms), pathways, transcription factor binding sites (TFBSs) and/or protein-protein interactions (PPIs). In yeast, various important regulatory features, including promoter bendability, nucleosome occupancy, 5'-UTR length, and TF-gene regulation evidence, are available but have not been used in any enrichment analysis servers. This motivates us to develop the Yeast Genes Analyzer (YGA), a web server that simultaneously analyzes various biological (regulatory/functional) features of two gene sets and performs statistical tests to identify the distinct features between them. Many well-studied gene sets such as essential, stress-response, TATA box-containing and cell cycle genes were pre-compiled in YGA for users, if they have only one gene set, to compare with. In comparison with the existing enrichment analysis servers, YGA tests more comprehensive regulatory features (e.g. promoter bendability, nucleosome occupancy, 5'-UTR length, experimental evidence of TF-gene binding and TF-gene regulation) and functional features (e.g. PPI, GO terms, pathways and functional groups of genes, including essential/non-essential genes, stress-induced/-repressed genes, TATA box-containing/-less genes, occupied/depleted proximal-nucleosome genes and cell cycle genes). Furthermore, YGA uses various statistical tests to provide objective comparison measures. The two major contributions of YGA, comprehensive features and statistical comparison, help to mine important information that cannot be obtained from other servers. The sophisticated analysis tools of YGA can identify distinct biological features between two gene sets, which help biologists to form new hypotheses about the underlying biological mechanisms responsible for the observed difference between these two gene sets. YGA can be accessed from the following web pages: http://cosbi.ee.ncku.edu.tw/yga/ and http://yga.ee.ncku.edu.tw/.
Collapse
Affiliation(s)
- Darby Tien-Hao Chang
- Department of Electrical Engineering, National Cheng Kung University, Tainan 70101, Taiwan
| | | | | | | |
Collapse
|
13
|
Maienschein-Cline M, Dinner AR, Hlavacek WS, Mu F. Improved predictions of transcription factor binding sites using physicochemical features of DNA. Nucleic Acids Res 2012; 40:e175. [PMID: 22923524 PMCID: PMC3526315 DOI: 10.1093/nar/gks771] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
Typical approaches for predicting transcription factor binding sites (TFBSs) involve use of a position-specific weight matrix (PWM) to statistically characterize the sequences of the known sites. Recently, an alternative physicochemical approach, called SiteSleuth, was proposed. In this approach, a linear support vector machine (SVM) classifier is trained to distinguish TFBSs from background sequences based on local chemical and structural features of DNA. SiteSleuth appears to generally perform better than PWM-based methods. Here, we improve the SiteSleuth approach by considering both new physicochemical features and algorithmic modifications. New features are derived from Gibbs energies of amino acid-DNA interactions and hydroxyl radical cleavage profiles of DNA. Algorithmic modifications consist of inclusion of a feature selection step, use of a nonlinear kernel in the SVM classifier, and use of a consensus-based post-processing step for predictions. We also considered SVM classification based on letter features alone to distinguish performance gains from use of SVM-based models versus use of physicochemical features. The accuracy of each of the variant methods considered was assessed by cross validation using data available in the RegulonDB database for 54 Escherichia coli TFs, as well as by experimental validation using published ChIP-chip data available for Fis and Lrp.
Collapse
|
14
|
Chang DTH, Ke CH, Lin JH, Chiang JH. AutoBind: automatic extraction of protein-ligand-binding affinity data from biological literature. ACTA ACUST UNITED AC 2012; 28:2162-8. [PMID: 22753780 DOI: 10.1093/bioinformatics/bts367] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Determination of the binding affinity of a protein-ligand complex is important to quantitatively specify whether a particular small molecule will bind to the target protein. Besides, collection of comprehensive datasets for protein-ligand complexes and their corresponding binding affinities is crucial in developing accurate scoring functions for the prediction of the binding affinities of previously unknown protein-ligand complexes. In the past decades, several databases of protein-ligand-binding affinities have been created via visual extraction from literature. However, such approaches are time-consuming and most of these databases are updated only a few times per year. Hence, there is an immediate demand for an automatic extraction method with high precision for binding affinity collection. RESULT We have created a new database of protein-ligand-binding affinity data, AutoBind, based on automatic information retrieval. We first compiled a collection of 1586 articles where the binding affinities have been marked manually. Based on this annotated collection, we designed four sentence patterns that are used to scan full-text articles as well as a scoring function to rank the sentences that match our patterns. The proposed sentence patterns can effectively identify the binding affinities in full-text articles. Our assessment shows that AutoBind achieved 84.22% precision and 79.07% recall on the testing corpus. Currently, 13 616 protein-ligand complexes and the corresponding binding affinities have been deposited in AutoBind from 17 221 articles. AVAILABILITY AutoBind is automatically updated on a monthly basis, and it is freely available at http://autobind.csie.ncku.edu.tw/ and http://autobind.mc.ntu.edu.tw/. All of the deposited binding affinities have been refined and approved manually before being released.
Collapse
Affiliation(s)
- Darby Tien-Hao Chang
- Department of Electrical Engineering, Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 70101, Taiwan
| | | | | | | |
Collapse
|
15
|
Chien TY, Lin CK, Lin CW, Weng YZ, Chen CY, Chang DTH. DBD2BS: connecting a DNA-binding protein with its binding sites. Nucleic Acids Res 2012; 40:W173-9. [PMID: 22693214 PMCID: PMC3394304 DOI: 10.1093/nar/gks564] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2012] [Revised: 05/07/2012] [Accepted: 05/19/2012] [Indexed: 11/25/2022] Open
Abstract
By binding to short and highly conserved DNA sequences in genomes, DNA-binding proteins initiate, enhance or repress biological processes. Accurately identifying such binding sites, often represented by position weight matrices (PWMs), is an important step in understanding the control mechanisms of cells. When given coordinates of a DNA-binding domain (DBD) bound with DNA, a potential function can be used to estimate the change of binding affinity after base substitutions, where the changes can be summarized as a PWM. This technique provides an effective alternative when the chromatin immunoprecipitation data are unavailable for PWM inference. To facilitate the procedure of predicting PWMs based on protein-DNA complexes or even structures of the unbound state, the web server, DBD2BS, is presented in this study. The DBD2BS uses an atom-level knowledge-based potential function to predict PWMs characterizing the sequences to which the query DBD structure can bind. For unbound queries, a list of 1066 DBD-DNA complexes (including 1813 protein chains) is compiled for use as templates for synthesizing bound structures. The DBD2BS provides users with an easy-to-use interface for visualizing the PWMs predicted based on different templates and the spatial relationships of the query protein, the DBDs and the DNAs. The DBD2BS is the first attempt to predict PWMs of DBDs from unbound structures rather than from bound ones. This approach increases the number of existing protein structures that can be exploited when analyzing protein-DNA interactions. In a recent study, the authors showed that the kernel adopted by the DBD2BS can generate PWMs consistent with those obtained from the experimental data. The use of DBD2BS to predict PWMs can be incorporated with sequence-based methods to discover binding sites in genome-wide studies. Available at: http://dbd2bs.csie.ntu.edu.tw/, http://dbd2bs.csbb.ntu.edu.tw/, and http://dbd2bs.ee.ncku.edu.tw.
Collapse
Affiliation(s)
- Ting-Ying Chien
- Department of Computer Science and Information Engineering, Center for Systems Biology, Department of Bio-Industrial Mechatronics Engineering, National Taiwan University, Taipei 106, Taiwan, and Department of Electrical Engineering, National Cheng Kung University, Tainan 701, Taiwan
| | - Chih-Kang Lin
- Department of Computer Science and Information Engineering, Center for Systems Biology, Department of Bio-Industrial Mechatronics Engineering, National Taiwan University, Taipei 106, Taiwan, and Department of Electrical Engineering, National Cheng Kung University, Tainan 701, Taiwan
| | - Chih-Wei Lin
- Department of Computer Science and Information Engineering, Center for Systems Biology, Department of Bio-Industrial Mechatronics Engineering, National Taiwan University, Taipei 106, Taiwan, and Department of Electrical Engineering, National Cheng Kung University, Tainan 701, Taiwan
| | - Yi-Zhong Weng
- Department of Computer Science and Information Engineering, Center for Systems Biology, Department of Bio-Industrial Mechatronics Engineering, National Taiwan University, Taipei 106, Taiwan, and Department of Electrical Engineering, National Cheng Kung University, Tainan 701, Taiwan
| | - Chien-Yu Chen
- Department of Computer Science and Information Engineering, Center for Systems Biology, Department of Bio-Industrial Mechatronics Engineering, National Taiwan University, Taipei 106, Taiwan, and Department of Electrical Engineering, National Cheng Kung University, Tainan 701, Taiwan
| | - Darby Tien-Hao Chang
- Department of Computer Science and Information Engineering, Center for Systems Biology, Department of Bio-Industrial Mechatronics Engineering, National Taiwan University, Taipei 106, Taiwan, and Department of Electrical Engineering, National Cheng Kung University, Tainan 701, Taiwan
| |
Collapse
|
16
|
Gabdoulline R, Eckweiler D, Kel A, Stegmaier P. 3DTF: a web server for predicting transcription factor PWMs using 3D structure-based energy calculations. Nucleic Acids Res 2012; 40:W180-5. [PMID: 22693215 PMCID: PMC3394331 DOI: 10.1093/nar/gks551] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
We present the webserver 3D transcription factor (3DTF) to compute position-specific weight matrices (PWMs) of transcription factors using a knowledge-based statistical potential derived from crystallographic data on protein–DNA complexes. Analysis of available structures that can be used to construct PWMs shows that there are hundreds of 3D structures from which PWMs could be derived, as well as thousands of proteins homologous to these. Therefore, we created 3DTF, which delivers binding matrices given the experimental or modeled protein–DNA complex. The webserver can be used by biologists to derive novel PWMs for transcription factors lacking known binding sites and is freely accessible at http://www.gene-regulation.com/pub/programs/3dtf/.
Collapse
Affiliation(s)
- R Gabdoulline
- Heinrich-Heine University of Duesseldorf, Universitaetstr. 1, 40225 Duesseldorf, Germany
| | | | | | | |
Collapse
|
17
|
Dey S, Pal A, Guharoy M, Sonavane S, Chakrabarti P. Characterization and prediction of the binding site in DNA-binding proteins: improvement of accuracy by combining residue composition, evolutionary conservation and structural parameters. Nucleic Acids Res 2012; 40:7150-61. [PMID: 22641851 PMCID: PMC3424558 DOI: 10.1093/nar/gks405] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
We present a set of four parameters that in combination can predict DNA-binding residues on protein structures to a high degree of accuracy. These are the number of evolutionary conserved residues (Ncons) and their spatial clustering (ρe), hydrogen bond donor capability (Dp) and residue propensity (Rp). We first used these parameters to characterize 130 interfaces in a set of 126 DNA-binding proteins (DBPs). The applicability of these parameters both individually and in combination, to distinguish the true binding region from the rest of the protein surface was then analyzed. Rp shows the best performance identifying the true interface with the top rank in 83% cases. Importantly, we also used the unbound-bound test cases of the protein–DNA docking benchmark to test the efficacy of our method. When applied to the unbound form of the DBPs, Rp can distinguish 86% cases. Finally, we have applied the SVM approach for recognizing the interface region using the above parameters along with the individual amino acid composition as attributes. The accuracy of prediction is 90.5% for the bound structures and 93.6% for the unbound form of the proteins.
Collapse
Affiliation(s)
- Sucharita Dey
- Bioinformatics Centre, Bose Institute, P-1/12 CIT Scheme VIIM, Kolkata 700 054, India
| | | | | | | | | |
Collapse
|