1
|
Ljubic M, D'Ercole C, Waheed Y, de Marco A, Borisek J, De March M. Computational study of the HLTF ATPase remodeling domain suggests its activity on dsDNA and implications in damage tolerance. J Struct Biol 2024:108149. [PMID: 39491691 DOI: 10.1016/j.jsb.2024.108149] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2024] [Revised: 10/04/2024] [Accepted: 10/28/2024] [Indexed: 11/05/2024]
Abstract
The Helicase-Like Transcription Factor (HLTF) is member of the SWI/SNF-family of ATP dependent chromatin remodellers known primarily for maintaining genome stability. Biochemical and cellular assays support its multiple roles in DNA Damage Tolerance. However, the lack of sufficient structural data limits the comprehension of the molecular basis of its modes of action. In this work we have modelled and characterized the HLTF ATPase remodeling domain by using bioinformatic tools and all-atoms molecular dynamics simulations. In-silico results suggested that its binding to dsDNA is mainly mediated by the positively charged residues Arg563 and Lys913, found conserved in HLTF homologs, and Arg620 and Lys999, found only in HLTF. Interestingly, these residues are mutated in cancer cells. During translocation on dsDNA, HLTF remains persistently bound through the N-terminal ATPase subunit. However, DNA advancement occurs only in the presence of the synergic-anticorrelated action of both motor lobes. In contrast, the C-terminal facilitates substrate remodeling through DNA deformation and generation of bulges according to a wave-model. Finally, the large conformational change suggested between the two motor-remodeling subunits might be activated upon the release of PARP1 on stalled fork and be responsible for the intervention of HLTF-HIRAN in the formation of D-loop and 4-way junction DNA structures.
Collapse
Affiliation(s)
- Martin Ljubic
- Theory Department, National Institute of Chemistry, Hajdrihova 19, 1000 Ljubljana, Slovenia
| | - Claudia D'Ercole
- Laboratory for Environmental and Life Sciences, University of Nova Gorica, Vipavska cesta 13, Rožna Dolina, 5000 Nova Gorica, Slovenia
| | - Yossma Waheed
- Laboratory for Environmental and Life Sciences, University of Nova Gorica, Vipavska cesta 13, Rožna Dolina, 5000 Nova Gorica, Slovenia; National Institute of Science and Technology, Sector H-12, Islamabad Capital Territory, Pakistan
| | - Ario de Marco
- Laboratory for Environmental and Life Sciences, University of Nova Gorica, Vipavska cesta 13, Rožna Dolina, 5000 Nova Gorica, Slovenia
| | - Jure Borisek
- Theory Department, National Institute of Chemistry, Hajdrihova 19, 1000 Ljubljana, Slovenia
| | - Matteo De March
- Laboratory for Environmental and Life Sciences, University of Nova Gorica, Vipavska cesta 13, Rožna Dolina, 5000 Nova Gorica, Slovenia.
| |
Collapse
|
2
|
Daanial Khan Y, Alkhalifah T, Alturise F, Hassan Butt A. DeepDBS: Identification of DNA-binding sites in protein sequences by using deep representations and random forest. Methods 2024; 231:26-36. [PMID: 39270885 DOI: 10.1016/j.ymeth.2024.09.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2024] [Revised: 08/26/2024] [Accepted: 09/04/2024] [Indexed: 09/15/2024] Open
Abstract
Interactions of biological molecules in organisms are considered to be primary factors for the lifecycle of that organism. Various important biological functions are dependent on such interactions and among different kinds of interactions, the protein DNA interactions are very important for the processes of transcription, regulation of gene expression, DNA repairing and packaging. Thus, keeping the knowledge of such interactions and the sites of those interactions is necessary to study the mechanism of various biological processes. As experimental identification through biological assays is quite resource-demanding, costly and error-prone, scientists opt for the computational methods for efficient and accurate identification of such DNA-protein interaction sites. Thus, herein, we propose a novel and accurate method namely DeepDBS for the identification of DNA-binding sites in proteins, using primary amino acid sequences of proteins under study. From protein sequences, deep representations were computed through a one-dimensional convolution neural network (1D-CNN), recurrent neural network (RNN) and long short-term memory (LSTM) network and were further used to train a Random Forest classifier. Random Forest with LSTM-based features outperformed the other models, as well as the existing state-of-the-art methods with an accuracy score of 0.99 for self-consistency test, 10-fold cross-validation, 5-fold cross-validation, and jackknife validation while 0.92 for independent dataset testing. It is concluded based on results that the DeepDBS can help accurate and efficient identification of DNA binding sites (DBS) in proteins.
Collapse
Affiliation(s)
- Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Punjab 54770, Pakistan
| | - Tamim Alkhalifah
- Department of Computer Engineering, College of Computer, Qassim University, Buraydah, Saudi Arabia
| | - Fahad Alturise
- Department of Cybersecurity, College of Computer, Qassim University, Buraydah 52571, Saudi Arabia
| | - Ahmad Hassan Butt
- Department of Computer Science, Faculty of Computing and Information Technology, University of the Punjab, Lahore 54000, Punjab, Pakistan.
| |
Collapse
|
3
|
Jia P, Zhang F, Wu C, Li M. A comprehensive review of protein-centric predictors for biomolecular interactions: from proteins to nucleic acids and beyond. Brief Bioinform 2024; 25:bbae162. [PMID: 38739759 PMCID: PMC11089422 DOI: 10.1093/bib/bbae162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2024] [Revised: 02/17/2024] [Accepted: 03/31/2024] [Indexed: 05/16/2024] Open
Abstract
Proteins interact with diverse ligands to perform a large number of biological functions, such as gene expression and signal transduction. Accurate identification of these protein-ligand interactions is crucial to the understanding of molecular mechanisms and the development of new drugs. However, traditional biological experiments are time-consuming and expensive. With the development of high-throughput technologies, an increasing amount of protein data is available. In the past decades, many computational methods have been developed to predict protein-ligand interactions. Here, we review a comprehensive set of over 160 protein-ligand interaction predictors, which cover protein-protein, protein-nucleic acid, protein-peptide and protein-other ligands (nucleotide, heme, ion) interactions. We have carried out a comprehensive analysis of the above four types of predictors from several significant perspectives, including their inputs, feature profiles, models, availability, etc. The current methods primarily rely on protein sequences, especially utilizing evolutionary information. The significant improvement in predictions is attributed to deep learning methods. Additionally, sequence-based pretrained models and structure-based approaches are emerging as new trends.
Collapse
Affiliation(s)
- Pengzhen Jia
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
| | - Fuhao Zhang
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
- College of Information Engineering, Northwest A&F University, No. 3 Taicheng Road, Yangling, Shaanxi 712100, China
| | - Chaojin Wu
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
| |
Collapse
|
4
|
Zhang J, Basu S, Kurgan L. HybridDBRpred: improved sequence-based prediction of DNA-binding amino acids using annotations from structured complexes and disordered proteins. Nucleic Acids Res 2024; 52:e10. [PMID: 38048333 PMCID: PMC10810184 DOI: 10.1093/nar/gkad1131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2023] [Accepted: 11/10/2023] [Indexed: 12/06/2023] Open
Abstract
Current predictors of DNA-binding residues (DBRs) from protein sequences belong to two distinct groups, those trained on binding annotations extracted from structured protein-DNA complexes (structure-trained) vs. intrinsically disordered proteins (disorder-trained). We complete the first empirical analysis of predictive performance across the structure- and disorder-annotated proteins for a representative collection of ten predictors. Majority of the structure-trained tools perform well on the structure-annotated proteins while doing relatively poorly on the disorder-annotated proteins, and vice versa. Several methods make accurate predictions for the structure-annotated proteins or the disorder-annotated proteins, but none performs highly accurately for both annotation types. Moreover, most predictors make excessive cross-predictions for the disorder-annotated proteins, where residues that interact with non-DNA ligand types are predicted as DBRs. Motivated by these results, we design, validate and deploy an innovative meta-model, hybridDBRpred, that uses deep transformer network to combine predictions generated by three best current predictors. HybridDBRpred provides accurate predictions and low levels of cross-predictions across the two annotation types, and is statistically more accurate than each of the ten tools and baseline meta-predictors that rely on averaging and logistic regression. We deploy hybridDBRpred as a convenient web server at http://biomine.cs.vcu.edu/servers/hybridDBRpred/ and provide the corresponding source code at https://github.com/jianzhang-xynu/hybridDBRpred.
Collapse
Affiliation(s)
- Jian Zhang
- School of Computer and Information Technology, Xinyang Normal University, Xinyang 464000, PR China
| | - Sushmita Basu
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| |
Collapse
|
5
|
Zhu YH, Liu Z, Liu Y, Ji Z, Yu DJ. ULDNA: integrating unsupervised multi-source language models with LSTM-attention network for high-accuracy protein-DNA binding site prediction. Brief Bioinform 2024; 25:bbae040. [PMID: 38349057 PMCID: PMC10939370 DOI: 10.1093/bib/bbae040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2023] [Revised: 01/02/2024] [Accepted: 01/22/2024] [Indexed: 02/15/2024] Open
Abstract
Efficient and accurate recognition of protein-DNA interactions is vital for understanding the molecular mechanisms of related biological processes and further guiding drug discovery. Although the current experimental protocols are the most precise way to determine protein-DNA binding sites, they tend to be labor-intensive and time-consuming. There is an immediate need to design efficient computational approaches for predicting DNA-binding sites. Here, we proposed ULDNA, a new deep-learning model, to deduce DNA-binding sites from protein sequences. This model leverages an LSTM-attention architecture, embedded with three unsupervised language models that are pre-trained on large-scale sequences from multiple database sources. To prove its effectiveness, ULDNA was tested on 229 protein chains with experimental annotation of DNA-binding sites. Results from computational experiments revealed that ULDNA significantly improves the accuracy of DNA-binding site prediction in comparison with 17 state-of-the-art methods. In-depth data analyses showed that the major strength of ULDNA stems from employing three transformer language models. Specifically, these language models capture complementary feature embeddings with evolution diversity, in which the complex DNA-binding patterns are buried. Meanwhile, the specially crafted LSTM-attention network effectively decodes evolution diversity-based embeddings as DNA-binding results at the residue level. Our findings demonstrated a new pipeline for predicting DNA-binding sites on a large scale with high accuracy from protein sequence alone.
Collapse
Affiliation(s)
- Yi-Heng Zhu
- College of Artificial Intelligence, Nanjing Agricultural University, Nanjing 210095, China
| | - Zi Liu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Yan Liu
- School of Information Engineering, Yangzhou University, Yangzhou 225000, China
| | - Zhiwei Ji
- College of Artificial Intelligence, Nanjing Agricultural University, Nanjing 210095, China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| |
Collapse
|
6
|
Qian Y, Shang T, Guo F, Wang C, Cui Z, Ding Y, Wu H. Identification of DNA-binding protein based multiple kernel model. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:13149-13170. [PMID: 37501482 DOI: 10.3934/mbe.2023586] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
DNA-binding proteins (DBPs) play a critical role in the development of drugs for treating genetic diseases and in DNA biology research. It is essential for predicting DNA-binding proteins more accurately and efficiently. In this paper, a Laplacian Local Kernel Alignment-based Restricted Kernel Machine (LapLKA-RKM) is proposed to predict DBPs. In detail, we first extract features from the protein sequence using six methods. Second, the Radial Basis Function (RBF) kernel function is utilized to construct pre-defined kernel metrics. Then, these metrics are combined linearly by weights calculated by LapLKA. Finally, the fused kernel is input to RKM for training and prediction. Independent tests and leave-one-out cross-validation were used to validate the performance of our method on a small dataset and two large datasets. Importantly, we built an online platform to represent our model, which is now freely accessible via http://8.130.69.121:8082/.
Collapse
Affiliation(s)
- Yuqing Qian
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Tingting Shang
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Fei Guo
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Chunliang Wang
- The Second Affiliated Hospital of Soochow University, Suzhou, China
| | - Zhiming Cui
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Hongjie Wu
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| |
Collapse
|
7
|
Sun C, Feng Y, Fan G. IDPsBind: a repository of binding sites for intrinsically disordered proteins complexes with known 3D structures. BMC Mol Cell Biol 2022; 23:33. [PMID: 35883018 PMCID: PMC9327236 DOI: 10.1186/s12860-022-00434-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2021] [Accepted: 07/14/2022] [Indexed: 11/10/2022] Open
Abstract
Abstract
Background
Intrinsically disordered proteins (IDPs) lack a stable three-dimensional structure under physiological conditions but play crucial roles in many biological processes. Intrinsically disordered proteins perform various biological functions by interacting with other ligands.
Results
Here, we present a database, IDPsBind, which displays interacting sites between IDPs and interacting ligands by using the distance threshold method in known 3D structure IDPs complexes from the PDB database. IDPsBind contains 9626 IDPs complexes and 880 intrinsically disordered proteins verified by experiments. The current release of the IDPsBind database is defined as version 1.0. IDPsBind is freely accessible at http://www.s-bioinformatics.cn/idpsbind/home/.
Conclusions
IDPsBind provides more comprehensive interaction sites for IDPs complexes of known 3D structures. It can not only help the subsequent studies of the interaction mechanism of intrinsically disordered proteins but also provides a suitable background for developing the algorithms for predicting the interaction sites of intrinsically disordered proteins.
Collapse
|
8
|
Yu S, Peng D, Zhu W, Liao B, Wang P, Yang D, Wu F. Hybrid_DBP: Prediction of DNA-binding proteins using hybrid features and convolutional neural networks. Front Pharmacol 2022; 13:1031759. [PMID: 36299898 PMCID: PMC9589247 DOI: 10.3389/fphar.2022.1031759] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Accepted: 09/27/2022] [Indexed: 11/21/2022] Open
Abstract
DNA-binding proteins (DBP) play an essential role in the genetics and evolution of organisms. A particular DNA sequence could provide underlying therapeutic benefits for hereditary diseases and cancers. Studying these proteins can timely and effectively understand their mechanistic analysis and play a particular function in disease prevention and treatment. The limitation of identifying DNA-binding protein members from the sequence database is time-consuming, costly, and ineffective. Therefore, efficient methods for improving DBP classification are crucial to disease research. In this paper, we developed a novel predictor Hybrid _DBP, which identified potential DBP by using hybrid features and convolutional neural networks. The method combines two feature selection methods, MonoDiKGap and Kmer, and then used MRMD2.0 to remove redundant features. According to the results, 94% of DBP were correctly recognized, and the accuracy of the independent test set reached 91.2%. This means Hybrid_ DBP can become a useful prediction tool for predicting DBP.
Collapse
Affiliation(s)
- Shaoyou Yu
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Dejun Peng
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Wen Zhu
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
- *Correspondence: Wen Zhu,
| | - Bo Liao
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Peng Wang
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Dongxuan Yang
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Fangxiang Wu
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| |
Collapse
|
9
|
Comparative Analysis on Alignment-Based and Pretrained Feature Representations for the Identification of DNA-Binding Proteins. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:5847242. [PMID: 35799660 PMCID: PMC9256349 DOI: 10.1155/2022/5847242] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Accepted: 06/07/2022] [Indexed: 11/17/2022]
Abstract
The interaction between DNA and protein is vital for the development of a living body. Previous numerous studies on in silico identification of DNA-binding proteins (DBPs) usually include features extracted from the alignment-based (pseudo) position-specific scoring matrix (PSSM), leading to limited application due to its time-consuming generation. Few researchers have paid attention to the application of pretrained language models at the scale of evolution to the identification of DBPs. To this end, we present comprehensive insights into a comparison study on alignment-based PSSM and pretrained evolutionary scale modeling (ESM) representations in the field of DBP classification. The comparison is conducted by extracting information from PSSM and ESM representations using four unified averaging operations and by performing various feature selection (FS) methods. Experimental results demonstrate that the pretrained ESM representation outperforms the PSSM-derived features in a fair comparison perspective. The pretrained feature presentation deserves wide application to the area of in silico DBP identification as well as other function annotation issues. Finally, it is also confirmed that an ensemble scheme by aggregating various trained FS models can significantly improve the classification performance of DBPs.
Collapse
|
10
|
Refactoring transcription factors for metabolic engineering. Biotechnol Adv 2022; 57:107935. [PMID: 35271945 DOI: 10.1016/j.biotechadv.2022.107935] [Citation(s) in RCA: 36] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2021] [Revised: 02/04/2022] [Accepted: 03/03/2022] [Indexed: 12/19/2022]
Abstract
Due to the ability to regulate target metabolic pathways globally and dynamically, metabolic regulation systems composed of transcription factors have been widely used in metabolic engineering and synthetic biology. This review introduced the categories, action principles, prediction strategies, and related databases of transcription factors. Then, the application of global transcription machinery engineering technology and the transcription factor-based biosensors and quorum sensing systems are overviewed. In addition, strategies for optimizing the transcriptional regulatory tools' performance by refactoring transcription factors are summarized. Finally, the current limitations and prospects of constructing various regulatory tools based on transcription factors are discussed. This review will provide theoretical guidance for the rational design and construction of transcription factor-based metabolic regulation systems.
Collapse
|
11
|
Wei J, Chen S, Zong L, Gao X, Li Y. Protein-RNA interaction prediction with deep learning: structure matters. Brief Bioinform 2022; 23:bbab540. [PMID: 34929730 PMCID: PMC8790951 DOI: 10.1093/bib/bbab540] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2021] [Revised: 11/14/2021] [Accepted: 11/22/2021] [Indexed: 12/11/2022] Open
Abstract
Protein-RNA interactions are of vital importance to a variety of cellular activities. Both experimental and computational techniques have been developed to study the interactions. Because of the limitation of the previous database, especially the lack of protein structure data, most of the existing computational methods rely heavily on the sequence data, with only a small portion of the methods utilizing the structural information. Recently, AlphaFold has revolutionized the entire protein and biology field. Foreseeably, the protein-RNA interaction prediction will also be promoted significantly in the upcoming years. In this work, we give a thorough review of this field, surveying both the binding site and binding preference prediction problems and covering the commonly used datasets, features and models. We also point out the potential challenges and opportunities in this field. This survey summarizes the development of the RNA-binding protein-RNA interaction field in the past and foresees its future development in the post-AlphaFold era.
Collapse
Affiliation(s)
- Junkang Wei
- Department of Computer Science and Engineering (CSE), The Chinese
University of Hong Kong (CUHK), 999077, Hong Kong SAR, China
| | - Siyuan Chen
- Computational Bioscience Research Center (CBRC),
King Abdullah University of Science and Technology (KAUST),
23955-6900, Thuwal, Saudi Arabia
| | - Licheng Zong
- Department of Computer Science and Engineering (CSE), The Chinese
University of Hong Kong (CUHK), 999077, Hong Kong SAR, China
| | - Xin Gao
- Computational Bioscience Research Center (CBRC),
King Abdullah University of Science and Technology (KAUST),
23955-6900, Thuwal, Saudi Arabia
| | - Yu Li
- Department of Computer Science and Engineering (CSE), The Chinese
University of Hong Kong (CUHK), 999077, Hong Kong SAR, China
- The CUHK Shenzhen Research Institute, Hi-Tech Park, 518057,
Shenzhen, China
| |
Collapse
|
12
|
Zhang S, Zhao L, Zheng CH, Xia J. A feature-based approach to predict hot spots in protein-DNA binding interfaces. Brief Bioinform 2021; 21:1038-1046. [PMID: 30957840 DOI: 10.1093/bib/bbz037] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2019] [Revised: 02/20/2019] [Accepted: 03/07/2019] [Indexed: 12/21/2022] Open
Abstract
DNA-binding hot spot residues of proteins are dominant and fundamental interface residues that contribute most of the binding free energy of protein-DNA interfaces. As experimental methods for identifying hot spots are expensive and time consuming, computational approaches are urgently required in predicting hot spots on a large scale. In this work, we systematically assessed a wide variety of 114 features from a combination of the protein sequence, structure, network and solvent accessible information and their combinations along with various feature selection strategies for hot spot prediction. We then trained and compared four commonly used machine learning models, namely, support vector machine (SVM), random forest, Naïve Bayes and k-nearest neighbor, for the identification of hot spots using 10-fold cross-validation and the independent test set. Our results show that (1) features based on the solvent accessible surface area have significant effect on hot spot prediction; (2) different but complementary features generally enhance the prediction performance; and (3) SVM outperforms other machine learning methods on both training and independent test sets. In an effort to improve predictive performance, we developed a feature-based method, namely, PrPDH (Prediction of Protein-DNA binding Hot spots), for the prediction of hot spots in protein-DNA binding interfaces using SVM based on the selected 10 optimal features. Comparative results on benchmark data sets indicate that our predictor is able to achieve generally better performance in predicting hot spots compared to the state-of-the-art predictors. A user-friendly web server for PrPDH is well established and is freely available at http://bioinfo.ahu.edu.cn:8080/PrPDH.
Collapse
Affiliation(s)
- Sijia Zhang
- Institutes of Physical Science and Information Technology, School of Computer Science and Technology, Anhui University, Hefei, Anhui, China
| | - Le Zhao
- Institutes of Physical Science and Information Technology, School of Computer Science and Technology, Anhui University, Hefei, Anhui, China
| | - Chun-Hou Zheng
- Institutes of Physical Science and Information Technology, School of Computer Science and Technology, Anhui University, Hefei, Anhui, China
| | - Junfeng Xia
- Institutes of Physical Science and Information Technology, School of Computer Science and Technology, Anhui University, Hefei, Anhui, China
| |
Collapse
|
13
|
Zhang Y, Ni J, Gao Y. RF-SVM: Identification of DNA-binding proteins based on comprehensive feature representation methods and support vector machine. Proteins 2021; 90:395-404. [PMID: 34455627 DOI: 10.1002/prot.26229] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2021] [Revised: 08/10/2021] [Accepted: 08/24/2021] [Indexed: 01/07/2023]
Abstract
Protein-DNA interactions play an important role in biological progress, such as DNA replication, repair, and modification processes. In order to have a better understanding of its functions, the one of the most important steps is the identification of DNA-binding proteins. We propose a DNA-binding protein predictor, namely, RF-SVM, which contains four types features, that is, pseudo amino acid composition (PseAAC), amino acid distribution (AAD), adjacent amino acid composition frequency (ACF) and Local-DPP. Random Forest algorithm is utilized for selecting top 174 features, which are established the predictor model with the support vector machine (SVM) on training dataset UniSwiss-Tr. Finally, RF-SVM method is compared with other existing methods on test dataset UniSwiss-Tst. The experimental results demonstrated that RF-SVM has accuracy of 84.25%. Meanwhile, we discover that the physicochemical properties of amino acids for OOBM770101(H), CIDH920104(H), MIYS990104(H), NISK860101(H), VINM940103(H), and SNEP660101(A) have contribution to predict DNA-binding proteins. The main code and datasets can gain in https://github.com/NiJianWei996/RF-SVM.
Collapse
Affiliation(s)
- Yanping Zhang
- Department of Mathematics, School of Science, Hebei University of Engineering, Handan, China
| | - Jianwei Ni
- Department of Mathematics, School of Science, Hebei University of Engineering, Handan, China
| | - Ya Gao
- Department of Mathematics, School of Science, Hebei University of Engineering, Handan, China
| |
Collapse
|
14
|
Zhang J, Chen Q, Liu B. DeepDRBP-2L: A New Genome Annotation Predictor for Identifying DNA-Binding Proteins and RNA-Binding Proteins Using Convolutional Neural Network and Long Short-Term Memory. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1451-1463. [PMID: 31722485 DOI: 10.1109/tcbb.2019.2952338] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
DNA-binding proteins (DBPs) and RNA-binding proteins (RBPs) are two kinds of crucial proteins, which are associated with various cellule activities and some important diseases. Accurate identification of DBPs and RBPs facilitate both theoretical research and real world application. Existing sequence-based DBP predictors can accurately identify DBPs but incorrectly predict many RBPs as DBPs, and vice versa, resulting in low prediction precision. Moreover, some proteins (DRBPs) interacting with both DNA and RNA play important roles in gene expression and cannot be identified by existing computational methods. In this study, a two-level predictor named DeepDRBP-2L was proposed by combining Convolutional Neural Network (CNN) and the Long Short-Term Memory (LSTM). It is the first computational method that is able to identify DBPs, RBPs and DRBPs. Rigorous cross-validations and independent tests showed that DeepDRBP-2L is able to overcome the shortcoming of the existing methods and can go one further step to identify DRBPs. Application of DeepDRBP-2L to tomato genome further demonstrated its performance. The webserver of DeepDRBP-2L is freely available at http://bliulab.net/DeepDRBP-2L.
Collapse
|
15
|
UMAP-DBP: An Improved DNA-Binding Proteins Prediction Method Based on Uniform Manifold Approximation and Projection. Protein J 2021; 40:562-575. [PMID: 34176069 DOI: 10.1007/s10930-021-10011-y] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/10/2021] [Indexed: 10/21/2022]
Abstract
DNA-binding proteins play a vital role in cellular processes. It is an extremely urgent to develop a high-throughput method for efficiently identifying DNA-binding proteins. According to the current research situation, some methods in machine learning and deep learning show excellent computational speed and accuracy, which are worthy of application. In this work, a novel predictor was proposed to predict DNA binding proteins called UMAP-DBP. Firstly, the feature extraction of primary protein sequence was realized based on physicochemical distance transformation, Profile-based auto-cross covariance and General series correlation pseudo amino acid composition. Secondly, uniform manifold approximation and projection (UMAP) and feature importance score methods were used for feature selection; there is a progressive relationship between them. Finally, the Adaboost operation engine with jackknife test were adopted for predicting DNA-binding proteins. For the jackknife test on the BP1075 and BP594, we obtained an overall accuracy of 82.97% and 82.14%, Cohen's kappa (CK) of 0.66 and 0.64, respectively. The results illustrate that a feasible method has been developed for predicting DNA-binding proteins by UMAP and Adaboost. This is the first study in which UMAP has been successfully applied to identify DNA-binding proteins. All the datasets and codes are accessible at https://github.com/Wang-Jinyue/UMAP-DBP .
Collapse
|
16
|
Gao M, Skolnick J. A General Framework to Learn Tertiary Structure for Protein Sequence Characterization. FRONTIERS IN BIOINFORMATICS 2021; 1. [PMID: 34308415 PMCID: PMC8301223 DOI: 10.3389/fbinf.2021.689960] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
During the past five years, deep-learning algorithms have enabled ground-breaking progress towards the prediction of tertiary structure from a protein sequence. Very recently, we developed SAdLSA, a new computational algorithm for protein sequence comparison via deep-learning of protein structural alignments. SAdLSA shows significant improvement over established sequence alignment methods. In this contribution, we show that SAdLSA provides a general machine-learning framework for structurally characterizing protein sequences. By aligning a protein sequence against itself, SAdLSA generates a fold distogram for the input sequence, including challenging cases whose structural folds were not present in the training set. About 70% of the predicted distograms are statistically significant. Although at present the accuracy of the intra-sequence distogram predicted by SAdLSA self-alignment is not as good as deep-learning algorithms specifically trained for distogram prediction, it is remarkable that the prediction of single protein structures is encoded by an algorithm that learns ensembles of pairwise structural comparisons, without being explicitly trained to recognize individual structural folds. As such, SAdLSA can not only predict protein folds for individual sequences, but also detects subtle, yet significant, structural relationships between multiple protein sequences using the same deep-learning neural network. The former reduces to a special case in this general framework for protein sequence annotation.
Collapse
Affiliation(s)
- Mu Gao
- Center for the Study of Systems Biology, School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, United States
| | - Jeffrey Skolnick
- Center for the Study of Systems Biology, School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, United States
| |
Collapse
|
17
|
Yang L, Li X, Shu T, Wang P, Li X. PseKNC and Adaboost-Based Method for DNA-Binding Proteins Recognition. INT J PATTERN RECOGN 2021. [DOI: 10.1142/s0218001421500221] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
DNA-binding proteins are an essential part of the DNA. It also an integral component during life processes of various organisms, for instance, DNA recombination, replication, and so on. Recognition of such proteins helps medical researchers pinpoint the cause of disease. Traditional techniques of identifying DNA-binding proteins are expensive and time-consuming. Machine learning methods can identify these proteins quickly and efficiently. However, the accuracies of the existing related methods were not high enough. In this paper, we propose a framework to identify DNA-binding proteins. The proposed framework first uses PseKNC (ps), MomoKGap (mo), and MomoDiKGap (md) methods to combine three algorithms to extract features. Further, we apply Adaboost weight ranking to select optimal feature subsets from the above three types of features. Based on the selected features, three algorithms (k-nearest neighbor (knn), Support Vector Machine (SVM), and Random Forest (RF)) are applied to classify it. Finally, three predictors for identifying DNA-binding proteins are established, including [Formula: see text], [Formula: see text], [Formula: see text]. We utilize benchmark and independent datasets to train and evaluate the proposed framework. Three tests are performed, including Jackknife test, 10-fold cross-validation and independent test. Among them, the accuracy of ps+md is the highest. We named the model with the best result as psmdDBPs and applied it to identify DNA-binding proteins.
Collapse
Affiliation(s)
- Lina Yang
- School of Computer, Electronics and Information, Guangxi University, Nanning, P. R. China
| | - Xiangyu Li
- School of Computer, Electronics and Information, Guangxi University, Nanning, P. R. China
| | - Ting Shu
- Guangdong-Hongkong-Macao Greater Bay Area, Weather Research Center for Monitoring Warning and Forecasting, (Shenzhen Institute of Meteorological Innovation), Shenzhen, P. R. China
| | - Patrick Wang
- Computer and Information Science, Northeastern University, Boston, USA
| | - Xichun Li
- Guangxi Normal University for Nationalities, Chongzuo, P. R. China
| |
Collapse
|
18
|
Zhang S, Zhu F, Yu Q, Zhu X. Identifying DNA-binding proteins based on multi-features and LASSO feature selection. Biopolymers 2021; 112:e23419. [PMID: 33476047 DOI: 10.1002/bip.23419] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2021] [Revised: 01/08/2021] [Accepted: 01/08/2021] [Indexed: 01/22/2023]
Abstract
DNA-binding proteins perform an indispensable function in the maintenance and processing of genetic information and are inefficiently identified by traditional experimental methods due to their huge quantities. On the contrary, machine learning methods as an emerging technique demonstrate satisfactory speed and accuracy when used to study these molecules. This work focuses on extracting four different features from primary and secondary sequence features: Reduced sequence and index-vectors (RS), Pseudo-amino acid components (PseAACS), Position-specific scoring matrix-Auto Cross Covariance Transform (PSSM-ACCT), and Position-specific scoring matrix-Discrete Wavelet Transform (PSSM-DWT). Using the LASSO dimension reduction method, we experiment on the combination of feature submodels to obtain the optimized number of top rank features. These features are respectively input into the training Ensemble subspace discriminant, Ensemble bagged tree and KNN to predict the DNA-binding proteins. Three different datasets, PDB594, PDB1075, and PDB186, are adopted to evaluate the performance of the as-proposed approach in this work. The PDB1075 and PDB594 datasets are adopted for the five-fold cross-validation, and the PDB186 is used for the independent experiment. In the five-fold cross-validation, both the PDB1075 and PDB594 show extremely high accuracy, reaching 86.98% and 88.9% by Ensemble subspace discriminant, respectively. The accuracy of independent experiment by multi-classifiers voting is 83.33%, which suggests that the methodology proposed in this work is capable of predicting DNA-binding proteins effectively.
Collapse
Affiliation(s)
- Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an, China
| | - Fu Zhu
- School of Mathematics and Statistics, Xidian University, Xi'an, China
| | - Qianhao Yu
- School of Artificial Intelligence, Xidian University, Xi'an, China
| | - Xiaoyue Zhu
- School of Electronic Engineering, Xidian University, Xi'an, China
| |
Collapse
|
19
|
Zhang Y, Chen P, Gao Y, Ni J, Wang X. DBP-PSSM: Combination of Evolutionary Profiles with the XGBoost Algorithm to Improve the Identification of DNA-binding Proteins. Comb Chem High Throughput Screen 2020; 25:3-12. [PMID: 33238837 DOI: 10.2174/1386207323999201124203531] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 10/16/2020] [Accepted: 10/29/2020] [Indexed: 11/22/2022]
Abstract
BACKGROUND AND OBJECTIVE DNA-binding proteins play important roles in a variety of biological processes, such as gene transcription and regulation, DNA replication and repair, DNA recombination and packaging, and the formation of chromatin and ribosomes. Therefore, it is urgent to develop a computational method to improve the recognition efficiency of DNA-binding proteins. METHODS We proposed a novel method, DBP-PSSM, which constructed the features from amino acid composition and evolutionary information of protein sequences. The maximum relevance, minimum redundancy (mRMR) was employed to select the optimal features for establishing the XGBoost classifier, therefore, the novel model of prediction DNA-binding proteins, DBP-PSSM, was established with 5-fold cross-validation on the training dataset. RESULTS DBP-PSSM achieved an accuracy of 81.18% and MCC of 0.657 in a test dataset, which outperformed the many existing methods. These results demonstrated that our method can effectively predict DNA-binding proteins. CONCLUSION The data and source code are provided at https://github.com/784221489/DNA-binding.
Collapse
Affiliation(s)
- Yanping Zhang
- School of Mathematics and Physics Science and Engineering, Hebei University of Engineering, Handan 056038,China
| | - Pengcheng Chen
- School of Mathematics and Physics Science and Engineering, Hebei University of Engineering, Handan 056038,China
| | - Ya Gao
- School of Mathematics and Physics Science and Engineering, Hebei University of Engineering, Handan 056038,China
| | - Jianwei Ni
- School of Mathematics and Physics Science and Engineering, Hebei University of Engineering, Handan 056038,China
| | - Xiaosheng Wang
- School of Mathematics and Physics Science and Engineering, Hebei University of Engineering, Handan 056038,China
| |
Collapse
|
20
|
Predicting Hot Spot Residues at Protein-DNA Binding Interfaces Based on Sequence Information. Interdiscip Sci 2020; 13:1-11. [PMID: 33068261 DOI: 10.1007/s12539-020-00399-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2020] [Revised: 09/27/2020] [Accepted: 10/01/2020] [Indexed: 10/23/2022]
Abstract
Hot spot residues at protein-DNA binding interfaces are hugely important for investigating the underlying mechanism of molecular recognition. Currently, there are a few tools available for identifying the hot spot residues in the protein-DNA complexes. In addition, the three-dimensional protein structures are needed in these tools. However, it is well known that the three-dimensional structures are unavailable for most proteins. Considering the limitation, we proposed a method, named SPDH, for predicting hot spot residues only based on protein sequences. Firstly, we obtained 133 features from physicochemical property, conservation, predicted solvent accessible surface area and structure. Then, we systematically assessed these features based on various feature selection methods to obtain the optimal feature subset and compared the models using four classical machine learning algorithms (support vector machine, random forest, logistic regression, and k-nearest neighbor) on the training dataset. We found that the variability of physicochemical property features between wild and mutative types was important on improving the performance of the prediction model. On the independent test set, our method achieved the performance with AUC of 0.760 and sensitivity of 0.808, and outperformed other methods. The data and source code can be downloaded at https://github.com/xialab-ahu/SPDH .
Collapse
|
21
|
Chauhan S, Ahmad S. Enabling full‐length evolutionary profiles based deep convolutional neural network for predicting DNA‐binding proteins from sequence. Proteins 2019; 88:15-30. [DOI: 10.1002/prot.25763] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2019] [Revised: 06/01/2019] [Accepted: 06/15/2019] [Indexed: 12/22/2022]
Affiliation(s)
- Sucheta Chauhan
- School of Computational and Integrative SciencesJawaharlal Nehru University New Delhi India
| | - Shandar Ahmad
- School of Computational and Integrative SciencesJawaharlal Nehru University New Delhi India
| |
Collapse
|
22
|
Ali F, Ahmed S, Swati ZNK, Akbar S. DP-BINDER: machine learning model for prediction of DNA-binding proteins by fusing evolutionary and physicochemical information. J Comput Aided Mol Des 2019; 33:645-658. [PMID: 31123959 DOI: 10.1007/s10822-019-00207-x] [Citation(s) in RCA: 41] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2019] [Accepted: 05/18/2019] [Indexed: 12/28/2022]
Abstract
DNA-binding proteins (DBPs) participate in various biological processes including DNA replication, recombination, and repair. In the human genome, about 6-7% of these proteins are utilized for genes encoding. DBPs shape the DNA into a compact structure known chromatin while some of these proteins regulate the chromosome packaging and transcription process. In the pharmaceutical industry, DBPs are used as a key component of antibiotics, steroids, and cancer drugs. These proteins also involve in biophysical, biological, and biochemical studies of DNA. Due to the crucial role in various biological activities, identification of DBPs is a hot issue in protein science. A series of experimental and computational methods have been proposed, however, some methods didn't achieve the desired results while some are inadequate in its accuracy and authenticity. Still, it is highly desired to present more intelligent computational predictors. In this work, we introduce an innovative computational method namely DP-BINDER based on physicochemical and evolutionary information. We captured local highly decisive features from physicochemical properties of primary protein sequences via normalized Moreau-Broto autocorrelation (NMBAC) and evolutionary information by position specific scoring matrix-transition probability composition (PSSM-TPC) and pseudo-position specific scoring matrix (PsePSSM) using training and independent datasets. The optimal features were selected by the support vector machine-recursive feature elimination and correlation bias reduction (SVM-RFE + CBR) from fused features and were fed into random forest (RF) and support vector machine (SVM). Our method attained 92.46% and 89.58% accuracy with jackknife and ten-fold cross-validation, respectively on the training dataset, while 81.17% accuracy on the independent dataset for prediction of DBPs. These results demonstrate that our method attained the highest success rate in the literature. The superiority of DP-BINDER over existing approaches due to several reasons including abstraction of local dominant features via effective feature descriptors, utilization of appropriate feature selection algorithms and effective classifier.
Collapse
Affiliation(s)
- Farman Ali
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China.
| | - Saeed Ahmed
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China
| | - Zar Nawab Khan Swati
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China
- Department of Computer Science, Karakoram International University, Gilgit, Gilgit-Baltistan, 15100, Pakistan
| | - Shahid Akbar
- Department of Computer Science, Abdul Wali Khan University, Mardan, Pakistan
| |
Collapse
|
23
|
Zhu YH, Hu J, Song XN, Yu DJ. DNAPred: Accurate Identification of DNA-Binding Sites from Protein Sequence by Ensembled Hyperplane-Distance-Based Support Vector Machines. J Chem Inf Model 2019; 59:3057-3071. [PMID: 30943723 DOI: 10.1021/acs.jcim.8b00749] [Citation(s) in RCA: 39] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Accurate identification of protein-DNA binding sites is significant for both understanding protein function and drug design. Machine-learning-based methods have been extensively used for the prediction of protein-DNA binding sites. However, the data imbalance problem, in which the number of nonbinding residues (negative-class samples) is far larger than that of binding residues (positive-class samples), seriously restricts the performance improvements of machine-learning-based predictors. In this work, we designed a two-stage imbalanced learning algorithm, called ensembled hyperplane-distance-based support vector machines (E-HDSVM), to improve the prediction performance of protein-DNA binding sites. The first stage of E-HDSVM designs a new iterative sampling algorithm, called hyperplane-distance-based under-sampling (HD-US), to extract multiple subsets from the original imbalanced data set, each of which is used to train a support vector machine (SVM). Unlike traditional sampling algorithms, HD-US selects samples by calculating the distances between the samples and the separating hyperplane of the SVM. The second stage of E-HDSVM proposes an enhanced AdaBoost (EAdaBoost) algorithm to ensemble multiple trained SVMs. As an enhanced version of the original AdaBoost algorithm, EAdaBoost overcomes the overfitting problem. Stringent cross-validation and independent tests on benchmark data sets demonstrated the superiority of E-HDSVM over several popular imbalanced learning algorithms. Based on the proposed E-HDSVM algorithm, we further implemented a sequence-based protein-DNA binding site predictor, called DNAPred, which is freely available at http://csbio.njust.edu.cn/bioinf/dnapred/ for academic use. The computational experimental results showed that our predictor achieved an average overall accuracy of 91.7% and a Mathew's correlation coefficient of 0.395 on five benchmark data sets and outperformed several state-of-the-art sequence-based protein-DNA binding site predictors.
Collapse
Affiliation(s)
- Yi-Heng Zhu
- School of Computer Science and Engineering , Nanjing University of Science and Technology , Xiaolingwei 200 , Nanjing 210094 , P. R. China
| | - Jun Hu
- College of Information Engineering , Zhejiang University of Technology , Hangzhou 310023 , P. R. China
| | - Xiao-Ning Song
- School of Internet of Things , Jiangnan University , 1800 Lihu Road , Wuxi 214122 , P. R. China
| | - Dong-Jun Yu
- School of Computer Science and Engineering , Nanjing University of Science and Technology , Xiaolingwei 200 , Nanjing 210094 , P. R. China
| |
Collapse
|
24
|
Emamjomeh A, Choobineh D, Hajieghrari B, MahdiNezhad N, Khodavirdipour A. DNA-protein interaction: identification, prediction and data analysis. Mol Biol Rep 2019; 46:3571-3596. [PMID: 30915687 DOI: 10.1007/s11033-019-04763-1] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2018] [Accepted: 03/14/2019] [Indexed: 12/30/2022]
Abstract
Life in living organisms is dependent on specific and purposeful interaction between other molecules. Such purposeful interactions make the various processes inside the cells and the bodies of living organisms possible. DNA-protein interactions, among all the types of interactions between different molecules, are of considerable importance. Currently, with the development of numerous experimental techniques, diverse methods are convenient for recognition and investigating such interactions. While the traditional experimental techniques to identify DNA-protein complexes are time-consuming and are unsuitable for genome-scale studies, the current high throughput approaches are more efficient in determining such interaction at a large-scale, but they are clearly too costly to be practice for daily applications. Hence, according to the availability of much information related to different biological sequences and clearing different dimensions of conditions in which such interactions are formed, with the developments related to the computer, mathematics, and statistics motivate scientists to develop bioinformatics tools for prediction the interaction site(s). Until now, there has been much progress in this field. In this review, the factors and conditions governing the interaction and the laboratory techniques for examining such interactions are addressed. In addition, developed bioinformatics tools are introduced and compared for this reason and, in the end, several suggestions are offered for the promotion of such tools in prediction with much more precision.
Collapse
Affiliation(s)
- Abbasali Emamjomeh
- Laboratory of Computational Biotechnology and Bioinformatics (CBB), Department of Plant Breeding and Biotechnology (PBB), University of Zabol, Zabol, 98615-538, Iran.
| | - Darush Choobineh
- Agricultural Biotechnology, Department of Plant Breeding and Biotechnology (PBB), Faculty of Agriculture, University of Zabol, Zabol, Iran
| | - Behzad Hajieghrari
- Department of Agricultural Biotechnology, College of Agriculture, Jahrom University, Jahrom, 74135-111, Iran.
| | - Nafiseh MahdiNezhad
- Laboratory of Computational Biotechnology and Bioinformatics (CBB), Department of Plant Breeding and Biotechnology (PBB), University of Zabol, Zabol, 98615-538, Iran
| | - Amir Khodavirdipour
- Division of Human Genetics, Department of Anatomy, St. John's hospital, Bangalore, India
| |
Collapse
|
25
|
Litfin T, Yang Y, Zhou Y. SPOT-Peptide: Template-Based Prediction of Peptide-Binding Proteins and Peptide-Binding Sites. J Chem Inf Model 2019; 59:924-930. [DOI: 10.1021/acs.jcim.8b00777] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Affiliation(s)
- Thomas Litfin
- School of Information and Communication Technology, Griffith University, Southport, QLD 4222, Australia
| | - Yuedong Yang
- School of Data and Computer Science, Sun-Yat Sen University, Guangzhou, Guangdong 510006, China
| | - Yaoqi Zhou
- School of Information and Communication Technology, Griffith University, Southport, QLD 4222, Australia
- Institute for Glycomics, Griffith University, Southport, QLD 4222, Australia
| |
Collapse
|
26
|
Poursheikhali Asghari M, Abdolmaleki P. Prediction of RNA- and DNA-Binding Proteins Using Various Machine Learning Classifiers. Avicenna J Med Biotechnol 2019; 11:104-111. [PMID: 30800250 PMCID: PMC6359699] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Nucleic acid-binding proteins play major roles in different biological processes, such as transcription, splicing and translation. Therefore, the nucleic acid-binding function prediction of proteins is a step toward full functional annotation of proteins. The aim of our research was the improvement of nucleic-acid binding function prediction. METHODS In the current study, nine machine-learning algorithms were used to predict RNA- and DNA-binding proteins and also to discriminate between RNA-binding proteins and DNA-binding proteins. The electrostatic features were utilized for prediction of each function in corresponding adapted protein datasets. The leave-one-out cross-validation process was used to measure the performance of employed classifiers. RESULTS Radial basis function classifier gave the best results in predicting RNA- and DNA-binding proteins in comparison with other classifiers applied. In discriminating between RNA- and DNA-binding proteins, multilayer perceptron classifier was the best one. CONCLUSION Our findings show that the prediction of nucleic acid-binding function based on these simple electrostatic features can be improved by applied classifiers. Moreover, a reasonable progress to distinguish between RNA- and DNA-binding proteins has been achieved.
Collapse
Affiliation(s)
| | - Parviz Abdolmaleki
- Corresponding author: Parviz Abdolmaleki, Ph.D., Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran, Tel: +98 21 82883404, Fax: +98 21 82884457, E-mail:,
| |
Collapse
|
27
|
Set of approaches based on 3D structure and position specific-scoring matrix for predicting DNA-binding proteins. Bioinformatics 2018; 35:1844-1851. [DOI: 10.1093/bioinformatics/bty912] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2018] [Revised: 10/08/2018] [Accepted: 10/31/2018] [Indexed: 11/14/2022] Open
|
28
|
DPP-PseAAC: A DNA-binding protein prediction model using Chou's general PseAAC. J Theor Biol 2018; 452:22-34. [PMID: 29753757 DOI: 10.1016/j.jtbi.2018.05.006] [Citation(s) in RCA: 96] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2018] [Revised: 04/21/2018] [Accepted: 05/04/2018] [Indexed: 11/21/2022]
Abstract
A DNA-binding protein (DNA-BP) is a protein that can bind and interact with a DNA. Identification of DNA-BPs using experimental methods is expensive as well as time consuming. As such, fast and accurate computational methods are sought for predicting whether a protein can bind with a DNA or not. In this paper, we focus on building a new computational model to identify DNA-BPs in an efficient and accurate way. Our model extracts meaningful information directly from the protein sequences, without any dependence on functional domain or structural information. After feature extraction, we have employed Random Forest (RF) model to rank the features. Afterwards, we have used Recursive Feature Elimination (RFE) method to extract an optimal set of features and trained a prediction model using Support Vector Machine (SVM) with linear kernel. Our proposed method, named as DNA-binding Protein Prediction model using Chou's general PseAAC (DPP-PseAAC), demonstrates superior performance compared to the state-of-the-art predictors on standard benchmark dataset. DPP-PseAAC achieves accuracy values of 93.21%, 95.91% and 77.42% for 10-fold cross-validation test, jackknife test and independent test respectively. The source code of DPP-PseAAC, along with relevant dataset and detailed experimental results, can be found at https://github.com/srautonu/DNABinding. A publicly accessible web interface has also been established at: http://77.68.43.135:8080/DPP-PseAAC/.
Collapse
|
29
|
Abstract
The increasing number of protein structures with uncharacterized function necessitates the development of in silico prediction methods for functional annotations on proteins. In this chapter, different kinds of computational approaches are briefly introduced to predict DNA-binding residues on surface of DNA-binding proteins, and the merits and limitations of these methods are mainly discussed. This chapter focuses on the structure-based approaches and mainly discusses the framework of machine learning methods in application to DNA-binding prediction task.
Collapse
|
30
|
iDNAProt-ES: Identification of DNA-binding Proteins Using Evolutionary and Structural Features. Sci Rep 2017; 7:14938. [PMID: 29097781 PMCID: PMC5668250 DOI: 10.1038/s41598-017-14945-1] [Citation(s) in RCA: 58] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2017] [Accepted: 10/19/2017] [Indexed: 11/12/2022] Open
Abstract
DNA-binding proteins play a very important role in the structural composition of the DNA. In addition, they regulate and effect various cellular processes like transcription, DNA replication, DNA recombination, repair and modification. The experimental methods used to identify DNA-binding proteins are expensive and time consuming and thus attracted researchers from computational field to address the problem. In this paper, we present iDNAProt-ES, a DNA-binding protein prediction method that utilizes both sequence based evolutionary and structure based features of proteins to identify their DNA-binding functionality. We used recursive feature elimination to extract an optimal set of features and train them using Support Vector Machine (SVM) with linear kernel to select the final model. Our proposed method significantly outperforms the existing state-of-the-art predictors on standard benchmark dataset. The accuracy of the predictor is 90.18% using jack knife test and 88.87% using 10-fold cross validation on the benchmark dataset. The accuracy of the predictor on the independent dataset is 80.64% which is also significantly better than the state-of-the-art methods. iDNAProt-ES is a novel prediction method that uses evolutionary and structural based features. We believe the superior performance of iDNAProt-ES will motivate the researchers to use this method to identify DNA-binding proteins. iDNAProt-ES is publicly available as a web server at: http://brl.uiu.ac.bd/iDNAProt-ES/.
Collapse
|
31
|
Improved detection of DNA-binding proteins via compression technology on PSSM information. PLoS One 2017; 12:e0185587. [PMID: 28961273 PMCID: PMC5621689 DOI: 10.1371/journal.pone.0185587] [Citation(s) in RCA: 49] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2017] [Accepted: 09/17/2017] [Indexed: 12/04/2022] Open
Abstract
Since the importance of DNA-binding proteins in multiple biomolecular functions has been recognized, an increasing number of researchers are attempting to identify DNA-binding proteins. In recent years, the machine learning methods have become more and more compelling in the case of protein sequence data soaring, because of their favorable speed and accuracy. In this paper, we extract three features from the protein sequence, namely NMBAC (Normalized Moreau-Broto Autocorrelation), PSSM-DWT (Position-specific scoring matrix—Discrete Wavelet Transform), and PSSM-DCT (Position-specific scoring matrix—Discrete Cosine Transform). We also employ feature selection algorithm on these feature vectors. Then, these features are fed into the training SVM (support vector machine) model as classifier to predict DNA-binding proteins. Our method applys three datasets, namely PDB1075, PDB594 and PDB186, to evaluate the performance of our approach. The PDB1075 and PDB594 datasets are employed for Jackknife test and the PDB186 dataset is used for the independent test. Our method achieves the best accuracy in the Jacknife test, from 79.20% to 86.23% and 80.5% to 86.20% on PDB1075 and PDB594 datasets, respectively. In the independent test, the accuracy of our method comes to 76.3%. The performance of independent test also shows that our method has a certain ability to be effectively used for DNA-binding protein prediction. The data and source code are at https://doi.org/10.6084/m9.figshare.5104084.
Collapse
|
32
|
Wei L, Tang J, Zou Q. Local-DPP: An improved DNA-binding protein prediction method by exploring local evolutionary information. Inf Sci (N Y) 2017. [DOI: 10.1016/j.ins.2016.06.026] [Citation(s) in RCA: 196] [Impact Index Per Article: 28.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
33
|
Zhang J, Gao B, Chai H, Ma Z, Yang G. Identification of DNA-binding proteins using multi-features fusion and binary firefly optimization algorithm. BMC Bioinformatics 2016; 17:323. [PMID: 27565741 PMCID: PMC5002159 DOI: 10.1186/s12859-016-1201-8] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2016] [Accepted: 08/24/2016] [Indexed: 11/13/2022] Open
Abstract
Background DNA-binding proteins (DBPs) play fundamental roles in many biological processes. Therefore, the developing of effective computational tools for identifying DBPs is becoming highly desirable. Results In this study, we proposed an accurate method for the prediction of DBPs. Firstly, we focused on the challenge of improving DBP prediction accuracy with information solely from the sequence. Secondly, we used multiple informative features to encode the protein. These features included evolutionary conservation profile, secondary structure motifs, and physicochemical properties. Thirdly, we introduced a novel improved Binary Firefly Algorithm (BFA) to remove redundant or noisy features as well as select optimal parameters for the classifier. The experimental results of our predictor on two benchmark datasets outperformed many state-of-the-art predictors, which revealed the effectiveness of our method. The promising prediction performance on a new-compiled independent testing dataset from PDB and a large-scale dataset from UniProt proved the good generalization ability of our method. In addition, the BFA forged in this research would be of great potential in practical applications in optimization fields, especially in feature selection problems. Conclusions A highly accurate method was proposed for the identification of DBPs. A user-friendly web-server named iDbP (identification of DNA-binding Proteins) was constructed and provided for academic use. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1201-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jian Zhang
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, 130117, People's Republic of China
| | - Bo Gao
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, 130117, People's Republic of China
| | - Haiting Chai
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, 130117, People's Republic of China
| | - Zhiqiang Ma
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, 130117, People's Republic of China
| | - Guifu Yang
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, 130117, People's Republic of China. .,Office of Informatization Management and Planning, Northeast Normal University, Changchun, 130117, People's Republic of China.
| |
Collapse
|
34
|
Kawabata T. HOMCOS: an updated server to search and model complex 3D structures. ACTA ACUST UNITED AC 2016; 17:83-99. [PMID: 27522608 PMCID: PMC5274653 DOI: 10.1007/s10969-016-9208-y] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2015] [Accepted: 08/05/2016] [Indexed: 01/18/2023]
Abstract
The HOMCOS server (http://homcos.pdbj.org) was updated for both searching and modeling the 3D complexes for all molecules in the PDB. As compared to the previous HOMCOS server, the current server targets all of the molecules in the PDB including proteins, nucleic acids, small compounds and metal ions. Their binding relationships are stored in the database. Five services are available for users. For the services “Modeling a Homo Protein Multimer” and “Modeling a Hetero Protein Multimer”, a user can input one or two proteins as the queries, while for the service “Protein-Compound Complex”, a user can input one chemical compound and one protein. The server searches similar molecules by BLAST and KCOMBU. Based on each similar complex found, a simple sequence-replaced model is quickly generated by replacing the residue names and numbers with those of the query protein. A target compound is flexibly superimposed onto the template compound using the program fkcombu. If monomeric 3D structures are input as the query, then template-based docking can be performed. For the service “Searching Contact Molecules for a Query Protein”, a user inputs one protein sequence as the query, and then the server searches for its homologous proteins in PDB and summarizes their contacting molecules as the predicted contacting molecules. The results are summarized in “Summary Bars” or “Site Table”display. The latter shows the results as a one-site-one-row table, which is useful for annotating the effects of mutations. The service “Searching Contact Molecules for a Query Compound” is also available.
Collapse
Affiliation(s)
- Takeshi Kawabata
- Institute for Protein Research, Osaka University, 3-2 Yamadaoka, Suita, Osaka, 565-0871, Japan.
| |
Collapse
|
35
|
gDNA-Prot: Predict DNA-binding proteins by employing support vector machine and a novel numerical characterization of protein sequence. J Theor Biol 2016; 406:8-16. [PMID: 27378005 DOI: 10.1016/j.jtbi.2016.06.002] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2016] [Revised: 05/19/2016] [Accepted: 06/01/2016] [Indexed: 11/24/2022]
Abstract
DNA-binding proteins are the functional proteins in cells, which play an important role in various essential biological activities. An effective and fast computational method gDNA-Prot is proposed to predict DNA-binding proteins in this paper, which is a DNA-binding predictor that combines the support vector machine classifier and a novel kind of feature called graphical representation. The DNA-binding protein sequence information was described with the 20 probabilities of amino acids and the 23 new numerical graphical representation features of a protein sequence, based on 23 physicochemical properties of 20 amino acids. The Principal Components Analysis (PCA) was employed as feature selection method for removing the irrelevant features and reducing redundant features. The Sigmod function and Min-max normalization methods for PCA were applied to accelerate the training speed and obtain higher accuracy. Experiments demonstrated that the Principal Components Analysis with Sigmod function generated the best performance. The gDNA-Prot method was also compared with the DNAbinder, iDNA-Prot and DNA-Prot. The results suggested that gDNA-Prot outperformed the DNAbinder and iDNA-Prot. Although the DNA-Prot outperformed gDNA-Prot, gDNA-Prot was faster and convenient to predict the DNA-binding proteins. Additionally, the proposed gNDA-Prot method is available at http://sourceforge.net/projects/gdnaprot.
Collapse
|
36
|
Paz I, Kligun E, Bengad B, Mandel-Gutfreund Y. BindUP: a web server for non-homology-based prediction of DNA and RNA binding proteins. Nucleic Acids Res 2016; 44:W568-74. [PMID: 27198220 PMCID: PMC4987955 DOI: 10.1093/nar/gkw454] [Citation(s) in RCA: 49] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2016] [Accepted: 05/11/2016] [Indexed: 12/12/2022] Open
Abstract
Gene expression is a multi-step process involving many layers of regulation. The main regulators of the pathway are DNA and RNA binding proteins. While over the years, a large number of DNA and RNA binding proteins have been identified and extensively studied, it is still expected that many other proteins, some with yet another known function, are awaiting to be discovered. Here we present a new web server, BindUP, freely accessible through the website http://bindup.technion.ac.il/, for predicting DNA and RNA binding proteins using a non-homology-based approach. Our method is based on the electrostatic features of the protein surface and other general properties of the protein. BindUP predicts nucleic acid binding function given the proteins three-dimensional structure or a structural model. Additionally, BindUP provides information on the largest electrostatic surface patches, visualized on the server. The server was tested on several datasets of DNA and RNA binding proteins, including proteins which do not possess DNA or RNA binding domains and have no similarity to known nucleic acid binding proteins, achieving very high accuracy. BindUP is applicable in either single or batch modes and can be applied for testing hundreds of proteins simultaneously in a highly efficient manner.
Collapse
Affiliation(s)
- Inbal Paz
- Department of Biology, Technion-Israel Institute of Technology, Technion City, Haifa 32000, Israel
| | - Efrat Kligun
- Department of Biology, Technion-Israel Institute of Technology, Technion City, Haifa 32000, Israel
| | - Barak Bengad
- Department of Biology, Technion-Israel Institute of Technology, Technion City, Haifa 32000, Israel
| | - Yael Mandel-Gutfreund
- Department of Biology, Technion-Israel Institute of Technology, Technion City, Haifa 32000, Israel
| |
Collapse
|
37
|
Liu B, Wang S, Dong Q, Li S, Liu X. Identification of DNA-binding proteins by combining auto-cross covariance transformation and ensemble learning. IEEE Trans Nanobioscience 2016; 15:328-334. [PMID: 28113908 DOI: 10.1109/tnb.2016.2555951] [Citation(s) in RCA: 65] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
DNA-binding proteins play a pivotal role in various intra- and extra-cellular activities ranging from DNA replication to gene expression control. With the rapid development of next generation of sequencing technique, the number of protein sequences is unprecedentedly increasing. Thus it is necessary to develop computational methods to identify the DNA-binding proteins only based on the protein sequence information. In this study, a novel method called iDNA-KACC is presented, which combines the Support Vector Machine (SVM) and the auto-cross covariance transformation. The protein sequences are first converted into profile-based protein representation, and then converted into a series of fixed-length vectors by the auto-cross covariance transformation with Kmer composition. The sequence order effect can be effectively captured by this scheme. These vectors are then fed into Support Vector Machine (SVM) to discriminate the DNA-binding proteins from the non DNA-binding ones. iDNA-KACC achieves an overall accuracy of 75.16% and Matthew correlation coefficient of 0.5 by a rigorous jackknife test. Its performance is further improved by employing an ensemble learning approach, and the improved predictor is called iDNA-KACC-EL. Experimental results on an independent dataset shows that iDNA-KACC-EL outperforms all the other state-of-the-art predictors, indicating that it would be a useful computational tool for DNA binding protein identification. .
Collapse
|
38
|
Abstract
Native proteins perform an amazing variety of biochemical functions, including enzymatic catalysis, and can engage in protein-protein and protein-DNA interactions that are essential for life. A key question is how special are these functional properties of proteins. Are they extremely rare, or are they an intrinsic feature? Comparison to the properties of compact conformations of artificially generated compact protein structures selected for thermodynamic stability but not any type of function, the artificial (ART) protein library, demonstrates that a remarkable number of the properties of native-like proteins are recapitulated. These include the complete set of small molecule ligand-binding pockets and most protein-protein interfaces. ART structures are predicted to be capable of weakly binding metabolites and cover a significant fraction of metabolic pathways, with the most enriched pathways including ancient ones such as glycolysis. Native-like active sites are also found in ART proteins. A small fraction of ART proteins are predicted to have strong protein-protein and protein-DNA interactions. Overall, it appears that biochemical function is an intrinsic feature of proteins which nature has significantly optimized during evolution. These studies raise questions as to the relative roles of specificity and promiscuity in the biochemical function and control of cells that need investigation.
Collapse
Affiliation(s)
- Jeffrey Skolnick
- Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, Atlanta, GA, USA
| | - Mu Gao
- Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, Atlanta, GA, USA
| | - Hongyi Zhou
- Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, Atlanta, GA, USA
| |
Collapse
|
39
|
Lopez MB, Garcia MN, Grasso D, Bintz J, Molejon MI, Velez G, Lomberk G, Neira JL, Urrutia R, Iovanna J. Functional Characterization of Nupr1L, A Novel p53-Regulated Isoform of the High-Mobility Group (HMG)-Related Protumoral Protein Nupr1. J Cell Physiol 2015; 230:2936-50. [PMID: 25899918 DOI: 10.1002/jcp.25022] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2014] [Accepted: 04/15/2015] [Indexed: 12/18/2022]
Abstract
We have previously demonstrated a crucial role of nuclear protein 1 (NUPR1) in tumor development and progression. In this work, we report the functional characterization of a novel Nupr1-like isoform (NUPR1L) and its functional interaction with the protumoral factor NUPR1. Through the use of primary sequence analysis, threading, and homology-based molecular modeling, as well as expression and immunolocalization, studies reveal that NUPR1L displays properties, which are similar to member of the HMG-like family of chromatin regulators, including its ability to translocate to the cell nucleus and bind to DNA. Analysis of the NUPR1L promoter showed the presence of two p53-response elements at positions -37 and -7, respectively. Experiments using reporter assays combined with site-directed mutagenesis and using cells with controllable p53 expression demonstrate that both of these sequences are responsible for the regulation of NUPR1L expression by p53. Congruently, NUPR1L gene expression is activated in response to DNA damage induced by oxaliplatin treatment or cell cycle arrest induced by serum starvation, two well-validated methods to achieve p53 activation. Interestingly, expression of NUPR1L downregulates the expression of NUPR1, its closely related protumoral isoform, by a mechanism that involves the inhibition of its promoter activity. At the cellular level, overexpression of NUPR1L induces G1 cell cycle arrest and a decrease in their cell viability, an effect that is mediated, at least in part, by downregulating NUPR1 expression. Combined, these experiments constitute the first functional characterization of NUPR1L as a new p53-induced gene, which negatively regulates the protumoral factor NUPR1.
Collapse
Affiliation(s)
- Maria Belen Lopez
- Centre de Recherche en Cancérologie de Marseille (CRCM), INSERM U1068, CNRS UMR 7258, Aix-Marseille Université and Institut Paoli-Calmettes, Parc Scientifique et Technologique de Luminy, Marseille, France
| | - Maria Noé Garcia
- Centre de Recherche en Cancérologie de Marseille (CRCM), INSERM U1068, CNRS UMR 7258, Aix-Marseille Université and Institut Paoli-Calmettes, Parc Scientifique et Technologique de Luminy, Marseille, France
| | - Daniel Grasso
- Centre de Recherche en Cancérologie de Marseille (CRCM), INSERM U1068, CNRS UMR 7258, Aix-Marseille Université and Institut Paoli-Calmettes, Parc Scientifique et Technologique de Luminy, Marseille, France
| | - Jennifer Bintz
- Centre de Recherche en Cancérologie de Marseille (CRCM), INSERM U1068, CNRS UMR 7258, Aix-Marseille Université and Institut Paoli-Calmettes, Parc Scientifique et Technologique de Luminy, Marseille, France
| | - Maria Inés Molejon
- Centre de Recherche en Cancérologie de Marseille (CRCM), INSERM U1068, CNRS UMR 7258, Aix-Marseille Université and Institut Paoli-Calmettes, Parc Scientifique et Technologique de Luminy, Marseille, France
| | - Gabriel Velez
- Laboratory of Epigenetics and Chromatin Dynamics, Gastroenterology Research Unit, Departments of Biochemistry and Molecular Biology, Biophysics, and Medicine, Mayo Clinic, Rochester, Minnesota
| | - Gwen Lomberk
- Laboratory of Epigenetics and Chromatin Dynamics, Gastroenterology Research Unit, Departments of Biochemistry and Molecular Biology, Biophysics, and Medicine, Mayo Clinic, Rochester, Minnesota
| | - Jose Luis Neira
- Instituto de Biología Molecular y Celular, Universidad Miguel Hernández, Elche (Alicante), Spain
| | - Raul Urrutia
- Laboratory of Epigenetics and Chromatin Dynamics, Gastroenterology Research Unit, Departments of Biochemistry and Molecular Biology, Biophysics, and Medicine, Mayo Clinic, Rochester, Minnesota
| | - Juan Iovanna
- Centre de Recherche en Cancérologie de Marseille (CRCM), INSERM U1068, CNRS UMR 7258, Aix-Marseille Université and Institut Paoli-Calmettes, Parc Scientifique et Technologique de Luminy, Marseille, France
| |
Collapse
|
40
|
DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation. Sci Rep 2015; 5:15479. [PMID: 26482832 PMCID: PMC4611492 DOI: 10.1038/srep15479] [Citation(s) in RCA: 82] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2015] [Accepted: 09/28/2015] [Indexed: 02/01/2023] Open
Abstract
DNA-binding proteins play an important role in most cellular processes. Therefore, it is necessary to develop an efficient predictor for identifying DNA-binding proteins only based on the sequence information of proteins. The bottleneck for constructing a useful predictor is to find suitable features capturing the characteristics of DNA binding proteins. We applied PseAAC to DNA binding protein identification, and PseAAC was further improved by incorporating the evolutionary information by using profile-based protein representation. Finally, Combined with Support Vector Machines (SVMs), a predictor called iDNAPro-PseAAC was proposed. Experimental results on an updated benchmark dataset showed that iDNAPro-PseAAC outperformed some state-of-the-art approaches, and it can achieve stable performance on an independent dataset. By using an ensemble learning approach to incorporate more negative samples (non-DNA binding proteins) in the training process, the performance of iDNAPro-PseAAC was further improved. The web server of iDNAPro-PseAAC is available at http://bioinformatics.hitsz.edu.cn/iDNAPro-PseAAC/.
Collapse
|
41
|
Zhao ZJ, Chen JT, Yuan JY, Yin XX, Song HY, Wang XC. Optimization modeling of single-chain antibody against hepatoma based on similarity algorithm. Int J Clin Exp Med 2015; 8:14827-14836. [PMID: 26628964 PMCID: PMC4658853] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2015] [Accepted: 09/07/2015] [Indexed: 06/05/2023]
Abstract
The purposes was to establish optimal modeling of single-chain antibody molecules based on similarity algorithm and seek the connecting peptides that had the minimal effect on the structure and bioactivity of the variable region of heavy chain (VH) and that of light chain (VL) in a single-chain antibody against liver cancer. After the Linker with different lengths (n=0~7) had been added into single chain fragment variable (ScFv), modeling of the overall sequences of VH, VL and ScFv were conducted respectively. Meanwhile, the peptide chain structure of (Gly4Ser)n was adopted for the connecting peptide. Then the spatial spherical shell layer alignment algorithm based on spherical polar coordinates was utilized for comparing the structural similarity of VH and VL before and after adding connecting peptide. Equally, in order to determine the stability of VH and VL, MATLAB was applied for analysis of the fore and aft distances and the diffusion radius. Indirect ELISA method was used to detect single-chain antibody immunological activity of Linker with different lengths. The MTT assay was utilized for the examination of the inhibition rate of single-chain antibody with different lengths of Linker to liver cancer cell. When n=4, the structural similarity between VH together with VL and their original ones was the highest. When n=3, the influence of connecting peptide on the stability of VH and VL was minimum. When n>3, the fore and aft distances changed little due to the increase and fold of the length of peptide chain. The results of ELISA detection showed that when n=4, affinity of single chain antibody to liver cancer cells was much higher. The MTT test also indicated that when n=4, the inhibition rate of the connecting peptide on hepatoma carcinoma cell reached the highest, and that came second when n=3. When n=4, the structural stability and biological functions of anti-hepatoma single-chain antibody were both favorable. This study has provided a basis for the design and construction of single-chain antibody.
Collapse
Affiliation(s)
- Zhi-Jun Zhao
- Department of Oncology, The First Hospital Affiliated to Henan UniversityKaifeng, China
| | - Jing-Tao Chen
- Department of Oncology, The First Hospital Affiliated to Henan UniversityKaifeng, China
| | - Jia-Ying Yuan
- Department of Ultrasound Diagnosis, The Directly Under The Hospital of Henan Military RegionZhengzhou, China
| | - Xiao-Xiang Yin
- Department of Oncology, The First Hospital Affiliated to Henan UniversityKaifeng, China
| | - Hua-Yong Song
- Department of Oncology, The First Hospital Affiliated to Henan UniversityKaifeng, China
| | - Xin-Chun Wang
- Department of Oncology, The First Hospital Affiliated to Henan UniversityKaifeng, China
| |
Collapse
|
42
|
Motion GB, Howden AJM, Huitema E, Jones S. DNA-binding protein prediction using plant specific support vector machines: validation and application of a new genome annotation tool. Nucleic Acids Res 2015; 43:e158. [PMID: 26304539 PMCID: PMC4678848 DOI: 10.1093/nar/gkv805] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2015] [Accepted: 07/28/2015] [Indexed: 11/26/2022] Open
Abstract
There are currently 151 plants with draft genomes available but levels of functional annotation for putative protein products are low. Therefore, accurate computational predictions are essential to annotate genomes in the first instance, and to provide focus for the more costly and time consuming functional assays that follow. DNA-binding proteins are an important class of proteins that require annotation, but current computational methods are not applicable for genome wide predictions in plant species. Here, we explore the use of species and lineage specific models for the prediction of DNA-binding proteins in plants. We show that a species specific support vector machine model based on Arabidopsis sequence data is more accurate (accuracy 81%) than a generic model (74%), and based on this we develop a plant specific model for predicting DNA-binding proteins. We apply this model to the tomato proteome and demonstrate its ability to perform accurate high-throughput prediction of DNA-binding proteins. In doing so, we have annotated 36 currently uncharacterised proteins by assigning a putative DNA-binding function. Our model is publically available and we propose it be used in combination with existing tools to help increase annotation levels of DNA-binding proteins encoded in plant genomes.
Collapse
Affiliation(s)
- Graham B Motion
- Division of Plant Sciences, University of Dundee at the James Hutton Institute, Invergowrie, Dundee DD2 5DA, UK Cell and Molecular Sciences, James Hutton Institute, Invergowrie, Dundee DD2 5DA, UK
| | - Andrew J M Howden
- Division of Plant Sciences, University of Dundee at the James Hutton Institute, Invergowrie, Dundee DD2 5DA, UK
| | - Edgar Huitema
- Division of Plant Sciences, University of Dundee at the James Hutton Institute, Invergowrie, Dundee DD2 5DA, UK
| | - Susan Jones
- Information and Computational Sciences, James Hutton Institute, Invergowrie, Dundee DD2 5DA, UK
| |
Collapse
|
43
|
Yan J, Friedrich S, Kurgan L. A comprehensive comparative review of sequence-based predictors of DNA- and RNA-binding residues. Brief Bioinform 2015; 17:88-105. [DOI: 10.1093/bib/bbv023] [Citation(s) in RCA: 70] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2014] [Indexed: 01/07/2023] Open
|
44
|
Bahrami S, Ehsani R, Drabløs F. A property-based analysis of human transcription factors. BMC Res Notes 2015; 8:82. [PMID: 25890365 PMCID: PMC4373352 DOI: 10.1186/s13104-015-1039-6] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2014] [Accepted: 03/02/2015] [Indexed: 01/17/2023] Open
Abstract
Background Transcription factors are essential proteins for regulating gene expression. This regulation depends upon specific features of the transcription factors, including how they interact with DNA, how they interact with each other, and how they are post-translationally modified. Reliable information about key properties associated with transcription factors will therefore be useful for data analysis, in particular of data from high-throughput experiments. Results We have used an existing list of 1978 human proteins described as transcription factors to make a well-annotated data set, which includes information on Pfam domains, DNA-binding domains, post-translational modifications and protein–protein interactions. We have then used this data set for enrichment analysis. We have investigated correlations within this set of features, and between the features and more general protein properties. We have also used the data set to analyze previously published gene lists associated with cell differentiation, cancer, and tissue distribution. Conclusions The study shows that well-annotated feature list for transcription factors is a useful resource for extensive data analysis; both of transcription factor properties in general and of properties associated with specific processes. However, the study also shows that such analyses are easily biased by incomplete coverage in experimental data, and by how gene sets are defined. Electronic supplementary material The online version of this article (doi:10.1186/s13104-015-1039-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Shahram Bahrami
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, P.O. Box 8905, , NO-7491, Trondheim, Norway. .,St. Olavs Hospital, NO-7006, Trondheim, Norway.
| | - Rezvan Ehsani
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, P.O. Box 8905, , NO-7491, Trondheim, Norway.
| | - Finn Drabløs
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, P.O. Box 8905, , NO-7491, Trondheim, Norway.
| |
Collapse
|
45
|
An overview of the prediction of protein DNA-binding sites. Int J Mol Sci 2015; 16:5194-215. [PMID: 25756377 PMCID: PMC4394471 DOI: 10.3390/ijms16035194] [Citation(s) in RCA: 53] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2014] [Revised: 02/21/2015] [Accepted: 02/27/2015] [Indexed: 02/06/2023] Open
Abstract
Interactions between proteins and DNA play an important role in many essential biological processes such as DNA replication, transcription, splicing, and repair. The identification of amino acid residues involved in DNA-binding sites is critical for understanding the mechanism of these biological activities. In the last decade, numerous computational approaches have been developed to predict protein DNA-binding sites based on protein sequence and/or structural information, which play an important role in complementing experimental strategies. At this time, approaches can be divided into three categories: sequence-based DNA-binding site prediction, structure-based DNA-binding site prediction, and homology modeling and threading. In this article, we review existing research on computational methods to predict protein DNA-binding sites, which includes data sets, various residue sequence/structural features, machine learning methods for comparison and selection, evaluation methods, performance comparison of different tools, and future directions in protein DNA-binding site prediction. In particular, we detail the meta-analysis of protein DNA-binding sites. We also propose specific implications that are likely to result in novel prediction methods, increased performance, or practical applications.
Collapse
|
46
|
Xu R, Zhou J, Wang H, He Y, Wang X, Liu B. Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation. BMC SYSTEMS BIOLOGY 2015; 9 Suppl 1:S10. [PMID: 25708928 PMCID: PMC4331676 DOI: 10.1186/1752-0509-9-s1-s10] [Citation(s) in RCA: 64] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
BACKGROUND DNA-binding proteins play a pivotal role in various intra- and extra-cellular activities ranging from DNA replication to gene expression control. Identification of DNA-binding proteins is one of the major challenges in the field of genome annotation. There have been several computational methods proposed in the literature to deal with the DNA-binding protein identification. However, most of them can't provide an invaluable knowledge base for our understanding of DNA-protein interactions. RESULTS We firstly presented a new protein sequence encoding method called PSSM Distance Transformation, and then constructed a DNA-binding protein identification method (SVM-PSSM-DT) by combining PSSM Distance Transformation with support vector machine (SVM). First, the PSSM profiles are generated by using the PSI-BLAST program to search the non-redundant (NR) database. Next, the PSSM profiles are transformed into uniform numeric representations appropriately by distance transformation scheme. Lastly, the resulting uniform numeric representations are inputted into a SVM classifier for prediction. Thus whether a sequence can bind to DNA or not can be determined. In benchmark test on 525 DNA-binding and 550 non DNA-binding proteins using jackknife validation, the present model achieved an ACC of 79.96%, MCC of 0.622 and AUC of 86.50%. This performance is considerably better than most of the existing state-of-the-art predictive methods. When tested on a recently constructed independent dataset PDB186, SVM-PSSM-DT also achieved the best performance with ACC of 80.00%, MCC of 0.647 and AUC of 87.40%, and outperformed some existing state-of-the-art methods. CONCLUSIONS The experiment results demonstrate that PSSM Distance Transformation is an available protein sequence encoding method and SVM-PSSM-DT is a useful tool for identifying the DNA-binding proteins. A user-friendly web-server of SVM-PSSM-DT was constructed, which is freely accessible to the public at the web-site on http://bioinformatics.hitsz.edu.cn/PSSM-DT/.
Collapse
Affiliation(s)
- Ruifeng Xu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - Jiyun Zhou
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - Hongpeng Wang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - Yulan He
- School of Engineering & Applied Science, Aston University, Birmingham, UK
| | - Xiaolong Wang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| |
Collapse
|
47
|
Samant M, Jethva M, Hasija Y. INTERACT-O-FINDER: A Tool for Prediction of DNA-Binding Proteins Using Sequence Features. Int J Pept Res Ther 2014. [DOI: 10.1007/s10989-014-9446-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
48
|
newDNA-Prot: Prediction of DNA-binding proteins by employing support vector machine and a comprehensive sequence representation. Comput Biol Chem 2014; 52:51-9. [PMID: 25240115 DOI: 10.1016/j.compbiolchem.2014.09.002] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2014] [Revised: 09/05/2014] [Accepted: 09/06/2014] [Indexed: 11/21/2022]
Abstract
Identification of DNA-binding proteins is essential in studying cellular activities as the DNA-binding proteins play a pivotal role in gene regulation. In this study, we propose newDNA-Prot, a DNA-binding protein predictor that employs support vector machine classifier and a comprehensive feature representation. The sequence representation are categorized into 6 groups: primary sequence based, evolutionary profile based, predicted secondary structure based, predicted relative solvent accessibility based, physicochemical property based and biological function based features. The mRMR, wrapper and two-stage feature selection methods are employed for removing irrelevant features and reducing redundant features. Experiments demonstrate that the two-stage method performs better than the mRMR and wrapper methods. We also perform a statistical analysis on the selected features and results show that more than 95% of the selected features are statistically significant and they cover all 6 feature groups. The newDNA-Prot method is compared with several state of the art algorithms, including iDNA-Prot, DNAbinder and DNA-Prot. The results demonstrate that newDNA-Prot method outperforms the iDNA-Prot, DNAbinder and DNA-Prot methods. More specific, newDNA-Prot improves the runner-up method, DNA-Prot for around 10% on several evaluation measures. The proposed newDNA-Prot method is available at http://sourceforge.net/projects/newdnaprot/
Collapse
|
49
|
Liu B, Xu J, Lan X, Xu R, Zhou J, Wang X, Chou KC. iDNA-Prot|dis: identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition. PLoS One 2014; 9:e106691. [PMID: 25184541 PMCID: PMC4153653 DOI: 10.1371/journal.pone.0106691] [Citation(s) in RCA: 208] [Impact Index Per Article: 20.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2014] [Accepted: 07/31/2014] [Indexed: 11/18/2022] Open
Abstract
Playing crucial roles in various cellular processes, such as recognition of specific nucleotide sequences, regulation of transcription, and regulation of gene expression, DNA-binding proteins are essential ingredients for both eukaryotic and prokaryotic proteomes. With the avalanche of protein sequences generated in the postgenomic age, it is a critical challenge to develop automated methods for accurate and rapidly identifying DNA-binding proteins based on their sequence information alone. Here, a novel predictor, called "iDNA-Prot|dis", was established by incorporating the amino acid distance-pair coupling information and the amino acid reduced alphabet profile into the general pseudo amino acid composition (PseAAC) vector. The former can capture the characteristics of DNA-binding proteins so as to enhance its prediction quality, while the latter can reduce the dimension of PseAAC vector so as to speed up its prediction process. It was observed by the rigorous jackknife and independent dataset tests that the new predictor outperformed the existing predictors for the same purpose. As a user-friendly web-server, iDNA-Prot|dis is accessible to the public at http://bioinformatics.hitsz.edu.cn/iDNA-Prot_dis/. Moreover, for the convenience of the vast majority of experimental scientists, a step-by-step protocol guide is provided on how to use the web-server to get their desired results without the need to follow the complicated mathematic equations that are presented in this paper just for the integrity of its developing process. It is anticipated that the iDNA-Prot|dis predictor may become a useful high throughput tool for large-scale analysis of DNA-binding proteins, or at the very least, play a complementary role to the existing predictors in this regard.
Collapse
Affiliation(s)
- Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
- Shanghai Key Laboratory of Intelligent Information Processing, Shanghai, China
- Gordon Life Science Institute, Belmont, Massachusetts, United States of America
- * E-mail: (BL); (KCC)
| | - Jinghao Xu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - Xun Lan
- Stanford University, Stanford, California, United States of America
| | - Ruifeng Xu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - Jiyun Zhou
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - Xiaolong Wang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Belmont, Massachusetts, United States of America
- Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, Saudi Arabia
- * E-mail: (BL); (KCC)
| |
Collapse
|
50
|
Zhao H, Wang J, Zhou Y, Yang Y. Predicting DNA-binding proteins and binding residues by complex structure prediction and application to human proteome. PLoS One 2014; 9:e96694. [PMID: 24792350 PMCID: PMC4008587 DOI: 10.1371/journal.pone.0096694] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2014] [Accepted: 04/10/2014] [Indexed: 12/25/2022] Open
Abstract
As more and more protein sequences are uncovered from increasingly inexpensive sequencing techniques, an urgent task is to find their functions. This work presents a highly reliable computational technique for predicting DNA-binding function at the level of protein-DNA complex structures, rather than low-resolution two-state prediction of DNA-binding as most existing techniques do. The method first predicts protein-DNA complex structure by utilizing the template-based structure prediction technique HHblits, followed by binding affinity prediction based on a knowledge-based energy function (Distance-scaled finite ideal-gas reference state for protein-DNA interactions). A leave-one-out cross validation of the method based on 179 DNA-binding and 3797 non-binding protein domains achieves a Matthews correlation coefficient (MCC) of 0.77 with high precision (94%) and high sensitivity (65%). We further found 51% sensitivity for 82 newly determined structures of DNA-binding proteins and 56% sensitivity for the human proteome. In addition, the method provides a reasonably accurate prediction of DNA-binding residues in proteins based on predicted DNA-binding complex structures. Its application to human proteome leads to more than 300 novel DNA-binding proteins; some of these predicted structures were validated by known structures of homologous proteins in APO forms. The method [SPOT-Seq (DNA)] is available as an on-line server at http://sparks-lab.org.
Collapse
Affiliation(s)
- Huiying Zhao
- School of Informatics, Indiana University Purdue University, Indianapolis, Indiana, United States of America
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, Indiana, United States of America
- QIMR Berghofer Medical Research Institute, Brisbane, Queensland, Australia
| | - Jihua Wang
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, Indiana, United States of America
- Shandong Provincial Key Laboratory of Functional Macromolecular Biophysics, Dezhou University, Dezhou, Shandong, China
| | - Yaoqi Zhou
- School of Informatics, Indiana University Purdue University, Indianapolis, Indiana, United States of America
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, Indiana, United States of America
- Shandong Provincial Key Laboratory of Functional Macromolecular Biophysics, Dezhou University, Dezhou, Shandong, China
- Institute for Glycomics and School of Information and Communication Technique, Griffith University, Southport, Queensland, Australia
- * E-mail: (YZ); (YY)
| | - Yuedong Yang
- School of Informatics, Indiana University Purdue University, Indianapolis, Indiana, United States of America
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, Indiana, United States of America
- Institute for Glycomics and School of Information and Communication Technique, Griffith University, Southport, Queensland, Australia
- * E-mail: (YZ); (YY)
| |
Collapse
|