1
|
Pradhan UK, Meher PK, Naha S, Das R, Gupta A, Parsad R. ProkDBP: Toward more precise identification of prokaryotic DNA binding proteins. Protein Sci 2024; 33:e5015. [PMID: 38747369 PMCID: PMC11094783 DOI: 10.1002/pro.5015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2023] [Revised: 04/18/2024] [Accepted: 04/21/2024] [Indexed: 05/19/2024]
Abstract
Prokaryotic DNA binding proteins (DBPs) play pivotal roles in governing gene regulation, DNA replication, and various cellular functions. Accurate computational models for predicting prokaryotic DBPs hold immense promise in accelerating the discovery of novel proteins, fostering a deeper understanding of prokaryotic biology, and facilitating the development of therapeutics targeting for potential disease interventions. However, existing generic prediction models often exhibit lower accuracy in predicting prokaryotic DBPs. To address this gap, we introduce ProkDBP, a novel machine learning-driven computational model for prediction of prokaryotic DBPs. For prediction, a total of nine shallow learning algorithms and five deep learning models were utilized, with the shallow learning models demonstrating higher performance metrics compared to their deep learning counterparts. The light gradient boosting machine (LGBM), coupled with evolutionarily significant features selected via random forest variable importance measure (RF-VIM) yielded the highest five-fold cross-validation accuracy. The model achieved the highest auROC (0.9534) and auPRC (0.9575) among the 14 machine learning models evaluated. Additionally, ProkDBP demonstrated substantial performance with an independent dataset, exhibiting higher values of auROC (0.9332) and auPRC (0.9371). Notably, when benchmarked against several cutting-edge existing models, ProkDBP showcased superior predictive accuracy. Furthermore, to promote accessibility and usability, ProkDBP (https://iasri-sg.icar.gov.in/prokdbp/) is available as an online prediction tool, enabling free access to interested users. This tool stands as a significant contribution, enhancing the repertoire of resources for accurate and efficient prediction of prokaryotic DBPs.
Collapse
Affiliation(s)
- Upendra Kumar Pradhan
- Division of Statistical GeneticsICAR‐Indian Agricultural Statistics Research Institute, PUSANew DelhiIndia
| | - Prabina Kumar Meher
- Division of Statistical GeneticsICAR‐Indian Agricultural Statistics Research Institute, PUSANew DelhiIndia
| | - Sanchita Naha
- Division of Computer ApplicationsICAR‐Indian Agricultural Statistics Research Institute, PUSANew DelhiIndia
| | - Ritwika Das
- Division of Agricultural BioinformaticsICAR‐Indian Agricultural Statistics Research Institute, PUSANew DelhiIndia
| | - Ajit Gupta
- Division of Statistical GeneticsICAR‐Indian Agricultural Statistics Research Institute, PUSANew DelhiIndia
| | - Rajender Parsad
- ICAR‐Indian Agricultural Statistics Research Institute, PUSANew DelhiIndia
| |
Collapse
|
2
|
Hosen MF, Mahmud SH, Ahmed K, Chen W, Moni MA, Deng HW, Shoombuatong W, Hasan MM. DeepDNAbP: A deep learning-based hybrid approach to improve the identification of deoxyribonucleic acid-binding proteins. Comput Biol Med 2022; 145:105433. [DOI: 10.1016/j.compbiomed.2022.105433] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2021] [Revised: 03/11/2022] [Accepted: 03/20/2022] [Indexed: 11/03/2022]
|
3
|
DNAPred_Prot: Identification of DNA-Binding Proteins Using Composition- and Position-Based Features. Appl Bionics Biomech 2022; 2022:5483115. [PMID: 35465187 PMCID: PMC9020926 DOI: 10.1155/2022/5483115] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2021] [Revised: 12/25/2021] [Accepted: 02/05/2022] [Indexed: 12/29/2022] Open
Abstract
In the domain of genome annotation, the identification of DNA-binding protein is one of the crucial challenges. DNA is considered a blueprint for the cell. It contained all necessary information for building and maintaining the trait of an organism. It is DNA, which makes a living thing, a living thing. Protein interaction with DNA performs an essential role in regulating DNA functions such as DNA repair, transcription, and regulation. Identification of these proteins is a crucial task for understanding the regulation of genes. Several methods have been developed to identify the binding sites of DNA and protein depending upon the structures and sequences, but they were costly and time-consuming. Therefore, we propose a methodology named “DNAPred_Prot”, which uses various position and frequency-dependent features from protein sequences for efficient and effective prediction of DNA-binding proteins. Using testing techniques like 10-fold cross-validation and jackknife testing an accuracy of 94.95% and 95.11% was yielded, respectively. The results of SVM and ANN were also compared with those of a random forest classifier. The robustness of the proposed model was evaluated by using the independent dataset PDB186, and an accuracy of 91.47% was achieved by it. From these results, it can be predicted that the suggested methodology performs better than other extant methods for the identification of DNA-binding proteins.
Collapse
|
4
|
Nanni L, Brahnam S. Robust ensemble of handcrafted and learned approaches for DNA-binding proteins. APPLIED COMPUTING AND INFORMATICS 2021. [DOI: 10.1108/aci-03-2021-0051] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Purpose
Automatic DNA-binding protein (DNA-BP) classification is now an essential proteomic technology. Unfortunately, many systems reported in the literature are tested on only one or two datasets/tasks. The purpose of this study is to create the most optimal and universal system for DNA-BP classification, one that performs competitively across several DNA-BP classification tasks.
Design/methodology/approach
Efficient DNA-BP classifier systems require the discovery of powerful protein representations and feature extraction methods. Experiments were performed that combined and compared descriptors extracted from state-of-the-art matrix/image protein representations. These descriptors were trained on separate support vector machines (SVMs) and evaluated. Convolutional neural networks with different parameter settings were fine-tuned on two matrix representations of proteins. Decisions were fused with the SVMs using the weighted sum rule and evaluated to experimentally derive the most powerful general-purpose DNA-BP classifier system.
Findings
The best ensemble proposed here produced comparable, if not superior, classification results on a broad and fair comparison with the literature across four different datasets representing a variety of DNA-BP classification tasks, thereby demonstrating both the power and generalizability of the proposed system.
Originality/value
Most DNA-BP methods proposed in the literature are only validated on one (rarely two) datasets/tasks. In this work, the authors report the performance of our general-purpose DNA-BP system on four datasets representing different DNA-BP classification tasks. The excellent results of the proposed best classifier system demonstrate the power of the proposed approach. These results can now be used for baseline comparisons by other researchers in the field.
Collapse
|
5
|
Zhang Y, Chen P, Gao Y, Ni J, Wang X. DBP-PSSM: Combination of Evolutionary Profiles with the XGBoost Algorithm to Improve the Identification of DNA-binding Proteins. Comb Chem High Throughput Screen 2020; 25:3-12. [PMID: 33238837 DOI: 10.2174/1386207323999201124203531] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 10/16/2020] [Accepted: 10/29/2020] [Indexed: 11/22/2022]
Abstract
BACKGROUND AND OBJECTIVE DNA-binding proteins play important roles in a variety of biological processes, such as gene transcription and regulation, DNA replication and repair, DNA recombination and packaging, and the formation of chromatin and ribosomes. Therefore, it is urgent to develop a computational method to improve the recognition efficiency of DNA-binding proteins. METHODS We proposed a novel method, DBP-PSSM, which constructed the features from amino acid composition and evolutionary information of protein sequences. The maximum relevance, minimum redundancy (mRMR) was employed to select the optimal features for establishing the XGBoost classifier, therefore, the novel model of prediction DNA-binding proteins, DBP-PSSM, was established with 5-fold cross-validation on the training dataset. RESULTS DBP-PSSM achieved an accuracy of 81.18% and MCC of 0.657 in a test dataset, which outperformed the many existing methods. These results demonstrated that our method can effectively predict DNA-binding proteins. CONCLUSION The data and source code are provided at https://github.com/784221489/DNA-binding.
Collapse
Affiliation(s)
- Yanping Zhang
- School of Mathematics and Physics Science and Engineering, Hebei University of Engineering, Handan 056038,China
| | - Pengcheng Chen
- School of Mathematics and Physics Science and Engineering, Hebei University of Engineering, Handan 056038,China
| | - Ya Gao
- School of Mathematics and Physics Science and Engineering, Hebei University of Engineering, Handan 056038,China
| | - Jianwei Ni
- School of Mathematics and Physics Science and Engineering, Hebei University of Engineering, Handan 056038,China
| | - Xiaosheng Wang
- School of Mathematics and Physics Science and Engineering, Hebei University of Engineering, Handan 056038,China
| |
Collapse
|
6
|
PredDBP-Stack: Prediction of DNA-Binding Proteins from HMM Profiles using a Stacked Ensemble Method. BIOMED RESEARCH INTERNATIONAL 2020; 2020:7297631. [PMID: 32352006 PMCID: PMC7174956 DOI: 10.1155/2020/7297631] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/02/2020] [Accepted: 04/01/2020] [Indexed: 12/02/2022]
Abstract
DNA-binding proteins (DBPs) play vital roles in all aspects of genetic activities. However, the identification of DBPs by using wet-lab experimental approaches is often time-consuming and laborious. In this study, we develop a novel computational method, called PredDBP-Stack, to predict DBPs solely based on protein sequences. First, amino acid composition (AAC) and transition probability composition (TPC) extracted from the hidden markov model (HMM) profile are adopted to represent a protein. Next, we establish a stacked ensemble model to identify DBPs, which involves two stages of learning. In the first stage, the four base classifiers are trained with the features of HMM-based compositions. In the second stage, the prediction probabilities of these base classifiers are used as inputs to the meta-classifier to perform the final prediction of DBPs. Based on the PDB1075 benchmark dataset, we conduct a jackknife cross validation with the proposed PredDBP-Stack predictor and obtain a balanced sensitivity and specificity of 92.47% and 92.36%, respectively. This outcome outperforms most of the existing classifiers. Furthermore, our method also achieves superior performance and model robustness on the PDB186 independent dataset. This demonstrates that the PredDBP-Stack is an effective classifier for accurately identifying DBPs based on protein sequence information alone.
Collapse
|
7
|
HMMPred: Accurate Prediction of DNA-Binding Proteins Based on HMM Profiles and XGBoost Feature Selection. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2020; 2020:1384749. [PMID: 32300371 PMCID: PMC7142336 DOI: 10.1155/2020/1384749] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/29/2020] [Accepted: 03/16/2020] [Indexed: 02/08/2023]
Abstract
Prediction of DNA-binding proteins (DBPs) has become a popular research topic in protein science due to its crucial role in all aspects of biological activities. Even though considerable efforts have been devoted to developing powerful computational methods to solve this problem, it is still a challenging task in the field of bioinformatics. A hidden Markov model (HMM) profile has been proved to provide important clues for improving the prediction performance of DBPs. In this paper, we propose a method, called HMMPred, which extracts the features of amino acid composition and auto- and cross-covariance transformation from the HMM profiles, to help train a machine learning model for identification of DBPs. Then, a feature selection technique is performed based on the extreme gradient boosting (XGBoost) algorithm. Finally, the selected optimal features are fed into a support vector machine (SVM) classifier to predict DBPs. The experimental results tested on two benchmark datasets show that the proposed method is superior to most of the existing methods and could serve as an alternative tool to identify DBPs.
Collapse
|
8
|
Zhou L, Song X, Yu DJ, Sun J. Sequence-based Detection of DNA-binding Proteins using Multiple-view Features Allied with Feature Selection. Mol Inform 2020; 39:e2000006. [PMID: 32144887 DOI: 10.1002/minf.202000006] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2020] [Accepted: 03/05/2020] [Indexed: 12/12/2022]
Abstract
DNA-binding proteins play essential roles in many molecular functions and gene regulation. Therefore, it becomes highly desirable to develop effective computational techniques for detecting DNA-binding proteins. In this paper, we proposed a new method, iDBP-DEP, which performs DNA-binding prediction by using the discriminative feature derived from multi-view feature sources including evolutionary profile, dipeptide composition, and physicochemical properties with feature selection. We evaluated iDBP-DEP on two benchmark datasets, i. e., PDB1075 and PDB594 by rigorous Jackknife test. Compared with the state-of-the-art sequence-based DNA-binding predictors, the proposed iDBP-DEP achieved 1.8 % and 3.0 % improvements of accuracy (Acc) and Mathew's Correlation Coefficient (MCC), respectively, on PDB1075 dataset; 7.4 % and 14.8 % improvements of Acc and MCC, respectively, on PDB594. The independent validation test with PDB186 show that the proposed method achieved the best performances on Acc (80.1 %) and MCC (0.684), which further demonstrated the robustness of iDBP-DEP for the detection of DNA-binding proteins. Datasets and codes used in this study are freely available at https://githup.com/Zll-codeside/iDBP-DEP.
Collapse
Affiliation(s)
- Liling Zhou
- School of Internet of Things Engineering, Jiangnan University, Wuxi, China
| | - Xiaoning Song
- School of Internet of Things Engineering, Jiangnan University, Wuxi, China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China
| | - Jun Sun
- School of Internet of Things Engineering, Jiangnan University, Wuxi, China
| |
Collapse
|
9
|
Set of approaches based on 3D structure and position specific-scoring matrix for predicting DNA-binding proteins. Bioinformatics 2018; 35:1844-1851. [DOI: 10.1093/bioinformatics/bty912] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2018] [Revised: 10/08/2018] [Accepted: 10/31/2018] [Indexed: 11/14/2022] Open
|
10
|
Mishra A, Pokhrel P, Hoque MT. StackDPPred: a stacking based prediction of DNA-binding protein from sequence. Bioinformatics 2018; 35:433-441. [DOI: 10.1093/bioinformatics/bty653] [Citation(s) in RCA: 64] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2018] [Accepted: 07/18/2018] [Indexed: 12/12/2022] Open
Affiliation(s)
- Avdesh Mishra
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA
| | - Pujan Pokhrel
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA
| | - Md Tamjidul Hoque
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA
| |
Collapse
|
11
|
Improved detection of DNA-binding proteins via compression technology on PSSM information. PLoS One 2017; 12:e0185587. [PMID: 28961273 PMCID: PMC5621689 DOI: 10.1371/journal.pone.0185587] [Citation(s) in RCA: 49] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2017] [Accepted: 09/17/2017] [Indexed: 12/04/2022] Open
Abstract
Since the importance of DNA-binding proteins in multiple biomolecular functions has been recognized, an increasing number of researchers are attempting to identify DNA-binding proteins. In recent years, the machine learning methods have become more and more compelling in the case of protein sequence data soaring, because of their favorable speed and accuracy. In this paper, we extract three features from the protein sequence, namely NMBAC (Normalized Moreau-Broto Autocorrelation), PSSM-DWT (Position-specific scoring matrix—Discrete Wavelet Transform), and PSSM-DCT (Position-specific scoring matrix—Discrete Cosine Transform). We also employ feature selection algorithm on these feature vectors. Then, these features are fed into the training SVM (support vector machine) model as classifier to predict DNA-binding proteins. Our method applys three datasets, namely PDB1075, PDB594 and PDB186, to evaluate the performance of our approach. The PDB1075 and PDB594 datasets are employed for Jackknife test and the PDB186 dataset is used for the independent test. Our method achieves the best accuracy in the Jacknife test, from 79.20% to 86.23% and 80.5% to 86.20% on PDB1075 and PDB594 datasets, respectively. In the independent test, the accuracy of our method comes to 76.3%. The performance of independent test also shows that our method has a certain ability to be effectively used for DNA-binding protein prediction. The data and source code are at https://doi.org/10.6084/m9.figshare.5104084.
Collapse
|
12
|
Wei L, Tang J, Zou Q. Local-DPP: An improved DNA-binding protein prediction method by exploring local evolutionary information. Inf Sci (N Y) 2017. [DOI: 10.1016/j.ins.2016.06.026] [Citation(s) in RCA: 196] [Impact Index Per Article: 28.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
13
|
Ding H, Feng PM, Chen W, Lin H. Identification of bacteriophage virion proteins by the ANOVA feature selection and analysis. MOLECULAR BIOSYSTEMS 2015; 10:2229-35. [PMID: 24931825 DOI: 10.1039/c4mb00316k] [Citation(s) in RCA: 106] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
The bacteriophage virion proteins play extremely important roles in the fate of host bacterial cells. Accurate identification of bacteriophage virion proteins is very important for understanding their functions and clarifying the lysis mechanism of bacterial cells. In this study, a new sequence-based method was developed to identify phage virion proteins. In the new method, the protein sequences were initially formulated by the g-gap dipeptide compositions. Subsequently, the analysis of variance (ANOVA) with incremental feature selection (IFS) was used to search for the optimal feature set. It was observed that, in jackknife cross-validation, the optimal feature set including 160 optimized features can produce the maximum accuracy of 85.02%. By performing feature analysis, we found that the correlation between two amino acids with one gap was more important than other correlations for phage virion protein prediction and that some of the 1-gap dipeptides were important and mainly contributed to the virion protein prediction. This analysis will provide novel insights into the function of phage virion proteins. On the basis of the proposed method, an online web-server, PVPred, was established and can be freely accessed from the website (http://lin.uestc.edu.cn/server/PVPred). We believe that the PVPred will become a powerful tool to study phage virion proteins and to guide the related experimental validations.
Collapse
Affiliation(s)
- Hui Ding
- Key Laboratory for Neuro-Information of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | | | | | | |
Collapse
|
14
|
Xu R, Zhou J, Wang H, He Y, Wang X, Liu B. Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation. BMC SYSTEMS BIOLOGY 2015; 9 Suppl 1:S10. [PMID: 25708928 PMCID: PMC4331676 DOI: 10.1186/1752-0509-9-s1-s10] [Citation(s) in RCA: 64] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
BACKGROUND DNA-binding proteins play a pivotal role in various intra- and extra-cellular activities ranging from DNA replication to gene expression control. Identification of DNA-binding proteins is one of the major challenges in the field of genome annotation. There have been several computational methods proposed in the literature to deal with the DNA-binding protein identification. However, most of them can't provide an invaluable knowledge base for our understanding of DNA-protein interactions. RESULTS We firstly presented a new protein sequence encoding method called PSSM Distance Transformation, and then constructed a DNA-binding protein identification method (SVM-PSSM-DT) by combining PSSM Distance Transformation with support vector machine (SVM). First, the PSSM profiles are generated by using the PSI-BLAST program to search the non-redundant (NR) database. Next, the PSSM profiles are transformed into uniform numeric representations appropriately by distance transformation scheme. Lastly, the resulting uniform numeric representations are inputted into a SVM classifier for prediction. Thus whether a sequence can bind to DNA or not can be determined. In benchmark test on 525 DNA-binding and 550 non DNA-binding proteins using jackknife validation, the present model achieved an ACC of 79.96%, MCC of 0.622 and AUC of 86.50%. This performance is considerably better than most of the existing state-of-the-art predictive methods. When tested on a recently constructed independent dataset PDB186, SVM-PSSM-DT also achieved the best performance with ACC of 80.00%, MCC of 0.647 and AUC of 87.40%, and outperformed some existing state-of-the-art methods. CONCLUSIONS The experiment results demonstrate that PSSM Distance Transformation is an available protein sequence encoding method and SVM-PSSM-DT is a useful tool for identifying the DNA-binding proteins. A user-friendly web-server of SVM-PSSM-DT was constructed, which is freely accessible to the public at the web-site on http://bioinformatics.hitsz.edu.cn/PSSM-DT/.
Collapse
Affiliation(s)
- Ruifeng Xu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - Jiyun Zhou
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - Hongpeng Wang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - Yulan He
- School of Engineering & Applied Science, Aston University, Birmingham, UK
| | - Xiaolong Wang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
| |
Collapse
|
15
|
Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian naïve Bayes. PLoS One 2014; 9:e86703. [PMID: 24475169 PMCID: PMC3901691 DOI: 10.1371/journal.pone.0086703] [Citation(s) in RCA: 112] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2013] [Accepted: 12/10/2013] [Indexed: 11/22/2022] Open
Abstract
Developing an efficient method for determination of the DNA-binding proteins, due to their vital roles in gene regulation, is becoming highly desired since it would be invaluable to advance our understanding of protein functions. In this study, we proposed a new method for the prediction of the DNA-binding proteins, by performing the feature rank using random forest and the wrapper-based feature selection using forward best-first search strategy. The features comprise information from primary sequence, predicted secondary structure, predicted relative solvent accessibility, and position specific scoring matrix. The proposed method, called DBPPred, used Gaussian naïve Bayes as the underlying classifier since it outperformed five other classifiers, including decision tree, logistic regression, k-nearest neighbor, support vector machine with polynomial kernel, and support vector machine with radial basis function. As a result, the proposed DBPPred yields the highest average accuracy of 0.791 and average MCC of 0.583 according to the five-fold cross validation with ten runs on the training benchmark dataset PDB594. Subsequently, blind tests on the independent dataset PDB186 by the proposed model trained on the entire PDB594 dataset and by other five existing methods (including iDNA-Prot, DNA-Prot, DNAbinder, DNABIND and DBD-Threader) were performed, resulting in that the proposed DBPPred yielded the highest accuracy of 0.769, MCC of 0.538, and AUC of 0.790. The independent tests performed by the proposed DBPPred on completely a large non-DNA binding protein dataset and two RNA binding protein datasets also showed improved or comparable quality when compared with the relevant prediction methods. Moreover, we observed that majority of the selected features by the proposed method are statistically significantly different between the mean feature values of the DNA-binding and the non DNA-binding proteins. All of the experimental results indicate that the proposed DBPPred can be an alternative perspective predictor for large-scale determination of DNA-binding proteins.
Collapse
|
16
|
Ding H, Feng PM, Chen W, Lin H. Identification of bacteriophage virion proteins by the ANOVA feature selection and analysis. MOLECULAR BIOSYSTEMS 2014. [DOI: 10.1039/c4mb00316k pmid: 24931825] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
The bacteriophage virion proteins play extremely important roles in the fate of host bacterial cells.
Collapse
Affiliation(s)
- Hui Ding
- Key Laboratory for Neuro-Information of Ministry of Education
- Center of Bioinformatics
- School of Life Science and Technology
- University of Electronic Science and Technology of China
- Chengdu 610054, China
| | - Peng-Mian Feng
- School of Public Health
- Hebei United University
- Tangshan 063000, China
| | - Wei Chen
- Department of Physics
- School of Sciences
- and Center for Genomics and Computational Biology
- Hebei United University
- Tangshan 063000, China
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education
- Center of Bioinformatics
- School of Life Science and Technology
- University of Electronic Science and Technology of China
- Chengdu 610054, China
| |
Collapse
|
17
|
Min JL, Xiao X, Chou KC. iEzy-drug: a web server for identifying the interaction between enzymes and drugs in cellular networking. BIOMED RESEARCH INTERNATIONAL 2013; 2013:701317. [PMID: 24371828 PMCID: PMC3858977 DOI: 10.1155/2013/701317] [Citation(s) in RCA: 58] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/07/2013] [Accepted: 09/17/2013] [Indexed: 01/16/2023]
Abstract
With the features of extremely high selectivity and efficiency in catalyzing almost all the chemical reactions in cells, enzymes play vitally important roles for the life of an organism and hence have become frequent targets for drug design. An essential step in developing drugs by targeting enzymes is to identify drug-enzyme interactions in cells. It is both time-consuming and costly to do this purely by means of experimental techniques alone. Although some computational methods were developed in this regard based on the knowledge of the three-dimensional structure of enzyme, unfortunately their usage is quite limited because three-dimensional structures of many enzymes are still unknown. Here, we reported a sequence-based predictor, called "iEzy-Drug," in which each drug compound was formulated by a molecular fingerprint with 258 feature components, each enzyme by the Chou's pseudo amino acid composition generated via incorporating sequential evolution information and physicochemical features derived from its sequence, and the prediction engine was operated by the fuzzy K-nearest neighbor algorithm. The overall success rate achieved by iEzy-Drug via rigorous cross-validations was about 91%. Moreover, to maximize the convenience for the majority of experimental scientists, a user-friendly web server was established, by which users can easily obtain their desired results.
Collapse
Affiliation(s)
- Jian-Liang Min
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333046, China
| | - Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333046, China
- Information School, ZheJiang Textile & Fashion College, NingBo 315211, China
- Gordon Life Science Institute, Belmont, MA 02478, USA
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Belmont, MA 02478, USA
- Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, Saudi Arabia
| |
Collapse
|
18
|
Predicting DNA binding proteins using support vector machine with hybrid fractal features. J Theor Biol 2013; 343:186-92. [PMID: 24189096 DOI: 10.1016/j.jtbi.2013.10.009] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2013] [Revised: 08/12/2013] [Accepted: 10/17/2013] [Indexed: 11/20/2022]
Abstract
DNA-binding proteins play a vitally important role in many biological processes. Prediction of DNA-binding proteins from amino acid sequence is a significant but not fairly resolved scientific problem. Chaos game representation (CGR) investigates the patterns hidden in protein sequences, and visually reveals previously unknown structure. Fractal dimensions (FD) are good tools to measure sizes of complex, highly irregular geometric objects. In order to extract the intrinsic correlation with DNA-binding property from protein sequences, CGR algorithm, fractal dimension and amino acid composition are applied to formulate the numerical features of protein samples in this paper. Seven groups of features are extracted, which can be computed directly from the primary sequence, and each group is evaluated by the 10-fold cross-validation test and Jackknife test. Comparing the results of numerical experiments, the group of amino acid composition and fractal dimension (21-dimension vector) gets the best result, the average accuracy is 81.82% and average Matthew's correlation coefficient (MCC) is 0.6017. This resulting predictor is also compared with existing method DNA-Prot and shows better performances.
Collapse
|
19
|
Zou C, Gong J, Li H. An improved sequence based prediction protocol for DNA-binding proteins using SVM and comprehensive feature analysis. BMC Bioinformatics 2013; 14:90. [PMID: 23497329 PMCID: PMC3602657 DOI: 10.1186/1471-2105-14-90] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2012] [Accepted: 03/04/2013] [Indexed: 11/10/2022] Open
Abstract
Background DNA-binding proteins (DNA-BPs) play a pivotal role in both eukaryotic and prokaryotic proteomes. There have been several computational methods proposed in the literature to deal with the DNA-BPs, many informative features and properties were used and proved to have significant impact on this problem. However the ultimate goal of Bioinformatics is to be able to predict the DNA-BPs directly from primary sequence. Results In this work, the focus is how to transform these informative features into uniform numeric representation appropriately and improve the prediction accuracy of our SVM-based classifier for DNA-BPs. A systematic representation of some selected features known to perform well is investigated here. Firstly, four kinds of protein properties are obtained and used to describe the protein sequence. Secondly, three different feature transformation methods (OCTD, AC and SAA) are adopted to obtain numeric feature vectors from three main levels: Global, Nonlocal and Local of protein sequence and their performances are exhaustively investigated. At last, the mRMR-IFS feature selection method and ensemble learning approach are utilized to determine the best prediction model. Besides, the optimal features selected by mRMR-IFS are illustrated based on the observed results which may provide useful insights for revealing the mechanisms of protein-DNA interactions. For five-fold cross-validation over the DNAdset and DNAaset, we obtained an overall accuracy of 0.940 and 0.811, MCC of 0.881 and 0.614 respectively. Conclusions The good results suggest that it can efficiently develop an entirely sequence-based protocol that transforms and integrates informative features from different scales used by SVM to predict DNA-BPs accurately. Moreover, a novel systematic framework for sequence descriptor-based protein function prediction is proposed here.
Collapse
Affiliation(s)
- Chuanxin Zou
- Shanghai Key Laboratory of New Drug Design, State Key Laboratory of Bioreactor Engineering, School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China
| | | | | |
Collapse
|
20
|
Learning protein multi-view features in complex space. Amino Acids 2013; 44:1365-79. [DOI: 10.1007/s00726-013-1472-6] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2012] [Accepted: 02/13/2013] [Indexed: 12/11/2022]
|
21
|
Ma X, Guo J, Liu HD, Xie JM, Sun X. Sequence-based prediction of DNA-binding residues in proteins with conservation and correlation information. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:1766-1775. [PMID: 22868682 DOI: 10.1109/tcbb.2012.106] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
The recognition of DNA-binding residues in proteins is critical to our understanding of the mechanisms of DNA-protein interactions, gene expression, and for guiding drug design. Therefore, a prediction method DNABR (DNA Binding Residues) is proposed for predicting DNA-binding residues in protein sequences using the random forest (RF) classifier with sequence-based features. Two types of novel sequence features are proposed in this study, which reflect the information about the conservation of physicochemical properties of the amino acids, and the correlation of amino acids between different sequence positions in terms of physicochemical properties. The first type of feature uses the evolutionary information combined with the conservation of physicochemical properties of the amino acids while the second reflects the dependency effect of amino acids with regards to polarity charge and hydrophobic properties in the protein sequences. Those two features and an orthogonal binary vector which reflect the characteristics of 20 types of amino acids are used to build the DNABR, a model to predict DNA-binding residues in proteins. The DNABR model achieves a value of 0.6586 for Matthew’s correlation coefficient (MCC) and 93.04 percent overall accuracy (ACC) with a68.47 percent sensitivity (SE) and 98.16 percent specificity (SP), respectively. The comparisons with each feature demonstrate that these two novel features contribute most to the improvement in predictive ability. Furthermore, performance comparisons with other approaches clearly show that DNABR has an excellent prediction performance for detecting binding residues in putative DNA-binding protein. The DNABR web-server system is freely available at http://www.cbi.seu.edu.cn/DNABR/.
Collapse
Affiliation(s)
- Xin Ma
- State Key Laboratory of Bioelectronics, School of Biological Science & Medical Engineering, Southeast University and Nanjing Audit University, Nanjing, P.R. China.
| | | | | | | | | |
Collapse
|
22
|
Lin WZ, Fang JA, Xiao X, Chou KC. iDNA-Prot: identification of DNA binding proteins using random forest with grey model. PLoS One 2011; 6:e24756. [PMID: 21935457 PMCID: PMC3174210 DOI: 10.1371/journal.pone.0024756] [Citation(s) in RCA: 194] [Impact Index Per Article: 14.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2011] [Accepted: 08/16/2011] [Indexed: 11/18/2022] Open
Abstract
DNA-binding proteins play crucial roles in various cellular processes. Developing high throughput tools for rapidly and effectively identifying DNA-binding proteins is one of the major challenges in the field of genome annotation. Although many efforts have been made in this regard, further effort is needed to enhance the prediction power. By incorporating the features into the general form of pseudo amino acid composition that were extracted from protein sequences via the “grey model” and by adopting the random forest operation engine, we proposed a new predictor, called iDNA-Prot, for identifying uncharacterized proteins as DNA-binding proteins or non-DNA binding proteins based on their amino acid sequences information alone. The overall success rate by iDNA-Prot was 83.96% that was obtained via jackknife tests on a newly constructed stringent benchmark dataset in which none of the proteins included has pairwise sequence identity to any other in a same subset. In addition to achieving high success rate, the computational time for iDNA-Prot is remarkably shorter in comparison with the relevant existing predictors. Hence it is anticipated that iDNA-Prot may become a useful high throughput tool for large-scale analysis of DNA-binding proteins. As a user-friendly web-server, iDNA-Prot is freely accessible to the public at the web-site on http://icpr.jci.edu.cn/bioinfo/iDNA-Prot or http://www.jci-bioinfo.cn/iDNA-Prot. Moreover, for the convenience of the vast majority of experimental scientists, a step-by-step guide is provided on how to use the web-server to get the desired results.
Collapse
Affiliation(s)
- Wei-Zhong Lin
- Information Science and Technology School, Donghua University, Shanghai, China
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen, China
| | - Jian-An Fang
- Information Science and Technology School, Donghua University, Shanghai, China
| | - Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen, China
- Gordon Life Science Institute, San Diego, California, United States of America
- * E-mail:
| | - Kuo-Chen Chou
- Gordon Life Science Institute, San Diego, California, United States of America
| |
Collapse
|
23
|
Zhang SW, Chen W, Yang F, Pan Q. Using Chou's pseudo amino acid composition to predict protein quaternary structure: a sequence-segmented PseAAC approach. Amino Acids 2008; 35:591-8. [PMID: 18427713 DOI: 10.1007/s00726-008-0086-x] [Citation(s) in RCA: 71] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2008] [Accepted: 02/28/2008] [Indexed: 12/11/2022]
Abstract
In the protein universe, many proteins are composed of two or more polypeptide chains, generally referred to as subunits, which associate through noncovalent interactions and, occasionally, disulfide bonds to form protein quaternary structures. It has long been known that the functions of proteins are closely related to their quaternary structures; some examples include enzymes, hemoglobin, DNA polymerase, and ion channels. However, it is extremely labor-expensive and even impossible to quickly determine the structures of hundreds of thousands of protein sequences solely from experiments. Since the number of protein sequences entering databanks is increasing rapidly, it is highly desirable to develop computational methods for classifying the quaternary structures of proteins from their primary sequences. Since the concept of Chou's pseudo amino acid composition (PseAAC) was introduced, a variety of approaches, such as residue conservation scores, von Neumann entropy, multiscale energy, autocorrelation function, moment descriptors, and cellular automata, have been utilized to formulate the PseAAC for predicting different attributes of proteins. Here, in a different approach, a sequence-segmented PseAAC is introduced to represent protein samples. Meanwhile, multiclass SVM classifier modules were adopted to classify protein quaternary structures. As a demonstration, the dataset constructed by Chou and Cai [(2003) Proteins 53:282-289] was adopted as a benchmark dataset. The overall jackknife success rates thus obtained were 88.2-89.1%, indicating that the new approach is quite promising for predicting protein quaternary structure.
Collapse
Affiliation(s)
- Shao-Wu Zhang
- College of Automation, Northwestern Polytechnical University, 710072, Xi'an, China.
| | | | | | | |
Collapse
|
24
|
Zhao XM, Chen L, Aihara K. Protein function prediction with high-throughput data. Amino Acids 2008; 35:517-30. [PMID: 18427717 DOI: 10.1007/s00726-008-0077-y] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2008] [Accepted: 03/13/2008] [Indexed: 12/12/2022]
Abstract
Protein function prediction is one of the main challenges in post-genomic era. The availability of large amounts of high-throughput data provides an alternative approach to handling this problem from the computational viewpoint. In this review, we provide a comprehensive description of the computational methods that are currently applicable to protein function prediction, especially from the perspective of machine learning. Machine learning techniques can generally be classified as supervised learning, semi-supervised learning and unsupervised learning. By classifying the existing computational methods for protein annotation into these three groups, we are able to present a comprehensive framework on protein annotation based on machine learning techniques. In addition to describing recently developed theoretical methodologies, we also cover representative databases and software tools that are widely utilized in the prediction of protein function.
Collapse
Affiliation(s)
- Xing-Ming Zhao
- ERATO Aihara Complexity Modelling Project, JST, Tokyo, 151-0064, Japan
| | | | | |
Collapse
|
25
|
An ensemble of support vector machines for predicting the membrane protein type directly from the amino acid sequence. Amino Acids 2008; 35:573-80. [DOI: 10.1007/s00726-008-0083-0] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2008] [Accepted: 02/26/2008] [Indexed: 11/26/2022]
|
26
|
An ensemble of reduced alphabets with protein encoding based on grouped weight for predicting DNA-binding proteins. Amino Acids 2008; 36:167-75. [DOI: 10.1007/s00726-008-0044-7] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2007] [Accepted: 02/07/2008] [Indexed: 10/22/2022]
|