1
|
Liu Y, Liu Y, Li Z. Protein-Protein Interaction Prediction via Structure-Based Deep Learning. Proteins 2024. [PMID: 38923590 DOI: 10.1002/prot.26721] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 05/04/2024] [Accepted: 06/04/2024] [Indexed: 06/28/2024]
Abstract
Protein-protein interactions (PPIs) play an essential role in life activities. Many artificial intelligence algorithms based on protein sequence information have been developed to predict PPIs. However, these models have difficulty dealing with various sequence lengths and suffer from low generalization and prediction accuracy. In this study, we proposed a novel end-to-end deep learning framework, RSPPI, combining residual neural network (ResNet) and spatial pyramid pooling (SPP), to predict PPIs based on the protein sequence physicochemistry properties and spatial structural information. In the RSPPI model, ResNet was employed to extract the structural and physicochemical information from the protein three-dimensional structure and primary sequence; the SPP layer was used to transform feature maps to a single vector and avoid the fixed-length requirement. The RSPPI model possessed excellent cross-species performance and outperformed several state-of-the-art methods based either on protein sequence or gene ontology in most evaluation metrics. The RSPPI model provides a novel strategy to develop an AI PPI prediction algorithm.
Collapse
Affiliation(s)
- Yucong Liu
- Shanghai Key Laboratory of Mechanics in Energy Engineering, Shanghai Institute of Applied Mathematics and Mechanics, School of Mechanics and Engineering Science, Shanghai University, Shanghai, China
| | - Yijun Liu
- Shanghai Key Laboratory of Mechanics in Energy Engineering, Shanghai Institute of Applied Mathematics and Mechanics, School of Mechanics and Engineering Science, Shanghai University, Shanghai, China
| | - Zhenhai Li
- Shanghai Key Laboratory of Mechanics in Energy Engineering, Shanghai Institute of Applied Mathematics and Mechanics, School of Mechanics and Engineering Science, Shanghai University, Shanghai, China
| |
Collapse
|
2
|
Zou H, Yin Z. m7G-DPP: Identifying N7-methylguanosine sites based on dinucleotide physicochemical properties of RNA. Biophys Chem 2021; 279:106697. [PMID: 34628276 DOI: 10.1016/j.bpc.2021.106697] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2021] [Revised: 10/01/2021] [Accepted: 10/02/2021] [Indexed: 11/17/2022]
Abstract
N7-methylguanosine (m7G) modification is one of the most common post-transcriptional RNA modifications, which play vital role in the regulation of gene expression. Dysfunction of m7G may result to developmental defects and the appearance of some serious diseases. Thus, it is an urgent task to fast and accurate identifying m7G sites. In view of experimental approaches are costly and time-consuming, researchers focused their attention on computational models. Hence, in current study, we proposed a novel predictor called m7G-DPP to identify m7G sites. In the predictor, the RNA sequences were firstly encoded by physicochemical (PC) properties of dinucleotide. Then, sliding window approach was adopted to divide PC matrix into multiple matrixes, and Pearson's correlation coefficient (PCC), dynamic time warping (DTW), and distance correlation (DC) were employed to extract classification features at each window. Next, the least absolute shrinkage and selection operator (LASSO) algorithm was applied to select discriminative features. Finally, these selected features were fed into support vector machine to identify m7G sites. Experimental results showed that the proposed method is effective, which may play a complementary role in current m7G sites prediction studies. The MATLAB codes and dataset can be obtained from website at https://figshare.com/articles/online_resource/m7G-DPP/15000348.
Collapse
Affiliation(s)
- Hongliang Zou
- School of Communications and Electronics, Jiangxi Science and Technology Normal University, Nanchang 330003, China.
| | - Zhijian Yin
- School of Communications and Electronics, Jiangxi Science and Technology Normal University, Nanchang 330003, China
| |
Collapse
|
3
|
Yang L, Han Y, Zhang H, Li W, Dai Y. Prediction of Protein-Protein Interactions with Local Weight-Sharing Mechanism in Deep Learning. BIOMED RESEARCH INTERNATIONAL 2020; 2020:5072520. [PMID: 32626745 PMCID: PMC7312734 DOI: 10.1155/2020/5072520] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/19/2019] [Revised: 03/04/2020] [Accepted: 05/21/2020] [Indexed: 12/30/2022]
Abstract
Protein-protein interactions (PPIs) are important for almost all cellular processes, including metabolic cycles, DNA transcription and replication, and signaling cascades. The experimental methods for identifying PPIs are always time-consuming and expensive. Therefore, it is important to develop computational approaches for predicting PPIs. In this paper, an improved model is proposed to use a machine learning method in the study of protein-protein interactions. With the consideration of the factors affecting the prediction of the PPIs, a method of feature extraction and fusion is proposed to improve the variety of the features to be considered in the prediction. Besides, with the consideration of the effect affected by the different input order of the two proteins, we propose a "Y-type" Bi-RNN model and train the network by using a method which both needs backward and forward training. In order to insure the training time caused on the extra training either a backward one or a forward one, this paper proposes a weight-sharing policy to minimize the parameters in the training. The experimental results show that the proposed method can achieve an accuracy of 99.57%, recall of 99.36%, sensitivity of 99.76%, precision of 99.74%, MCC of 99.14%, and AUC of 99.56% under the benchmark dataset.
Collapse
Affiliation(s)
- Lei Yang
- College of Computer Science and Engineering, Northeastern University, Shenyang, China
- Key Laboratory of Intelligent Computing in Medical Image, Ministry of Education, Northeastern University, China
| | - Yukun Han
- College of Computer Science and Engineering, Northeastern University, Shenyang, China
| | - Huixue Zhang
- College of Computer Science and Engineering, Northeastern University, Shenyang, China
| | - Wenlong Li
- College of Software, Northeastern University, Shenyang, China
| | - Yu Dai
- College of Software, Northeastern University, Shenyang, China
| |
Collapse
|
4
|
Wekesa JS, Meng J, Luan Y. Multi-feature fusion for deep learning to predict plant lncRNA-protein interaction. Genomics 2020; 112:2928-2936. [PMID: 32437848 DOI: 10.1016/j.ygeno.2020.05.005] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2019] [Revised: 04/22/2020] [Accepted: 05/05/2020] [Indexed: 12/28/2022]
Abstract
Long non-coding RNAs (lncRNAs) play key roles in regulating cellular biological processes through diverse molecular mechanisms including binding to RNA binding proteins. The majority of plant lncRNAs are functionally uncharacterized, thus, accurate prediction of plant lncRNA-protein interaction is imperative for subsequent functional studies. We present an integrative model, namely DRPLPI. Its uniqueness is that it predicts by multi-feature fusion. Structural and four groups of sequence features are used, including tri-nucleotide composition, gapped k-mer, recursive complement and binary profile. We design a multi-head self-attention long short-term memory encoder-decoder network to extract generative high-level features. To obtain robust results, DRPLPI combines categorical boosting and extra trees into a single meta-learner. Experiments on Zea mays and Arabidopsis thaliana obtained 0.9820 and 0.9652 area under precision/recall curve (AUPRC) respectively. The proposed method shows significant enhancement in the prediction performance compared with existing state-of-the-art methods.
Collapse
Affiliation(s)
- Jael Sanyanda Wekesa
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116023, China; School of Computing and Information Technology, Jomo Kenyatta University of Agriculture and Technology, Nairobi 62000-00200, Kenya
| | - Jun Meng
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116023, China.
| | - Yushi Luan
- School of Bioengineering, Dalian University of Technology, Dalian, Liaoning 116023, China
| |
Collapse
|
5
|
Liu B, Leng L, Sun X, Wang Y, Ma J, Zhu Y. ECMPride: prediction of human extracellular matrix proteins based on the ideal dataset using hybrid features with domain evidence. PeerJ 2020; 8:e9066. [PMID: 32377454 PMCID: PMC7195829 DOI: 10.7717/peerj.9066] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2019] [Accepted: 04/05/2020] [Indexed: 01/28/2023] Open
Abstract
Extracellular matrix (ECM) proteins play an essential role in various biological processes in multicellular organisms, and their abnormal regulation can lead to many diseases. For large-scale ECM protein identification, especially through proteomic-based techniques, a theoretical reference database of ECM proteins is required. In this study, based on the experimentally verified ECM datasets and by the integration of protein domain features and a machine learning model, we developed ECMPride, a flexible and scalable tool for predicting ECM proteins. ECMPride achieved excellent performance in predicting ECM proteins, with appropriate balanced accuracy and sensitivity, and the performance of ECMPride was shown to be superior to the previously developed tool. A new theoretical dataset of human ECM components was also established by applying ECMPride to all human entries in the SwissProt database, containing a significant number of putative ECM proteins as well as the abundant biological annotations. This dataset might serve as a valuable reference resource for ECM protein identification.
Collapse
Affiliation(s)
- Binghui Liu
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing, China
| | - Ling Leng
- Department of Central Laboratory, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, Beijing, China
| | - Xuer Sun
- Tissue Engineering Lab, Institute of Health Service and Transfusion Medicine, Beijing, China
| | - Yunfang Wang
- Tissue Engineering Lab, Institute of Health Service and Transfusion Medicine, Beijing, China
| | - Jie Ma
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing, China
| | - Yunping Zhu
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing, China.,Basic Medical School, Anhui Medical University, Anhui, China
| |
Collapse
|
6
|
Prediction of Extracellular Matrix Proteins by Fusing Multiple Feature Information, Elastic Net, and Random Forest Algorithm. MATHEMATICS 2020. [DOI: 10.3390/math8020169] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
Extracellular matrix (ECM) proteins play an important role in a series of biological processes of cells. The study of ECM proteins is helpful to further comprehend their biological functions. We propose ECMP-RF (extracellular matrix proteins prediction by random forest) to predict ECM proteins. Firstly, the features of the protein sequence are extracted by combining encoding based on grouped weight, pseudo amino-acid composition, pseudo position-specific scoring matrix, a local descriptor, and an autocorrelation descriptor. Secondly, the synthetic minority oversampling technique (SMOTE) algorithm is employed to process the class imbalance data, and the elastic net (EN) is used to reduce the dimension of the feature vectors. Finally, the random forest (RF) classifier is used to predict the ECM proteins. Leave-one-out cross-validation shows that the balanced accuracy of the training and testing datasets is 97.3% and 97.9%, respectively. Compared with other state-of-the-art methods, ECMP-RF is significantly better than other predictors.
Collapse
|
7
|
Yang R, Zhang C, Gao R, Zhang L, Song Q. Predicting FAD Interacting Residues with Feature Selection and Comprehensive Sequence Descriptors. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:2046-2056. [PMID: 29993986 DOI: 10.1109/tcbb.2018.2824332] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The function of a flavoprotein is determined to a great extent by the binding sites on its surface that interacts with flavin adenine dinucleotide (FAD). Malfunction or dysregulation of FAD binding leads to a series of diseases. Therefore, accurately identifying FAD interacting residues (FIRs) provides insights into the molecular mechanisms of flavoprotein-related biological processes and disease progression. In this paper, a new computational method is proposed for identifying FIRs from protein sequences. Various sequence-derived discriminative features are explored. We analyze the distinctions of these features between FIRs and non-FIRs. We also investigate the predictive capabilities of both individual features and combinations of features. A relief algorithm followed by incremental feature selection (relief-IFS) is then adopted to search the optimal features. Finally, a random forest (RF) module is used to predict FIRs based on the optimal features. Using a 5-fold cross-validation test, the proposed method performs well, with a sensitivity of 0.847, a specificity of 0.933, an accuracy of 0.890, and a Matthews correlation coefficient (MCC) of 0.782, thereby outperforming previous methods. These results indicate that our method is relatively successful at predicting FIRs.
Collapse
|
8
|
Ru X, Li L, Wang C. Identification of Phage Viral Proteins With Hybrid Sequence Features. Front Microbiol 2019; 10:507. [PMID: 30972038 PMCID: PMC6443926 DOI: 10.3389/fmicb.2019.00507] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2018] [Accepted: 02/27/2019] [Indexed: 02/01/2023] Open
Abstract
The uniqueness of bacteriophages plays an important role in bioinformatics research. In real applications, the function of the bacteriophage virion proteins is the main area of interest. Therefore, it is very important to classify bacteriophage virion proteins and non-phage virion proteins accurately. Extracting comprehensive and effective sequence features from proteins plays a vital role in protein classification. In order to more fully represent protein information, this paper is more comprehensive and effective by combining the features extracted by the feature information representation algorithm based on sequence information (CCPA) and the feature representation algorithm based on sequence and structure information. After extracting features, the Max-Relevance-Max-Distance (MRMD) algorithm is used to select the optimal feature set with the strongest correlation between class labels and low redundancy between features. Given the randomness of the samples selected by the random forest classification algorithm and the randomness features for producing each node variable, a random forest method is employed to perform 10-fold cross-validation on the bacteriophage protein classification. The accuracy of this model is as high as 93.5% in the classification of phage proteins in this study. This study also found that, among the eight physicochemical properties considered, the charge property has the greatest impact on the classification of bacteriophage proteins These results indicate that the model discussed in this paper is an important tool in bacteriophage protein research.
Collapse
Affiliation(s)
- Xiaoqing Ru
- School of Information and Electrical Engineering, Hebei University of Engineering, Handan, China
| | - Lihong Li
- School of Information and Electrical Engineering, Hebei University of Engineering, Handan, China
| | - Chunyu Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
9
|
Yang Q, Jia C, Li T. Prediction of aptamer-protein interacting pairs based on sparse autoencoder feature extraction and an ensemble classifier. Math Biosci 2019; 311:103-108. [PMID: 30880100 DOI: 10.1016/j.mbs.2019.01.009] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2018] [Revised: 01/29/2019] [Accepted: 01/29/2019] [Indexed: 10/27/2022]
Abstract
Aptamer-protein interacting pairs play important roles in physiological functions and structural characterization. Identifying aptamer-protein interacting pairs is challenging and limited, despite of the tremendous applications of aptamers. Therefore, it is vital to construct a high prediction performance model for identifying aptamer-target interacting pairs. In this study, a novel ensemble method is presented to predict aptamer-protein interacting pairs by integrating sequence characteristics derived from aptamers and the target proteins. The features extracted for aptamers were the compositions of amino acids and pseudo K-tuple nucleotides. In addition, a sparse autoencoder was used to characterize features for the target protein sequences. To remove redundant features, gradient boosting decision tree (GBDT) and incremental feature selection (IFS) methods were used to obtain the optimum combination of sequence characters. Based on 616 selected features, an ensemble of three sub- support vector machine (SVM) classifiers was used to construct our prediction model. Evaluated on an independent dataset, our predictor obtained an accuracy of 75.7%, Matthew's Correlation Coefficient of 0.478, and Youden's Index of 0.538, which were superior to the values reached using other existing predictors. The results show that our model can be used to distinguishing novel aptamer-protein interacting pairs and revealing the interrelation between aptamers and proteins.
Collapse
Affiliation(s)
- Qing Yang
- Institute of Environmental Systems Biology, College of Environmental and Engineering, Dalian Maritime University, No. 1 Linghai Road, Dalian 116026, China
| | - Cangzhi Jia
- School of Science, Dalian Maritime University, No. 1 Linghai Road, Dalian 116026, China
| | - Taoying Li
- Department of Maritime Economics and Management, Dalian Maritime University, No. 1 Linghai Road, Dalian 116026, China.
| |
Collapse
|
10
|
McKown EN, DeAguero JL, Canan BD, Kilic A, Zhu Y, Janssen PM, Delfín DA. Impaired adhesion of induced pluripotent stem cell-derived cardiac progenitor cells (iPSC-CPCs) to isolated extracellular matrix from failing hearts. Heliyon 2018; 4:e00870. [PMID: 30364772 PMCID: PMC6197956 DOI: 10.1016/j.heliyon.2018.e00870] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2018] [Revised: 09/21/2018] [Accepted: 10/15/2018] [Indexed: 11/18/2022] Open
Abstract
We tested the hypothesis that induced pluripotent stem cell-derived cardiac progenitor cells (iPSC-CPCs) are less able to adhere to the extracellular matrix (ECM) derived from failing human hearts with dilated cardiomyopathy compared to nonfailing human heart ECM. We also hypothesized that morphological development, cell beating rates, and mRNA levels of Nkx2.5 and cardiac troponin T would be altered after culturing the iPSC-CPCs on the failing heart ECM under cardiomyocyte differentiation conditions. We used microscopy to distinguish between adhered and unadhered cells, and to determine morphological development and cell beating. We used qPCR to determine mRNA levels. iPSC-CPCs show a significantly reduced ability to adhere to the ECM of failing hearts and higher expression of Nkx2.5 mRNA. However, morphological development, cell beating rates, and cardiac troponin T levels were not significantly altered in the cells cultured on the failing heart ECM. Our study shows that the failing heart ECM from patients with dilated cardiomyopathy impairs initial iPSC-CPC adhesion and may have a modest effect on the ability of the cells to transdifferentiate into cardiomyocytes.
Collapse
Affiliation(s)
- Elizabeth N. McKown
- The University of New Mexico College of Pharmacy, Department of Pharmaceutical Sciences, MSC09 5360, 1 University of New Mexico, Albuquerque, NM 87131, USA
| | - Joshua L. DeAguero
- The University of New Mexico College of Pharmacy, Department of Pharmaceutical Sciences, MSC09 5360, 1 University of New Mexico, Albuquerque, NM 87131, USA
| | - Benjamin D. Canan
- The Ohio State University College of Medicine, Department of Physiology and Cell Biology and the Davis Heart Lung Research Institute, 200 Hamilton Hall, 1645 Neil Avenue, Columbus, OH 43210, USA
| | - Ahmet Kilic
- The Ohio State University College of Medicine, Department of Surgery and the Davis Heart Lung Research Institute, Richard M. Ross Heart Hospital, 452 West 10th Ave., Columbus, OH 43210, USA
| | - Yiliang Zhu
- The University of New Mexico School of Medicine, Department of Internal Medicine, MSC10 5550, 1 University of New Mexico, Albuquerque, NM 87131, USA
| | - Paul M.L. Janssen
- The Ohio State University College of Medicine, Department of Physiology and Cell Biology and the Davis Heart Lung Research Institute, 200 Hamilton Hall, 1645 Neil Avenue, Columbus, OH 43210, USA
| | - Dawn A. Delfín
- The University of New Mexico College of Pharmacy, Department of Pharmaceutical Sciences, MSC09 5360, 1 University of New Mexico, Albuquerque, NM 87131, USA
- Corresponding author.
| |
Collapse
|
11
|
Zhang W, Yue X, Liu F, Chen Y, Tu S, Zhang X. A unified frame of predicting side effects of drugs by using linear neighborhood similarity. BMC SYSTEMS BIOLOGY 2017; 11:101. [PMID: 29297371 PMCID: PMC5751767 DOI: 10.1186/s12918-017-0477-2] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
BACKGROUND Drug side effects are one of main concerns in the drug discovery, which gains wide attentions. Investigating drug side effects is of great importance, and the computational prediction can help to guide wet experiments. As far as we known, a great number of computational methods have been proposed for the side effect predictions. The assumption that similar drugs may induce same side effects is usually employed for modeling, and how to calculate the drug-drug similarity is critical in the side effect predictions. RESULTS In this paper, we present a novel measure of drug-drug similarity named "linear neighborhood similarity", which is calculated in a drug feature space by exploring linear neighborhood relationship. Then, we transfer the similarity from the feature space into the side effect space, and predict drug side effects by propagating known side effect information through a similarity-based graph. Under a unified frame based on the linear neighborhood similarity, we propose method "LNSM" and its extension "LNSM-SMI" to predict side effects of new drugs, and propose the method "LNSM-MSE" to predict unobserved side effect of approved drugs. CONCLUSIONS We evaluate the performances of LNSM and LNSM-SMI in predicting side effects of new drugs, and evaluate the performances of LNSM-MSE in predicting missing side effects of approved drugs. The results demonstrate that the linear neighborhood similarity can improve the performances of side effect prediction, and the linear neighborhood similarity-based methods can outperform existing side effect prediction methods. More importantly, the proposed methods can predict side effects of new drugs as well as unobserved side effects of approved drugs under a unified frame.
Collapse
Affiliation(s)
- Wen Zhang
- School of Computer, Wuhan University, Wuhan, 430072, China
| | - Xiang Yue
- International School of Software, Wuhan University, Wuhan, 430072, China
| | - Feng Liu
- International School of Software, Wuhan University, Wuhan, 430072, China
| | - Yanlin Chen
- School of Mathematics and Statistics, Wuhan University, Wuhan, 430072, China
| | - Shikui Tu
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Xining Zhang
- School of Computer, Wuhan University, Wuhan, 430072, China.
| |
Collapse
|
12
|
Muñoz E, Nováček V, Vandenbussche PY. Facilitating prediction of adverse drug reactions by using knowledge graphs and multi-label learning models. Brief Bioinform 2017; 20:190-202. [DOI: 10.1093/bib/bbx099] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2017] [Indexed: 01/02/2023] Open
Affiliation(s)
- Emir Muñoz
- Fujitsu Ireland Ltd., Co. Dublin, Ireland
- Insight Centre for Data Analytics, NUI Galway, Co. Galway, Ireland
| | - Vít Nováček
- Insight Centre for Data Analytics, NUI Galway, Co. Galway, Ireland
| | | |
Collapse
|
13
|
Zhang L, Zhang C, Gao R, Yang R, Song Q. Sequence Based Prediction of Antioxidant Proteins Using a Classifier Selection Strategy. PLoS One 2016; 11:e0163274. [PMID: 27662651 PMCID: PMC5035026 DOI: 10.1371/journal.pone.0163274] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2016] [Accepted: 09/05/2016] [Indexed: 11/28/2022] Open
Abstract
Antioxidant proteins perform significant functions in maintaining oxidation/antioxidation balance and have potential therapies for some diseases. Accurate identification of antioxidant proteins could contribute to revealing physiological processes of oxidation/antioxidation balance and developing novel antioxidation-based drugs. In this study, an ensemble method is presented to predict antioxidant proteins with hybrid features, incorporating SSI (Secondary Structure Information), PSSM (Position Specific Scoring Matrix), RSA (Relative Solvent Accessibility), and CTD (Composition, Transition, Distribution). The prediction results of the ensemble predictor are determined by an average of prediction results of multiple base classifiers. Based on a classifier selection strategy, we obtain an optimal ensemble classifier composed of RF (Random Forest), SMO (Sequential Minimal Optimization), NNA (Nearest Neighbor Algorithm), and J48 with an accuracy of 0.925. A Relief combined with IFS (Incremental Feature Selection) method is adopted to obtain optimal features from hybrid features. With the optimal features, the ensemble method achieves improved performance with a sensitivity of 0.95, a specificity of 0.93, an accuracy of 0.94, and an MCC (Matthew’s Correlation Coefficient) of 0.880, far better than the existing method. To evaluate the prediction performance objectively, the proposed method is compared with existing methods on the same independent testing dataset. Encouragingly, our method performs better than previous studies. In addition, our method achieves more balanced performance with a sensitivity of 0.878 and a specificity of 0.860. These results suggest that the proposed ensemble method can be a potential candidate for antioxidant protein prediction. For public access, we develop a user-friendly web server for antioxidant protein identification that is freely accessible at http://antioxidant.weka.cc.
Collapse
Affiliation(s)
- Lina Zhang
- School of Control Science and Engineering, Shandong University, Jinan, China
| | - Chengjin Zhang
- School of Control Science and Engineering, Shandong University, Jinan, China
- School of Mechanical, Electrical and Information Engineering, Shandong University at Weihai, China
- * E-mail:
| | - Rui Gao
- School of Control Science and Engineering, Shandong University, Jinan, China
| | - Runtao Yang
- School of Control Science and Engineering, Shandong University, Jinan, China
| | - Qing Song
- School of Electrical Engineering, University of Jinan, Jinan, China
| |
Collapse
|
14
|
Prediction of aptamer-protein interacting pairs using an ensemble classifier in combination with various protein sequence attributes. BMC Bioinformatics 2016; 17:225. [PMID: 27245069 PMCID: PMC4888498 DOI: 10.1186/s12859-016-1087-5] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2016] [Accepted: 05/17/2016] [Indexed: 02/05/2023] Open
Abstract
Background Aptamer-protein interacting pairs play a variety of physiological functions and therapeutic potentials in organisms. Rapidly and effectively predicting aptamer-protein interacting pairs is significant to design aptamers binding to certain interested proteins, which will give insight into understanding mechanisms of aptamer-protein interacting pairs and developing aptamer-based therapies. Results In this study, an ensemble method is presented to predict aptamer-protein interacting pairs with hybrid features. The features for aptamers are extracted from Pseudo K-tuple Nucleotide Composition (PseKNC) while the features for proteins incorporate Discrete Cosine Transformation (DCT), disorder information, and bi-gram Position Specific Scoring Matrix (PSSM). We investigate predictive capabilities of various feature spaces. The proposed ensemble method obtains the best performance with Youden’s Index of 0.380, using the hybrid feature space of PseKNC, DCT, bi-gram PSSM, and disorder information by 10-fold cross validation. The Relief-Incremental Feature Selection (IFS) method is adopted to obtain the optimal feature set. Based on the optimal feature set, the proposed method achieves a balanced performance with a sensitivity of 0.753 and a specificity of 0.725 on the training dataset, which indicates that this method can solve the imbalanced data problem effectively. To evaluate the prediction performance objectively, an independent testing dataset is used to evaluate the proposed method. Encouragingly, our proposed method performs better than previous study with a sensitivity of 0.738 and a Youden’s Index of 0.451. Conclusions These results suggest that the proposed method can be a potential candidate for aptamer-protein interacting pair prediction, which may contribute to finding novel aptamer-protein interacting pairs and understanding the relationship between aptamers and proteins. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1087-5) contains supplementary material, which is available to authorized users.
Collapse
|
15
|
JPPRED: Prediction of Types of J-Proteins from Imbalanced Data Using an Ensemble Learning Method. BIOMED RESEARCH INTERNATIONAL 2015; 2015:705156. [PMID: 26587542 PMCID: PMC4637456 DOI: 10.1155/2015/705156] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/03/2015] [Revised: 10/05/2015] [Accepted: 10/11/2015] [Indexed: 11/17/2022]
Abstract
Different types of J-proteins perform distinct functions in chaperone processes and diseases development. Accurate identification of types of J-proteins will provide significant clues to reveal the mechanism of J-proteins and contribute to developing drugs for diseases. In this study, an ensemble predictor called JPPRED for J-protein prediction is proposed with hybrid features, including split amino acid composition (SAAC), pseudo amino acid composition (PseAAC), and position specific scoring matrix (PSSM). To deal with the imbalanced benchmark dataset, the synthetic minority oversampling technique (SMOTE) and undersampling technique are applied. The average sensitivity of JPPRED based on above-mentioned individual feature spaces lies in the range of 0.744–0.851, indicating the discriminative power of these features. In addition, JPPRED yields the highest average sensitivity of 0.875 using the hybrid feature spaces of SAAC, PseAAC, and PSSM. Compared to individual base classifiers, JPPRED obtains more balanced and better performance for each type of J-proteins. To evaluate the prediction performance objectively, JPPRED is compared with previous study. Encouragingly, JPPRED obtains balanced performance for each type of J-proteins, which is significantly superior to that of the existing method. It is anticipated that JPPRED can be a potential candidate for J-protein prediction.
Collapse
|
16
|
Zhang L, Zhang C, Gao R, Yang R. An Ensemble Method to Distinguish Bacteriophage Virion from Non-Virion Proteins Based on Protein Sequence Characteristics. Int J Mol Sci 2015; 16:21734-58. [PMID: 26370987 PMCID: PMC4613277 DOI: 10.3390/ijms160921734] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2015] [Revised: 08/16/2015] [Accepted: 08/25/2015] [Indexed: 11/16/2022] Open
Abstract
Bacteriophage virion proteins and non-virion proteins have distinct functions in biological processes, such as specificity determination for host bacteria, bacteriophage replication and transcription. Accurate identification of bacteriophage virion proteins from bacteriophage protein sequences is significant to understand the complex virulence mechanism in host bacteria and the influence of bacteriophages on the development of antibacterial drugs. In this study, an ensemble method for bacteriophage virion protein prediction from bacteriophage protein sequences is put forward with hybrid feature spaces incorporating CTD (composition, transition and distribution), bi-profile Bayes, PseAAC (pseudo-amino acid composition) and PSSM (position-specific scoring matrix). When performing on the training dataset 10-fold cross-validation, the presented method achieves a satisfactory prediction result with a sensitivity of 0.870, a specificity of 0.830, an accuracy of 0.850 and Matthew's correlation coefficient (MCC) of 0.701, respectively. To evaluate the prediction performance objectively, an independent testing dataset is used to evaluate the proposed method. Encouragingly, our proposed method performs better than previous studies with a sensitivity of 0.853, a specificity of 0.815, an accuracy of 0.831 and MCC of 0.662 on the independent testing dataset. These results suggest that the proposed method can be a potential candidate for bacteriophage virion protein prediction, which may provide a useful tool to find novel antibacterial drugs and to understand the relationship between bacteriophage and host bacteria. For the convenience of the vast majority of experimental Int. J. Mol. Sci. 2015, 16,21735 scientists, a user-friendly and publicly-accessible web-server for the proposed ensemble method is established.
Collapse
Affiliation(s)
- Lina Zhang
- School of Control Science and Engineering, Shandong University, Jinan 250061, China.
| | - Chengjin Zhang
- School of Control Science and Engineering, Shandong University, Jinan 250061, China.
- School of Mechanical, Electrical and Information Engineering, Shandong University, Weihai 264209, China.
| | - Rui Gao
- School of Control Science and Engineering, Shandong University, Jinan 250061, China.
| | - Runtao Yang
- School of Control Science and Engineering, Shandong University, Jinan 250061, China.
| |
Collapse
|