51
|
Luo X, Ju W, Qu M, Gu Y, Chen C, Deng M, Hua XS, Zhang M. CLEAR: Cluster-Enhanced Contrast for Self-Supervised Graph Representation Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; PP:899-912. [PMID: 35675236 DOI: 10.1109/tnnls.2022.3177775] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
This article studies self-supervised graph representation learning, which is critical to various tasks, such as protein property prediction. Existing methods typically aggregate representations of each individual node as graph representations, but fail to comprehensively explore local substructures (i.e., motifs and subgraphs), which also play important roles in many graph mining tasks. In this article, we propose a self-supervised graph representation learning framework named cluster-enhanced Contrast (CLEAR) that models the structural semantics of a graph from graph-level and substructure-level granularities, i.e., global semantics and local semantics, respectively. Specifically, we use graph-level augmentation strategies followed by a graph neural network-based encoder to explore global semantics. As for local semantics, we first use graph clustering techniques to partition each whole graph into several subgraphs while preserving as much semantic information as possible. We further employ a self-attention interaction module to aggregate the semantics of all subgraphs into a local-view graph representation. Moreover, we integrate both global semantics and local semantics into a multiview graph contrastive learning framework, enhancing the semantic-discriminative ability of graph representations. Extensive experiments on various real-world benchmarks demonstrate the efficacy of the proposed over current graph self-supervised representation learning approaches on both graph classification and transfer learning tasks.
Collapse
|
52
|
A Novel Ensemble Learning-Based Computational Method to Predict Protein-Protein Interactions from Protein Primary Sequences. BIOLOGY 2022; 11:biology11050775. [PMID: 35625503 PMCID: PMC9139052 DOI: 10.3390/biology11050775] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/13/2022] [Revised: 05/10/2022] [Accepted: 05/11/2022] [Indexed: 11/16/2022]
Abstract
Simple Summary Protein–protein interactions (PPIs) play a central role in the evolution and progression of various biological processes. In this article, we constructed a novel ensemble-learning-based model to predict potential PPIs, which only utilized the protein sequence information. The presented method used Discrete Hilbert transform to extract amino acid sequence information from position-specific scoring matrices. Then these extracted features were fed into rotation forest for training and predicting. When applying our method to the three datasets (Yeast, Human, and Oryza sativa) for detecting PPIs, we obtained excellent prediction performance. Furthermore, the comparison results indicated that our computational model is effective and robust in predicting potential PPI pairs. Abstract Protein–protein interactions (PPIs) are crucial for understanding the cellular processes, including signal cascade, DNA transcription, metabolic cycles, and repair. In the past decade, a multitude of high-throughput methods have been introduced to detect PPIs. However, these techniques are time-consuming, laborious, and always suffer from high false negative rates. Therefore, there is a great need of new computational methods as a supplemental tool for PPIs prediction. In this article, we present a novel sequence-based model to predict PPIs that combines Discrete Hilbert transform (DHT) and Rotation Forest (RoF). This method contains three stages: firstly, the Position-Specific Scoring Matrices (PSSM) was adopted to transform the amino acid sequence into a PSSM matrix, which can contain rich information about protein evolution. Then, the 400-dimensional DHT descriptor was constructed for each protein pair. Finally, these feature descriptors were fed to the RoF classifier for identifying the potential PPI class. When exploring the proposed model on the Yeast, Human, and Oryza sativa PPIs datasets, it yielded excellent prediction accuracies of 91.93, 96.35, and 94.24%, respectively. In addition, we also conducted numerous experiments on cross-species PPIs datasets, and the predictive capacity of our method is also very excellent. To further access the prediction ability of the proposed approach, we present the comparison of RoF with four powerful classifiers, including Support Vector Machine (SVM), Random Forest (RF), K-nearest Neighbor (KNN), and AdaBoost. We also compared it with some existing superiority works. These comprehensive experimental results further confirm the excellent and feasibility of the proposed approach. In future work, we hope it can be a supplemental tool for the proteomics analysis.
Collapse
|
53
|
Lin K, Quan X, Jin C, Shi Z, Yang J. An Interpretable Double-Scale Attention Model for Enzyme Protein Class Prediction Based on Transformer Encoders and Multi-Scale Convolutions. Front Genet 2022; 13:885627. [PMID: 35432476 PMCID: PMC9012241 DOI: 10.3389/fgene.2022.885627] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2022] [Accepted: 03/07/2022] [Indexed: 12/01/2022] Open
Abstract
Background Classification and annotation of enzyme proteins are fundamental for enzyme research on biological metabolism. Enzyme Commission (EC) numbers provide a standard for hierarchical enzyme class prediction, on which several computational methods have been proposed. However, most of these methods are dependent on prior distribution information and none explicitly quantifies amino-acid-level relations and possible contribution of sub-sequences. Methods In this study, we propose a double-scale attention enzyme class prediction model named DAttProt with high reusability and interpretability. DAttProt encodes sequence by self-supervised Transformer encoders in pre-training and gathers local features by multi-scale convolutions in fine-tuning. Specially, a probabilistic double-scale attention weight matrix is designed to aggregate multi-scale features and positional prediction scores. Finally, a full connection linear classifier conducts a final inference through the aggregated features and prediction scores. Results On DEEPre and ECPred datasets, DAttProt performs as competitive with the compared methods on level 0 and outperforms them on deeper task levels, reaching 0.788 accuracy on level 2 of DEEPre and 0.967 macro-F1 on level 1 of ECPred. Moreover, through case study, we demonstrate that the double-scale attention matrix learns to discover and focus on the positions and scales of bio-functional sub-sequences in the protein. Conclusion Our DAttProt provides an effective and interpretable method for enzyme class prediction. It can predict enzyme protein classes accurately and furthermore discover enzymatic functional sub-sequences such as protein motifs from both positional and spatial scales.
Collapse
Affiliation(s)
- Ken Lin
- College of Artificial Intelligence, Nankai University, Tianjin, China
| | - Xiongwen Quan
- College of Artificial Intelligence, Nankai University, Tianjin, China
- *Correspondence: Xiongwen Quan,
| | - Chen Jin
- College of Computer Science, Nankai University, Tianjin, China
| | - Zhuangwei Shi
- College of Artificial Intelligence, Nankai University, Tianjin, China
| | - Jinglong Yang
- College of Artificial Intelligence, Nankai University, Tianjin, China
| |
Collapse
|
54
|
Pan J, You ZH, Li LP, Huang WZ, Guo JX, Yu CQ, Wang LP, Zhao ZY. DWPPI: A Deep Learning Approach for Predicting Protein–Protein Interactions in Plants Based on Multi-Source Information With a Large-Scale Biological Network. Front Bioeng Biotechnol 2022; 10:807522. [PMID: 35387292 PMCID: PMC8978800 DOI: 10.3389/fbioe.2022.807522] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2021] [Accepted: 02/25/2022] [Indexed: 12/30/2022] Open
Abstract
The prediction of protein–protein interactions (PPIs) in plants is vital for probing the cell function. Although multiple high-throughput approaches in the biological domain have been developed to identify PPIs, with the increasing complexity of PPI network, these methods fall into laborious and time-consuming situations. Thus, it is essential to develop an effective and feasible computational method for the prediction of PPIs in plants. In this study, we present a network embedding-based method, called DWPPI, for predicting the interactions between different plant proteins based on multi-source information and combined with deep neural networks (DNN). The DWPPI model fuses the protein natural language sequence information (attribute information) and protein behavior information to represent plant proteins as feature vectors and finally sends these features to a deep learning–based classifier for prediction. To validate the prediction performance of DWPPI, we performed it on three model plant datasets: Arabidopsis thaliana (A. thaliana), mazie (Zea mays), and rice (Oryza sativa). The experimental results with the fivefold cross-validation technique demonstrated that DWPPI obtains great performance with the AUC (area under ROC curves) values of 0.9548, 0.9867, and 0.9213, respectively. To further verify the predictive capacity of DWPPI, we compared it with some different state-of-the-art machine learning classifiers. Moreover, case studies were performed with the AC149810.2_FGP003 protein. As a result, 14 of the top 20 PPI pairs identified by DWPPI with the highest scores were confirmed by the literature. These excellent results suggest that the DWPPI model can act as a promising tool for related plant molecular biology.
Collapse
Affiliation(s)
- Jie Pan
- School of Information Engineering, Xijing University, Xi’an, China
| | - Zhu-Hong You
- School of Information Engineering, Xijing University, Xi’an, China
| | - Li-Ping Li
- School of Information Engineering, Xijing University, Xi’an, China
- College of Grassland and Environment Science, Xinjiang Agricultural University, Urumqi, China
- *Correspondence: Li-Ping Li, ; Chang-Qing Yu,
| | - Wen-Zhun Huang
- School of Information Engineering, Xijing University, Xi’an, China
| | - Jian-Xin Guo
- School of Information Engineering, Xijing University, Xi’an, China
| | - Chang-Qing Yu
- School of Information Engineering, Xijing University, Xi’an, China
- *Correspondence: Li-Ping Li, ; Chang-Qing Yu,
| | - Li-Ping Wang
- School of Information Engineering, Xijing University, Xi’an, China
| | - Zheng-Yang Zhao
- School of Information Engineering, Xijing University, Xi’an, China
| |
Collapse
|
55
|
Zhou X, Song H, Li J. Residue-Frustration-Based Prediction of Protein-Protein Interactions Using Machine Learning. J Phys Chem B 2022; 126:1719-1727. [PMID: 35170967 DOI: 10.1021/acs.jpcb.1c10525] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
The study of protein-protein interactions (PPIs) is important in understanding the function of proteins. However, it is still a challenge to investigate the transient protein-protein interaction by experiments. Hence, the computational prediction for protein-protein interactions draws growing attention. Statistics-based features have been widely used in the studies of protein structure prediction and protein folding. Due to the scarcity of experimental data of PPI, it is difficult to construct a conventional statistical feature for PPI prediction, and the application of statistics-based features is very limited in this field. In this paper, we explored the application of frustration, a statistical potential, in PPI prediction. By comparing the energetic contribution of the extra stabilization energy from a given residue pair in the native protein with the statistics of the energies, we obtained the residue pair's frustration index. By calculating the number of residue pairs with a high frustration index, the highly frustrated density, a residue-frustration-based feature, was then obtained to describe the tendency of residues to be involved in PPI. Highly frustrated density, as well as structure-based features, were then used to describe protein residues and combined with the long short-term memory (LSTM) neural network to predict PPI residue pairs. Our model correctly predicted 75% dimers when only the top 2‰ residue pairs were selected in each dimer. Our model, which considers the statistics-based features, is significantly different from the models based on the chemical features of residues. We found that frustration can effectively describe the tendency of residue to be involved in PPI. Frustration-based features can replace chemical features to combine with machine learning and realize the better performance of PPI prediction. It reveals the great potential of statistical potential such as frustration in PPI prediction.
Collapse
Affiliation(s)
- Xiaozhou Zhou
- Zhejiang Province Key Laboratory of Quantum Technology and Device, Institute of Quantitative Biology, Department of Physics, Zhejiang University, Hangzhou 310027, Zhejiang, China
| | - Haoyu Song
- Zhejiang Province Key Laboratory of Quantum Technology and Device, Institute of Quantitative Biology, Department of Physics, Zhejiang University, Hangzhou 310027, Zhejiang, China
| | - Jingyuan Li
- Zhejiang Province Key Laboratory of Quantum Technology and Device, Institute of Quantitative Biology, Department of Physics, Zhejiang University, Hangzhou 310027, Zhejiang, China
| |
Collapse
|
56
|
Liu Z, Ren Z, Yan L, Li F. DeepLRR: An Online Webserver for Leucine-Rich-Repeat Containing Protein Characterization Based on Deep Learning. PLANTS (BASEL, SWITZERLAND) 2022; 11:plants11010136. [PMID: 35009139 PMCID: PMC8796025 DOI: 10.3390/plants11010136] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/29/2021] [Revised: 12/31/2021] [Accepted: 01/01/2022] [Indexed: 05/26/2023]
Abstract
Members of the leucine-rich repeat (LRR) superfamily play critical roles in multiple biological processes. As the LRR unit sequence is highly variable, accurately predicting the number and location of LRR units in proteins is a highly challenging task in the field of bioinformatics. Existing methods still need to be improved, especially when it comes to similarity-based methods. We introduce our DeepLRR method based on a convolutional neural network (CNN) model and LRR features to predict the number and location of LRR units in proteins. We compared DeepLRR with six existing methods using a dataset containing 572 LRR proteins and it outperformed all of them when it comes to overall F1 score. In addition, DeepLRR has integrated identifying plant disease-resistance proteins (NLR, LRR-RLK, LRR-RLP) and non-canonical domains. With DeepLRR, 223, 191 and 183 LRR-RLK genes in Arabidopsis (Arabidopsis thaliana), rice (Oryza sativa ssp. Japonica) and tomato (Solanum lycopersicum) genomes were re-annotated, respectively. Chromosome mapping and gene cluster analysis revealed that 24.2% (54/223), 29.8% (57/191) and 16.9% (31/183) of LRR-RLK genes formed gene cluster structures in Arabidopsis, rice and tomato, respectively. Finally, we explored the evolutionary relationship and domain composition of LRR-RLK genes in each plant and distributions of known receptor and co-receptor pairs. This provides a new perspective for the identification of potential receptors and co-receptors.
Collapse
Affiliation(s)
- Zhenya Liu
- Key Lab of Horticultural Plant Biology (MOE), College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan 430070, China
| | - Zirui Ren
- College of Informatics, Huazhong Agricultural University, Wuhan 430070, China; (Z.R.); (L.Y.)
| | - Lunyi Yan
- College of Informatics, Huazhong Agricultural University, Wuhan 430070, China; (Z.R.); (L.Y.)
| | - Feng Li
- Key Lab of Horticultural Plant Biology (MOE), College of Horticulture and Forestry Sciences, Huazhong Agricultural University, Wuhan 430070, China
| |
Collapse
|
57
|
Wang L, Zhong C. gGATLDA: lncRNA-disease association prediction based on graph-level graph attention network. BMC Bioinformatics 2022; 23:11. [PMID: 34983363 PMCID: PMC8729153 DOI: 10.1186/s12859-021-04548-z] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2021] [Accepted: 12/21/2021] [Indexed: 01/20/2023] Open
Abstract
Background Long non-coding RNAs (lncRNAs) are related to human diseases by regulating gene expression. Identifying lncRNA-disease associations (LDAs) will contribute to diagnose, treatment, and prognosis of diseases. However, the identification of LDAs by the biological experiments is time-consuming, costly and inefficient. Therefore, the development of efficient and high-accuracy computational methods for predicting LDAs is of great significance. Results In this paper, we propose a novel computational method (gGATLDA) to predict LDAs based on graph-level graph attention network. Firstly, we extract the enclosing subgraphs of each lncRNA-disease pair. Secondly, we construct the feature vectors by integrating lncRNA similarity and disease similarity as node attributes in subgraphs. Finally, we train a graph neural network (GNN) model by feeding the subgraphs and feature vectors to it, and use the trained GNN model to predict lncRNA-disease potential association scores. The experimental results show that our method can achieve higher area under the receiver operation characteristic curve (AUC), area under the precision recall curve (AUPR), accuracy and F1-Score than the state-of-the-art methods in five fold cross-validation. Case studies show that our method can effectively identify lncRNAs associated with breast cancer, gastric cancer, prostate cancer, and renal cancer. Conclusion The experimental results indicate that our method is a useful approach for predicting potential LDAs.
Collapse
Affiliation(s)
- Li Wang
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, China.,School of Computer, Electronics and Information, Guangxi University, Nanning, China
| | - Cheng Zhong
- School of Computer, Electronics and Information, Guangxi University, Nanning, China. .,Key Laboratory of Parallel and Distributed Computing in Guangxi Colleges and Universities, Guangxi University, Nanning, China.
| |
Collapse
|
58
|
From complete cross-docking to partners identification and binding sites predictions. PLoS Comput Biol 2022; 18:e1009825. [PMID: 35089918 PMCID: PMC8827487 DOI: 10.1371/journal.pcbi.1009825] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2021] [Revised: 02/09/2022] [Accepted: 01/11/2022] [Indexed: 11/19/2022] Open
Abstract
Proteins ensure their biological functions by interacting with each other. Hence, characterising protein interactions is fundamental for our understanding of the cellular machinery, and for improving medicine and bioengineering. Over the past years, a large body of experimental data has been accumulated on who interacts with whom and in what manner. However, these data are highly heterogeneous and sometimes contradictory, noisy, and biased. Ab initio methods provide a means to a "blind" protein-protein interaction network reconstruction. Here, we report on a molecular cross-docking-based approach for the identification of protein partners. The docking algorithm uses a coarse-grained representation of the protein structures and treats them as rigid bodies. We applied the approach to a few hundred of proteins, in the unbound conformations, and we systematically investigated the influence of several key ingredients, such as the size and quality of the interfaces, and the scoring function. We achieved some significant improvement compared to previous works, and a very high discriminative power on some specific functional classes. We provide a readout of the contributions of shape and physico-chemical complementarity, interface matching, and specificity, in the predictions. In addition, we assessed the ability of the approach to account for protein surface multiple usages, and we compared it with a sequence-based deep learning method. This work may contribute to guiding the exploitation of the large amounts of protein structural models now available toward the discovery of unexpected partners and their complex structure characterisation.
Collapse
|
59
|
Wu Y, Zeng M, Fei Z, Yu Y, Wu FX, Li M. KAICD: A knowledge attention-based deep learning framework for automatic ICD coding. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2020.05.115] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
|
60
|
Tang M, Wu L, Yu X, Chu Z, Jin S, Liu J. Prediction of Protein-Protein Interaction Sites Based on Stratified Attentional Mechanisms. Front Genet 2021; 12:784863. [PMID: 34880910 PMCID: PMC8647646 DOI: 10.3389/fgene.2021.784863] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2021] [Accepted: 10/08/2021] [Indexed: 11/19/2022] Open
Abstract
Proteins are the basic substances that undertake human life activities, and they often perform their biological functions through interactions with other biological macromolecules, such as cell transmission and signal transduction. Predicting the interaction sites between proteins can deepen the understanding of the principle of protein interactions, but traditional experimental methods are time-consuming and labor-intensive. In this study, a new hierarchical attention network structure, named HANPPIS, by adding six effective features of protein sequence, position-specific scoring matrix (PSSM), secondary structure, pre-training vector, hydrophilic, and amino acid position, is proposed to predict protein–protein interaction (PPI) sites. The experiment proved that our model has obtained very effective results, which was better than the existing advanced calculation methods. More importantly, we used the double-layer attention mechanism to improve the interpretability of the model and to a certain extent solved the problem of the “black box” of deep neural networks, which can be used as a reference for location positioning on the biological level.
Collapse
Affiliation(s)
- Minli Tang
- Department of Computer Science and Technology, Xiamen University, Xiamen, China.,School of Big Data Engineering, Kaili University, Kaili, China
| | - Longxin Wu
- Department of Computer Science and Technology, Xiamen University, Xiamen, China
| | - Xinyu Yu
- Department of Computer Science and Technology, Xiamen University, Xiamen, China
| | - Zhaoqi Chu
- Department of Instrumental and Electrical Engineering, School of Aerospace Engineering, Xiamen University, Xiamen, China
| | - Shuting Jin
- Department of Computer Science and Technology, Xiamen University, Xiamen, China.,National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, China
| | - Juan Liu
- Department of Instrumental and Electrical Engineering, School of Aerospace Engineering, Xiamen University, Xiamen, China
| |
Collapse
|
61
|
Zhang F, Song H, Zeng M, Wu FX, Li Y, Pan Y, Li M. A Deep Learning Framework for Gene Ontology Annotations With Sequence- and Network-Based Information. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2208-2217. [PMID: 31985440 DOI: 10.1109/tcbb.2020.2968882] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Knowledge of protein functions plays an important role in biology and medicine. With the rapid development of high-throughput technologies, a huge number of proteins have been discovered. However, there are a great number of proteins without functional annotations. A protein usually has multiple functions and some functions or biological processes require interactions of a plurality of proteins. Additionally, Gene Ontology provides a useful classification for protein functions and contains more than 40,000 terms. We propose a deep learning framework called DeepGOA to predict protein functions with protein sequences and protein-protein interaction (PPI) networks. For protein sequences, we extract two types of information: sequence semantic information and subsequence-based features. We use the word2vec technique to numerically represent protein sequences, and utilize a Bi-directional Long and Short Time Memory (Bi-LSTM) and multi-scale convolutional neural network (multi-scale CNN) to obtain the global and local semantic features of protein sequences, respectively. Additionally, we use the InterPro tool to scan protein sequences for extracting subsequence-based information, such as domains and motifs. Then, the information is plugged into a neural network to generate high-quality features. For the PPI network, the Deepwalk algorithm is applied to generate its embedding information of PPI. Then the two types of features are concatenated together to predict protein functions. To evaluate the performance of DeepGOA, several different evaluation methods and metrics are utilized. The experimental results show that DeepGOA outperforms DeepGO and BLAST.
Collapse
|
62
|
Wang P, Zhang G, Yu ZG, Huang G. A Deep Learning and XGBoost-Based Method for Predicting Protein-Protein Interaction Sites. Front Genet 2021; 12:752732. [PMID: 34764983 PMCID: PMC8576272 DOI: 10.3389/fgene.2021.752732] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Accepted: 09/20/2021] [Indexed: 11/29/2022] Open
Abstract
Knowledge about protein-protein interactions is beneficial in understanding cellular mechanisms. Protein-protein interactions are usually determined according to their protein-protein interaction sites. Due to the limitations of current techniques, it is still a challenging task to detect protein-protein interaction sites. In this article, we presented a method based on deep learning and XGBoost (called DeepPPISP-XGB) for predicting protein-protein interaction sites. The deep learning model served as a feature extractor to remove redundant information from protein sequences. The Extreme Gradient Boosting algorithm was used to construct a classifier for predicting protein-protein interaction sites. The DeepPPISP-XGB achieved the following results: area under the receiver operating characteristic curve of 0.681, a recall of 0.624, and area under the precision-recall curve of 0.339, being competitive with the state-of-the-art methods. We also validated the positive role of global features in predicting protein-protein interaction sites.
Collapse
Affiliation(s)
- Pan Wang
- School of Electrical Engineering, Shaoyang University, Shaoyang, China
| | - Guiyang Zhang
- School of Electrical Engineering, Shaoyang University, Shaoyang, China
| | - Zu-Guo Yu
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Xiangtan, China
| | - Guohua Huang
- School of Electrical Engineering, Shaoyang University, Shaoyang, China
| |
Collapse
|
63
|
Pan J, Li LP, You ZH, Yu CQ, Ren ZH, Guan YJ. Prediction of Protein-Protein Interactions in Arabidopsis, Maize, and Rice by Combining Deep Neural Network With Discrete Hilbert Transform. Front Genet 2021; 12:745228. [PMID: 34616437 PMCID: PMC8488469 DOI: 10.3389/fgene.2021.745228] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2021] [Accepted: 08/18/2021] [Indexed: 11/21/2022] Open
Abstract
Protein-protein interactions (PPIs) in plants play an essential role in the regulation of biological processes. However, traditional experimental methods are expensive, time-consuming, and need sophisticated technical equipment. These drawbacks motivated the development of novel computational approaches to predict PPIs in plants. In this article, a new deep learning framework, which combined the discrete Hilbert transform (DHT) with deep neural networks (DNN), was presented to predict PPIs in plants. To be more specific, plant protein sequences were first transformed as a position-specific scoring matrix (PSSM). Then, DHT was employed to capture features from the PSSM. To improve the prediction accuracy, we used the singular value decomposition algorithm to decrease noise and reduce the dimensions of the feature descriptors. Finally, these feature vectors were fed into DNN for training and predicting. When performing our method on three plant PPI datasets Arabidopsis thaliana, maize, and rice, we achieved good predictive performance with average area under receiver operating characteristic curve values of 0.8369, 0.9466, and 0.9440, respectively. To fully verify the predictive ability of our method, we compared it with different feature descriptors and machine learning classifiers. Moreover, to further demonstrate the generality of our approach, we also test it on the yeast and human PPI dataset. Experimental results anticipated that our method is an efficient and promising computational model for predicting potential plant-protein interacted pairs.
Collapse
Affiliation(s)
- Jie Pan
- School of Information Engineering, Xijing University, Xi’an, China
| | - Li-Ping Li
- School of Information Engineering, Xijing University, Xi’an, China
| | | | | | | | | |
Collapse
|
64
|
Mahdipour E, Ghasemzadeh M. The protein-protein interaction network alignment using recurrent neural network. Med Biol Eng Comput 2021; 59:2263-2286. [PMID: 34529185 DOI: 10.1007/s11517-021-02428-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2021] [Accepted: 08/05/2021] [Indexed: 11/29/2022]
Abstract
The main challenge of biological network alignment is that the problem of finding the alignments in two graphs is NP-hard. The discovery of protein-protein interaction (PPI) networks is of great importance in bioinformatics due to their utilization in identifying the cellular pathways, finding new medicines, and disease recognition. In this regard, we describe the network alignment method in the form of a classification problem for the very first time and introduce a deep network that finds the alignment of nodes present in the two networks. We call this method RENA, which means Network Alignment using REcurrent neural network. The proposed solution consists of three steps; in the first phase, we obtain the sequence and topological similarities from the networks' structure. For the second phase, the dataset needed for the transformation of the problem into a classification problem is created from obtained features. In the third phase, we predict the nodes' alignment between two networks using deep learning. We used Biogrid dataset for RENA evaluation. The RENA method is compared with three classification approaches of support vector machine, K-nearest neighbors, and linear discriminant analysis. The experimental results demonstrate the efficiency of the RENA method and 100% accuracy in PPI network alignment prediction.
Collapse
Affiliation(s)
- Elham Mahdipour
- Computer Engineering Department at Khavaran Institute of Higher Education, Mashhad, Iran.
| | | |
Collapse
|
65
|
Bouvier B. Protein-Protein Interface Topology as a Predictor of Secondary Structure and Molecular Function Using Convolutional Deep Learning. J Chem Inf Model 2021; 61:3292-3303. [PMID: 34225449 DOI: 10.1021/acs.jcim.1c00644] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Abstract
To power the specific recognition and binding of protein partners into functional complexes, a wealth of information about the structure and function of the partners is necessarily encoded into the global shape of protein-protein interfaces and their local topological features. To identify whether this is the case, this study uses convolutional deep learning methods (typically leveraged for 2D image recognition) on 3D voxel representations of protein-protein interfaces colored by burial depth. A novel two-stage network fed with voxelizations of each interface at two distinct resolutions achieves balance between performance and computational cost. From the shape of the interfaces, the network tries to predict the presence of secondary structure motifs at the interface and the molecular function of the corresponding complex. Secondary structure and certain classes of function are found to be very well predicted, validating the hypothesis that interface shape is a conveyor of higher-level information. Interface patterns triggering the recognition of specific classes are also identified and described.
Collapse
Affiliation(s)
- Benjamin Bouvier
- Laboratoire de Glycochimie, des Antimicrobiens et des Agroressources, CNRS UMR7378/Université de Picardie Jules Verne, 10 rue Baudelocque, 80039 Amiens Cedex, France
| |
Collapse
|
66
|
Wang Y, Li Z, Zhang Y, Ma Y, Huang Q, Chen X, Dai Z, Zou X. Performance improvement for a 2D convolutional neural network by using SSC encoding on protein-protein interaction tasks. BMC Bioinformatics 2021; 22:184. [PMID: 33845759 PMCID: PMC8042949 DOI: 10.1186/s12859-021-04111-w] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2020] [Accepted: 03/30/2021] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND The interactions of proteins are determined by their sequences and affect the regulation of the cell cycle, signal transduction and metabolism, which is of extraordinary significance to modern proteomics research. Despite advances in experimental technology, it is still expensive, laborious, and time-consuming to determine protein-protein interactions (PPIs), and there is a strong demand for effective bioinformatics approaches to identify potential PPIs. Considering the large amount of PPI data, a high-performance processor can be utilized to enhance the capability of the deep learning method and directly predict protein sequences. RESULTS We propose the Sequence-Statistics-Content protein sequence encoding format (SSC) based on information extraction from the original sequence for further performance improvement of the convolutional neural network. The original protein sequences are encoded in the three-channel format by introducing statistical information (the second channel) and bigram encoding information (the third channel), which can increase the unique sequence features to enhance the performance of the deep learning model. On predicting protein-protein interaction tasks, the results using the 2D convolutional neural network (2D CNN) with the SSC encoding method are better than those of the 1D CNN with one hot encoding. The independent validation of new interactions from the HIPPIE database (version 2.1 published on July 18, 2017) and the validation of directly predicted results by applying a molecular docking tool indicate the effectiveness of the proposed protein encoding improvement in the CNN model. CONCLUSION The proposed protein sequence encoding method is efficient at improving the capability of the CNN model on protein sequence-related tasks and may also be effective at enhancing the capability of other machine learning or deep learning methods. Prediction accuracy and molecular docking validation showed considerable improvement compared to the existing hot encoding method, indicating that the SSC encoding method may be useful for analyzing protein sequence-related tasks. The source code of the proposed methods is freely available for academic research at https://github.com/wangy496/SSC-format/ .
Collapse
Affiliation(s)
- Yang Wang
- School of Chemistry, Sun Yat-Sen University, Guangzhou, 510275, People's Republic of China
| | - Zhanchao Li
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, 510006, People's Republic of China
| | - Yanfei Zhang
- School of Chemistry, Sun Yat-Sen University, Guangzhou, 510275, People's Republic of China
| | - Yingjun Ma
- School of Chemistry, Sun Yat-Sen University, Guangzhou, 510275, People's Republic of China
| | - Qixing Huang
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, 510006, People's Republic of China
| | - Xingyu Chen
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, 510006, People's Republic of China
| | - Zong Dai
- School of Chemistry, Sun Yat-Sen University, Guangzhou, 510275, People's Republic of China
- Research Institute of Sun Yat-Sen University in Shenzhen, Shenzhen, 518000, People's Republic of China
| | - Xiaoyong Zou
- School of Chemistry, Sun Yat-Sen University, Guangzhou, 510275, People's Republic of China.
- Research Institute of Sun Yat-Sen University in Shenzhen, Shenzhen, 518000, People's Republic of China.
| |
Collapse
|
67
|
iEnhancer-GAN: A Deep Learning Framework in Combination with Word Embedding and Sequence Generative Adversarial Net to Identify Enhancers and Their Strength. Int J Mol Sci 2021; 22:ijms22073589. [PMID: 33808317 PMCID: PMC8036415 DOI: 10.3390/ijms22073589] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Revised: 03/10/2021] [Accepted: 03/24/2021] [Indexed: 12/13/2022] Open
Abstract
As critical components of DNA, enhancers can efficiently and specifically manipulate the spatial and temporal regulation of gene transcription. Malfunction or dysregulation of enhancers is implicated in a slew of human pathology. Therefore, identifying enhancers and their strength may provide insights into the molecular mechanisms of gene transcription and facilitate the discovery of candidate drug targets. In this paper, a new enhancer and its strength predictor, iEnhancer-GAN, is proposed based on a deep learning framework in combination with the word embedding and sequence generative adversarial net (Seq-GAN). Considering the relatively small training dataset, the Seq-GAN is designed to generate artificial sequences. Given that each functional element in DNA sequences is analogous to a “word” in linguistics, the word segmentation methods are proposed to divide DNA sequences into “words”, and the skip-gram model is employed to transform the “words” into digital vectors. In view of the powerful ability to extract high-level abstraction features, a convolutional neural network (CNN) architecture is constructed to perform the identification tasks, and the word vectors of DNA sequences are vertically concatenated to form the embedding matrices as the input of the CNN. Experimental results demonstrate the effectiveness of the Seq-GAN to expand the training dataset, the possibility of applying word segmentation methods to extract “words” from DNA sequences, the feasibility of implementing the skip-gram model to encode DNA sequences, and the powerful prediction ability of the CNN. Compared with other state-of-the-art methods on the training dataset and independent test dataset, the proposed method achieves a significantly improved overall performance. It is anticipated that the proposed method has a certain promotion effect on enhancer related fields.
Collapse
|
68
|
Ding L, Xie S, Zhang S, Shen H, Zhong H, Li D, Shi P, Chi L, Zhang Q. Delayed Comparison and Apriori Algorithm (DCAA): A Tool for Discovering Protein-Protein Interactions From Time-Series Phosphoproteomic Data. Front Mol Biosci 2020; 7:606570. [PMID: 33363212 PMCID: PMC7758479 DOI: 10.3389/fmolb.2020.606570] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2020] [Accepted: 11/02/2020] [Indexed: 01/04/2023] Open
Abstract
Analysis of high-throughput omics data is one of the most important approaches for obtaining information regarding interactions between proteins/genes. Time-series omics data are a series of omics data points indexed in time order and normally contain more abundant information about the interactions between biological macromolecules than static omics data. In addition, phosphorylation is a key posttranslational modification (PTM) that is indicative of possible protein function changes in cellular processes. Analysis of time-series phosphoproteomic data should provide more meaningful information about protein interactions. However, although many algorithms, databases, and websites have been developed to analyze omics data, the tools dedicated to discovering molecular interactions from time-series omics data, especially from time-series phosphoproteomic data, are still scarce. Moreover, most reported tools ignore the lag between functional alterations and the corresponding changes in protein synthesis/PTM and are highly dependent on previous knowledge, resulting in high false-positive rates and difficulties in finding newly discovered protein–protein interactions (PPIs). Therefore, in the present study, we developed a new method to discover protein–protein interactions with the delayed comparison and Apriori algorithm (DCAA) to address the aforementioned problems. DCAA is based on the idea that there is a lag between functional alterations and the corresponding changes in protein synthesis/PTM. The Apriori algorithm was used to mine association rules from the relationships between items in a dataset and find PPIs based on time-series phosphoproteomic data. The advantage of DCAA is that it does not rely on previous knowledge and the PPI database. The analysis of actual time-series phosphoproteomic data showed that more than 68% of the protein interactions/regulatory relationships predicted by DCAA were accurate. As an analytical tool for PPIs that does not rely on a priori knowledge, DCAA should be useful to predict PPIs from time-series omics data, and this approach is not limited to phosphoproteomic data.
Collapse
Affiliation(s)
- Lianhong Ding
- School of Information, Beijing Wuzi University, Beijing, China
| | - Shaoshuai Xie
- National Glycoengineering Research Center, Shandong University, Qingdao, China
| | - Shucui Zhang
- The Key Laboratory of Cardiovascular Remodeling and Function Research, Chinese Ministry of Education, Chinese National Health Commission and Chinese Academy of Medical Sciences, Qilu Hospital of Shandong University, Jinan, China
| | - Hangyu Shen
- National Center for Materials Service Safety, University of Science and Technology Beijing, Beijing, China
| | - Huaqiang Zhong
- National Center for Materials Service Safety, University of Science and Technology Beijing, Beijing, China
| | - Daoyuan Li
- National Glycoengineering Research Center, Shandong University, Qingdao, China
| | - Peng Shi
- National Center for Materials Service Safety, University of Science and Technology Beijing, Beijing, China
| | - Lianli Chi
- National Glycoengineering Research Center, Shandong University, Qingdao, China
| | - Qunye Zhang
- The Key Laboratory of Cardiovascular Remodeling and Function Research, Chinese Ministry of Education, Chinese National Health Commission and Chinese Academy of Medical Sciences, Qilu Hospital of Shandong University, Jinan, China
| |
Collapse
|
69
|
Timmons PB, Hewage CM. ENNAACT is a novel tool which employs neural networks for anticancer activity classification for therapeutic peptides. Biomed Pharmacother 2020; 133:111051. [PMID: 33254015 DOI: 10.1016/j.biopha.2020.111051] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Revised: 10/08/2020] [Accepted: 11/19/2020] [Indexed: 12/12/2022] Open
Abstract
The prevalence of cancer as a threat to human life, responsible for 9.6 million deaths worldwide in 2018, motivates the search for new anticancer agents. While many options are currently available for treatment, these are often expensive and impact the human body unfavourably. Anticancer peptides represent a promising emerging field of anticancer therapeutics, which are characterized by favourable toxicity profile. The development of accurate in silico methods for anticancer peptide prediction is of paramount importance, as the amount of available sequence data is growing each year. This study leverages advances in machine learning research to produce a novel sequence-based deep neural network classifier for anticancer peptide activity. The classifier achieves performance comparable to the best-in-class, with a cross-validated accuracy of 98.3%, Matthews correlation coefficient of 0.91 and an Area Under the Curve of 0.95. This innovative classifier is available as a web server at https://research.timmons.eu/ennaact, facilitating in silico screening and design of new anticancer peptide chemotherapeutics by the research community.
Collapse
Affiliation(s)
- Patrick Brendan Timmons
- UCD School of Biomolecular and Biomedical Science, UCD Centre for Synthesis and Chemical Biology, UCD Conway Institute, University College Dublin, Dublin 4, Ireland
| | - Chandralal M Hewage
- UCD School of Biomolecular and Biomedical Science, UCD Centre for Synthesis and Chemical Biology, UCD Conway Institute, University College Dublin, Dublin 4, Ireland.
| |
Collapse
|
70
|
Guo L, Wang Y, Xu X, Cheng KK, Long Y, Xu J, Li S, Dong J. DeepPSP: A Global-Local Information-Based Deep Neural Network for the Prediction of Protein Phosphorylation Sites. J Proteome Res 2020; 20:346-356. [PMID: 33241931 DOI: 10.1021/acs.jproteome.0c00431] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Identification of phosphorylation sites is an important step in the function study and drug design of proteins. In recent years, there have been increasing applications of the computational method in the identification of phosphorylation sites because of its low cost and high speed. Most of the currently available methods focus on using local information around potential phosphorylation sites for prediction and do not take the global information of the protein sequence into consideration. Here, we demonstrated that the global information of protein sequences may be also critical for phosphorylation site prediction. In this paper, a new deep neural network model, called DeepPSP, was proposed for the prediction of protein phosphorylation sites. In the DeepPSP model, two parallel modules were introduced to extract both local and global features from protein sequences. Two squeeze-and-excitation blocks and one bidirectional long short-term memory block were introduced into each module to capture effective representations of the sequences. Comparative studies were carried out to evaluate the performance of DeepPSP, and four other prediction methods using public data sets The F1-score, area under receiver operating characteristic curves (AUROC), and area under precision-recall curves (AUPRC) of DeepPSP were found to be 0.4819, 0.82, and 0.50, respectively, for S/T general site prediction and 0.4206, 0.73, and 0.39, respectively, for Y general site prediction. Compared with the MusiteDeep method, the F1-score, AUROC, and AUPRC of DeepPSP were found to increase by 8.6, 2.5, and 8.7%, respectively, for S/T general site prediction and by 20.6, 5.8, and 18.2%, respectively, for Y general site prediction. Among the tested methods, the developed DeepPSP method was also found to produce best results for different kinase-specific site predictions including CDK, mitogen-activated protein kinase, CAMK, AGC, and CMGC. Taken together, the developed DeepPSP method may offer a more accurate phosphorylation site prediction by including global information. It may serve as an alternative model with better performance and interpretability for protein phosphorylation site prediction.
Collapse
Affiliation(s)
- Lei Guo
- Department of Electronic Science, National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen 361005, China
| | - Yongpei Wang
- Department of Electronic Science, National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen 361005, China
| | - Xiangnan Xu
- School of Mathematics and Statistics, The University of Sydney, Sydeny, New South Wales 2006, Australia
| | - Kian-Kai Cheng
- Innovation Centre in Agritechnology, Universiti Teknologi Malaysia, Muar, Johor 84600, Malaysia
| | - Yichi Long
- Department of Electronic Science, National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen 361005, China
| | - Jingjing Xu
- Department of Electronic Science, National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen 361005, China
| | - Sanshu Li
- Institute of Genomics, Medical School, Huaqiao University, Xiamen 361021, China
| | - Jiyang Dong
- Department of Electronic Science, National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen 361005, China
| |
Collapse
|
71
|
Bian H, Guo M, Wang J. Recognition of Mitochondrial Proteins in Plasmodium Based on the Tripeptide Composition. Front Cell Dev Biol 2020; 8:578901. [PMID: 33043014 PMCID: PMC7525148 DOI: 10.3389/fcell.2020.578901] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2020] [Accepted: 08/13/2020] [Indexed: 01/31/2023] Open
Abstract
Mitochondria play essential roles in eukaryotic cells, especially in Plasmodium cells. They have several unusual evolutionary and functional features that are incredibly vital for disease diagnosis and drug design. Thus, predicting mitochondrial proteins of Plasmodium has become a worthwhile work. However, existing computational methods can only predict mitochondrial proteins of Plasmodium falciparum (P. falciparum for short), and these methods have low accuracy. It is highly desirable to design a classifier with high accuracy for predicting mitochondrial proteins for all Plasmodium species, not only P. falciparum. We proposed a novel method, named as PM-OTC, for predicting mitochondrial proteins in Plasmodium. PM-OTC uses the Support Vector Machine (SVM) as the classifier and the selected tripeptide composition as the features. We adopted the 5-fold cross-validation method to train and test PM-OTC. Results demonstrate that PM-OTC achieves an accuracy of 94.91%, and performances of PM-OTC are superior to other methods.
Collapse
Affiliation(s)
- Haodong Bian
- School of Computer Science, Inner Mongolia University, Hohhot, China
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China.,Beijing Key Laboratory of Intelligent Processing for Building Big Data, Beijing, China
| | - Juan Wang
- School of Computer Science, Inner Mongolia University, Hohhot, China.,Stage Key Laboratories of Reproductive Regulation & Breeding of Grassland Livestock, Hohhot, China
| |
Collapse
|
72
|
Timmons PB, Hewage CM. HAPPENN is a novel tool for hemolytic activity prediction for therapeutic peptides which employs neural networks. Sci Rep 2020; 10:10869. [PMID: 32616760 PMCID: PMC7331684 DOI: 10.1038/s41598-020-67701-3] [Citation(s) in RCA: 42] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2020] [Accepted: 06/09/2020] [Indexed: 12/11/2022] Open
Abstract
The growing prevalence of resistance to antibiotics motivates the search for new antibacterial agents. Antimicrobial peptides are a diverse class of well-studied membrane-active peptides which function as part of the innate host defence system, and form a promising avenue in antibiotic drug research. Some antimicrobial peptides exhibit toxicity against eukaryotic membranes, typically characterised by hemolytic activity assays, but currently, the understanding of what differentiates hemolytic and non-hemolytic peptides is limited. This study leverages advances in machine learning research to produce a novel artificial neural network classifier for the prediction of hemolytic activity from a peptide's primary sequence. The classifier achieves best-in-class performance, with cross-validated accuracy of [Formula: see text] and Matthews correlation coefficient of 0.71. This innovative classifier is available as a web server at https://research.timmons.eu/happenn , allowing the research community to utilise it for in silico screening of peptide drug candidates for high therapeutic efficacies.
Collapse
Affiliation(s)
- Patrick Brendan Timmons
- UCD School of Biomolecular and Biomedical Science, UCD Centre for Synthesis and Chemical Biology, UCD Conway Institute, University College Dublin, Dublin 4, Ireland
| | - Chandralal M Hewage
- UCD School of Biomolecular and Biomedical Science, UCD Centre for Synthesis and Chemical Biology, UCD Conway Institute, University College Dublin, Dublin 4, Ireland.
| |
Collapse
|
73
|
Zeng M, Lu C, Zhang F, Li Y, Wu FX, Li Y, Li M. SDLDA: lncRNA-disease association prediction based on singular value decomposition and deep learning. Methods 2020; 179:73-80. [DOI: 10.1016/j.ymeth.2020.05.002] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2020] [Revised: 04/24/2020] [Accepted: 05/02/2020] [Indexed: 12/20/2022] Open
|
74
|
Deng A, Zhang H, Wang W, Zhang J, Fan D, Chen P, Wang B. Developing Computational Model to Predict Protein-Protein Interaction Sites Based on the XGBoost Algorithm. Int J Mol Sci 2020; 21:E2274. [PMID: 32218345 PMCID: PMC7178137 DOI: 10.3390/ijms21072274] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2020] [Revised: 03/10/2020] [Accepted: 03/23/2020] [Indexed: 12/27/2022] Open
Abstract
The study of protein-protein interaction is of great biological significance, and the prediction of protein-protein interaction sites can promote the understanding of cell biological activity and will be helpful for drug development. However, uneven distribution between interaction and non-interaction sites is common because only a small number of protein interactions have been confirmed by experimental techniques, which greatly affects the predictive capability of computational methods. In this work, two imbalanced data processing strategies based on XGBoost algorithm were proposed to re-balance the original dataset from inherent relationship between positive and negative samples for the prediction of protein-protein interaction sites. Herein, a feature extraction method was applied to represent the protein interaction sites based on evolutionary conservatism of proteins, and the influence of overlapping regions of positive and negative samples was considered in prediction performance. Our method showed good prediction performance, such as prediction accuracy of 0.807 and MCC of 0.614, on an original dataset with 10,455 surface residues but only 2297 interface residues. Experimental results demonstrated the effectiveness of our XGBoost-based method.
Collapse
Affiliation(s)
- Aijun Deng
- Key Laboratory of Metallurgical Emission Reduction & Resources Recycling (Anhui University of Technology), Ministry of Education, Ma'anshan 243002, China
- School of Metallurgical Engineering, Anhui University of Technology, Ma'anshan 243032, China
- Department of Engineering, University of Leicester, Leicester LE1 7RH, UK
| | - Huan Zhang
- School of Electrical and Information Engineering, Anhui University of Technology, Ma'anshan 243032, China
| | - Wenyan Wang
- School of Electrical and Information Engineering, Anhui University of Technology, Ma'anshan 243032, China
| | - Jun Zhang
- Co-Innovation Center for Information Supply & Assurance Technology, Anhui University, Hefei 230032, China
| | - Dingdong Fan
- School of Metallurgical Engineering, Anhui University of Technology, Ma'anshan 243032, China
| | - Peng Chen
- Co-Innovation Center for Information Supply & Assurance Technology, Anhui University, Hefei 230032, China
| | - Bing Wang
- Key Laboratory of Metallurgical Emission Reduction & Resources Recycling (Anhui University of Technology), Ministry of Education, Ma'anshan 243002, China
- School of Electrical and Information Engineering, Anhui University of Technology, Ma'anshan 243032, China
- Co-Innovation Center for Information Supply & Assurance Technology, Anhui University, Hefei 230032, China
| |
Collapse
|
75
|
Zeng M, Li M, Wu FX, Li Y, Pan Y. DeepEP: a deep learning framework for identifying essential proteins. BMC Bioinformatics 2019; 20:506. [PMID: 31787076 PMCID: PMC6886168 DOI: 10.1186/s12859-019-3076-y] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Background Essential proteins are crucial for cellular life and thus, identification of essential proteins is an important topic and a challenging problem for researchers. Recently lots of computational approaches have been proposed to handle this problem. However, traditional centrality methods cannot fully represent the topological features of biological networks. In addition, identifying essential proteins is an imbalanced learning problem; but few current shallow machine learning-based methods are designed to handle the imbalanced characteristics. Results We develop DeepEP based on a deep learning framework that uses the node2vec technique, multi-scale convolutional neural networks and a sampling technique to identify essential proteins. In DeepEP, the node2vec technique is applied to automatically learn topological and semantic features for each protein in protein-protein interaction (PPI) network. Gene expression profiles are treated as images and multi-scale convolutional neural networks are applied to extract their patterns. In addition, DeepEP uses a sampling method to alleviate the imbalanced characteristics. The sampling method samples the same number of the majority and minority samples in a training epoch, which is not biased to any class in training process. The experimental results show that DeepEP outperforms traditional centrality methods. Moreover, DeepEP is better than shallow machine learning-based methods. Detailed analyses show that the dense vectors which are generated by node2vec technique contribute a lot to the improved performance. It is clear that the node2vec technique effectively captures the topological and semantic properties of PPI network. The sampling method also improves the performance of identifying essential proteins. Conclusion We demonstrate that DeepEP improves the prediction performance by integrating multiple deep learning techniques and a sampling method. DeepEP is more effective than existing methods.
Collapse
Affiliation(s)
- Min Zeng
- School of Computer Science and Engineering, Central South University, Changsha, 410083, People's Republic of China
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha, 410083, People's Republic of China.
| | - Fang-Xiang Wu
- Division of Biomedical Engineering and Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, SKS7N5A9, Canada
| | - Yaohang Li
- Department of Computer Science, Old Dominion University, Norfolk, VA23529, USA
| | - Yi Pan
- Department of Computer Science, Georgia State University, Atlanta, GA30302, USA
| |
Collapse
|