1
|
Yu Z, Yu J, Wang H, Zhang S, Zhao L, Shi S. PhosAF: An integrated deep learning architecture for predicting protein phosphorylation sites with AlphaFold2 predicted structures. Anal Biochem 2024; 690:115510. [PMID: 38513769 DOI: 10.1016/j.ab.2024.115510] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2023] [Revised: 03/14/2024] [Accepted: 03/18/2024] [Indexed: 03/23/2024]
Abstract
Phosphorylation is indispensable in comprehending biological processes, while biological experimental methods for identifying phosphorylation sites are tedious and arduous. With the rapid growth of biotechnology, deep learning methods have made significant progress in site prediction tasks. Nevertheless, most existing predictors only consider protein sequence information, that limits the capture of protein spatial information. Building upon the latest advancement in protein structure prediction by AlphaFold2, a novel integrated deep learning architecture PhosAF is developed to predict phosphorylation sites in human proteins by integrating CMA-Net and MFC-Net, which considers sequence and structure information predicted by AlphaFold2. Here, CMA-Net module is composed of multiple convolutional neural network layers and multi-head attention is appended to obtaining the local and long-term dependencies of sequence features. Meanwhile, the MFC-Net module composed of deep neural network layers is used to capture the complex representations of evolutionary and structure features. Furthermore, different features are combined to predict the final phosphorylation sites. In addition, we put forward a new strategy to construct reliable negative samples via protein secondary structures. Experimental results on independent test data and case study indicate that our model PhosAF surpasses the current most advanced methods in phosphorylation site prediction.
Collapse
Affiliation(s)
- Ziyuan Yu
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang, 330031, China.
| | - Jialin Yu
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang, 330031, China.
| | - Hongmei Wang
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang, 330031, China.
| | - Shuai Zhang
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang, 330031, China.
| | - Long Zhao
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang, 330031, China.
| | - Shaoping Shi
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang, 330031, China; Institute of Mathematics and Interdisciplinary Sciences, Nanchang University, Nanchang, 330031, China.
| |
Collapse
|
2
|
Gutierrez CS, Kassim AA, Gutierrez BD, Raines RT. Sitetack: A Deep Learning Model that Improves PTM Prediction by Using Known PTMs. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.03.596298. [PMID: 38895359 PMCID: PMC11185516 DOI: 10.1101/2024.06.03.596298] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Abstract
Post-translational modifications (PTMs) increase the diversity of the proteome and are vital to organismal life and therapeutic strategies. Deep learning has been used to predict PTM locations. Still, limitations in datasets and their analyses compromise success. Here we evaluate the use of known PTM sites in prediction via sequence-based deep learning algorithms. Specifically, PTM locations were encoded as a separate amino acid before sequences were encoded via word embedding and passed into a convolutional neural network that predicts the probability of a modification at a given site. Without labeling known PTMs, our model is on par with others. With labeling, however, we improved significantly upon extant models. Moreover, knowing PTM locations can increase the predictability of a different PTM. Our findings highlight the importance of PTMs for the installation of additional PTMs. We anticipate that including known PTM locations will enhance the performance of other proteomic machine learning algorithms.
Collapse
|
3
|
Zhou Z, Yeung W, Soleymani S, Gravel N, Salcedo M, Li S, Kannan N. Using explainable machine learning to uncover the kinase-substrate interaction landscape. Bioinformatics 2024; 40:btae033. [PMID: 38244571 PMCID: PMC10868336 DOI: 10.1093/bioinformatics/btae033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Revised: 12/09/2023] [Accepted: 01/17/2024] [Indexed: 01/22/2024] Open
Abstract
MOTIVATION Phosphorylation, a post-translational modification regulated by protein kinase enzymes, plays an essential role in almost all cellular processes. Understanding how each of the nearly 500 human protein kinases selectively phosphorylates their substrates is a foundational challenge in bioinformatics and cell signaling. Although deep learning models have been a popular means to predict kinase-substrate relationships, existing models often lack interpretability and are trained on datasets skewed toward a subset of well-studied kinases. RESULTS Here we leverage recent peptide library datasets generated to determine substrate specificity profiles of 300 serine/threonine kinases to develop an explainable Transformer model for kinase-peptide interaction prediction. The model, trained solely on primary sequences, achieved state-of-the-art performance. Its unique multitask learning paradigm built within the model enables predictions on virtually any kinase-peptide pair, including predictions on 139 kinases not used in peptide library screens. Furthermore, we employed explainable machine learning methods to elucidate the model's inner workings. Through analysis of learned embeddings at different training stages, we demonstrate that the model employs a unique strategy of substrate prediction considering both substrate motif patterns and kinase evolutionary features. SHapley Additive exPlanation (SHAP) analysis reveals key specificity determining residues in the peptide sequence. Finally, we provide a web interface for predicting kinase-substrate associations for user-defined sequences and a resource for visualizing the learned kinase-substrate associations. AVAILABILITY AND IMPLEMENTATION All code and data are available at https://github.com/esbgkannan/Phosformer-ST. Web server is available at https://phosformer.netlify.app.
Collapse
Affiliation(s)
- Zhongliang Zhou
- School of Computing, University of Georgia, Athens, GA 30602, United States
| | - Wayland Yeung
- Institute of Bioinformatics, University of Georgia, Athens, GA 30602, United States
| | - Saber Soleymani
- School of Computing, University of Georgia, Athens, GA 30602, United States
| | - Nathan Gravel
- Institute of Bioinformatics, University of Georgia, Athens, GA 30602, United States
| | - Mariah Salcedo
- Department of Biochemistry and Molecular Biology, University of Georgia, Athens, GA 30602, United States
| | - Sheng Li
- School of Data Science, University of Virginia, Charlottesville, VA 22903, United States
| | - Natarajan Kannan
- Institute of Bioinformatics, University of Georgia, Athens, GA 30602, United States
- Department of Biochemistry and Molecular Biology, University of Georgia, Athens, GA 30602, United States
| |
Collapse
|
4
|
Shrestha P, Kandel J, Tayara H, Chong KT. DL-SPhos: Prediction of serine phosphorylation sites using transformer language model. Comput Biol Med 2024; 169:107925. [PMID: 38183701 DOI: 10.1016/j.compbiomed.2024.107925] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Revised: 12/21/2023] [Accepted: 01/01/2024] [Indexed: 01/08/2024]
Abstract
Serine phosphorylation plays a pivotal role in the pathogenesis of various cellular processes and diseases. Roughly 81% of human diseases have links to phosphorylation, and an overwhelming 86.4% of protein phosphorylation takes place at serine residues. In eukaryotes, over a quarter of proteins undergo phosphorylation, with more than half implicated in numerous disorders, notably cancer and reproductive system diseases. This study primarily focuses on serine-phosphorylation-driven pathogenesis and the critical role of conserved motif identification. While numerous techniques exist for predicting serine phosphorylation sites, traditional wet lab experiments are resource-intensive. Our paper introduces a cutting-edge deep learning tool for predicting S phosphorylation sites, integrating explainable AI for motif identification, a transformer language model, and deep neural network components. We trained our model on protein sequences from UniProt, validated it against the dbPTM benchmark dataset, and employed the PTMD dataset to explore motifs related to mammalian disorders. Our results highlight that our model surpasses other deep learning predictors by a significant 3%. Furthermore, we utilized the local interpretable model-agnostic explanations (LIME) approach to shed light on the predictions, emphasizing the amino acid residues crucial for S phosphorylation. Notably, our model also outperformed competitors in kinase-specific serine phosphorylation prediction on benchmark datasets.
Collapse
Affiliation(s)
- Palistha Shrestha
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju-si, 54896, Jeollabuk-do, Republic of Korea
| | - Jeevan Kandel
- Graduate School of Integrated Energy-AI, Jeonbuk National University, Jeonju-si, 54896, Jeollabuk-do, Republic of Korea
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju-si, 54896, Jeollabuk-do, Republic of Korea.
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju-si, 54896, Jeollabuk-do, Republic of Korea; Advances Electronics and Information Research Center, Jeonbuk National University, Jeonju-si, 54896, Jeollabuk-do, Republic of Korea.
| |
Collapse
|
5
|
Wu J, Liu B, Zhang J, Wang Z, Li J. DL-PPI: a method on prediction of sequenced protein-protein interaction based on deep learning. BMC Bioinformatics 2023; 24:473. [PMID: 38097937 PMCID: PMC10722729 DOI: 10.1186/s12859-023-05594-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Accepted: 12/01/2023] [Indexed: 12/17/2023] Open
Abstract
PURPOSE Sequenced Protein-Protein Interaction (PPI) prediction represents a pivotal area of study in biology, playing a crucial role in elucidating the mechanistic underpinnings of diseases and facilitating the design of novel therapeutic interventions. Conventional methods for extracting features through experimental processes have proven to be both costly and exceedingly complex. In light of these challenges, the scientific community has turned to computational approaches, particularly those grounded in deep learning methodologies. Despite the progress achieved by current deep learning technologies, their effectiveness diminishes when applied to larger, unfamiliar datasets. RESULTS In this study, the paper introduces a novel deep learning framework, termed DL-PPI, for predicting PPIs based on sequence data. The proposed framework comprises two key components aimed at improving the accuracy of feature extraction from individual protein sequences and capturing relationships between proteins in unfamiliar datasets. 1. Protein Node Feature Extraction Module: To enhance the accuracy of feature extraction from individual protein sequences and facilitate the understanding of relationships between proteins in unknown datasets, the paper devised a novel protein node feature extraction module utilizing the Inception method. This module efficiently captures relevant patterns and representations within protein sequences, enabling more informative feature extraction. 2. Feature-Relational Reasoning Network (FRN): In the Global Feature Extraction module of our model, the paper developed a novel FRN that leveraged Graph Neural Networks to determine interactions between pairs of input proteins. The FRN effectively captures the underlying relational information between proteins, contributing to improved PPI predictions. DL-PPI framework demonstrates state-of-the-art performance in the realm of sequence-based PPI prediction.
Collapse
Affiliation(s)
- Jiahui Wu
- Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
| | - Bo Liu
- School of Mathematical and Computational Sciences, Massey University, Auckland, 0745, New Zealand.
| | - Jidong Zhang
- Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
| | - Zhihan Wang
- Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
| | - Jianqiang Li
- Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
| |
Collapse
|
6
|
Cong H, Liu H, Cao Y, Liang C, Chen Y. Protein-protein interaction site prediction by model ensembling with hybrid feature and self-attention. BMC Bioinformatics 2023; 24:456. [PMID: 38053020 DOI: 10.1186/s12859-023-05592-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2022] [Accepted: 11/30/2023] [Indexed: 12/07/2023] Open
Abstract
BACKGROUND Protein-protein interactions (PPIs) are crucial in various biological functions and cellular processes. Thus, many computational approaches have been proposed to predict PPI sites. Although significant progress has been made, these methods still have limitations in encoding the characteristics of each amino acid in sequences. Many feature extraction methods rely on the sliding window technique, which simply merges all the features of residues into a vector. The importance of some key residues may be weakened in the feature vector, leading to poor performance. RESULTS We propose a novel sequence-based method for PPI sites prediction. The new network model, PPINet, contains multiple feature processing paths. For a residue, the PPINet extracts the features of the targeted residue and its context separately. These two types of features are processed by two paths in the network and combined to form a protein representation, where the two types of features are of relatively equal importance. The model ensembling technique is applied to make use of more features. The base models are trained with different features and then ensembled via stacking. In addition, a data balancing strategy is presented, by which our model can get significant improvement on highly unbalanced data. CONCLUSION The proposed method is evaluated on a fused dataset constructed from Dset186, Dset_72, and PDBset_164, as well as the public Dset_448 dataset. Compared with current state-of-the-art methods, the performance of our method is better than the others. In the most important metrics, such as AUPRC and recall, it surpasses the second-best programmer on the latter dataset by 6.9% and 4.7%, respectively. We also demonstrated that the improvement is essentially due to using the ensemble model, especially, the hybrid feature. We share our code for reproducibility and future research at https://github.com/CandiceCong/StackingPPINet .
Collapse
Affiliation(s)
- Hanhan Cong
- School of Information Science and Engineering, Shandong Normal University, Jinan, China
- Shandong Provincial Key Laboratory for Novel Distributed Computer Software Technology, Jinan, China
| | - Hong Liu
- School of Information Science and Engineering, Shandong Normal University, Jinan, China.
- Shandong Provincial Key Laboratory for Novel Distributed Computer Software Technology, Jinan, China.
| | - Yi Cao
- School of Information Science and Engineering, University of Jinan, Jinan, China
- Shandong Provincial Key Laboratory of Network Based Intelligent Computing, Jinan, China
| | - Cheng Liang
- School of Information Science and Engineering, Shandong Normal University, Jinan, China
| | - Yuehui Chen
- School of Information Science and Engineering, University of Jinan, Jinan, China
- Shandong Provincial Key Laboratory of Network Based Intelligent Computing, Jinan, China
| |
Collapse
|
7
|
Esmaili F, Pourmirzaei M, Ramazi S, Shojaeilangari S, Yavari E. A Review of Machine Learning and Algorithmic Methods for Protein Phosphorylation Site Prediction. GENOMICS, PROTEOMICS & BIOINFORMATICS 2023; 21:1266-1285. [PMID: 37863385 PMCID: PMC11082408 DOI: 10.1016/j.gpb.2023.03.007] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/16/2022] [Revised: 01/16/2023] [Accepted: 03/23/2023] [Indexed: 10/22/2023]
Abstract
Post-translational modifications (PTMs) have key roles in extending the functional diversity of proteins and, as a result, regulating diverse cellular processes in prokaryotic and eukaryotic organisms. Phosphorylation modification is a vital PTM that occurs in most proteins and plays a significant role in many biological processes. Disorders in the phosphorylation process lead to multiple diseases, including neurological disorders and cancers. The purpose of this review is to organize this body of knowledge associated with phosphorylation site (p-site) prediction to facilitate future research in this field. At first, we comprehensively review all related databases and introduce all steps regarding dataset creation, data preprocessing, and method evaluation in p-site prediction. Next, we investigate p-site prediction methods, which are divided into two computational groups: algorithmic and machine learning (ML). Additionally, it is shown that there are basically two main approaches for p-site prediction by ML: conventional and end-to-end deep learning methods, both of which are given an overview. Moreover, this review introduces the most important feature extraction techniques, which have mostly been used in p-site prediction. Finally, we create three test sets from new proteins related to the released version of the database of protein post-translational modifications (dbPTM) in 2022 based on general and human species. Evaluating online p-site prediction tools on newly added proteins introduced in the dbPTM 2022 release, distinct from those in the dbPTM 2019 release, reveals their limitations. In other words, the actual performance of these online p-site prediction tools on unseen proteins is notably lower than the results reported in their respective research papers.
Collapse
Affiliation(s)
- Farzaneh Esmaili
- Department of Information Technology, Tarbiat Modares University, Tehran 14115-111, Iran
| | - Mahdi Pourmirzaei
- Department of Information Technology, Tarbiat Modares University, Tehran 14115-111, Iran
| | - Shahin Ramazi
- Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran 14115-111, Iran.
| | - Seyedehsamaneh Shojaeilangari
- Biomedical Engineering Group, Department of Electrical Engineering and Information Technology, Iranian Research Organization for Science and Technology (IROST), Tehran 33535-111, Iran
| | - Elham Yavari
- Department of Information Technology, Tarbiat Modares University, Tehran 14115-111, Iran
| |
Collapse
|
8
|
Xie J, Quan L, Wang X, Wu H, Jin Z, Pan D, Chen T, Wu T, Lyu Q. DeepMPSF: A Deep Learning Network for Predicting General Protein Phosphorylation Sites Based on Multiple Protein Sequence Features. J Chem Inf Model 2023; 63:7258-7271. [PMID: 37931253 DOI: 10.1021/acs.jcim.3c00996] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2023]
Abstract
Phosphorylation, as one of the most important post-translational modifications, plays a key role in various cellular physiological processes and disease occurrences. In recent years, computer technology has been gradually applied to the prediction of protein phosphorylation sites. However, most existing methods rely on simple protein sequence features that provide limited contextual information. To overcome this limitation, we propose DeepMPSF, a phosphorylation site prediction model based on multiple protein sequence features. There are two types of features: sequence semantic features, which comprise protein residue type information and relative position information within protein sequence, and protein background biophysical features, which include global semantic information containing more comprehensive protein background information obtained from pretrained models. To extract these features, DeepMPSF employs two separate subnetworks: the S71SFE module and the BBFE module, which automatically extract high-level semantic features. Our model incorporates a learning strategy for handling imbalanced datasets through ensemble learning during training and prediction. DeepMPSF is trained and evaluated on a well-established dataset of human proteins. Comparing the analysis with other benchmark methods reveals that DeepMPSF outperforms in predicting both S/T residues and Y residues. In particular, DeepMPSF showed excellent generalization performance in cross-species blind test performance, with an average improvement of 5.63%/5.72%, 22.28%/25.94%, 20.11%/17.49%, and 26.40%/28.33% for Mus musculus/Rattus norvegicus test sets in area under curves (AUCs) of ROC curve, AUC of the PR curve, F1-score, and MCC metrics, respectively. Furthermore, it also shows excellent performance in the latest updated case of natural proteins with functional phosphorylation sites. Through an ablation study and visual analysis, we uncover that the design of different feature modules significantly contributes to the accurate classification of DeepMPSF, which provides valuable insights for predicting phosphorylation sites and offers effective support for future downstream research.
Collapse
Affiliation(s)
- Jingxin Xie
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| | - Lijun Quan
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
- Province Key Lab for Information Processing Technologies, Soochow University, Suzhou 215006, China
- Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000, China
| | - Xuejiao Wang
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| | - Hongjie Wu
- Suzhou University of Science and Technology, Suzhou 215006, China
| | - Zhi Jin
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| | - Deng Pan
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| | - Taoning Chen
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| | - Tingfang Wu
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
- Province Key Lab for Information Processing Technologies, Soochow University, Suzhou 215006, China
- Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000, China
| | - Qiang Lyu
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
- Province Key Lab for Information Processing Technologies, Soochow University, Suzhou 215006, China
- Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000, China
| |
Collapse
|
9
|
Mou M, Pan Z, Zhou Z, Zheng L, Zhang H, Shi S, Li F, Sun X, Zhu F. A Transformer-Based Ensemble Framework for the Prediction of Protein-Protein Interaction Sites. RESEARCH (WASHINGTON, D.C.) 2023; 6:0240. [PMID: 37771850 PMCID: PMC10528219 DOI: 10.34133/research.0240] [Citation(s) in RCA: 23] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/02/2023] [Accepted: 09/08/2023] [Indexed: 09/30/2023]
Abstract
The identification of protein-protein interaction (PPI) sites is essential in the research of protein function and the discovery of new drugs. So far, a variety of computational tools based on machine learning have been developed to accelerate the identification of PPI sites. However, existing methods suffer from the low predictive accuracy or the limited scope of application. Specifically, some methods learned only global or local sequential features, leading to low predictive accuracy, while others achieved improved performance by extracting residue interactions from structures but were limited in their application scope for the serious dependence on precise structure information. There is an urgent need to develop a method that integrates comprehensive information to realize proteome-wide accurate profiling of PPI sites. Herein, a novel ensemble framework for PPI sites prediction, EnsemPPIS, was therefore proposed based on transformer and gated convolutional networks. EnsemPPIS can effectively capture not only global and local patterns but also residue interactions. Specifically, EnsemPPIS was unique in (a) extracting residue interactions from protein sequences with transformer and (b) further integrating global and local sequential features with the ensemble learning strategy. Compared with various existing methods, EnsemPPIS exhibited either superior performance or broader applicability on multiple PPI sites prediction tasks. Moreover, pattern analysis based on the interpretability of EnsemPPIS demonstrated that EnsemPPIS was fully capable of learning residue interactions within the local structure of PPI sites using only sequence information. The web server of EnsemPPIS is freely available at http://idrblab.org/ensemppis.
Collapse
Affiliation(s)
- Minjie Mou
- College of Pharmaceutical Sciences, The Second Affiliated Hospital,
Zhejiang UniversitySchool of Medicine, National Key Laboratory of Advanced Drug Delivery and Release Systems, Zhejiang University, Hangzhou 310058, China
| | - Ziqi Pan
- College of Pharmaceutical Sciences, The Second Affiliated Hospital,
Zhejiang UniversitySchool of Medicine, National Key Laboratory of Advanced Drug Delivery and Release Systems, Zhejiang University, Hangzhou 310058, China
| | - Zhimeng Zhou
- College of Pharmaceutical Sciences, The Second Affiliated Hospital,
Zhejiang UniversitySchool of Medicine, National Key Laboratory of Advanced Drug Delivery and Release Systems, Zhejiang University, Hangzhou 310058, China
| | - Lingyan Zheng
- College of Pharmaceutical Sciences, The Second Affiliated Hospital,
Zhejiang UniversitySchool of Medicine, National Key Laboratory of Advanced Drug Delivery and Release Systems, Zhejiang University, Hangzhou 310058, China
| | - Hanyu Zhang
- College of Pharmaceutical Sciences, The Second Affiliated Hospital,
Zhejiang UniversitySchool of Medicine, National Key Laboratory of Advanced Drug Delivery and Release Systems, Zhejiang University, Hangzhou 310058, China
| | - Shuiyang Shi
- College of Pharmaceutical Sciences, The Second Affiliated Hospital,
Zhejiang UniversitySchool of Medicine, National Key Laboratory of Advanced Drug Delivery and Release Systems, Zhejiang University, Hangzhou 310058, China
| | - Fengcheng Li
- College of Pharmaceutical Sciences, The Second Affiliated Hospital,
Zhejiang UniversitySchool of Medicine, National Key Laboratory of Advanced Drug Delivery and Release Systems, Zhejiang University, Hangzhou 310058, China
| | - Xiuna Sun
- College of Pharmaceutical Sciences, The Second Affiliated Hospital,
Zhejiang UniversitySchool of Medicine, National Key Laboratory of Advanced Drug Delivery and Release Systems, Zhejiang University, Hangzhou 310058, China
| | - Feng Zhu
- College of Pharmaceutical Sciences, The Second Affiliated Hospital,
Zhejiang UniversitySchool of Medicine, National Key Laboratory of Advanced Drug Delivery and Release Systems, Zhejiang University, Hangzhou 310058, China
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou 330110, China
| |
Collapse
|
10
|
Liang Z, Liu T, Li Q, Zhang G, Zhang B, Du X, Liu J, Chen Z, Ding H, Hu G, Lin H, Zhu F, Luo C. Deciphering the functional landscape of phosphosites with deep neural network. Cell Rep 2023; 42:113048. [PMID: 37659078 DOI: 10.1016/j.celrep.2023.113048] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2023] [Revised: 07/11/2023] [Accepted: 08/11/2023] [Indexed: 09/04/2023] Open
Abstract
Current biochemical approaches have only identified the most well-characterized kinases for a tiny fraction of the phosphoproteome, and the functional assignments of phosphosites are almost negligible. Herein, we analyze the substrate preference catalyzed by a specific kinase and present a novel integrated deep neural network model named FuncPhos-SEQ for functional assignment of human proteome-level phosphosites. FuncPhos-SEQ incorporates phosphosite motif information from a protein sequence using multiple convolutional neural network (CNN) channels and network features from protein-protein interactions (PPIs) using network embedding and deep neural network (DNN) channels. These concatenated features are jointly fed into a heterogeneous feature network to prioritize functional phosphosites. Combined with a series of in vitro and cellular biochemical assays, we confirm that NADK-S48/50 phosphorylation could activate its enzymatic activity. In addition, ERK1/2 are discovered as the primary kinases responsible for NADK-S48/50 phosphorylation. Moreover, FuncPhos-SEQ is developed as an online server.
Collapse
Affiliation(s)
- Zhongjie Liang
- Center for Systems Biology, Department of Bioinformatics, School of Biology and Basic Medical Sciences, Soochow University, Suzhou 215123, China; Jiangsu Province Engineering Research Center of Precision Diagnostics and Therapeutics Development, Soochow University, Suzhou 215123, China
| | - Tonghai Liu
- Zhongshan Institute for Drug Discovery, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Zhongshan 528437, China; State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
| | - Qi Li
- Zhongshan Institute for Drug Discovery, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Zhongshan 528437, China; State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
| | - Guangyu Zhang
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| | - Bei Zhang
- State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
| | - Xikun Du
- Center for Systems Biology, Department of Bioinformatics, School of Biology and Basic Medical Sciences, Soochow University, Suzhou 215123, China
| | - Jingqiu Liu
- State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
| | - Zhifeng Chen
- State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
| | - Hong Ding
- State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
| | - Guang Hu
- Center for Systems Biology, Department of Bioinformatics, School of Biology and Basic Medical Sciences, Soochow University, Suzhou 215123, China; Jiangsu Province Engineering Research Center of Precision Diagnostics and Therapeutics Development, Soochow University, Suzhou 215123, China
| | - Hao Lin
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| | - Fei Zhu
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China.
| | - Cheng Luo
- Zhongshan Institute for Drug Discovery, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Zhongshan 528437, China; State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China; School of Pharmaceutical Science and Technology, Hangzhou Institute for Advanced Study, UCAS, Hangzhou 310024, China; School of Life Science and Technology, Shanghai Tech University, 100 Haike Road, Shanghai 201210, China; School of Pharmacy, Fujian Medical University, Fuzhou 350122, China.
| |
Collapse
|
11
|
Pakhrin SC, Pokharel S, Pratyush P, Chaudhari M, Ismail HD, Kc DB. LMPhosSite: A Deep Learning-Based Approach for General Protein Phosphorylation Site Prediction Using Embeddings from the Local Window Sequence and Pretrained Protein Language Model. J Proteome Res 2023; 22:2548-2557. [PMID: 37459437 DOI: 10.1021/acs.jproteome.2c00667] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/05/2023]
Abstract
Phosphorylation is one of the most important post-translational modifications and plays a pivotal role in various cellular processes. Although there exist several computational tools to predict phosphorylation sites, existing tools have not yet harnessed the knowledge distilled by pretrained protein language models. Herein, we present a novel deep learning-based approach called LMPhosSite for the general phosphorylation site prediction that integrates embeddings from the local window sequence and the contextualized embedding obtained using global (overall) protein sequence from a pretrained protein language model to improve the prediction performance. Thus, the LMPhosSite consists of two base-models: one for capturing effective local representation and the other for capturing global per-residue contextualized embedding from a pretrained protein language model. The output of these base-models is integrated using a score-level fusion approach. LMPhosSite achieves a precision, recall, Matthew's correlation coefficient, and F1-score of 38.78%, 67.12%, 0.390, and 49.15%, for the combined serine and threonine independent test data set and 34.90%, 62.03%, 0.298, and 44.67%, respectively, for the tyrosine independent test data set, which is better than the compared approaches. These results demonstrate that LMPhosSite is a robust computational tool for the prediction of the general phosphorylation sites in proteins.
Collapse
Affiliation(s)
- Subash C Pakhrin
- School of Computing, Wichita State University, 1845 Fairmount St., Wichita, Kansas 67260, United States
- Department of Computer Science & Engineering Technology, University of Houston-Downtown, 1 Main St., Houston, Texas 77002, United States
| | - Suresh Pokharel
- Department of Computer Science, Michigan Technological University, Houghton, Michigan 49931, United States
| | - Pawel Pratyush
- Department of Computer Science, Michigan Technological University, Houghton, Michigan 49931, United States
| | - Meenal Chaudhari
- Department of Biology, North Carolina A&T State University, Greensboro, North Carolina 27411, United States
| | - Hamid D Ismail
- Department of Computer Science, Michigan Technological University, Houghton, Michigan 49931, United States
| | - Dukka B Kc
- Department of Computer Science, Michigan Technological University, Houghton, Michigan 49931, United States
| |
Collapse
|
12
|
Lee M. Recent Advances in Deep Learning for Protein-Protein Interaction Analysis: A Comprehensive Review. Molecules 2023; 28:5169. [PMID: 37446831 DOI: 10.3390/molecules28135169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Revised: 06/30/2023] [Accepted: 06/30/2023] [Indexed: 07/15/2023] Open
Abstract
Deep learning, a potent branch of artificial intelligence, is steadily leaving its transformative imprint across multiple disciplines. Within computational biology, it is expediting progress in the understanding of Protein-Protein Interactions (PPIs), key components governing a wide array of biological functionalities. Hence, an in-depth exploration of PPIs is crucial for decoding the intricate biological system dynamics and unveiling potential avenues for therapeutic interventions. As the deployment of deep learning techniques in PPI analysis proliferates at an accelerated pace, there exists an immediate demand for an exhaustive review that encapsulates and critically assesses these novel developments. Addressing this requirement, this review offers a detailed analysis of the literature from 2021 to 2023, highlighting the cutting-edge deep learning methodologies harnessed for PPI analysis. Thus, this review stands as a crucial reference for researchers in the discipline, presenting an overview of the recent studies in the field. This consolidation helps elucidate the dynamic paradigm of PPI analysis, the evolution of deep learning techniques, and their interdependent dynamics. This scrutiny is expected to serve as a vital aid for researchers, both well-established and newcomers, assisting them in maneuvering the rapidly shifting terrain of deep learning applications in PPI analysis.
Collapse
Affiliation(s)
- Minhyeok Lee
- School of Electrical and Electronics Engineering, Chung-Ang University, Seoul 06974, Republic of Korea
| |
Collapse
|
13
|
Wang Z, Deng Z, Zhang W, Lou Q, Choi KS, Wei Z, Wang L, Wu J. MMSMAPlus: a multi-view multi-scale multi-attention embedding model for protein function prediction. Brief Bioinform 2023:7187109. [PMID: 37258453 DOI: 10.1093/bib/bbad201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2022] [Revised: 04/16/2023] [Accepted: 05/08/2023] [Indexed: 06/02/2023] Open
Abstract
Protein is the most important component in organisms and plays an indispensable role in life activities. In recent years, a large number of intelligent methods have been proposed to predict protein function. These methods obtain different types of protein information, including sequence, structure and interaction network. Among them, protein sequences have gained significant attention where methods are investigated to extract the information from different views of features. However, how to fully exploit the views for effective protein sequence analysis remains a challenge. In this regard, we propose a multi-view, multi-scale and multi-attention deep neural model (MMSMA) for protein function prediction. First, MMSMA extracts multi-view features from protein sequences, including one-hot encoding features, evolutionary information features, deep semantic features and overlapping property features based on physiochemistry. Second, a specific multi-scale multi-attention deep network model (MSMA) is built for each view to realize the deep feature learning and preliminary classification. In MSMA, both multi-scale local patterns and long-range dependence from protein sequences can be captured. Third, a multi-view adaptive decision mechanism is developed to make a comprehensive decision based on the classification results of all the views. To further improve the prediction performance, an extended version of MMSMA, MMSMAPlus, is proposed to integrate homology-based protein prediction under the framework of multi-view deep neural model. Experimental results show that the MMSMAPlus has promising performance and is significantly superior to the state-of-the-art methods. The source code can be found at https://github.com/wzy-2020/MMSMAPlus.
Collapse
Affiliation(s)
- Zhongyu Wang
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| | - Zhaohong Deng
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| | - Wei Zhang
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| | - Qiongdan Lou
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| | | | - Zhisheng Wei
- National Key Laboratory of Food Science and Resource Mining, Jiangnan University, Wuxi, China
| | - Lei Wang
- National Key Laboratory of Food Science and Resource Mining, Jiangnan University, Wuxi, China
| | - Jing Wu
- National Key Laboratory of Food Science and Resource Mining, Jiangnan University, Wuxi, China
| |
Collapse
|
14
|
Varshney N, Mishra AK. Deep Learning in Phosphoproteomics: Methods and Application in Cancer Drug Discovery. Proteomes 2023; 11:proteomes11020016. [PMID: 37218921 DOI: 10.3390/proteomes11020016] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Revised: 04/24/2023] [Accepted: 04/25/2023] [Indexed: 05/24/2023] Open
Abstract
Protein phosphorylation is a key post-translational modification (PTM) that is a central regulatory mechanism of many cellular signaling pathways. Several protein kinases and phosphatases precisely control this biochemical process. Defects in the functions of these proteins have been implicated in many diseases, including cancer. Mass spectrometry (MS)-based analysis of biological samples provides in-depth coverage of phosphoproteome. A large amount of MS data available in public repositories has unveiled big data in the field of phosphoproteomics. To address the challenges associated with handling large data and expanding confidence in phosphorylation site prediction, the development of many computational algorithms and machine learning-based approaches have gained momentum in recent years. Together, the emergence of experimental methods with high resolution and sensitivity and data mining algorithms has provided robust analytical platforms for quantitative proteomics. In this review, we compile a comprehensive collection of bioinformatic resources used for the prediction of phosphorylation sites, and their potential therapeutic applications in the context of cancer.
Collapse
Affiliation(s)
- Neha Varshney
- Division of Biological Sciences, Department of Cellular and Molecular Medicine, University of California, San Diego, CA 93093, USA
- Ludwig Institute for Cancer Research, La Jolla, CA 92093, USA
| | - Abhinava K Mishra
- Molecular, Cellular and Developmental Biology Department, University of California, Santa Barbara, CA 93106, USA
| |
Collapse
|
15
|
Gong P, Cheng L, Zhang Z, Meng A, Li E, Chen J, Zhang L. Multi-omics integration method based on attention deep learning network for biomedical data classification. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2023; 231:107377. [PMID: 36739624 DOI: 10.1016/j.cmpb.2023.107377] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/31/2022] [Revised: 01/06/2023] [Accepted: 01/25/2023] [Indexed: 06/18/2023]
Abstract
BACKGROUND AND OBJECTIVE Integrating multi-omics data for the comprehensive analysis of the biological processes in human diseases has become one of the most challenging tasks of bioinformatics. Deep learning (DL) algorithms have recently become one of the most promising multi-omics data integration analysis methods. However, existing DL-based studies almost integrate the multi-omics data by concatenation in the input data space or the learned feature space, ignoring the correlations between patients and omics. METHODS We propose a novel multi-omics integration method, called Multi-omics Attention Deep Learning Network (MOADLN), which is used for biomedical data classification. Firstly, for each type of omics data, we use three fully-connected layers and the self-attention mechanism to reduce dimensionality, and construct the correlations between patients, respectively. Then, we apply the feature vector learned from self-attention to generate the initial category labels. Secondly, for the initial label predicted of each omics data, we use an effective Multi-Omics Correlation Discovery Network (MOCDN) to learn the cross-omic correlations in the label space. Finally, we use the softmax classifier for label prediction. RESULTS We demonstrate that our method outperforms several state-of-the-art methods on two datasets with mRNA expression data, DNA methylation data, and miRNA expression data. In addition, we identified essential biomarkers of relevant diseases by MOADLN, and the generality of MOADLN is also demonstrated in the KIRP and KIRC datasets. CONCLUSIONS MOADLN jointly explores correlations between patients in intra-omics and correlations of cross-omics in label space, which is an effective DL-based classification of biomedical data.
Collapse
Affiliation(s)
- Ping Gong
- School of Medical Imaging, Xuzhou Medical University, Xuzhou, CN, China.
| | - Lei Cheng
- School of Medical Imaging, Xuzhou Medical University, Xuzhou, CN, China
| | - Zhiyuan Zhang
- School of Medical Imaging, Xuzhou Medical University, Xuzhou, CN, China
| | - Ao Meng
- School of Medical Imaging, Xuzhou Medical University, Xuzhou, CN, China
| | - Enshuo Li
- School of Medical Imaging, Xuzhou Medical University, Xuzhou, CN, China
| | - Jie Chen
- Department of Radiation Oncology, Affiliated Hospital of Xuzhou Medical University, Xuzhou, CN, China
| | - Longzhen Zhang
- Department of Radiation Oncology, Affiliated Hospital of Xuzhou Medical University, Xuzhou, CN, China
| |
Collapse
|
16
|
Zhou H, Tan W, Shi S. DeepGpgs: a novel deep learning framework for predicting arginine methylation sites combined with Gaussian prior and gated self-attention mechanism. Brief Bioinform 2023; 24:7000314. [PMID: 36694944 DOI: 10.1093/bib/bbad018] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2022] [Revised: 12/26/2022] [Accepted: 01/04/2023] [Indexed: 01/26/2023] Open
Abstract
Protein arginine methylation is an important posttranslational modification (PTM) associated with protein functional diversity and pathological conditions including cancer. Identification of methylation binding sites facilitates a better understanding of the molecular function of proteins. Recent developments in the field of deep neural networks have led to a proliferation of deep learning-based methylation identification studies because of their fast and accurate prediction. In this paper, we propose DeepGpgs, an advanced deep learning model incorporating Gaussian prior and gated attention mechanism. We introduce a residual network channel to extract the evolutionary information of proteins. Then we combine the adaptive embedding with bidirectional long short-term memory networks to form a context-shared encoder layer. A gated multi-head attention mechanism is followed to obtain the global information about the sequence. A Gaussian prior is injected into the sequence to assist in predicting PTMs. We also propose a weighted joint loss function to alleviate the false negative problem. We empirically show that DeepGpgs improves Matthews correlation coefficient by 6.3% on the arginine methylation independent test set compared with the existing state-of-the-art methylation site prediction methods. Furthermore, DeepGpgs has good robustness in phosphorylation site prediction of SARS-CoV-2, which indicates that DeepGpgs has good transferability and the potential to be extended to other modification sites prediction. The open-source code and data of the DeepGpgs can be obtained from https://github.com/saizhou1/DeepGpgs.
Collapse
Affiliation(s)
- Haiwei Zhou
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
| | - Wenxi Tan
- School of Mathematical Sciences, Fudan University, Shanghai 200433, China
| | - Shaoping Shi
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang 330031, China
| |
Collapse
|
17
|
Wang X, Ding Z, Wang R, Lin X. Deepro-Glu: combination of convolutional neural network and Bi-LSTM models using ProtBert and handcrafted features to identify lysine glutarylation sites. Brief Bioinform 2023; 24:6991122. [PMID: 36653898 DOI: 10.1093/bib/bbac631] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2022] [Revised: 12/11/2022] [Accepted: 12/28/2022] [Indexed: 01/20/2023] Open
Abstract
Lysine glutarylation (Kglu) is a newly discovered post-translational modification of proteins with important roles in mitochondrial functions, oxidative damage, etc. The established biological experimental methods to identify glutarylation sites are often time-consuming and costly. Therefore, there is an urgent need to develop computational methods for efficient and accurate identification of glutarylation sites. Most of the existing computational methods only utilize handcrafted features to construct the prediction model and do not consider the positive impact of the pre-trained protein language model on the prediction performance. Based on this, we develop an ensemble deep-learning predictor Deepro-Glu that combines convolutional neural network and bidirectional long short-term memory network using the deep learning features and traditional handcrafted features to predict lysine glutaryation sites. The deep learning features are generated from the pre-trained protein language model called ProtBert, and the handcrafted features consist of sequence-based features, physicochemical property-based features and evolution information-based features. Furthermore, the attention mechanism is used to efficiently integrate the deep learning features and the handcrafted features by learning the appropriate attention weights. 10-fold cross-validation and independent tests demonstrate that Deepro-Glu achieves competitive or superior performance than the state-of-the-art methods. The source codes and data are publicly available at https://github.com/xwanggroup/Deepro-Glu.
Collapse
Affiliation(s)
- Xiao Wang
- School of Computer and Communication Engineering, Zhengzhou University of Light Industry, No. 136, Science Avenue, 450002, Zhengzhou, China
| | - Zhaoyuan Ding
- School of Computer and Communication Engineering, Zhengzhou University of Light Industry, No. 136, Science Avenue, 450002, Zhengzhou, China
| | - Rong Wang
- School of Computer and Communication Engineering, Zhengzhou University of Light Industry, No. 136, Science Avenue, 450002, Zhengzhou, China
| | - Xi Lin
- Instiute of Artificial Intelligence, Xiamen University, No.4221, Xiang'an South Road, 361000, Xiamen, China
| |
Collapse
|
18
|
Zhu F, Deng L, Dai Y, Zhang G, Meng F, Luo C, Hu G, Liang Z. PPICT: an integrated deep neural network for predicting inter-protein PTM cross-talk. Brief Bioinform 2023; 24:7035113. [PMID: 36781207 DOI: 10.1093/bib/bbad052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Revised: 01/11/2023] [Accepted: 01/26/2023] [Indexed: 02/15/2023] Open
Abstract
Post-translational modifications (PTMs) fine-tune various signaling pathways not only by the modification of a single residue, but also by the interplay of different modifications on residue pairs within or between proteins, defined as PTM cross-talk. As a challenging question, less attention has been given to PTM dynamics underlying cross-talk residue pairs and structural information underlying protein-protein interaction (PPI) graph, limiting the progress in this PTM functional research. Here we propose a novel integrated deep neural network PPICT (Predictor for PTM Inter-protein Cross-Talk), which predicts PTM cross-talk by combining protein sequence-structure-dynamics information and structural information for PPI graph. We find that cross-talk events preferentially occur among residues with high co-evolution and high potential in allosteric regulation. To make full use of the complex associations between protein evolutionary and biophysical features, and protein pair features, a heterogeneous feature combination net is introduced in the final prediction of PPICT. The comprehensive test results show that the proposed PPICT method significantly improves the prediction performance with an AUC value of 0.869, outperforming the existing state-of-the-art methods. Additionally, the PPICT method can capture the potential PTM cross-talks involved in the functional regulatory PTMs on modifying enzymes and their catalyzed PTM substrates. Therefore, PPICT represents an effective tool for identifying PTM cross-talk between proteins at the proteome level and highlights the hints for cross-talk between different signal pathways introduced by PTMs.
Collapse
Affiliation(s)
- Fei Zhu
- School of Computer Science and Technology, Soochow University, 215006, Suzhou, China
- Center for Systems Biology, Department of Bioinformatics, School of Biology and Basic Medical Sciences, Soochow University, 215123, Suzhou, China
| | - Lei Deng
- School of Computer Science and Technology, Soochow University, 215006, Suzhou, China
| | - Yuhao Dai
- School of Computer Science and Technology, Soochow University, 215006, Suzhou, China
| | - Guangyu Zhang
- School of Computer Science and Technology, Soochow University, 215006, Suzhou, China
| | - Fanwang Meng
- Department of Chemistry and Chemical Biology, McMaster University, L8S 4L8, Ontario, Canada
| | - Cheng Luo
- State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 201203, Shanghai, China
| | - Guang Hu
- Center for Systems Biology, Department of Bioinformatics, School of Biology and Basic Medical Sciences, Soochow University, 215123, Suzhou, China
| | - Zhongjie Liang
- Center for Systems Biology, Department of Bioinformatics, School of Biology and Basic Medical Sciences, Soochow University, 215123, Suzhou, China
- State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 201203, Shanghai, China
- Key Laboratory of Systems Biomedicine (Ministry of Education), Center for Systems Biomedicine, Shanghai Jiao Tong University, 200240, Shanghai, China
| |
Collapse
|
19
|
Zhou Z, Yeung W, Gravel N, Salcedo M, Soleymani S, Li S, Kannan N. Phosformer: an explainable transformer model for protein kinase-specific phosphorylation predictions. Bioinformatics 2023; 39:7000331. [PMID: 36692152 PMCID: PMC9900213 DOI: 10.1093/bioinformatics/btad046] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Revised: 01/16/2023] [Accepted: 01/23/2023] [Indexed: 01/25/2023] Open
Abstract
MOTIVATION The human genome encodes over 500 distinct protein kinases which regulate nearly all cellular processes by the specific phosphorylation of protein substrates. While advances in mass spectrometry and proteomics studies have identified thousands of phosphorylation sites across species, information on the specific kinases that phosphorylate these sites is currently lacking for the vast majority of phosphosites. Recently, there has been a major focus on the development of computational models for predicting kinase-substrate associations. However, most current models only allow predictions on a subset of well-studied kinases. Furthermore, the utilization of hand-curated features and imbalances in training and testing datasets pose unique challenges in the development of accurate predictive models for kinase-specific phosphorylation prediction. Motivated by the recent development of universal protein language models which automatically generate context-aware features from primary sequence information, we sought to develop a unified framework for kinase-specific phosphosite prediction, allowing for greater investigative utility and enabling substrate predictions at the whole kinome level. RESULTS We present a deep learning model for kinase-specific phosphosite prediction, termed Phosformer, which predicts the probability of phosphorylation given an arbitrary pair of unaligned kinase and substrate peptide sequences. We demonstrate that Phosformer implicitly learns evolutionary and functional features during training, removing the need for feature curation and engineering. Further analyses reveal that Phosformer also learns substrate specificity motifs and is able to distinguish between functionally distinct kinase families. Benchmarks indicate that Phosformer exhibits significant improvements compared to the state-of-the-art models, while also presenting a more generalized, unified, and interpretable predictive framework. AVAILABILITY AND IMPLEMENTATION Code and data are available at https://github.com/esbgkannan/phosformer. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - Nathan Gravel
- Institute of Bioinformatics, University of Georgia, GA 30602, USA
| | - Mariah Salcedo
- Department of Biochemistry and Molecular Biology, University of Georgia, GA 30602, USA
| | | | - Sheng Li
- To whom correspondence should be addressed. or
| | | |
Collapse
|
20
|
Sun Q, Cheng L, Meng A, Ge S, Chen J, Zhang L, Gong P. SADLN: Self-attention based deep learning network of integrating multi-omics data for cancer subtype recognition. Front Genet 2023; 13:1032768. [PMID: 36685873 PMCID: PMC9846505 DOI: 10.3389/fgene.2022.1032768] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Accepted: 12/15/2022] [Indexed: 01/05/2023] Open
Abstract
Integrating multi-omics data for cancer subtype recognition is an important task in bioinformatics. Recently, deep learning has been applied to recognize the subtype of cancers. However, existing studies almost integrate the multi-omics data simply by concatenation as the single data and then learn a latent low-dimensional representation through a deep learning model, which did not consider the distribution differently of omics data. Moreover, these methods ignore the relationship of samples. To tackle these problems, we proposed SADLN: A self-attention based deep learning network of integrating multi-omics data for cancer subtype recognition. SADLN combined encoder, self-attention, decoder, and discriminator into a unified framework, which can not only integrate multi-omics data but also adaptively model the sample's relationship for learning an accurately latent low-dimensional representation. With the integrated representation learned from the network, SADLN used Gaussian Mixture Model to identify cancer subtypes. Experiments on ten cancer datasets of TCGA demonstrated the advantages of SADLN compared to ten methods. The Self-Attention Based Deep Learning Network (SADLN) is an effective method of integrating multi-omics data for cancer subtype recognition.
Collapse
Affiliation(s)
- Qiuwen Sun
- School of Medical Imaging, Xuzhou Medical University, Xuzhou, China
| | - Lei Cheng
- School of Medical Imaging, Xuzhou Medical University, Xuzhou, China
| | - Ao Meng
- School of Medical Imaging, Xuzhou Medical University, Xuzhou, China
| | - Shuguang Ge
- School of Information and Control Engineering, University of Mining and Technology, Xuzhou, China
| | - Jie Chen
- Department of Radiation Oncology, Affiliated Hospital of Xuzhou Medical University, Xuzhou, China
| | - Longzhen Zhang
- Department of Radiation Oncology, Affiliated Hospital of Xuzhou Medical University, Xuzhou, China
| | - Ping Gong
- School of Medical Imaging, Xuzhou Medical University, Xuzhou, China,*Correspondence: Ping Gong,
| |
Collapse
|
21
|
A Novel Capsule Network with Attention Routing to Identify Prokaryote Phosphorylation Sites. Biomolecules 2022; 12:biom12121854. [PMID: 36551282 PMCID: PMC9775645 DOI: 10.3390/biom12121854] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2022] [Revised: 12/07/2022] [Accepted: 12/09/2022] [Indexed: 12/14/2022] Open
Abstract
By denaturing proteins and promoting the formation of multiprotein complexes, protein phosphorylation has important effects on the activity of protein functional molecules and cell signaling. The regulation of protein phosphorylation allows microbes to respond rapidly and reversibly to specific environmental stimuli or niches, which is closely related to the molecular mechanisms of bacterial drug resistance. Accurate prediction of phosphorylation sites (p-site) of prokaryotes can contribute to addressing bacterial resistance and providing new perspectives for developing novel antibacterial drugs. Most existing studies focus on human phosphorylation sites, while tools targeting phosphorylation site identification of prokaryotic proteins are still relatively scarce. This study designs a capsule network-based prediction technique for p-site in prokaryotes. To address the poor scalability and unreliability of dynamic routing processes in the output space of capsule networks, a more reliable way is introduced to learn the consistency between capsules. We incorporate a self-attention mechanism into the routing algorithm to capture the global information of the capsule, reducing the computational effort while enriching the representation capability of the capsule. Aiming at the weak robustness of the model, EcapsP improves the prediction accuracy and stability by introducing shortcuts and unconditional reconfiguration. In addition, the study compares and analyzes the prediction performance based on word vectors, physicochemical properties, and mixing characteristics in predicting serine (Ser/S), threonine (Thr/T), and tyrosine (Tyr/Y) p-site. The comprehensive experimental results show that the accuracy of the developed technique is close to 70% for the identification of the three phosphorylation sites in prokaryotes. Importantly, in side-by-side comparisons with other state-of-the-art predictors, our method improves the Matthews correlation coefficient (MCC) by approximately 7%. The results demonstrate the superiority of EcapsP in terms of high performance and reliability.
Collapse
|
22
|
Zeng Y, Liu D, Wang Y. Identification of phosphorylation site using S-padding strategy based convolutional neural network. Health Inf Sci Syst 2022; 10:29. [PMID: 36124094 PMCID: PMC9481819 DOI: 10.1007/s13755-022-00196-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2022] [Accepted: 08/25/2022] [Indexed: 10/14/2022] Open
Abstract
Purpose Abnormal phosphorylation has been proved to associate with a variety of human diseases, and the identification of phosphorylation sites is one of the research hotspots in healthcare. The study of phosphorylation site prediction in deep learning models often introduces a variety of information, and the utilization of complex models limits the usage scenarios of the models. Methods An enhanced deep learning method with S-padding strategy based on convolutional neural network is proposed in this paper. The S-padding strategy forms a three-dimensional matrix with extension information from original amino acid sequences, and a corresponding 2D-CNN model is designed to abstract the comprehensive features of phosphorylation site area in protein sequences. Results The fivefold cross-validation experiments are conducted, and the results show the performance of the proposed method on human dataset can achieve an accuracy of 89.68 % on serine/threonine sites and 88.16 % on tyrosine sites, respectively. Furthermore, phosphorylation site prediction on different organisms obtains the accuracy, sensitivity, and specificity of over 0.85, indicating a potential capability on phosphorylation site prediction task. Comparison result with existing models shows that the proposed method obtains better performance on both accuracy and AUC value, and the proposed method can further improve performance with sufficient training data. Conclusion This method enables proteome-wide predictions via models trained on a large amount of phosphorylation data, further exploiting the potential of protein phosphorylation site identification, and helping to provide insights into phosphorylation mechanisms.
Collapse
Affiliation(s)
- Yanjiao Zeng
- School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, 510006 Guangdong China
| | - Dongning Liu
- School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, 510006 Guangdong China
| | - Yang Wang
- School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, 510006 Guangdong China
| |
Collapse
|
23
|
Li X, Zhang S, Shi H. An improved residual network using deep fusion for identifying RNA 5-methylcytosine sites. Bioinformatics 2022; 38:4271-4277. [PMID: 35866985 DOI: 10.1093/bioinformatics/btac532] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2022] [Revised: 06/30/2022] [Accepted: 07/21/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION 5-Methylcytosine (m5C) is a crucial post-transcriptional modification. With the development of technology, it is widely found in various RNAs. Numerous studies have indicated that m5C plays an essential role in various activities of organisms, such as tRNA recognition, stabilization of RNA structure, RNA metabolism and so on. Traditional identification is costly and time-consuming by wet biological experiments. Therefore, computational models are commonly used to identify the m5C sites. Due to the vast computing advantages of deep learning, it is feasible to construct the predictive model through deep learning algorithms. RESULTS In this study, we construct a model to identify m5C based on a deep fusion approach with an improved residual network. First, sequence features are extracted from the RNA sequences using Kmer, K-tuple nucleotide frequency component (KNFC), Pseudo dinucleotide composition (PseDNC) and Physical and chemical property (PCP). Kmer and KNFC extract information from a statistical point of view. PseDNC and PCP extract information from the physicochemical properties of RNA sequences. Then, two parts of information are fused with new features using bidirectional long- and short-term memory and attention mechanisms, respectively. Immediately after, the fused features are fed into the improved residual network for classification. Finally, 10-fold cross-validation and independent set testing are used to verify the credibility of the model. The results show that the accuracy reaches 91.87%, 95.55%, 92.27% and 95.60% on the training sets and independent test sets of Arabidopsis thaliana and M.musculus, respectively. This is a considerable improvement compared to previous studies and demonstrates the robust performance of our model. AVAILABILITY AND IMPLEMENTATION The data and code related to the study are available at https://github.com/alivelxj/m5c-DFRESG.
Collapse
Affiliation(s)
- Xinjie Li
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, P. R. China
| | - Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, P. R. China
| | - Hongyan Shi
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, P. R. China
| |
Collapse
|
24
|
Shi H, Zhang S, Li X. R5hmCFDV: computational identification of RNA 5-hydroxymethylcytosine based on deep feature fusion and deep voting. Brief Bioinform 2022; 23:6658858. [PMID: 35945157 DOI: 10.1093/bib/bbac341] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2022] [Revised: 07/17/2022] [Accepted: 07/25/2022] [Indexed: 11/13/2022] Open
Abstract
RNA 5-hydroxymethylcytosine (5hmC) is a kind of RNA modification, which is related to the life activities of many organisms. Studying its distribution is very important to reveal its biological function. Previously, high-throughput sequencing was used to identify 5hmC, but it is expensive and inefficient. Therefore, machine learning is used to identify 5hmC sites. Here, we design a model called R5hmCFDV, which is mainly divided into feature representation, feature fusion and classification. (i) Pseudo dinucleotide composition, dinucleotide binary profile and frequency, natural vector and physicochemical property are used to extract features from four aspects: nucleotide composition, coding, natural language and physical and chemical properties. (ii) To strengthen the relevance of features, we construct a novel feature fusion method. Firstly, the attention mechanism is employed to process four single features, stitch them together and feed them to the convolution layer. After that, the output data are processed by BiGRU and BiLSTM, respectively. Finally, the features of these two parts are fused by the multiply function. (iii) We design the deep voting algorithm for classification by imitating the soft voting mechanism in the Python package. The base classifiers contain deep neural network (DNN), convolutional neural network (CNN) and improved gated recurrent unit (GRU). And then using the principle of soft voting, the corresponding weights are assigned to the predicted probabilities of the three classifiers. The predicted probability values are multiplied by the corresponding weights and then summed to obtain the final prediction results. We use 10-fold cross-validation to evaluate the model, and the evaluation indicators are significantly improved. The prediction accuracy of the two datasets is as high as 95.41% and 93.50%, respectively. It demonstrates the stronger competitiveness and generalization performance of our model. In addition, all datasets and source codes can be found at https://github.com/HongyanShi026/R5hmCFDV.
Collapse
Affiliation(s)
- Hongyan Shi
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, P. R. China
| | - Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, P. R. China
| | - Xinjie Li
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, P. R. China
| |
Collapse
|
25
|
Mini-review: Recent advances in post-translational modification site prediction based on deep learning. Comput Struct Biotechnol J 2022; 20:3522-3532. [PMID: 35860402 PMCID: PMC9284371 DOI: 10.1016/j.csbj.2022.06.045] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2022] [Revised: 06/21/2022] [Accepted: 06/21/2022] [Indexed: 11/23/2022] Open
Abstract
Post-translational modifications (PTMs) are closely linked to numerous diseases, playing a significant role in regulating protein structures, activities, and functions. Therefore, the identification of PTMs is crucial for understanding the mechanisms of cell biology and diseases therapy. Compared to traditional machine learning methods, the deep learning approaches for PTM prediction provide accurate and rapid screening, guiding the downstream wet experiments to leverage the screened information for focused studies. In this paper, we reviewed the recent works in deep learning to identify phosphorylation, acetylation, ubiquitination, and other PTM types. In addition, we summarized PTM databases and discussed future directions with critical insights.
Collapse
Key Words
- AAindex, Amino acid index
- ATP, Adenosine triphosphate
- AUC, Area under curve
- Ac, Acetylation
- BE, Binary encoding
- BLOSUM, Blocks substitution matrix
- Bi-LSTM, Bidirectional LSTM
- CKSAAP, Composition of k-spaced amino acid Pairs
- CNN, Convolutional neural network
- CNNOH, CNN with the one-hot encoding
- CNNWE, CNN with the word-embedding encoding
- CNNrgb, CNN red green blue
- CV, Cross-validation
- DC-CNN, Densely connected convolutional neural network
- DL, Deep learning
- DNNs, Deep neural networks
- Deep learning
- E. coli, Escherichia coli
- EBGW, Encoding based on grouped weight
- EGAAC, Enhanced grouped amino acids content
- IG, Information gain
- K, Lysine
- KNN, k nearest neighbor
- LASSO, Least absolute shrinkage and selection operator
- LSTM, Long short-term memory
- LSTMWE, LSTM with the word-embedding encoding
- M.musculus, Mus musculus
- MDC, Modular densely connected convolutional networks
- MDCAN, Multilane dense convolutional attention network
- ML, Machine learning
- MLP, Multilayer perceptron
- MMI, Multivariate mutual information
- Machine learning
- Mass spectrometry
- NMBroto, Normalized Moreau-Broto autocorrelation
- P, Proline
- PSP, PhosphoSitePlus
- PSSM, Position-specific scoring matrix
- PTM, Post-translational modifications
- Ph, Phosphorylation
- Post-translational modification
- Prediction
- PseAAC, Pseudo-amino acid composition
- R, Arginine
- RF, Random forest
- RNN, Recurrent neural network
- ROC, Receiver operating characteristic
- S, Serine
- S. typhimurium, Salmonella typhimurium
- S.cerevisiae, Saccharomyces cerevisiae
- SE, Squeeze and excitation
- SEV, Split to Equal Validation
- ST, Source and target
- SUMO, Small ubiquitin-like modifier
- SVM, Support vector machines
- T, Threonine
- Ub, Ubiquitination
- Y, Tyrosine
- ZSL, Zero-shot learning
Collapse
|
26
|
Zhu F, Yang S, Meng F, Zheng Y, Ku X, Luo C, Hu G, Liang Z. Leveraging Protein Dynamics to Identify Functional Phosphorylation Sites using Deep Learning Models. J Chem Inf Model 2022; 62:3331-3345. [PMID: 35816597 DOI: 10.1021/acs.jcim.2c00484] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Accurate prediction of post-translational modifications (PTMs) is of great significance in understanding cellular processes, by modulating protein structure and dynamics. Nowadays, with the rapid growth of protein data at different "omics" levels, machine learning models largely enriched the prediction of PTMs. However, most machine learning models only rely on protein sequence and little structural information. The lack of the systematic dynamics analysis underlying PTMs largely limits the PTM functional predictions. In this research, we present two dynamics-centric deep learning models, namely, cDL-PAU and cDL-FuncPhos, by incorporating sequence, structure, and dynamics-based features to elucidate the molecular basis and underlying functional landscape of PTMs. cDL-PAU achieved satisfactory area under the curve (AUC) scores of 0.804-0.888 for predicting phosphorylation, acetylation, and ubiquitination (PAU) sites, while cDL-FuncPhos achieved an AUC value of 0.771 for predicting functional phosphorylation (FuncPhos) sites, displaying reliable improvements. Through a feature selection, the dynamics-based coupling and commute ability show large contributions in discovering PAU sites and FuncPhos sites, suggesting the allosteric propensity for important PTMs. The application of cDL-FuncPhos in three oncoproteins not only corroborates its strong performance in FuncPhos prioritization but also gains insight into the physical basis for the functions. The source code and data set of cDL-PAU and cDL-FuncPhos are available at https://github.com/ComputeSuda/PTM_ML.
Collapse
Affiliation(s)
- Fei Zhu
- Center for Systems Biology, Department of Bioinformatics, School of Biology and Basic Medical Sciences, Soochow University, Suzhou 215123, China.,School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| | - Sijie Yang
- School of Computer Science and Technology, Soochow University, Suzhou 215006, China
| | - Fanwang Meng
- Department of Chemistry and Chemical Biology, McMaster University, Hamilton L8S 4L8, Ontario, Canada
| | - Yuxiang Zheng
- Center for Systems Biology, Department of Bioinformatics, School of Biology and Basic Medical Sciences, Soochow University, Suzhou 215123, China
| | - Xin Ku
- Key Laboratory of Systems Biomedicine (Ministry of Education), Shanghai Center for Systems Biomedicine, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Cheng Luo
- State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China
| | - Guang Hu
- Center for Systems Biology, Department of Bioinformatics, School of Biology and Basic Medical Sciences, Soochow University, Suzhou 215123, China
| | - Zhongjie Liang
- Center for Systems Biology, Department of Bioinformatics, School of Biology and Basic Medical Sciences, Soochow University, Suzhou 215123, China.,Key Laboratory of Systems Biomedicine (Ministry of Education), Shanghai Center for Systems Biomedicine, Shanghai Jiao Tong University, Shanghai 200240, China.,State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China
| |
Collapse
|
27
|
DeepDA-Ace: A Novel Domain Adaptation Method for Species-Specific Acetylation Site Prediction. MATHEMATICS 2022. [DOI: 10.3390/math10142364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
Protein lysine acetylation is an important type of post-translational modification (PTM), and it plays a crucial role in various cellular processes. Recently, although many researchers have focused on developing tools for acetylation site prediction based on computational methods, most of these tools are based on traditional machine learning algorithms for acetylation site prediction without species specificity, still maintained as a single prediction model. Recent studies have shown that the acetylation sites of distinct species have evident location-specific differences; however, there is currently no integrated prediction model that can effectively predict acetylation sites cross all species. Therefore, to enhance the scope of species-specific level, it is necessary to establish a framework for species-specific acetylation site prediction. In this work, we propose a domain adaptation framework DeepDA-Ace for species-specific acetylation site prediction, including Rattus norvegicus, Schistosoma japonicum, Arabidopsis thaliana, and other types of species. In DeepDA-Ace, an attention based densely connected convolutional neural network is designed to capture sequence features, and the semantic adversarial learning strategy is proposed to align features of different species so as to achieve knowledge transfer. The DeepDA-Ace outperformed both the general prediction model and fine-tuning based species-specific model across most types of species. The experiment results have demonstrated that DeepDA-Ace is superior to the general and fine-tuning methods, and its precision exceeds 0.75 on most species. In addition, our method achieves at least 5% improvement over the existing acetylation prediction tools.
Collapse
|
28
|
Wang B, Wang Y, Chen Y, Gao M, Ren J, Guo Y, Situ C, Qi Y, Zhu H, Li Y, Guo X. DeepSCP: utilizing deep learning to boost single-cell proteome coverage. Brief Bioinform 2022; 23:6598882. [DOI: 10.1093/bib/bbac214] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Revised: 04/20/2022] [Accepted: 05/06/2022] [Indexed: 11/12/2022] Open
Abstract
Abstract
Multiplexed single-cell proteomes (SCPs) quantification by mass spectrometry greatly improves the SCP coverage. However, it still suffers from a low number of protein identifications and there is much room to boost proteins identification by computational methods. In this study, we present a novel framework DeepSCP, utilizing deep learning to boost SCP coverage. DeepSCP constructs a series of features of peptide-spectrum matches (PSMs) by predicting the retention time based on the multiple SCP sample sets and fragment ion intensities based on deep learning, and predicts PSM labels with an optimized-ensemble learning model. Evaluation of DeepSCP on public and in-house SCP datasets showed superior performances compared with other state-of-the-art methods. DeepSCP identified more confident peptides and proteins by controlling q-value at 0.01 using target–decoy competition method. As a convenient and low-cost computing framework, DeepSCP will help boost single-cell proteome identification and facilitate the future development and application of single-cell proteomics.
Collapse
Affiliation(s)
- Bing Wang
- School of Medicine , Southeast University, Nanjing 210009 , China
- Department of Histology and Embryology , State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing 211166 , China
| | - Yue Wang
- Department of Histology and Embryology , State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing 211166 , China
| | - Yu Chen
- Department of Histology and Embryology , State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing 211166 , China
| | - Mengmeng Gao
- Department of Histology and Embryology , State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing 211166 , China
| | - Jie Ren
- Department of Histology and Embryology , State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing 211166 , China
| | - Yueshuai Guo
- Department of Histology and Embryology , State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing 211166 , China
| | - Chenghao Situ
- Department of Histology and Embryology , State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing 211166 , China
| | - Yaling Qi
- Department of Histology and Embryology , State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing 211166 , China
| | - Hui Zhu
- Department of Clinical Laboratory , Sir Run Run Hospital, Nanjing Medical University, Nanjing 211166 , China
| | - Yan Li
- School of Medicine , Southeast University, Nanjing 210009 , China
- Department of Histology and Embryology , State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing 211166 , China
| | - Xuejiang Guo
- Department of Histology and Embryology , State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing 211166 , China
| |
Collapse
|
29
|
Jiang M, Zhang R, Xia Y, Jia G, Yin Y, Wang P, Wu J, Ge R. i2APP: A Two-Step Machine Learning Framework For Antiparasitic Peptides Identification. Front Genet 2022; 13:884589. [PMID: 35571057 PMCID: PMC9091563 DOI: 10.3389/fgene.2022.884589] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2022] [Accepted: 04/11/2022] [Indexed: 11/18/2022] Open
Abstract
Parasites can cause enormous damage to their hosts. Studies have shown that antiparasitic peptides can inhibit the growth and development of parasites and even kill them. Because traditional biological methods to determine the activity of antiparasitic peptides are time-consuming and costly, a method for large-scale prediction of antiparasitic peptides is urgently needed. We propose a computational approach called i2APP that can efficiently identify APPs using a two-step machine learning (ML) framework. First, in order to solve the imbalance of positive and negative samples in the training set, a random under sampling method is used to generate a balanced training data set. Then, the physical and chemical features and terminus-based features are extracted, and the first classification is performed by Light Gradient Boosting Machine (LGBM) and Support Vector Machine (SVM) to obtain 264-dimensional higher level features. These features are selected by Maximal Information Coefficient (MIC) and the features with the big MIC values are retained. Finally, the SVM algorithm is used for the second classification in the optimized feature space. Thus the prediction model i2APP is fully constructed. On independent datasets, the accuracy and AUC of i2APP are 0.913 and 0.935, respectively, which are better than the state-of-arts methods. The key idea of the proposed method is that multi-level features are extracted from peptide sequences and the higher-level features can distinguish well the APPs and non-APPs.
Collapse
Affiliation(s)
- Minchao Jiang
- School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
| | - Renfeng Zhang
- Shandong Provincial Hospital Affiliated to Shandong First Medical University, Jinan, China
| | - Yixiao Xia
- School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
| | - Gangyong Jia
- School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
| | - Yuyu Yin
- School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
| | - Pu Wang
- Computer School, Hubei University of Arts and Science, Xiangyang, China
- *Correspondence: Pu Wang, ; Jian Wu, ; Ruiquan Ge,
| | - Jian Wu
- MyGenostics Inc., Beijing, China
- *Correspondence: Pu Wang, ; Jian Wu, ; Ruiquan Ge,
| | - Ruiquan Ge
- School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
- *Correspondence: Pu Wang, ; Jian Wu, ; Ruiquan Ge,
| |
Collapse
|
30
|
Guan X, Wang Y, Shao W, Li Z, Huang S, Zhang D. S2Snet: deep learning for low molecular weight RNA identification with nanopore. Brief Bioinform 2022; 23:6562681. [DOI: 10.1093/bib/bbac098] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2021] [Revised: 02/24/2022] [Accepted: 02/25/2022] [Indexed: 11/12/2022] Open
Abstract
Abstract
Ribonucleic acid (RNA) is a pivotal nucleic acid that plays a crucial role in regulating many biological activities. Recently, one study utilized a machine learning algorithm to automatically classify RNA structural events generated by a Mycobacterium smegmatis porin A nanopore trap. Although it can achieve desirable classification results, compared with deep learning (DL) methods, this classic machine learning requires domain knowledge to manually extract features, which is sophisticated, labor-intensive and time-consuming. Meanwhile, the generated original RNA structural events are not strictly equal in length, which is incompatible with the input requirements of DL models. To alleviate this issue, we propose a sequence-to-sequence (S2S) module that transforms the unequal length sequence (UELS) to the equal length sequence. Furthermore, to automatically extract features from the RNA structural events, we propose a sequence-to-sequence neural network based on DL. In addition, we add an attention mechanism to capture vital information for classification, such as dwell time and blockage amplitude. Through quantitative and qualitative analysis, the experimental results have achieved about a 2% performance increase (accuracy) compared to the previous method. The proposed method can also be applied to other nanopore platforms, such as the famous Oxford nanopore. It is worth noting that the proposed method is not only aimed at pursuing state-of-the-art performance but also provides an overall idea to process nanopore data with UELS.
Collapse
Affiliation(s)
- Xiaoyu Guan
- College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing, China
| | - Yuqin Wang
- State Key Laboratory of Analytical Chemistry for Life Sciences, School of Chemistry and Chemical Engineering, Nanjing University, 210023, Nanjing, China
- Chemistry and Biomedicine Innovation Center (ChemBIC), Nanjing University, 210023, Nanjing, China
| | - Wei Shao
- College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing, China
| | - Zhongnian Li
- College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing, China
| | - Shuo Huang
- State Key Laboratory of Analytical Chemistry for Life Sciences, School of Chemistry and Chemical Engineering, Nanjing University, 210023, Nanjing, China
- Chemistry and Biomedicine Innovation Center (ChemBIC), Nanjing University, 210023, Nanjing, China
| | - Daoqiang Zhang
- College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing, China
| |
Collapse
|
31
|
Zhou J, Xiong W, Wang Y, Guan J. Protein Function Prediction Based on PPI Networks: Network Reconstruction vs Edge Enrichment. Front Genet 2022; 12:758131. [PMID: 34970299 PMCID: PMC8712557 DOI: 10.3389/fgene.2021.758131] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2021] [Accepted: 11/11/2021] [Indexed: 01/21/2023] Open
Abstract
Over the past decades, massive amounts of protein-protein interaction (PPI) data have been accumulated due to the advancement of high-throughput technologies, and but data quality issues (noise or incompleteness) of PPI have been still affecting protein function prediction accuracy based on PPI networks. Although two main strategies of network reconstruction and edge enrichment have been reported on the effectiveness of boosting the prediction performance in numerous literature studies, there still lack comparative studies of the performance differences between network reconstruction and edge enrichment. Inspired by the question, this study first uses three protein similarity metrics (local, global and sequence) for network reconstruction and edge enrichment in PPI networks, and then evaluates the performance differences of network reconstruction, edge enrichment and the original networks on two real PPI datasets. The experimental results demonstrate that edge enrichment work better than both network reconstruction and original networks. Moreover, for the edge enrichment of PPI networks, the sequence similarity outperformes both local and global similarity. In summary, our study can help biologists select suitable pre-processing schemes and achieve better protein function prediction for PPI networks.
Collapse
Affiliation(s)
- Jiaogen Zhou
- Jiangsu Provincial Engineering Research Center for Intelligent Monitoring and Ecological Management of Pond and Reservoir Water Environment, Huaiyin Normal University, Huian, China
| | - Wei Xiong
- Shanghai Key Lab of Intelligent Information Processing, and School of Computer Science, Fudan University, Shanghai, China
| | - Yang Wang
- Department of Computer Science and Technology, Tongji University, Shanghai, China
| | - Jihong Guan
- Department of Computer Science and Technology, Tongji University, Shanghai, China
| |
Collapse
|
32
|
Guo X, He H, Yu J, Shi S. PKSPS: a novel method for predicting kinase of specific phosphorylation sites based on maximum weighted bipartite matching algorithm and phosphorylation sequence enrichment analysis. Brief Bioinform 2021; 23:6398688. [PMID: 34661630 DOI: 10.1093/bib/bbab436] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2021] [Revised: 09/10/2021] [Accepted: 09/21/2021] [Indexed: 11/14/2022] Open
Abstract
With the development of biotechnology, a large number of phosphorylation sites have been experimentally confirmed and collected, but only a few of them have kinase annotations. Since experimental methods to detect kinases at specific phosphorylation sites are expensive and accidental, some computational methods have been proposed to predict the kinase of these sites, but most methods only consider single sequence information or single functional network information. In this study, a new method Predicting Kinase of Specific Phosphorylation Sites (PKSPS) is developed to predict kinases of specific phosphorylation sites in human proteins by combining PKSPS-Net with PKSPS-Seq, which considers protein-protein interaction (PPI) network information and sequence information. For PKSPS-Net, kinase-kinase and substrate-substrate similarity are quantified based on the topological similarity of proteins in the PPI network, and maximum weighted bipartite matching algorithm is proposed to predict kinase-substrate relationship. In PKSPS-Seq, phosphorylation sequence enrichment analysis is used to analyze the similarity of local sequences around phosphorylation sites and predict the kinase of specific phosphorylation sites (KSP). PKSPS has been proved to be more effective than the PKSPS-Net or PKSPS-Seq on different sets of kinases. Further comparison results show that the PKSPS method performs better than existing methods. Finally, the case study demonstrates the effectiveness of the PKSPS in predicting kinases of specific phosphorylation sites. The open source code and data of the PKSPS can be obtained from https://github.com/guoxinyunncu/PKSPS.
Collapse
Affiliation(s)
- Xinyun Guo
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang 330031, China
| | - Huan He
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang 330031, China
| | - Jialin Yu
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang 330031, China
| | - Shaoping Shi
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang 330031, China
| |
Collapse
|