1
|
Chang X, Zhu Y, Chen Y, Li L. DeepNphos: A deep-learning architecture for prediction of N-phosphorylation sites. Comput Biol Med 2024; 170:108079. [PMID: 38295472 DOI: 10.1016/j.compbiomed.2024.108079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Revised: 01/25/2024] [Accepted: 01/27/2024] [Indexed: 02/02/2024]
Abstract
MOTIVATION Phosphorylation, a prevalent post-translational modification, plays a crucial role in regulating cellular activities. This process encompasses O-phosphorylation (e.g., phosphoserine) and N-phosphorylation (e.g., phospho-lysine (pK), phospho-arginine (pR), and phospho-histidine (pH)). While significant research has focused on O-phosphorylation, resulting in the development of various algorithms for predicting O-phosphorylation sites with commendable performance, there has been a notable absence of models designed to predict N-phosphorylation sites. This study introduces an integrated model named DeepNphos, designed to predict N-phosphorylation sites. This model is developed based on the analysis of thousands of experimentally identified pK, pR and pH sites. RESULTS Observing that the Convolutional Neural Network (CNN) model, incorporating the One-Hot encoding feature, demonstrates favorable performance in comparison to other models when predicting pK, pR, and pH sites. Additionally, pK exhibits similarities to other lysine modification types, and integrating the CNN model with a deep-transfer learning (DTL) strategy based on tens of thousands of known lysine modification sites could enhance pK prediction performance. In contrast, pR exhibits little similarity to other arginine modification types, and the integration of DTL has minimal impact on pR prediction performance. Furthermore, the decision was made to refrain from incorporating the DTL strategy in predicting pH sites, given the scarcity of histidine modification sites beyond those associated with pH. The final classifiers for predicting pK, pR, and pH sites achieve AUC values of 0.856, 0.805 and 0.802 for ten-fold cross-validation, respectively. Overall, DeepNphos is the first classifier for predicting N-phosphorylation sites, accessible at https://github.com/ChangXulinmessi/DeepNPhos.
Collapse
Affiliation(s)
- Xulin Chang
- College of Computer Science and Technology, Qingdao University, Qingdao, 266071, China
| | - Yafei Zhu
- College of Computer Science and Technology, Qingdao University, Qingdao, 266071, China
| | - Yu Chen
- College of Computer Science and Technology, Qingdao University, Qingdao, 266071, China
| | - Lei Li
- School of Health and Life Sciences, University of Health and Rehabilitation Sciences, Qingdao, 266000, China.
| |
Collapse
|
2
|
Jiang Y, Yan R, Wang X. PlantNh-Kcr: a deep learning model for predicting non-histone crotonylation sites in plants. PLANT METHODS 2024; 20:28. [PMID: 38360730 PMCID: PMC10870457 DOI: 10.1186/s13007-024-01157-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Accepted: 02/07/2024] [Indexed: 02/17/2024]
Abstract
BACKGROUND Lysine crotonylation (Kcr) is a crucial protein post-translational modification found in histone and non-histone proteins. It plays a pivotal role in regulating diverse biological processes in both animals and plants, including gene transcription and replication, cell metabolism and differentiation, as well as photosynthesis. Despite the significance of Kcr, detection of Kcr sites through biological experiments is often time-consuming, expensive, and only a fraction of crotonylated peptides can be identified. This reality highlights the need for efficient and rapid prediction of Kcr sites through computational methods. Currently, several machine learning models exist for predicting Kcr sites in humans, yet models tailored for plants are rare. Furthermore, no downloadable Kcr site predictors or datasets have been developed specifically for plants. To address this gap, it is imperative to integrate existing Kcr sites detected in plant experiments and establish a dedicated computational model for plants. RESULTS Most plant Kcr sites are located on non-histones. In this study, we collected non-histone Kcr sites from five plants, including wheat, tabacum, rice, peanut, and papaya. We then conducted a comprehensive analysis of the amino acid distribution surrounding these sites. To develop a predictive model for plant non-histone Kcr sites, we combined a convolutional neural network (CNN), a bidirectional long short-term memory network (BiLSTM), and attention mechanism to build a deep learning model called PlantNh-Kcr. On both five-fold cross-validation and independent tests, PlantNh-Kcr outperformed multiple conventional machine learning models and other deep learning models. Furthermore, we conducted an analysis of species-specific effect on the PlantNh-Kcr model and found that a general model trained using data from multiple species outperforms species-specific models. CONCLUSION PlantNh-Kcr represents a valuable tool for predicting plant non-histone Kcr sites. We expect that this model will aid in addressing key challenges and tasks in the study of plant crotonylation sites.
Collapse
Affiliation(s)
- Yanming Jiang
- College of Mathematics and Computer Sciences, Shanxi Normal University, Taiyuan, 030031, China
| | - Renxiang Yan
- The Key Laboratory of Marine Enzyme Engineering of Fujian Province, Fuzhou University, Fuzhou, 350002, China
- College of Biological Science and Engineering, Fuzhou University, Fuzhou, 350002, China
| | - Xiaofeng Wang
- College of Mathematics and Computer Sciences, Shanxi Normal University, Taiyuan, 030031, China.
| |
Collapse
|
3
|
Yang YH, Yang JT, Liu JF. Lactylation prediction models based on protein sequence and structural feature fusion. Brief Bioinform 2024; 25:bbad539. [PMID: 38385873 PMCID: PMC10939394 DOI: 10.1093/bib/bbad539] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Revised: 11/14/2023] [Accepted: 12/27/2023] [Indexed: 02/23/2024] Open
Abstract
Lysine lactylation (Kla) is a newly discovered posttranslational modification that is involved in important life activities, such as glycolysis-related cell function, macrophage polarization and nervous system regulation, and has received widespread attention due to the Warburg effect in tumor cells. In this work, we first design a natural language processing method to automatically extract the 3D structural features of Kla sites, avoiding potential biases caused by manually designed structural features. Then, we establish two Kla prediction frameworks, Attention-based feature fusion Kla model (ABFF-Kla) and EBFF-Kla, to integrate the sequence features and the structure features based on the attention layer and embedding layer, respectively. The results indicate that ABFF-Kla and Embedding-based feature fusion Kla model (EBFF-Kla), which fuse features from protein sequences and spatial structures, have better predictive performance than that of models that use only sequence features. Our work provides an approach for the automatic extraction of protein structural features, as well as a flexible framework for Kla prediction. The source code and the training data of the ABFF-Kla and the EBFF-Kla are publicly deposited at: https://github.com/ispotato/Lactylation_model.
Collapse
Affiliation(s)
- Ye-Hong Yang
- State Key Laboratory of Common Mechanism Research for Major Diseases, Department of Biochemistry and Molecular Biology, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College, No.5, Dongdan 3, Dongcheng District Municipality of Beijing, Beijing 100005, China
| | - Jun-Tao Yang
- State Key Laboratory of Common Mechanism Research for Major Diseases, Department of Biochemistry and Molecular Biology, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College, No.5, Dongdan 3, Dongcheng District Municipality of Beijing, Beijing 100005, China
- Plastic Surgery Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100144, PR China
| | - Jiang-Feng Liu
- State Key Laboratory of Common Mechanism Research for Major Diseases, Department of Biochemistry and Molecular Biology, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College, No.5, Dongdan 3, Dongcheng District Municipality of Beijing, Beijing 100005, China
- Plastic Surgery Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100144, PR China
| |
Collapse
|
4
|
Zhang T, Jia J, Chen C, Zhang Y, Yu B. BiGRUD-SA: Protein S-sulfenylation sites prediction based on BiGRU and self-attention. Comput Biol Med 2023; 163:107145. [PMID: 37336062 DOI: 10.1016/j.compbiomed.2023.107145] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Revised: 05/18/2023] [Accepted: 06/06/2023] [Indexed: 06/21/2023]
Abstract
S-sulfenylation is a vital post-translational modification (PTM) of proteins, which is an intermediate in other redox reactions and has implications for signal transduction and protein function regulation. However, there are many restrictions on the experimental identification of S-sulfenylation sites. Therefore, predicting S-sulfoylation sites by computational methods is fundamental to studying protein function and related biological mechanisms. In this paper, we propose a method named BiGRUD-SA based on bi-directional gated recurrent unit (BiGRU) and self-attention mechanism to predict protein S-sulfenylation sites. We first use AAC, BLOSUM62, AAindex, EAAC and GAAC to extract features, and do feature fusion to obtain original feature space. Next, we use SMOTE-Tomek method to handle data imbalance. Then, we input the processed data to the BiGRU and use self-attention mechanism to do further feature extraction. Finally, we input the data obtained to the deep neural networks (DNN) to identify S-sulfenylation sites. The accuracies of training set and independent test set are 96.66% and 95.91% respectively, which indicates that our method is conducive to identifying S-sulfenylation sites. Furthermore, we use a data set of S-sulfenylation sites in Arabidopsis thaliana to effectively verify the generalization ability of BiGRUD-SA method, and obtain better prediction results.
Collapse
Affiliation(s)
- Tingting Zhang
- College of Computer Science and Technology, Shandong University, Qingdao, 266237, China; College of Information Science and Technology, School of Data Science, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Jihua Jia
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Cheng Chen
- College of Computer Science and Technology, Shandong University, Qingdao, 266237, China
| | - Yaqun Zhang
- College of Mathematics and Big Data, Dezhou University, Dezhou, 253023, China.
| | - Bin Yu
- College of Information Science and Technology, School of Data Science, Qingdao University of Science and Technology, Qingdao, 266061, China; School of Data Science, University of Science and Technology of China, Hefei, 230027, China.
| |
Collapse
|
5
|
Yue ZX, Yan TC, Xu HQ, Liu YH, Hong YF, Chen GX, Xie T, Tao L. A systematic review on the state-of-the-art strategies for protein representation. Comput Biol Med 2023; 152:106440. [PMID: 36543002 DOI: 10.1016/j.compbiomed.2022.106440] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Revised: 12/08/2022] [Accepted: 12/15/2022] [Indexed: 12/23/2022]
Abstract
The study of drug-target protein interaction is a key step in drug research. In recent years, machine learning techniques have become attractive for research, including drug research, due to their automated nature, predictive power, and expected efficiency. Protein representation is a key step in the study of drug-target protein interaction by machine learning, which plays a fundamental role in the ultimate accomplishment of accurate research. With the progress of machine learning, protein representation methods have gradually attracted attention and have consequently developed rapidly. Therefore, in this review, we systematically classify current protein representation methods, comprehensively review them, and discuss the latest advances of interest. According to the information extraction methods and information sources, these representation methods are generally divided into structure and sequence-based representation methods. Each primary class can be further divided into specific subcategories. As for the particular representation methods involve both traditional and the latest approaches. This review contains a comprehensive assessment of the various methods which researchers can use as a reference for their specific protein-related research requirements, including drug research.
Collapse
Affiliation(s)
- Zi-Xuan Yue
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Tian-Ci Yan
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Hong-Quan Xu
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Yu-Hong Liu
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Yan-Feng Hong
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Gong-Xing Chen
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Tian Xie
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China.
| | - Lin Tao
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China.
| |
Collapse
|
6
|
Jiang H, Shang S, Sha Y, Zhang L, He N, Li L. EdeepSADPr: an extensive deep-learning architecture for prediction of the in situ crosstalks of serine phosphorylation and ADP-ribosylation. Front Cell Dev Biol 2023; 11:1149535. [PMID: 37187615 PMCID: PMC10175571 DOI: 10.3389/fcell.2023.1149535] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2023] [Accepted: 04/17/2023] [Indexed: 05/17/2023] Open
Abstract
The in situ post-translational modification (PTM) crosstalk refers to the interactions between different types of PTMs that occur on the same residue site of a protein. The crosstalk sites generally have different characteristics from those with the single PTM type. Studies targeting the latter's features have been widely conducted, while studies on the former's characteristics are rare. For example, the characteristics of serine phosphorylation (pS) and serine ADP-ribosylation (SADPr) have been investigated, whereas those of their in situ crosstalks (pSADPr) are unknown. In this study, we collected 3,250 human pSADPr, 7,520 SADPr, 151,227 pS and 80,096 unmodified serine sites and explored the features of the pSADPr sites. We found that the characteristics of pSADPr sites are more similar to those of SADPr compared to pS or unmodified serine sites. Moreover, the crosstalk sites are likely to be phosphorylated by some kinase families (e.g., AGC, CAMK, STE and TKL) rather than others (e.g., CK1 and CMGC). Additionally, we constructed three classifiers to predict pSADPr sites from the pS dataset, the SADPr dataset and the protein sequences separately. We built and evaluated five deep-learning classifiers in ten-fold cross-validation and independent test datasets. We also used the classifiers as base classifiers to develop a few stacking-based ensemble classifiers to improve performance. The best classifiers had the AUC values of 0.700, 0.914 and 0.954 for recognizing pSADPr sites from the SADPr, pS and unmodified serine sites, respectively. The lowest prediction accuracy was achieved by separating pSADPr and SADPr sites, which is consistent with the observation that pSADPr's characteristics are more similar to those of SADPr than the rest. Finally, we developed an online tool for extensively predicting human pSADPr sites based on the CNNOH classifier, dubbed EdeepSADPr. It is freely available through http://edeepsadpr.bioinfogo.org/. We expect our investigation will promote a comprehensive understanding of crosstalks.
Collapse
Affiliation(s)
- Haoqiang Jiang
- College of Basic Medicine, Qingdao University, Qingdao, China
- Sino Genomics Technology Co., Ltd., Qingdao, China
| | - Shipeng Shang
- College of Basic Medicine, Qingdao University, Qingdao, China
| | - Yutong Sha
- College of Basic Medicine, Qingdao University, Qingdao, China
| | - Lin Zhang
- College of Computer Science and Technology, Qingdao University, Qingdao, China
| | - Ningning He
- College of Basic Medicine, Qingdao University, Qingdao, China
| | - Lei Li
- College of Basic Medicine, Qingdao University, Qingdao, China
- Faculty of Biomedical and Rehabilitation Engineering, University of Health and Rehabilitation Sciences, Qingdao, China
- *Correspondence: Lei Li,
| |
Collapse
|
7
|
Meng Y, Zhang L, Zhang L, Wang Z, Wang X, Li C, Chen Y, Shang S, Li L. CysModDB: a comprehensive platform with the integration of manually curated resources and analysis tools for cysteine posttranslational modifications. Brief Bioinform 2022; 23:6775608. [PMID: 36305460 PMCID: PMC9677505 DOI: 10.1093/bib/bbac460] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2022] [Revised: 08/27/2022] [Accepted: 09/26/2022] [Indexed: 12/14/2022] Open
Abstract
The unique chemical reactivity of cysteine residues results in various posttranslational modifications (PTMs), which are implicated in regulating a range of fundamental biological processes. With the advent of chemical proteomics technology, thousands of cysteine PTM (CysPTM) sites have been identified from multiple species. A few CysPTM-based databases have been developed, but they mainly focus on data collection rather than various annotations and analytical integration. Here, we present a platform-dubbed CysModDB, integrated with the comprehensive CysPTM resources and analysis tools. CysModDB contains five parts: (1) 70 536 experimentally verified CysPTM sites with annotations of sample origin and enrichment techniques, (2) 21 654 modified proteins annotated with functional regions and structure information, (3) cross-references to external databases such as the protein-protein interactions database, (4) online computational tools for predicting CysPTM sites and (5) integrated analysis tools such as gene enrichment and investigation of sequence features. These parts are integrated using a customized graphic browser and a Basket. The browser uses graphs to represent the distribution of modified sites with different CysPTM types on protein sequences and mapping these sites to the protein structures and functional regions, which assists in exploring cross-talks between the modified sites and their potential effect on protein functions. The Basket connects proteins and CysPTM sites to the analysis tools. In summary, CysModDB is an integrated platform to facilitate the CysPTM research, freely accessible via https://cysmoddb.bioinfogo.org/.
Collapse
Affiliation(s)
| | | | - Laizhi Zhang
- School of Basic Medicine, Qingdao University, Qingdao, China
| | - Ziyu Wang
- School of Basic Medicine, Qingdao University, Qingdao, China
| | - Xuanwen Wang
- College of Computer Science and Technology, Qingdao University, Qingdao, China
| | - Chan Li
- School of Basic Medicine, Qingdao University, Qingdao, China
| | - Yu Chen
- College of Computer Science and Technology, Qingdao University, Qingdao, China
| | - Shipeng Shang
- Corresponding authors: Lei Li, Faculty of Biomedical and Rehabilitation Engineering, University of Health and Rehabilitation Sciences, Qingdao 266071, China. Tel/Fax: +86 532 8581 2983; E-mail: ; Shipeng Shang, School of Basic Medicine, Qingdao University, Qingdao 266071, China. Tel.: +86 532 8595 1111; Fax: +86 532 8581 2983; E-mail:
| | - Lei Li
- Corresponding authors: Lei Li, Faculty of Biomedical and Rehabilitation Engineering, University of Health and Rehabilitation Sciences, Qingdao 266071, China. Tel/Fax: +86 532 8581 2983; E-mail: ; Shipeng Shang, School of Basic Medicine, Qingdao University, Qingdao 266071, China. Tel.: +86 532 8595 1111; Fax: +86 532 8581 2983; E-mail:
| |
Collapse
|
8
|
Zhao J, Jiang H, Zou G, Lin Q, Wang Q, Liu J, Ma L. CNNArginineMe: A CNN structure for training models for predicting arginine methylation sites based on the One-Hot encoding of peptide sequence. Front Genet 2022; 13:1036862. [PMID: 36324513 PMCID: PMC9618650 DOI: 10.3389/fgene.2022.1036862] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2022] [Accepted: 10/04/2022] [Indexed: 11/30/2022] Open
Abstract
Protein arginine methylation (PRme), as one post-translational modification, plays a critical role in numerous cellular processes and regulates critical cellular functions. Though several in silico models for predicting PRme sites have been reported, new models may be required to develop due to the significant increase of identified PRme sites. In this study, we constructed multiple machine-learning and deep-learning models. The deep-learning model CNN combined with the One-Hot coding showed the best performance, dubbed CNNArginineMe. CNNArginineMe performed best in AUC scoring metrics in comparisons with several reported predictors. Additionally, we employed CNNArginineMe to predict arginine methylation proteome and performed functional analysis. The arginine methylated proteome is significantly enriched in the amyotrophic lateral sclerosis (ALS) pathway. CNNArginineMe is freely available at https://github.com/guoyangzou/CNNArginineMe.
Collapse
Affiliation(s)
- Jiaojiao Zhao
- Cancer Institute of the Affiliated Hospital of Qingdao University and Qingdao Cancer Institute, Qingdao University, Qingdao, China
- School of Basic Medicine, Qingdao University, Qingdao, China
| | - Haoqiang Jiang
- School of Basic Medicine, Qingdao University, Qingdao, China
| | - Guoyang Zou
- School of Basic Medicine, Qingdao University, Qingdao, China
| | - Qian Lin
- Cancer Institute of the Affiliated Hospital of Qingdao University and Qingdao Cancer Institute, Qingdao University, Qingdao, China
| | - Qiang Wang
- Oncology Department, Shandong Second Provincial General Hospital, Jinan, China
| | - Jia Liu
- Department of Pharmacology, School of Pharmacy, Qingdao University, Qingdao, China
| | - Leina Ma
- Cancer Institute of the Affiliated Hospital of Qingdao University and Qingdao Cancer Institute, Qingdao University, Qingdao, China
- *Correspondence: Leina Ma,
| |
Collapse
|
9
|
Zhu Y, Liu Y, Chen Y, Li L. ResSUMO: A Deep Learning Architecture Based on Residual Structure for Prediction of Lysine SUMOylation Sites. Cells 2022; 11:2646. [PMID: 36078053 PMCID: PMC9454673 DOI: 10.3390/cells11172646] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Revised: 08/18/2022] [Accepted: 08/22/2022] [Indexed: 12/26/2022] Open
Abstract
Lysine SUMOylation plays an essential role in various biological functions. Several approaches integrating various algorithms have been developed for predicting SUMOylation sites based on a limited dataset. Recently, the number of identified SUMOylation sites has significantly increased due to investigation at the proteomics scale. We collected modification data and found the reported approaches had poor performance using our collected data. Therefore, it is essential to explore the characteristics of this modification and construct prediction models with improved performance based on an enlarged dataset. In this study, we constructed and compared 16 classifiers by integrating four different algorithms and four encoding features selected from 11 sequence-based or physicochemical features. We found that the convolution neural network (CNN) model integrated with residue structure, dubbed ResSUMO, performed favorably when compared with the traditional machine learning and CNN models in both cross-validation and independent tests. The area under the receiver operating characteristic (ROC) curve for ResSUMO was around 0.80, superior to that of the reported predictors. We also found that increasing the depth of neural networks in the CNN models did not improve prediction performance due to the degradation problem, but the residual structure could be included to optimize the neural networks and improve performance. This indicates that residual neural networks have the potential to be broadly applied in the prediction of other types of modification sites with great effectiveness and robustness. Furthermore, the online ResSUMO service is freely accessible.
Collapse
Affiliation(s)
- Yafei Zhu
- College of Computer Science and Technology, Qingdao University, Qingdao 266071, China
| | - Yuhai Liu
- Dawning International Information Industry, Co., Ltd., Qingdao 266101, China
| | - Yu Chen
- College of Computer Science and Technology, Qingdao University, Qingdao 266071, China
| | - Lei Li
- College of Computer Science and Technology, Qingdao University, Qingdao 266071, China
- Faculty of Biomedical and Rehabilitation Engineering, University of Health and Rehabilitation Sciences, Qingdao 266001, China
| |
Collapse
|
10
|
Mini-review: Recent advances in post-translational modification site prediction based on deep learning. Comput Struct Biotechnol J 2022; 20:3522-3532. [PMID: 35860402 PMCID: PMC9284371 DOI: 10.1016/j.csbj.2022.06.045] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2022] [Revised: 06/21/2022] [Accepted: 06/21/2022] [Indexed: 11/23/2022] Open
Abstract
Post-translational modifications (PTMs) are closely linked to numerous diseases, playing a significant role in regulating protein structures, activities, and functions. Therefore, the identification of PTMs is crucial for understanding the mechanisms of cell biology and diseases therapy. Compared to traditional machine learning methods, the deep learning approaches for PTM prediction provide accurate and rapid screening, guiding the downstream wet experiments to leverage the screened information for focused studies. In this paper, we reviewed the recent works in deep learning to identify phosphorylation, acetylation, ubiquitination, and other PTM types. In addition, we summarized PTM databases and discussed future directions with critical insights.
Collapse
Key Words
- AAindex, Amino acid index
- ATP, Adenosine triphosphate
- AUC, Area under curve
- Ac, Acetylation
- BE, Binary encoding
- BLOSUM, Blocks substitution matrix
- Bi-LSTM, Bidirectional LSTM
- CKSAAP, Composition of k-spaced amino acid Pairs
- CNN, Convolutional neural network
- CNNOH, CNN with the one-hot encoding
- CNNWE, CNN with the word-embedding encoding
- CNNrgb, CNN red green blue
- CV, Cross-validation
- DC-CNN, Densely connected convolutional neural network
- DL, Deep learning
- DNNs, Deep neural networks
- Deep learning
- E. coli, Escherichia coli
- EBGW, Encoding based on grouped weight
- EGAAC, Enhanced grouped amino acids content
- IG, Information gain
- K, Lysine
- KNN, k nearest neighbor
- LASSO, Least absolute shrinkage and selection operator
- LSTM, Long short-term memory
- LSTMWE, LSTM with the word-embedding encoding
- M.musculus, Mus musculus
- MDC, Modular densely connected convolutional networks
- MDCAN, Multilane dense convolutional attention network
- ML, Machine learning
- MLP, Multilayer perceptron
- MMI, Multivariate mutual information
- Machine learning
- Mass spectrometry
- NMBroto, Normalized Moreau-Broto autocorrelation
- P, Proline
- PSP, PhosphoSitePlus
- PSSM, Position-specific scoring matrix
- PTM, Post-translational modifications
- Ph, Phosphorylation
- Post-translational modification
- Prediction
- PseAAC, Pseudo-amino acid composition
- R, Arginine
- RF, Random forest
- RNN, Recurrent neural network
- ROC, Receiver operating characteristic
- S, Serine
- S. typhimurium, Salmonella typhimurium
- S.cerevisiae, Saccharomyces cerevisiae
- SE, Squeeze and excitation
- SEV, Split to Equal Validation
- ST, Source and target
- SUMO, Small ubiquitin-like modifier
- SVM, Support vector machines
- T, Threonine
- Ub, Ubiquitination
- Y, Tyrosine
- ZSL, Zero-shot learning
Collapse
|
11
|
Development of an experiment-split method for benchmarking the generalization of a PTM site predictor: Lysine methylome as an example. PLoS Comput Biol 2021; 17:e1009682. [PMID: 34879076 PMCID: PMC8687584 DOI: 10.1371/journal.pcbi.1009682] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2021] [Revised: 12/20/2021] [Accepted: 11/25/2021] [Indexed: 11/19/2022] Open
Abstract
Many computational classifiers have been developed to predict different types of post-translational modification sites. Their performances are measured using cross-validation or independent test, in which experimental data from different sources are mixed and randomly split into training and test sets. However, the self-reported performances of most classifiers based on this measure are generally higher than their performances in the application of new experimental data. It suggests that the cross-validation method overestimates the generalization ability of a classifier. Here, we proposed a generalization estimate method, dubbed experiment-split test, where the experimental sources for the training set are different from those for the test set that simulate the data derived from a new experiment. We took the prediction of lysine methylome (Kme) as an example and developed a deep learning-based Kme site predictor (called DeepKme) with outstanding performance. We assessed the experiment-split test by comparing it with the cross-validation method. We found that the performance measured using the experiment-split test is lower than that measured in terms of cross-validation. As the test data of the experiment-split method were derived from an independent experimental source, this method could reflect the generalization of the predictor. Therefore, we believe that the experiment-split method can be applied to benchmark the practical performance of a given PTM model. DeepKme is free accessible via https://github.com/guoyangzou/DeepKme. The performance of a model for predicting post-translational modification sites is commonly evaluated using the cross-validation method, where the data derived from different experimental sources are mixed and randomly separated into the training dataset and validation dataset. However, the performance measured through cross-validation is generally higher than the performance in the application of new experimental data, indicating that the cross-validation method overestimates the generalization of a model. In this study, we proposed a generalization estimate method, dubbed experiment-split test, where the experimental sources for the training set are different from those for the test set that simulate the data derived from a new experiment. We took the prediction of lysine methylome as an example and developed a deep learning-based Kme site predictor DeepKme with outstanding performance. We found that the performance measured by the experiment-split method is lower than that measured in terms of cross-validation. As the test data of the experiment-split method were derived from an independent experimental source, this method could reflect the generalization of the prediction model. Therefore, the experiment-split method can be applied to benchmark the practical prediction performance.
Collapse
|
12
|
Al-Saggaf UM, Usman M, Naseem I, Moinuddin M, Jiman AA, Alsaggaf MU, Alshoubaki HK, Khan S. ECM-LSE: Prediction of Extracellular Matrix Proteins Using Deep Latent Space Encoding of k-Spaced Amino Acid Pairs. Front Bioeng Biotechnol 2021; 9:752658. [PMID: 34722479 PMCID: PMC8552119 DOI: 10.3389/fbioe.2021.752658] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Accepted: 09/13/2021] [Indexed: 12/26/2022] Open
Abstract
Extracelluar matrix (ECM) proteins create complex networks of macromolecules which fill-in the extracellular spaces of living tissues. They provide structural support and play an important role in maintaining cellular functions. Identification of ECM proteins can play a vital role in studying various types of diseases. Conventional wet lab-based methods are reliable; however, they are expensive and time consuming and are, therefore, not scalable. In this research, we propose a sequence-based novel machine learning approach for the prediction of ECM proteins. In the proposed method, composition of k-spaced amino acid pair (CKSAAP) features are encoded into a classifiable latent space (LS) with the help of deep latent space encoding (LSE). A comprehensive ablation analysis is conducted for performance evaluation of the proposed method. Results are compared with other state-of-the-art methods on the benchmark dataset, and the proposed ECM-LSE approach has shown to comprehensively outperform the contemporary methods.
Collapse
Affiliation(s)
- Ubaid M. Al-Saggaf
- Center of Excellence in Intelligent Engineering Systems, King Abdulaziz University, Jeddah, Saudi Arabia
- Electrical and Computer Engineering Department, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Muhammad Usman
- Department of Computer Engineering, Chosun University, Gwangju, South Korea
| | - Imran Naseem
- Research and Development, Love For Data, Karachi, Pakistan
- School of Electrical, Electronic and Computer Engineering, The University of Western Australia, Perth, WA, Australia
- College of Engineering, Karachi Institute of Economics and Technology, Korangi Creek, Karachi, Pakistan
| | - Muhammad Moinuddin
- Center of Excellence in Intelligent Engineering Systems, King Abdulaziz University, Jeddah, Saudi Arabia
- Electrical and Computer Engineering Department, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Ahmad A. Jiman
- Electrical and Computer Engineering Department, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Mohammed U. Alsaggaf
- Center of Excellence in Intelligent Engineering Systems, King Abdulaziz University, Jeddah, Saudi Arabia
- Department of Radiology, Faculty of Medicine, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Hitham K. Alshoubaki
- Center of Excellence in Intelligent Engineering Systems, King Abdulaziz University, Jeddah, Saudi Arabia
- Electrical and Computer Engineering Department, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Shujaat Khan
- Department of Bio and Brain Engineering, Daejeon, South Korea
| |
Collapse
|
13
|
Sha Y, Ma C, Wei X, Liu Y, Chen Y, Li L. DeepSADPr: A hybrid-learning architecture for serine ADP-ribosylation site prediction. Methods 2021; 203:575-583. [PMID: 34560250 DOI: 10.1016/j.ymeth.2021.09.008] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2021] [Revised: 09/13/2021] [Accepted: 09/16/2021] [Indexed: 01/28/2023] Open
Abstract
Protein adenosine diphosphate-ribosylation (ADPr) is caused by the covalent binding of one or more ADP-ribose moieties to a target protein and regulates the biological functions of the target protein. To fully understand the regulatory mechanism of ADP-ribosylation, the essential step is the identification of the ADPr sites from the proteome. As the experimental approaches are costly and time-consuming, it is necessary to develop a computational tool to predict ADPr sites. Recently, serine has been found to be the major residue type for ADP-ribosylation but no predictor is available. In this study, we collected thousands of experimentally validated human ADPr sites on serine residue and constructed several different machine-learning classifiers. We found that the hybrid model, dubbed DeepSADPr, which integrated the one-dimensional convolutional neural network (CNN) with the One-Hot encoding approach and the word-embedding approach, compared favourably to other models in terms of both ten-fold cross-validation and independent test. Its AUC values reached 0.935 for ten-fold cross-validation. Its values of sensitivity, accuracy and Matthews's correlation coefficient reached 0.933, 0.867 and 0.740, respectively, with the fixed specificity value of 0.80. Overall, DeepSADPr is the first classifier for predicting Serine ADPr sites, which is available at http://www.bioinfogo.org/DeepSADPr.
Collapse
Affiliation(s)
- Yutong Sha
- College of Computer Science & Technology, Qingdao University, Qingdao 266071, China
| | - Chenglong Ma
- College of Life Sciences, Qingdao University, Qingdao 266071, China
| | - Xilin Wei
- College of Computer Science & Technology, Qingdao University, Qingdao 266071, China
| | - Yuhai Liu
- Dawning International Information Industry, Co., Ltd., Qingdao 266101, China
| | - Yu Chen
- College of Computer Science & Technology, Qingdao University, Qingdao 266071, China.
| | - Lei Li
- College of Computer Science & Technology, Qingdao University, Qingdao 266071, China; School of Basic Medicine, Qingdao University, Qingdao 266071, China; College of Life Sciences, Qingdao University, Qingdao 266071, China.
| |
Collapse
|
14
|
Nallapareddy V, Bogam S, Devarakonda H, Paliwal S, Bandyopadhyay D. DeepCys: Structure-based multiple cysteine function prediction method trained on deep neural network: Case study on domains of unknown functions belonging to COX2 domains. Proteins 2021; 89:745-761. [PMID: 33580578 DOI: 10.1002/prot.26056] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Accepted: 01/31/2021] [Indexed: 12/29/2022]
Abstract
Cysteine (Cys) is the most reactive amino acid participating in a wide range of biological functions. In-silico predictions complement the experiments to meet the need of functional characterization. Multiple Cys function prediction algorithm is scarce, in contrast to specific function prediction algorithms. Here we present a deep neural network-based multiple Cys function prediction, available on web-server (DeepCys) (https://deepcys.herokuapp.com/). DeepCys model was trained and tested on two independent datasets curated from protein crystal structures. This prediction method requires three inputs, namely, PDB identifier (ID), chain ID and residue ID for a given Cys and outputs the probabilities of four cysteine functions, namely, disulphide, metal-binding, thioether and sulphenylation and predicts the most probable Cys function. The algorithm exploits the local and global protein properties, like, sequence and secondary structure motifs, buried fractions, microenvironments and protein/enzyme class. DeepCys outperformed most of the multiple and specific Cys function algorithms. This method can predict maximum number of cysteine functions. Moreover, for the first time, explicitly predicts thioether function. This tool was used to elucidate the cysteine functions on domains of unknown functions belonging to cytochrome C oxidase subunit-II like transmembrane domains. Apart from the web-server, a standalone program is also available on GitHub (https://github.com/vam-sin/deepcys).
Collapse
Affiliation(s)
- Vamsi Nallapareddy
- Department of Biological Sciences, Birla Institute of Technology and Science, Hyderabad, Telangana, India
| | - Shubham Bogam
- Department of Biological Sciences, Birla Institute of Technology and Science, Hyderabad, Telangana, India
| | - Himaja Devarakonda
- Department of Biological Sciences, Birla Institute of Technology and Science, Hyderabad, Telangana, India
| | - Shubham Paliwal
- Department of Biological Sciences, Birla Institute of Technology and Science, Hyderabad, Telangana, India
| | - Debashree Bandyopadhyay
- Department of Biological Sciences, Birla Institute of Technology and Science, Hyderabad, Telangana, India
| |
Collapse
|