1
|
Parvez A, Ali SD, Tayara H, Chong KT. Stacking based ensemble learning framework for identification of nitrotyrosine sites. Comput Biol Med 2024; 183:109200. [PMID: 39366143 DOI: 10.1016/j.compbiomed.2024.109200] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2024] [Revised: 09/02/2024] [Accepted: 09/22/2024] [Indexed: 10/06/2024]
Abstract
Protein nitrotyrosine is an essential post-translational modification that results from the nitration of tyrosine amino acid residues. This modification is known to be associated with the regulation and characterization of several biological functions and diseases. Therefore, accurate identification of nitrotyrosine sites plays a significant role in the elucidating progress of associated biological signs. In this regard, we reported an accurate computational tool known as iNTyro-Stack for the identification of protein nitrotyrosine sites. iNTyro-Stack is a machine-learning model based on a stacking algorithm. The base classifiers in stacking are selected based on the highest performance. The feature map employed is a linear combination of the amino composition encoding schemes, including the composition of k-spaced amino acid pairs and tri-peptide composition. The recursive feature elimination technique is used for significant feature selection. The performance of the proposed method is evaluated using k-fold cross-validation and independent testing approaches. iNTyro-Stack achieved an accuracy of 86.3% and a Matthews correlation coefficient (MCC) of 72.6% in cross-validation. Its generalization capability was further validated on an imbalanced independent test set, where it attained an accuracy of 69.32%. iNTyro-Stack outperforms existing state-of-the-art methods across both evaluation techniques. The github repository is create to reproduce the method and results of iNTyro-Stack, accessible on: https://github.com/waleed551/iNTyro-Stack/.
Collapse
Affiliation(s)
- Aiman Parvez
- Graduate School of Integrated Energy-AI, Jeonbuk National University, Jeonju, 54896, South Korea
| | - Syed Danish Ali
- Department of Electrical Engineering, The University of Azad Jammu and Kashmir, Muzaffarabad, 13100, Pakistan; Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju, 54896, South Korea.
| | - Hilal Tayara
- Department of International Science and Engineering, Jeonbuk National University, Jeonju, 54896, South Korea
| | - Kil To Chong
- Department of International Science and Engineering, Jeonbuk National University, Jeonju, 54896, South Korea; Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju, 54896, South Korea
| |
Collapse
|
2
|
Wang X, Zhang Z, Liu C. iACP-DFSRA: identification of anticancer peptides based on a dual-channel fusion strategy of ResCNN and Attention. J Mol Biol 2024:168810. [PMID: 39362624 DOI: 10.1016/j.jmb.2024.168810] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2024] [Revised: 09/10/2024] [Accepted: 09/27/2024] [Indexed: 10/05/2024]
Abstract
Anticancer peptides (ACPs) have been widely applied in the treatment of cancer owing to good safety, rational side effects, and high selectivity. However, the number of ACPs that have been experimentally validated is limited as identification of ACPs is extremely expensive. Hence, accurate and cost-effective identification methods for ACPs are urgently needed. In this work, we proposed a deep learning-based model, named iACP-DFSRA, for ACPs identification. Specifically, we adopted two kinds of sequence embedding technologies, ProtBert_BFD pre-training language model and handcrafted features to encode protein sequences. Then, the LightGBM was used for feature selection, and the selected features were input into ResCNN and Attention mechanism, respectively, to extract local and global features. Finally, the concatenate features were deeply fused by using the Attention mechanism to allow key features to be paid more attention to by the model and make predictions by fully connected layer. The results of 10-fold cross-validation demonstrated that the iACP-DFSRA model delivered improved results in most metrics with Sp of 94.15%, Sn of 95.32%, Acc of 94.74% and MCC of 89.48% compared to the latest AACFlow model. Indeed, the iACP-DFSRA model is the only model with Acc > 90% and MCC > 80% on this independent test dataset. Furthermore, we have further demonstrated the superiority of our model on additional datasets. In addition, t-SNE and SHAP interpretation analysis demonstrated that it is crucial to use two channels for feature extraction and use the Attention mechanism for deep fusion, which helps the iACP-DFSRA to predict ACPs more effectively.
Collapse
Affiliation(s)
- Xin Wang
- School of Science, Dalian Maritime University, Dalian, 116026, China.
| | - Zimeng Zhang
- School of Science, Dalian Maritime University, Dalian, 116026, China
| | - Chang Liu
- School of Science, Dalian Maritime University, Dalian, 116026, China
| |
Collapse
|
3
|
Spadaro A, Sharma A, Dehzangi I. Predicting lysine methylation sites using a convolutional neural network. Methods 2024; 226:127-132. [PMID: 38604414 DOI: 10.1016/j.ymeth.2024.04.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Revised: 12/15/2023] [Accepted: 04/07/2024] [Indexed: 04/13/2024] Open
Abstract
Protein lysine methylation is a particular type of post translational modification that plays an important role in both histone and non-histone function regulation in proteins. Deregulation caused by lysine methyltransferases has been identified as the cause of several diseases including cancer as well as both mental and developmental disorders. Identifying lysine methylation sites is a critical step in both early diagnosis and drug design. This study proposes a new Machine Learning method called CNN-Meth for predicting lysine methylation sites using a convolutional neural network (CNN). Our model is trained using evolutionary, structural, and physicochemical-based presentation along with binary encoding. Unlike previous studies, instead of extracting handcrafted features, we use CNN to automatically extract features from different presentations of amino acids to avoid information loss. Automated feature extraction from these representations of amino acids as well as CNN as a classifier have never been used for this problem. Our results demonstrate that CNN-Meth can significantly outperform previous methods for predicting methylation sites. It achieves 96.0%, 85.1%, 96.4%, and 0.65 in terms of Accuracy, Sensitivity, Specificity, and Matthew's Correlation Coefficient (MCC), respectively. CNN-Meth and its source code are publicly available at https://github.com/MLBC-lab/CNN-Meth.
Collapse
Affiliation(s)
- Austin Spadaro
- Center for Computational and Integrative Biology, Rutgers University, Camden, NJ, United States
| | - Alok Sharma
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia; Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | - Iman Dehzangi
- Center for Computational and Integrative Biology, Rutgers University, Camden, NJ, United States; Department of Computer Science, Rutgers University, Camden, NJ, United States.
| |
Collapse
|
4
|
Shrestha P, Kandel J, Tayara H, Chong KT. DL-SPhos: Prediction of serine phosphorylation sites using transformer language model. Comput Biol Med 2024; 169:107925. [PMID: 38183701 DOI: 10.1016/j.compbiomed.2024.107925] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Revised: 12/21/2023] [Accepted: 01/01/2024] [Indexed: 01/08/2024]
Abstract
Serine phosphorylation plays a pivotal role in the pathogenesis of various cellular processes and diseases. Roughly 81% of human diseases have links to phosphorylation, and an overwhelming 86.4% of protein phosphorylation takes place at serine residues. In eukaryotes, over a quarter of proteins undergo phosphorylation, with more than half implicated in numerous disorders, notably cancer and reproductive system diseases. This study primarily focuses on serine-phosphorylation-driven pathogenesis and the critical role of conserved motif identification. While numerous techniques exist for predicting serine phosphorylation sites, traditional wet lab experiments are resource-intensive. Our paper introduces a cutting-edge deep learning tool for predicting S phosphorylation sites, integrating explainable AI for motif identification, a transformer language model, and deep neural network components. We trained our model on protein sequences from UniProt, validated it against the dbPTM benchmark dataset, and employed the PTMD dataset to explore motifs related to mammalian disorders. Our results highlight that our model surpasses other deep learning predictors by a significant 3%. Furthermore, we utilized the local interpretable model-agnostic explanations (LIME) approach to shed light on the predictions, emphasizing the amino acid residues crucial for S phosphorylation. Notably, our model also outperformed competitors in kinase-specific serine phosphorylation prediction on benchmark datasets.
Collapse
Affiliation(s)
- Palistha Shrestha
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju-si, 54896, Jeollabuk-do, Republic of Korea
| | - Jeevan Kandel
- Graduate School of Integrated Energy-AI, Jeonbuk National University, Jeonju-si, 54896, Jeollabuk-do, Republic of Korea
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju-si, 54896, Jeollabuk-do, Republic of Korea.
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju-si, 54896, Jeollabuk-do, Republic of Korea; Advances Electronics and Information Research Center, Jeonbuk National University, Jeonju-si, 54896, Jeollabuk-do, Republic of Korea.
| |
Collapse
|
5
|
Braghetto A, Orlandini E, Baiesi M. Interpretable Machine Learning of Amino Acid Patterns in Proteins: A Statistical Ensemble Approach. J Chem Theory Comput 2023; 19:6011-6022. [PMID: 37552831 PMCID: PMC10500975 DOI: 10.1021/acs.jctc.3c00383] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Indexed: 08/10/2023]
Abstract
Explainable and interpretable unsupervised machine learning helps one to understand the underlying structure of data. We introduce an ensemble analysis of machine learning models to consolidate their interpretation. Its application shows that restricted Boltzmann machines compress consistently into a few bits the information stored in a sequence of five amino acids at the start or end of α-helices or β-sheets. The weights learned by the machines reveal unexpected properties of the amino acids and the secondary structure of proteins: (i) His and Thr have a negligible contribution to the amphiphilic pattern of α-helices; (ii) there is a class of α-helices particularly rich in Ala at their end; (iii) Pro occupies most often slots otherwise occupied by polar or charged amino acids, and its presence at the start of helices is relevant; (iv) Glu and especially Asp on one side and Val, Leu, Iso, and Phe on the other display the strongest tendency to mark amphiphilic patterns, i.e., extreme values of an effective hydrophobicity, though they are not the most powerful (non)hydrophobic amino acids.
Collapse
Affiliation(s)
- Anna Braghetto
- Department
of Physics and Astronomy, University of
Padova, Via Marzolo 8, 35131 Padua, Italy
- INFN,
Sezione di Padova, Via
Marzolo 8, 35131 Padua, Italy
| | - Enzo Orlandini
- Department
of Physics and Astronomy, University of
Padova, Via Marzolo 8, 35131 Padua, Italy
- INFN,
Sezione di Padova, Via
Marzolo 8, 35131 Padua, Italy
| | - Marco Baiesi
- Department
of Physics and Astronomy, University of
Padova, Via Marzolo 8, 35131 Padua, Italy
- INFN,
Sezione di Padova, Via
Marzolo 8, 35131 Padua, Italy
| |
Collapse
|
6
|
ACP-ADA: A Boosting Method with Data Augmentation for Improved Prediction of Anticancer Peptides. Int J Mol Sci 2022; 23:ijms232012194. [PMID: 36293050 PMCID: PMC9603247 DOI: 10.3390/ijms232012194] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2022] [Revised: 10/08/2022] [Accepted: 10/11/2022] [Indexed: 11/30/2022] Open
Abstract
Cancer is the second-leading cause of death worldwide, and therapeutic peptides that target and destroy cancer cells have received a great deal of interest in recent years. Traditional wet experiments are expensive and inefficient for identifying novel anticancer peptides; therefore, the development of an effective computational approach is essential to recognize ACP candidates before experimental methods are used. In this study, we proposed an Ada-boosting algorithm with the base learner random forest called ACP-ADA, which integrates binary profile feature, amino acid index, and amino acid composition with a 210-dimensional feature space vector to represent the peptides. Training samples in the feature space were augmented to increase the sample size and further improve the performance of the model in the case of insufficient samples. Furthermore, we used five-fold cross-validation to find model parameters, and the cross-validation results showed that ACP-ADA outperforms existing methods for this feature combination with data augmentation in terms of performance metrics. Specifically, ACP-ADA recorded an average accuracy of 86.4% and a Mathew’s correlation coefficient of 74.01% for dataset ACP740 and 90.83% and 81.65% for dataset ACP240; consequently, it can be a very useful tool in drug development and biomedical research.
Collapse
|