1
|
Zuo Y, Fang X, Wan J, He W, Liu X, Zeng X, Deng Z. PreMLS: The undersampling technique based on ClusterCentroids to predict multiple lysine sites. PLoS Comput Biol 2024; 20:e1012544. [PMID: 39436947 DOI: 10.1371/journal.pcbi.1012544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2024] [Accepted: 10/09/2024] [Indexed: 10/25/2024] Open
Abstract
The translated protein undergoes a specific modification process, which involves the formation of covalent bonds on lysine residues and the attachment of small chemical moieties. The protein's fundamental physicochemical properties undergo a significant alteration. The change significantly alters the proteins' 3D structure and activity, enabling them to modulate key physiological processes. The modulation encompasses inhibiting cancer cell growth, delaying ovarian aging, regulating metabolic diseases, and ameliorating depression. Consequently, the identification and comprehension of post-translational lysine modifications hold substantial value in the realms of biological research and drug development. Post-translational modifications (PTMs) at lysine (K) sites are among the most common protein modifications. However, research on K-PTMs has been largely centered on identifying individual modification types, with a relative scarcity of balanced data analysis techniques. In this study, a classification system is developed for the prediction of concurrent multiple modifications at a single lysine residue. Initially, a well-established multi-label position-specific triad amino acid propensity algorithm is utilized for feature encoding. Subsequently, PreMLS: a novel ClusterCentroids undersampling algorithm based on MiniBatchKmeans was introduced to eliminate redundant or similar major class samples, thereby mitigating the issue of class imbalance. A convolutional neural network architecture was specifically constructed for the analysis of biological sequences to predict multiple lysine modification sites. The model, evaluated through five-fold cross-validation and independent testing, was found to significantly outperform existing models such as iMul-kSite and predML-Site. The results presented here aid in prioritizing potential lysine modification sites, facilitating subsequent biological assays and advancing pharmaceutical research. To enhance accessibility, an open-access predictive script has been crafted for the multi-label predictive model developed in this study.
Collapse
Affiliation(s)
- Yun Zuo
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| | - Xingze Fang
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| | - Jiayong Wan
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| | - Wenying He
- School of Artificial Intelligence, Hebei University of Technology, Tianjin, China
| | - Xiangrong Liu
- Department of Computer Science and Technology, National Institute for Data Science in Health and Medicine, Xiamen Key Laboratory of Intelligent Storage and Computing, Xiamen University, Xiamen, China
| | - Xiangxiang Zeng
- School of Information Science and Engineering, Hunan University, Changsha, China
| | - Zhaohong Deng
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
| |
Collapse
|
2
|
Zhang X, Zhao S, Su X, Xu L. From docking to dynamics: Unveiling the potential non-peptide and non-covalent inhibitors of M pro from natural products. Comput Biol Med 2024; 181:108963. [PMID: 39216402 DOI: 10.1016/j.compbiomed.2024.108963] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2024] [Revised: 07/05/2024] [Accepted: 07/26/2024] [Indexed: 09/04/2024]
Abstract
MOTIVATION This study aims to investigate non-covalent and non-peptide inhibitors of Mpro, a crucial protein target, by employing a comprehensive approach that integrates molecular docking, molecular dynamics simulations, and top-hits activity predictions. The focus is on elucidating the non-covalent and non-peptide binding modes of potential inhibitors with Mpro. METHODS We employed a semi-flexible molecular docking methodology, binding score and ADME screening, which are based on structure, to screen compounds from CMNPD and HERB in silico. These methodologies allowed us to find potential candidates depending on their binding values and interactions with the binding site of main protease. To further evaluate the stability of these interactions, we conducted molecular dynamics simulations and calculated binding energies. Ultimately, a top-hits activity prediction method was employed to prioritize compounds based on their predicted inhibitory potential. RESULTS Through a combination of binding energy calculations and activity predictions, we identified six potential inhibitor molecules exhibiting promising activity against Mpro. These compounds demonstrated favorable binding interactions and stability profiles, making them attractive candidates for further experimental validation and drug development efforts targeting Mpro.
Collapse
Affiliation(s)
- Xin Zhang
- The Quzhou Affiliated Hospital of Wenzhou Medical University, Quzhou People's Hospital, Quzhou, China; Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Shulin Zhao
- The Quzhou Affiliated Hospital of Wenzhou Medical University, Quzhou People's Hospital, Quzhou, China
| | - Xi Su
- Foshan Women and Children Hospital, Foshan, China
| | - Lifeng Xu
- The Quzhou Affiliated Hospital of Wenzhou Medical University, Quzhou People's Hospital, Quzhou, China.
| |
Collapse
|
3
|
Ahmed Z, Shahzadi K, Temesgen SA, Ahmad B, Chen X, Ning L, Zulfiqar H, Lin H, Jin YT. A protein pre-trained model-based approach for the identification of the liquid-liquid phase separation (LLPS) proteins. Int J Biol Macromol 2024; 277:134146. [PMID: 39067723 DOI: 10.1016/j.ijbiomac.2024.134146] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2024] [Revised: 07/06/2024] [Accepted: 07/23/2024] [Indexed: 07/30/2024]
Abstract
Liquid-liquid phase separation (LLPS) regulates many biological processes including RNA metabolism, chromatin rearrangement, and signal transduction. Aberrant LLPS potentially leads to serious diseases. Therefore, the identification of the LLPS proteins is crucial. Traditionally, biochemistry-based methods for identifying LLPS proteins are costly, time-consuming, and laborious. In contrast, artificial intelligence-based approaches are fast and cost-effective and can be a better alternative to biochemistry-based methods. Previous research methods employed word2vec in conjunction with machine learning or deep learning algorithms. Although word2vec captures word semantics and relationships, it might not be effective in capturing features relevant to protein classification, like physicochemical properties, evolutionary relationships, or structural features. Additionally, other studies often focused on a limited set of features for model training, including planar π contact frequency, pi-pi, and β-pairing propensities. To overcome such shortcomings, this study first constructed a reliable dataset containing 1206 protein sequences, including 603 LLPS and 603 non-LLPS protein sequences. Then a computational model was proposed to efficiently identify the LLPS proteins by perceiving semantic information of protein sequences directly; using an ESM2-36 pre-trained model based on transformer architecture in conjunction with a convolutional neural network. The model could achieve an accuracy of 85.68% and 89.67%, respectively on training data and test data, surpassing the accuracy of previous studies. The performance demonstrates the potential of our computational methods as efficient alternatives for identifying LLPS proteins.
Collapse
Affiliation(s)
- Zahoor Ahmed
- Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, China.
| | - Kiran Shahzadi
- Department of Biotechnology, Women University of Azad Jammu and Kashmir, Bagh, Azad Kashmir, Pakistan.
| | - Sebu Aboma Temesgen
- School of Life Science and Technology, University of Electronic Science and Technology of China, 611731 Chengdu, China.
| | - Basharat Ahmad
- School of Life Science and Technology, University of Electronic Science and Technology of China, 611731 Chengdu, China.
| | - Xiang Chen
- Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, China.
| | - Lin Ning
- School of Life Science and Technology, University of Electronic Science and Technology of China, 611731 Chengdu, China; School of Healthcare Technology, Chengdu Neusoft University, Chengdu, China.
| | - Hasan Zulfiqar
- Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, China.
| | - Hao Lin
- Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, China.
| | - Yan-Ting Jin
- School of Life Science and Technology, University of Electronic Science and Technology of China, 611731 Chengdu, China.
| |
Collapse
|
4
|
Liu YX, Song JL, Li XM, Lin H, Cao YN. Identification of target genes co-regulated by four key histone modifications of five key regions in hepatocellular carcinoma. Methods 2024; 231:165-177. [PMID: 39349287 DOI: 10.1016/j.ymeth.2024.09.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2024] [Revised: 08/27/2024] [Accepted: 09/27/2024] [Indexed: 10/02/2024] Open
Abstract
Hepatocellular carcinoma (HCC) is a cancer with high morbidity and mortality. Studies have shown that histone modification plays an important regulatory role in the occurrence and development of HCC. However, the specific regulatory effects of histone modifications on gene expression in HCC are still unclear. This study focuses on HepG2 cell lines and hepatocyte cell lines. First, the distribution of histone modification signals in the two cell lines was calculated and analyzed. Then, using the random forest algorithm, we analyzed the effects of different histone modifications and their modified regions on gene expression in the two cell lines, four key histone modifications (H3K36me3, H3K4me3, H3K79me2, and H3K9ac) and five key regions that co-regulate gene expression were obtained. Subsequently, target genes regulated by key histone modifications in key regions were screened. Combined with clinical data, Cox regression analysis and Kaplan-Meier survival analysis were performed on the target genes, and four key target genes (CBX2, CEBPZOS, LDHA, and UMPS) related to prognosis were identified. Finally, through immune infiltration analysis and drug sensitivity analysis of key target genes, the potential role of key target genes in HCC was confirmed. Our results provide a theoretical basis for exploring the occurrence of HCC and propose potential biomarkers associated with histone modifications, which may be potential drug targets for the clinical treatment of HCC.
Collapse
Affiliation(s)
- Yu-Xian Liu
- School of Artificial Intelligence, Anhui University of Science and Technology, Huainan 232001, China.
| | - Jia-Le Song
- School of Artificial Intelligence, Anhui University of Science and Technology, Huainan 232001, China
| | - Xiao-Ming Li
- School of Artificial Intelligence, Anhui University of Science and Technology, Huainan 232001, China
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, Center for Informational Biology, School of Life Sciences and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China.
| | - Yan-Ni Cao
- School of Artificial Intelligence, Anhui University of Science and Technology, Huainan 232001, China.
| |
Collapse
|
5
|
Zhang W, Ding Y, Wei L, Guo X, Ni F. Therapeutic peptides identification via kernel risk sensitive loss-based k-nearest neighbor model and multi-Laplacian regularization. Brief Bioinform 2024; 25:bbae534. [PMID: 39438076 PMCID: PMC11495874 DOI: 10.1093/bib/bbae534] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2024] [Revised: 08/30/2024] [Accepted: 10/08/2024] [Indexed: 10/25/2024] Open
Abstract
Therapeutic peptides are therapeutic agents synthesized from natural amino acids, which can be used as carriers for precisely transporting drugs and can activate the immune system for preventing and treating various diseases. However, screening therapeutic peptides using biochemical assays is expensive, time-consuming, and limited by experimental conditions and biological samples, and there may be ethical considerations in the clinical stage. In contrast, screening therapeutic peptides using machine learning and computational methods is efficient, automated, and can accurately predict potential therapeutic peptides. In this study, a k-nearest neighbor model based on multi-Laplacian and kernel risk sensitive loss was proposed, which introduces a kernel risk loss function derived from the K-local hyperplane distance nearest neighbor model as well as combining the Laplacian regularization method to predict therapeutic peptides. The findings indicated that the suggested approach achieved satisfactory results and could effectively predict therapeutic peptide sequences.
Collapse
Affiliation(s)
- Wenyu Zhang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, No. 2006 Xiyuan Avenue, High tech Zone, Chengdu 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, No.1 Chengdian Road, Kecheng District, Quzhou 324000, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, No.1 Chengdian Road, Kecheng District, Quzhou 324000, China
| | - Leyi Wei
- Macao Polytechnic University, Gomes Street, Macau Peninsula, Macau 999078, China
| | - Xiaoyi Guo
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, No.1 Chengdian Road, Kecheng District, Quzhou 324000, China
| | - Fengming Ni
- Department of Gastroenterology, The First Hospital of Jilin University, No. 71 Xinmin Street, Chaoyang District, Changchun 130021, China
| |
Collapse
|
6
|
Zuo Y, Zhang B, He W, Bi Y, Liu X, Zeng X, Deng Z. MSlocPRED: deep transfer learning-based identification of multi-label mRNA subcellular localization. Brief Bioinform 2024; 25:bbae504. [PMID: 39401145 PMCID: PMC11472759 DOI: 10.1093/bib/bbae504] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Revised: 08/19/2024] [Accepted: 09/30/2024] [Indexed: 10/17/2024] Open
Abstract
Subcellular localization of messenger ribonucleic acid (mRNA) is a universal mechanism for precise and efficient control of the translation process. Although many computational methods have been constructed by researchers for predicting mRNA subcellular localization, very few of these computational methods have been designed to predict subcellular localization with multiple localization annotations, and their generalization performance could be improved. In this study, the prediction model MSlocPRED was constructed to identify multi-label mRNA subcellular localization. First, the preprocessed Dataset 1 and Dataset 2 are transformed into the form of images. The proposed MDNDO-SMDU resampling technique is then used to balance the number of samples in each category in the training dataset. Finally, deep transfer learning was used to construct the predictive model MSlocPRED to identify subcellular localization for 16 classes (Dataset 1) and 18 classes (Dataset 2). The results of comparative tests of different resampling techniques show that the resampling technique proposed in this study is more effective in preprocessing for subcellular localization. The prediction results of the datasets constructed by intercepting different NC end (Both the 5' and 3' untranslated regions that flank the protein-coding sequence and influence mRNA function without encoding proteins themselves.) lengths show that for Dataset 1 and Dataset 2, the prediction performance is best when the NC end is intercepted by 35 nucleotides, respectively. The results of both independent testing and five-fold cross-validation comparisons with established prediction tools show that MSlocPRED is significantly better than established tools for identifying multi-label mRNA subcellular localization. Additionally, to understand how the MSlocPRED model works during the prediction process, SHapley Additive exPlanations was used to explain it. The predictive model and associated datasets are available on the following github: https://github.com/ZBYnb1/MSlocPRED/tree/main.
Collapse
Affiliation(s)
- Yun Zuo
- School of Artificial Intelligence and Computer Science, Jiangnan University, No. 1800 Lihu Avenue, Binhu District, Wuxi 214000, China
| | - Bangyi Zhang
- School of Artificial Intelligence and Computer Science, Jiangnan University, No. 1800 Lihu Avenue, Binhu District, Wuxi 214000, China
| | - Wenying He
- School of Artificial Intelligence, Hebei University of Technology, 5340 Xiping Road, Beichen District, Tianjin 300130, China
| | - Yue Bi
- Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Wellington Rd, Clayton VIC 3800, Australia
| | - Xiangrong Liu
- Department of Computer Science and Technology, National Institute for Data Science in Health and Medicine, Xiamen Key Laboratory of Intelligent Storage and Computing, Xiamen University, 422 Siming South Road, Siming District, Xiamen City, Fujian 361005, China
| | - Xiangxiang Zeng
- School of Information Science and Engineering, Hunan University, Yuelu District, Changsha 410012, China
| | - Zhaohong Deng
- School of Artificial Intelligence and Computer Science, Jiangnan University, No. 1800 Lihu Avenue, Binhu District, Wuxi 214000, China
| |
Collapse
|
7
|
Zuo Y, Wan M, Shen Y, Wang X, He W, Bi Y, Liu X, Deng Z. ILYCROsite: Identification of lysine crotonylation sites based on FCM-GRNN undersampling technique. Comput Biol Chem 2024; 113:108212. [PMID: 39277959 DOI: 10.1016/j.compbiolchem.2024.108212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2024] [Revised: 09/02/2024] [Accepted: 09/12/2024] [Indexed: 09/17/2024]
Abstract
Protein lysine crotonylation is an important post-translational modification that regulates various cellular activities. For example, histone crotonylation affects chromatin structure and promotes histone replacement. Identification and understanding of lysine crotonylation sites is crucial in the field of protein research. However, due to the increasing amount of non-histone crotonylation sites, existing classifiers based on traditional machine learning may encounter performance limitations. In order to address this problem, a novel deep learning-based model for identifying crotonylation sites is presented in this study, given the unique advantages of deep learning techniques for sequence data analysis. In this study, an MLP-Attention-based model was developed for the identification of crotonylation sites. Firstly, three feature extraction strategies, namely Amino Acid Composition, K-mer, and Distance-based residue features extraction strategy, were used to encode crotonylated and non-crotonylated sequences. Then, in order to balance the training dataset, the FCM-GRNN undersampling algorithm combining fuzzy clustering and generalized neural network approaches was introduced. Finally, to improve the effectiveness of crotonylation site identification, we explored various classification algorithms, and based on the relevant experimental performance comparisons, the multilayer perceptron (MLP) combined with the superimposed self-attention mechanism was finally selected to construct the prediction model ILYCROsite. The results obtained from independent testing and five-fold cross-validation demonstrated that the model proposed in this study, ILYCROsite, had excellent performance. Notably, on the independent test set, ILYCROsite achieves an AUC value of 87.93 %, which is significantly better than the existing state-of-the-art models. In addition, SHAP (Shapley Additive exPlanations) values were used to analyze the importance of features and their impact on model predictions. Meanwhile, in order to facilitate researchers to use the prediction model constructed in this study, we developed a prediction program to identify the crotonylation sites in a given protein sequence. The data and code for this program are available at: https://github.com/wmqskr/ILYCROsite.
Collapse
Affiliation(s)
- Yun Zuo
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China.
| | - Minquan Wan
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Yang Shen
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Xinheng Wang
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Wenying He
- School of Artificial Intelligence, Hebei University of Technology, Tianjin 300130, China
| | - Yue Bi
- Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Australia
| | - Xiangrong Liu
- Department of Computer Science and Technology, National Institute for Data Science in Health and Medicine, Xiamen Key Laboratory of Intelligent Storage and Computing, Xiamen University, Xiamen 361005, China
| | - Zhaohong Deng
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China.
| |
Collapse
|
8
|
Wu CY, Xu ZX, Li N, Qi DY, Hao ZH, Wu HY, Gao R, Jin YT. Accurately identifying positive and negative regulation of apoptosis using fusion features and machine learning methods. Comput Biol Chem 2024; 113:108207. [PMID: 39265463 DOI: 10.1016/j.compbiolchem.2024.108207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2024] [Revised: 08/20/2024] [Accepted: 09/06/2024] [Indexed: 09/14/2024]
Abstract
Apoptotic proteins play a crucial role in the apoptosis process, ensuring a balance between cell proliferation and death. Thus, further elucidating the regulatory mechanisms of apoptosis will enhance our understanding of their functions. However, the development of computational methods to accurately identify positive and negative regulation of apoptosis remains a significant challenge. This work proposes a machine learning model based on multi-feature fusion to effectively identify the roles of positive and negative regulation of apoptosis. Initially, we constructed a reliable benchmark dataset containing 200 positive regulation of apoptosis and 241 negative regulation of apoptosis proteins. Subsequently, we developed a classifier that combines the support vector machine (SVM) with pseudo composition of k-spaced amino acid pairs (PseCKSAAP), composition transition distribution (CTD), dipeptide deviation from expected mean (DDE), and PSSM-composition to identify these proteins. Analysis of variance (ANOVA) was employed to select optimized features that could yield the maximum prediction performance. Evaluating the proposed model on independent data revealed and achieved an accuracy of 0.781 with an AUROC of 0.837, demonstrating our model's potent capabilities.
Collapse
Affiliation(s)
- Cheng-Yan Wu
- Key Laboratory of Magnetism and Magnetic Materials at Universities of Inner Mongolia Autonomous Region, Baotou Teacher's College, Baotou 014010, China.
| | - Zhi-Xue Xu
- Key Laboratory of Magnetism and Magnetic Materials at Universities of Inner Mongolia Autonomous Region, Baotou Teacher's College, Baotou 014010, China.
| | - Nan Li
- Key Laboratory of Magnetism and Magnetic Materials at Universities of Inner Mongolia Autonomous Region, Baotou Teacher's College, Baotou 014010, China.
| | - Dan-Yang Qi
- Key Laboratory of Magnetism and Magnetic Materials at Universities of Inner Mongolia Autonomous Region, Baotou Teacher's College, Baotou 014010, China.
| | - Zhi-Hong Hao
- Key Laboratory of Magnetism and Magnetic Materials at Universities of Inner Mongolia Autonomous Region, Baotou Teacher's College, Baotou 014010, China.
| | - Hong-Ye Wu
- Key Laboratory of Magnetism and Magnetic Materials at Universities of Inner Mongolia Autonomous Region, Baotou Teacher's College, Baotou 014010, China.
| | - Ru Gao
- The People's Hospital of Wenjiang, Chengdu, Sichuan 611130, China.
| | - Yan-Ting Jin
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China.
| |
Collapse
|
9
|
Zuo Y, Zhang B, Dong Y, He W, Bi Y, Liu X, Zeng X, Deng Z. Glypred: Lysine Glycation Site Prediction via CCU-LightGBM-BiLSTM Framework with Multi-Head Attention Mechanism. J Chem Inf Model 2024; 64:6699-6711. [PMID: 39121059 DOI: 10.1021/acs.jcim.4c01034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/11/2024]
Abstract
Glycation, a type of posttranslational modification, preferentially occurs on lysine and arginine residues, impairing protein functionality and altering characteristics. This process is linked to diseases such as Alzheimer's, diabetes, and atherosclerosis. Traditional wet lab experiments are time-consuming, whereas machine learning has significantly streamlined the prediction of protein glycation sites. Despite promising results, challenges remain, including data imbalance, feature redundancy, and suboptimal classifier performance. This research introduces Glypred, a lysine glycation site prediction model combining ClusterCentroids Undersampling (CCU), LightGBM, and bidirectional long short-term memory network (BiLSTM) methodologies, with an additional multihead attention mechanism integrated into the BiLSTM. To achieve this, the study undertakes several key steps: selecting diverse feature types to capture comprehensive protein information, employing a cluster-based undersampling strategy to balance the data set, using LightGBM for feature selection to enhance model performance, and implementing a bidirectional LSTM network for accurate classification. Together, these approaches ensure that Glypred effectively identifies glycation sites with high accuracy and robustness. For feature encoding, five distinct feature types─AAC, KMER, DR, PWAA, and EBGW─were selected to capture a broad spectrum of protein sequence and biological information. These encoded features were integrated and validated to ensure comprehensive protein information acquisition. To address the issue of highly imbalanced positive and negative samples, various undersampling algorithms, including random undersampling, NearMiss, edited nearest neighbor rule, and CCU, were evaluated. CCU was ultimately chosen to remove redundant nonglycated training data, establishing a balanced data set that enhances the model's accuracy and robustness. For feature selection, the LightGBM ensemble learning algorithm was employed to reduce feature dimensionality by identifying the most significant features. This approach accelerates model training, enhances generalization capabilities, and ensures good transferability of the model. Finally, a bidirectional long short-term memory network was used as the classifier, with a network structure designed to capture glycation modification site features from both forward and backward directions. To prevent overfitting, appropriate regularization parameters and dropout rates were introduced, achieving efficient classification. Experimental results show that Glypred achieved optimal performance. This model provides new insights for bioinformatics and encourages the application of similar strategies in other fields. A lysine glycation site prediction software tool was also developed using the PyQt5 library, offering researchers an auxiliary screening tool to reduce workload and improve efficiency. The software and data sets are available on GitHub: https://github.com/ZBYnb/Glypred.
Collapse
Affiliation(s)
- Yun Zuo
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Bangyi Zhang
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Yinkang Dong
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| | - Wenying He
- School of Artificial Intelligence, Hebei University of Technology, Tianjin 300130, China
| | - Yue Bi
- Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Clayton 3800, Australia
| | - Xiangrong Liu
- Department of Computer Science and Technology, National Institute for Data Science in Health and Medicine, Xiamen Key Laboratory of Intelligent Storage and Computing, Xiamen University, Xiamen 361005, China
| | - Xiangxiang Zeng
- School of Information Science and Engineering, Hunan University, Changsha 410012, China
| | - Zhaohong Deng
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214000, China
| |
Collapse
|
10
|
Meng C, Pei Y, Bu Y, Liu Q, Li Q, Zou Q, Zhang Y. IIFS2.0: An Improved Incremental Feature Selection Method for Protein Sequence Processing Based on a Caching Strategy. J Mol Biol 2024:168741. [PMID: 39122168 DOI: 10.1016/j.jmb.2024.168741] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2024] [Revised: 07/08/2024] [Accepted: 08/05/2024] [Indexed: 08/12/2024]
Abstract
The purpose of feature selection in protein sequence recognition problems is to select the optimal feature set and use it as training input for classifiers and discover key sequence features of specific proteins. In the feature selection process, relevant features associated with the target task will be retained, and irrelevant and redundant features will be removed. Therefore, in an ideal state, a feature combination with smaller feature dimensions and higher performance indicators is desired. This paper proposes an algorithm called IIFS2.0 based on the cache elimination strategy, which takes the local optimal combination of cached feature subsets as a breakthrough point. It searches for a new feature combination method through the cache elimination strategy to avoid the drawbacks of human factors and excessive reliance on feature sorting results. We validated and analyzed its effectiveness on the protein dataset, demonstrating that IIFS2.0 significantly reduces the dimensionality of feature combinations while also improving various evaluation indicators. In addition, we provide IIFS2.0 on https://112.124.26.17:8006/ for researchers to use.
Collapse
Affiliation(s)
- Chaolu Meng
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China; Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application of Agriculture and Animal Husbandry, China
| | - Yue Pei
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Yongbo Bu
- College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China; Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application of Agriculture and Animal Husbandry, China
| | - Qing Liu
- Department of Pain, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou 646000, Sichuan, China
| | - Qun Li
- Department of Pain, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou 646000, Sichuan, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, China; Department of Anesthesiology, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou 646000, Sichuan, China.
| | - Ying Zhang
- Department of Pain, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou 646000, Sichuan, China; Department of Anesthesiology, The Affiliated Traditional Chinese Medicine Hospital of Southwest Medical University, Luzhou 646000, Sichuan, China.
| |
Collapse
|
11
|
Petrovskiy DV, Nikolsky KS, Kulikova LI, Rudnev VR, Butkova TV, Malsagova KA, Kopylov AT, Kaysheva AL. PowerNovo: de novo peptide sequencing via tandem mass spectrometry using an ensemble of transformer and BERT models. Sci Rep 2024; 14:15000. [PMID: 38951578 PMCID: PMC11217302 DOI: 10.1038/s41598-024-65861-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2024] [Accepted: 06/25/2024] [Indexed: 07/03/2024] Open
Abstract
The primary objective of analyzing the data obtained in a mass spectrometry-based proteomic experiment is peptide and protein identification, or correct assignment of the tandem mass spectrum to one amino acid sequence. Comparison of empirical fragment spectra with the theoretical predicted one or matching with the collected spectra library are commonly accepted strategies of proteins identification and defining of their amino acid sequences. Although these approaches are widely used and are appreciably efficient for the well-characterized model organisms or measured proteins, they cannot detect novel peptide sequences that have not been previously annotated or are rare. This study presents PowerNovo tool for de novo sequencing of proteins using tandem mass spectra acquired in a variety of types of mass analyzers and different fragmentation techniques. PowerNovo involves an ensemble of models for peptide sequencing: model for detecting regularities in tandem mass spectra, precursors, and fragment ions and a natural language processing model, which has a function of peptide sequence quality assessment and helps with reconstruction of noisy sequences. The results of testing showed that the performance of PowerNovo is comparable and even better than widely utilized PointNovo, DeepNovo, Casanovo, and Novor packages. Also, PowerNovo provides complete cycle of processing (pipeline) of mass spectrometry data and, along with predicting the peptide sequence, involves the peptide assembly and protein inference blocks.
Collapse
|
12
|
Li H, Jiang L, Yang K, Shang S, Li M, Lv Z. iNP_ESM: Neuropeptide Identification Based on Evolutionary Scale Modeling and Unified Representation Embedding Features. Int J Mol Sci 2024; 25:7049. [PMID: 39000158 PMCID: PMC11240975 DOI: 10.3390/ijms25137049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2024] [Revised: 06/17/2024] [Accepted: 06/25/2024] [Indexed: 07/16/2024] Open
Abstract
Neuropeptides are biomolecules with crucial physiological functions. Accurate identification of neuropeptides is essential for understanding nervous system regulatory mechanisms. However, traditional analysis methods are expensive and laborious, and the development of effective machine learning models continues to be a subject of current research. Hence, in this research, we constructed an SVM-based machine learning neuropeptide predictor, iNP_ESM, by integrating protein language models Evolutionary Scale Modeling (ESM) and Unified Representation (UniRep) for the first time. Our model utilized feature fusion and feature selection strategies to improve prediction accuracy during optimization. In addition, we validated the effectiveness of the optimization strategy with UMAP (Uniform Manifold Approximation and Projection) visualization. iNP_ESM outperforms existing models on a variety of machine learning evaluation metrics, with an accuracy of up to 0.937 in cross-validation and 0.928 in independent testing, demonstrating optimal neuropeptide recognition capabilities. We anticipate improved neuropeptide data in the future, and we believe that the iNP_ESM model will have broader applications in the research and clinical treatment of neurological diseases.
Collapse
Affiliation(s)
- Honghao Li
- College of Biomedical Engineering, Sichuan University, Chengdu 610041, China
| | - Liangzhen Jiang
- College of Food and Biological Engineering, Chengdu University, Chengdu 610106, China
- Country Key Laboratory of Coarse Cereal Processing, Ministry of Agriculture and Rural Affairs, Chengdu 610106, China
| | - Kaixiang Yang
- College of Software Engineering, Sichuan University, Chengdu 610041, China
| | - Shulin Shang
- College of Biomedical Engineering, Sichuan University, Chengdu 610041, China
| | - Mingxin Li
- College of Biomedical Engineering, Sichuan University, Chengdu 610041, China
| | - Zhibin Lv
- College of Biomedical Engineering, Sichuan University, Chengdu 610041, China
| |
Collapse
|
13
|
Feng C, Wei H, Li X, Feng B, Xu C, Zhu X, Liu R. A stacking-based algorithm for antifreeze protein identification using combined physicochemical, pseudo amino acid composition, and reduction property features. Comput Biol Med 2024; 176:108534. [PMID: 38754217 DOI: 10.1016/j.compbiomed.2024.108534] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2024] [Revised: 04/03/2024] [Accepted: 04/28/2024] [Indexed: 05/18/2024]
Abstract
Antifreeze proteins have wide applications in the medical and food industries. In this study, we propose a stacking-based classifier that can effectively identify antifreeze proteins. Initially, feature extraction was performed in three aspects: reduction properties, scalable pseudo amino acid composition, and physicochemical properties. A hybrid feature set comprised of the combined information from these three categories was obtained. Subsequently, we trained the training set based on LightGBM, XGBoost, and RandomForest algorithms, and the training outcomes were passed to the Logistic algorithm for matching, thereby establishing a stacking algorithm. The proposed algorithm was tested on the test set and an independent validation set. Experimental data indicates that the algorithm achieved a recognition accuracy of 98.3 %, and an accuracy of 98.5 % on the validation set. Lastly, we analyzed the reasons why numerical features achieved high recognition capabilities from multiple aspects. Data dimensionality reduction and the analysis from two-dimensional and three-dimensional views revealed separability between positive and negative samples, and the protein three-dimensional structure further demonstrated significant differences in related features between the two samples. Analysis of the classifier revealed that Hr*Hr, HrHr, and Sc-PseAAC_1, 188D(152,116,57,183) were among the seven most important numerical features affecting algorithm recognition. For Hr*Hr and HrHr, supportive sequence level evidence for the reduction dictionary was found in terms of conservation area analysis, multiple sequence alignment, and amino acid conservative substitution. Moreover, the importance of the reduction dictionary was recognized through a comparative analysis of importance before and after the reduction, realizing the effectiveness of the dictionary in improving feature importance. A decision tree model has been utilized to discern the distinctions between dipeptides associated with the physical and chemical properties of His(H), Iso(I), Leu(L), and Lys(K) and other dipeptides. We finally analyzed the other seven features of importance, and data analysis confirmed that hydrophobicity, secondary structure, charge properties, van der Waals forces, and solvent accessibility are also factors affecting the antifreeze capability of proteins.
Collapse
Affiliation(s)
- Changli Feng
- Department of Information Science and Technology, Taishan University, Taian, 271000, China.
| | - Haiyan Wei
- Department of Information Science and Technology, Taishan University, Taian, 271000, China.
| | - Xin Li
- Department of Information Science and Technology, Taishan University, Taian, 271000, China.
| | - Bin Feng
- Department of Information Science and Technology, Taishan University, Taian, 271000, China.
| | - Chugui Xu
- Department of Information Science and Technology, Taishan University, Taian, 271000, China.
| | - Xiaorong Zhu
- Department of Information Science and Technology, Taishan University, Taian, 271000, China.
| | - Ruijun Liu
- School of Software, Beihang University, Beijing, 100191, China.
| |
Collapse
|
14
|
Jiao S, Ye X, Sakurai T, Zou Q, Liu R. Integrated convolution and self-attention for improving peptide toxicity prediction. BIOINFORMATICS (OXFORD, ENGLAND) 2024; 40:btae297. [PMID: 38696758 DOI: 10.1093/bioinformatics/btae297] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Revised: 04/02/2024] [Accepted: 04/30/2024] [Indexed: 05/04/2024]
Abstract
MOTIVATION Peptides are promising agents for the treatment of a variety of diseases due to their specificity and efficacy. However, the development of peptide-based drugs is often hindered by the potential toxicity of peptides, which poses a significant barrier to their clinical application. Traditional experimental methods for evaluating peptide toxicity are time-consuming and costly, making the development process inefficient. Therefore, there is an urgent need for computational tools specifically designed to predict peptide toxicity accurately and rapidly, facilitating the identification of safe peptide candidates for drug development. RESULTS We provide here a novel computational approach, CAPTP, which leverages the power of convolutional and self-attention to enhance the prediction of peptide toxicity from amino acid sequences. CAPTP demonstrates outstanding performance, achieving a Matthews correlation coefficient of approximately 0.82 in both cross-validation settings and on independent test datasets. This performance surpasses that of existing state-of-the-art peptide toxicity predictors. Importantly, CAPTP maintains its robustness and generalizability even when dealing with data imbalances. Further analysis by CAPTP reveals that certain sequential patterns, particularly in the head and central regions of peptides, are crucial in determining their toxicity. This insight can significantly inform and guide the design of safer peptide drugs. AVAILABILITY AND IMPLEMENTATION The source code for CAPTP is freely available at https://github.com/jiaoshihu/CAPTP.
Collapse
Affiliation(s)
- Shihu Jiao
- Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan
| | - Xiucai Ye
- Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan
| | - Tetsuya Sakurai
- Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China
| | - Ruijun Liu
- School of Software, Beihang University, Beijing 100191, China
| |
Collapse
|
15
|
Lou R, Shui W. Acquisition and Analysis of DIA-Based Proteomic Data: A Comprehensive Survey in 2023. Mol Cell Proteomics 2024; 23:100712. [PMID: 38182042 PMCID: PMC10847697 DOI: 10.1016/j.mcpro.2024.100712] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 12/27/2023] [Accepted: 01/02/2024] [Indexed: 01/07/2024] Open
Abstract
Data-independent acquisition (DIA) mass spectrometry (MS) has emerged as a powerful technology for high-throughput, accurate, and reproducible quantitative proteomics. This review provides a comprehensive overview of recent advances in both the experimental and computational methods for DIA proteomics, from data acquisition schemes to analysis strategies and software tools. DIA acquisition schemes are categorized based on the design of precursor isolation windows, highlighting wide-window, overlapping-window, narrow-window, scanning quadrupole-based, and parallel accumulation-serial fragmentation-enhanced DIA methods. For DIA data analysis, major strategies are classified into spectrum reconstruction, sequence-based search, library-based search, de novo sequencing, and sequencing-independent approaches. A wide array of software tools implementing these strategies are reviewed, with details on their overall workflows and scoring approaches at different steps. The generation and optimization of spectral libraries, which are critical resources for DIA analysis, are also discussed. Publicly available benchmark datasets covering global proteomics and phosphoproteomics are summarized to facilitate performance evaluation of various software tools and analysis workflows. Continued advances and synergistic developments of versatile components in DIA workflows are expected to further enhance the power of DIA-based proteomics.
Collapse
Affiliation(s)
- Ronghui Lou
- iHuman Institute, ShanghaiTech University, Shanghai, China; School of Life Science and Technology, ShanghaiTech University, Shanghai, China.
| | - Wenqing Shui
- iHuman Institute, ShanghaiTech University, Shanghai, China; School of Life Science and Technology, ShanghaiTech University, Shanghai, China.
| |
Collapse
|
16
|
Wu J, Liu B, Zhang J, Wang Z, Li J. DL-PPI: a method on prediction of sequenced protein-protein interaction based on deep learning. BMC Bioinformatics 2023; 24:473. [PMID: 38097937 PMCID: PMC10722729 DOI: 10.1186/s12859-023-05594-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Accepted: 12/01/2023] [Indexed: 12/17/2023] Open
Abstract
PURPOSE Sequenced Protein-Protein Interaction (PPI) prediction represents a pivotal area of study in biology, playing a crucial role in elucidating the mechanistic underpinnings of diseases and facilitating the design of novel therapeutic interventions. Conventional methods for extracting features through experimental processes have proven to be both costly and exceedingly complex. In light of these challenges, the scientific community has turned to computational approaches, particularly those grounded in deep learning methodologies. Despite the progress achieved by current deep learning technologies, their effectiveness diminishes when applied to larger, unfamiliar datasets. RESULTS In this study, the paper introduces a novel deep learning framework, termed DL-PPI, for predicting PPIs based on sequence data. The proposed framework comprises two key components aimed at improving the accuracy of feature extraction from individual protein sequences and capturing relationships between proteins in unfamiliar datasets. 1. Protein Node Feature Extraction Module: To enhance the accuracy of feature extraction from individual protein sequences and facilitate the understanding of relationships between proteins in unknown datasets, the paper devised a novel protein node feature extraction module utilizing the Inception method. This module efficiently captures relevant patterns and representations within protein sequences, enabling more informative feature extraction. 2. Feature-Relational Reasoning Network (FRN): In the Global Feature Extraction module of our model, the paper developed a novel FRN that leveraged Graph Neural Networks to determine interactions between pairs of input proteins. The FRN effectively captures the underlying relational information between proteins, contributing to improved PPI predictions. DL-PPI framework demonstrates state-of-the-art performance in the realm of sequence-based PPI prediction.
Collapse
Affiliation(s)
- Jiahui Wu
- Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
| | - Bo Liu
- School of Mathematical and Computational Sciences, Massey University, Auckland, 0745, New Zealand.
| | - Jidong Zhang
- Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
| | - Zhihan Wang
- Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
| | - Jianqiang Li
- Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
| |
Collapse
|
17
|
Le NQK. Leveraging transformers-based language models in proteome bioinformatics. Proteomics 2023; 23:e2300011. [PMID: 37381841 DOI: 10.1002/pmic.202300011] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2023] [Revised: 06/13/2023] [Accepted: 06/13/2023] [Indexed: 06/30/2023]
Abstract
In recent years, the rapid growth of biological data has increased interest in using bioinformatics to analyze and interpret this data. Proteomics, which studies the structure, function, and interactions of proteins, is a crucial area of bioinformatics. Using natural language processing (NLP) techniques in proteomics is an emerging field that combines machine learning and text mining to analyze biological data. Recently, transformer-based NLP models have gained significant attention for their ability to process variable-length input sequences in parallel, using self-attention mechanisms to capture long-range dependencies. In this review paper, we discuss the recent advancements in transformer-based NLP models in proteome bioinformatics and examine their advantages, limitations, and potential applications to improve the accuracy and efficiency of various tasks. Additionally, we highlight the challenges and future directions of using these models in proteome bioinformatics research. Overall, this review provides valuable insights into the potential of transformer-based NLP models to revolutionize proteome bioinformatics.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan
- AIBioMed Research Group, Taipei Medical University, Taipei, Taiwan
- Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei, Taiwan
- Translational Imaging Research Center, Taipei Medical University Hospital, Taipei, Taiwan
| |
Collapse
|
18
|
Yu X, Hu J, Zhang Y. SNN6mA: Improved DNA N6-methyladenine site prediction using Siamese network-based feature embedding. Comput Biol Med 2023; 166:107533. [PMID: 37793205 DOI: 10.1016/j.compbiomed.2023.107533] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2023] [Revised: 09/01/2023] [Accepted: 09/27/2023] [Indexed: 10/06/2023]
Abstract
DNA N6-methyladenine (6mA) is one of the most common and abundant modifications, which plays essential roles in various biological processes and cellular functions. Therefore, the accurate identification of DNA 6mA sites is of great importance for a better understanding of its regulatory mechanisms and biological functions. Although significant progress has been made, there still has room for further improvement in 6mA site prediction in DNA sequences. In this study, we report a smart but accurate 6mA predictor, termed as SNN6mA, using Siamese network. To be specific, DNA segments are firstly encoded into feature vectors using the one-hot encoding scheme; then, these original feature vectors are mapped to a low-dimensional embedding space derived from Siamese network to capture more discriminative features; finally, the obtained low-dimensional features are fed to a fully connected neural network to perform final prediction. Stringent benchmarking tests on the datasets of two species demonstrated that the proposed SNN6mA is superior to the state-of-the-art 6mA predictors. Detailed data analyses show that the major advantage of SNN6mA lies in the utilization of Siamese network, which can map the original features into a low-dimensional embedding space with more discriminative capability. In summary, the proposed SNN6mA is the first attempt to use Siamese network for 6mA site prediction and could be easily extended to predict other types of modifications. The codes and datasets used in the study are freely available at https://github.com/YuXuan-Glasgow/SNN6mA for academic use.
Collapse
Affiliation(s)
- Xuan Yu
- Glasgow College, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Jun Hu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou, 310023, China
| | - Ying Zhang
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China.
| |
Collapse
|
19
|
Bustillo L, Laino T, Rodrigues T. The rise of automated curiosity-driven discoveries in chemistry. Chem Sci 2023; 14:10378-10384. [PMID: 37799997 PMCID: PMC10548516 DOI: 10.1039/d3sc03367h] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2023] [Accepted: 09/07/2023] [Indexed: 10/07/2023] Open
Abstract
The quest for generating novel chemistry knowledge is critical in scientific advancement, and machine learning (ML) has emerged as an asset in this pursuit. Through interpolation among learned patterns, ML can tackle tasks that were previously deemed demanding to machines. This distinctive capacity of ML provides invaluable aid to bench chemists in their daily work. However, current ML tools are typically designed to prioritize experiments with the highest likelihood of success, i.e., higher predictive confidence. In this perspective, we build on current trends that suggest a future in which ML could be just as beneficial in exploring uncharted search spaces through simulated curiosity. We discuss how low and 'negative' data can catalyse one-/few-shot learning, and how the broader use of curious ML and novelty detection algorithms can propel the next wave of chemical discoveries. We anticipate that ML for curiosity-driven research will help the community overcome potentially biased assumptions and uncover unexpected findings in the chemical sciences at an accelerated pace.
Collapse
Affiliation(s)
- Latimah Bustillo
- Research Institute for Medicines (iMed), Faculdade de Farmácia, Universidade de Lisboa Lisbon Portugal
| | - Teodoro Laino
- IBM Research Europe Säumerstrasse 4 8803 Rüschlikon Switzerland
- National Center for Competence in Research-Catalysis (NCCR-Catalysis) Zurich Switzerland
| | - Tiago Rodrigues
- Research Institute for Medicines (iMed), Faculdade de Farmácia, Universidade de Lisboa Lisbon Portugal
| |
Collapse
|
20
|
Son J, Na S, Paek E. DbyDeep: Exploration of MS-Detectable Peptides via Deep Learning. Anal Chem 2023; 95:11193-11200. [PMID: 37459568 PMCID: PMC10401496 DOI: 10.1021/acs.analchem.3c00460] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Accepted: 07/05/2023] [Indexed: 08/02/2023]
Abstract
Predicting peptide detectability is useful in a variety of mass spectrometry (MS)-based proteomics applications, particularly targeted proteomics. However, most machine learning-based computational methods have relied solely on information from the peptide itself, such as its amino acid sequences or physicochemical properties, despite the fact that peptides detected by MS are dependent on many factors, including protein sample preparation, digestion, separation, ionization, and precursor selection during MS experiments. DbyDeep (Detectability by Deep learning) is an innovative end-to-end LSTM network model for peptide detectability prediction that incorporates sequence contexts of peptides and their cleavage sites (by protease). Utilizing the cleavage site contexts could improve the performance of prediction, and DbyDeep outperformed existing methods in predicting peptides recognizable from multiple MS/MS data sets with diverse species and MS instruments. We argue for the necessity of a learning model that encompasses several contexts associated with peptide detection, as opposed to depending just on peptide sequences. There is a Python implementation of DbyDeep at https://github.com/BISCodeRepo/DbyDeep.
Collapse
Affiliation(s)
- Juho Son
- Department
of Computer Science, Hanyang University, Seoul 04763, Republic of Korea
| | - Seungjin Na
- Department
of Computer Science, Hanyang University, Seoul 04763, Republic of Korea
- Institute
for Artificial Intelligence Research, Hanyang
University, Seoul 04763, Republic
of Korea
| | - Eunok Paek
- Department
of Computer Science, Hanyang University, Seoul 04763, Republic of Korea
- Institute
for Artificial Intelligence Research, Hanyang
University, Seoul 04763, Republic
of Korea
| |
Collapse
|
21
|
PD-BertEDL: An Ensemble Deep Learning Method Using BERT and Multivariate Representation to Predict Peptide Detectability. Int J Mol Sci 2022; 23:ijms232012385. [PMID: 36293242 PMCID: PMC9604182 DOI: 10.3390/ijms232012385] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Revised: 10/11/2022] [Accepted: 10/12/2022] [Indexed: 12/03/2022] Open
Abstract
Peptide detectability is defined as the probability of identifying a peptide from a mixture of standard samples, which is a key step in protein identification and analysis. Exploring effective methods for predicting peptide detectability is helpful for disease treatment and clinical research. However, most existing computational methods for predicting peptide detectability rely on a single information. With the increasing complexity of feature representation, it is necessary to explore the influence of multivariate information on peptide detectability. Thus, we propose an ensemble deep learning method, PD-BertEDL. Bidirectional encoder representations from transformers (BERT) is introduced to capture the context information of peptides. Context information, sequence information, and physicochemical information of peptides were combined to construct the multivariate feature space of peptides. We use different deep learning methods to capture the high-quality features of different categories of peptides information and use the average fusion strategy to integrate three model prediction results to solve the heterogeneity problem and to enhance the robustness and adaptability of the model. The experimental results show that PD-BertEDL is superior to the existing prediction methods, which can effectively predict peptide detectability and provide strong support for protein identification and quantitative analysis, as well as disease treatment.
Collapse
|
22
|
Ekvall M, Truong P, Gabriel W, Wilhelm M, Käll L. Prosit Transformer: A transformer for Prediction of MS2 Spectrum Intensities. J Proteome Res 2022; 21:1359-1364. [PMID: 35413196 PMCID: PMC9087333 DOI: 10.1021/acs.jproteome.1c00870] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
![]()
Machine learning
has been an integral part of interpreting data
from mass spectrometry (MS)-based proteomics for a long time. Relatively
recently, a machine-learning structure appeared successful in other
areas of bioinformatics, Transformers. Furthermore, the implementation
of Transformers within bioinformatics has become relatively convenient
due to transfer learning, i.e., adapting a network trained for other
tasks to new functionality. Transfer learning makes these relatively
large networks more accessible as it generally requires less data,
and the training time improves substantially. We implemented a Transformer
based on the pretrained model TAPE to predict MS2 intensities. TAPE
is a general model trained to predict missing residues from protein
sequences. Despite being trained for a different task, we could modify
its behavior by adding a prediction head at the end of the TAPE model
and fine-tune it using the spectrum intensity from the training set
to the well-known predictor Prosit. We demonstrate that the predictor,
which we call Prosit Transformer, outperforms the recurrent neural-network-based
predictor Prosit, increasing the median angular similarity on its
hold-out set from 0.908 to 0.929. We believe that Transformers will
significantly increase prediction accuracy for other types of predictions
within MS-based proteomics.
Collapse
Affiliation(s)
- Markus Ekvall
- Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health, Royal Institute of Technology─KTH, Box 1031, SE-17121 Solna, Sweden
| | - Patrick Truong
- Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health, Royal Institute of Technology─KTH, Box 1031, SE-17121 Solna, Sweden
| | - Wassim Gabriel
- Computational Mass Spectrometry, Technical University of Munich (TUM), D-85354 Freising, Germany
| | - Mathias Wilhelm
- Computational Mass Spectrometry, Technical University of Munich (TUM), D-85354 Freising, Germany
| | - Lukas Käll
- Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health, Royal Institute of Technology─KTH, Box 1031, SE-17121 Solna, Sweden
| |
Collapse
|
23
|
Yang Y, Lin L, Qiao L. Deep learning approaches for data-independent acquisition proteomics. Expert Rev Proteomics 2021; 18:1031-1043. [PMID: 34918987 DOI: 10.1080/14789450.2021.2020654] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
INTRODUCTION Data-independent acquisition (DIA) is an emerging technology for large-scale proteomic studies. DIA data analysis methods are evolving rapidly, and deep learning has cut a conspicuous figure in this field. AREAS COVERED This review discusses and provides an overview of the deep learning methods that are used for DIA data analysis, including spectral library prediction, feature scoring, and statistical control in peptide-centric analysis, as well as de novo peptide sequencing. Literature searches were performed for articles, including preprints, up to December 2021 from PubMed, Scopus, and Web of Science databases. EXPERT OPINION While spectral library prediction has broken through the limitation on proteome coverage of experimental libraries, the statistical burden due to the large query space is the remaining challenge of utilizing proteome-wide predicted libraries. Analysis of post-translational modifications is another promising direction of deep learning-based DIA methods.
Collapse
Affiliation(s)
- Yi Yang
- Department of Chemistry, Shanghai Stomatological Hospital, and Minhang Hospital, Fudan University, Shanghai China
| | - Ling Lin
- Department of Chemistry, Shanghai Stomatological Hospital, and Minhang Hospital, Fudan University, Shanghai China
| | - Liang Qiao
- Department of Chemistry, Shanghai Stomatological Hospital, and Minhang Hospital, Fudan University, Shanghai China
| |
Collapse
|
24
|
Zhu Y, Yin S, Zheng J, Shi Y, Jia C. O-glycosylation site prediction for Homo sapiens by combining properties and sequence features with support vector machine. J Bioinform Comput Biol 2021; 20:2150029. [PMID: 34806952 DOI: 10.1142/s0219720021500293] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
O-glycosylation is a protein posttranslational modification important in regulating almost all cells. It is related to a large number of physiological and pathological phenomena. Recognizing O-glycosylation sites is the key to further investigating the molecular mechanism of protein posttranslational modification. This study aimed to collect a reliable dataset on Homo sapiens and develop an O-glycosylation predictor for Homo sapiens, named Captor, through multiple features. A random undersampling method and a synthetic minority oversampling technique were employed to deal with imbalanced data. In addition, the Kruskal-Wallis (K-W) test was adopted to optimize feature vectors and improve the performance of the model. A support vector machine, due to its optimal performance, was used to train and optimize the final prediction model after a comprehensive comparison of various classifiers in traditional machine learning methods and deep learning. On the independent test set, Captor outperformed the existing O-glycosylation tool, suggesting that Captor could provide more instructive guidance for further experimental research on O-glycosylation. The source code and datasets are available at https://github.com/YanZhu06/Captor/.
Collapse
Affiliation(s)
- Yan Zhu
- School of Science, Dalian Maritime University, Dalian 116026, P. R. China
| | - Shuwan Yin
- School of Science, Dalian Maritime University, Dalian 116026, P. R. China
| | - Jia Zheng
- School of Science, Dalian Maritime University, Dalian 116026, P. R. China
| | - Yixia Shi
- School of Mathematics and Statistics, Lingnan Normal University, Zhanjiang 524048, P. R. China
| | - Cangzhi Jia
- School of Science, Dalian Maritime University, Dalian 116026, P. R. China
| |
Collapse
|
25
|
Prediction of Peptide Detectability Based on CapsNet and Convolutional Block Attention Module. Int J Mol Sci 2021; 22:ijms222112080. [PMID: 34769509 PMCID: PMC8584443 DOI: 10.3390/ijms222112080] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2021] [Revised: 10/30/2021] [Accepted: 11/02/2021] [Indexed: 11/17/2022] Open
Abstract
According to proteomics technology, as impacted by the complexity of sampling in the experimental process, several problems remain with the reproducibility of mass spectrometry experiments, and the peptide identification and quantitative results continue to be random. Predicting the detectability exhibited by peptides can optimize the mentioned results to be more accurate, so such a prediction is of high research significance. This study builds a novel method to predict the detectability of peptides by complying with the capsule network (CapsNet) and the convolutional block attention module (CBAM). First, the residue conical coordinate (RCC), the amino acid composition (AAC), the dipeptide composition (DPC), and the sequence embedding code (SEC) are extracted as the peptide chain features. Subsequently, these features are divided into the biological feature and sequence feature, and separately inputted into the neural network of CapsNet. Moreover, the attention module CBAM is added to the network to assign weights to channels and spaces, as an attempt to enhance the feature learning and improve the network training effect. To verify the effectiveness of the proposed method, it is compared with some other popular methods. As revealed from the experimentally achieved results, the proposed method outperforms those methods in most performance assessments.
Collapse
|
26
|
Sun B, Smialowski P, Straub T, Imhof A. Investigation and Highly Accurate Prediction of Missed Tryptic Cleavages by Deep Learning. J Proteome Res 2021; 20:3749-3757. [PMID: 34137619 DOI: 10.1021/acs.jproteome.1c00346] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Trypsin is one of the most important and widely used proteolytic enzymes in mass spectrometry (MS)-based proteomic research. It exclusively cleaves peptide bonds at the C-terminus of lysine and arginine. However, the cleavage is also affected by several factors, including specific surrounding amino acids, resulting in frequent incomplete proteolysis and subsequent issues in peptide identification and quantification. The accurate annotations on missed cleavages are crucial to database searching in MS analysis. Here, we present deep-learning predicting missed cleavages (dpMC), a novel algorithm for the prediction of missed trypsin cleavage sites. This algorithm provides a very high accuracy for predicting missed cleavages with area under the curves (AUCs) of cross-validation and holdout testing above 0.99, along with the mean F1 score and the Matthews correlation coefficient (MCC) of 0.9677 and 0.9349, respectively. We tested our algorithm on data sets from different species and different experimental conditions, and its performance outperforms other currently available prediction methods. In addition, the method also provides a better insight into the detailed rules of trypsin cleavages coupled with propensity and motif analysis. Moreover, our method can be integrated into database searching in the MS analysis to identify and quantify mass spectra effectively and efficiently.
Collapse
Affiliation(s)
- Bo Sun
- Biomedical Center, Protein Analysis Unit, Faculty of Medicine, Ludwig-Maximilians-Universität München, Großhaderner Strasse 9, 82152 Planegg-Martinsried, Germany
| | - Pawel Smialowski
- Institute of Stem Cell Research, Helmholtz Center Munich, German Research Center for Environmental Health, 85764 Munich, Germany.,Biomedical Center, Computational Biology Unit, Faculty of Medicine, Ludwig-Maximilians-Universität München, Großhaderner Strasse 9, 82152 Planegg-Martinsried, Germany
| | - Tobias Straub
- Biomedical Center, Computational Biology Unit, Faculty of Medicine, Ludwig-Maximilians-Universität München, Großhaderner Strasse 9, 82152 Planegg-Martinsried, Germany
| | - Axel Imhof
- Biomedical Center, Protein Analysis Unit, Faculty of Medicine, Ludwig-Maximilians-Universität München, Großhaderner Strasse 9, 82152 Planegg-Martinsried, Germany
| |
Collapse
|