1
|
Rathore AS, Choudhury S, Arora A, Tijare P, Raghava GPS. ToxinPred 3.0: An improved method for predicting the toxicity of peptides. Comput Biol Med 2024; 179:108926. [PMID: 39038391 DOI: 10.1016/j.compbiomed.2024.108926] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Revised: 05/17/2024] [Accepted: 07/17/2024] [Indexed: 07/24/2024]
Abstract
Toxicity emerges as a prominent challenge in the design of therapeutic peptides, causing the failure of numerous peptides during clinical trials. In 2013, our group developed ToxinPred, a computational method that has been extensively adopted by the scientific community for predicting peptide toxicity. In this paper, we propose a refined variant of ToxinPred that showcases improved reliability and accuracy in predicting peptide toxicity. Initially, we utilized a similarity/alignment-based approach employing BLAST to predict toxic peptides, which yielded satisfactory accuracy; however, the method suffered from inadequate coverage. Subsequently, we employed a motif-based approach using MERCI software to uncover specific patterns or motifs that are exclusively observed in toxic peptides. The search for these motifs in peptides allowed us to predict toxic peptides with a high level of specificity with poor sensitivity. To overcome the coverage limitations, we developed alignment-free methods using machine/deep learning techniques to balance sensitivity and specificity of prediction. Deep learning model (ANN - LSTM with fixed sequence length) developed using one-hot encoding achieved a maximum AUROC of 0.93 with MCC of 0.71 on an independent dataset. Machine learning model (extra tree) developed using compositional features of peptides achieved a maximum AUROC of 0.95 with MCC of 0.78. We also developed large language models and achieved maximum AUC of 0.93 using ESM2-t33. Finally, we developed hybrid or ensemble methods combining two or more methods to enhance performance. Our specific hybrid method, which combines a motif-based approach with a machine learning-based model, achieved a maximum AUROC of 0.98 with MCC 0.81 on an independent dataset. In this study, all models were trained and tested on 80 % of data using five-fold cross-validation and evaluated on the remaining 20 % of data called independent dataset. The evaluation of all methods on an independent dataset revealed that the method proposed in this study exhibited better performance than existing methods. To cater to the needs of the scientific community, we have developed a standalone software, pip package and web-based server ToxinPred3 (https://github.com/raghavagps/toxinpred3 and https://webs.iiitd.edu.in/raghava/toxinpred3/).
Collapse
Affiliation(s)
- Anand Singh Rathore
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India.
| | - Shubham Choudhury
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India.
| | - Akanksha Arora
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India.
| | - Purva Tijare
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India.
| | - Gajendra P S Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India.
| |
Collapse
|
2
|
Nguyen VN, Ho TT, Doan TD, Le NQK. Using a hybrid neural network architecture for DNA sequence representation: A study on N 4-methylcytosine sites. Comput Biol Med 2024; 178:108664. [PMID: 38875905 DOI: 10.1016/j.compbiomed.2024.108664] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2024] [Revised: 05/11/2024] [Accepted: 05/26/2024] [Indexed: 06/16/2024]
Abstract
N4-methylcytosine (4mC) is a modified form of cytosine found in DNA, contributing to epigenetic regulation. It exists in various genomes, including the Rosaceae family encompassing significant fruit crops like apples, cherries, and roses. Previous investigations have examined the distribution and functional implications of 4mC sites within the Rosaceae genome, focusing on their potential roles in gene expression regulation, environmental adaptation, and evolution. This research aims to improve the accuracy of predicting 4mC sites within the genome of Fragaria vesca, a Rosaceae plant species. Building upon the original 4mc-w2vec method, which combines word embedding processing and a convolutional neural network (CNN), we have incorporated additional feature encoding techniques and leveraged pre-trained natural language processing (NLP) models with different deep learning architectures including different forms of CNN, recurrent neural networks (RNN) and long short-term memory (LSTM). Our assessments have shown that the best model is derived from a CNN model using fastText encoding. This model demonstrates enhanced performance, achieving a sensitivity of 0.909, specificity of 0.77, and accuracy of 0.879 on an independent dataset. Furthermore, our model surpasses previously published works on the same dataset, thus showcasing its superior predictive capabilities.
Collapse
Affiliation(s)
- Van-Nui Nguyen
- University of Information and Communication Technology, Thai Nguyen University, Thai Nguyen, Viet Nam
| | - Trang-Thi Ho
- Department of Computer Science and Information Engineering, TamKang University, New Taipei, 251301, Taiwan
| | - Thu-Dung Doan
- International Degree Program in Animal Vaccine Technology, International College, National Pingtung University of Science and Technology, Pingtung, Taiwan
| | - Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei, 110, Taiwan; Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei, 110, Taiwan; AIBioMed Research Group, Taipei Medical University, Taipei, 110, Taiwan; Translational Imaging Research Center, Taipei Medical University Hospital, Taipei, 110, Taiwan.
| |
Collapse
|
3
|
Arif R, Kanwal S, Ahmed S, Kabir M. A Computational Predictor for Accurate Identification of Tumor Homing Peptides by Integrating Sequential and Deep BiLSTM Features. Interdiscip Sci 2024; 16:503-518. [PMID: 38733473 DOI: 10.1007/s12539-024-00628-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Revised: 03/16/2024] [Accepted: 03/27/2024] [Indexed: 05/13/2024]
Abstract
Cancer remains a severe illness, and current research indicates that tumor homing peptides (THPs) play an important part in cancer therapy. The identification of THPs can provide crucial insights for drug-discovery and pharmaceutical industries as they allow for tailored medication delivery towards cancer cells. These peptides have a high affinity enabling particular receptors present upon tumor surfaces, allowing for the creation of precision medications that reduce off-target consequences and enhance cancer patient treatment results. Wet-lab techniques are considered essential tools for studying THPs; however, they're labor-extensive and time-consuming, therefore making prediction of THPs a challenging task for the researchers. Computational-techniques, on the other hand, are considered significant tools in identifying THPs according to the sequence data. Despite many strategies have been presented to predict new THP, there is still a need to develop a robust method with higher rates of success. In this paper, we developed a novel framework, THP-DF, for accurately identifying THPs on a large-scale. Firstly, the peptide sequences are encoded through various sequential features. Secondly, each feature is passed to BiLSTM and attention layers to extract simplified deep features. Finally, an ensemble-framework is formed via integrating sequential- and deep features which are fed to a support vector machine which with 10-fold cross-validation to carry to validate the efficiency. The experimental results showed that THP-DF worked better on both [Formula: see text] and [Formula: see text] datasets by achieving accuracy of > 95% which are higher than existing predictors both datasets. This indicates that the proposed predictor could be a beneficial tool to precisely and rapidly identify THPs and will contribute to the cutting-edge cancer treatment strategies and pharmaceuticals.
Collapse
Affiliation(s)
- Roha Arif
- School of Systems and Technology, University of Management and Technology, Lahore, 54782, Pakistan
| | - Sameera Kanwal
- School of Systems and Technology, University of Management and Technology, Lahore, 54782, Pakistan
| | - Saeed Ahmed
- School of Systems and Technology, University of Management and Technology, Lahore, 54782, Pakistan
| | - Muhammad Kabir
- School of Systems and Technology, University of Management and Technology, Lahore, 54782, Pakistan.
| |
Collapse
|
4
|
Beltrán JF, Herrera-Belén L, Parraguez-Contreras F, Farías JG, Machuca-Sepúlveda J, Short S. MultiToxPred 1.0: a novel comprehensive tool for predicting 27 classes of protein toxins using an ensemble machine learning approach. BMC Bioinformatics 2024; 25:148. [PMID: 38609877 PMCID: PMC11010298 DOI: 10.1186/s12859-024-05748-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Accepted: 03/14/2024] [Indexed: 04/14/2024] Open
Abstract
Protein toxins are defense mechanisms and adaptations found in various organisms and microorganisms, and their use in scientific research as therapeutic candidates is gaining relevance due to their effectiveness and specificity against cellular targets. However, discovering these toxins is time-consuming and expensive. In silico tools, particularly those based on machine learning and deep learning, have emerged as valuable resources to address this challenge. Existing tools primarily focus on binary classification, determining whether a protein is a toxin or not, and occasionally identifying specific types of toxins. For the first time, we propose a novel approach capable of classifying protein toxins into 27 distinct categories based on their mode of action within cells. To accomplish this, we assessed multiple machine learning techniques and found that an ensemble model incorporating the Light Gradient Boosting Machine and Quadratic Discriminant Analysis algorithms exhibited the best performance. During the tenfold cross-validation on the training dataset, our model exhibited notable metrics: 0.840 accuracy, 0.827 F1 score, 0.836 precision, 0.840 sensitivity, and 0.989 AUC. In the testing stage, using an independent dataset, the model achieved 0.846 accuracy, 0.838 F1 score, 0.847 precision, 0.849 sensitivity, and 0.991 AUC. These results present a powerful next-generation tool called MultiToxPred 1.0, accessible through a web application. We believe that MultiToxPred 1.0 has the potential to become an indispensable resource for researchers, facilitating the efficient identification of protein toxins. By leveraging this tool, scientists can accelerate their search for these toxins and advance their understanding of their therapeutic potential.
Collapse
Affiliation(s)
- Jorge F Beltrán
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile.
| | - Lisandra Herrera-Belén
- Departamento de Ciencias Básicas, Facultad de Ciencias, Universidad Santo Tomas, Temuco, Chile
| | - Fernanda Parraguez-Contreras
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile
| | - Jorge G Farías
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile
| | - Jorge Machuca-Sepúlveda
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile
| | - Stefania Short
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar, 01145, Temuco, Chile
| |
Collapse
|
5
|
Lee TF, Chang CH, Shao JC, Liu YH, Chiu CL, Hsieh YW, Lee SH, Chao PJ, Yeh SA. Revolution of Medical Review: The Application of Meta-Analysis and Convolutional Neural Network-Natural Language Processing in Classifying the Literature for Head and Neck Cancer Radiotherapy. Cancer Control 2024; 31:10732748241286688. [PMID: 39323027 PMCID: PMC11439162 DOI: 10.1177/10732748241286688] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2024] [Revised: 08/20/2024] [Accepted: 09/06/2024] [Indexed: 09/27/2024] Open
Abstract
This study explored the application of meta-analysis and convolutional neural network-natural language processing (CNN-NLP) technologies in classifying literature concerning radiotherapy for head and neck cancer. It aims to enhance both the efficiency and accuracy of literature reviews. By integrating statistical analysis with deep learning, this research successfully identified key studies related to the probability of normal tissue complications (NTCP) from a vast corpus of literature. This demonstrates the advantages of these technologies in recognizing professional terminology and extracting relevant information. The findings not only improve the quality of literature reviews but also offer new insights for future research on optimizing medical studies through AI technologies. Despite the challenges related to data quality and model generalization, this work provides clear directions for future research.
Collapse
Affiliation(s)
- Tsair-Fwu Lee
- Medical Physics and Informatics Laboratory of Electronics Engineering, National Kaohsiung University of Sciences and Technology, Kaohsiung, Taiwan
- Graduate Institute of Clinical Medicine, Kaohsiung Medical University, Kaohsiung, Taiwan
- Department of Medical Imaging and Radiological Sciences, Kaohsiung Medical University, Kaohsiung, Taiwan
| | - Chu-Ho Chang
- Medical Physics and Informatics Laboratory of Electronics Engineering, National Kaohsiung University of Sciences and Technology, Kaohsiung, Taiwan
| | - Jen-Chung Shao
- Medical Physics and Informatics Laboratory of Electronics Engineering, National Kaohsiung University of Sciences and Technology, Kaohsiung, Taiwan
| | - Yen-Hsien Liu
- Medical Physics and Informatics Laboratory of Electronics Engineering, National Kaohsiung University of Sciences and Technology, Kaohsiung, Taiwan
| | - Chien-Liang Chiu
- Medical Physics and Informatics Laboratory of Electronics Engineering, National Kaohsiung University of Sciences and Technology, Kaohsiung, Taiwan
| | - Yang-Wei Hsieh
- Medical Physics and Informatics Laboratory of Electronics Engineering, National Kaohsiung University of Sciences and Technology, Kaohsiung, Taiwan
| | - Shen-Hao Lee
- Medical Physics and Informatics Laboratory of Electronics Engineering, National Kaohsiung University of Sciences and Technology, Kaohsiung, Taiwan
| | - Pei-Ju Chao
- Medical Physics and Informatics Laboratory of Electronics Engineering, National Kaohsiung University of Sciences and Technology, Kaohsiung, Taiwan
- Department of Radiation Oncology, E-DA Hospital, Kaohsiung, Taiwan
| | - Shyh-An Yeh
- Medical Physics and Informatics Laboratory of Electronics Engineering, National Kaohsiung University of Sciences and Technology, Kaohsiung, Taiwan
- Department of Medical Imaging and Radiological Sciences, I-Shou University, Kaohsiung, Taiwan
- Department of Radiation Oncology, E-DA Hospital, Kaohsiung, Taiwan
| |
Collapse
|
6
|
Le NQK. Leveraging transformers-based language models in proteome bioinformatics. Proteomics 2023; 23:e2300011. [PMID: 37381841 DOI: 10.1002/pmic.202300011] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2023] [Revised: 06/13/2023] [Accepted: 06/13/2023] [Indexed: 06/30/2023]
Abstract
In recent years, the rapid growth of biological data has increased interest in using bioinformatics to analyze and interpret this data. Proteomics, which studies the structure, function, and interactions of proteins, is a crucial area of bioinformatics. Using natural language processing (NLP) techniques in proteomics is an emerging field that combines machine learning and text mining to analyze biological data. Recently, transformer-based NLP models have gained significant attention for their ability to process variable-length input sequences in parallel, using self-attention mechanisms to capture long-range dependencies. In this review paper, we discuss the recent advancements in transformer-based NLP models in proteome bioinformatics and examine their advantages, limitations, and potential applications to improve the accuracy and efficiency of various tasks. Additionally, we highlight the challenges and future directions of using these models in proteome bioinformatics research. Overall, this review provides valuable insights into the potential of transformer-based NLP models to revolutionize proteome bioinformatics.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan
- AIBioMed Research Group, Taipei Medical University, Taipei, Taiwan
- Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei, Taiwan
- Translational Imaging Research Center, Taipei Medical University Hospital, Taipei, Taiwan
| |
Collapse
|