1
|
Nafi MMI. Predicting C- and S-linked Glycosylation sites from protein sequences using protein language models. Comput Biol Med 2025; 189:109956. [PMID: 40073495 DOI: 10.1016/j.compbiomed.2025.109956] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2024] [Revised: 02/25/2025] [Accepted: 02/27/2025] [Indexed: 03/14/2025]
Abstract
Among various post-translational modifications (PTMs), predicting C-linked and S-linked glycosites is an essential task, yet experimental techniques such as Capillary Electrophoresis (CE), Enzymatic Deglycosylation, and Mass Spectrometry (MS) are expensive. Therefore, computational techniques are required to predict these glycosites. Here, different language model embeddings and sequential features were explored. Two separate feature selection methods: Recursive Feature Elimination (RFE) and Particle Swarm Optimization (PSO) were employed and utilized for identifying the optimal feature set. Cross-validation results were generated for choosing the final models. Three sampling strategies to handle imbalanced datasets were examined: Random undersampling, Synthetic Minority Over-sampling Technique (SMOTE) and Adaptive Synthetic Sampling Approach for Imbalanced Learning (ADASYN). In this study, two models: DeepCSEmbed-C and DeepCSEmbed-S are proposed for C-linked and S-linked glycosylation prediction respectively. DeepCSEmbed-C is a dual-branch deep learning model comprising a Feedforward Neural Network (FNN) branch and an Inception branch, coupled with a Random undersampling strategy. DeepCSEmbed-S is a Categorical Boosting (CAT) model with the SMOTE oversampling strategy. DeepCSEmbed-C outperformed available state-of-the-art (SOTA) methods, achieving 92.9% sensitivity, 95.1% F1-score and 90.6% MCC on the Independent dataset. Datasets and python scripts for training and testing the models are provided and made freely accessible at https://github.com/nafcoder/DeepCSEmbed.
Collapse
|
2
|
Rizzuto V, Settino M, Stroffolini G, Covello G, Vanags J, Naccarato M, Montanari R, de Lossada CR, Mazzotta C, Forestiero A, Adornetto C, Rechichi M, Ricca F, Greco G, Laganovska G, Borroni D. Ocular surface microbiome: Influences of physiological, environmental, and lifestyle factors. Comput Biol Med 2025; 190:110046. [PMID: 40174504 DOI: 10.1016/j.compbiomed.2025.110046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2024] [Revised: 01/22/2025] [Accepted: 03/16/2025] [Indexed: 04/04/2025]
Abstract
PURPOSE The ocular surface (OS) microbiome is influenced by various factors and impacts on ocular health. Understanding its composition and dynamics is crucial for developing targeted interventions for ocular diseases. This study aims to identify host variables, including physiological, environmental, and lifestyle (PEL) factors, that influence the ocular microbiome composition and establish valid associations between the ocular microbiome and health outcomes. METHODS The 16S rRNA gene sequencing was performed on OS samples collected from 135 healthy individuals using eSwab. DNA was extracted, libraries prepared, and PCR products purified and analyzed. PEL confounding factors were identified, and a cross-validation strategy using various bioinformatics methods including Machine learning was used to identify features that classify microbial profiles. RESULTS Nationality, allergy, sport practice, and eyeglasses usage are significant PEL confounding factors influencing the eye microbiome. Alpha-diversity analysis revealed significant differences between Spanish and Italian subjects (p-value < 0.001), with a median Shannon index of 1.05 for Spanish subjects and 0.59 for Italian subjects. Additionally, 8 microbial genera were significantly associated with eyeglass usage. Beta-diversity analysis indicated significant differences in microbial community composition based on nationality, age, sport, and eyeglasses usage. Differential abundance analysis identified several microbial genera associated with these PEL factors. The Support Vector Machine (SVM) model for Nationality achieved an accuracy of 100%, with an AUC-ROC score of 1.0, indicating excellent performance in classifying microbial profiles. CONCLUSION This study underscores the importance of considering PEL factors when studying the ocular microbiome. Our findings highlight the complex interplay between environmental, lifestyle, and demographic factors in shaping the OS microbiome. Future research should further explore these interactions to develop personalized approaches for managing ocular health.
Collapse
Affiliation(s)
- Vincenzo Rizzuto
- Clinic of Ophthalmology, P. Stradins Clinical University Hospital, Riga, Latvia; School of Advanced Studies, Center for Neuroscience, University of Camerino, Camerino, Italy; Latvian American Eye Center (LAAC), Riga, Latvia
| | - Marzia Settino
- Department of Mathematics and Computer Science, University of Calabria, Rende, Italy; Institute of High Performance Computing and Networks-National Research Council (ICAR-CNR), Rende, Italy.
| | - Giacomo Stroffolini
- Department of Infectious-Tropical Diseases and Microbiology, IRCCS Sacro Cuore Don Calabria Hospital, Verona, Italy
| | - Giuseppe Covello
- Department of Surgical, Medical, Molecular Pathology and Critical Care Medicine, University of Pisa, Pisa, Italy
| | - Juris Vanags
- Department of Ophthalmology, Riga Stradins University, Riga, Latvia; Clinic of Ophthalmology, P. Stradins Clinical University Hospital, Riga, Latvia
| | - Marta Naccarato
- Clinic of Ophthalmology, P. Stradins Clinical University Hospital, Riga, Latvia; Iris Medical Center, Cosenza, Italy
| | - Roberto Montanari
- Pharmacology Institute, Heidelberg University Hospital, Heidelberg, Germany
| | - Carlos Rocha de Lossada
- Eyemetagenomics Ltd., London, United Kingdom; Ophthalmology Department, QVision, Almeria, Spain; Ophthalmology Department, Hospital Regional Universitario of Malaga, Malaga, Spain; Department of Surgery, Ophthalmology Area, University of Seville, Seville, Spain
| | - Cosimo Mazzotta
- Siena Crosslinking Center, Siena, Italy; Departmental Ophthalmology Unit, USL Toscana Sud Est, Siena, Italy; Postgraduate Ophthalmology School, University of Siena, Siena, Italy
| | - Agostino Forestiero
- Institute of High Performance Computing and Networks-National Research Council (ICAR-CNR), Rende, Italy
| | | | | | - Francesco Ricca
- Department of Mathematics and Computer Science, University of Calabria, Rende, Italy
| | - Gianluigi Greco
- Department of Mathematics and Computer Science, University of Calabria, Rende, Italy
| | - Guna Laganovska
- Department of Ophthalmology, Riga Stradins University, Riga, Latvia; Clinic of Ophthalmology, P. Stradins Clinical University Hospital, Riga, Latvia
| | - Davide Borroni
- Department of Ophthalmology, Riga Stradins University, Riga, Latvia; Eyemetagenomics Ltd., London, United Kingdom; Centro Oculistico Borroni, Gallarate, Italy
| |
Collapse
|
3
|
Akbar S, Raza A, Awan HH, Zou Q, Alghamdi W, Saeed A. pNPs-CapsNet: Predicting Neuropeptides Using Protein Language Models and FastText Encoding-Based Weighted Multi-View Feature Integration with Deep Capsule Neural Network. ACS OMEGA 2025; 10:12403-12416. [PMID: 40191328 PMCID: PMC11966582 DOI: 10.1021/acsomega.4c11449] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/24/2024] [Revised: 02/04/2025] [Accepted: 03/07/2025] [Indexed: 04/09/2025]
Abstract
Neuropeptides (NPs) are critical signaling molecules that are essential in numerous physiological processes and possess significant therapeutic potential. Computational prediction of NPs has emerged as a promising alternative to traditional experimental methods, often labor-intensive, time-consuming, and expensive. Recent advancements in computational peptide models provide a cost-effective approach to identifying NPs, characterized by high selectivity toward target cells and minimal side effects. In this study, we propose a novel deep capsule neural network-based computational model, namely pNPs-CapsNet, to predict NPs and non-NPs accurately. Input samples are numerically encoded using pretrained protein language models, including ESM, ProtBERT-BFD, and ProtT5, to extract attention mechanism-based contextual and semantic features. A differential evolution-based weighted feature integration method is utilized to construct a multiview vector. Additionally, a two-tier feature selection strategy, comprising MRMD and SHAP analysis, is developed to identify and select optimal features. Finally, the novel capsule neural network (CapsNet) is trained using the selected optimal feature set. The proposed pNPs-CapsNet model achieved a remarkable predictive accuracy of 98.10% and an AUC of 0.98. To validate the generalization capability of the pNPs-CapsNet model, independent samples reported an accuracy of 95.21% and an AUC of 0.96. The pNPs-CapsNet model outperforms existing state-of-the-art models, demonstrating 4% and 2.5% improved predictive accuracy for training and independent data sets, respectively. The demonstrated efficacy and consistency of pNPs-CapsNet underline its potential as a valuable and robust tool for advancing drug discovery and academic research.
Collapse
Affiliation(s)
- Shahid Akbar
- Institute
of Fundamental and Frontier Sciences, University
of Electronic Science and Technology of China, Chengdu 610054, China
- Department
of Computer Science, Abdul Wali Khan University
Mardan, Mardan 23200, Khyber Pakhtunkhwa, Pakistan
| | - Ali Raza
- Department
of Computer Science, Bahria University, Islamabad 44220, Pakistan
| | - Hamid Hussain Awan
- Department
of Computer Science, Rawalpindi Women University, Rawalpindi 46300, Punjab, Pakistan
| | - Quan Zou
- Institute
of Fundamental and Frontier Sciences, University
of Electronic Science and Technology of China, Chengdu 610054, China
- Yangtze
Delta Region Institute (Quzhou), University
of Electronic Science and Technology of China, Quzhou 324000, PR China
| | - Wajdi Alghamdi
- Department
of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia
| | - Aamir Saeed
- Department
of Computer Science and IT, University of
Engineering and Technology, Jalozai Campus, Peshawar 25000, Pakistan
| |
Collapse
|
4
|
Asim MN, Asif T, Mehmood F, Dengel A. Peptide classification landscape: An in-depth systematic literature review on peptide types, databases, datasets, predictors architectures and performance. Comput Biol Med 2025; 188:109821. [PMID: 39987697 DOI: 10.1016/j.compbiomed.2025.109821] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2024] [Revised: 02/03/2025] [Accepted: 02/05/2025] [Indexed: 02/25/2025]
Abstract
Peptides are gaining significant attention in diverse fields such as the pharmaceutical market has seen a steady rise in peptide-based therapeutics over the past six decades. Peptides have been utilized in the development of distinct applications including inhibitors of SARS-COV-2 and treatments for conditions like cancer and diabetes. Distinct types of peptides possess unique characteristics, and development of peptide-specific applications require the discrimination of one peptide type from others. To the best of our knowledge, approximately 230 Artificial Intelligence (AI) driven applications have been developed for 22 distinct types of peptides, yet there remains significant room for development of new predictors. A Comprehensive review addresses the critical gap by providing a consolidated platform for the development of AI-driven peptide classification applications. This paper offers several key contributions, including presenting the biological foundations of 22 unique peptide types and categorizes them into four main classes: Regulatory, Therapeutic, Nutritional, and Delivery Peptides. It offers an in-depth overview of 47 databases that have been used to develop peptide classification benchmark datasets. It summarizes details of 288 benchmark datasets that are used in development of diverse types AI-driven peptide classification applications. It provides a detailed summary of 197 sequence representation learning methods and 94 classifiers that have been used to develop 230 distinct AI-driven peptide classification applications. Across 22 distinct types peptide classification tasks related to 288 benchmark datasets, it demonstrates performance values of 230 AI-driven peptide classification applications. It summarizes experimental settings and various evaluation measures that have been employed to assess the performance of AI-driven peptide classification applications. The primary focus of this manuscript is to consolidate scattered information into a single comprehensive platform. This resource will greatly assist researchers who are interested in developing new AI-driven peptide classification applications.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence, Kaiserslautern, 67663, Germany; Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany.
| | - Tayyaba Asif
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany
| | - Faiza Mehmood
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany; Institute of Data Sciences, University of Engineering and Technology, Lahore, Pakistan
| | - Andreas Dengel
- German Research Center for Artificial Intelligence, Kaiserslautern, 67663, Germany; Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany; Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany
| |
Collapse
|
5
|
Wei Z, Shen Y, Tang X, Wen J, Song Y, Wei M, Cheng J, Zhu X. AVPpred-BWR: antiviral peptides prediction via biological words representation. Bioinformatics 2025; 41:btaf126. [PMID: 40152250 PMCID: PMC11968319 DOI: 10.1093/bioinformatics/btaf126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2024] [Revised: 02/17/2025] [Accepted: 03/26/2025] [Indexed: 03/29/2025] Open
Abstract
MOTIVATION Antiviral peptides (AVPs) are short chains of amino acids, showing great potential as antiviral drugs. The traditional wisdom (e.g. wet experiments) for identifying the AVPs is time-consuming and laborious, while cutting-edge computational methods are less accurate to predict them. RESULTS In this article, we propose an AVPs prediction model via biological words representation, dubbed AVPpred-BWR. Based on the fact that the secondary structures of AVPs mainly consist of α-helix and loop, we explore the biological words of 1mer (corresponding to loops) and 4mer (4 continuous residues, corresponding to α-helix). That is, the peptides sequences are decomposed into biological words, and then the concealed sequential information is represented by training the Word2Vec models. Moreover, in order to extract multi-scale features, we leverage a CNN-Transformer framework to process the embeddings of 1mer and 4mer generated by Word2Vec models. To the best of our knowledge, this is the first time to realize the word segmentation of protein primary structure sequences based on the regularity of protein secondary structure. AVPpred-BWR illustrates clear improvements over its competitors on the independent test set (e.g. improvements of 4.6% and 11.0% for AUROC and MCC, respectively, compared to UniDL4BioPep). AVAILABILITY AND IMPLEMENTATION AVPpred-BWR is publicly available at: https://github.com/zyweizm/AVPpred-BWR or https://zenodo.org/records/14880447 (doi: 10.5281/zenodo.14880447).
Collapse
Affiliation(s)
- Zhuoyu Wei
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Yongqi Shen
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Xiang Tang
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Jian Wen
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Youyi Song
- School of Science, China Pharmaceutical University, Nanjing 210009, China
| | - Mingqiang Wei
- School of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
| | - Jing Cheng
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Xiaolei Zhu
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China
| |
Collapse
|
6
|
Shamas M, Tauseef H, Ahmad A, Raza A, Ghadi YY, Mamyrbayev O, Momynzhanova K, Alahmadi TJ. Classification of pulmonary diseases from chest radiographs using deep transfer learning. PLoS One 2025; 20:e0316929. [PMID: 40096069 PMCID: PMC11913265 DOI: 10.1371/journal.pone.0316929] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2024] [Accepted: 12/18/2024] [Indexed: 03/19/2025] Open
Abstract
Pulmonary diseases are the leading causes of disabilities and deaths worldwide. Early diagnosis of pulmonary diseases can reduce the fatality rate. Chest radiographs are commonly used to diagnose pulmonary diseases. In clinical practice, diagnosing pulmonary diseases using chest radiographs is challenging due to Overlapping and complex anatomical Structures, variability in radiographs, and their quality. The availability of a medical specialist with extensive professional experience is profoundly required. With the use of Convolutional Neural Networks in the medical field, diagnosis can be improved by automatically detecting and classifying these diseases. This paper has explored the effectiveness of Convolutional Neural Networks and transfer learning to improve the predictive outcomes of fifteen different pulmonary diseases using chest radiographs. Our proposed deep transfer learning-based computational model achieved promising results as compared to existing state-of-the-art methods. Our model reported an overall specificity of 97.92%, a sensitivity of 97.30%, a precision of 97.94%, and an Area under the Curve of 97.61%. It has been observed that the promising results of our proposed model will be valuable tool for practitioners in decision-making and efficiently diagnosing various pulmonary diseases.
Collapse
Affiliation(s)
- Muneeba Shamas
- Department of Computer Science, Lahore College for Women University, Lahore, Pakistan
| | - Huma Tauseef
- Department of Computer Science, Lahore College for Women University, Lahore, Pakistan
| | - Ashfaq Ahmad
- Department of Computer Science, MY University, Islamabad, Pakistan
| | - Ali Raza
- Department of Computer Science, MY University, Islamabad, Pakistan
| | - Yazeed Yasin Ghadi
- Department of Computer Science, Al Ain University, Abu Dhabi, United Arab Emirates
| | - Orken Mamyrbayev
- Institute of Information and Computational Technologies, Almaty, Kazakhstan
| | | | - Tahani Jaser Alahmadi
- Department of Information Systems, College of Computer and Information Sciences, Princess Nourah Bint Abdulrahman University, Riyadh, Saudi Arabia
| |
Collapse
|
7
|
Madni HA, Umer RM, Zottin S, Marr C, Foresti GL. FL-W3S: Cross-domain federated learning for weakly supervised semantic segmentation of white blood cells. Int J Med Inform 2025; 195:105806. [PMID: 39854783 DOI: 10.1016/j.ijmedinf.2025.105806] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2024] [Revised: 01/10/2025] [Accepted: 01/21/2025] [Indexed: 01/26/2025]
Abstract
BACKGROUND Segmentation models for clinical data experience severe performance degradation when trained on a single client from one domain and distributed to other clients from different domain. Federated Learning (FL) provides a solution by enabling multi-party collaborative learning without compromising the confidentiality of clients' private data. METHODS In this paper, we propose a cross-domain FL method for Weakly Supervised Semantic Segmentation (FL-W3S) of white blood cells in microscopic images. We perform model training on multiple clients with different data distributions to obtain a global aggregated model using only image-level class labels for semantic segmentation of white blood cells. A multi-class token transformer model learns the relationship between patch tokens and class tokens during collaborative learning and generates class-specific localization maps for mask predictions. To rectify the localization maps, we use patch-level pairwise affinity obtained from patch-to-patch transformer attention. RESULTS We evaluate performance of the proposed semantic segmentation method on two different datasets of white blood cells from different domains. Our experimental results show that for two datasets, there is 2.56% and 1.39% increase in performance of the proposed method over existing state-of-the-art methods. CONCLUSION The combination of federated learning for collaborative model training while preserving data privacy, alongside white blood cell segmentation techniques for precise cell identification, enhances diagnostic accuracy and personalized treatment strategies in clinical applications, particularly in hematology and pathology. More specifically, it involves isolating white blood cell from blood smear for further analysis such as automated blood cell counting, morphological analysis, cell classification, disease diagnosis and monitoring.
Collapse
Affiliation(s)
- Hussain Ahmad Madni
- Department of Computer Science and Artificial Intelligence, University of Udine, 33100, Italy.
| | - Rao Muhammad Umer
- Institute of AI for Health, Helmholtz Zentrum München, 85764 Munich, Germany
| | - Silvia Zottin
- Department of Computer Science and Artificial Intelligence, University of Udine, 33100, Italy
| | - Carsten Marr
- Institute of AI for Health, Helmholtz Zentrum München, 85764 Munich, Germany
| | - Gian Luca Foresti
- Department of Computer Science and Artificial Intelligence, University of Udine, 33100, Italy
| |
Collapse
|
8
|
Fan J, Weng W, Chen Q, Wu H, Wu J. PDG2Seq: Periodic Dynamic Graph to Sequence Model for Traffic Flow Prediction. Neural Netw 2025; 183:106941. [PMID: 39642644 DOI: 10.1016/j.neunet.2024.106941] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2024] [Revised: 09/23/2024] [Accepted: 11/16/2024] [Indexed: 12/09/2024]
Abstract
Traffic flow prediction is the foundation of intelligent traffic management systems. Current methods prioritize the development of intricate models to capture spatio-temporal correlations, yet they often neglect the exploitation of latent features within traffic flow. Firstly, the correlation among different road nodes exhibits dynamism rather than remaining static. Secondly, traffic data exhibits evident periodicity, yet current research lacks the exploration and utilization of periodic features. Lastly, current models typically rely solely on historical data for modeling, resulting in the limitation of accurately capturing future trend changes in traffic flow. To address these findings, this paper proposes a Periodic Dynamic Graph to Sequence Model (PDG2Seq) for traffic flow prediction. PDG2Seq consists of the Periodic Feature Selection Module (PFSM) and the Periodic Dynamic Graph Convolutional Gated Recurrent Unit (PDCGRU) to further extract the spatio-temporal features of the dynamic real-time traffic. The PFSM extracts learned periodic features using time points as indices, while the PDCGRU leverages the extracted periodic features from the PFSM and dynamic features from traffic flow to generate a Periodic Dynamic Graph for extracting spatio-temporal features. In the decoding phase, PDG2Seq utilizes periodic features corresponding to the prediction target to capture future trend changes, leading to more accurate predictions. Comprehensive experiments conducted on four large-scale datasets substantiate the superiority of PDG2Seq over existing state-of-the-art baselines. Related codes are available at https://github.com/wengwenchao123/PDG2Seq.
Collapse
Affiliation(s)
- Jin Fan
- Hangzhou Dianzi University, Hangzhou, China; Zhejiang Provincial Key Laboratory of Industrial Internet in Discrete Industries, Hangzhou, China.
| | - Wenchao Weng
- Zhejiang University of Technology, Hangzhou, China.
| | - Qikai Chen
- Hangzhou Dianzi University, Hangzhou, China.
| | - Huifeng Wu
- Hangzhou Dianzi University, Hangzhou, China; Zhejiang Provincial Key Laboratory of Industrial Internet in Discrete Industries, Hangzhou, China.
| | - Jia Wu
- Macquarie University, Sydney, Australia.
| |
Collapse
|
9
|
Gaurav A, Gupta BB, Arya V, Attar RW, Bansal S, Alhomoud A, Chui KT. Smart waste classification in IoT-enabled smart cities using VGG16 and Cat Swarm Optimized random forest. PLoS One 2025; 20:e0316930. [PMID: 40019915 PMCID: PMC11870384 DOI: 10.1371/journal.pone.0316930] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2024] [Accepted: 12/18/2024] [Indexed: 03/03/2025] Open
Abstract
Effective waste management is becoming a crucial component of sustainable urban development as smart technologies are used by smart cities more and more. Smart trash categorization systems provided by IoT may greatly enhance garbage sorting and recycling mechanisms. In this context, this work presents a waste categorization model based on transfer learning using the VGG16 model for feature extraction and a Random Forest classifier tuned by Cat Swarm Optimization (CSO). On a Kaggle garbage categorization dataset, the model outperformed conventional models like SVM, XGBoost, and logistic regression. With an accuracy of 85% and a high AUC of 0.85 the Random Forest model shows better performance in precision, recall, and F1-score as compared to standard machine learning models.
Collapse
Affiliation(s)
- Akshat Gaurav
- Ronin Institute, Montclair, New Jersey, United States of America
| | - Brij Bhooshan Gupta
- Department of Computer Science and Information Engineering, Asia University, Taichung, Taiwan
- Department of Medical Research, China Medical University Hospital, China Medical University, Taichung, Taiwan
- Symbiosis Centre for Information Technology (SCIT), Symbiosis International University, Pune, India
- School of Cybersecurity, Korea University, Seoul, South Korea
- Kyung Hee University, Dongdaemun-gu, Seoul, Korea
| | - Varsha Arya
- Hong Kong Metropolitan University, Hong Kong SAR, China
- Center for Interdisciplinary Research, University of Petroleum and Energy Studies (UPES), Dehradun, India
| | - Razaz Waheeb Attar
- Management Department, College of Business Administration, Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia
| | - Shavi Bansal
- Insights2Techinfo, Jaipur, India
- UCRD, Chandigarh University, Chandigarh, India
| | - Ahmed Alhomoud
- Department of Computer Sciences, College of Science, Northern Border University, Rafha, Saudi Arabia
| | - Kwok Tai Chui
- Hong Kong Metropolitan University, Hong Kong, SAR, China
| |
Collapse
|
10
|
Timoneda JC, Vera SV. Behind the mask: Random and selective masking in transformer models applied to specialized social science texts. PLoS One 2025; 20:e0318421. [PMID: 39982967 PMCID: PMC11844826 DOI: 10.1371/journal.pone.0318421] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2024] [Accepted: 01/16/2025] [Indexed: 02/23/2025] Open
Abstract
Transformer models such as BERT and RoBERTa are increasingly popular in the social sciences to generate data through supervised text classification. These models can be further trained through Masked Language Modeling (MLM) to increase performance in specialized applications. MLM uses a default masking rate of 15 percent, and few works have investigated how different masking rates may affect performance. Importantly, there are no systematic tests on whether selectively masking certain words improves classifier accuracy. In this article, we further train a set of models to classify fake news around the coronavirus pandemic using 15, 25, 40, 60 and 80 percent random and selective masking. We find that a masking rate of 40 percent, both random and selective, improves within-category performance but has little impact on overall performance. This finding has important implications for scholars looking to build BERT and RoBERTa classifiers, especially those where one specific category is more relevant to their research.
Collapse
Affiliation(s)
- Joan C Timoneda
- Department of Political Science, Purdue University, West Lafayette, Indiana, United States of America
| | | |
Collapse
|
11
|
Masud A, Hosen MB, Habibullah M, Anannya M, Kaiser MS. Image captioning in Bengali language using visual attention. PLoS One 2025; 20:e0309364. [PMID: 39946345 PMCID: PMC11825021 DOI: 10.1371/journal.pone.0309364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2024] [Accepted: 08/11/2024] [Indexed: 02/16/2025] Open
Abstract
Automatically generating image captions poses one of the most challenging applications within artificial intelligence due to its integration of computer vision and natural language processing algorithms. This task becomes notably more formidable when dealing with a language as intricate as Bengali and the overall scarcity of Bengali-captioned image databases. In this investigation, a meticulously human-annotated dataset of Bengali captions has been curated specifically for the encompassing collection of pictures. Simultaneously, an innovative end-to-end architecture has been introduced to craft pertinent image descriptions in the Bengali language, leveraging an attention-driven decoder. Initially, the amalgamation of images' spatial and temporal attributes is facilitated by Gated Recurrent Units, constituting the input features. These features are subsequently fed into the attention layer alongside embedded caption features. The attention mechanism scrutinizes the interrelation between visual and linguistic representations, encompassing both categories of representations. Later, a comprehensive recursive unit comprising two layers employs the amalgamated attention traits to construct coherent sentences. Utilizing our furnished dataset, this model undergoes training, culminating in achievements of a 43% BLEU-4 score, a 39% METEOR score, and a 47% ROUGE score. Compared to all preceding endeavors in Bengali image captioning, these outcomes signify the pinnacle of current attainable standards.
Collapse
Affiliation(s)
- Adiba Masud
- Institute of Information Technology, Jahangirnagar University, Dhaka, Bangladesh
- Department of Software Engineering, Daffodil International University, Dhaka, Bangladesh
| | - Md. Biplob Hosen
- Institute of Information Technology, Jahangirnagar University, Dhaka, Bangladesh
| | - Md. Habibullah
- Institute of Information Technology, Jahangirnagar University, Dhaka, Bangladesh
| | - Mehrin Anannya
- Institute of Information Technology, Jahangirnagar University, Dhaka, Bangladesh
| | - M. Shamim Kaiser
- Institute of Information Technology, Jahangirnagar University, Dhaka, Bangladesh
| |
Collapse
|
12
|
Hemmatian J, Hajizadeh R, Nazari F. Addressing imbalanced data classification with Cluster-Based Reduced Noise SMOTE. PLoS One 2025; 20:e0317396. [PMID: 39928607 PMCID: PMC11809912 DOI: 10.1371/journal.pone.0317396] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2024] [Accepted: 12/29/2024] [Indexed: 02/12/2025] Open
Abstract
In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen's kappa, Matthew's correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE's neighbors' number to 5.
Collapse
Affiliation(s)
| | - Rassoul Hajizadeh
- Machine Learning and Deep Learning Laboratory, Faculty of Engineering Modern Technologies, Amol University of Special Modern Technologies, Amol, Iran
| | - Fakhroddin Nazari
- Faculty of Engineering Modern Technologies, Amol University of Special Modern Technologies, Amol, Iran
| |
Collapse
|
13
|
Feifei W, Wenrou S, Jinyue S, Qiaochu D, Jingjing L, Jin L, Junxiang L, Xuhui L, Xiao L, Congfen H. Anti-ageing mechanism of topical bioactive ingredient composition on skin based on network pharmacology. Int J Cosmet Sci 2025; 47:134-154. [PMID: 39246148 DOI: 10.1111/ics.13005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2024] [Revised: 06/16/2024] [Accepted: 06/28/2024] [Indexed: 09/10/2024]
Abstract
OBJECTIVE To elucidate the anti-ageing mechanism of the combination of eight ingredients on the skin from a multidimensional view of the skin. METHODS The target pathway mechanisms of composition to delay skin ageing were investigated by a network pharmacology approach and experimentally validated at three levels: epidermal, dermal, and tissue. RESULTS We identified 24 statistically significant skin ageing-related pathways, encompassing crucial processes such as epidermal barrier repair, dermal collagen and elastin production, inhibition of reactive oxygen species (ROS), as well as modulation of acetylcholine and acetylcholine receptor binding. Furthermore, our in vitro experimental findings exhibited the following outcomes: the composition promotes fibroblast proliferation and the expression of barrier-related genes in the epidermis; it also stimulated the expression of collagen I, collagen III, and elastic fibre while inhibiting ROS and β-Gal levels in HDF cells within the dermis. Additionally, Spilanthol in the Acmella oleracea extract contained in the composition demonstrated neuro-relaxing activity in Zebrafish embryo, suggesting its potential as an anti-wrinkle ingredient at the hypodermis level. CONCLUSIONS In vitro experiments validated the anti-ageing mechanism of composition at multiple skin levels. This framework can be extended to unravel the functional mechanisms of other clinically validated compositions, including traditional folk recipes utilized in cosmeceuticals.
Collapse
Affiliation(s)
- Wang Feifei
- Yunnan Botanee Bio-Technology Group Co., Ltd., Yunnan, China
- Yunnan Yunke Characteristic Plant Extraction Laboratory Co., Ltd., Yunnan, China
| | - Su Wenrou
- Yunnan Botanee Bio-Technology Group Co., Ltd., Yunnan, China
- Yunnan Yunke Characteristic Plant Extraction Laboratory Co., Ltd., Yunnan, China
| | - Sun Jinyue
- AGECODE R&D Center, Yangtze Delta Region Institute of Tsinghua University, Zhejiang, China
- Beijing Key Lab of Plant Resources Research and Development, Beijing Technology and Business University, Beijing, China
| | - Du Qiaochu
- Yunnan Botanee Bio-Technology Group Co., Ltd., Yunnan, China
- Yunnan Yunke Characteristic Plant Extraction Laboratory Co., Ltd., Yunnan, China
| | - Li Jingjing
- Yunnan Botanee Bio-Technology Group Co., Ltd., Yunnan, China
- Yunnan Yunke Characteristic Plant Extraction Laboratory Co., Ltd., Yunnan, China
| | - Liu Jin
- Yunnan Botanee Bio-Technology Group Co., Ltd., Yunnan, China
- Yunnan Yunke Characteristic Plant Extraction Laboratory Co., Ltd., Yunnan, China
| | - Li Junxiang
- AGECODE R&D Center, Yangtze Delta Region Institute of Tsinghua University, Zhejiang, China
- Harvest Biotech (Zhejiang) Co., Ltd., Zhejiang, China
| | - Li Xuhui
- AGECODE R&D Center, Yangtze Delta Region Institute of Tsinghua University, Zhejiang, China
- Zhejiang Provincial Key Laboratory of Applied Enzymology, Yangtze Delta Region Institute of Tsinghua University, Zhejiang, China
| | - Lin Xiao
- School of Life Sciences, Northwestern Polytechnical University, Xi'an, China
| | - He Congfen
- Beijing Key Lab of Plant Resources Research and Development, Beijing Technology and Business University, Beijing, China
| |
Collapse
|
14
|
Yue J, Li T, Xu J, Chen Z, Li Y, Liang S, Liu Z, Wang Y. Discovery of anticancer peptides from natural and generated sequences using deep learning. Int J Biol Macromol 2025; 290:138880. [PMID: 39706427 DOI: 10.1016/j.ijbiomac.2024.138880] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2024] [Revised: 12/10/2024] [Accepted: 12/16/2024] [Indexed: 12/23/2024]
Abstract
Anticancer peptides (ACPs) demonstrate significant potential in clinical cancer treatment due to their ability to selectively target and kill cancer cells. In recent years, numerous artificial intelligence (AI) algorithms have been developed. However, many predictive methods lack sufficient wet lab validation, thereby constraining the progress of models and impeding the discovery of novel ACPs. This study proposes a comprehensive research strategy by introducing CNBT-ACPred, an ACP prediction model based on a three-channel deep learning architecture, supported by extensive in vitro and in vivo experiments. CNBT-ACPred achieved an accuracy of 0.9554 and a Matthews Correlation Coefficient (MCC) of 0.8602. Compared to existing excellent models, CNBT-ACPred increased accuracy by at least 5 % and improved MCC by 15 %. Predictions were conducted on over 3.8 million sequences from Uniprot, along with 100,000 sequences generated by a deep generative model, ultimately identifying 37 out of 41 candidate peptides from >30 species that exhibited effective in vitro tumor inhibitory activity. Among these, tPep14 demonstrated significant anticancer effects in two mouse xenograft models without detectable toxicity. Finally, the study revealed correlations between the amino acid composition, structure, and function of the identified ACP candidates.
Collapse
Affiliation(s)
- Jianda Yue
- The National and Local Joint Engineering Laboratory of Animal Peptide Drug Development, College of Life Sciences, Hunan Normal University, Changsha 410081, Hunan, China; Peptide and small molecule drug R&D plateform, Furong Laboratory, Hunan Normal University, Changsha 410081, Hunan, China; Institute of Interdisciplinary Studies, Hunan Normal University, Changsha 410081, Hunan, China.
| | - Tingting Li
- The National and Local Joint Engineering Laboratory of Animal Peptide Drug Development, College of Life Sciences, Hunan Normal University, Changsha 410081, Hunan, China; Peptide and small molecule drug R&D plateform, Furong Laboratory, Hunan Normal University, Changsha 410081, Hunan, China; Institute of Interdisciplinary Studies, Hunan Normal University, Changsha 410081, Hunan, China.
| | - Jiawei Xu
- The National and Local Joint Engineering Laboratory of Animal Peptide Drug Development, College of Life Sciences, Hunan Normal University, Changsha 410081, Hunan, China; Peptide and small molecule drug R&D plateform, Furong Laboratory, Hunan Normal University, Changsha 410081, Hunan, China; Institute of Interdisciplinary Studies, Hunan Normal University, Changsha 410081, Hunan, China.
| | - Zihui Chen
- The National and Local Joint Engineering Laboratory of Animal Peptide Drug Development, College of Life Sciences, Hunan Normal University, Changsha 410081, Hunan, China; Peptide and small molecule drug R&D plateform, Furong Laboratory, Hunan Normal University, Changsha 410081, Hunan, China; Institute of Interdisciplinary Studies, Hunan Normal University, Changsha 410081, Hunan, China
| | - Yaqi Li
- The National and Local Joint Engineering Laboratory of Animal Peptide Drug Development, College of Life Sciences, Hunan Normal University, Changsha 410081, Hunan, China; Peptide and small molecule drug R&D plateform, Furong Laboratory, Hunan Normal University, Changsha 410081, Hunan, China; Institute of Interdisciplinary Studies, Hunan Normal University, Changsha 410081, Hunan, China.
| | - Songping Liang
- The National and Local Joint Engineering Laboratory of Animal Peptide Drug Development, College of Life Sciences, Hunan Normal University, Changsha 410081, Hunan, China; Peptide and small molecule drug R&D plateform, Furong Laboratory, Hunan Normal University, Changsha 410081, Hunan, China; Institute of Interdisciplinary Studies, Hunan Normal University, Changsha 410081, Hunan, China.
| | - Zhonghua Liu
- The National and Local Joint Engineering Laboratory of Animal Peptide Drug Development, College of Life Sciences, Hunan Normal University, Changsha 410081, Hunan, China; Peptide and small molecule drug R&D plateform, Furong Laboratory, Hunan Normal University, Changsha 410081, Hunan, China; Institute of Interdisciplinary Studies, Hunan Normal University, Changsha 410081, Hunan, China.
| | - Ying Wang
- The National and Local Joint Engineering Laboratory of Animal Peptide Drug Development, College of Life Sciences, Hunan Normal University, Changsha 410081, Hunan, China; Peptide and small molecule drug R&D plateform, Furong Laboratory, Hunan Normal University, Changsha 410081, Hunan, China; Institute of Interdisciplinary Studies, Hunan Normal University, Changsha 410081, Hunan, China.
| |
Collapse
|
15
|
Li L, Wang R, Zou M, Guo F, Ren Y. Enhanced ResNet-50 for garbage classification: Feature fusion and depth-separable convolutions. PLoS One 2025; 20:e0317999. [PMID: 39869568 PMCID: PMC11771864 DOI: 10.1371/journal.pone.0317999] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2024] [Accepted: 01/08/2025] [Indexed: 01/29/2025] Open
Abstract
As people's material living standards continue to improve, the types and quantities of household garbage they generate rapidly increase. Therefore, it is urgent to develop a reasonable and effective method for garbage classification. This is important for resource recycling and environmental improvement and contributes to the sustainable development of production and the economy. However, existing deep learning-based garbage image classification models generally suffer from low classification accuracy, insufficient robustness, and slow detection speed due to the large number of model parameters. To this end, a new garbage image classification model is proposed, with the ResNet-50 network as the core architecture. Specifically, first, a redundancy-weighted feature fusion module is proposed, enabling the model to fully leverage valuable feature information, thereby improving its performance. At the same time, the module filters out redundant information from multi-scale features, reducing the number of model parameters. Second, the standard 3×3 convolutions in ResNet-50 are replaced with depth-separable convolutions, significantly improving the model's computational efficiency while preserving the feature extraction capability of the original convolutional structure. Finally, to address the issue of class imbalance, a weighting factor is added to the Focal Loss, aiming to mitigate the negative impact of class imbalance on model performance and enhance the model's robustness. Experimental results on the TrashNet dataset show that the proposed model effectively reduces the number of parameters, improves detection speed, and achieves an accuracy of 94.13%, surpassing the vast majority of existing deep learning-based waste image classification models, demonstrating its solid practical value.
Collapse
Affiliation(s)
- Lingbo Li
- Library of Information Center, Zhejiang Technical Institute of Economics, Hangzhou, China
| | - Runpu Wang
- School of Computer Science Engineering, University of New South Wales, Canberra, Australia
| | - Miaojie Zou
- Faculty of Business and Economics, Monash University, Melbourne, Australia
| | - Fusen Guo
- School of Systems and Computing, University of New South Wales, Canberra, Australia
| | - Yuheng Ren
- School of Business Economics, European Union University, Montreux, Switzerland
| |
Collapse
|
16
|
Iqbal MW, Shahab M, Ullah Z, Zheng G, Anjum I, Shazly GA, Mengistie AA, Sun X, Yuan Q. Integrating machine learning and structure-based approaches for repurposing potent tyrosine protein kinase Src inhibitors to treat inflammatory disorders. Sci Rep 2025; 15:1836. [PMID: 39805859 PMCID: PMC11730308 DOI: 10.1038/s41598-024-83767-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2024] [Accepted: 12/17/2024] [Indexed: 01/16/2025] Open
Abstract
Tyrosine-protein kinase Src plays a key role in cell proliferation and growth under favorable conditions, but its overexpression and genetic mutations can lead to the progression of various inflammatory diseases. Due to the specificity and selectivity problems of previously discovered inhibitors like dasatinib and bosutinib, we employed an integrated machine learning and structure-based drug repurposing strategy to find novel, targeted, and non-toxic Src kinase inhibitors. Different machine learning models including random forest (RF), k-nearest neighbors (K-NN), decision tree, and support vector machine (SVM), were trained using already available bioactivity data of Src kinase targeting compounds. The performance evaluation of these models demonstrated SVM as the best model, which was further utilized to shortlist 51 highly potent compounds by screening an FDA-approved library of 1040 drugs. Molecular docking and molecular dynamic simulation were subsequently employed to evaluate the binding affinity and stability of the proposed compounds. Orlistat, acarbose and afatinib were identified as the potent leads, demonstrating stable conformations and stronger interactions, validated by root mean square deviation (RMSD), root mean square fluctuation (RMSF), radius of gyration (RoG), and hydrogen bond analyses. Molecular Mechanics/Generalized Born Surface Area (MMGBSA) analysis validated their binding affinities by providing comparably lower binding free energies for orlistat (- 33.4743 ± 3.8908), acarbose (- 19.5455 ± 5.4702), and afatinib (- 36.4944 ± 5.4929) than the control, dasatinib (- 13.7785 ± 5.8058). Finally, toxicity analysis revealed orlistat and acarbose as the possible safer therapeutics by eliminating afatinib as it showed significant toxicity concerns. Our investigation supports the advance computational methods utilization in the field of drug discovery and suggest further experimental validation of proposed inhibitors of Src kinase for their safer use against inflammatory diseases. The ultimate aim of this study is to advance the development of effective treatments for inflammatory diseases, linked with Src overexpression.
Collapse
Affiliation(s)
- Muhammad Waleed Iqbal
- State Key Laboratory of Chemical Resources Engineering, Beijing University of Chemical Technology, Beijing, 100029, People's Republic of China
| | - Muhammad Shahab
- State Key Laboratory of Chemical Resources Engineering, Beijing University of Chemical Technology, Beijing, 100029, People's Republic of China
| | - Zakir Ullah
- State Key Laboratory of Chemical Resources Engineering, Beijing University of Chemical Technology, Beijing, 100029, People's Republic of China
| | - Guojun Zheng
- State Key Laboratory of Chemical Resources Engineering, Beijing University of Chemical Technology, Beijing, 100029, People's Republic of China
| | - Irfan Anjum
- Department of Basic Medical Sciences, Shifa College of Pharmaceutical Sciences, Shifa Tameer-e-Millat University, Islamabad, 44000, Pakistan
| | - Gamal A Shazly
- Department of Pharmaceutics, College of Pharmacy, King Saud University, Riyadh, 11451, Saudi Arabia
| | | | - Xinxiao Sun
- State Key Laboratory of Chemical Resources Engineering, Beijing University of Chemical Technology, Beijing, 100029, People's Republic of China.
| | - Qipeng Yuan
- State Key Laboratory of Chemical Resources Engineering, Beijing University of Chemical Technology, Beijing, 100029, People's Republic of China.
| |
Collapse
|
17
|
Shahid, Hayat M, Alghamdi W, Akbar S, Raza A, Kadir RA, Sarker MR. pACP-HybDeep: predicting anticancer peptides using binary tree growth based transformer and structural feature encoding with deep-hybrid learning. Sci Rep 2025; 15:565. [PMID: 39747941 PMCID: PMC11695694 DOI: 10.1038/s41598-024-84146-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2024] [Accepted: 12/20/2024] [Indexed: 01/04/2025] Open
Abstract
Worldwide, Cancer remains a significant health concern due to its high mortality rates. Despite numerous traditional therapies and wet-laboratory methods for treating cancer-affected cells, these approaches often face limitations, including high costs and substantial side effects. Recently the high selectivity of peptides has garnered significant attention from scientists due to their reliable targeted actions and minimal adverse effects. Furthermore, keeping the significant outcomes of the existing computational models, we propose a highly reliable and effective model namely, pACP-HybDeep for the accurate prediction of anticancer peptides. In this model, training peptides are numerically encoded using an attention-based ProtBERT-BFD encoder to extract semantic features along with CTDT-based structural information. Furthermore, a k-nearest neighbor-based binary tree growth (BTG) algorithm is employed to select an optimal feature set from the multi-perspective vector. The selected feature vector is subsequently trained using a CNN + RNN-based deep learning model. Our proposed pACP-HybDeep model demonstrated a high training accuracy of 95.33%, and an AUC of 0.97. To validate the generalization capabilities of the model, our pACP-HybDeep model achieved accuracies of 94.92%, 92.26%, and 91.16% on independent datasets Ind-S1, Ind-S2, and Ind-S3, respectively. The demonstrated efficacy, and reliability of the pACP-HybDeep model using test datasets establish it as a valuable tool for researchers in academia and pharmaceutical drug design.
Collapse
Affiliation(s)
- Shahid
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, 23200, KP, Pakistan
| | - Maqsood Hayat
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, 23200, KP, Pakistan.
| | - Wajdi Alghamdi
- Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Shahid Akbar
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, 23200, KP, Pakistan.
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610054, China.
| | - Ali Raza
- Department of Computer Science, MY University, Islamabad, 45750, Pakistan
| | - Rabiah Abdul Kadir
- Institute of Visual Informatics, Universiti Kebangsaan Malaysia, 43600, Bangi, Selangor, Malaysia.
| | - Mahidur R Sarker
- Institute of Visual Informatics, Universiti Kebangsaan Malaysia, 43600, Bangi, Selangor, Malaysia
- Universidad de Dise˜no, Innovaci´on y Tecnología, UDIT, Av. Alfonso XIII, 97, 28016, Madrid, Spain
| |
Collapse
|
18
|
Fang S, Hong S, Li Q, Li P, Coats T, Zou B, Kong G. Cross-modal similar clinical case retrieval using a modular model based on contrastive learning and k-nearest neighbor search. Int J Med Inform 2025; 193:105680. [PMID: 39500035 DOI: 10.1016/j.ijmedinf.2024.105680] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2024] [Revised: 09/20/2024] [Accepted: 10/28/2024] [Indexed: 12/01/2024]
Abstract
OBJECTIVE Electronic health record systems have made it possible for clinicians to use previously encountered similar cases to support clinical decision-making. However, most studies for similar case retrieval were based on single-modal data. The existing studies on cross-modal clinical case retrieval were limited. We aimed to develop a CRoss-Modal Retrieval (CRMR) model to retrieve similar clinical cases recorded in different data modalities. MATERIALS AND METHODS The publically available Medical Information Mart for Intensive Care-Chest X-ray (MIMIC-CXR) dataset was used for model development and testing. The CRMR model was designed as a modular model containing two feature extraction models, two feature transformation models, one feature transformation optimization model, and one case retrieval model. The ability to retrieve similar clinical cases recorded in different data modalities was facilitated by the use of contrastive deep learning and k-nearest neighbor search. RESULTS The average retrieval precision, denoted as AP@k, of the developed CRMR model, were 76.9 %@5, 76.7 %@10, 76.5 %@20, 76.3 %@50, and 77.9 %@100, respectively. Here k is the number of similar cases returned after retrieval. The average retrieval time varied from 0.013 ms to 0.016 ms with k varying from 5 to 100. Moreover, the model can retrieve similar cases with the same multiple radiographic manifestations as the query case. DISCUSSION The CRMR model has shown promising cross-modal retrieval performance in clinical case analysis, with the potential for future scalability and improvement in handling diverse disease types and data modalities. The CRMR model has promising potential to aid clinicians in making optimal and explainable clinical decisions.
Collapse
Affiliation(s)
- Shichao Fang
- National Institute of Health Data Science, Peking University, Beijing, China; Advanced Institute of Information Technology, Peking University, Hangzhou, Zhejiang, China; Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, UK; King's College Hospital NHS Foundation Trust, London, UK
| | - Shenda Hong
- National Institute of Health Data Science, Peking University, Beijing, China
| | - Qing Li
- Advanced Institute of Information Technology, Peking University, Hangzhou, Zhejiang, China
| | - Pengfei Li
- Advanced Institute of Information Technology, Peking University, Hangzhou, Zhejiang, China
| | - Tim Coats
- Emergency Medicine Academic Group, Department of Cardiovascular Sciences, University of Leicester, Leicester, UK
| | - Beiji Zou
- School of Computer Science and Engineering, Central South University, Changsha, Hunan, China
| | - Guilan Kong
- National Institute of Health Data Science, Peking University, Beijing, China; Advanced Institute of Information Technology, Peking University, Hangzhou, Zhejiang, China.
| |
Collapse
|
19
|
Han S, Jung H. NATE: Non-pArameTric approach for Explainable credit scoring on imbalanced class. PLoS One 2024; 19:e0316454. [PMID: 39739883 DOI: 10.1371/journal.pone.0316454] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2024] [Accepted: 12/11/2024] [Indexed: 01/02/2025] Open
Abstract
Credit scoring models play a crucial role for financial institutions in evaluating borrower risk and sustaining profitability. Logistic regression is widely used in credit scoring due to its robustness, interpretability, and computational efficiency; however, its predictive power decreases when applied to complex or non-linear datasets, resulting in reduced accuracy. In contrast, tree-based machine learning models often provide enhanced predictive performance but struggle with interpretability. Furthermore, imbalanced class distributions, which are prevalent in credit scoring, can adversely impact model accuracy and robustness, as the majority class tends to dominate. Despite these challenges, research that comprehensively addresses both the predictive performance and explainability aspects within the credit scoring domain remains limited. This paper introduces the Non-pArameTric oversampling approach for Explainable credit scoring (NATE), a framework designed to address these challenges by combining oversampling techniques with tree-based classifiers to enhance model performance and interpretability. NATE incorporates class balancing methods to mitigate the impact of imbalanced data distributions and integrates interpretability features to elucidate the model's decision-making process. Experimental results show that NATE substantially outperforms traditional logistic regression in credit risk classification, with improvements of 19.33% in AUC, 71.56% in MCC, and 85.33% in F1 Score. Oversampling approaches, particularly when used with gradient boosting, demonstrated superior effectiveness compared to undersampling, achieving optimal metrics of AUC: 0.9649, MCC: 0.8104, and F1 Score: 0.9072. Moreover, NATE enhances interpretability by providing detailed insights into feature contributions, aiding in understanding individual predictions. These findings highlight NATE's capability in managing class imbalance, improving predictive performance, and enhancing model interpretability, demonstrating its potential as a reliable and transparent tool for credit scoring applications.
Collapse
Affiliation(s)
- Seongil Han
- School of Computing & Mathematical Sciences, University of London, Birkbeck College, London, United Kingdom
| | - Haemin Jung
- Department of Industrial & Management Engineering, Korea National University of Transportation, Chungju, South Korea
| |
Collapse
|
20
|
Akbar S, Ullah M, Raza A, Zou Q, Alghamdi W. DeepAIPs-Pred: Predicting Anti-Inflammatory Peptides Using Local Evolutionary Transformation Images and Structural Embedding-Based Optimal Descriptors with Self-Normalized BiTCNs. J Chem Inf Model 2024; 64:9609-9625. [PMID: 39625463 DOI: 10.1021/acs.jcim.4c01758] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2024]
Abstract
Inflammation is a biological response to harmful stimuli, playing a crucial role in facilitating tissue repair by eradicating pathogenic microorganisms. However, when inflammation becomes chronic, it leads to numerous serious disorders, particularly in autoimmune diseases. Anti-inflammatory peptides (AIPs) have emerged as promising therapeutic agents due to their high specificity, potency, and low toxicity. However, identifying AIPs using traditional in vivo methods is time-consuming and expensive. Recent advancements in computational-based intelligent models for peptides have offered a cost-effective alternative for identifying various inflammatory diseases, owing to their selectivity toward targeted cells with low side effects. In this paper, we propose a novel computational model, namely, DeepAIPs-Pred, for the accurate prediction of AIP sequences. The training samples are represented using LBP-PSSM- and LBP-SMR-based evolutionary image transformation methods. Additionally, to capture contextual semantic features, we employed attention-based ProtBERT-BFD embedding and QLC for structural features. Furthermore, differential evolution (DE)-based weighted feature integration is utilized to produce a multiview feature vector. The SMOTE-Tomek Links are introduced to address the class imbalance problem, and a two-layer feature selection technique is proposed to reduce and select the optimal features. Finally, the novel self-normalized bidirectional temporal convolutional networks (SnBiTCN) are trained using optimal features, achieving a significant predictive accuracy of 94.92% and an AUC of 0.97. The generalization of our proposed model is validated using two independent datasets, demonstrating higher performance with the improvement of ∼2 and ∼10% of accuracies than the existing state-of-the-art model using Ind-I and Ind-II, respectively. The efficacy and reliability of DeepAIPs-Pred highlight its potential as a valuable and promising tool for drug development and research academia.
Collapse
Affiliation(s)
- Shahid Akbar
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, KP 23200, Pakistan
| | - Matee Ullah
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Ali Raza
- Department of Computer Science, MY University, Islamabad 45750, Pakistan
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China
| | - Wajdi Alghamdi
- Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia
| |
Collapse
|
21
|
Al-Omari AM, Akkam YH, Zyout A, Younis S, Tawalbeh SM, Al-Sawalmeh K, Al Fahoum A, Arnold J. Accelerating antimicrobial peptide design: Leveraging deep learning for rapid discovery. PLoS One 2024; 19:e0315477. [PMID: 39705302 DOI: 10.1371/journal.pone.0315477] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2024] [Accepted: 11/26/2024] [Indexed: 12/22/2024] Open
Abstract
Antimicrobial peptides (AMPs) are excellent at fighting many different infections. This demonstrates how important it is to make new AMPs that are even better at eliminating infections. The fundamental transformation in a variety of scientific disciplines, which led to the emergence of machine learning techniques, has presented significant opportunities for the development of antimicrobial peptides. Machine learning and deep learning are used to predict antimicrobial peptide efficacy in the study. The main purpose is to overcome traditional experimental method constraints. Gram-negative bacterium Escherichia coli is the model organism in this study. The investigation assesses 1,360 peptide sequences that exhibit anti- E. coli activity. These peptides' minimal inhibitory concentrations have been observed to be correlated with a set of 34 physicochemical characteristics. Two distinct methodologies are implemented. The initial method involves utilizing the pre-computed physicochemical attributes of peptides as the fundamental input data for a machine-learning classification approach. In the second method, these fundamental peptide features are converted into signal images, which are then transmitted to a deep learning neural network. The first and second methods have accuracy of 74% and 92.9%, respectively. The proposed methods were developed to target a single microorganism (gram negative E.coli), however, they offered a framework that could potentially be adapted for other types of antimicrobial, antiviral, and anticancer peptides with further validation. Furthermore, they have the potential to result in significant time and cost reductions, as well as the development of innovative AMP-based treatments. This research contributes to the advancement of deep learning-based AMP drug discovery methodologies by generating potent peptides for drug development and application. This discovery has significant implications for the processing of biological data and the computation of pharmacology.
Collapse
Affiliation(s)
- Ahmad M Al-Omari
- Biomedical Systems and Informatics Engineering Department, College of Engineering, Yarmouk University, Irbid, Jordan
| | - Yazan H Akkam
- Medicinal Chemistry and Pharmacognosy Department, Faculty of Pharmacy, Yarmouk University, Irbid, Jordan
| | - Ala'a Zyout
- Biomedical Systems and Informatics Engineering Department, College of Engineering, Yarmouk University, Irbid, Jordan
| | - Shayma'a Younis
- Biomedical Systems and Informatics Engineering Department, College of Engineering, Yarmouk University, Irbid, Jordan
| | - Shefa M Tawalbeh
- Biomedical Systems and Informatics Engineering Department, College of Engineering, Yarmouk University, Irbid, Jordan
| | - Khaled Al-Sawalmeh
- Department of Basic Pathological Sciences, College of Medicine, Yarmouk University, Irbid, Jordan
| | - Amjed Al Fahoum
- Biomedical Systems and Informatics Engineering Department, College of Engineering, Yarmouk University, Irbid, Jordan
| | - Jonathan Arnold
- Genetics Department, University of Georgia, Athens, GA, United States of America
| |
Collapse
|
22
|
Kalal V, Jha BK. Cancer detection with various classification models: A comprehensive feature analysis using HMM to extract a nucleotide pattern. Comput Biol Chem 2024; 113:108215. [PMID: 39378821 DOI: 10.1016/j.compbiolchem.2024.108215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2024] [Revised: 09/04/2024] [Accepted: 09/15/2024] [Indexed: 10/10/2024]
Abstract
This work presents a novel feature extraction method for identifying complex patterns in genomic sequences by employing the Hidden Markov Model (HMM). In this study, we use HMM to identify gene nucleotide patterns that are specific to malignant and non-malignant cells. Crucial genetic components DNA and RNA are involved in many biological processes that impact both healthy and malignant cells. Early patient identification is essential to successful cancer diagnosis and therapy. Varying nucleotide patterns indicate different cellular responses, which are important to understanding the molecular causes of cancer and associated disorders. We present a detailed study of nucleotide patterns in whole raw nucleotide sequences with variations in both protein sequence (CDS) and non-protein sequence (NCDS) in both malignant and non-malignant cells. Nucleotide prediction has been achieved while computational expenses are reduced by using the proposed HMM for feature extraction and selection. The classification models implemented in this work for cancer detection are Gradient-Boosted Decision Trees (GBDT), Random Forests (RF), Decision Trees (DT), and Support Vector Machines (SVM) with kernels. The suggested classification model's accuracy and 10-fold cross-validation have been validated via comprehensive case studies. The results reveal that DT and ensemble learning techniques significantly differentiate between malignant and non-malignant DNA sequences. SVM with suitable kernels improves cancer detection accuracy significantly. Combining feature reduction approaches with nucleotide pattern classifiers based on Hidden Markov models improves performance and ensures reliable cancer detection.
Collapse
Affiliation(s)
- Vijay Kalal
- Department of Mathematics, School of Technology, Pandit Deendayal Energy University, Raysan, Gandhinagar, Gujarat 382007, India.
| | - Brajesh Kumar Jha
- Department of Mathematics, School of Technology, Pandit Deendayal Energy University, Raysan, Gandhinagar, Gujarat 382007, India.
| |
Collapse
|
23
|
Wang Y, Fang C. Cycle-ESM: Generation-assisted classification of antifungal peptides using ESM protein language model. Comput Biol Chem 2024; 113:108240. [PMID: 39437594 DOI: 10.1016/j.compbiolchem.2024.108240] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2024] [Revised: 09/29/2024] [Accepted: 10/04/2024] [Indexed: 10/25/2024]
Abstract
The rising prevalence of invasive fungal infections and the emergence of antifungal resistance highlight the urgent need for new antifungal medications. Antifungal peptides have emerged as promising alternatives to traditional antimicrobial agents. The identification of natural or synthetic antifungal peptides is crucial for advancing antifungal drug development. Typically, the availability of antifungal samples is limited, and significant sequence diversity exists among antifungal peptides, posing challenges for high-throughput screening. To address the identification challenge of antifungal peptides with limited sample availability, this study introduces the Cycle ESM method. Initially, the method utilises the ESM protein language model to generate additional data on antifungal peptides, serving as a data augmentation technique to enhance model training effectiveness. Subsequently, the ESM is employed in conjunction with a textCNN model to construct a classifier for peptide prediction, with a comprehensive exploration of peptide characteristics to improve prediction accuracy. Experimental results demonstrate that the performance of the Cycle ESM method surpasses that of existing methods across three distinct antifungal peptide datasets. This study presents a novel approach to antifungal peptide prediction and offers innovative insights for addressing classification problems with limited sample availability.
Collapse
Affiliation(s)
- YiMing Wang
- Beijing Institute of Petrochemical Technology, Beijing, 102617, China
| | - Chun Fang
- Beijing Institute of Petrochemical Technology, Beijing, 102617, China.
| |
Collapse
|
24
|
Qi D, Liu T. VotePLMs-AFP: Identification of antifreeze proteins using transformer-embedding features and ensemble learning. Biochim Biophys Acta Gen Subj 2024; 1868:130721. [PMID: 39426757 DOI: 10.1016/j.bbagen.2024.130721] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2024] [Revised: 09/24/2024] [Accepted: 10/11/2024] [Indexed: 10/21/2024]
Abstract
Antifreeze proteins (AFPs) are a unique class of biomolecules capable of protecting other proteins, cell membranes, and cellular structures within organisms from damage caused by freezing conditions. Given the significance of AFPs in various domains such as biotechnology, agriculture, and medicine, several machine learning methods have been developed to identify AFPs. However, due to the complexity and diversity of AFPs, the predictive performance of existing methods is limited. Therefore, there is an urgent need to develop an efficient and rapid computational method for accurately predicting AFPs. In this study, we proposed a novel predictor based on transformer-embedding features and ensemble learning for the identification of AFPs, termed VotePLMs-AFP. Firstly, three types of feature descriptors were extracted from pre-trained protein language models (PLMs) during the feature extraction process. Subsequently, we analyzed six combinations generated by these three embeddings to explore the optimal feature set, which was input into the soft voting-based ensemble learning classifier for the identification of AFPs. Finally, we evaluated the model on the two benchmark datasets. The experimental results show that our model achieves high prediction accuracy in 10-fold cross-validation (CV) and independent set testing, outperforming existing state-of-the-art methods. Therefore, our model could serve as an effective tool for predicting AFPs.
Collapse
Affiliation(s)
- Dawei Qi
- College of Information Technology, Shanghai Ocean University, Shanghai 201306, China
| | - Taigang Liu
- College of Information Technology, Shanghai Ocean University, Shanghai 201306, China.
| |
Collapse
|
25
|
Lu Q, Xu J, Zhang R, Liu H, Wang M, Liu X, Yue Z, Gao Y. RiceSNP-ABST: a deep learning approach to identify abiotic stress-associated single nucleotide polymorphisms in rice. Brief Bioinform 2024; 26:bbae702. [PMID: 39757606 PMCID: PMC11962596 DOI: 10.1093/bib/bbae702] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2024] [Revised: 11/16/2024] [Accepted: 12/23/2024] [Indexed: 01/07/2025] Open
Abstract
Given the adverse effects faced by rice due to abiotic stresses, the precise and rapid identification of single nucleotide polymorphisms (SNPs) associated with abiotic stress traits (ABST-SNPs) in rice is crucial for developing resistant rice varieties. The scarcity of high-quality data related to abiotic stress in rice has hindered the development of computational models and constrained research efforts aimed at rice improvement and breeding. Genome-wide association studies provide a better statistical power to consider ABST-SNPs in rice. Meanwhile, deep learning methods have shown their capability in predicting disease- or phenotype-associated loci, but have primarily focused on human species. Therefore, developing predictive models for identifying ABST-SNPs in rice is both urgent and valuable. In this paper, a model called RiceSNP-ABST is proposed for predicting ABST-SNPs in rice. Firstly, six training datasets were generated using a novel strategy for negative sample construction. Secondly, four feature encoding methods were proposed based on DNA sequence fragments, followed by feature selection. Finally, convolutional neural networks with residual connections were used to determine whether the sequences contained rice ABST-SNPs. RiceSNP-ABST outperformed traditional machine learning and state-of-the-art methods on the benchmark dataset and demonstrated consistent generalization on an independent dataset and cross-species datasets. Notably, multi-granularity causal structure learning was employed to elucidate the relationships among DNA structural features, aiming to identify key genetic variants more effectively. The web-based tool for the RiceSNP-ABST can be accessed at http://rice-snp-abst.aielab.cc.
Collapse
Affiliation(s)
- Quan Lu
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Jiajun Xu
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Renyi Zhang
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Hangcheng Liu
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Meng Wang
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Xiaoshuang Liu
- Research Center for Biological Breeding Technology, Advance Academy, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Zhenyu Yue
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
- Research Center for Biological Breeding Technology, Advance Academy, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Yujia Gao
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
- Research Center for Biological Breeding Technology, Advance Academy, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| |
Collapse
|
26
|
Najafi H, Savoji K, Mirzaeibonehkhater M, Moravvej SV, Alizadehsani R, Pedrammehr S. A Novel Method for 3D Lung Tumor Reconstruction Using Generative Models. Diagnostics (Basel) 2024; 14:2604. [PMID: 39594270 PMCID: PMC11592759 DOI: 10.3390/diagnostics14222604] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2024] [Revised: 11/02/2024] [Accepted: 11/12/2024] [Indexed: 11/28/2024] Open
Abstract
BACKGROUND Lung cancer remains a significant health concern, and the effectiveness of early detection significantly enhances patient survival rates. Identifying lung tumors with high precision is a challenge due to the complex nature of tumor structures and the surrounding lung tissues. METHODS To address these hurdles, this paper presents an innovative three-step approach that leverages Generative Adversarial Networks (GAN), Long Short-Term Memory (LSTM), and VGG16 algorithms for the accurate reconstruction of three-dimensional (3D) lung tumor images. The first challenge we address is the accurate segmentation of lung tissues from CT images, a task complicated by the overwhelming presence of non-lung pixels, which can lead to classifier imbalance. Our solution employs a GAN model trained with a reinforcement learning (RL)-based algorithm to mitigate this imbalance and enhance segmentation accuracy. The second challenge involves precisely detecting tumors within the segmented lung regions. We introduce a second GAN model with a novel loss function that significantly improves tumor detection accuracy. Following successful segmentation and tumor detection, the VGG16 algorithm is utilized for feature extraction, preparing the data for the final 3D reconstruction. These features are then processed through an LSTM network and converted into a format suitable for the reconstructive GAN. This GAN, equipped with dilated convolution layers in its discriminator, captures extensive contextual information, enabling the accurate reconstruction of the tumor's 3D structure. RESULTS The effectiveness of our method is demonstrated through rigorous evaluation against established techniques using the LIDC-IDRI dataset and standard performance metrics, showcasing its superior performance and potential for enhancing early lung cancer detection. CONCLUSIONS This study highlights the benefits of combining GANs, LSTM, and VGG16 into a unified framework. This approach significantly improves the accuracy of detecting and reconstructing lung tumors, promising to enhance diagnostic methods and patient results in lung cancer treatment.
Collapse
Affiliation(s)
- Hamidreza Najafi
- Biomedical Engineering Department, School of Electrical Engineering, Iran University of Science and Technology, Tehran 16846-13114, Iran;
| | - Kimia Savoji
- Biomedical Data Science and Informatics, School of Computing, Clemson University, Clemson, SC 29634, USA;
| | - Marzieh Mirzaeibonehkhater
- Department of Electrical and Computer Engineering, Indiana University-Purdue University, Indianapolis, IN 46202, USA;
| | - Seyed Vahid Moravvej
- Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan 84156-83111, Iran;
| | - Roohallah Alizadehsani
- Institute for Intelligent Systems Research and Innovation (IISRI), Deakin University, Geelong, VIC 3216, Australia;
| | - Siamak Pedrammehr
- Faculty of Design, Tabriz Islamic Art University, Tabriz 51647-36931, Iran
| |
Collapse
|
27
|
Noor S, Naseem A, Awan HH, Aslam W, Khan S, AlQahtani SA, Ahmad N. Deep-m5U: a deep learning-based approach for RNA 5-methyluridine modification prediction using optimized feature integration. BMC Bioinformatics 2024; 25:360. [PMID: 39563239 DOI: 10.1186/s12859-024-05978-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2024] [Accepted: 11/06/2024] [Indexed: 11/21/2024] Open
Abstract
BACKGROUND RNA 5-methyluridine (m5U) modifications play a crucial role in biological processes, making their accurate identification a key focus in computational biology. This paper introduces Deep-m5U, a robust predictor designed to enhance the prediction of m5U modifications. The proposed method, named Deep-m5U, utilizes a hybrid pseudo-K-tuple nucleotide composition (PseKNC) for sequence formulation, a Shapley Additive exPlanations (SHAP) algorithm for discriminant feature selection, and a deep neural network (DNN) as the classifier. RESULTS The model was evaluated using two benchmark datasets, i.e., Full Transcript and Mature mRNA. Deep-m5U achieved overall accuracies of 91.47% and 95.86% for the Full Transcript and Mature mRNA datasets with 10-fold cross-validation, and for independent samples, the model attained 92.94% and 95.17% accuracy. CONCLUSION Compared to existing models, Deep-m5U showed approximately 5.23% and 3.73% higher accuracy on the training data and 3.95% and 3.26% higher accuracy on independent samples for the Full Transcript and Mature mRNA datasets, respectively. The reliability and effectiveness of Deep-m5U make it a valuable tool for scientists and a potential asset in pharmaceutical design and research.
Collapse
Affiliation(s)
- Sumaiya Noor
- Business and Management Sciences Department, Purdue University, West Lafayette, IN, USA
| | - Afshan Naseem
- Institute of Oceanography and Environment (INOS), Universiti Malaysia Terengganu, 21030, Kuala Nerus, Terengganu, Malaysia
| | - Hamid Hussain Awan
- Department of Computer Science, Muslim Youth University, Islamabad, Pakistan
| | - Wasiq Aslam
- Department of Computer Science, Muslim Youth University, Islamabad, Pakistan
| | - Salman Khan
- New Emerging Technologies and 5G Network and Beyond Research Chair, Department of Computer Engineering, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
| | - Salman A AlQahtani
- New Emerging Technologies and 5G Network and Beyond Research Chair, Department of Computer Engineering, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
| | - Nijad Ahmad
- Department of Computer Science, Khurasan University, Jalalabad, Afghanistan.
| |
Collapse
|
28
|
Yan K. Syntactic analysis of SMOSS model combined with improved LSTM model: Taking English writing teaching as an example. PLoS One 2024; 19:e0312049. [PMID: 39546444 PMCID: PMC11567549 DOI: 10.1371/journal.pone.0312049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2024] [Accepted: 09/30/2024] [Indexed: 11/17/2024] Open
Abstract
This paper explores the method of combining Sequential Matching on Sliding Window Sequences (SMOSS) model with improved Long Short-Term Memory (LSTM) model in English writing teaching to improve learners' syntactic understanding and writing ability, thus effectively improving the quality of English writing teaching. Firstly, this paper analyzes the structure of SMOSS model. Secondly, this paper optimizes the traditional LSTM model by using Connectist Temporal Classification (CTC), and proposes an English text error detection model. Meanwhile, this paper combines the SMOSS model with the optimized LSTM model to form a comprehensive syntactic analysis framework, and designs and implements the structure and code of the framework. Finally, on the one hand, the semantic disambiguation performance of the model is tested by using SemCor data set. On the other hand, taking English writing teaching as an example, the proposed method is further verified by designing a comparative experiment in groups. The results show that: (1) From the experimental data of word sense disambiguation, the accuracy of the SMOSS-LSTM model proposed in this paper is the lowest when the context range is "3+3", then it rises in turn at "5+5" and "7+7", reaches the highest at "7+7", and then begins to decrease at "10+10"; (2) Compared with the control group, the accuracy of syntactic analysis in the experimental group reached 89.5%, while that in the control group was only 73.2%. (3) In the aspect of English text error detection, the detection accuracy of the proposed model in the experimental group is as high as 94.8%, which is significantly better than the traditional SMOSS-based text error detection method, and its accuracy is only 68.3%. (4) Compared with other existing researches, although it is slightly inferior to Bidirectional Encoder Representations from Transformers (BERT) in word sense disambiguation, this proposed model performs well in syntactic analysis and English text error detection, and its comprehensive performance is excellent. This paper verifies the effectiveness and practicability of applying SMOSS model and improved LSTM model to the syntactic analysis task in English writing teaching, and provides new ideas and methods for the application of syntactic analysis in English teaching.
Collapse
Affiliation(s)
- Ke Yan
- Department of Public Instruction, Nanyang Medical College, Nanyang, Henan, China
| |
Collapse
|
29
|
Beltrán JF, Herrera-Belén L, Yáñez AJ, Jimenez L. Prediction of viral oncoproteins through the combination of generative adversarial networks and machine learning techniques. Sci Rep 2024; 14:27108. [PMID: 39511292 PMCID: PMC11543823 DOI: 10.1038/s41598-024-77028-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2024] [Accepted: 10/18/2024] [Indexed: 11/15/2024] Open
Abstract
Viral oncoproteins play crucial roles in transforming normal cells into cancer cells, representing a significant factor in the etiology of various cancers. Traditionally, identifying these oncoproteins is both time-consuming and costly. With advancements in computational biology, bioinformatics tools based on machine learning have emerged as effective methods for predicting biological activities. Here, for the first time, we propose an innovative approach that combines Generative Adversarial Networks (GANs) with supervised learning methods to enhance the accuracy and generalizability of viral oncoprotein prediction. Our methodology evaluated multiple machine learning models, including Random Forest, Multilayer Perceptron, Light Gradient Boosting Machine, eXtreme Gradient Boosting, and Support Vector Machine. In ten-fold cross-validation on our training dataset, the GAN-enhanced Random Forest model demonstrated superior performance metrics: 0.976 accuracy, 0.976 F1 score, 0.977 precision, 0.976 sensitivity, and 1.0 AUC. During independent testing, this model achieved 0.982 accuracy, 0.982 F1 score, 0.982 precision, 0.982 sensitivity, and 1.0 AUC. These results establish our new tool, VirOncoTarget, accessible via a web application. We anticipate that VirOncoTarget will be a valuable resource for researchers, enabling rapid and reliable viral oncoprotein prediction and advancing our understanding of their role in cancer biology.
Collapse
Affiliation(s)
- Jorge F Beltrán
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar 01145, Temuco, Chile.
| | - Lisandra Herrera-Belén
- Departamento de Ciencias Básicas, Facultad de Ciencias, Universidad Santo Tomas, Temuco, Chile
| | - Alejandro J Yáñez
- Departamento de Investigación y Desarrollo, Greenvolution SpA, Puerto Varas, Chile
- Interdisciplinary Center for Aquaculture Research (INCAR), Concepcion, Chile
| | - Luis Jimenez
- Department of Chemical Engineering, Faculty of Engineering and Science, Universidad de La Frontera, Ave. Francisco Salazar 01145, Temuco, Chile
| |
Collapse
|
30
|
Qureshi MS, Qureshi MB, Iqrar U, Raza A, Ghadi YY, Innab N, Alajmi M, Qahmash A. AI based predictive acceptability model for effective vaccine delivery in healthcare systems. Sci Rep 2024; 14:26657. [PMID: 39496689 PMCID: PMC11535025 DOI: 10.1038/s41598-024-76891-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2024] [Accepted: 10/17/2024] [Indexed: 11/06/2024] Open
Abstract
Vaccine acceptance is a crucial component of a viable immunization program in healthcare system, yet the disparities in new and existing vaccination adoption rates prevail across regions. Disparities in the rate of vaccine acceptance result in low immunization coverage and slow uptake of newly introduced vaccines. This research presents an innovative AI-driven predictive model, designed to accurately forecast vaccine acceptance within immunization programs, while providing high interpretability. Primarily, the contribution of this study is to classify vaccine acceptability into Low, Medium, Partial High, and High categories. Secondly, this study implements the Feature Importance method to make the model highly interpretable for healthcare providers. Thirdly, our findings highlight the impact of demographic and socio-demographic factors on vaccine acceptance, providing valuable insights for policymakers to improve immunization rates. A sample dataset containing 7150 data records with 31 demographic and socioeconomic attributes from PDHS (2017-2018) is used in this paper. Using the LightGBM algorithm, the proposed model constructed on the basis of different machine-learning procedures achieved 98% accuracy to accurately predict the acceptability of vaccines included in the immunization program. The association rules suggest that higher SES, region, parents' occupation, and mother's education have an association with vaccine acceptability.
Collapse
Affiliation(s)
- Muhammad Shuaib Qureshi
- School of Computing Sciences, Pak-Austria Fachhochschule Institute of Applied Sciences and Technology, Haripur, KPK, Pakistan
| | - Muhammad Bilal Qureshi
- Department of Computer Science & IT, University of Lakki Marwat, Lakki Marwat, KPK, 28420, Pakistan
| | - Urooj Iqrar
- Department of Computer Science, Shaheed Zulfikar Ali Bhutto Institute of Science and Technology, Islamabad, 46000, Pakistan
| | - Ali Raza
- Department of Computer Science, MY University, Islamabad, Pakistan.
| | - Yazeed Yasin Ghadi
- Department of Computer Science, Al Ain University, 15551, Abu Dhabi, United Arab Emirates
| | - Nisreen Innab
- Department of Computer Science and Information Systems, College of Applied Sciences, AlMaarefa University, 13713, Diriyah, Riyadh, Saudi Arabia
| | - Masoud Alajmi
- Department of Computer Engineering, College of Computers and Information Technology, Taif University, 21944, Taif, Saudi Arabia
| | - Ayman Qahmash
- Department of Informatics and computer systems, College of Computer Science, King Khalid University, Abha, Saudi Arabia.
| |
Collapse
|
31
|
Zhang Z, Lu Y, Wang T, Wei X, Wei Z. Joint Dual Feature Distillation and Gradient Progressive Pruning for BERT compression. Neural Netw 2024; 179:106533. [PMID: 39079378 DOI: 10.1016/j.neunet.2024.106533] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 06/24/2024] [Accepted: 07/09/2024] [Indexed: 09/18/2024]
Abstract
The increasing size of pre-trained language models has led to a growing interest in model compression. Pruning and distillation are the primary methods employed to compress these models. Existing pruning and distillation methods are effective in maintaining model accuracy and reducing its size. However, they come with limitations. For instance, pruning is often suboptimal and biased by transforming it into a continuous optimization problem. Distillation relies primarily on one-to-one layer mappings for knowledge transfer, which leads to underutilization of the rich knowledge in teacher. Therefore, we propose a method of joint pruning and distillation for automatic pruning of pre-trained language models. Specifically, we first propose Gradient Progressive Pruning (GPP), which achieves a smooth transition of indicator vector values from real to binary by progressively converging the values of unimportant units' indicator vectors to zero before the end of the search phase. This effectively overcomes the limitations of traditional pruning methods while supporting compression with higher sparsity. In addition, we propose the Dual Feature Distillation (DFD). DFD adaptively globally fuses teacher features and locally fuses student features, and then uses the dual features of global teacher features and local student features for knowledge distillation. This realizes a "preview-review" mechanism that can better extract useful information from multi-level teacher information and transfer it to student. Comparative experiments on the GLUE benchmark dataset and ablation experiments indicate that our method outperforms other state-of-the-art methods.
Collapse
Affiliation(s)
- Zhou Zhang
- School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China.
| | - Yang Lu
- School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China; Anhui Mine IOT and Security Monitoring Technology Key Laboratory, Hefei 230088, China.
| | - Tengfei Wang
- School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China.
| | - Xing Wei
- School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China; Intelligent Manufacturing Institute of Hefei University of Technology, Hefei 230009, China.
| | - Zhen Wei
- School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China; Intelligent Manufacturing Institute of Hefei University of Technology, Hefei 230009, China.
| |
Collapse
|
32
|
Shaon MSH, Karim T, Ali MM, Ahmed K, Bui FM, Chen L, Moni MA. A robust deep learning approach for identification of RNA 5-methyluridine sites. Sci Rep 2024; 14:25688. [PMID: 39465261 PMCID: PMC11514282 DOI: 10.1038/s41598-024-76148-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2024] [Accepted: 10/10/2024] [Indexed: 10/29/2024] Open
Abstract
RNA 5-methyluridine (m5U) sites play a significant role in understanding RNA modifications, which influence numerous biological processes such as gene expression and cellular functioning. Consequently, the identification of m5U sites can play a vital role in the integrity, structure, and function of RNA molecules. Therefore, this study introduces GRUpred-m5U, a novel deep learning-based framework based on a gated recurrent unit in mature RNA and full transcript RNA datasets. We used three descriptor groups: nucleic acid composition, pseudo nucleic acid composition, and physicochemical properties, which include five feature extraction methods ENAC, Kmer, DPCP, DPCP type 2, and PseDNC. Initially, we aggregated all the feature extraction methods and created a new merged set. Three hybrid models were developed employing deep-learning methods and evaluated through 10-fold cross-validation with seven evaluation metrics. After a comprehensive evaluation, the GRUpred-m5U model outperformed the other applied models, obtaining 98.41% and 96.70% accuracy on the two datasets, respectively. To our knowledge, the proposed model outperformed all the existing state-of-the-art technology. The proposed supervised machine learning model was evaluated using unsupervised machine learning techniques such as principal component analysis (PCA), and it was observed that the proposed method provided a valid performance for identifying m5U. Considering its multi-layered construction, the GRUpred-m5U model has tremendous potential for future applications in the biological industry. The model, which consisted of neurons processing complicated input, excelled at pattern recognition and produced reliable results. Despite its greater size, the model obtained accurate results, essential in detecting m5U.
Collapse
Affiliation(s)
| | - Tasmin Karim
- Department of Computer Science and Informatics, Oakland University, Rochester, MI, 48309, USA
| | - Md Mamun Ali
- Division of Biomedical Engineering, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, S7N 5A9, Canada
- Department of Software Engineering, Daffodil Smart City (DSC), Daffodil International University, Birulia, Savar, Dhaka, 1216, Bangladesh
| | - Kawsar Ahmed
- Department of Electrical and Computer Engineering, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, S7N 5A9, Canada.
- Group of Bio-photomatiχ, Department of Information and Communication Technology, Mawlana Bhashani Science and Technology University, Santosh, 1902, Tangail, Bangladesh.
- Health Informatics Research Lab, Department of Computer Science and Engineering, Daffodil International University, Daffodil Smart City, Dhaka, 1216, Birulia, Bangladesh.
| | - Francis M Bui
- Department of Electrical and Computer Engineering, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, S7N 5A9, Canada
| | - Li Chen
- Department of Electrical and Computer Engineering, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, S7N 5A9, Canada
| | - Mohammad Ali Moni
- AI & Digital Health Technology, Artificial Intelligence & Cyber Future Institute, Charles Sturt University, Bathurst, NSW, 2795, Australia.
- AI & Digital Health Technology, Rural Health Research Institute, Charles Sturt University, Orange, NSW, 2800, Australia.
| |
Collapse
|
33
|
Castro-Silva JA, Moreno-García MN, Guachi-Guachi L, Peluffo-Ordóñez DH. Novel hippocampus-centered methodology for informative instance selection in Alzheimer's disease data. Heliyon 2024; 10:e37552. [PMID: 39381107 PMCID: PMC11456841 DOI: 10.1016/j.heliyon.2024.e37552] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2023] [Revised: 08/30/2024] [Accepted: 09/05/2024] [Indexed: 10/10/2024] Open
Abstract
The quantity and quality of a dataset play a crucial role in the performance of prediction models. Increasing the amount of data increases the computational requirements and can introduce negligible variations, outliers, and noise. These significantly impact the model performance. Thus, instance selection techniques are crucial for building prediction models with informative data, reducing the dataset size, improving performance, and minimizing computational costs. This study proposed a novel methodology for identifying the most informative two-dimensional slices derived from magnetic resonance imaging (MRI) to study Alzheimer's disease. The efficacy of our methodology was attributable to a hippocampus-centered analysis using data from multiple atlases. The methodology was evaluated by constructing convolutional neural networks to identify Alzheimer's disease, using a consolidated dataset constructed from three standard datasets: Alzheimer's Disease Neuroimaging Initiative, Australian Imaging, Biomarker & Lifestyle Flagship Study of Ageing, and Open Access Series of Imaging Studies. The proposed methodology demonstrated a commendable subject-level classification accuracy of approximately ( 95.00 % ) when distinguishing between normal cognition and Alzheimer's.
Collapse
Affiliation(s)
- Juan A. Castro-Silva
- Universidad de Salamanca, Salamanca, Spain
- Universidad Surcolombiana, Neiva, Colombia
| | | | | | - Diego H. Peluffo-Ordóñez
- College of Computing, Mohammed VI Polytechnic University, Lot 660, Hay Moulay Rachid Ben Guerir, 43150, Morocco
- SDAS Research Group (https://sdas-group.com/), Ben Guerir 43150, Morocco
- Faculty of Engineering, Corporación Universitaria Autónoma de Nariño, Pasto 520001, Colombia
| |
Collapse
|
34
|
Aruna AS, Babu KRR, Deepthi K. A deep drug prediction framework for viral infectious diseases using an optimizer-based ensemble of convolutional neural network: COVID-19 as a case study. Mol Divers 2024:10.1007/s11030-024-11003-7. [PMID: 39379663 DOI: 10.1007/s11030-024-11003-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2024] [Accepted: 09/26/2024] [Indexed: 10/10/2024]
Abstract
The SARS-CoV-2 outbreak highlights the persistent vulnerability of humanity to epidemics and emerging microbial threats, emphasizing the lack of time to develop disease-specific treatments. Therefore, it appears beneficial to utilize existing resources and therapies. Computational drug repositioning is an effective strategy that redirects authorized drugs to new therapeutic purposes. This strategy holds significant promise for newly emerging diseases, as drug discovery is a lengthy and expensive process. Through this study, we present an ensemble method based on the convolutional neural network integrated with genetic algorithm and deep forest classifier for virus-drug association prediction (CGDVDA). We generated feature vectors by combining drug chemical structure and virus genomic sequence-based similarities, and extracted prominent deep features by applying the convolutional neural network. The convoluted features are optimized using the genetic algorithm and classified using the ensemble deep forest classifier to predict novel virus-drug associations. The proposed method predicts drugs for COVID-19 and other viral diseases in the dataset. The model could achieve ROC-AUC scores of 0.9159 on fivefold cross-validation. We compared the performance of the model with state-of-the-art approaches and classifiers. The experimental results and case studies illustrate the efficacy of CGDVDA in predicting drugs against viral infectious diseases.
Collapse
Affiliation(s)
- A S Aruna
- Dept. of Information Technology, Government Engineering College Palakkad, APJ Abdul Kalam Technological University, Palakkad, Kerala, 678633, India.
- Department of Computer Science, College of Engineering Vadakara, Kozhikode, Kerala, 673105, India.
| | - K R Remesh Babu
- Dept. of Information Technology, Government Engineering College Palakkad, APJ Abdul Kalam Technological University, Palakkad, Kerala, 678633, India
| | - K Deepthi
- Department of Computer Science, Central University of Kerala (Govt. of India), Kasaragod, Kerala, 671320, India
| |
Collapse
|
35
|
Kilimci ZH, Yalcin M. ACP-ESM: A novel framework for classification of anticancer peptides using protein-oriented transformer approach. Artif Intell Med 2024; 156:102951. [PMID: 39173421 DOI: 10.1016/j.artmed.2024.102951] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Revised: 07/19/2024] [Accepted: 08/13/2024] [Indexed: 08/24/2024]
Abstract
Anticancer peptides (ACPs) are a class of molecules that have gained significant attention in the field of cancer research and therapy. ACPs are short chains of amino acids, the building blocks of proteins, and they possess the ability to selectively target and kill cancer cells. One of the key advantages of ACPs is their ability to selectively target cancer cells while sparing healthy cells to a greater extent. This selectivity is often attributed to differences in the surface properties of cancer cells compared to normal cells. That is why ACPs are being investigated as potential candidates for cancer therapy. ACPs may be used alone or in combination with other treatment modalities like chemotherapy and radiation therapy. While ACPs hold promise as a novel approach to cancer treatment, there are challenges to overcome, including optimizing their stability, improving selectivity, and enhancing their delivery to cancer cells, continuous increasing in number of peptide sequences, developing a reliable and precise prediction model. In this work, we propose an efficient transformer-based framework to identify ACPs for by performing accurate a reliable and precise prediction model. For this purpose, four different transformer models, namely ESM, ProtBERT, BioBERT, and SciBERT are employed to detect ACPs from amino acid sequences. To demonstrate the contribution of the proposed framework, extensive experiments are carried on widely-used datasets in the literature, two versions of AntiCp2, cACP-DeepGram, ACP-740. Experiment results show the usage of proposed model enhances classification accuracy when compared to the literature studies. The proposed framework, ESM, exhibits 96.45% of accuracy for AntiCp2 dataset, 97.66% of accuracy for cACP-DeepGram dataset, and 88.51% of accuracy for ACP-740 dataset, thence determining new state-of-the-art. The code of proposed framework is publicly available at github (https://github.com/mstf-yalcin/acp-esm).
Collapse
Affiliation(s)
- Zeynep Hilal Kilimci
- Department of Information Systems Engineering, Kocaeli University, 41001, Kocaeli, Turkey.
| | - Mustafa Yalcin
- Department of Information Systems Engineering, Kocaeli University, 41001, Kocaeli, Turkey.
| |
Collapse
|
36
|
Wen J, Ding Z, Wei Z, Xia H, Zhang Y, Zhu X. NeuroPpred-SHE: An interpretable neuropeptides prediction model based on selected features from hand-crafted features and embeddings of T5 model. Comput Biol Med 2024; 181:109048. [PMID: 39182368 DOI: 10.1016/j.compbiomed.2024.109048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Revised: 08/13/2024] [Accepted: 08/18/2024] [Indexed: 08/27/2024]
Abstract
Neuropeptides are the most ubiquitous neurotransmitters in the immune system, regulating various biological processes. Neuropeptides play a significant role for the discovery of new drugs and targets for nervous system disorders. Traditional experimental methods for identifying neuropeptides are time-consuming and costly. Although several computational methods have been developed to predict the neuropeptides, the accuracy is still not satisfactory due to the representability of the extracted features. In this work, we propose an efficient and interpretable model, NeuroPpred-SHE, for predicting neuropeptides by selecting the optimal feature subset from both hand-crafted features and embeddings of a protein language model. Specially, we first employed a pre-trained T5 protein language model to extract embedding features and twelve other encoding methods to extract hand-crafted features from peptide sequences, respectively. Secondly, we fused both embedding features and hand-crafted features to enhance the feature representability. Thirdly, we utilized random forest (RF), Max-Relevance and Min-Redundancy (mRMR) and eXtreme Gradient Boosting (XGBoost) methods to select the optimal feature subset from the fused features. Finally, we employed five machine learning methods (GBDT, XGBoost, SVM, MLP, and LightGBM) to build the models. Our results show that the model based on GBDT achieves the best performance. Furthermore, our final model was compared with other state-of-the-art methods on an independent test set, the results indicate that our model achieves an AUROC of 97.8 % which is higher than all the other state-of-the-art predictors. Our model is available at: https://github.com/wenjean/NeuroPpred-SHE.
Collapse
Affiliation(s)
- Jian Wen
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, 230036, China
| | - Zhijie Ding
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, 230036, China
| | - Zhuoyu Wei
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, 230036, China
| | - Hongwei Xia
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, 230036, China
| | - Yong Zhang
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, 230036, China.
| | - Xiaolei Zhu
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, 230036, China.
| |
Collapse
|
37
|
İhtiyar MN, Özgür A. Generative language models on nucleotide sequences of human genes. Sci Rep 2024; 14:22204. [PMID: 39333252 PMCID: PMC11437190 DOI: 10.1038/s41598-024-72512-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Accepted: 09/09/2024] [Indexed: 09/29/2024] Open
Abstract
Language models, especially transformer-based ones, have achieved colossal success in natural language processing. To be precise, studies like BERT for natural language understanding and works like GPT-3 for natural language generation are very important. If we consider DNA sequences as a text written with an alphabet of four letters representing the nucleotides, they are similar in structure to natural languages. This similarity has led to the development of discriminative language models such as DNABERT in the field of DNA-related bioinformatics. To our knowledge, however, the generative side of the coin is still largely unexplored. Therefore, we have focused on the development of an autoregressive generative language model such as GPT-3 for DNA sequences. Since working with whole DNA sequences is challenging without extensive computational resources, we decided to conduct our study on a smaller scale and focus on nucleotide sequences of human genes, i.e. unique parts of DNA with specific functions, rather than the whole DNA. This decision has not significantly changed the structure of the problem, as both DNA and genes can be considered as 1D sequences consisting of four different nucleotides without losing much information and without oversimplification. First of all, we systematically studied an almost entirely unexplored problem and observed that recurrent neural networks (RNNs) perform best, while simple techniques such as N-grams are also promising. Another beneficial point was learning how to work with generative models on languages we do not understand, unlike natural languages. The importance of using real-world tasks beyond classical metrics such as perplexity was noted. In addition, we examined whether the data-hungry nature of these models can be altered by selecting a language with minimal vocabulary size, four due to four different types of nucleotides. The reason for reviewing this was that choosing such a language might make the problem easier. However, in this study, we found that this did not change the amount of data required very much.
Collapse
Affiliation(s)
- Musa Nuri İhtiyar
- Department of Computer Engineering, Boğaziçi University, 34342, Istanbul, Turkey.
| | - Arzucan Özgür
- Department of Computer Engineering, Boğaziçi University, 34342, Istanbul, Turkey.
| |
Collapse
|
38
|
Ghafoor H, Asim MN, Ibrahim MA, Dengel A. ProSol-multi: Protein solubility prediction via amino acids multi-level correlation and discriminative distribution. Heliyon 2024; 10:e36041. [PMID: 39281576 PMCID: PMC11401092 DOI: 10.1016/j.heliyon.2024.e36041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2024] [Revised: 08/01/2024] [Accepted: 08/08/2024] [Indexed: 09/18/2024] Open
Abstract
Protein solubility prediction is useful for the careful selection of highly effective candidate proteins for drug development. In recombinant proteins synthesis, solubility prediction is valuable for optimizing key protein characteristics, including stability, functionality, and ease of purification. It contains valuable information about potential biomarkers or therapeutic targets and helps in early forecasting of neurodegenerative diseases, cancer, and cardiovascular disorders. Traditional wet-lab experimental protein solubility prediction approaches are error-prone, time-consuming, and costly. Researchers harnessed the competence of Artificial Intelligence approaches for replacing experimental approaches with computational predictors. These predictors inferred the solubility of proteins by analyzing amino acids distributions in raw protein sequences. There is still a lot of room for the development of robust computational predictors because existing predictors remain fail in extracting comprehensive discriminative distribution of amino acids. To more precisely discriminate soluble proteins from insoluble proteins, this paper presents ProSol-Multi predictor that makes use of a novel MLCDE encoder and Random Forest classifier. MLCDE encoder transforms protein sequences into informative statistical vectors by capturing amino acids multi-level correlation and discriminative distribution within raw protein sequences. The performance of proposed encoder is evaluated against 56 existing protein sequence encoding methods on a widely used protein solubility prediction benchmark dataset under two different experimental settings namely intrinsic and extrinsic. Intrinsic evaluation reveals that from all sequence encoders, proposed MLCDE encoder manages to generate non-overlapping clusters of soluble and insoluble classes. In extrinsic evaluation, 10 machine learning classifiers achieve better performance with proposed MLCDE encoder as compared to 56 existing protein sequence encoders. Moreover, across 4 public benchmark datasets, proposed ProSol-Multi predictor outshines 20 existing predictors by an average accuracy of 3%, MCC and AU-ROC of 2%. ProSol-Multi interactive web application is available at https://sds_genetic_analysis.opendfki.de/ProSol-Multi.
Collapse
Affiliation(s)
- Hina Ghafoor
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| | - Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| | - Muhammad Ali Ibrahim
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| | - Andreas Dengel
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| |
Collapse
|
39
|
Gunduz H. Comparative analysis of BERT and FastText representations on crowdfunding campaign success prediction. PeerJ Comput Sci 2024; 10:e2316. [PMID: 39314718 PMCID: PMC11419673 DOI: 10.7717/peerj-cs.2316] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Accepted: 08/19/2024] [Indexed: 09/25/2024]
Abstract
Crowdfunding has become a popular financing method, attracting investors, businesses, and entrepreneurs. However, many campaigns fail to secure funding, making it crucial to reduce participation risks using artificial intelligence (AI). This study investigates the effectiveness of advanced AI techniques in predicting the success of crowdfunding campaigns on Kickstarter by analyzing campaign blurbs. We compare the performance of two widely used text representation models, bidirectional encoder representations from transformers (BERT) and FastText, in conjunction with long-short term memory (LSTM) and gradient boosting machine (GBM) classifiers. Our analysis involves preprocessing campaign blurbs, extracting features using BERT and FastText, and evaluating the predictive performance of these features with LSTM and GBM models. All experimental results show that BERT representations significantly outperform FastText, with the highest accuracy of 0.745 achieved using a fine-tuned BERT model combined with LSTM. These findings highlight the importance of using deep contextual embeddings and the benefits of fine-tuning pre-trained models for domain-specific applications. The results are benchmarked against existing methods, demonstrating the superiority of our approach. This study provides valuable insights for improving predictive models in the crowdfunding domain, offering practical implications for campaign creators and investors.
Collapse
Affiliation(s)
- Hakan Gunduz
- Software Engineering Department, Kocaeli University, Kocaeli, Marmara, Turkey
| |
Collapse
|
40
|
Uddin I, Awan HH, Khalid M, Khan S, Akbar S, Sarker MR, Abdolrasol MGM, Alghamdi TAH. A hybrid residue based sequential encoding mechanism with XGBoost improved ensemble model for identifying 5-hydroxymethylcytosine modifications. Sci Rep 2024; 14:20819. [PMID: 39242695 PMCID: PMC11379919 DOI: 10.1038/s41598-024-71568-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2024] [Accepted: 08/29/2024] [Indexed: 09/09/2024] Open
Abstract
RNA modifications play an important role in actively controlling recently created formation in cellular regulation mechanisms, which link them to gene expression and protein. The RNA modifications have numerous alterations, presenting broad glimpses of RNA's operations and character. The modification process by the TET enzyme oxidation is the crucial change associated with cytosine hydroxymethylation. The effect of CR is an alteration in specific biochemical ways of the organism, such as gene expression and epigenetic alterations. Traditional laboratory systems that identify 5-hydroxymethylcytosine (5hmC) samples are expensive and time-consuming compared to other methods. To address this challenge, the paper proposed XGB5hmC, a machine learning algorithm based on a robust gradient boosting algorithm (XGBoost), with different residue based formulation methods to identify 5hmC samples. Their results were amalgamated, and six different frequency residue based encoding features were fused to form a hybrid vector in order to enhance model discrimination capabilities. In addition, the proposed model incorporates SHAP (Shapley Additive Explanations) based feature selection to demonstrate model interpretability by highlighting the high contributory features. Among the applied machine learning algorithms, the XGBoost ensemble model using the tenfold cross-validation test achieved improved results than existing state-of-the-art models. Our model reported an accuracy of 89.97%, sensitivity of 87.78%, specificity of 94.45%, F1-score of 0.8934%, and MCC of 0.8764%. This study highlights the potential to provide valuable insights for enhancing medical assessment and treatment protocols, representing a significant advancement in RNA modification analysis.
Collapse
Affiliation(s)
- Islam Uddin
- Department of Computer Science, Abdul Wali Khan University, Mardan, Pakistan
| | - Hamid Hussain Awan
- Department of Computer Science, Muslim Youth University, Islamabad, Pakistan
| | - Majdi Khalid
- Department of Computer Science and Artificial Intelligence, College of Computing, Umm Al-Qura University, Makkah, 21955, Saudi Arabia
| | - Salman Khan
- Department of Computer Science, Abdul Wali Khan University, Mardan, Pakistan
| | - Shahid Akbar
- Department of Computer Science, Abdul Wali Khan University, Mardan, Pakistan.
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610054, China.
| | - Mahidur R Sarker
- Institute of Visual Informatics, Universiti Kebangsaan Malaysia, Bangi, 43600, Selangor, Malaysia
- Universidad de Diseño, Innovación y Tecnología, UDIT, Av. Alfonso XIII, 97, 28016, Madrid, Spain
| | - Maher G M Abdolrasol
- Institute of Sustainable Energy, Universiti Tenaga Nasional, Kajang, 43000, Malaysia
| | - Thamer A H Alghamdi
- Wolfson Centre for Magnetics, School of Engineering, Cardiff University, Cardiff, CF24 3AA, UK.
- Electrical Engineering Department, Faculty of Engineering, Al-Baha University, Al-Baha, 65779, Saudi Arabia.
| |
Collapse
|
41
|
Wang S, Luo B. Academic achievement prediction in higher education through interpretable modeling. PLoS One 2024; 19:e0309838. [PMID: 39236050 PMCID: PMC11376577 DOI: 10.1371/journal.pone.0309838] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2024] [Accepted: 08/20/2024] [Indexed: 09/07/2024] Open
Abstract
Student academic achievement is an important indicator for evaluating the quality of education, especially, the achievement prediction empowers educators in tailoring their instructional approaches, thereby fostering advancements in both student performance and the overall educational quality. However, extracting valuable insights from vast educational data to develop effective strategies for evaluating student performance remains a significant challenge for higher education institutions. Traditional machine learning (ML) algorithms often struggle to clearly delineate the interplay between the factors that influence academic success and the resulting grades. To address these challenges, this paper introduces the XGB-SHAP model, a novel approach for predicting student achievement that combines Extreme Gradient Boosting (XGBoost) with SHapley Additive exPlanations (SHAP). The model was applied to a dataset from a public university in Wuhan, encompassing the academic records of 87 students who were enrolled in a Japanese course between September 2021 and June 2023. The findings indicate the model excels in accuracy, achieving a Mean absolute error (MAE) of approximately 6 and an R-squared value near 0.82, surpassing three other ML models. The model further uncovers how different instructional modes influence the factors that contribute to student achievement. This insight supports the need for a customized approach to feature selection that aligns with the specific characteristics of each teaching mode. Furthermore, the model highlights the importance of incorporating self-directed learning skills into student-related indicators when predicting academic performance.
Collapse
Affiliation(s)
- Sixuan Wang
- School of Foreign Languages, Wuhan Business University, Wuhan, Hubei, People's Republic of China
| | - Bin Luo
- School of Foreign Languages, Wuhan Business University, Wuhan, Hubei, People's Republic of China
| |
Collapse
|
42
|
Kalal V, Jha BK. A Kernelized Classification Approach for Cancer Recognition Using Markovian Analysis of DNA Structure Patterns as Feature Mining. Cell Biochem Biophys 2024; 82:2249-2274. [PMID: 38847942 DOI: 10.1007/s12013-024-01336-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/22/2024] [Indexed: 10/02/2024]
Abstract
Nucleotide-based molecules called DNA and RNA are essential for several biological processes that affect both normal and cancerous cells. They contain the critical genetic material needed for normal cell growth and functioning. The DNA structure patterns that make up the genetic code affect cells' growth, behavior, and control. Different DNA structure patterns indicate different physiological effects in the cell. Knowledge of these patterns is necessary to identify the molecular origins of cancer and other disorders. Analyzing these patterns can help in the early detection of diseases, which is essential for the effectiveness of cancer research and therapy. The novelty of this study is to examine the patterns of dinucleotide structure in many genomic regions, including the non-coding region sequence (N-CDS), coding region sequence (CDS), and whole raw DNA sequence (W.R. sequence). It provides an in-depth discussion of dinucleotide patterns related to these diverse genetic environments and contains malignant and non-malignant DNA sequences. The Markovian modeling that predicts dinucleotide probabilities also reduces feature complexity and minimizes computational costs compared to the approaches of Kernelized Logistic Regression (KLR) and Support Vector Machine (SVM). This technique is effectively evaluated in essential case studies, as indicated by accuracy metrics and 10-fold cross-validation. The classifier and feature reduction, which are generated by Markovian probability, operate well together and can help predict cancer. Our findings successfully distinguish DNA sequences related to cancer from those diagnostics of non-cancerous diseases by analyzing the W.R. DNA sequence as well as its CDS and N-CDS regions.
Collapse
Affiliation(s)
- Vijay Kalal
- Department of Mathematics, School of Technology, Pandit Deendayal Energy University, Raysan, Gandhinagar, Gujarat, 382007, India
| | - Brajesh Kumar Jha
- Department of Mathematics, School of Technology, Pandit Deendayal Energy University, Raysan, Gandhinagar, Gujarat, 382007, India.
| |
Collapse
|
43
|
Xu Y, Zhang S, Zhu F, Liang Y. A deep learning model for anti-inflammatory peptides identification based on deep variational autoencoder and contrastive learning. Sci Rep 2024; 14:18451. [PMID: 39117712 PMCID: PMC11310449 DOI: 10.1038/s41598-024-69419-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2024] [Accepted: 08/05/2024] [Indexed: 08/10/2024] Open
Abstract
As a class of biologically active molecules with significant immunomodulatory and anti-inflammatory effects, anti-inflammatory peptides have important application value in the medical and biotechnology fields due to their unique biological functions. Research on the identification of anti-inflammatory peptides provides important theoretical foundations and practical value for a deeper understanding of the biological mechanisms of inflammation and immune regulation, as well as for the development of new drugs and biotechnological applications. Therefore, it is necessary to develop more advanced computational models for identifying anti-inflammatory peptides. In this study, we propose a deep learning model named DAC-AIPs based on variational autoencoder and contrastive learning for accurate identification of anti-inflammatory peptides. In the sequence encoding part, the incorporation of multi-hot encoding helps capture richer sequence information. The autoencoder, composed of convolutional layers and linear layers, can learn latent features and reconstruct features, with variational inference enhancing the representation capability of latent features. Additionally, the introduction of contrastive learning aims to improve the model's classification ability. Through cross-validation and independent dataset testing experiments, DAC-AIPs achieves superior performance compared to existing state-of-the-art models. In cross-validation, the classification accuracy of DAC-AIPs reached around 88%, which is 7% higher than previous models. Furthermore, various ablation experiments and interpretability experiments validate the effectiveness of DAC-AIPs. Finally, a user-friendly online predictor is designed to enhance the practicality of the model, and the server is freely accessible at http://dac-aips.online .
Collapse
Affiliation(s)
- Yujie Xu
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, People's Republic of China
| | - Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, People's Republic of China.
| | - Feng Zhu
- Center for Translational Medicine, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an, 710061, People's Republic of China
| | - Yunyun Liang
- School of Science, Xi'an Polytechnic University, Xi'an, 710048, People's Republic of China
| |
Collapse
|
44
|
Rukh G, Akbar S, Rehman G, Alarfaj FK, Zou Q. StackedEnC-AOP: prediction of antioxidant proteins using transform evolutionary and sequential features based multi-scale vector with stacked ensemble learning. BMC Bioinformatics 2024; 25:256. [PMID: 39098908 PMCID: PMC11298090 DOI: 10.1186/s12859-024-05884-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2024] [Accepted: 07/29/2024] [Indexed: 08/06/2024] Open
Abstract
BACKGROUND Antioxidant proteins are involved in several biological processes and can protect DNA and cells from the damage of free radicals. These proteins regulate the body's oxidative stress and perform a significant role in many antioxidant-based drugs. The current invitro-based medications are costly, time-consuming, and unable to efficiently screen and identify the targeted motif of antioxidant proteins. METHODS In this model, we proposed an accurate prediction method to discriminate antioxidant proteins namely StackedEnC-AOP. The training sequences are formulation encoded via incorporating a discrete wavelet transform (DWT) into the evolutionary matrix to decompose the PSSM-based images via two levels of DWT to form a Pseudo position-specific scoring matrix (PsePSSM-DWT) based embedded vector. Additionally, the Evolutionary difference formula and composite physiochemical properties methods are also employed to collect the structural and sequential descriptors. Then the combined vector of sequential features, evolutionary descriptors, and physiochemical properties is produced to cover the flaws of individual encoding schemes. To reduce the computational cost of the combined features vector, the optimal features are chosen using Minimum redundancy and maximum relevance (mRMR). The optimal feature vector is trained using a stacking-based ensemble meta-model. RESULTS Our developed StackedEnC-AOP method reported a prediction accuracy of 98.40% and an AUC of 0.99 via training sequences. To evaluate model validation, the StackedEnC-AOP training model using an independent set achieved an accuracy of 96.92% and an AUC of 0.98. CONCLUSION Our proposed StackedEnC-AOP strategy performed significantly better than current computational models with a ~ 5% and ~ 3% improved accuracy via training and independent sets, respectively. The efficacy and consistency of our proposed StackedEnC-AOP make it a valuable tool for data scientists and can execute a key role in research academia and drug design.
Collapse
Affiliation(s)
- Gul Rukh
- Department of Zoology, Abdul Wali Khan University Mardan, Mardan, 23200, KP, Pakistan
| | - Shahid Akbar
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610054, People's Republic of China
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, 23200, KP, Pakistan
| | - Gauhar Rehman
- Department of Zoology, Abdul Wali Khan University Mardan, Mardan, 23200, KP, Pakistan
| | - Fawaz Khaled Alarfaj
- Department of Management Information Systems (MIS), School of Business, King Faisal University (KFU), 31982, Al-Ahsa, Saudi Arabia
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610054, People's Republic of China.
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, 324000, People's Republic of China.
| |
Collapse
|
45
|
Yu JC, Ni K, Chen CT. ENCAP: Computational prediction of tumor T cell antigens with ensemble classifiers and diverse sequence features. PLoS One 2024; 19:e0307176. [PMID: 39024250 PMCID: PMC11257298 DOI: 10.1371/journal.pone.0307176] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2024] [Accepted: 07/01/2024] [Indexed: 07/20/2024] Open
Abstract
Cancer immunotherapy enhances the body's natural immune system to combat cancer, offering the advantage of lowered side effects compared to traditional treatments because of its high selectivity and efficacy. Utilizing computational methods to identify tumor T cell antigens (TTCAs) is valuable in unraveling the biological mechanisms and enhancing the effectiveness of immunotherapy. In this study, we present ENCAP, a predictor for TTCA based on ensemble classifiers and diverse sequence features. Sequences were encoded as a feature vector of 4349 entries based on 57 different feature types, followed by feature engineering and hyperparameter optimization for machine learning models, respectively. The selected feature subsets of ENCAP are primarily composed of physicochemical properties, with several features specifically related to hydrophobicity and amphiphilicity. Two publicly available datasets were used for performance evaluation. ENCAP yields an AUC (Area Under the ROC Curve) of 0.768 and an MCC (Matthew's Correlation Coefficient) of 0.522 on the first independent test set. On the second test set, it achieves an AUC of 0.960 and an MCC of 0.789. Performance evaluations show that ENCAP generates 4.8% and 13.5% improvements in MCC over the state-of-the-art methods on two popular TTCA datasets, respectively. For the third test dataset of 71 experimentally validated TTCAs from the literature, ENCAP yields prediction accuracy of 0.873, achieving improvements ranging from 12% to 25.7% compared to three state-of-the-art methods. In general, the prediction accuracy is higher for sequences of fewer hydrophobic residues, and more hydrophilic and charged residues. The source code of ENCAP is freely available at https://github.com/YnnJ456/ENCAP.
Collapse
Affiliation(s)
- Jen-Chieh Yu
- Department of Bioinformatics and Medical Engineering, Asia University, Taichung, Taiwan
| | - Kuan Ni
- Graduate Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung, Taiwan
| | - Ching-Tai Chen
- Department of Bioinformatics and Medical Engineering, Asia University, Taichung, Taiwan
- Center for Precision Health Research, Asia University, Taichung, Taiwan
| |
Collapse
|
46
|
Zhang L, Hu X, Xiao K, Kong L. Effective identification and differential analysis of anticancer peptides. Biosystems 2024; 241:105246. [PMID: 38848816 DOI: 10.1016/j.biosystems.2024.105246] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2024] [Revised: 05/27/2024] [Accepted: 06/04/2024] [Indexed: 06/09/2024]
Abstract
Anticancer peptides (ACPs) have recently emerged as promising cancer therapeutics due to their selectivity and lower toxicity. However, the number of experimentally validated ACPs is limited, and identifying ACPs from large-scale sequence data is time-consuming and expensive. Therefore, it is critical to develop and improve upon existing computational models for identifying ACPs. In this study, a computational method named ACP_DA was proposed based on peptide residue composition and physiochemical properties information. To curtail overfitting and reduce computational costs, a sequential forward selection method was utilized to construct the optimal feature groups. Subsequently, the feature vectors were fed into light gradient boosting machine classifier for model construction. It was observed by an independent set test that ACP_DA achieved the highest Matthew's correlation coefficient of 0.63 and accuracy of 0.8129, displaying at least a 2% enhancement compared to state-of-the-art methods. The satisfactory results demonstrate the effectiveness of ACP_DA as a powerful tool for identifying ACPs, with the potential to significantly contribute to the development and optimization of promising therapies. The data and resource codes are available at https://github.com/Zlclab/ACP_DA.
Collapse
Affiliation(s)
- Lichao Zhang
- School of Mathematics and Statistics, Northeastern University at Qinhuangdao, Qinhuangdao, PR China; Hebei Innovation Center for Smart Perception and Applied Technology of Agricultural Data, Qinhuangdao, PR China
| | - Xueli Hu
- School of Mathematics and Statistics, Northeastern University at Qinhuangdao, Qinhuangdao, PR China
| | - Kang Xiao
- School of Mathematics and Statistics, Northeastern University at Qinhuangdao, Qinhuangdao, PR China
| | - Liang Kong
- Hebei Innovation Center for Smart Perception and Applied Technology of Agricultural Data, Qinhuangdao, PR China; School of Mathematics and Information Science & Technology, Hebei Normal University of Science & Technology, Qinhuangdao, PR China.
| |
Collapse
|
47
|
Harun-Or-Roshid M, Pham NT, Manavalan B, Kurata H. Meta-2OM: A multi-classifier meta-model for the accurate prediction of RNA 2'-O-methylation sites in human RNA. PLoS One 2024; 19:e0305406. [PMID: 38924058 PMCID: PMC11207182 DOI: 10.1371/journal.pone.0305406] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2024] [Accepted: 05/29/2024] [Indexed: 06/28/2024] Open
Abstract
2'-O-methylation (2-OM or Nm) is a widespread RNA modification observed in various RNA types like tRNA, mRNA, rRNA, miRNA, piRNA, and snRNA, which plays a crucial role in several biological functional mechanisms and innate immunity. To comprehend its modification mechanisms and potential epigenetic regulation, it is necessary to accurately identify 2-OM sites. However, biological experiments can be tedious, time-consuming, and expensive. Furthermore, currently available computational methods face challenges due to inadequate datasets and limited classification capabilities. To address these challenges, we proposed Meta-2OM, a cutting-edge predictor that can accurately identify 2-OM sites in human RNA. In brief, we applied a meta-learning approach that considered eight conventional machine learning algorithms, including tree-based classifiers and decision boundary-based classifiers, and eighteen different feature encoding algorithms that cover physicochemical, compositional, position-specific and natural language processing information. The predicted probabilities of 2-OM sites from the baseline models are then combined and trained using logistic regression to generate the final prediction. Consequently, Meta-2OM achieved excellent performance in both 5-fold cross-validation training and independent testing, outperforming all existing state-of-the-art methods. Specifically, on the independent test set, Meta-2OM achieved an overall accuracy of 0.870, sensitivity of 0.836, specificity of 0.904, and Matthew's correlation coefficient of 0.743. To facilitate its use, a user-friendly web server and standalone program have been developed and freely available at http://kurata35.bio.kyutech.ac.jp/Meta-2OM and https://github.com/kuratahiroyuki/Meta-2OM.
Collapse
Affiliation(s)
- Md. Harun-Or-Roshid
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, Iizuka, Fukuoka, Japan
| | - Nhat Truong Pham
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Republic of Korea
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, Republic of Korea
| | - Hiroyuki Kurata
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, Iizuka, Fukuoka, Japan
| |
Collapse
|
48
|
Ipkovich Á, Czvetkó T, A. Acosta L, Lee S, Nzimenyera I, Sebestyén V, Abonyi J. Network science and explainable AI-based life cycle management of sustainability models. PLoS One 2024; 19:e0300531. [PMID: 38870225 PMCID: PMC11175538 DOI: 10.1371/journal.pone.0300531] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Accepted: 02/29/2024] [Indexed: 06/15/2024] Open
Abstract
Model-based assessment of the potential impacts of variables on the Sustainable Development Goals (SDGs) can bring great additional information about possible policy intervention points. In the context of sustainability planning, machine learning techniques can provide data-driven solutions throughout the modeling life cycle. In a changing environment, existing models must be continuously reviewed and developed for effective decision support. Thus, we propose to use the Machine Learning Operations (MLOps) life cycle framework. A novel approach for model identification and development is introduced, which involves utilizing the Shapley value to determine the individual direct and indirect contributions of each variable towards the output, as well as network analysis to identify key drivers and support the identification and validation of possible policy intervention points. The applicability of the methods is demonstrated through a case study of the Hungarian water model developed by the Global Green Growth Institute. Based on the model exploration of the case of water efficiency and water stress (in the examined period for the SDG 6.4.1 & 6.4.2) SDG indicators, water reuse and water circularity offer a more effective intervention option than pricing and the use of internal or external renewable water resources.
Collapse
Affiliation(s)
- Ádám Ipkovich
- HUN-REN-PE Complex Systems Monitoring Research Group, University of Pannonia, Veszprém, Hungary
| | - Tímea Czvetkó
- HUN-REN-PE Complex Systems Monitoring Research Group, University of Pannonia, Veszprém, Hungary
| | - Lilibeth A. Acosta
- Climate Action and Inclusive Development (CAID) Unit, Global Green Growth Institute, Jung-gu, Seoul, Republic of Korea
| | - Sanga Lee
- Climate Action and Inclusive Development (CAID) Unit, Global Green Growth Institute, Jung-gu, Seoul, Republic of Korea
| | - Innocent Nzimenyera
- Climate Action and Inclusive Development (CAID) Unit, Global Green Growth Institute, Jung-gu, Seoul, Republic of Korea
| | - Viktor Sebestyén
- HUN-REN-PE Complex Systems Monitoring Research Group, University of Pannonia, Veszprém, Hungary
- Sustainability Solutions Research Lab, Faculty of Engineering, University of Pannonia, Veszprém, Hungary
| | - János Abonyi
- HUN-REN-PE Complex Systems Monitoring Research Group, University of Pannonia, Veszprém, Hungary
| |
Collapse
|
49
|
Jia Y, Yu Z, Hong Z. Semantic aware-based instruction embedding for binary code similarity detection. PLoS One 2024; 19:e0305299. [PMID: 38861533 PMCID: PMC11166306 DOI: 10.1371/journal.pone.0305299] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2024] [Accepted: 05/27/2024] [Indexed: 06/13/2024] Open
Abstract
Binary code similarity detection plays a crucial role in various applications within binary security, including vulnerability detection, malicious software analysis, etc. However, existing methods suffer from limited differentiation in binary embedding representations across different compilation environments, lacking dynamic high-level semantics. Moreover, current approaches often neglect multi-level semantic feature extraction, thereby failing to acquire precise semantic information about the binary code. To address these limitations, this paper introduces a novel detection solution called BinBcla. This method employs an enhanced pre-training model to generate instruction embeddings with dynamic semantics for binary functions. Subsequently, multi-feature fusion technique is utilized to extract local semantic information and long-distance global features from the code, respectively, employing self-attention to comprehend the structure information of the code. Finally, an improved cosine similarity method is employed to learn relationships among all elements of the distance vectors, thereby enhancing the model's robustness to new sample functions. Experiments are conducted across different architectures, compilers, and optimization levels. The results indicate that BinBcla achieves higher accuracy, precision and F1 score compared to existing methods.
Collapse
Affiliation(s)
- Yuhao Jia
- College of Information Engineering, Zhejiang University of Technology, Hangzhou, Zhejiang, China
| | - Zhicheng Yu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou, Zhejiang, China
| | - Zhen Hong
- College of Information Engineering, Zhejiang University of Technology, Hangzhou, Zhejiang, China
| |
Collapse
|
50
|
Chen T, Kabir MF. Explainable machine learning approach for cancer prediction through binarilization of RNA sequencing data. PLoS One 2024; 19:e0302947. [PMID: 38728288 PMCID: PMC11086842 DOI: 10.1371/journal.pone.0302947] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2023] [Accepted: 04/15/2024] [Indexed: 05/12/2024] Open
Abstract
In recent years, researchers have proven the effectiveness and speediness of machine learning-based cancer diagnosis models. However, it is difficult to explain the results generated by machine learning models, especially ones that utilized complex high-dimensional data like RNA sequencing data. In this study, we propose the binarilization technique as a novel way to treat RNA sequencing data and used it to construct explainable cancer prediction models. We tested our proposed data processing technique on five different models, namely neural network, random forest, xgboost, support vector machine, and decision tree, using four cancer datasets collected from the National Cancer Institute Genomic Data Commons. Since our datasets are imbalanced, we evaluated the performance of all models using metrics designed for imbalance performance like geometric mean, Matthews correlation coefficient, F-Measure, and area under the receiver operating characteristic curve. Our approach showed comparative performance while relying on less features. Additionally, we demonstrated that data binarilization offers higher explainability by revealing how each feature affects the prediction. These results demonstrate the potential of data binarilization technique in improving the performance and explainability of RNA sequencing based cancer prediction models.
Collapse
Affiliation(s)
- Tianjie Chen
- Department of Computer Science, Pennsylvania State University Harrisburg, Middletown, Pennsylvania, United States of America
| | - Md Faisal Kabir
- Department of Computer Science, Pennsylvania State University Harrisburg, Middletown, Pennsylvania, United States of America
| |
Collapse
|