1
|
Arif M, Musleh S, Ghulam A, Fida H, Alqahtani Y, Alam T. StackDPPred: Multiclass prediction of defensin peptides using stacked ensemble learning with optimized features. Methods 2024; 230:129-139. [PMID: 39173785 DOI: 10.1016/j.ymeth.2024.08.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2024] [Revised: 07/30/2024] [Accepted: 08/13/2024] [Indexed: 08/24/2024] Open
Abstract
Host defense or antimicrobial peptides (AMPs) are promising candidates for protecting host against microbial pathogens for example bacteria, virus, fungi, yeast. Defensins are the type of AMPs that act as potential therapeutic drug agent and perform vital role in various biological process. Conventional Experiments to identify defensin peptides (DPs) are time consuming and expensive. Thus, the shortcomings of wet lab experiments are leveraged by computational methods to accurately predict the functional types of DPs. In this paper, we aim to propose a novel multi-class ensemble-based prediction model called StackDPPred for identifying the properties of DPs. The peptide sequences are encoded using split amino acid composition (SAAC), segmented position specific scoring matrix (SegPSSM), histogram of oriented gradients-based PSSM (HOGPSSM) and feature extraction based graphical and statistical (FEGS) descriptors. Next, principal component analysis (PCA) is used to select the best subset of attributes. After that, the optimized features are fed into single machine learning and stacking-based ensemble classifiers. Furthermore, the ablation study demonstrates the robustness and efficacy of the stacking approach using reduced features for predicting DPs and their families. The proposed StackDPPred method improves the overall accuracy by 13.41% and 7.62% compared to existing DPs predictors iDPF-PseRAAC and iDEF-PseRAAC, respectively on validation test. Additionally, we applied the local interpretable model-agnostic explanations (LIME) algorithm to understand the contribution of selected features to the overall prediction. We believe, StackDPPred could serve as a valuable tool accelerating the screening of large-scale DPs and peptide-based drug discovery process.
Collapse
Affiliation(s)
- Muhammad Arif
- College of Science and Engineering, Hamad Bin Khalifa University, Doha 34110, Qatar
| | - Saleh Musleh
- College of Science and Engineering, Hamad Bin Khalifa University, Doha 34110, Qatar
| | - Ali Ghulam
- Information Technology Centre, Sindh Agriculture University, Sindh, Pakistan
| | - Huma Fida
- Department of Microbiology, Abdul Wali Khan University Mardan, 23200, KPK, Pakistan
| | | | - Tanvir Alam
- College of Science and Engineering, Hamad Bin Khalifa University, Doha 34110, Qatar.
| |
Collapse
|
2
|
Arif M, Musleh S, Fida H, Alam T. PLMACPred prediction of anticancer peptides based on protein language model and wavelet denoising transformation. Sci Rep 2024; 14:16992. [PMID: 39043738 PMCID: PMC11266708 DOI: 10.1038/s41598-024-67433-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2024] [Accepted: 07/11/2024] [Indexed: 07/25/2024] Open
Abstract
Anticancer peptides (ACPs) perform a promising role in discovering anti-cancer drugs. The growing research on ACPs as therapeutic agent is increasing due to its minimal side effects. However, identifying novel ACPs using wet-lab experiments are generally time-consuming, labor-intensive, and expensive. Leveraging computational methods for fast and accurate prediction of ACPs would harness the drug discovery process. Herein, a machine learning-based predictor, called PLMACPred, is developed for identifying ACPs from peptide sequence only. PLMACPred adopted a set of encoding schemes representing evolutionary-property, composition-property, and protein language model (PLM), i.e., evolutionary scale modeling (ESM-2)- and ProtT5-based embedding to encode peptides. Then, two-dimensional (2D) wavelet denoising (WD) was employed to remove the noise from extracted features. Finally, ensemble-based cascade deep forest (CDF) model was developed to identify ACP. PLMACPred model attained superior performance on all three benchmark datasets, namely, ACPmain, ACPAlter, and ACP740 over tenfold cross validation and independent dataset. PLMACPred outperformed the existing models and improved the prediction accuracy by 18.53%, 2.4%, 7.59% on ACPmain, ACPalter, ACP740 dataset, respectively. We showed that embedding from ProtT5 and ESM-2 was capable of capturing better contextual information from the entire sequence than the other encoding schemes for ACP prediction. For the explainability of proposed model, SHAP (SHapley Additive exPlanations) method was used to analyze the feature effect on the ACP prediction. A list of novel sequence motifs was proposed from the ACP sequence using MEME suites. We believe, PLMACPred will support in accelerating the discovery of novel ACPs as well as other activities of microbial peptides.
Collapse
Affiliation(s)
- Muhammad Arif
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
| | - Saleh Musleh
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
| | - Huma Fida
- Department of Microbiology, Abdul Wali Khan University, Mardan, KPK, Pakistan
| | - Tanvir Alam
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar.
| |
Collapse
|
3
|
Arif R, Kanwal S, Ahmed S, Kabir M. A Computational Predictor for Accurate Identification of Tumor Homing Peptides by Integrating Sequential and Deep BiLSTM Features. Interdiscip Sci 2024; 16:503-518. [PMID: 38733473 DOI: 10.1007/s12539-024-00628-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Revised: 03/16/2024] [Accepted: 03/27/2024] [Indexed: 05/13/2024]
Abstract
Cancer remains a severe illness, and current research indicates that tumor homing peptides (THPs) play an important part in cancer therapy. The identification of THPs can provide crucial insights for drug-discovery and pharmaceutical industries as they allow for tailored medication delivery towards cancer cells. These peptides have a high affinity enabling particular receptors present upon tumor surfaces, allowing for the creation of precision medications that reduce off-target consequences and enhance cancer patient treatment results. Wet-lab techniques are considered essential tools for studying THPs; however, they're labor-extensive and time-consuming, therefore making prediction of THPs a challenging task for the researchers. Computational-techniques, on the other hand, are considered significant tools in identifying THPs according to the sequence data. Despite many strategies have been presented to predict new THP, there is still a need to develop a robust method with higher rates of success. In this paper, we developed a novel framework, THP-DF, for accurately identifying THPs on a large-scale. Firstly, the peptide sequences are encoded through various sequential features. Secondly, each feature is passed to BiLSTM and attention layers to extract simplified deep features. Finally, an ensemble-framework is formed via integrating sequential- and deep features which are fed to a support vector machine which with 10-fold cross-validation to carry to validate the efficiency. The experimental results showed that THP-DF worked better on both [Formula: see text] and [Formula: see text] datasets by achieving accuracy of > 95% which are higher than existing predictors both datasets. This indicates that the proposed predictor could be a beneficial tool to precisely and rapidly identify THPs and will contribute to the cutting-edge cancer treatment strategies and pharmaceuticals.
Collapse
Affiliation(s)
- Roha Arif
- School of Systems and Technology, University of Management and Technology, Lahore, 54782, Pakistan
| | - Sameera Kanwal
- School of Systems and Technology, University of Management and Technology, Lahore, 54782, Pakistan
| | - Saeed Ahmed
- School of Systems and Technology, University of Management and Technology, Lahore, 54782, Pakistan
| | - Muhammad Kabir
- School of Systems and Technology, University of Management and Technology, Lahore, 54782, Pakistan.
| |
Collapse
|
4
|
Khalid M, Ali F, Alghamdi W, Alzahrani A, Alsini R, Alzahrani A. An ensemble computational model for prediction of clathrin protein by coupling machine learning with discrete cosine transform. J Biomol Struct Dyn 2024:1-9. [PMID: 38498362 DOI: 10.1080/07391102.2024.2329777] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Accepted: 02/19/2024] [Indexed: 03/20/2024]
Abstract
Clathrin protein (CP) plays a pivotal role in numerous cellular processes, including endocytosis, signal transduction, and neuronal function. Dysregulation of CP has been associated with a spectrum of diseases. Given its involvement in various cellular functions, CP has garnered significant attention for its potential applications in drug design and medicine, ranging from targeted drug delivery to addressing viral infections, neurological disorders, and cancer. The accurate identification of CP is crucial for unraveling its function and devising novel therapeutic strategies. Computational methods offer a rapid, cost-effective, and less labor-intensive alternative to traditional identification methods, making them especially appealing for high-throughput screening. This paper introduces CL-Pred, a novel computational method for CP identification. CL-Pred leverages three feature descriptors: Dipeptide Deviation from Expected Mean (DDE), Bigram Position Specific Scoring Matrix (BiPSSM), and Position Specific Scoring Matrix-Tetra Slice-Discrete Cosine Transform (PSSM-TS-DCT). The model is trained using three classifiers: Support Vector Machine (SVM), Extremely Randomized Tree (ERT), and Light eXtreme Gradient Boosting (LiXGB). Notably, the LiXGB-based model achieves outstanding performance, demonstrating accuracies of 94.63% and 93.65% on the training and testing datasets, respectively. The proposed CL-Pred method is poised to significantly advance our comprehension of clathrin-mediated endocytosis, cellular physiology, and disease pathogenesis. Furthermore, it holds promise for identifying potential drug targets across a spectrum of diseases.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Majdi Khalid
- Department of Computer Science and Artificial Intelligence, College of Computing, Umm Al-Qura University, Makkah, Saudi Arabia
| | - Farman Ali
- Sarhad University of Science and Information Technology Peshawar, Mardan Campus, Mardan, Pakistan
| | - Wajdi Alghamdi
- Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Abdulrahman Alzahrani
- Department of Information System and Technology, College of Computer Science and Engineering, University of Jeddah, Jeddah, Saudi Arabia
| | - Raed Alsini
- Department of Information Systems, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Ahmed Alzahrani
- College of Computer Science and Engineering, University of Jeddah, Jeddah, Saudi Arabia
| |
Collapse
|
5
|
Zou H. iDPPIV-SI: identifying dipeptidyl peptidase IV inhibitory peptides by using multiple sequence information. J Biomol Struct Dyn 2024; 42:2144-2152. [PMID: 37125813 DOI: 10.1080/07391102.2023.2203257] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2022] [Accepted: 04/10/2023] [Indexed: 05/02/2023]
Abstract
Currently, diabetes has become a great threaten for people's health in the world. Recent study shows that dipeptidyl peptidase IV (DPP-IV) inhibitory peptides may be a potential pharmaceutical agent to treat diabetes. Thus, there is a need to discriminate DPP-IV inhibitory peptides from non-DPP-IV inhibitory peptides. To address this issue, a novel computational model called iDPPIV-SI was developed in this study. In the first, 50 different types of physicochemical (PC) properties were employed to denote the peptide sequences. Three different feature descriptors including the 1-order, 2-order correlation methods and discrete wavelet transform were applied to collect useful information from the PC matrix. Furthermore, the least absolute shrinkage and selection operator (LASSO) algorithm was employed to select these most discriminative features. All of these chosen features were fed into support vector machine (SVM) for identifying DPP-IV inhibitory peptides. The iDPPIV-SI achieved 91.26% and 98.12% classification accuracies on the training and independent dataset, respectively. There is a significantly improvement in the classification performance by the proposed method, as compared with the state-of-the-art predictors. The datasets and MATLAB codes (based on MATLAB2015b) used in current study are available at https://figshare.com/articles/online_resource/iDPPIV-SI/20085878.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Hongliang Zou
- School of Communications and Electronics, Jiangxi Science and Technology Normal University, Nanchang, China
| |
Collapse
|
6
|
Arif M, Fang G, Fida H, Musleh S, Yu DJ, Alam T. iMRSAPred: Improved Prediction of Anti-MRSA Peptides Using Physicochemical and Pairwise Contact-Energy Properties of Amino Acids. ACS OMEGA 2024; 9:2874-2883. [PMID: 38250405 PMCID: PMC10795061 DOI: 10.1021/acsomega.3c08303] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/22/2023] [Revised: 12/06/2023] [Accepted: 12/13/2023] [Indexed: 01/23/2024]
Abstract
Methicillin-resistant Staphylococcus aureus (MRSA) is a growing concern for human lives worldwide. Anti-MRSA peptides act as potential antibiotic agents and play significant role to combat MRSA infection. Traditional laboratory-based methods for annotating Anti-MRSA peptides are although precise but quite challenging, costly, and time-consuming. Therefore, computational methods capable of identifying Anti-MRSA peptides accelerate the drug designing process for treating bacterial infections. In this study, we developed a novel sequence-based predictor "iMRSAPred" for screening Anti-MRSA peptides by incorporating energy estimation and physiochemical and sequential information. We successfully resolved the skewed imbalance phenomena by using synthetic minority oversampling technique plus Tomek link (SMOTETomek) algorithm. Furthermore, the Shapley additive explanation method was leveraged to analyze the impact of top-ranked features in the prediction task. We evaluated multiple machine learning algorithms, i.e., CatBoost, Cascade Deep Forest, Kernel and Tree Boosting, support vector machine, and HistGBoost classifiers by 10-fold cross-validation and independent testing. The proposed iMRSAPred method significantly improved the overall performance in terms of accuracy and Matthew's correlation coefficient (MCC) by 5.45 and 0.083%, respectively, on the training data set. On the independent data set, iMRSAPred improved accuracy and MCC by 3.98 and 0.055%, respectively. We believe that the proposed method would be useful in large-scale Anti-MRSA peptide prediction and provide insights into other bioactive peptides.
Collapse
Affiliation(s)
- Muhammad Arif
- College
of Science and Engineering, Hamad Bin Khalifa
University, Doha 34110, Qatar
| | - Ge Fang
- State
Key Laboratory for Organic Electronics and Information Displays, Institute of Advanced Materials (IAM), Nanjing University of Posts Telecommunications
9 Wenyuan Road, Nanjing 210023, P. R. China
- Center
for Research Innovation and Biomedical Informatics, Faculty of Medical
Technology, Mahidol University, Bankok 10700, Thailand
| | - Huma Fida
- Department
of Microbiology, Abdul Wali Khan University, Mardan 23200, KPK, Pakistan
| | - Saleh Musleh
- College
of Science and Engineering, Hamad Bin Khalifa
University, Doha 34110, Qatar
| | - Dong-Jun Yu
- School
of Computer Science and Engineering, Nanjing
University of Science and Technology, Nanjing 210023, China
| | - Tanvir Alam
- College
of Science and Engineering, Hamad Bin Khalifa
University, Doha 34110, Qatar
| |
Collapse
|
7
|
Yu Z, Yin Z, Zou H. iAMY-RECMFF: Identifying amyloidgenic peptides by using residue pairwise energy content matrix and features fusion algorithm. J Bioinform Comput Biol 2023; 21:2350023. [PMID: 37899353 DOI: 10.1142/s0219720023500233] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/31/2023]
Abstract
Various diseases, including Huntington's disease, Alzheimer's disease, and Parkinson's disease, have been reported to be linked to amyloid. Therefore, it is crucial to distinguish amyloid from non-amyloid proteins or peptides. While experimental approaches are typically preferred, they are costly and time-consuming. In this study, we have developed a machine learning framework called iAMY-RECMFF to discriminate amyloidgenic from non-amyloidgenic peptides. In our model, we first encoded the peptide sequences using the residue pairwise energy content matrix. We then utilized Pearson's correlation coefficient and distance correlation to extract useful information from this matrix. Additionally, we employed an improved similarity network fusion algorithm to integrate features from different perspectives. The Fisher approach was adopted to select the optimal feature subset. Finally, the selected features were inputted into a support vector machine for identifying amyloidgenic peptides. Experimental results demonstrate that our proposed method significantly improves the identification of amyloidgenic peptides compared to existing predictors. This suggests that our method may serve as a powerful tool in identifying amyloidgenic peptides. To facilitate academic use, the dataset and codes used in the current study are accessible at https://figshare.com/articles/online_resource/iAMY-RECMFF/22816916.
Collapse
Affiliation(s)
- Zizheng Yu
- School of Communications and Electronics Jiangxi, Science and Technology Normal University, Nanchang 330013, P. R. China
| | - Zhijian Yin
- School of Communications and Electronics Jiangxi, Science and Technology Normal University, Nanchang 330013, P. R. China
- Jiangxi Engineering Research Center of Unattended Perception System and Artificial Intelligence Technology Jiangxi Science and Technology Normal University, Jiangxi 330088, P. R. China
| | - Hongliang Zou
- School of Communications and Electronics Jiangxi, Science and Technology Normal University, Nanchang 330013, P. R. China
- Jiangxi Engineering Research Center of Unattended Perception System and Artificial Intelligence Technology Jiangxi Science and Technology Normal University, Jiangxi 330088, P. R. China
| |
Collapse
|
8
|
Zou H, Yu W. Integrating Low-Order and High-Order Correlation Information for Identifying Phage Virion Proteins. J Comput Biol 2023; 30:1131-1143. [PMID: 37729064 DOI: 10.1089/cmb.2022.0237] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/22/2023] Open
Abstract
Phage virion proteins (PVPs) play an important role in the host cell. Fast and accurate identification of PVPs is beneficial for the discovery and development of related drugs. Although wet experimental approaches are the first choice to identify PVPs, they are costly and time-consuming. Thus, researchers have turned their attention to computational models, which can speed up related studies. Therefore, we proposed a novel machine-learning model to identify PVPs in the current study. First, 50 different types of physicochemical properties were used to denote protein sequences. Next, two different approaches, including Pearson's correlation coefficient (PCC) and maximal information coefficient (MIC), were employed to extract discriminative information. Further, to capture the high-order correlation information, we used PCC and MIC once again. After that, we adopted the least absolute shrinkage and selection operator algorithm to select the optimal feature subset. Finally, these chosen features were fed into a support vector machine to discriminate PVPs from phage non-virion proteins. We performed experiments on two different datasets to validate the effectiveness of our proposed method. Experimental results showed a significant improvement in performance compared with state-of-the-art approaches. It indicates that the proposed computational model may become a powerful predictor in identifying PVPs.
Collapse
Affiliation(s)
- Hongliang Zou
- School of Communications and Electronics, Jiangxi Science and Technology Normal University, Nanchang, China
| | - Wanting Yu
- College of Animal Science and Technology, Jiangxi Agricultural University, Nanchang, China
| |
Collapse
|
9
|
Pan J, You Z, You W, Zhao T, Feng C, Zhang X, Ren F, Ma S, Wu F, Wang S, Sun Y. PTBGRP: predicting phage-bacteria interactions with graph representation learning on microbial heterogeneous information network. Brief Bioinform 2023; 24:bbad328. [PMID: 37742053 DOI: 10.1093/bib/bbad328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Revised: 08/14/2023] [Accepted: 08/30/2023] [Indexed: 09/25/2023] Open
Abstract
Identifying the potential bacteriophages (phage) candidate to treat bacterial infections plays an essential role in the research of human pathogens. Computational approaches are recognized as a valid way to predict bacteria and target phages. However, most of the current methods only utilize lower-order biological information without considering the higher-order connectivity patterns, which helps to improve the predictive accuracy. Therefore, we developed a novel microbial heterogeneous interaction network (MHIN)-based model called PTBGRP to predict new phages for bacterial hosts. Specifically, PTBGRP first constructs an MHIN by integrating phage-bacteria interaction (PBI) and six bacteria-bacteria interaction networks with their biological attributes. Then, different representation learning methods are deployed to extract higher-level biological features and lower-level topological features from MHIN. Finally, PTBGRP employs a deep neural network as the classifier to predict unknown PBI pairs based on the fused biological information. Experiment results demonstrated that PTBGRP achieves the best performance on the corresponding ESKAPE pathogens and PBI dataset when compared with state-of-art methods. In addition, case studies of Klebsiella pneumoniae and Staphylococcus aureus further indicate that the consideration of rich heterogeneous information enables PTBGRP to accurately predict PBI from a more comprehensive perspective. The webserver of the PTBGRP predictor is freely available at http://120.77.11.78/PTBGRP/.
Collapse
Affiliation(s)
- Jie Pan
- Key Laboratory of Resources Biology and Biotechnology in Western China, Ministry of Education, Provincial Key Laboratory of Biotechnology of Shaanxi Province, the College of Life Sciences, Northwest University, Xi'an 710069, China
| | - Zhuhong You
- School of Computer Science, Northwestern Polytechnical University, Xi'an 710129, China
| | - Wencai You
- Key Laboratory of Resources Biology and Biotechnology in Western China, Ministry of Education, Provincial Key Laboratory of Biotechnology of Shaanxi Province, the College of Life Sciences, Northwest University, Xi'an 710069, China
| | - Tian Zhao
- Key Laboratory of Resources Biology and Biotechnology in Western China, Ministry of Education, Provincial Key Laboratory of Biotechnology of Shaanxi Province, the College of Life Sciences, Northwest University, Xi'an 710069, China
| | - Chenlu Feng
- Key Laboratory of Resources Biology and Biotechnology in Western China, Ministry of Education, Provincial Key Laboratory of Biotechnology of Shaanxi Province, the College of Life Sciences, Northwest University, Xi'an 710069, China
| | - Xuexia Zhang
- North China Pharmaceutical Group, Shijiazhuang 050015, Hebei, China
- National Microbial Medicine Engineering & Research Center, Shijiazhuang 050015, Hebei, China
| | - Fengzhi Ren
- North China Pharmaceutical Group, Shijiazhuang 050015, Hebei, China
- National Microbial Medicine Engineering & Research Center, Shijiazhuang 050015, Hebei, China
| | - Sanxing Ma
- Key Laboratory of Resources Biology and Biotechnology in Western China, Ministry of Education, Provincial Key Laboratory of Biotechnology of Shaanxi Province, the College of Life Sciences, Northwest University, Xi'an 710069, China
| | - Fan Wu
- Key Laboratory of Resources Biology and Biotechnology in Western China, Ministry of Education, Provincial Key Laboratory of Biotechnology of Shaanxi Province, the College of Life Sciences, Northwest University, Xi'an 710069, China
| | - Shiwei Wang
- Key Laboratory of Resources Biology and Biotechnology in Western China, Ministry of Education, Provincial Key Laboratory of Biotechnology of Shaanxi Province, the College of Life Sciences, Northwest University, Xi'an 710069, China
| | - Yanmei Sun
- Key Laboratory of Resources Biology and Biotechnology in Western China, Ministry of Education, Provincial Key Laboratory of Biotechnology of Shaanxi Province, the College of Life Sciences, Northwest University, Xi'an 710069, China
| |
Collapse
|
10
|
Mursalim MKN, Mengko TLER, Hertadi R, Purwarianti A, Susanty M. BiCaps-DBP: Predicting DNA-binding proteins from protein sequences using Bi-LSTM and a 1D-capsule network. Comput Biol Med 2023; 163:107241. [PMID: 37437362 DOI: 10.1016/j.compbiomed.2023.107241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2023] [Revised: 06/23/2023] [Accepted: 07/07/2023] [Indexed: 07/14/2023]
Abstract
Predicting DNA-binding proteins (DBPs) based solely on primary sequences is one of the most challenging problems in genome annotation. DBPs play a crucial role in various biological processes, including DNA replication, transcription, repair, and splicing. Some DBPs are essential in pharmaceutical research on various human cancers and autoimmune diseases. Existing experimental methods for identifying DBPs are time-consuming and costly. Therefore, developing a rapid and accurate computational technique is necessary to address the issue. This study introduces BiCaps-DBP, a deep learning-based method that improves DBP prediction performance by combining bidirectional long short-term memory with a 1D-capsule network. This study uses three training and independent datasets to evaluate the proposed model's generalizability and robustness. Based on three independent datasets, BiCaps-DBP achieved 1.05%, 5.79% and 0.40% higher accuracies than an existing predictor for PDB2272, PDB186 and PDB20000, respectively. These outcomes indicate that the proposed method is a promising DBP predictor.
Collapse
Affiliation(s)
- Muhammad K N Mursalim
- School of Electrical Engineering and Informatics, Bandung Institute of Technology, Bandung, 40132, Indonesia; Department of Informatics Engineering, Universal University, Batam, Indonesia
| | - Tati L E R Mengko
- School of Electrical Engineering and Informatics, Bandung Institute of Technology, Bandung, 40132, Indonesia.
| | - Rukman Hertadi
- Faculty of Mathematics and Natural Sciences, Bandung Institute of Technology, Bandung, 40132, Indonesia
| | - Ayu Purwarianti
- School of Electrical Engineering and Informatics, Bandung Institute of Technology, Bandung, 40132, Indonesia; Center for Artificial Intelligence (U-CoE AI-VLB), Bandung Institute of Technology, Bandung, 40132, Indonesia
| | - Meredita Susanty
- School of Electrical Engineering and Informatics, Bandung Institute of Technology, Bandung, 40132, Indonesia; Department of Computer Science, Pertamina University, Jakarta, Indonesia
| |
Collapse
|
11
|
Yogadasan N, Doxey AC, Chuong SDX. A Machine Learning Framework Identifies Plastid-Encoded Proteins Harboring C3 and C4 Distinguishing Sequence Information. Genome Biol Evol 2023; 15:evad129. [PMID: 37462292 PMCID: PMC10368328 DOI: 10.1093/gbe/evad129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/12/2023] [Indexed: 07/27/2023] Open
Abstract
C4 photosynthesis is known to have at least 61 independent origins across plant lineages making it one of the most notable examples of convergent evolution. Of the >60 independent origins, a predicted 22-24 origins, encompassing greater than 50% of all known C4 species, exist within the Panicoideae, Arundinoideae, Chloridoideae, Micrairoideae, Aristidoideae, and Danthonioideae (PACMAD) clade of the Poaceae family. This clade is therefore primed with species ideal for the study of genomic changes associated with the acquisition of the C4 photosynthetic trait. In this study, we take advantage of the growing availability of sequenced plastid genomes and employ a machine learning (ML) approach to screen for plastid genes harboring C3 and C4 distinguishing information in PACMAD species. We demonstrate that certain plastid-encoded protein sequences possess distinguishing and informative sequence information that allows them to train accurate ML C3/C4 classification models. Our RbcL-trained model, for example, informs a C3/C4 classifier with greater than 99% accuracy. Accurate prediction of photosynthetic type from individual sequences suggests biologically relevant, and potentially differing roles of these sequence products in C3 versus C4 metabolism. With this ML framework, we have identified several key sequences and sites that are most predictive of C3/C4 status, including RbcL, subunits of the NAD(P)H dehydrogenase complex, and specific residues within, further highlighting their potential significance in the evolution and/or maintenance of C4 photosynthetic machinery. This general approach can be applied to uncover intricate associations between other similar genotype-phenotype relationships.
Collapse
Affiliation(s)
| | - Andrew C Doxey
- Department of Biology, University of Waterloo, Waterloo, ON, Canada
| | - Simon D X Chuong
- Department of Biology, University of Waterloo, Waterloo, ON, Canada
| |
Collapse
|
12
|
Shang J, Peng C, Tang X, Sun Y. PhaVIP: Phage VIrion Protein classification based on chaos game representation and Vision Transformer. Bioinformatics 2023; 39:i30-i39. [PMID: 37387136 DOI: 10.1093/bioinformatics/btad229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION As viruses that mainly infect bacteria, phages are key players across a wide range of ecosystems. Analyzing phage proteins is indispensable for understanding phages' functions and roles in microbiomes. High-throughput sequencing enables us to obtain phages in different microbiomes with low cost. However, compared to the fast accumulation of newly identified phages, phage protein classification remains difficult. In particular, a fundamental need is to annotate virion proteins, the structural proteins, such as major tail, baseplate, etc. Although there are experimental methods for virion protein identification, they are too expensive or time-consuming, leaving a large number of proteins unclassified. Thus, there is a great demand to develop a computational method for fast and accurate phage virion protein (PVP) classification. RESULTS In this work, we adapted the state-of-the-art image classification model, Vision Transformer, to conduct virion protein classification. By encoding protein sequences into unique images using chaos game representation, we can leverage Vision Transformer to learn both local and global features from sequence "images". Our method, PhaVIP, has two main functions: classifying PVP and non-PVP sequences and annotating the types of PVP, such as capsid and tail. We tested PhaVIP on several datasets with increasing difficulty and benchmarked it against alternative tools. The experimental results show that PhaVIP has superior performance. After validating the performance of PhaVIP, we investigated two applications that can use the output of PhaVIP: phage taxonomy classification and phage host prediction. The results showed the benefit of using classified proteins over all proteins. AVAILABILITY AND IMPLEMENTATION The web server of PhaVIP is available via: https://phage.ee.cityu.edu.hk/phavip. The source code of PhaVIP is available via: https://github.com/KennthShang/PhaVIP.
Collapse
Affiliation(s)
- Jiayu Shang
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), China
| | - Cheng Peng
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), China
| | - Xubo Tang
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), China
| | - Yanni Sun
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), China
| |
Collapse
|
13
|
Ali F, Alghamdi W, Almagrabi AO, Alghushairy O, Banjar A, Khalid M. Deep-AGP: Prediction of angiogenic protein by integrating two-dimensional convolutional neural network with discrete cosine transform. Int J Biol Macromol 2023; 243:125296. [PMID: 37301349 DOI: 10.1016/j.ijbiomac.2023.125296] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2023] [Revised: 06/05/2023] [Accepted: 06/07/2023] [Indexed: 06/12/2023]
Abstract
Angiogenic proteins (AGPs) play a primary role in the formation of new blood vessels from pre-existing ones. AGPs have diverse applications in cancer, including serving as biomarkers, guiding anti-angiogenic therapies, and aiding in tumor imaging. Understanding the role of AGPs in cardiovascular and neurodegenerative diseases is vital for developing new diagnostic tools and therapeutic approaches. Considering the significance of AGPs, in this research, we first time established a computational model using deep learning for identifying AGPs. First, we constructed a sequence-based dataset. Second, we explored features by designing a novel feature encoder, called position-specific scoring matrix-decomposition-discrete cosine transform (PSSM-DC-DCT) and existing descriptors including Dipeptide Deviation from Expected Mean (DDE) and bigram-position-specific scoring matrix (Bi-PSSM). Third, each feature set is fed into two-dimensional convolutional neural network (2D-CNN) and machine learning classifiers. Finally, the performance of each learning model is validated by 10-fold cross-validation (CV). The experimental results demonstrate that 2D-CNN with proposed novel feature descriptor achieved the highest success rate on both training and testing datasets. In addition to being an accurate predictor for identification of angiogenic proteins, our proposed method (Deep-AGP) might be fruitful in understanding cancer, cardiovascular, and neurodegenerative diseases, development of their novel therapeutic methods and drug designing.
Collapse
Affiliation(s)
- Farman Ali
- Sarhad University of Science and Information Technology Peshawar, Mardan Campus, Pakistan.
| | - Wajdi Alghamdi
- Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia
| | - Alaa Omran Almagrabi
- Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia.
| | - Omar Alghushairy
- Department of Information Systems and Technology, College of Computer Science and Engineering, University of Jeddah, Jeddah, Saudi Arabia
| | - Ameen Banjar
- Department of Information Systems and Technology, College of Computer Science and Engineering, University of Jeddah, Jeddah, Saudi Arabia
| | - Majdi Khalid
- Department of Computer Science, College of Computers and Information Systems, Umm Al-Qura University, Makkah 21955, Saudi Arabia
| |
Collapse
|
14
|
Ali F, Kumar H, Alghamdi W, Kateb FA, Alarfaj FK. Recent Advances in Machine Learning-Based Models for Prediction of Antiviral Peptides. ARCHIVES OF COMPUTATIONAL METHODS IN ENGINEERING : STATE OF THE ART REVIEWS 2023; 30:1-12. [PMID: 37359746 PMCID: PMC10148704 DOI: 10.1007/s11831-023-09933-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/05/2023] [Accepted: 04/19/2023] [Indexed: 06/28/2023]
Abstract
Viruses have killed and infected millions of people across the world. It causes several chronic diseases like COVID-19, HIV, and hepatitis. To cope with such diseases and virus infections, antiviral peptides (AVPs) have been applied in the design of drugs. Keeping in view the significant role in pharmaceutical industry and other research fields, identification of AVPs is highly indispensable. In this connection, experimental and computational methods were proposed to identify AVPs. However, more accurate predictors for boosting AVPs identification are highly desirable. This work presents a thorough study and reports the available predictors of AVPs. We explained applied datasets, feature representation approaches, classification algorithms, and evaluation parameters of performance. In this study, the limitations of the existing studies and the best methods were emphasized. Provided the pros and cons of the applied classifiers. The future insights demonstrate efficient feature encoding approaches, best feature optimization schemes, and effective classification techniques that can improve the performance of novel method for accurate prediction of AVPs.
Collapse
Affiliation(s)
- Farman Ali
- Sarhad University of Science and Information Technology Peshawar, Mardan Campus, Khyber Pakhtunkhwa, Pakistan
| | - Harish Kumar
- Department of Computer Science, College of Computer Science, King Khalid University, Abha, Saudi Arabia
| | - Wajdi Alghamdi
- Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, 21589 Saudi Arabia
| | - Faris A. Kateb
- Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, 21589 Saudi Arabia
| | - Fawaz Khaled Alarfaj
- Department of Management Information Systems, King Faisal University, Hufof, Saudi Arabia
| |
Collapse
|
15
|
He Z, Zheng D, Wang H. Accurate few-shot object counting with Hough matching feature enhancement. Front Comput Neurosci 2023; 17:1145219. [PMID: 37065544 PMCID: PMC10098187 DOI: 10.3389/fncom.2023.1145219] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2023] [Accepted: 03/02/2023] [Indexed: 04/03/2023] Open
Abstract
IntroductionGiven some exemplars, few-shot object counting aims to count the corresponding class objects in query images. However, when there are many target objects or background interference in the query image, some target objects may have occlusion and overlap, which causes a decrease in counting accuracy.MethodsTo overcome the problem, we propose a novel Hough matching feature enhancement network. First, we extract the image feature with a fixed convolutional network and refine it through local self-attention. And we design an exemplar feature aggregation module to enhance the commonality of the exemplar feature. Then, we build a Hough space to vote for candidate object regions. The Hough matching outputs reliable similarity maps between exemplars and the query image. Finally, we augment the query feature with exemplar features according to the similarity maps, and we use a cascade structure to further enhance the query feature.ResultsExperiment results on FSC-147 show that our network performs best compared to the existing methods, and the mean absolute counting error on the test set improves from 14.32 to 12.74.DiscussionAblation experiments demonstrate that Hough matching helps to achieve more accurate counting compared with previous matching methods.
Collapse
Affiliation(s)
- Zhiquan He
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen, China
- Guangdong Multimedia Information Service Engineering Technology Research Center, Shenzhen University, Shenzhen, China
- *Correspondence: Zhiquan He
| | - Donghong Zheng
- Guangdong Multimedia Information Service Engineering Technology Research Center, Shenzhen University, Shenzhen, China
| | - Hengyou Wang
- School of Science, Beijing University of Civil Engineering and Architecture, Beijing, China
| |
Collapse
|
16
|
Bajiya N, Dhall A, Aggarwal S, Raghava GPS. Advances in the field of phage-based therapy with special emphasis on computational resources. Brief Bioinform 2023; 24:6961791. [PMID: 36575815 DOI: 10.1093/bib/bbac574] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 11/07/2022] [Accepted: 11/25/2022] [Indexed: 12/29/2022] Open
Abstract
In the current era, one of the major challenges is to manage the treatment of drug/antibiotic-resistant strains of bacteria. Phage therapy, a century-old technique, may serve as an alternative to antibiotics in treating bacterial infections caused by drug-resistant strains of bacteria. In this review, a systematic attempt has been made to summarize phage-based therapy in depth. This review has been divided into the following two sections: general information and computer-aided phage therapy (CAPT). In the case of general information, we cover the history of phage therapy, the mechanism of action, the status of phage-based products (approved and clinical trials) and the challenges. This review emphasizes CAPT, where we have covered primary phage-associated resources, phage prediction methods and pipelines. This review covers a wide range of databases and resources, including viral genomes and proteins, phage receptors, host genomes of phages, phage-host interactions and lytic proteins. In the post-genomic era, identifying the most suitable phage for lysing a drug-resistant strain of bacterium is crucial for developing alternate treatments for drug-resistant bacteria and this remains a challenging problem. Thus, we compile all phage-associated prediction methods that include the prediction of phages for a bacterial strain, the host for a phage and the identification of interacting phage-host pairs. Most of these methods have been developed using machine learning and deep learning techniques. This review also discussed recent advances in the field of CAPT, where we briefly describe computational tools available for predicting phage virions, the life cycle of phages and prophage identification. Finally, we describe phage-based therapy's advantages, challenges and opportunities.
Collapse
Affiliation(s)
- Nisha Bajiya
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India
| | - Anjali Dhall
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India
| | - Suchet Aggarwal
- Department of Computer Science and Engineering, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India
| | - Gajendra P S Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India
| |
Collapse
|
17
|
Hayat M, Tahir M, Alarfaj FK, Alturki R, Gazzawe F. NLP-BCH-Ens: NLP-based intelligent computational model for discrimination of malaria parasite. Comput Biol Med 2022; 149:105962. [DOI: 10.1016/j.compbiomed.2022.105962] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2022] [Revised: 07/29/2022] [Accepted: 08/13/2022] [Indexed: 11/03/2022]
|
18
|
Arif M, Kabir M, Ahmed S, Khan A, Ge F, Khelifi A, Yu DJ. DeepCPPred: A Deep Learning Framework for the Discrimination of Cell-Penetrating Peptides and Their Uptake Efficiencies. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2749-2759. [PMID: 34347603 DOI: 10.1109/tcbb.2021.3102133] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Cell-penetrating peptides (CPPs) are special peptides capable of carrying a variety of bioactive molecules, such as genetic materials, short interfering RNAs and nanoparticles, into cells. Recently, research on CPP has gained substantial interest from researchers, and the biological mechanisms of CPPS have been assessed in the context of safe drug delivery agents and therapeutic applications. Correct identification and synthesis of CPPs using traditional biochemical methods is an extremely slow, expensive and laborious task particularly due to the large volume of unannotated peptide sequences accumulating in the World Bank repository. Hence, a powerful bioinformatics predictor that rapidly identifies CPPs with a high recognition rate is urgently needed. To date, numerous computational methods have been developed for CPP prediction. However, the available machine-learning (ML) tools are unable to distinguish both the CPPs and their uptake efficiencies. This study aimed to develop a two-layer deep learning framework named DeepCPPred to identify both CPPs in the first phase and peptide uptake efficiency in the second phase. The DeepCPPred predictor first uses four types of descriptors that cover evolutionary, energy estimation, reduced sequence and amino-acid contact information. Then, the extracted features are optimized through the elastic net algorithm and fed into a cascade deep forest algorithm to build the final CPP model. The proposed method achieved 99.45 percent overall accuracy with the CPP924 benchmark dataset in the first layer and 95.43 percent accuracy in the second layer with the CPPSite3 dataset using a 5-fold cross-validation test. Thus, our proposed bioinformatics tool surpassed all the existing state-of-the-art sequence-based CPP approaches.
Collapse
|
19
|
Fang Z, Feng T, Zhou H, Chen M. DeePVP: Identification and classification of phage virion proteins using deep learning. Gigascience 2022; 11:6661052. [PMID: 35950840 PMCID: PMC9366990 DOI: 10.1093/gigascience/giac076] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Revised: 06/08/2022] [Accepted: 07/11/2022] [Indexed: 01/04/2023] Open
Abstract
BACKGROUND Many biological properties of phages are determined by phage virion proteins (PVPs), and the poor annotation of PVPs is a bottleneck for many areas of viral research, such as viral phylogenetic analysis, viral host identification, and antibacterial drug design. Because of the high diversity of PVP sequences, the PVP annotation of a phage genome remains a particularly challenging bioinformatic task. FINDINGS Based on deep learning, we developed DeePVP. The main module of DeePVP aims to discriminate PVPs from non-PVPs within a phage genome, while the extended module of DeePVP can further classify predicted PVPs into the 10 major classes of PVPs. Compared with the present state-of-the-art tools, the main module of DeePVP performs better, with a 9.05% higher F1-score in the PVP identification task. Moreover, the overall accuracy of the extended module of DeePVP in the PVP classification task is approximately 3.72% higher than that of PhANNs. Two application cases show that the predictions of DeePVP are more reliable and can better reveal the compact PVP-enriched region than the current state-of-the-art tools. Particularly, in the Escherichia phage phiEC1 genome, a novel PVP-enriched region that is conserved in many other Escherichia phage genomes was identified, indicating that DeePVP will be a useful tool for the analysis of phage genomic structures. CONCLUSIONS DeePVP outperforms state-of-the-art tools. The program is optimized in both a virtual machine with graphical user interface and a docker so that the tool can be easily run by noncomputer professionals. DeePVP is freely available at https://github.com/fangzcbio/DeePVP/.
Collapse
Affiliation(s)
- Zhencheng Fang
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou 510280, China
| | - Tao Feng
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou 510280, China
| | - Hongwei Zhou
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou 510280, China
| | - Muxuan Chen
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou 510280, China
| |
Collapse
|
20
|
Charoenkwan P, Schaduangrat N, Hasan MM, Moni MA, Lió P, Shoombuatong W. Empirical comparison and analysis of machine learning-based predictors for predicting and analyzing of thermophilic proteins. EXCLI JOURNAL 2022; 21:554-570. [PMID: 35651661 PMCID: PMC9150013 DOI: 10.17179/excli2022-4723] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/28/2022] [Accepted: 02/21/2022] [Indexed: 12/15/2022]
Abstract
Thermophilic proteins (TPPs) are critical for basic research and in the food industry due to their ability to maintain a thermodynamically stable fold at extremely high temperatures. Thus, the expeditious identification of novel TPPs through computational models from protein sequences is very desirable. Over the last few decades, a number of computational methods, especially machine learning (ML)-based methods, for in silico prediction of TPPs have been developed. Therefore, it is desirable to revisit these methods and summarize their advantages and disadvantages in order to further develop new computational approaches to achieve more accurate and improved prediction of TPPs. With this goal in mind, we comprehensively investigate a large collection of fourteen state-of-the-art TPP predictors in terms of their dataset size, feature encoding schemes, feature selection strategies, ML algorithms, evaluation strategies and web server/software usability. To the best of our knowledge, this article represents the first comprehensive review on the development of ML-based methods for in silico prediction of TPPs. Among these TPP predictors, they can be classified into two groups according to the interpretability of ML algorithms employed (i.e., computational black-box methods and computational white-box methods). In order to perform the comparative analysis, we conducted a comparative study on several currently available TPP predictors based on two benchmark datasets. Finally, we provide future perspectives for the design and development of new computational models for TPP prediction. We hope that this comprehensive review will facilitate researchers in selecting an appropriate TPP predictor that is the most suitable one to deal with their purposes and provide useful perspectives for the development of more effective and accurate TPP predictors.
Collapse
Affiliation(s)
- Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai, Thailand, 50200
| | - Nalini Schaduangrat
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand, 10700
| | - Md Mehedi Hasan
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Mohammad Ali Moni
- School of Health and Rehabilitation Sciences, Faculty of Health and Behavioural Sciences, the University of Queensland, St Lucia, QLD 4072, Australia
| | - Pietro Lió
- Department of Computer Science and Technology, University of Cambridge, Cambridge, CB3 0FD, UK
| | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand, 10700
| |
Collapse
|
21
|
Ali F, Kumar H, Patil S, Kotecha K, Banjar A, Daud A. Target-DBPPred: An intelligent model for prediction of DNA-binding proteins using discrete wavelet transform based compression and light eXtreme gradient boosting. Comput Biol Med 2022; 145:105533. [PMID: 35447463 DOI: 10.1016/j.compbiomed.2022.105533] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2022] [Revised: 04/11/2022] [Accepted: 04/13/2022] [Indexed: 11/03/2022]
Abstract
DNA-protein interaction is a critical biological process that performs influential activities, including DNA transcription and recombination. DBPs (DNA-binding proteins) are closely associated with different kinds of human diseases (asthma, cancer, and AIDS), while some of the DBPs are used in the production of antibiotics, steroids, and anti-inflammatories. Several methods have been reported for the prediction of DBPs. However, a more intelligent method is still highly desirable for the accurate prediction of DBPs. This study presents an intelligent computational method, Target-DBPPred, to improve DBPs prediction. Important features from primary protein sequences are investigated via a novel feature descriptor, called EDF-PSSM-DWT (Evolutionary difference formula position-specific scoring matrix-discrete wavelet transform) and several other multi-evolutionary methods, including F-PSSM (Filtered position-specific scoring matrix), EDF-PSSM (Evolutionary difference formula position-specific scoring matrix), PSSM-DPC (Position-specific scoring matrix-dipeptide composition), and Lead-BiPSSM (Lead-bigram-position specific scoring matrix) to encapsulate diverse multivariate features. The best feature set from the features of each descriptor is selected using sequential forward selection (SFS). Further, four models are trained using Adaboost, XGB (eXtreme gradient boosting), ERT (extremely randomized trees), and LiXGB (Light eXtreme gradient boosting) classifiers. LiXGB, with the best feature set of EDF-PSSM-DWT, has attained 6.69% and 15.07% higher performance in terms of accuracies using training and testing datasets, respectively. The obtained results verify the improved performance of our proposed predictor over the existing predictors.
Collapse
Affiliation(s)
- Farman Ali
- Department of Elementary and Secondary Education, Peshawar, Khyber Pakhtunkhwa, Pakistan; School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China.
| | - Harish Kumar
- Department of Computer Science, College of Computer Science, King Khalid University, Abha, Saudi Arabia
| | - Shruti Patil
- Symbiosis Institute of Technology, Symbiosis Centre for Applied Artificial Intelligence, Symbiosis International University, Pune, India
| | - Ketan Kotecha
- Symbiosis Institute of Technology, Symbiosis Centre for Applied Artificial Intelligence, Symbiosis International University, Pune, India.
| | - Ameen Banjar
- Department of Information Systems, College of Computer Science and Engineering, University of Jeddah, Jeddah, Saudi Arabia
| | - Ali Daud
- Key Laboratory of Oceanographic Big Data Mining & Application of Zhejiang Province, School of Information Engineering, Zhejiang Ocean University, Zhoushan, 316022, China; Department of Computer Science and Artificial Intelligence, University of Jeddah, Jeddah, Saudi Arabia.
| |
Collapse
|
22
|
Sikander R, Ghulam A, Ali F. XGB-DrugPred: computational prediction of druggable proteins using eXtreme gradient boosting and optimized features set. Sci Rep 2022; 12:5505. [PMID: 35365726 PMCID: PMC8976041 DOI: 10.1038/s41598-022-09484-3] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2021] [Accepted: 03/07/2022] [Indexed: 11/19/2022] Open
Abstract
Accurate identification of drug-targets in human body has great significance for designing novel drugs. Compared with traditional experimental methods, prediction of drug-targets via machine learning algorithms has enhanced the attention of many researchers due to fast and accurate prediction. In this study, we propose a machine learning-based method, namely XGB-DrugPred for accurate prediction of druggable proteins. The features from primary protein sequences are extracted by group dipeptide composition, reduced amino acid alphabet, and novel encoder pseudo amino acid composition segmentation. To select the best feature set, eXtreme Gradient Boosting-recursive feature elimination is implemented. The best feature set is provided to eXtreme Gradient Boosting (XGB), Random Forest, and Extremely Randomized Tree classifiers for model training and prediction. The performance of these classifiers is evaluated by tenfold cross-validation. The empirical results show that XGB-based predictor achieves the best results compared with other classifiers and existing methods in the literature.
Collapse
Affiliation(s)
- Rahu Sikander
- School of Computer Science and Technology, Xidian University, Xi'an, 710071, China.
| | - Ali Ghulam
- Computerization and Network Section, Sindh Agriculture University, Tandojam, Pakistan
| | - Farman Ali
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China
| |
Collapse
|
23
|
Belei C, Joeressen J, Amancio-Filho ST. Fused-Filament Fabrication of Short Carbon Fiber-Reinforced Polyamide: Parameter Optimization for Improved Performance under Uniaxial Tensile Loading. Polymers (Basel) 2022; 14:polym14071292. [PMID: 35406166 PMCID: PMC9002508 DOI: 10.3390/polym14071292] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2022] [Revised: 03/18/2022] [Accepted: 03/21/2022] [Indexed: 02/05/2023] Open
Abstract
This study intends to contribute to the state of the art of Fused-Filament Fabrication (FFF) of short-fiber-reinforced polyamides by optimizing process parameters to improve the performance of printed parts under uniaxial tensile loading. This was performed using two different approaches: a more traditional 2k full factorial design of experiments (DoE) and multiple polynomial regression using an algorithm implementing machine learning (ML) principles such as train-test split and cross-validation. Evaluated parameters included extrusion and printing bed temperatures, layer height and printing speed. It was concluded that when exposed to new observations, the ML-based model predicted the response with higher accuracy. However, the DoE fared slightly better at predicting observations where higher response values were expected, including the optimal solution, which reached an UTS of 117.1 ± 5.7 MPa. Moreover, there was an important correlation between process parameters and the response. Layer height and printing bed temperatures were considered the most influential parameters, while extrusion temperature and printing speed had a lower influence on the outcome. The general influence of parameters on the response was correlated with the degree of interlayer cohesion, which in turn affected the mechanical performance of the 3D-printed specimens.
Collapse
|
24
|
Ahmad S, Charoenkwan P, Quinn JMW, Moni MA, Hasan MM, Lio' P, Shoombuatong W. SCORPION is a stacking-based ensemble learning framework for accurate prediction of phage virion proteins. Sci Rep 2022; 12:4106. [PMID: 35260777 PMCID: PMC8904530 DOI: 10.1038/s41598-022-08173-5] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2022] [Accepted: 03/03/2022] [Indexed: 12/30/2022] Open
Abstract
Fast and accurate identification of phage virion proteins (PVPs) would greatly aid facilitation of antibacterial drug discovery and development. Although, several research efforts based on machine learning (ML) methods have been made for in silico identification of PVPs, these methods have certain limitations. Therefore, in this study, we propose a new computational approach, termed SCORPION, (StaCking-based Predictior fOR Phage VIrion PrOteiNs), to accurately identify PVPs using only protein primary sequences. Specifically, we explored comprehensive 13 different feature descriptors from different aspects (i.e., compositional information, composition-transition-distribution information, position-specific information and physicochemical properties) with 10 popular ML algorithms to construct a pool of optimal baseline models. These optimal baseline models were then used to generate probabilistic features (PFs) and considered as a new feature vector. Finally, we utilized a two-step feature selection strategy to determine the optimal PF feature vector and used this feature vector to develop a stacked model (SCORPION). Both tenfold cross-validation and independent test results indicate that SCORPION achieves superior predictive performance than its constitute baseline models and existing methods. We anticipate SCORPION will serve as a useful tool for the cost-effective and large-scale screening of new PVPs. The source codes and datasets for this work are available for downloading in the GitHub repository (https://github.com/saeed344/SCORPION).
Collapse
Affiliation(s)
- Saeed Ahmad
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand
| | - Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai, 50200, Thailand
| | - Julian M W Quinn
- Bone Biology Division, Garvan Institute of Medical Research, 384 Victoria Street, Darlinghurst, NSW, 2010, Australia
| | - Mohammad Ali Moni
- Faculty of Health and Behavioural Sciences, School of Health and Rehabilitation Sciences, The University of Queensland, St Lucia, QLD, 4072, Australia
| | - Md Mehedi Hasan
- Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane Center for Biomedical Informatics and Genomics, Tulane University, New Orleans, LA, 70112, USA
| | - Pietro Lio'
- Department of Computer Science and Technology, University of Cambridge, Cambridge, CB3 0FD, UK
| | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand.
| |
Collapse
|
25
|
Kabir M, Nantasenamat C, Kanthawong S, Charoenkwan P, Shoombuatong W. Large-scale comparative review and assessment of computational methods for phage virion proteins identification. EXCLI JOURNAL 2022; 21:11-29. [PMID: 35145365 PMCID: PMC8822302 DOI: 10.17179/excli2021-4411] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Accepted: 11/29/2021] [Indexed: 12/11/2022]
Abstract
Phage virion proteins (PVPs) are effective at recognizing and binding to host cell receptors while having no deleterious effects on human or animal cells. Understanding their functional mechanisms is regarded as a critical goal that will aid in rational antibacterial drug discovery and development. Although high-throughput experimental methods for identifying PVPs are considered the gold standard for exploring crucial PVP features, these procedures are frequently time-consuming and labor-intensive. Thusfar, more than ten sequence-based predictors have been established for the in silico identification of PVPs in conjunction with traditional experimental approaches. As a result, a revised and more thorough assessment is extremely desirable. With this purpose in mind, we first conduct a thorough survey and evaluation of a vast array of 13 state-of-the-art PVP predictors. Among these PVP predictors, they can be classified into three groups according to the types of machine learning (ML) algorithms employed (i.e. traditional ML-based methods, ensemble-based methods and deep learning-based methods). Subsequently, we explored which factors are important for building more accurate and stable predictors and this included training/independent datasets, feature encoding algorithms, feature selection methods, core algorithms, performance evaluation metrics/strategies and web servers. Finally, we provide insights and future perspectives for the design and development of new and more effective computational approaches for the detection and characterization of PVPs.
Collapse
Affiliation(s)
- Muhammad Kabir
- School of Systems and Technology, Department of Computer Science, University of Management and Technology, Lahore, Pakistan, 54770
| | - Chanin Nantasenamat
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand, 10700
| | - Sakawrat Kanthawong
- Department of Microbiology, Faculty of Medicine, Khon Kaen University, Khon Kaen, Thailand, 40002
| | - Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai, Thailand, 50200
| | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand, 10700
| |
Collapse
|
26
|
Zou H. Identifying blood‐brain barrier peptides by using amino acids physicochemical properties and features fusion method. Pept Sci (Hoboken) 2021. [DOI: 10.1002/pep2.24247] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Affiliation(s)
- Hongliang Zou
- School of Communications and Electronics Jiangxi Science and Technology Normal University Nanchang China
| |
Collapse
|
27
|
Zhang S, Shi H. iR5hmcSC: Identifying RNA 5-hydroxymethylcytosine with multiple features based on stacking learning. Comput Biol Chem 2021; 95:107583. [PMID: 34562726 DOI: 10.1016/j.compbiolchem.2021.107583] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2021] [Revised: 09/02/2021] [Accepted: 09/12/2021] [Indexed: 01/27/2023]
Abstract
RNA 5-hydroxymethylcytosine (5hmC) modification is the basis of the translation of genetic information and the biological evolution. The study of its distribution in transcriptome is fundamentally crucial to reveal the biological significance of 5hmC. Biochemical experiments can use a variety of sequencing-based technologies to achieve high-throughput identification of 5hmC; however, they are labor-intensive, time-consuming, as well as expensive. Therefore, it is urgent to develop more effective and feasible computational methods. In this paper, a novel and powerful model called iR5hmcSC is designed for identifying 5hmC. Firstly, we extract the different features by K-mer, Pseudo Structure Status Composition and One-Hot encoding. Subsequently, the combination of chi-square test and logistic regression is utilized as the feature selection method to select the optimal feature sets. And then stacking learning, an ensemble learning method including random forest (RF), extra trees (EX), AdaBoost (Ada), gradient boosting decision tree (GBDT), and support vector machine (SVM), is used to recognize 5hmC and non-5hmC. Finally, 10-fold cross-validation test is performed to evaluate the model. The accuracy reaches 85.27% and 79.92% on benchmark dataset and independent dataset, respectively. The result is better than the state-of-the-art methods, which indicates that our model is a feasible tool to identify 5hmC. The datasets and source code are freely available at https://github.com/HongyanShi026/iR5hmcSC.
Collapse
Affiliation(s)
- Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, PR China.
| | - Hongyan Shi
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, PR China
| |
Collapse
|
28
|
iPVP-MCV: A Multi-Classifier Voting Model for the Accurate Identification of Phage Virion Proteins. Symmetry (Basel) 2021. [DOI: 10.3390/sym13081506] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
The classic structure of a bacteriophage is commonly characterized by complex symmetry. The head of the structure features icosahedral symmetry, whereas the tail features helical symmetry. The phage virion protein (PVP), a type of bacteriophage structural protein, is an essential material of the infectious viral particles and is responsible for multiple biological functions. Accurate identification of PVPs is of great significance for comprehending the interaction between phages and host bacteria and developing new antimicrobial drugs or antibiotics. However, traditional experimental approaches for identifying PVPs are often time-consuming and laborious. Therefore, the development of computational methods that can efficiently and accurately identify PVPs is desired. In this study, we proposed a multi-classifier voting model called iPVP-MCV to enhance the predictive performance of PVPs based on their amino acid sequences. First, three types of evolutionary features were extracted from the position-specific scoring matrix (PSSM) profiles to represent PVPs and non-PVPs. Then, a set of baseline models were trained based on the support vector machine (SVM) algorithm combined with each type of feature descriptors. Finally, the outputs of these baseline models were integrated to construct the proposed method iPVP-MCV by using the majority voting strategy. Our results demonstrated that the proposed iPVP-MCV model was superior to existing methods when performing the rigorous independent dataset test.
Collapse
|
29
|
Charoenkwan P, Chiangjong W, Hasan MM, Nantasenamat C, Shoombuatong W. Review and comparative analysis of machine learning-based predictors for predicting and analyzing of anti-angiogenic peptides. Curr Med Chem 2021; 29:849-864. [PMID: 34375178 DOI: 10.2174/0929867328666210810145806] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2021] [Revised: 06/17/2021] [Accepted: 06/22/2021] [Indexed: 11/22/2022]
Abstract
Cancer is one of the leading causes of death worldwide and underlying this is angiogenesis that represents one of the hallmarks of cancer. Ongoing effort is already under way in the discovery of anti-angiogenic peptides (AAPs) as a promising therapeutic route by tackling the formation of new blood vessels. As such, the identification of AAPs constitutes a viable path for understanding their mechanistic properties pertinent for the discovery of new anti-cancer drugs. In spite of the abundance of peptide sequences in public databases, experimental efforts in the identification of anti-angiogenic peptides have progressed very slowly owing to its high expenditures and laborious nature. Owing to its inherent ability to make sense of large volumes of data, machine learning (ML) represents a lucrative technique that can be harnessed for peptide-based drug discovery. In this review, we conducted a comprehensive and comparative analysis of ML-based AAP predictors in terms of their employed feature descriptors, ML algorithms, cross-validation methods and prediction performance. Moreover, the common framework of these AAP predictors and their inherent weaknesses are also discussed. Particularly, we explore future perspectives for improving the prediction accuracy and model interpretability, which represents an interesting avenue for overcoming some of the inherent weaknesses of existing AAP predictors. We anticipate that this review would assist researchers in the rapid screening and identification of promising AAPs for clinical use.
Collapse
Affiliation(s)
- Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai, Thailand
| | - Wararat Chiangjong
- Pediatric Translational Research Unit, Department of Pediatrics, Faculty of Medicine, Ramathibodi Hospital, Mahidol University, Bangkok 10400, Thailand
| | - Md Mehedi Hasan
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA 70112, United States
| | - Chanin Nantasenamat
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand
| | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand
| |
Collapse
|
30
|
Barukab O, Ali F, Khan SA. DBP-GAPred: An intelligent method for prediction of DNA-binding proteins types by enhanced evolutionary profile features with ensemble learning. J Bioinform Comput Biol 2021; 19:2150018. [PMID: 34291709 DOI: 10.1142/s0219720021500189] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
DNA-binding proteins (DBPs) perform an influential role in diverse biological activities like DNA replication, slicing, repair, and transcription. Some DBPs are indispensable for understanding many types of human cancers (i.e. lung, breast, and liver cancer) and chronic diseases (i.e. AIDS/HIV, asthma), while other kinds are involved in antibiotics, steroids, and anti-inflammatory drugs designing. These crucial processes are closely related to DBPs types. DBPs are categorized into single-stranded DNA-binding proteins (ssDBPs) and double-stranded DNA-binding proteins (dsDBPs). Few computational predictors have been reported for discriminating ssDBPs and dsDBPs. However, due to the limitations of the existing methods, an intelligent computational system is still highly desirable. In this work, features from protein sequences are discovered by extending the notion of dipeptide composition (DPC), evolutionary difference formula (EDF), and K-separated bigram (KSB) into the position-specific scoring matrix (PSSM). The highly intrinsic information was encoded by a compression approach named discrete cosine transform (DCT) and the model was trained with support vector machine (SVM). The prediction performance was further boosted by the genetic algorithm (GA) ensemble strategy. The novel predictor (DBP-GAPred) acquired 1.89%, 0.28%, and 6.63% higher accuracies on jackknife, 10-fold, and independent dataset tests, respectively than the best predictor. These outcomes confirm the superiority of our method over the existing predictors.
Collapse
Affiliation(s)
- Omar Barukab
- Faculty of Computing and Information Technology, King Abdulaziz University, Rabigh 21911 Jeddah, Saudi Arabia
| | - Farman Ali
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, P. R. China
| | - Sher Afzal Khan
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan
| |
Collapse
|
31
|
Nami Y, Imeni N, Panahi B. Application of machine learning in bacteriophage research. BMC Microbiol 2021; 21:193. [PMID: 34174831 PMCID: PMC8235560 DOI: 10.1186/s12866-021-02256-5] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2021] [Accepted: 06/08/2021] [Indexed: 12/20/2022] Open
Abstract
Phages are one of the key components in the structure, dynamics, and interactions of microbial communities in different bins. It has a clear impact on human health and the food industry. Bacteriophage characterization using in vitro approaches are time/cost consuming and laborious tasks. On the other hand, with the advent of new high-throughput sequencing technology, the development of a powerful computational framework to characterize the newly identified bacteriophages is inevitable for future research. Machine learning includes powerful techniques that enable the analysis of complex datasets for knowledge discovery and pattern recognition. In this study, we have conducted a comprehensive review of machine learning methods application using different types of features were applied in various aspects of bacteriophage research including, automated curation, identification, classification, host species recognition, virion protein identification, and life cycle prediction. Moreover, potential limitations and advantages of the developed frameworks were discussed.
Collapse
Affiliation(s)
- Yousef Nami
- Department of Food Biotechnology, Branch for Northwest & West Region, Agricultural Biotechnology Research Institute of Iran, Agricultural Research, Education and Extension Organization (AREEO), Tabriz, Iran
| | - Nazila Imeni
- Young Researchers and Elite Clube, Marand Branch, Islamic Azad University, Marand, Iran
| | - Bahman Panahi
- Department of Genomics, Branch for Northwest & West Region, Agricultural Biotechnology Research Institute of Iran, Agricultural Research, Education and Extension Organization (AREEO), Tabriz, Iran.
| |
Collapse
|
32
|
Khan ZU, Pi D. DeepSSPred: A Deep Learning Based Sulfenylation Site Predictor Via a Novel nSegmented Optimize Federated Feature Encoder. Protein Pept Lett 2021; 28:708-721. [PMID: 33267753 DOI: 10.2174/0929866527666201202103411] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2020] [Revised: 10/14/2020] [Accepted: 10/18/2020] [Indexed: 11/22/2022]
Abstract
BACKGROUND S-sulfenylation (S-sulphenylation, or sulfenic acid) proteins, are special kinds of post-translation modification, which plays an important role in various physiological and pathological processes such as cytokine signaling, transcriptional regulation, and apoptosis. Despite these aforementioned significances, and by complementing existing wet methods, several computational models have been developed for sulfenylation cysteine sites prediction. However, the performance of these models was not satisfactory due to inefficient feature schemes, severe imbalance issues, and lack of an intelligent learning engine. OBJECTIVE In this study, our motivation is to establish a strong and novel computational predictor for discrimination of sulfenylation and non-sulfenylation sites. METHODS In this study, we report an innovative bioinformatics feature encoding tool, named DeepSSPred, in which, resulting encoded features is obtained via nSegmented hybrid feature, and then the resampling technique called synthetic minority oversampling was employed to cope with the severe imbalance issue between SC-sites (minority class) and non-SC sites (majority class). State of the art 2D-Convolutional Neural Network was employed over rigorous 10-fold jackknife cross-validation technique for model validation and authentication. RESULTS Following the proposed framework, with a strong discrete presentation of feature space, machine learning engine, and unbiased presentation of the underline training data yielded into an excellent model that outperforms with all existing established studies. The proposed approach is 6% higher in terms of MCC from the first best. On an independent dataset, the existing first best study failed to provide sufficient details. The model obtained an increase of 7.5% in accuracy, 1.22% in Sn, 12.91% in Sp and 13.12% in MCC on the training data and12.13% of ACC, 27.25% in Sn, 2.25% in Sp, and 30.37% in MCC on an independent dataset in comparison with 2nd best method. These empirical analyses show the superlative performance of the proposed model over both training and Independent dataset in comparison with existing literature studies. CONCLUSION In this research, we have developed a novel sequence-based automated predictor for SC-sites, called DeepSSPred. The empirical simulations outcomes with a training dataset and independent validation dataset have revealed the efficacy of the proposed theoretical model. The good performance of DeepSSPred is due to several reasons, such as novel discriminative feature encoding schemes, SMOTE technique, and careful construction of the prediction model through the tuned 2D-CNN classifier. We believe that our research work will provide a potential insight into a further prediction of S-sulfenylation characteristics and functionalities. Thus, we hope that our developed predictor will significantly helpful for large scale discrimination of unknown SC-sites in particular and designing new pharmaceutical drugs in general.
Collapse
Affiliation(s)
- Zaheer Ullah Khan
- College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China
| | - Dechang Pi
- College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China
| |
Collapse
|
33
|
Component Parts of Bacteriophage Virions Accurately Defined by a Machine-Learning Approach Built on Evolutionary Features. mSystems 2021; 6:e0024221. [PMID: 34042467 PMCID: PMC8269216 DOI: 10.1128/msystems.00242-21] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Antimicrobial resistance (AMR) continues to evolve as a major threat to human health, and new strategies are required for the treatment of AMR infections. Bacteriophages (phages) that kill bacterial pathogens are being identified for use in phage therapies, with the intention to apply these bactericidal viruses directly into the infection sites in bespoke phage cocktails. Despite the great unsampled phage diversity for this purpose, an issue hampering the roll out of phage therapy is the poor quality annotation of many of the phage genomes, particularly for those from infrequently sampled environmental sources. We developed a computational tool called STEP3 to use the “evolutionary features” that can be recognized in genome sequences of diverse phages. These features, when integrated into an ensemble framework, achieved a stable and robust prediction performance when benchmarked against other prediction tools using phages from diverse sources. Validation of the prediction accuracy of STEP3 was conducted with high-resolution mass spectrometry analysis of two novel phages, isolated from a watercourse in the Southern Hemisphere. STEP3 provides a robust computational approach to distinguish specific and universal features in phages to improve the quality of phage cocktails and is available for use at http://step3.erc.monash.edu/. IMPORTANCE In response to the global problem of antimicrobial resistance, there are moves to use bacteriophages (phages) as therapeutic agents. Selecting which phages will be effective therapeutics relies on interpreting features contributing to shelf-life and applicability to diagnosed infections. However, the protein components of the phage virions that dictate these properties vary so much in sequence that best estimates suggest failure to recognize up to 90% of them. We have utilized this diversity in evolutionary features as an advantage, to apply machine learning for prediction accuracy for diverse components in phage virions. We benchmark this new tool showing the accurate recognition and evaluation of phage component parts using genome sequence data of phages from undersampled environments, where the richest diversity of phage still lies.
Collapse
|
34
|
Fang Z, Zhou H. VirionFinder: Identification of Complete and Partial Prokaryote Virus Virion Protein From Virome Data Using the Sequence and Biochemical Properties of Amino Acids. Front Microbiol 2021; 12:615711. [PMID: 33613485 PMCID: PMC7894196 DOI: 10.3389/fmicb.2021.615711] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Accepted: 01/04/2021] [Indexed: 01/22/2023] Open
Abstract
Viruses are some of the most abundant biological entities on Earth, and prokaryote virus are the dominant members of the viral community. Because of the diversity of prokaryote virus, functional annotation cannot be performed on a large number of genes from newly discovered prokaryote virus by searching the current database; therefore, the development of an alignment-free algorithm for functional annotation of prokaryote virus proteins is important to understand the viral community. The identification of prokaryote virus proteins (PVVPs) is a critical step for many viral analyses, such as species classification, phylogenetic analysis and the exploration of how prokaryote virus interact with their hosts. Although a series of PVVP prediction tools have been developed, the performance of these tools is still not satisfactory. Moreover, viral metagenomic data contains fragmented sequences, leading to the existence of some incomplete genes. Therefore, a tool that can identify partial prokaryote virus proteins is also needed. In this work, we present a novel algorithm, called VirionFinder, to identify the complete and partial PVVPs from non-prokaryote virus virion proteins (non-PVVPs). VirionFinder uses the sequence and biochemical properties of 20 amino acids as the mathematical model to encode the protein sequences and uses a deep learning technique to identify whether a given protein is a PVVP. Compared with the state-of-the-art tools using artificial benchmark datasets, the results show that under the same specificity (Sp), the sensitivity (Sn) of VirionFinder is approximately 10-34% much higher than the Sn of these tools on both complete and partial proteins. When evaluating related tools using real virome data, the recognition rate of PVVP-like sequences of VirionFinder is also much higher than that of the other tools. We expect that VirionFinder will be a powerful tool for identifying novel virion proteins from both complete prokaryote virus genomes and viral metagenomic data. VirionFinder is freely available at https://github.com/zhenchengfang/VirionFinder.
Collapse
Affiliation(s)
- Zhencheng Fang
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, China
- Center for Quantitative Biology, Peking University, Beijing, China
| | - Hongwei Zhou
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, China
- State Key Laboratory of Organ Failure Research, Southern Medical University, Guangzhou, China
| |
Collapse
|
35
|
Ahmed S, Kabir M, Arif M, Khan ZU, Yu DJ. DeepPPSite: A deep learning-based model for analysis and prediction of phosphorylation sites using efficient sequence information. Anal Biochem 2020; 612:113955. [PMID: 32949607 DOI: 10.1016/j.ab.2020.113955] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2020] [Revised: 08/30/2020] [Accepted: 09/11/2020] [Indexed: 12/29/2022]
Abstract
Phosphorylation is a ubiquitous type of post-translational modification (PTM) that occurs in both eukaryotic and prokaryotic cells where in a phosphate group binds with amino acid residues. These specific residues, i.e., serine (S), threonine (T), and tyrosine (Y), exhibit diverse functions at the molecular level. Recent studies have determined that some diseases such as cancer, diabetes, and neurodegenerative diseases are caused by abnormal phosphorylation. Based on its potential applications in biological research and drug development, the large-scale identification of phosphorylation sites has attracted interest. Existing wet-lab technologies for targeting phosphorylation sites are overpriced and time consuming. Thus, computational algorithms that can efficiently accelerate the annotation of phosphorylation sites from massive protein sequences are needed. Numerous machine learning-based methods have been implemented for phosphorylation sites prediction. However, despite extensive efforts, existing computational approaches continue to have inadequate performance, particularly in terms of overall ACC, MCC, and AUC. In this paper, we report a novel deep learning-based predictor to overcome these performance hurdles, DeepPPSite, which was constructed using a stacked long short-term memory recurrent network for predicting phosphorylation sites. The proposed technique expediently learns the protein representations from conjoint protein descriptors. The experimental results indicated that our model achieved superior performance on the training dataset for S, T and Y, with MCC values of 0.608, 0.602, and 0.558, respectively, using a 10-fold cross-validation test. We further determined the generalization efficacy of the proposed predictor DeepPPSite by conducting a rigorous independent test. The predictive MCC values were 0.358, 0.356, and 0.350 for the S, T, and Y phosphorylation sites, respectively. Rigorous cross-validation and independent validation tests for the three types of phosphorylation sites demonstrated that the designed DeepPPSite tool significantly outperforms state-of-the-art methods.
Collapse
Affiliation(s)
- Saeed Ahmed
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China.
| | - Muhammad Kabir
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China.
| | - Muhammad Arif
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China.
| | - Zaheer Ullah Khan
- School of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China.
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China.
| |
Collapse
|
36
|
Meta-iPVP: a sequence-based meta-predictor for improving the prediction of phage virion proteins using effective feature representation. J Comput Aided Mol Des 2020; 34:1105-1116. [DOI: 10.1007/s10822-020-00323-z] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2020] [Accepted: 06/10/2020] [Indexed: 12/11/2022]
|
37
|
Arif M, Ahmad S, Ali F, Fang G, Li M, Yu DJ. TargetCPP: accurate prediction of cell-penetrating peptides from optimized multi-scale features using gradient boost decision tree. J Comput Aided Mol Des 2020; 34:841-856. [PMID: 32180124 DOI: 10.1007/s10822-020-00307-z] [Citation(s) in RCA: 41] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2019] [Accepted: 03/09/2020] [Indexed: 02/08/2023]
Abstract
Cell-penetrating peptides (CPPs) are short length permeable proteins have emerged as drugs delivery tool of therapeutic agents including genetic materials and macromolecules into cells. Recently, CPP has become a hotspot avenue for life science research and paved a new way of disease treatment without harmful impact on cell viability due to nontoxic characteristic. Therefore, the correct identification of CPPs will provide hints for medical applications. Considering the shortcomings of traditional experimental CPPs identification, it is urgently needed to design intelligent predictor for accurate identification of CPPs for the large scale uncharacterized sequences. We develop a novel computational method, called TargetCPP, to discriminate CPPs from Non-CPPs with improved accuracy. In TargetCPP, first the peptide sequences are formulated with four distinct encoding methods i.e., composite protein sequence representation, composition transition and distribution, split amino acid composition, and information theory features. These dominant feature vectors were fused and applied intelligent minimum redundancy and maximum relevancy feature selection method to choose an optimal subset of features. Finally, the predictive model is learned through different classification algorithms on the optimized features. Among these classifiers, gradient boost decision tree algorithm achieved excellent performance throughout the experiments. Notably, the TargetCPP tool attained high prediction Accuracy of 93.54% and 88.28% using jackknife and independent test, respectively. Empirical outcomes prove the superiority and potency of proposed bioinformatics method over state-of-the-art methods. It is highly anticipated that the outcomes of this study will provide a strong background for large scale prediction of CPPs and instructive guidance in clinical therapy and medical applications.
Collapse
Affiliation(s)
- Muhammad Arif
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China
| | - Saeed Ahmad
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China
| | - Farman Ali
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China
| | - Ge Fang
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China
| | - Min Li
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China.
| |
Collapse
|
38
|
Charoenkwan P, Kanthawong S, Schaduangrat N, Yana J, Shoombuatong W. PVPred-SCM: Improved Prediction and Analysis of Phage Virion Proteins Using a Scoring Card Method. Cells 2020; 9:E353. [PMID: 32028709 PMCID: PMC7072630 DOI: 10.3390/cells9020353] [Citation(s) in RCA: 39] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2019] [Revised: 01/20/2020] [Accepted: 01/27/2020] [Indexed: 12/16/2022] Open
Abstract
Although, existing methods have been successful in predicting phage (or bacteriophage) virion proteins (PVPs) using various types of protein features and complex classifiers, such as support vector machine and naïve Bayes, these two methods do not allow interpretability. However, the characterization and analysis of PVPs might be of great significance to understanding the molecular mechanisms of bacteriophage genetics and the development of antibacterial drugs. Hence, we herein proposed a novel method (PVPred-SCM) based on the scoring card method (SCM) in conjunction with dipeptide composition to identify and characterize PVPs. In PVPred-SCM, the propensity scores of 400 dipeptides were calculated using the statistical discrimination approach. Rigorous independent validation test showed that PVPred-SCM utilizing only dipeptide composition yielded an accuracy of 77.56%, indicating that PVPred-SCM performed well relative to the state-of-the-art method utilizing a number of protein features. Furthermore, the propensity scores of dipeptides were used to provide insights into the biochemical and biophysical properties of PVPs. Upon comparison, it was found that PVPred-SCM was superior to the existing methods considering its simplicity, interpretability, and implementation. Finally, in an effort to facilitate high-throughput prediction of PVPs, we provided a user-friendly web-server for identifying the likelihood of whether or not these sequences are PVPs. It is anticipated that PVPred-SCM will become a useful tool or at least a complementary existing method for predicting and analyzing PVPs.
Collapse
Affiliation(s)
- Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai 50200, Thailand;
| | - Sakawrat Kanthawong
- Department of Microbiology, Faculty of Medicine, Khon Kaen University, Khon Kaen 40002, Thailand;
| | - Nalini Schaduangrat
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand;
| | - Janchai Yana
- Department of Chemistry, Faculty of Science and Technology, Chiang Mai Rajabhat University, Chiang Mai 50300, Thailand;
| | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand;
| |
Collapse
|