1
|
Zhang M, Cui Q, Lü Y, Li W. A feature-aware multimodal framework with auto-fusion for Alzheimer's disease diagnosis. Comput Biol Med 2024; 178:108740. [PMID: 38901184 DOI: 10.1016/j.compbiomed.2024.108740] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2024] [Revised: 05/02/2024] [Accepted: 06/08/2024] [Indexed: 06/22/2024]
Abstract
Alzheimer's disease (AD), one of the most common dementias, has about 4.6 million new cases yearly worldwide. Due to the significant amount of suspected AD patients, early screening for the disease has become particularly important. There are diversified types of AD diagnosis data, such as cognitive tests, images, and risk factors, many prior investigations have primarily concentrated on integrating only high-dimensional features and simple fusion concatenation, resulting in less-than-optimal outcomes for AD diagnosis. Therefore, We propose an enhanced multimodal AD diagnostic framework comprising a feature-aware module and an automatic model fusion strategy (AMFS). To preserve the correlation and significance features within a low-dimensional space, the feature-aware module employs a low-dimensional SHapley Additive exPlanation (SHAP) boosting feature selection as the initial step, following this analysis, diverse tiers of low-dimensional features are extracted from patients' biological data. Besides, in the high-dimensional stage, the feature-aware module integrates cross-modal attention mechanisms to capture subtle relationships among different cognitive domains, neuroimaging modalities, and risk factors. Subsequently, we integrate the aforementioned feature-aware module with graph convolutional networks (GCN) to address heterogeneous data in multimodal AD, while also possessing the capability to perceive relationships between different modalities. Lastly, our proposed AMFS autonomously learns optimal parameters for aligning two sub-models. The validation tests using two ADNI datasets show the high accuracies of 95.9% and 91.9% respectively, in AD diagnosis. The methods efficiently select features from multimodal AD data, optimizing model fusion for potential clinical assistance in diagnostics.
Collapse
Affiliation(s)
- Meiwei Zhang
- College of Electrical Engineering, Chongqing University, Chongqing, 400030, China
| | - Qiushi Cui
- College of Electrical Engineering, Chongqing University, Chongqing, 400030, China.
| | - Yang Lü
- Department of Geriatrics, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China
| | - Wenyuan Li
- College of Electrical Engineering, Chongqing University, Chongqing, 400030, China
| |
Collapse
|
2
|
Matejin S, Gregoric ID, Radovancevic R, Paessler S, Perovic V. Risk stratification and prediction of severity of COVID-19 infection in patients with preexisting cardiovascular disease. Front Microbiol 2024; 15:1422393. [PMID: 39119143 PMCID: PMC11306936 DOI: 10.3389/fmicb.2024.1422393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2024] [Accepted: 07/16/2024] [Indexed: 08/10/2024] Open
Abstract
Introduction Coronavirus disease 2019 (COVID-19) caused by SARS-CoV-2 is a highly contagious viral disease. Cardiovascular diseases and heart failure elevate the risk of mechanical ventilation and fatal outcomes among COVID-19 patients, while COVID-19 itself increases the likelihood of adverse cardiovascular outcomes. Methods We collected blood samples and clinical data from hospitalized cardiovascular patients with and without proven COVID-19 infection in the time period before the vaccine became available. Statistical correlation analysis and machine learning were used to evaluate and identify individual parameters that could predict the risk of needing mechanical ventilation and patient survival. Results Our results confirmed that COVID-19 is associated with a severe outcome and identified increased levels of ferritin, fibrinogen, and platelets, as well as decreased levels of albumin, as having a negative impact on patient survival. Additionally, patients on ACE/ARB had a lower chance of dying or needing mechanical ventilation. The machine learning models revealed that ferritin, PCO2, and CRP were the most efficient combination of parameters for predicting survival, while the combination of albumin, fibrinogen, platelets, ALP, AB titer, and D-dimer was the most efficient for predicting the likelihood of requiring mechanical ventilation. Conclusion We believe that creating an AI-based model that uses these patient parameters to predict the cardiovascular patient's risk of mortality, severe complications, and the need for mechanical ventilation would help healthcare providers with rapid triage and redistribution of medical services, with the goal of improving overall survival. The use of the most effective combination of parameters in our models could advance risk assessment and treatment planning among the general population of cardiovascular patients.
Collapse
Affiliation(s)
- Stanislava Matejin
- Department of Advanced Cardiopulmonary Therapies and Transplantation, University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Igor D. Gregoric
- Department of Advanced Cardiopulmonary Therapies and Transplantation, University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Rajko Radovancevic
- Department of Advanced Cardiopulmonary Therapies and Transplantation, University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Slobodan Paessler
- Institute for Human Infections and Immunity, University of Texas Medical Branch, Galveston, TX, United States
| | - Vladimir Perovic
- Laboratory of Bioinformatics and Computational Chemistry, Institute of Nuclear Sciences Vinca, National Institute of the Republic of Serbia, University of Belgrade, Belgrade, Serbia
| |
Collapse
|
3
|
Okada N, Nozaki H, Nakamura S, Manjate EPA, Gebretsadik A, Ohtomo Y, Arima T, Kawamura Y. Optimizing multi-spectral ore sorting incorporating wavelength selection utilizing neighborhood component analysis for effective arsenic mineral detection. Sci Rep 2024; 14:11544. [PMID: 38773148 PMCID: PMC11109168 DOI: 10.1038/s41598-024-62166-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2024] [Accepted: 05/14/2024] [Indexed: 05/23/2024] Open
Abstract
Arsenic contamination not only complicates mineral processing but also poses environmental and health risks. To address these challenges, this research investigates the feasibility of utilizing Hyperspectral imaging combined with machine learning techniques for the identification of arsenic-containing minerals in copper ore samples, with a focus on practical application in sorting and processing operations. Through experimentation with various copper sulfide ores, Neighborhood Component Analysis (NCA) was employed to select essential wavelength bands from Hyperspectral data, subsequently used as inputs for machine learning algorithms to identify arsenic concentrations. Results demonstrate that by selecting a subset of informative bands using NCA, accurate mineral identification can be achieved with a significantly reduced the size of dataset, enabling efficient processing and analysis. Comparison with other wavelength selection methods highlights the superiority of NCA in optimizing classification accuracy. Specifically, the identification accuracy showed 91.9% or more when utilizing 8 or more bands selected by NCA and was comparable to hyperspectral data analysis with 204 bands. The findings suggest potential for cost-effective implementation of multispectral cameras in mineral processing operations. Future research directions include refining machine learning algorithms, exploring broader applications across diverse ore types, and integrating hyperspectral imaging with emerging sensor technologies for enhanced mineral processing capabilities.
Collapse
Affiliation(s)
- Natsuo Okada
- Division of Sustainable Resources Engineering, Graduate School of Engineering, Hokkaido University, Kita-13, Nishi-8, Sapporo, 060-8628, Japan.
| | - Hiromasa Nozaki
- Division of Sustainable Resources Engineering, Graduate School of Engineering, Hokkaido University, Kita-13, Nishi-8, Sapporo, 060-8628, Japan
| | - Shinichiro Nakamura
- Division of Sustainable Resources Engineering, Graduate School of Engineering, Hokkaido University, Kita-13, Nishi-8, Sapporo, 060-8628, Japan
| | | | - Angesom Gebretsadik
- Division of Sustainable Resources Engineering, Graduate School of Engineering, Hokkaido University, Kita-13, Nishi-8, Sapporo, 060-8628, Japan
- Department of Mining Engineering, Aksum University, 7080, Aksum, Tigray, Ethiopia
| | - Yoko Ohtomo
- Division of Sustainable Resources Engineering, Faculty of Engineering, Hokkaido University, Kita-13, Nishi-8, Sapporo, 060-8628, Japan
| | - Takahiko Arima
- Division of Sustainable Resources Engineering, Faculty of Engineering, Hokkaido University, Kita-13, Nishi-8, Sapporo, 060-8628, Japan
| | - Youhei Kawamura
- Division of Sustainable Resources Engineering, Faculty of Engineering, Hokkaido University, Kita-13, Nishi-8, Sapporo, 060-8628, Japan
| |
Collapse
|
4
|
Chien SC, Yang HC, Chen CY, Chien CH, Hsu CK, Chien PH, Li YCJ. Using alert dwell time to filter universal clinical alerts: A machine learning approach. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2023; 240:107696. [PMID: 37480643 DOI: 10.1016/j.cmpb.2023.107696] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/19/2023] [Revised: 06/14/2023] [Accepted: 06/24/2023] [Indexed: 07/24/2023]
Abstract
BACKGROUND Alerts in computerized physician order entry (CPOE) systems can improve patient safety. However, alerts in rule-based systems cannot be customized based on individual patient or user characteristics. This limitation can lead to the presentation of irrelevant alerts and subsequent alert fatigue. OBJECTIVE We used machine learning approaches with alert dwell time to filter out irrelevant alerts for physicians based on contextual factors. METHODS We utilized five machine learning algorithms and a total of 1,120 features grouped into six categories: alert, demographic, environment, diagnosis, prescription, and laboratory results. The output of the models was the alert dwell time within a specified time window to determine the optimal range by the sensitivity analysis. RESULTS We used 813,026 records (19 categories) from the hospital's outpatient clinic data from 2020 to 2021. The sensitivity analysis showed that a time window with a range of 0.3-4.0 s had the best performance, with an area under the receiver operating characteristic (AUROC) curve of 0.73 and an area under the precision-recall curve (AUPRC) of 0.97. The model built with alert and demographic feature groups showed the best performance, with an AUROC of 0.73. The most significant individual feature groups were alert and demographic, with AUROCs of 0.66 and 0.62, respectively. CONCLUSION Our study found that alerts and user and patient demographic features are more crucial than clinical features when constructing universal context-aware alerts. Using alert dwell time in combination with a time window is an effective way to determine the trigger status of an alert. The findings of this study can provide useful insights for researchers working on specific and universal context-aware alerts.
Collapse
Affiliation(s)
- Shuo-Chen Chien
- Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan; Artificial Intelligence Research and Development Center, Wan Fang Hospital, Taipei Medical University, Taipei 110, Taiwan; International Center for Health Information and Technology, College of Medical science and Technology, Taipei Medical University, Taipei 110, Taiwan
| | - Hsuan-Chia Yang
- Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan; International Center for Health Information and Technology, College of Medical science and Technology, Taipei Medical University, Taipei 110, Taiwan; Research Center of Big Data and Meta-analysis, Wan Fang Hospital, Taipei Medical University, Taipei 110, Taiwan
| | - Chun-You Chen
- Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan; Artificial Intelligence Research and Development Center, Wan Fang Hospital, Taipei Medical University, Taipei 110, Taiwan; International Center for Health Information and Technology, College of Medical science and Technology, Taipei Medical University, Taipei 110, Taiwan; Department of Radiation Oncology, Wan Fang Hospital, Taipei Medical University, Taipei 110, Taiwan
| | - Chia-Hui Chien
- Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan; International Center for Health Information and Technology, College of Medical science and Technology, Taipei Medical University, Taipei 110, Taiwan; Office of Public Affairs, Taipei Medical University, Taipei 110, Taiwan
| | - Chun-Kung Hsu
- Office of Information Technology, Taipei Medical University, Taipei 110, Taiwan
| | - Po-Han Chien
- Department of Finance, National Taiwan University, Taipei 110, Taiwan
| | - Yu-Chuan Jack Li
- Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan; International Center for Health Information and Technology, College of Medical science and Technology, Taipei Medical University, Taipei 110, Taiwan; Department of Dermatology, Wan Fang Hospital, Taipei Medical University, Taipei 110, Taiwan; Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei 110, Taiwan.
| |
Collapse
|
5
|
Zou H, Yu W. Integrating Low-Order and High-Order Correlation Information for Identifying Phage Virion Proteins. J Comput Biol 2023; 30:1131-1143. [PMID: 37729064 DOI: 10.1089/cmb.2022.0237] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/22/2023] Open
Abstract
Phage virion proteins (PVPs) play an important role in the host cell. Fast and accurate identification of PVPs is beneficial for the discovery and development of related drugs. Although wet experimental approaches are the first choice to identify PVPs, they are costly and time-consuming. Thus, researchers have turned their attention to computational models, which can speed up related studies. Therefore, we proposed a novel machine-learning model to identify PVPs in the current study. First, 50 different types of physicochemical properties were used to denote protein sequences. Next, two different approaches, including Pearson's correlation coefficient (PCC) and maximal information coefficient (MIC), were employed to extract discriminative information. Further, to capture the high-order correlation information, we used PCC and MIC once again. After that, we adopted the least absolute shrinkage and selection operator algorithm to select the optimal feature subset. Finally, these chosen features were fed into a support vector machine to discriminate PVPs from phage non-virion proteins. We performed experiments on two different datasets to validate the effectiveness of our proposed method. Experimental results showed a significant improvement in performance compared with state-of-the-art approaches. It indicates that the proposed computational model may become a powerful predictor in identifying PVPs.
Collapse
Affiliation(s)
- Hongliang Zou
- School of Communications and Electronics, Jiangxi Science and Technology Normal University, Nanchang, China
| | - Wanting Yu
- College of Animal Science and Technology, Jiangxi Agricultural University, Nanchang, China
| |
Collapse
|
6
|
Pan J, You Z, You W, Zhao T, Feng C, Zhang X, Ren F, Ma S, Wu F, Wang S, Sun Y. PTBGRP: predicting phage-bacteria interactions with graph representation learning on microbial heterogeneous information network. Brief Bioinform 2023; 24:bbad328. [PMID: 37742053 DOI: 10.1093/bib/bbad328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Revised: 08/14/2023] [Accepted: 08/30/2023] [Indexed: 09/25/2023] Open
Abstract
Identifying the potential bacteriophages (phage) candidate to treat bacterial infections plays an essential role in the research of human pathogens. Computational approaches are recognized as a valid way to predict bacteria and target phages. However, most of the current methods only utilize lower-order biological information without considering the higher-order connectivity patterns, which helps to improve the predictive accuracy. Therefore, we developed a novel microbial heterogeneous interaction network (MHIN)-based model called PTBGRP to predict new phages for bacterial hosts. Specifically, PTBGRP first constructs an MHIN by integrating phage-bacteria interaction (PBI) and six bacteria-bacteria interaction networks with their biological attributes. Then, different representation learning methods are deployed to extract higher-level biological features and lower-level topological features from MHIN. Finally, PTBGRP employs a deep neural network as the classifier to predict unknown PBI pairs based on the fused biological information. Experiment results demonstrated that PTBGRP achieves the best performance on the corresponding ESKAPE pathogens and PBI dataset when compared with state-of-art methods. In addition, case studies of Klebsiella pneumoniae and Staphylococcus aureus further indicate that the consideration of rich heterogeneous information enables PTBGRP to accurately predict PBI from a more comprehensive perspective. The webserver of the PTBGRP predictor is freely available at http://120.77.11.78/PTBGRP/.
Collapse
Affiliation(s)
- Jie Pan
- Key Laboratory of Resources Biology and Biotechnology in Western China, Ministry of Education, Provincial Key Laboratory of Biotechnology of Shaanxi Province, the College of Life Sciences, Northwest University, Xi'an 710069, China
| | - Zhuhong You
- School of Computer Science, Northwestern Polytechnical University, Xi'an 710129, China
| | - Wencai You
- Key Laboratory of Resources Biology and Biotechnology in Western China, Ministry of Education, Provincial Key Laboratory of Biotechnology of Shaanxi Province, the College of Life Sciences, Northwest University, Xi'an 710069, China
| | - Tian Zhao
- Key Laboratory of Resources Biology and Biotechnology in Western China, Ministry of Education, Provincial Key Laboratory of Biotechnology of Shaanxi Province, the College of Life Sciences, Northwest University, Xi'an 710069, China
| | - Chenlu Feng
- Key Laboratory of Resources Biology and Biotechnology in Western China, Ministry of Education, Provincial Key Laboratory of Biotechnology of Shaanxi Province, the College of Life Sciences, Northwest University, Xi'an 710069, China
| | - Xuexia Zhang
- North China Pharmaceutical Group, Shijiazhuang 050015, Hebei, China
- National Microbial Medicine Engineering & Research Center, Shijiazhuang 050015, Hebei, China
| | - Fengzhi Ren
- North China Pharmaceutical Group, Shijiazhuang 050015, Hebei, China
- National Microbial Medicine Engineering & Research Center, Shijiazhuang 050015, Hebei, China
| | - Sanxing Ma
- Key Laboratory of Resources Biology and Biotechnology in Western China, Ministry of Education, Provincial Key Laboratory of Biotechnology of Shaanxi Province, the College of Life Sciences, Northwest University, Xi'an 710069, China
| | - Fan Wu
- Key Laboratory of Resources Biology and Biotechnology in Western China, Ministry of Education, Provincial Key Laboratory of Biotechnology of Shaanxi Province, the College of Life Sciences, Northwest University, Xi'an 710069, China
| | - Shiwei Wang
- Key Laboratory of Resources Biology and Biotechnology in Western China, Ministry of Education, Provincial Key Laboratory of Biotechnology of Shaanxi Province, the College of Life Sciences, Northwest University, Xi'an 710069, China
| | - Yanmei Sun
- Key Laboratory of Resources Biology and Biotechnology in Western China, Ministry of Education, Provincial Key Laboratory of Biotechnology of Shaanxi Province, the College of Life Sciences, Northwest University, Xi'an 710069, China
| |
Collapse
|
7
|
Momanyi BM, Zulfiqar H, Grace-Mercure BK, Ahmed Z, Ding H, Gao H, Liu F. CFNCM: Collaborative filtering neighborhood-based model for predicting miRNA-disease associations. Comput Biol Med 2023; 163:107165. [PMID: 37315383 DOI: 10.1016/j.compbiomed.2023.107165] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2023] [Revised: 05/31/2023] [Accepted: 06/08/2023] [Indexed: 06/16/2023]
Abstract
MicroRNAs have a significant role in the emergence of various human disorders. Consequently, it is essential to understand the existing interactions between miRNAs and diseases, as this will help scientists better study and comprehend the diseases' biological mechanisms. Findings can be employed as biomarkers or drug targets to advance the detection, diagnosis, and treatment of complex human disorders by foretelling possible disease-related miRNAs. This study proposed a computational model for predicting potential miRNA-disease associations called the Collaborative Filtering Neighborhood-based Classification Model (CFNCM), in light of the shortcomings of conventional and biological experiments, which are expensive and time-consuming. The model generated integrated miRNA and disease similarity matrices using the validated associations and miRNA and disease similarity information and used them as the input features for CFNCM. To produce class labels, we first determined the association scores for brand-new pairs using user-based collaborative filtering. With zero as the threshold, the associations with scores >0 were labelled 1, indicating a potential positive association, otherwise, it is marked as 0. Then, we developed classification models using various machine-learning algorithms. By comparison, we discovered that the support vector machine (SVM) produced the best AUC of 0.96 with 10-fold cross-validation through the GridSearchCV technique for identifying optimal parameter values. In addition, the models were evaluated and verified by analyzing the top 50 breast and lung neoplasms-related miRNAs, of which 46 and 47 associations were verified in two authoritative databases, dbDEMC and miR2Disease.
Collapse
Affiliation(s)
- Biffon Manyura Momanyi
- School of Computer Science and Engineering, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Hasan Zulfiqar
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, 610054, China; Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, Zhejiang, 313001, China
| | - Bakanina Kissanga Grace-Mercure
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, 610054, China
| | - Zahoor Ahmed
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, 610054, China; Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou, Zhejiang, 313001, China
| | - Hui Ding
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, 610054, China.
| | - Hui Gao
- School of Computer Science and Engineering, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.
| | - Fen Liu
- Department of Radiation Oncology, Peking University Cancer Hospital (Inner Mongolia Campus), Affiliated Cancer Hospital of Inner Mongolia Medical University, Inner Mongolia Cancer Hospital, Hohhot, China.
| |
Collapse
|
8
|
Li F, Wang Z, Guo Y, Liu C, Zhu Y, Zhou Y, Li J, Liang D, Wang H. Developing a Dynamic Graph Network for Interpretable Analysis of Multi-Modal MRI Data in Parkinson's Disease Diagnosis. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2023; 2023:1-4. [PMID: 38083052 DOI: 10.1109/embc40787.2023.10340672] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/18/2023]
Abstract
Following the aging of the population, Parkinson's disease (PD) poses a severe challenge to public health. For the diagnosis of PD and the prediction of its progression, numerous computer-aided diagnosis procedures have been developed. Recently, Graph Convolutional Networks (GCN) are widely applied in deep learning to effectively integrate multi-modal features and model subject correlation. However, many GCNs which are used for node classification build large-scale fixed graph topologies using the entire dataset, which could make them impossible to verify independently. Furthermore, past GCN algorithms would need more interpretability, limiting their real-world applications. In this paper, an Interpretable Graph-Learning Convolutional Network (iGLCN) is proposed to enhance the performance of personalized diagnosis for PD while simultaneously producing interpretable results. The proposed method can dynamically adjust the graph structure for GCN to better diagnose outcomes by learning the optimal underlying latent graph. Through interpretable feature learning, the proposed network can interpret diagnosis outcomes. The experiments showed that the proposed method increased flexibility while maintaining a high level of classification performance and could be interpretable for PD diagnosis.Clinical Relevance- The proposed method is expected to have good performance in its strong practicability, feasibility, and interpretability for Parkinson's disease diagnosis.
Collapse
|
9
|
Shang J, Peng C, Tang X, Sun Y. PhaVIP: Phage VIrion Protein classification based on chaos game representation and Vision Transformer. Bioinformatics 2023; 39:i30-i39. [PMID: 37387136 DOI: 10.1093/bioinformatics/btad229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION As viruses that mainly infect bacteria, phages are key players across a wide range of ecosystems. Analyzing phage proteins is indispensable for understanding phages' functions and roles in microbiomes. High-throughput sequencing enables us to obtain phages in different microbiomes with low cost. However, compared to the fast accumulation of newly identified phages, phage protein classification remains difficult. In particular, a fundamental need is to annotate virion proteins, the structural proteins, such as major tail, baseplate, etc. Although there are experimental methods for virion protein identification, they are too expensive or time-consuming, leaving a large number of proteins unclassified. Thus, there is a great demand to develop a computational method for fast and accurate phage virion protein (PVP) classification. RESULTS In this work, we adapted the state-of-the-art image classification model, Vision Transformer, to conduct virion protein classification. By encoding protein sequences into unique images using chaos game representation, we can leverage Vision Transformer to learn both local and global features from sequence "images". Our method, PhaVIP, has two main functions: classifying PVP and non-PVP sequences and annotating the types of PVP, such as capsid and tail. We tested PhaVIP on several datasets with increasing difficulty and benchmarked it against alternative tools. The experimental results show that PhaVIP has superior performance. After validating the performance of PhaVIP, we investigated two applications that can use the output of PhaVIP: phage taxonomy classification and phage host prediction. The results showed the benefit of using classified proteins over all proteins. AVAILABILITY AND IMPLEMENTATION The web server of PhaVIP is available via: https://phage.ee.cityu.edu.hk/phavip. The source code of PhaVIP is available via: https://github.com/KennthShang/PhaVIP.
Collapse
Affiliation(s)
- Jiayu Shang
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), China
| | - Cheng Peng
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), China
| | - Xubo Tang
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), China
| | - Yanni Sun
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), China
| |
Collapse
|
10
|
Sulaiman A, Kaur S, Gupta S, Alshahrani H, Reshan MSA, Alyami S, Shaikh A. ResRandSVM: Hybrid Approach for Acute Lymphocytic Leukemia Classification in Blood Smear Images. Diagnostics (Basel) 2023; 13:2121. [PMID: 37371016 DOI: 10.3390/diagnostics13122121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Revised: 06/17/2023] [Accepted: 06/18/2023] [Indexed: 06/29/2023] Open
Abstract
Acute Lymphocytic Leukemia is a type of cancer that occurs when abnormal white blood cells are produced in the bone marrow which do not function properly, crowding out healthy cells and weakening the immunity of the body and thus its ability to resist infections. It spreads quickly in children's bodies, and if not treated promptly it may lead to death. The manual detection of this disease is a tedious and slow task. Machine learning and deep learning techniques are faster than manual detection and more accurate. In this paper, a deep feature selection-based approach ResRandSVM is proposed for the detection of Acute Lymphocytic Leukemia in blood smear images. The proposed approach uses seven deep-learning models: ResNet152, VGG16, DenseNet121, MobileNetV2, InceptionV3, EfficientNetB0 and ResNet50 for deep feature extraction from blood smear images. After that, three feature selection methods are used to extract valuable and important features: analysis of variance (ANOVA), principal component analysis (PCA), and Random Forest. Then the selected feature map is fed to four different classifiers, Adaboost, Support Vector Machine, Artificial Neural Network and Naïve Bayes models, to classify the images into leukemia and normal images. The model performs best with a combination of ResNet50 as a feature extractor, Random Forest as feature selection and Support Vector Machine as a classifier with an accuracy of 0.900, precision of 0.902, recall of 0.957 and F1-score of 0.929.
Collapse
Affiliation(s)
- Adel Sulaiman
- Department of Computer Science, College of Computer Science and Information Systems, Najran University, Najran 61441, Saudi Arabia
| | - Swapandeep Kaur
- Chitkara University Institute of Engineering and Technology, Chitkara University, Rajpura 140401, Punjab, India
| | - Sheifali Gupta
- Chitkara University Institute of Engineering and Technology, Chitkara University, Rajpura 140401, Punjab, India
| | - Hani Alshahrani
- Department of Computer Science, College of Computer Science and Information Systems, Najran University, Najran 61441, Saudi Arabia
| | - Mana Saleh Al Reshan
- Department of Information Systems, College of Computer Science and Information Systems, Najran University, Najran 61441, Saudi Arabia
| | - Sultan Alyami
- Department of Computer Science, College of Computer Science and Information Systems, Najran University, Najran 61441, Saudi Arabia
| | - Asadullah Shaikh
- Department of Information Systems, College of Computer Science and Information Systems, Najran University, Najran 61441, Saudi Arabia
| |
Collapse
|
11
|
Prediction of Phage Virion Proteins Using Machine Learning Methods. Molecules 2023; 28:molecules28052238. [PMID: 36903484 PMCID: PMC10004995 DOI: 10.3390/molecules28052238] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2023] [Revised: 01/27/2023] [Accepted: 02/20/2023] [Indexed: 03/04/2023] Open
Abstract
Antimicrobial resistance (AMR) is a major problem and an immediate alternative to antibiotics is the need of the hour. Research on the possible alternative products to tackle bacterial infections is ongoing worldwide. One of the most promising alternatives to antibiotics is the use of bacteriophages (phage) or phage-driven antibacterial drugs to cure bacterial infections caused by AMR bacteria. Phage-driven proteins, including holins, endolysins, and exopolysaccharides, have shown great potential in the development of antibacterial drugs. Likewise, phage virion proteins (PVPs) might also play an important role in the development of antibacterial drugs. Here, we have developed a machine learning-based prediction method to predict PVPs using phage protein sequences. We have employed well-known basic and ensemble machine learning methods with protein sequence composition features for the prediction of PVPs. We found that the gradient boosting classifier (GBC) method achieved the best accuracy of 80% on the training dataset and an accuracy of 83% on the independent dataset. The performance on the independent dataset is better than other existing methods. A user-friendly web server developed by us is freely available to all users for the prediction of PVPs from phage protein sequences. The web server might facilitate the large-scale prediction of PVPs and hypothesis-driven experimental study design.
Collapse
|
12
|
Dao FY, Liu ML, Su W, Lv H, Zhang ZY, Lin H, Liu L. AcrPred: A hybrid optimization with enumerated machine learning algorithm to predict Anti-CRISPR proteins. Int J Biol Macromol 2023; 228:706-714. [PMID: 36584777 DOI: 10.1016/j.ijbiomac.2022.12.250] [Citation(s) in RCA: 27] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Revised: 12/12/2022] [Accepted: 12/22/2022] [Indexed: 12/29/2022]
Abstract
CRISPR-Cas, as a tool for gene editing, has received extensive attention in recent years. Anti-CRISPR (Acr) proteins can inactivate the CRISPR-Cas defense system during interference phase, and can be used as a potential tool for the regulation of gene editing. In-depth study of Anti-CRISPR proteins is of great significance for the implementation of gene editing. In this study, we developed a high-accuracy prediction model based on two-step model fusion strategy, called AcrPred, which could produce an AUC of 0.952 with independent dataset validation. To further validate the proposed model, we compared with published tools and correctly identified 9 of 10 new Acr proteins, indicating the strong generalization ability of our model. Finally, for the convenience of related wet-experimental researchers, a user-friendly web-server AcrPred (Anti-CRISPR proteins Prediction) was established at http://lin-group.cn/server/AcrPred, by which users can easily identify potential Anti-CRISPR proteins.
Collapse
Affiliation(s)
- Fu-Ying Dao
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China; School of Biological Sciences, Nanyang Technological University, Singapore 639798, Singapore
| | - Meng-Lu Liu
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Wei Su
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hao Lv
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China; Department of Molecular Life Sciences, University of Zurich, Winterthurerstrasse 190, 8057 Zurich, Switzerland; SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Zhao-Yue Zhang
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hao Lin
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Li Liu
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003, China.
| |
Collapse
|
13
|
A new ranking-based stability measure for feature selection algorithms. Soft comput 2023. [DOI: 10.1007/s00500-022-07767-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
|
14
|
Aghayousefi R, Hosseiniyan Khatibi SM, Zununi Vahed S, Bastami M, Pirmoradi S, Teshnehlab M. A diagnostic miRNA panel to detect recurrence of ovarian cancer through artificial intelligence approaches. J Cancer Res Clin Oncol 2023; 149:325-341. [PMID: 36378340 DOI: 10.1007/s00432-022-04468-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2022] [Accepted: 11/06/2022] [Indexed: 11/17/2022]
Abstract
BACKGROUND Ovarian Cancer (OC) is the deadliest gynecology malignancy, whose high recurrence rate in OC patients is a challenging object. Therefore, having deep insights into the genetic and molecular mechanisms of OC recurrence can improve the target therapeutic procedures. This study aimed to discover crucial miRNAs for the detection of tumor recurrence in OC by artificial intelligence approaches. METHOD Through the ANOVA feature selection method, we selected 100 candidate miRNAs among 588 miRNAs. For their classification, a deep-learning model was employed to validate the significance of the candidate miRNAs. The accuracy, F1-score (high-risk), and AUC-ROC of classification test data based on the 100 miRNAs were 73%, 0.81, and 0.65, respectively. Association rule mining was used to discover hidden relations among the selected miRNAs. RESULT Five miRNAs, including miR-1914, miR-203, miR-135a-2, miR-149, and miR-9-1, were identified as the most frequent items among high-risk association rules. The identified miRNAs may target genes/proteins involved in epithelial-mesenchymal transition (EMT), resistance to therapy, and cancer stem cells; being responsible for the heterogeneity and plasticity of the tumor. Our conclusion presents mir-1914 as the significant candidate miRNA and the most frequent item. Current knowledge indicates that the dysregulated miR-1914 may function as a tumor suppressor or oncogene in the development of cancer. CONCLUSION These candidate miRNAs can be considered a powerful tool in the diagnosis of OC recurrence. We hypothesize that mir-1914 might open a new line of research in the realm of managing the recurrence of OC and could be a significant factor in triggering OC recurrence.
Collapse
Affiliation(s)
- Reyhaneh Aghayousefi
- Department of Electrical Engineering, K.N. Toosi University of Technology, Tehran, Iran
| | - Seyed Mahdi Hosseiniyan Khatibi
- Clinical Research Development Unit of Tabriz Valiasr Hospital, Tabriz University of Medical Sciences, Tabriz, Iran.,Rahat Breath and Sleep Research Center, Tabriz University of Medical Sciences, Tabriz, Iran.,Kidney Research Center, Tabriz University of Medical Sciences, Tabriz, Iran
| | | | - Milad Bastami
- Non-Communicable Diseases Research Center, Fasa University of Medical Sciences, Fasa, Iran
| | - Saeed Pirmoradi
- Clinical Research Development Unit of Tabriz Valiasr Hospital, Tabriz University of Medical Sciences, Tabriz, Iran.
| | - Mohammad Teshnehlab
- Department of Electrical Engineering, K.N. Toosi University of Technology, Tehran, Iran.
| |
Collapse
|
15
|
Identification of adaptor proteins using the ANOVA feature selection technique. Methods 2022; 208:42-47. [DOI: 10.1016/j.ymeth.2022.10.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2022] [Revised: 10/01/2022] [Accepted: 10/24/2022] [Indexed: 11/06/2022] Open
|
16
|
iEnhancer-MRBF: Identifying enhancers and their strength with a multiple Laplacian-regularized radial basis function network. Methods 2022; 208:1-8. [DOI: 10.1016/j.ymeth.2022.10.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Revised: 09/26/2022] [Accepted: 10/03/2022] [Indexed: 11/07/2022] Open
|
17
|
Fang Z, Feng T, Zhou H, Chen M. DeePVP: Identification and classification of phage virion proteins using deep learning. Gigascience 2022; 11:giac076. [PMID: 35950840 PMCID: PMC9366990 DOI: 10.1093/gigascience/giac076] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Revised: 06/08/2022] [Accepted: 07/11/2022] [Indexed: 01/04/2023] Open
Abstract
BACKGROUND Many biological properties of phages are determined by phage virion proteins (PVPs), and the poor annotation of PVPs is a bottleneck for many areas of viral research, such as viral phylogenetic analysis, viral host identification, and antibacterial drug design. Because of the high diversity of PVP sequences, the PVP annotation of a phage genome remains a particularly challenging bioinformatic task. FINDINGS Based on deep learning, we developed DeePVP. The main module of DeePVP aims to discriminate PVPs from non-PVPs within a phage genome, while the extended module of DeePVP can further classify predicted PVPs into the 10 major classes of PVPs. Compared with the present state-of-the-art tools, the main module of DeePVP performs better, with a 9.05% higher F1-score in the PVP identification task. Moreover, the overall accuracy of the extended module of DeePVP in the PVP classification task is approximately 3.72% higher than that of PhANNs. Two application cases show that the predictions of DeePVP are more reliable and can better reveal the compact PVP-enriched region than the current state-of-the-art tools. Particularly, in the Escherichia phage phiEC1 genome, a novel PVP-enriched region that is conserved in many other Escherichia phage genomes was identified, indicating that DeePVP will be a useful tool for the analysis of phage genomic structures. CONCLUSIONS DeePVP outperforms state-of-the-art tools. The program is optimized in both a virtual machine with graphical user interface and a docker so that the tool can be easily run by noncomputer professionals. DeePVP is freely available at https://github.com/fangzcbio/DeePVP/.
Collapse
Affiliation(s)
- Zhencheng Fang
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou 510280, China
| | - Tao Feng
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou 510280, China
| | - Hongwei Zhou
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou 510280, China
| | - Muxuan Chen
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou 510280, China
| |
Collapse
|
18
|
Liu S, Cui C, Chen H, Liu T. Ensemble Learning-Based Feature Selection for Phage Protein Prediction. Front Microbiol 2022; 13:932661. [PMID: 35910662 PMCID: PMC9335128 DOI: 10.3389/fmicb.2022.932661] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2022] [Accepted: 06/14/2022] [Indexed: 11/14/2022] Open
Abstract
Phage has high specificity for its host recognition. As a natural enemy of bacteria, it has been used to treat super bacteria many times. Identifying phage proteins from the original sequence is very important for understanding the relationship between phage and host bacteria and developing new antimicrobial agents. However, traditional experimental methods are both expensive and time-consuming. In this study, an ensemble learning-based feature selection method is proposed to find important features for phage protein identification. The method uses four types of protein sequence-derived features, quantifies the importance of each feature by adding perturbations to the features to influence the results, and finally splices the important features among the four types of features. In addition, we analyzed the selected features and their biological significance.
Collapse
Affiliation(s)
- Songbo Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Chengmin Cui
- Beijing Institute of Control Engineering, China Academy of Space Technology, Beijing, China
| | - Huipeng Chen
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
- *Correspondence: Huipeng Chen
| | - Tong Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
19
|
Niu M, Zou Q. SgRNA-RF: Identification of SgRNA On-Target Activity With Imbalanced Datasets. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2442-2453. [PMID: 33979289 DOI: 10.1109/tcbb.2021.3079116] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Single-guide RNA is a guide RNA (gRNA), which guides the insertion or deletion of uridine residues into kinetoplastid during RNA editing. It is a small non-coding RNA that can be combined with pre -mRNA pairing. SgRNA is a critical component of the CRISPR/Cas9 gene knockout system and play an important role in gene editing and gene regulation. It is important to accurately and quickly identify highly on-target activity sgRNAs. Due to its importance, several computational predictors have been proposed to predict sgRNAs on-target activity. All these methods have clearly contributed to the development of this very important field. However, they also have certain limitations. In the paper, we developed a new classifier SgRNA-RF, which extracts the features of nucleic acid composition and structure of on-target activity sgRNA sequence and identified by random forest algorithm. In addition to solving an imbalanced dataset, this paper proposed a new method called CS-Smote. We compared sgRNA-RF with state-of-the-art predictors on the five datasets, and found SgRNA-RF significantly improved the identification accuracy, with accuracies of 0.8636,0.9161,0.894,0.938,0.965,0.77,0.979,0.973, respectively. The user-friendly web server that implements sgRNA-RF is freely available at http://server.malab.cn/sgRNA-RF/.
Collapse
|
20
|
Chu Y, Guo S, Cui D, Fu X, Ma Y. DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data. PeerJ 2022; 10:e13404. [PMID: 35698617 PMCID: PMC9188312 DOI: 10.7717/peerj.13404] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2021] [Accepted: 04/18/2022] [Indexed: 01/14/2023] Open
Abstract
Bacteriophages (phages) are the most abundant and diverse biological entity on Earth. Due to the lack of universal gene markers and database representatives, there about 50-90% of genes of phages are unable to assign functions. This makes it a challenge to identify phage genomes and annotate functions of phage genes efficiently by homology search on a large scale, especially for newly phages. Portal (portal protein), TerL (large terminase subunit protein), and TerS (small terminase subunit protein) are three specific proteins of Caudovirales phage. Here, we developed a CNN (convolutional neural network)-based framework, DeephageTP, to identify the three specific proteins from metagenomic data. The framework takes one-hot encoding data of original protein sequences as the input and automatically extracts predictive features in the process of modeling. To overcome the false positive problem, a cutoff-loss-value strategy is introduced based on the distributions of the loss values of protein sequences within the same category. The proposed model with a set of cutoff-loss-values demonstrates high performance in terms of Precision in identifying TerL and Portal sequences (94% and 90%, respectively) from the mimic metagenomic dataset. Finally, we tested the efficacy of the framework using three real metagenomic datasets, and the results shown that compared to the conventional alignment-based methods, our proposed framework had a particular advantage in identifying the novel phage-specific protein sequences of portal and TerL with remote homology to their counterparts in the training datasets. In summary, our study for the first time develops a CNN-based framework for identifying the phage-specific protein sequences with high complexity and low conservation, and this framework will help us find novel phages in metagenomic sequencing data. The DeephageTP is available at https://github.com/chuym726/DeephageTP.
Collapse
Affiliation(s)
- Yunmeng Chu
- Shenzhen Key Laboratory of Synthetic Genomics, Guangdong Provincial Key Laboratory of Synthetic Genomics, CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese, Shenzhen, Guangdong, P.R. China,Department of Bioengineering and Biotechnology, Huaqiao University, Xiamen, Fujian, P.R. China
| | - Shun Guo
- Shenzhen Key Laboratory of Synthetic Genomics, Guangdong Provincial Key Laboratory of Synthetic Genomics, CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese, Shenzhen, Guangdong, P.R. China
| | - Dachao Cui
- Shenzhen Key Laboratory of Synthetic Genomics, Guangdong Provincial Key Laboratory of Synthetic Genomics, CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese, Shenzhen, Guangdong, P.R. China
| | - Xiongfei Fu
- Shenzhen Key Laboratory of Synthetic Genomics, Guangdong Provincial Key Laboratory of Synthetic Genomics, CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese, Shenzhen, Guangdong, P.R. China
| | - Yingfei Ma
- Shenzhen Key Laboratory of Synthetic Genomics, Guangdong Provincial Key Laboratory of Synthetic Genomics, CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese, Shenzhen, Guangdong, P.R. China
| |
Collapse
|
21
|
Yan J, Zhang B, Zhou M, Kwok HF, Siu SWI. Multi-Branch-CNN: Classification of ion channel interacting peptides using multi-branch convolutional neural network. Comput Biol Med 2022; 147:105717. [PMID: 35752114 DOI: 10.1016/j.compbiomed.2022.105717] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2022] [Revised: 05/18/2022] [Accepted: 06/05/2022] [Indexed: 11/03/2022]
Abstract
Ligand peptides that have high affinity for ion channels are critical for regulating ion flux across the plasma membrane. These peptides are now being considered as potential drug candidates for many diseases, such as cardiovascular disease and cancers. In this work, we developed Multi-Branch-CNN, a CNN method with multiple input branches for identifying three types of ion channel peptide binders (sodium, potassium, and calcium) from intra- and inter-feature types. As for its real-world applications, prediction models that are able to recognize novel sequences having high or low similarities to training sequences are required. To this end, we tested our models on two test sets: a general test set including sequences spanning different similarity levels to those of the training set, and a novel-test set consisting of only sequences that bear little resemblance to sequences from the training set. Our experiments showed that the Multi-Branch-CNN method performs better than thirteen traditional ML algorithms (TML13), yielding an improvement in accuracy of 3.2%, 1.2%, and 2.3% on the test sets as well as 8.8%, 14.3%, and 14.6% on the novel-test sets for sodium, potassium, and calcium ion channels, respectively. We confirmed the effectiveness of Multi-Branch-CNN by comparing it to the standard CNN method with one input branch (Single-Branch-CNN) and an ensemble method (TML13-Stack). The data sets, script files to reproduce the experiments, and the final predictive models are freely available at https://github.com/jieluyan/Multi-Branch-CNN.
Collapse
Affiliation(s)
- Jielu Yan
- PAMI Research Group, Department of Computer and Information Science, University of Macau, Taipa, Macao Special Administrative Region of China
| | - Bob Zhang
- PAMI Research Group, Department of Computer and Information Science, University of Macau, Taipa, Macao Special Administrative Region of China.
| | - Mingliang Zhou
- School of Computer Science, Chongqing University, Shapingba, Chongqing, China
| | - Hang Fai Kwok
- Department of Biomedical Sciences, Faculty of Health Sciences, University of Macau, Taipa, Macao Special Administrative Region of China.
| | - Shirley W I Siu
- Department of Computer and Information Science, University of Macau, Taipa, Macao Special Administrative Region of China; Institute of Science and Environment, University of Saint Joseph, Estr. Marginal da Ilha Verde, Macao Special Administrative Region of China.
| |
Collapse
|
22
|
Ahmad S, Charoenkwan P, Quinn JMW, Moni MA, Hasan MM, Lio' P, Shoombuatong W. SCORPION is a stacking-based ensemble learning framework for accurate prediction of phage virion proteins. Sci Rep 2022; 12:4106. [PMID: 35260777 PMCID: PMC8904530 DOI: 10.1038/s41598-022-08173-5] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2022] [Accepted: 03/03/2022] [Indexed: 12/30/2022] Open
Abstract
Fast and accurate identification of phage virion proteins (PVPs) would greatly aid facilitation of antibacterial drug discovery and development. Although, several research efforts based on machine learning (ML) methods have been made for in silico identification of PVPs, these methods have certain limitations. Therefore, in this study, we propose a new computational approach, termed SCORPION, (StaCking-based Predictior fOR Phage VIrion PrOteiNs), to accurately identify PVPs using only protein primary sequences. Specifically, we explored comprehensive 13 different feature descriptors from different aspects (i.e., compositional information, composition-transition-distribution information, position-specific information and physicochemical properties) with 10 popular ML algorithms to construct a pool of optimal baseline models. These optimal baseline models were then used to generate probabilistic features (PFs) and considered as a new feature vector. Finally, we utilized a two-step feature selection strategy to determine the optimal PF feature vector and used this feature vector to develop a stacked model (SCORPION). Both tenfold cross-validation and independent test results indicate that SCORPION achieves superior predictive performance than its constitute baseline models and existing methods. We anticipate SCORPION will serve as a useful tool for the cost-effective and large-scale screening of new PVPs. The source codes and datasets for this work are available for downloading in the GitHub repository (https://github.com/saeed344/SCORPION).
Collapse
Affiliation(s)
- Saeed Ahmad
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand
| | - Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai, 50200, Thailand
| | - Julian M W Quinn
- Bone Biology Division, Garvan Institute of Medical Research, 384 Victoria Street, Darlinghurst, NSW, 2010, Australia
| | - Mohammad Ali Moni
- Faculty of Health and Behavioural Sciences, School of Health and Rehabilitation Sciences, The University of Queensland, St Lucia, QLD, 4072, Australia
| | - Md Mehedi Hasan
- Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane Center for Biomedical Informatics and Genomics, Tulane University, New Orleans, LA, 70112, USA
| | - Pietro Lio'
- Department of Computer Science and Technology, University of Cambridge, Cambridge, CB3 0FD, UK
| | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand.
| |
Collapse
|
23
|
Kabir M, Nantasenamat C, Kanthawong S, Charoenkwan P, Shoombuatong W. Large-scale comparative review and assessment of computational methods for phage virion proteins identification. EXCLI JOURNAL 2022; 21:11-29. [PMID: 35145365 PMCID: PMC8822302 DOI: 10.17179/excli2021-4411] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Accepted: 11/29/2021] [Indexed: 12/11/2022]
Abstract
Phage virion proteins (PVPs) are effective at recognizing and binding to host cell receptors while having no deleterious effects on human or animal cells. Understanding their functional mechanisms is regarded as a critical goal that will aid in rational antibacterial drug discovery and development. Although high-throughput experimental methods for identifying PVPs are considered the gold standard for exploring crucial PVP features, these procedures are frequently time-consuming and labor-intensive. Thusfar, more than ten sequence-based predictors have been established for the in silico identification of PVPs in conjunction with traditional experimental approaches. As a result, a revised and more thorough assessment is extremely desirable. With this purpose in mind, we first conduct a thorough survey and evaluation of a vast array of 13 state-of-the-art PVP predictors. Among these PVP predictors, they can be classified into three groups according to the types of machine learning (ML) algorithms employed (i.e. traditional ML-based methods, ensemble-based methods and deep learning-based methods). Subsequently, we explored which factors are important for building more accurate and stable predictors and this included training/independent datasets, feature encoding algorithms, feature selection methods, core algorithms, performance evaluation metrics/strategies and web servers. Finally, we provide insights and future perspectives for the design and development of new and more effective computational approaches for the detection and characterization of PVPs.
Collapse
Affiliation(s)
- Muhammad Kabir
- School of Systems and Technology, Department of Computer Science, University of Management and Technology, Lahore, Pakistan, 54770
| | - Chanin Nantasenamat
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand, 10700
| | - Sakawrat Kanthawong
- Department of Microbiology, Faculty of Medicine, Khon Kaen University, Khon Kaen, Thailand, 40002
| | - Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai, Thailand, 50200
| | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand, 10700
| |
Collapse
|
24
|
Bonidia RP, Domingues DS, Sanches DS, de Carvalho ACPLF. MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors. Brief Bioinform 2022; 23:bbab434. [PMID: 34750626 PMCID: PMC8769707 DOI: 10.1093/bib/bbab434] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2021] [Revised: 09/18/2021] [Accepted: 09/20/2021] [Indexed: 12/24/2022] Open
Abstract
One of the main challenges in applying machine learning algorithms to biological sequence data is how to numerically represent a sequence in a numeric input vector. Feature extraction techniques capable of extracting numerical information from biological sequences have been reported in the literature. However, many of these techniques are not available in existing packages, such as mathematical descriptors. This paper presents a new package, MathFeature, which implements mathematical descriptors able to extract relevant numerical information from biological sequences, i.e. DNA, RNA and proteins (prediction of structural features along the primary sequence of amino acids). MathFeature makes available 20 numerical feature extraction descriptors based on approaches found in the literature, e.g. multiple numeric mappings, genomic signal processing, chaos game theory, entropy and complex networks. MathFeature also allows the extraction of alternative features, complementing the existing packages. To ensure that our descriptors are robust and to assess their relevance, experimental results are presented in nine case studies. According to these results, the features extracted by MathFeature showed high performance (0.6350-0.9897, accuracy), both applying only mathematical descriptors, but also hybridization with well-known descriptors in the literature. Finally, through MathFeature, we overcame several studies in eight benchmark datasets, exemplifying the robustness and viability of the proposed package. MathFeature has advanced in the area by bringing descriptors not available in other packages, as well as allowing non-experts to use feature extraction techniques.
Collapse
Affiliation(s)
- Robson P Bonidia
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos 13566-590, Brazil
| | - Douglas S Domingues
- Group of Genomics and Transcriptomes in Plants, Institute of Biosciences, São Paulo State University (UNESP), Rio Claro 13506-900, Brazil
| | - Danilo S Sanches
- Department of Computer Science, Federal University of Technology - Paraná, UTFPR, Cornélio Procópio 86300-000, Brazil
| | - André C P L F de Carvalho
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos 13566-590, Brazil
| |
Collapse
|
25
|
Zhao D, Teng Z, Li Y, Chen D. iAIPs: Identifying Anti-Inflammatory Peptides Using Random Forest. Front Genet 2021; 12:773202. [PMID: 34917130 PMCID: PMC8669811 DOI: 10.3389/fgene.2021.773202] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Accepted: 10/08/2021] [Indexed: 12/25/2022] Open
Abstract
Recently, several anti-inflammatory peptides (AIPs) have been found in the process of the inflammatory response, and these peptides have been used to treat some inflammatory and autoimmune diseases. Therefore, identifying AIPs accurately from a given amino acid sequences is critical for the discovery of novel and efficient anti-inflammatory peptide-based therapeutics and the acceleration of their application in therapy. In this paper, a random forest-based model called iAIPs for identifying AIPs is proposed. First, the original samples were encoded with three feature extraction methods, including g-gap dipeptide composition (GDC), dipeptide deviation from the expected mean (DDE), and amino acid composition (AAC). Second, the optimal feature subset is generated by a two-step feature selection method, in which the feature is ranked by the analysis of variance (ANOVA) method, and the optimal feature subset is generated by the incremental feature selection strategy. Finally, the optimal feature subset is inputted into the random forest classifier, and the identification model is constructed. Experiment results showed that iAIPs achieved an AUC value of 0.822 on an independent test dataset, which indicated that our proposed model has better performance than the existing methods. Furthermore, the extraction of features for peptide sequences provides the basis for evolutionary analysis. The study of peptide identification is helpful to understand the diversity of species and analyze the evolutionary history of species.
Collapse
Affiliation(s)
- Dongxu Zhao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Zhixia Teng
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yanjuan Li
- College of Electrical and Information Engineering, Quzhou University, Quzhou, China
| | - Dong Chen
- College of Electrical and Information Engineering, Quzhou University, Quzhou, China
| |
Collapse
|
26
|
Timmons PB, Hewage CM. ENNAVIA is a novel method which employs neural networks for antiviral and anti-coronavirus activity prediction for therapeutic peptides. Brief Bioinform 2021; 22:bbab258. [PMID: 34297817 PMCID: PMC8575049 DOI: 10.1093/bib/bbab258] [Citation(s) in RCA: 34] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2021] [Revised: 06/09/2021] [Accepted: 06/18/2021] [Indexed: 11/14/2022] Open
Abstract
Viruses represent one of the greatest threats to human health, necessitating the development of new antiviral drug candidates. Antiviral peptides often possess excellent biological activity and a favourable toxicity profile, and therefore represent a promising field of novel antiviral drugs. As the quantity of sequencing data grows annually, the development of an accurate in silico method for the prediction of peptide antiviral activities is important. This study leverages advances in deep learning and cheminformatics to produce a novel sequence-based deep neural network classifier for the prediction of antiviral peptide activity. The method outperforms the existent best-in-class, with an external test accuracy of 93.9%, Matthews correlation coefficient of 0.87 and an Area Under the Curve of 0.93 on the dataset of experimentally validated peptide activities. This cutting-edge classifier is available as an online web server at https://research.timmons.eu/ennavia, facilitating in silico screening and design of peptide antiviral drugs by the wider research community.
Collapse
Affiliation(s)
- Patrick Brendan Timmons
- UCD School of Biomolecular and Biomedical Science, UCD Centre for Synthesis and Chemical Biology, UCD Conway Institute, University College Dublin, Dublin 4, Ireland
| | - Chandralal M Hewage
- UCD School of Biomolecular and Biomedical Science, UCD Centre for Synthesis and Chemical Biology, UCD Conway Institute, University College Dublin, Dublin 4, Ireland
| |
Collapse
|
27
|
iPVP-MCV: A Multi-Classifier Voting Model for the Accurate Identification of Phage Virion Proteins. Symmetry (Basel) 2021. [DOI: 10.3390/sym13081506] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
The classic structure of a bacteriophage is commonly characterized by complex symmetry. The head of the structure features icosahedral symmetry, whereas the tail features helical symmetry. The phage virion protein (PVP), a type of bacteriophage structural protein, is an essential material of the infectious viral particles and is responsible for multiple biological functions. Accurate identification of PVPs is of great significance for comprehending the interaction between phages and host bacteria and developing new antimicrobial drugs or antibiotics. However, traditional experimental approaches for identifying PVPs are often time-consuming and laborious. Therefore, the development of computational methods that can efficiently and accurately identify PVPs is desired. In this study, we proposed a multi-classifier voting model called iPVP-MCV to enhance the predictive performance of PVPs based on their amino acid sequences. First, three types of evolutionary features were extracted from the position-specific scoring matrix (PSSM) profiles to represent PVPs and non-PVPs. Then, a set of baseline models were trained based on the support vector machine (SVM) algorithm combined with each type of feature descriptors. Finally, the outputs of these baseline models were integrated to construct the proposed method iPVP-MCV by using the majority voting strategy. Our results demonstrated that the proposed iPVP-MCV model was superior to existing methods when performing the rigorous independent dataset test.
Collapse
|
28
|
Abujelala M, Karthikeyan R, Tyagi O, Du J, Mehta RK. Brain Activity-Based Metrics for Assessing Learning States in VR under Stress among Firefighters: An Explorative Machine Learning Approach in Neuroergonomics. Brain Sci 2021; 11:885. [PMID: 34209388 PMCID: PMC8304323 DOI: 10.3390/brainsci11070885] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2021] [Revised: 06/26/2021] [Accepted: 06/28/2021] [Indexed: 12/02/2022] Open
Abstract
The nature of firefighters` duties requires them to work for long periods under unfavorable conditions. To perform their jobs effectively, they are required to endure long hours of extensive, stressful training. Creating such training environments is very expensive and it is difficult to guarantee trainees' safety. In this study, firefighters are trained in a virtual environment that includes virtual perturbations such as fires, alarms, and smoke. The objective of this paper is to use machine learning methods to discern encoding and retrieval states in firefighters during a visuospatial episodic memory task and explore which regions of the brain provide suitable signals to solve this classification problem. Our results show that the Random Forest algorithm could be used to distinguish between information encoding and retrieval using features extracted from fNIRS data. Our algorithm achieved an F-1 score of 0.844 and an accuracy of 79.10% if the training and testing data are obtained at similar environmental conditions. However, the algorithm's performance dropped to an F-1 score of 0.723 and accuracy of 60.61% when evaluated on data collected under different environmental conditions than the training data. We also found that if the training and evaluation data were recorded under the same environmental conditions, the RPM, LDLPFC, RDLPFC were the most relevant brain regions under non-stressful, stressful, and a mix of stressful and non-stressful conditions, respectively.
Collapse
Affiliation(s)
- Maher Abujelala
- Department of Industrial & Systems Engineering, Texas A & M University, College Station, TX 77843, USA;
| | - Rohith Karthikeyan
- Department of Mechanical Engineering, Texas A & M University, College Station, TX 77843, USA;
| | - Oshin Tyagi
- Department of Industrial & Systems Engineering, Texas A & M University, College Station, TX 77843, USA;
| | - Jing Du
- Department of Civil and Coastal Engineering, Engineering School of Sustainable Infrastructure and Environment (ESSIE), Herbert Wertheim College of Engineering, University of Florida, Gainesville, FL 32611, USA;
| | - Ranjana K. Mehta
- Department of Industrial & Systems Engineering, Texas A & M University, College Station, TX 77843, USA;
- Department of Mechanical Engineering, Texas A & M University, College Station, TX 77843, USA;
| |
Collapse
|
29
|
Nami Y, Imeni N, Panahi B. Application of machine learning in bacteriophage research. BMC Microbiol 2021; 21:193. [PMID: 34174831 PMCID: PMC8235560 DOI: 10.1186/s12866-021-02256-5] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2021] [Accepted: 06/08/2021] [Indexed: 12/20/2022] Open
Abstract
Phages are one of the key components in the structure, dynamics, and interactions of microbial communities in different bins. It has a clear impact on human health and the food industry. Bacteriophage characterization using in vitro approaches are time/cost consuming and laborious tasks. On the other hand, with the advent of new high-throughput sequencing technology, the development of a powerful computational framework to characterize the newly identified bacteriophages is inevitable for future research. Machine learning includes powerful techniques that enable the analysis of complex datasets for knowledge discovery and pattern recognition. In this study, we have conducted a comprehensive review of machine learning methods application using different types of features were applied in various aspects of bacteriophage research including, automated curation, identification, classification, host species recognition, virion protein identification, and life cycle prediction. Moreover, potential limitations and advantages of the developed frameworks were discussed.
Collapse
Affiliation(s)
- Yousef Nami
- Department of Food Biotechnology, Branch for Northwest & West Region, Agricultural Biotechnology Research Institute of Iran, Agricultural Research, Education and Extension Organization (AREEO), Tabriz, Iran
| | - Nazila Imeni
- Young Researchers and Elite Clube, Marand Branch, Islamic Azad University, Marand, Iran
| | - Bahman Panahi
- Department of Genomics, Branch for Northwest & West Region, Agricultural Biotechnology Research Institute of Iran, Agricultural Research, Education and Extension Organization (AREEO), Tabriz, Iran.
| |
Collapse
|
30
|
Component Parts of Bacteriophage Virions Accurately Defined by a Machine-Learning Approach Built on Evolutionary Features. mSystems 2021; 6:e0024221. [PMID: 34042467 PMCID: PMC8269216 DOI: 10.1128/msystems.00242-21] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Antimicrobial resistance (AMR) continues to evolve as a major threat to human health, and new strategies are required for the treatment of AMR infections. Bacteriophages (phages) that kill bacterial pathogens are being identified for use in phage therapies, with the intention to apply these bactericidal viruses directly into the infection sites in bespoke phage cocktails. Despite the great unsampled phage diversity for this purpose, an issue hampering the roll out of phage therapy is the poor quality annotation of many of the phage genomes, particularly for those from infrequently sampled environmental sources. We developed a computational tool called STEP3 to use the “evolutionary features” that can be recognized in genome sequences of diverse phages. These features, when integrated into an ensemble framework, achieved a stable and robust prediction performance when benchmarked against other prediction tools using phages from diverse sources. Validation of the prediction accuracy of STEP3 was conducted with high-resolution mass spectrometry analysis of two novel phages, isolated from a watercourse in the Southern Hemisphere. STEP3 provides a robust computational approach to distinguish specific and universal features in phages to improve the quality of phage cocktails and is available for use at http://step3.erc.monash.edu/. IMPORTANCE In response to the global problem of antimicrobial resistance, there are moves to use bacteriophages (phages) as therapeutic agents. Selecting which phages will be effective therapeutics relies on interpreting features contributing to shelf-life and applicability to diagnosed infections. However, the protein components of the phage virions that dictate these properties vary so much in sequence that best estimates suggest failure to recognize up to 90% of them. We have utilized this diversity in evolutionary features as an advantage, to apply machine learning for prediction accuracy for diverse components in phage virions. We benchmark this new tool showing the accurate recognition and evaluation of phage component parts using genome sequence data of phages from undersampled environments, where the richest diversity of phage still lies.
Collapse
|
31
|
Deberneh HM, Kim I. Prediction of Type 2 Diabetes Based on Machine Learning Algorithm. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2021; 18:3317. [PMID: 33806973 PMCID: PMC8004981 DOI: 10.3390/ijerph18063317] [Citation(s) in RCA: 43] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/02/2021] [Revised: 03/15/2021] [Accepted: 03/17/2021] [Indexed: 12/17/2022]
Abstract
Prediction of type 2 diabetes (T2D) occurrence allows a person at risk to take actions that can prevent onset or delay the progression of the disease. In this study, we developed a machine learning (ML) model to predict T2D occurrence in the following year (Y + 1) using variables in the current year (Y). The dataset for this study was collected at a private medical institute as electronic health records from 2013 to 2018. To construct the prediction model, key features were first selected using ANOVA tests, chi-squared tests, and recursive feature elimination methods. The resultant features were fasting plasma glucose (FPG), HbA1c, triglycerides, BMI, gamma-GTP, age, uric acid, sex, smoking, drinking, physical activity, and family history. We then employed logistic regression, random forest, support vector machine, XGBoost, and ensemble machine learning algorithms based on these variables to predict the outcome as normal (non-diabetic), prediabetes, or diabetes. Based on the experimental results, the performance of the prediction model proved to be reasonably good at forecasting the occurrence of T2D in the Korean population. The model can provide clinicians and patients with valuable predictive information on the likelihood of developing T2D. The cross-validation (CV) results showed that the ensemble models had a superior performance to that of the single models. The CV performance of the prediction models was improved by incorporating more medical history from the dataset.
Collapse
Affiliation(s)
| | - Intaek Kim
- Department of Information and Communications Engineering, Myongji University, 116 Myongji-ro, Yongin, Gyeonggi 17058, Korea;
| |
Collapse
|
32
|
Effrosynidis D, Arampatzis A. An evaluation of feature selection methods for environmental data. ECOL INFORM 2021. [DOI: 10.1016/j.ecoinf.2021.101224] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
33
|
He S, Guo F, Zou Q, HuiDing. MRMD2.0: A Python Tool for Machine Learning with Feature Ranking and Reduction. Curr Bioinform 2021. [DOI: 10.2174/1574893615999200503030350] [Citation(s) in RCA: 101] [Impact Index Per Article: 33.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Aims:
The study aims to find a way to reduce the dimensionality of the dataset.
Background:
Dimensionality reduction is the key issue of the machine learning process. It does
not only improve the prediction performance but also could recommend the intrinsic features and
help to explore the biological expression of the machine learning “black box”.
Objective:
A variety of feature selection algorithms are used to select data features to achieve
dimensionality reduction.
Methods:
First, MRMD2.0 integrated 7 different popular feature ranking algorithms with
PageRank strategy. Second, optimized dimensionality was detected with forward adding strategy.
Result:
We have achieved good results in our experiments.
Conclusion:
Several works have been tested with MRMD2.0. It showed well performance.
Otherwise, it also can draw the performance curves according to the feature dimensionality. If
users want to sacrifice accuracy for fewer features, they can select the dimensionality from the
performance curves.
Other:
We developed friendly python tools together with the web server. The users could upload
their csv, arff or libsvm format files. Then the webserver would help to rank features and find the
optimized dimensionality.
Collapse
Affiliation(s)
- Shida He
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Fei Guo
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - HuiDing
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
34
|
Fang Z, Zhou H. VirionFinder: Identification of Complete and Partial Prokaryote Virus Virion Protein From Virome Data Using the Sequence and Biochemical Properties of Amino Acids. Front Microbiol 2021; 12:615711. [PMID: 33613485 PMCID: PMC7894196 DOI: 10.3389/fmicb.2021.615711] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Accepted: 01/04/2021] [Indexed: 01/22/2023] Open
Abstract
Viruses are some of the most abundant biological entities on Earth, and prokaryote virus are the dominant members of the viral community. Because of the diversity of prokaryote virus, functional annotation cannot be performed on a large number of genes from newly discovered prokaryote virus by searching the current database; therefore, the development of an alignment-free algorithm for functional annotation of prokaryote virus proteins is important to understand the viral community. The identification of prokaryote virus proteins (PVVPs) is a critical step for many viral analyses, such as species classification, phylogenetic analysis and the exploration of how prokaryote virus interact with their hosts. Although a series of PVVP prediction tools have been developed, the performance of these tools is still not satisfactory. Moreover, viral metagenomic data contains fragmented sequences, leading to the existence of some incomplete genes. Therefore, a tool that can identify partial prokaryote virus proteins is also needed. In this work, we present a novel algorithm, called VirionFinder, to identify the complete and partial PVVPs from non-prokaryote virus virion proteins (non-PVVPs). VirionFinder uses the sequence and biochemical properties of 20 amino acids as the mathematical model to encode the protein sequences and uses a deep learning technique to identify whether a given protein is a PVVP. Compared with the state-of-the-art tools using artificial benchmark datasets, the results show that under the same specificity (Sp), the sensitivity (Sn) of VirionFinder is approximately 10-34% much higher than the Sn of these tools on both complete and partial proteins. When evaluating related tools using real virome data, the recognition rate of PVVP-like sequences of VirionFinder is also much higher than that of the other tools. We expect that VirionFinder will be a powerful tool for identifying novel virion proteins from both complete prokaryote virus genomes and viral metagenomic data. VirionFinder is freely available at https://github.com/zhenchengfang/VirionFinder.
Collapse
Affiliation(s)
- Zhencheng Fang
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, China
- Center for Quantitative Biology, Peking University, Beijing, China
| | - Hongwei Zhou
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, China
- State Key Laboratory of Organ Failure Research, Southern Medical University, Guangzhou, China
| |
Collapse
|
35
|
Timmons PB, Hewage CM. ENNAACT is a novel tool which employs neural networks for anticancer activity classification for therapeutic peptides. Biomed Pharmacother 2020; 133:111051. [PMID: 33254015 DOI: 10.1016/j.biopha.2020.111051] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Revised: 10/08/2020] [Accepted: 11/19/2020] [Indexed: 12/12/2022] Open
Abstract
The prevalence of cancer as a threat to human life, responsible for 9.6 million deaths worldwide in 2018, motivates the search for new anticancer agents. While many options are currently available for treatment, these are often expensive and impact the human body unfavourably. Anticancer peptides represent a promising emerging field of anticancer therapeutics, which are characterized by favourable toxicity profile. The development of accurate in silico methods for anticancer peptide prediction is of paramount importance, as the amount of available sequence data is growing each year. This study leverages advances in machine learning research to produce a novel sequence-based deep neural network classifier for anticancer peptide activity. The classifier achieves performance comparable to the best-in-class, with a cross-validated accuracy of 98.3%, Matthews correlation coefficient of 0.91 and an Area Under the Curve of 0.95. This innovative classifier is available as a web server at https://research.timmons.eu/ennaact, facilitating in silico screening and design of new anticancer peptide chemotherapeutics by the research community.
Collapse
Affiliation(s)
- Patrick Brendan Timmons
- UCD School of Biomolecular and Biomedical Science, UCD Centre for Synthesis and Chemical Biology, UCD Conway Institute, University College Dublin, Dublin 4, Ireland
| | - Chandralal M Hewage
- UCD School of Biomolecular and Biomedical Science, UCD Centre for Synthesis and Chemical Biology, UCD Conway Institute, University College Dublin, Dublin 4, Ireland.
| |
Collapse
|
36
|
Yang L, Gao H, Wu K, Zhang H, Li C, Tang L. Identification of Cancerlectins By Using Cascade Linear Discriminant Analysis and Optimal g-gap Tripeptide Composition. Curr Bioinform 2020. [DOI: 10.2174/1574893614666190730103156] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Background:
Lectins are a diverse group of glycoproteins or glycoconjugate proteins
that can be extracted from plants, invertebrates and higher animals. Cancerlectins, a kind of lectins,
which play a key role in the process of tumor cells interacting with each other and are being employed
as therapeutic agents. A full understanding of cancerlectins is significant because it provides
a tool for the future direction of cancer therapy.
Objective:
To develop an accurate and practically useful timesaving tool to identify cancerlectins.
A novel sequence-based method is proposed along with a correlative webserver to access the proposed
tool.
Methods:
Firstly, protein features were extracted in a newly feature building way termed, g-gap
tripeptide composition. After which a proposed cascade linear discriminant analysis (Cascade
LDA) is used to alleviate the high dimensional difficulties with the Analysis Of Variance (ANOVA)
as a feature importance criterion. Finally, Support Vector Machine (SVM) is used as the classifier
to identify cancerlectins.
Results:
The proposed method achieved an accuracy of 91.34% with sensitivity of 89.89%, specificity
of 92.48% and an 0.8318 Mathew’s correlation coefficient based on only 13 fusion features
in jackknife cross validation, the result of which is superior to other published methods in this domain.
Conclusion:
In this study, a new method based only on primary structure of protein is proposed
and experimental results show that it could be a promising tool to identify cancerlectins. An openaccess
webserver is made available in this work to facilitate other related works.
Collapse
Affiliation(s)
- Liangwei Yang
- Center for Informational Biology, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China
| | - Hui Gao
- Center for Informational Biology, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China
| | - Keyu Wu
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Haotian Zhang
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Changyu Li
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Lixia Tang
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
37
|
Wang Y, Kang J, Li N, Zhou Y, Tang Z, He B, Huang J. NeuroCS: A Tool to Predict Cleavage Sites of Neuropeptide Precursors. Protein Pept Lett 2020; 27:337-345. [PMID: 31721688 DOI: 10.2174/0929866526666191112150636] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2019] [Revised: 07/16/2019] [Accepted: 09/24/2019] [Indexed: 11/22/2022]
Abstract
BACKGROUND Neuropeptides are a class of bioactive peptides produced from neuropeptide precursors through a series of extremely complex processes, mediating neuronal regulations in many aspects. Accurate identification of cleavage sites of neuropeptide precursors is of great significance for the development of neuroscience and brain science. OBJECTIVE With the explosive growth of neuropeptide precursor data, it is pretty much needed to develop bioinformatics methods for predicting neuropeptide precursors' cleavage sites quickly and efficiently. METHODS We started with processing the neuropeptide precursor data from SwissProt and NueoPedia into two sets of data, training dataset and testing dataset. Subsequently, six feature extraction schemes were applied to generate different feature sets and then feature selection methods were used to find the optimal feature subset of each. Thereafter the support vector machine was utilized to build models for different feature types. Finally, the performance of models were evaluated with the independent testing dataset. RESULTS Six models are built through support vector machine. Among them the enhanced amino acid composition-based model reaches the highest accuracy of 91.60% in the 5-fold cross validation. When evaluated with independent testing dataset, it also showed an excellent performance with a high accuracy of 90.37% and Area under Receiver Operating Characteristic curve up to 0.9576. CONCLUSION The performance of the developed model was decent. Moreover, for users' convenience, an online web server called NeuroCS is built, which is freely available at http://i.uestc.edu.cn/NeuroCS/dist/index.html#/. NeuroCS can be used to predict neuropeptide precursors' cleavage sites effectively.
Collapse
Affiliation(s)
- Ying Wang
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Juanjuan Kang
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Ning Li
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Yuwei Zhou
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Zhongjie Tang
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Bifang He
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China.,Medical College, Guizhou University, Guiyang, China
| | - Jian Huang
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
38
|
Chen W, Nie F, Ding H. Recent Advances of Computational Methods for Identifying Bacteriophage Virion Proteins. Protein Pept Lett 2020; 27:259-264. [PMID: 30968770 DOI: 10.2174/0929866526666190410124642] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2019] [Revised: 03/07/2019] [Accepted: 04/01/2019] [Indexed: 01/09/2023]
Abstract
Phage Virion Proteins (PVP) are essential materials of bacteriophage, which participate in a series of biological processes. Accurate identification of phage virion proteins is helpful to understand the mechanism of interaction between the phage and its host bacteria. Since experimental method is labor intensive and time-consuming, in the past few years, many computational approaches have been proposed to identify phage virion proteins. In order to facilitate researchers to select appropriate methods, it is necessary to give a comprehensive review and comparison on existing computational methods on identifying phage virion proteins. In this review, we summarized the existing computational methods for identifying phage virion proteins and also assessed their performances on an independent dataset. Finally, challenges and future perspectives for identifying phage virion proteins were presented. Taken together, we hope that this review could provide clues to researches on the study of phage virion proteins.
Collapse
Affiliation(s)
- Wei Chen
- Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu 611730, China.,Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.,Center for Genomics and Computational Biology, School of Life Sciences, North China University of Science and Technology, Tangshan 063000, China
| | - Fulei Nie
- Center for Genomics and Computational Biology, School of Life Sciences, North China University of Science and Technology, Tangshan 063000, China
| | - Hui Ding
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
39
|
Zhang J, Xu D, Hao K, Zhang Y, Chen W, Liu J, Gao R, Wu C, De Marinis Y. FS-GBDT: identification multicancer-risk module via a feature selection algorithm by integrating Fisher score and GBDT. Brief Bioinform 2020; 22:5901960. [PMID: 34020547 DOI: 10.1093/bib/bbaa189] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2020] [Revised: 07/03/2020] [Accepted: 07/21/2020] [Indexed: 11/14/2022] Open
Abstract
Cancer is a highly heterogeneous disease caused by dysregulation in different cell types and tissues. However, different cancers may share common mechanisms. It is critical to identify decisive genes involved in the development and progression of cancer, and joint analysis of multiple cancers may help to discover overlapping mechanisms among different cancers. In this study, we proposed a fusion feature selection framework attributed to ensemble method named Fisher score and Gradient Boosting Decision Tree (FS-GBDT) to select robust and decisive feature genes in high-dimensional gene expression datasets. Joint analysis of 11 human cancers types was conducted to explore the key feature genes subset of cancer. To verify the efficacy of FS-GBDT, we compared it with four other common feature selection algorithms by Support Vector Machine (SVM) classifier. The algorithm achieved highest indicators, outperforms other four methods. In addition, we performed gene ontology analysis and literature validation of the key gene subset, and this subset were classified into several functional modules. Functional modules can be used as markers of disease to replace single gene which is difficult to be found repeatedly in applications of gene chip, and to study the core mechanisms of cancer.
Collapse
Affiliation(s)
- Jialin Zhang
- School of Mathematics and Statistics at Shandong University, China
| | - Da Xu
- School of Mathematics and Statistics at Shandong University, China
| | - Kaijing Hao
- School of Mathematics and Statistics at Shandong University, China
| | - Yusen Zhang
- academic leader of Computer Engineering in Shandong University, China
| | - Wei Chen
- School of Mathematics and Statistics at Shandong University, China
| | - Jiaguo Liu
- School of Mathematics and Statistics at Shandong University, China
| | - Rui Gao
- School of Control Science and Engineering, Shandong University
| | - Chuanyan Wu
- School of Intelligent Engineering in Shandong Management University
| | | |
Collapse
|
40
|
Feng P, Feng L. Recent Advances on Antioxidant Identification Based on Machine Learning Methods. Curr Drug Metab 2020; 21:804-809. [PMID: 32682368 DOI: 10.2174/1389200221666200719001449] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2020] [Revised: 03/17/2020] [Accepted: 05/13/2020] [Indexed: 11/22/2022]
Abstract
Antioxidants are molecules that can prevent damages to cells caused by free radicals. Recent studies also demonstrated that antioxidants play roles in preventing diseases. However, the number of known molecules with antioxidant activity is very small. Therefore, it is necessary to identify antioxidants from various resources. In the past several years, a series of computational methods have been proposed to identify antioxidants. In this review, we briefly summarized recent advances in computationally identifying antioxidants. The challenges and future perspectives for identifying antioxidants were also discussed. We hope this review will provide insights into researches on antioxidant identification.
Collapse
Affiliation(s)
- Pengmian Feng
- School of Basic Medical Sciences, Chengdu University of Traditional Chinese Medicine, Chengdu 611730, China
| | - Lijing Feng
- School of Sciences, North China University of Science and Technology, Tangshan 063000, China
| |
Collapse
|
41
|
Timmons PB, Hewage CM. HAPPENN is a novel tool for hemolytic activity prediction for therapeutic peptides which employs neural networks. Sci Rep 2020; 10:10869. [PMID: 32616760 PMCID: PMC7331684 DOI: 10.1038/s41598-020-67701-3] [Citation(s) in RCA: 42] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2020] [Accepted: 06/09/2020] [Indexed: 12/11/2022] Open
Abstract
The growing prevalence of resistance to antibiotics motivates the search for new antibacterial agents. Antimicrobial peptides are a diverse class of well-studied membrane-active peptides which function as part of the innate host defence system, and form a promising avenue in antibiotic drug research. Some antimicrobial peptides exhibit toxicity against eukaryotic membranes, typically characterised by hemolytic activity assays, but currently, the understanding of what differentiates hemolytic and non-hemolytic peptides is limited. This study leverages advances in machine learning research to produce a novel artificial neural network classifier for the prediction of hemolytic activity from a peptide's primary sequence. The classifier achieves best-in-class performance, with cross-validated accuracy of [Formula: see text] and Matthews correlation coefficient of 0.71. This innovative classifier is available as a web server at https://research.timmons.eu/happenn , allowing the research community to utilise it for in silico screening of peptide drug candidates for high therapeutic efficacies.
Collapse
Affiliation(s)
- Patrick Brendan Timmons
- UCD School of Biomolecular and Biomedical Science, UCD Centre for Synthesis and Chemical Biology, UCD Conway Institute, University College Dublin, Dublin 4, Ireland
| | - Chandralal M Hewage
- UCD School of Biomolecular and Biomedical Science, UCD Centre for Synthesis and Chemical Biology, UCD Conway Institute, University College Dublin, Dublin 4, Ireland.
| |
Collapse
|
42
|
Meta-iPVP: a sequence-based meta-predictor for improving the prediction of phage virion proteins using effective feature representation. J Comput Aided Mol Des 2020; 34:1105-1116. [DOI: 10.1007/s10822-020-00323-z] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2020] [Accepted: 06/10/2020] [Indexed: 12/11/2022]
|
43
|
Li HF, Wang XF, Tang H. Predicting Bacteriophage Enzymes and Hydrolases by Using Combined Features. Front Bioeng Biotechnol 2020; 8:183. [PMID: 32266225 PMCID: PMC7105632 DOI: 10.3389/fbioe.2020.00183] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2020] [Accepted: 02/24/2020] [Indexed: 12/19/2022] Open
Abstract
Bacteriophage is a type of virus that could infect the host bacteria. They have been applied in the treatment of pathogenic bacterial infection. Phage enzymes and hydrolases play the most important role in the destruction of bacterial cells. Correctly identifying the hydrolases coded by phage is not only beneficial to their function study, but also conducive to antibacteria drug discovery. Thus, this work aims to recognize the enzymes and hydrolases in phage. A combination of different features was used to represent samples of phage and hydrolase. A feature selection technique called analysis of variance was developed to optimize features. The classification was performed by using support vector machine (SVM). The prediction process includes two steps. The first step is to identify phage enzymes. The second step is to determine whether a phage enzyme is hydrolase or not. The jackknife cross-validated results showed that our method could produce overall accuracies of 85.1 and 94.3%, respectively, for the two predictions, demonstrating that the proposed method is promising.
Collapse
Affiliation(s)
- Hong-Fei Li
- Department of Pathophysiology, Key Laboratory of Medical Electrophysiology, Ministry of Education, Southwest Medical University, Luzhou, China.,School of Computer and Information Engineering, Henan Normal University, Henan, China
| | - Xian-Fang Wang
- School of Computer and Information Engineering, Henan Normal University, Henan, China
| | - Hua Tang
- Department of Pathophysiology, Key Laboratory of Medical Electrophysiology, Ministry of Education, Southwest Medical University, Luzhou, China
| |
Collapse
|
44
|
Smolarczyk T, Roterman-Konieczna I, Stapor K. Protein Secondary Structure Prediction: A Review of Progress and Directions. Curr Bioinform 2020. [DOI: 10.2174/1574893614666191017104639] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Over the last few decades, a search for the theory of protein folding has
grown into a full-fledged research field at the intersection of biology, chemistry and informatics.
Despite enormous effort, there are still open questions and challenges, like understanding the rules
by which amino acid sequence determines protein secondary structure.
Objective:
In this review, we depict the progress of the prediction methods over the years and
identify sources of improvement.
Methods:
The protein secondary structure prediction problem is described followed by the discussion
on theoretical limitations, description of the commonly used data sets, features and a review
of three generations of methods with the focus on the most recent advances. Additionally, methods
with available online servers are assessed on the independent data set.
Results:
The state-of-the-art methods are currently reaching almost 88% for 3-class prediction and
76.5% for an 8-class prediction.
Conclusion:
This review summarizes recent advances and outlines further research directions.
Collapse
Affiliation(s)
- Tomasz Smolarczyk
- Institute of Informatics, Silesian University of Technology, Gliwice, Poland
| | - Irena Roterman-Konieczna
- Department of Bioinformatics and Telemedicine, Jagiellonian University Medical College, Krakow, Poland
| | - Katarzyna Stapor
- Institute of Informatics, Silesian University of Technology, Gliwice, Poland
| |
Collapse
|
45
|
Meng C, Zhang J, Ye X, Guo F, Zou Q. Review and comparative analysis of machine learning-based phage virion protein identification methods. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2020; 1868:140406. [PMID: 32135196 DOI: 10.1016/j.bbapap.2020.140406] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/01/2020] [Revised: 02/14/2020] [Accepted: 02/27/2020] [Indexed: 02/01/2023]
Abstract
Phage virion protein (PVP) identification plays key role in elucidating relationships between phages and hosts. Moreover, PVP identification can facilitate the design of related biochemical entities. Recently, several machine learning approaches have emerged for this purpose and have shown their potential capacities. In this study, the proposed PVP identifiers are systemically reviewed, and the related algorithms and tools are comprehensively analyzed. We summarized the common framework of these PVP identifiers and constructed our own novel identifiers based upon the framework. Furthermore, we focus on a performance comparison of all PVP identifiers by using a training dataset and an independent dataset. Highlighting the pros and cons of these identifiers demonstrates that g-gap DPC (dipeptide composition) features are capable of representing characteristics of PVPs. Moreover, SVM (support vector machine) is proven to be the more effective classifier to distinguish PVPs and non-PVPs.
Collapse
Affiliation(s)
- Chaolu Meng
- College of Intelligence and Computing, Tianjin University, Tianjin, China; College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China
| | - Jun Zhang
- Rehabilitation Department, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Xiucai Ye
- Department of Computer Science, University of Tsukuba, Tsukuba, Science City, Japan
| | - Fei Guo
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China; Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.
| |
Collapse
|
46
|
Charoenkwan P, Kanthawong S, Schaduangrat N, Yana J, Shoombuatong W. PVPred-SCM: Improved Prediction and Analysis of Phage Virion Proteins Using a Scoring Card Method. Cells 2020; 9:E353. [PMID: 32028709 PMCID: PMC7072630 DOI: 10.3390/cells9020353] [Citation(s) in RCA: 39] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2019] [Revised: 01/20/2020] [Accepted: 01/27/2020] [Indexed: 12/16/2022] Open
Abstract
Although, existing methods have been successful in predicting phage (or bacteriophage) virion proteins (PVPs) using various types of protein features and complex classifiers, such as support vector machine and naïve Bayes, these two methods do not allow interpretability. However, the characterization and analysis of PVPs might be of great significance to understanding the molecular mechanisms of bacteriophage genetics and the development of antibacterial drugs. Hence, we herein proposed a novel method (PVPred-SCM) based on the scoring card method (SCM) in conjunction with dipeptide composition to identify and characterize PVPs. In PVPred-SCM, the propensity scores of 400 dipeptides were calculated using the statistical discrimination approach. Rigorous independent validation test showed that PVPred-SCM utilizing only dipeptide composition yielded an accuracy of 77.56%, indicating that PVPred-SCM performed well relative to the state-of-the-art method utilizing a number of protein features. Furthermore, the propensity scores of dipeptides were used to provide insights into the biochemical and biophysical properties of PVPs. Upon comparison, it was found that PVPred-SCM was superior to the existing methods considering its simplicity, interpretability, and implementation. Finally, in an effort to facilitate high-throughput prediction of PVPs, we provided a user-friendly web-server for identifying the likelihood of whether or not these sequences are PVPs. It is anticipated that PVPred-SCM will become a useful tool or at least a complementary existing method for predicting and analyzing PVPs.
Collapse
Affiliation(s)
- Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai 50200, Thailand;
| | - Sakawrat Kanthawong
- Department of Microbiology, Faculty of Medicine, Khon Kaen University, Khon Kaen 40002, Thailand;
| | - Nalini Schaduangrat
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand;
| | - Janchai Yana
- Department of Chemistry, Faculty of Science and Technology, Chiang Mai Rajabhat University, Chiang Mai 50300, Thailand;
| | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand;
| |
Collapse
|
47
|
Wang Z, He W, Tang J, Guo F. Identification of Highest-Affinity Binding Sites of Yeast Transcription Factor Families. J Chem Inf Model 2020; 60:1876-1883. [DOI: 10.1021/acs.jcim.9b01012] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Affiliation(s)
- Zongyu Wang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Wenying He
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Jijun Tang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
- Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin 300072, P. R. China
- Department of Computer Science and Engineering, University of South Carolina, Columbia, South Carolina 29208, United States
| | - Fei Guo
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| |
Collapse
|
48
|
Arif M, Ali F, Ahmad S, Kabir M, Ali Z, Hayat M. Pred-BVP-Unb: Fast prediction of bacteriophage Virion proteins using un-biased multi-perspective properties with recursive feature elimination. Genomics 2019; 112:1565-1574. [PMID: 31526842 DOI: 10.1016/j.ygeno.2019.09.006] [Citation(s) in RCA: 35] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2019] [Revised: 08/27/2019] [Accepted: 09/11/2019] [Indexed: 10/26/2022]
Abstract
Bacteriophage virion proteins (BVPs) are bacterial viruses that have a great impact on different biological functions of bacteria. They are significantly used in genetic engineering and phage therapy applications. Correct identification of BVP through conventional pathogen methods are slow and expensive. Thus, designing a Bioinformatics predictor is urgently desirable to accelerate correct identification of BVPs within a huge volume of proteins. However, available prediction tools performance is inadequate due to the lack of useful feature representation and severe imbalance issue. In the present study, we propose an intelligent model, called Pred-BVP-Unb for discrimination of BVPs that employed three nominal sequences-driven descriptors, i.e. Bi-PSSM evolutionary information, composition & translation, and split amino acid composition. The imbalance phenomena between classes were coped with the help of a synthetic minority oversampling technique. The essential attributes are selected by a robust algorithm called recursive feature elimination. Finally, the optimal feature space is provided to support vector machine classifier using a radial base kernel in order to train the model. Our predictor remarkably outperforms than existing approaches in the literature by achieving the highest accuracy of 92.54% and 83.06% respectively on the benchmark and independent datasets. We expect that Pred-BVP-Unb tool can provide useful hints for designing antibacterial drugs and also helpful to expedite large scale discovery of new bacteriophage virion proteins. The source code and all datasets are publicly available at https://github.com/Muhammad-Arif-NUST/BVP_Pred_Unb.
Collapse
Affiliation(s)
- Muhammad Arif
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China; Department of Computer Science, Abdul Wali Khan University Mardan, KP, Pakistan.
| | - Farman Ali
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China.
| | - Saeed Ahmad
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Muhammad Kabir
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Zakir Ali
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Maqsood Hayat
- Department of Computer Science, Abdul Wali Khan University Mardan, KP, Pakistan.
| |
Collapse
|
49
|
Identification of Intrinsically Disordered Proteins and Regions by Length-Dependent Predictors Based on Conditional Random Fields. MOLECULAR THERAPY-NUCLEIC ACIDS 2019; 17:396-404. [PMID: 31307006 PMCID: PMC6626971 DOI: 10.1016/j.omtn.2019.06.004] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/25/2019] [Revised: 06/06/2019] [Accepted: 06/07/2019] [Indexed: 01/24/2023]
Abstract
Accurate identification of intrinsically disordered proteins/regions (IDPs/IDRs) is critical for predicting protein structure and function. Previous studies have shown that IDRs of different lengths have different characteristics, and several classification-based predictors have been proposed for predicting different types of IDRs. Compared with these classification-based predictors, the previously proposed predictor IDP-CRF exhibits state-of-the-art performance for predicting IDPs/IDRs, which is a sequence labeling model based on conditional random fields (CRFs). Motivated by these methods, we propose a predictor called IDP-FSP, which is an ensemble of three CRF-based predictors called IDP-FSP-L, IDP-FSP-S, and IDP-FSP-G. These three predictors are specially designed to predict long, short, and generic disordered regions, respectively, and they are constructed based on different features. To the best of our knowledge, IDP-FSP is the first predictor that combines a sequence labeling algorithm with IDRs of different lengths. Experimental results using two independent test datasets show that IDP-FSP achieves better or at least comparable predictive performance with 26 existing state-of-the-art methods in this field, proving the effectiveness of IDP-FSP.
Collapse
|
50
|
Ru X, Li L, Wang C. Identification of Phage Viral Proteins With Hybrid Sequence Features. Front Microbiol 2019; 10:507. [PMID: 30972038 PMCID: PMC6443926 DOI: 10.3389/fmicb.2019.00507] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2018] [Accepted: 02/27/2019] [Indexed: 02/01/2023] Open
Abstract
The uniqueness of bacteriophages plays an important role in bioinformatics research. In real applications, the function of the bacteriophage virion proteins is the main area of interest. Therefore, it is very important to classify bacteriophage virion proteins and non-phage virion proteins accurately. Extracting comprehensive and effective sequence features from proteins plays a vital role in protein classification. In order to more fully represent protein information, this paper is more comprehensive and effective by combining the features extracted by the feature information representation algorithm based on sequence information (CCPA) and the feature representation algorithm based on sequence and structure information. After extracting features, the Max-Relevance-Max-Distance (MRMD) algorithm is used to select the optimal feature set with the strongest correlation between class labels and low redundancy between features. Given the randomness of the samples selected by the random forest classification algorithm and the randomness features for producing each node variable, a random forest method is employed to perform 10-fold cross-validation on the bacteriophage protein classification. The accuracy of this model is as high as 93.5% in the classification of phage proteins in this study. This study also found that, among the eight physicochemical properties considered, the charge property has the greatest impact on the classification of bacteriophage proteins These results indicate that the model discussed in this paper is an important tool in bacteriophage protein research.
Collapse
Affiliation(s)
- Xiaoqing Ru
- School of Information and Electrical Engineering, Hebei University of Engineering, Handan, China
| | - Lihong Li
- School of Information and Electrical Engineering, Hebei University of Engineering, Handan, China
| | - Chunyu Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|