1
|
Ejiyi CJ, Cai D, Ejiyi MB, Chikwendu IA, Coker K, Oluwasanmi A, Bamisile OF, Ejiyi TU, Qin Z. Polynomial-SHAP analysis of liver disease markers for capturing of complex feature interactions in machine learning models. Comput Biol Med 2024; 182:109168. [PMID: 39342675 DOI: 10.1016/j.compbiomed.2024.109168] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2024] [Revised: 09/17/2024] [Accepted: 09/17/2024] [Indexed: 10/01/2024]
Abstract
Liver disease diagnosis is pivotal for effective patient management, and machine learning techniques have shown promise in this domain. In this study, we investigate the impact of Polynomial-SHapley Additive exPlanations analysis on enhancing the performance and interpretability of machine learning models for liver disease classification. Our results demonstrate significant improvements in accuracy, precision, recall, F1_score, and Matthews correlation coefficient across various algorithms when polynomial- SHapley Additive exPlanations analysis is applied. Specifically, the Light Gradient Boosting Machine model achieves exceptional performance with 100 % accuracy in both scenarios. Furthermore, by comparing the results obtained with and without the approach, we observe substantial differences in the performance, highlighting the importance of incorporating Polynomial-SHapley Additive exPlanations analysis for improved model performance. The Polynomial features and SHapley Additive exPlanations values also enhance the interpretability of machine learning models by capturing complex feature interactions, enabling users to gain deeper insights into the underlying mechanisms driving the diagnosis. Moreover, data rebalancing using Synthetic Minority Over-sampling Technique and parameter tuning were employed to optimize the performance of the models. These findings underscore the significance of employing this analytical approach in machine-learning-based diagnostic systems for liver diseases, offering superior performance and enhanced interpretability for informed decision-making in clinical practice.
Collapse
Affiliation(s)
- Chukwuebuka Joseph Ejiyi
- College of Nuclear Technology and Automation Engineering, Sichuan Industrial Internet Intelligent Monitoring and Application Engineering Research Center, Chengdu University of Technology, Sichuan, Chengdu, China; Network and Data Security Key Laboratory of Sichuan Province, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Dongsheng Cai
- College of Nuclear Technology and Automation Engineering, Sichuan Industrial Internet Intelligent Monitoring and Application Engineering Research Center, Chengdu University of Technology, Sichuan, Chengdu, China.
| | - Makuachukwu B Ejiyi
- Pharmacy Department, University of Nigeria Nsukka, Nsukka, Enugu State, Nigeria
| | - Ijeoma A Chikwendu
- School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China
| | - Kenneth Coker
- School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China; Department of Electrical and Electronic Engineering, Ho Technical University Ghana, Ghana
| | - Ariyo Oluwasanmi
- Network and Data Security Key Laboratory of Sichuan Province, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Oluwatoyosi F Bamisile
- College of Nuclear Technology and Automation Engineering, Sichuan Industrial Internet Intelligent Monitoring and Application Engineering Research Center, Chengdu University of Technology, Sichuan, Chengdu, China
| | - Thomas U Ejiyi
- Department of Pure and Industrial Chemistry, University of Nigeria Nsukka, Nsukka, Enugu State, Nigeria
| | - Zhen Qin
- Network and Data Security Key Laboratory of Sichuan Province, University of Electronic Science and Technology of China, Chengdu, Sichuan, China.
| |
Collapse
|
2
|
Li J, Zou L, Ma H, Zhao J, Wang C, Li J, Hu G, Yang H, Wang B, Xu D, Xia Y, Jiang Y, Jiang X, Li N. Interpretable machine learning based on CT-derived extracellular volume fraction to predict pathological grading of hepatocellular carcinoma. Abdom Radiol (NY) 2024; 49:3383-3396. [PMID: 38703190 DOI: 10.1007/s00261-024-04313-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2024] [Revised: 03/23/2024] [Accepted: 03/25/2024] [Indexed: 05/06/2024]
Abstract
PURPOSE To develop a non-invasive auxiliary assessment method based on CT-derived extracellular volume (ECV) to predict the pathological grading (PG) of hepatocellular carcinoma (HCC). METHODS The study retrospectively analyzed 238 patients who underwent HCC resection surgery between January 2013 and April 2023. Six machine learning algorithms were employed to construct predictive models for HCC PG: logistic regression, extreme gradient boosting, Light Gradient Boosting Machine (LightGBM), random forest, adaptive boosting, and Gaussian naive Bayes. Model performance was evaluated using receiver operating characteristic curve analysis, including area under the curve (AUC), sensitivity, specificity, accuracy, positive predictive value, negative predictive value, and F1 score. Calibration plots were used for visual evaluation of model calibration. Clinical decision curve analysis was performed to assess potential clinical utility by calculating net benefit. RESULTS 166 patients from Hospital A were allocated to the training set, while 72 patients from Hospital B (constituting 30.25% of the total sample) were assigned to the test set. The model achieved an AUC of 1.000 (95%CI: 1.000-1.000) in the training set and 0.927 (95%CI: 0.837-0.999) in the validation set, respectively. Ultimately, the model achieved an AUC of 0.909 (95%CI: 0.837-0.980) in the test set, with an accuracy of 0.778, sensitivity of 0.906, specificity of 0.789, negative predictive value of 0.556, and F1 score of 0.908. CONCLUSION This study successfully developed and validated a non-invasive auxiliary assessment method based on CT-derived ECV to predict the HCC PG, providing important supplementary information for clinical decision-making.
Collapse
Affiliation(s)
- Jie Li
- Department of Radiology, Binzhou Medical University Hospital, No. 661 Huanghe 2nd Road, Bincheng District, Binzhou, 256600, China
| | - Linxuan Zou
- Department of Radiology, Binzhou Medical University Hospital, No. 661 Huanghe 2nd Road, Bincheng District, Binzhou, 256600, China
| | - Heng Ma
- Department of Radiology, Yantai Yuhuangding Hospital, Qingdao University, Yantai, 264000, China
| | - Jifu Zhao
- Department of Radiology, Binzhou Medical University Hospital, No. 661 Huanghe 2nd Road, Bincheng District, Binzhou, 256600, China
| | - Chengyan Wang
- Department of Radiology, Binzhou Medical University Hospital, No. 661 Huanghe 2nd Road, Bincheng District, Binzhou, 256600, China
| | - Jun Li
- Department of Radiology, Yantai Affiliated Hospital of Binzhou Medical University, Yantai, 264000, China
| | - Guangchao Hu
- School of Medical Imaging, Binzhou Medical University, No. 346 Guanhai Road, Laishan District, Yantai, 264003, China
| | - Haoran Yang
- School of Medical Imaging, Binzhou Medical University, No. 346 Guanhai Road, Laishan District, Yantai, 264003, China
| | - Beizhong Wang
- Department of Radiology, Binzhou Medical University Hospital, No. 661 Huanghe 2nd Road, Bincheng District, Binzhou, 256600, China
| | - Donghao Xu
- School of Medical Imaging, Binzhou Medical University, No. 346 Guanhai Road, Laishan District, Yantai, 264003, China
| | - Yuanhao Xia
- Department of Radiology, Yantai Yuhuangding Hospital, Qingdao University, Yantai, 264000, China
- School of Medical Imaging, Binzhou Medical University, No. 346 Guanhai Road, Laishan District, Yantai, 264003, China
| | - Yi Jiang
- Department of Vascular Interventional Surgery, Yantai Affiliated Hospital of Binzhou Medical University, Yantai, 264000, China
| | - Xingyue Jiang
- Department of Radiology, Binzhou Medical University Hospital, No. 661 Huanghe 2nd Road, Bincheng District, Binzhou, 256600, China.
| | - Naixuan Li
- Department of Vascular Interventional Surgery, Yantai Affiliated Hospital of Binzhou Medical University, Yantai, 264000, China.
| |
Collapse
|
3
|
Koyama H. Machine learning application in otology. Auris Nasus Larynx 2024; 51:666-673. [PMID: 38704894 DOI: 10.1016/j.anl.2024.04.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2023] [Revised: 03/13/2024] [Accepted: 04/02/2024] [Indexed: 05/07/2024]
Abstract
This review presents a comprehensive history of Artificial Intelligence (AI) in the context of the revolutionary application of machine learning (ML) to medical research and clinical utilization, particularly for the benefit of researchers interested in the application of ML in otology. To this end, we discuss the key components of ML-input, output, and algorithms. In particular, some representation algorithms commonly used in medical research are discussed. Subsequently, we review ML applications in otology research, including diagnosis, influential identification, and surgical outcome prediction. In the context of surgical outcome prediction, specific surgical treatments, including cochlear implantation, active middle ear implantation, tympanoplasty, and vestibular schwannoma resection, are considered. Finally, we highlight the obstacles and challenges that need to be overcome in future research.
Collapse
Affiliation(s)
- Hajime Koyama
- Department of Otorhinolaryngology and Head and Neck Surgery, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan.
| |
Collapse
|
4
|
Yasin P, Yimit Y, Cai X, Aimaiti A, Sheng W, Mamat M, Nijiati M. Machine learning-enabled prediction of prolonged length of stay in hospital after surgery for tuberculosis spondylitis patients with unbalanced data: a novel approach using explainable artificial intelligence (XAI). Eur J Med Res 2024; 29:383. [PMID: 39054495 PMCID: PMC11270948 DOI: 10.1186/s40001-024-01988-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2023] [Accepted: 07/18/2024] [Indexed: 07/27/2024] Open
Abstract
BACKGROUND Tuberculosis spondylitis (TS), commonly known as Pott's disease, is a severe type of skeletal tuberculosis that typically requires surgical treatment. However, this treatment option has led to an increase in healthcare costs due to prolonged hospital stays (PLOS). Therefore, identifying risk factors associated with extended PLOS is necessary. In this research, we intended to develop an interpretable machine learning model that could predict extended PLOS, which can provide valuable insights for treatments and a web-based application was implemented. METHODS We obtained patient data from the spine surgery department at our hospital. Extended postoperative length of stay (PLOS) refers to a hospitalization duration equal to or exceeding the 75th percentile following spine surgery. To identify relevant variables, we employed several approaches, such as the least absolute shrinkage and selection operator (LASSO), recursive feature elimination (RFE) based on support vector machine classification (SVC), correlation analysis, and permutation importance value. Several models using implemented and some of them are ensembled using soft voting techniques. Models were constructed using grid search with nested cross-validation. The performance of each algorithm was assessed through various metrics, including the AUC value (area under the curve of receiver operating characteristics) and the Brier Score. Model interpretation involved utilizing methods such as Shapley additive explanations (SHAP), the Gini Impurity Index, permutation importance, and local interpretable model-agnostic explanations (LIME). Furthermore, to facilitate the practical application of the model, a web-based interface was developed and deployed. RESULTS The study included a cohort of 580 patients and 11 features include (CRP, transfusions, infusion volume, blood loss, X-ray bone bridge, X-ray osteophyte, CT-vertebral destruction, CT-paravertebral abscess, MRI-paravertebral abscess, MRI-epidural abscess, postoperative drainage) were selected. Most of the classifiers showed better performance, where the XGBoost model has a higher AUC value (0.86) and lower Brier Score (0.126). The XGBoost model was chosen as the optimal model. The results obtained from the calibration and decision curve analysis (DCA) plots demonstrate that XGBoost has achieved promising performance. After conducting tenfold cross-validation, the XGBoost model demonstrated a mean AUC of 0.85 ± 0.09. SHAP and LIME were used to display the variables' contributions to the predicted value. The stacked bar plots indicated that infusion volume was the primary contributor, as determined by Gini, permutation importance (PFI), and the LIME algorithm. CONCLUSIONS Our methods not only effectively predicted extended PLOS but also identified risk factors that can be utilized for future treatments. The XGBoost model developed in this study is easily accessible through the deployed web application and can aid in clinical research.
Collapse
Affiliation(s)
- Parhat Yasin
- Department of Spine Surgery, The Sixth Affiliated Hospital of Xinjiang Medical University, Urumqi, 830000, Xinjiang, People's Republic of China
- Department of Spine Surgery, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, 830054, Xinjiang, People's Republic of China
| | - Yasen Yimit
- Department of Radiology, The First People's Hospital of Kashi Prefecture, Kashi, 844000, Xinjiang, People's Republic of China
| | - Xiaoyu Cai
- Department of Spine Surgery, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, 830054, Xinjiang, People's Republic of China
| | - Abasi Aimaiti
- Department of Anesthesiology, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, 830054, Xinjiang, People's Republic of China
| | - Weibin Sheng
- Department of Spine Surgery, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, 830054, Xinjiang, People's Republic of China
| | - Mardan Mamat
- Department of Spine Surgery, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, 830054, Xinjiang, People's Republic of China.
| | - Mayidili Nijiati
- Department of Radiology, The Fourth Affiliated Hospital of Xinjiang Medical University(Xinjiang Hospital of Traditional Chinese Medicine), Urumqi, 830002, Xinjiang, People's Republic of China.
- Xinjiang Key Laboratory of Artificial Intelligence Assisted Imaging Diagnosis, Kashi, 844000, Xinjiang, People's Republic of China.
| |
Collapse
|
5
|
Oliveira MF, de Albuquerque Neto MC, Leite TS, Alves PAA, Lima SVC, Silva RO. Performance evaluate of different chemometrics formalisms used for prostate cancer diagnosis by NMR-based metabolomics. Metabolomics 2023; 20:8. [PMID: 38127222 DOI: 10.1007/s11306-023-02067-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/25/2023] [Accepted: 11/16/2023] [Indexed: 12/23/2023]
Abstract
INTRODUCTION In general, two characteristics are ever present in NMR-based metabolomics studies: (1) they are assays aiming to classify the samples in different groups, and (2) the number of samples is smaller than the feature (chemical shift) number. It is also common to observe imbalanced datasets due to the sampling method and/or inclusion criteria. These situations can cause overfitting. However, appropriate feature selection and classification methods can be useful to solve this issue. OBJECTIVES Investigate the performance of metabolomics models built from the association between feature selectors, the absence of feature selection, and classification algorithms, as well as use the best performance model as an NMR-based metabolomic method for prostate cancer diagnosis. METHODS We evaluated the performance of NMR-based metabolomics models for prostate cancer diagnosis using seven feature selectors and five classification formalisms. We also obtained metabolomics models without feature selection. In this study, thirty-eight volunteers with a positive diagnosis of prostate cancer and twenty-three healthy volunteers were enrolled. RESULTS Thirty-eight models obtained were evaluated using AUROC, accuracy, sensitivity, specificity, and kappa's index values. The best result was obtained when Genetic Algorithm was used with Linear Discriminant Analysis with 0.92 sensitivity, 0.83 specificity, and 0.88 accuracy. CONCLUSION The results show that the pick of a proper feature selection method and classification model, and a resampling method can avoid overfitting in a small metabolomic dataset. Furthermore, this approach would decrease the number of biopsies and optimize patient follow-up. 1H NMR-based metabolomics promises to be a non-invasive tool in prostate cancer diagnosis.
Collapse
Affiliation(s)
- Márcio Felipe Oliveira
- Metabonomics and Chemometrics Laboratory, Fundamental Chemistry Department, Universidade Federal de Pernambuco, Av. Jornalista Anibal Fernandes, s/n, Cidade Universitária, Recife, Pernambuco, Brazil.
- Fundamental Chemistry Department, Universidade Federal de Pernambuco, Av. Jornalista Anibal Fernandes, s/n, Cidade Universitária, Recife, Pernambuco, Brazil.
| | - Moacir Cavalcante de Albuquerque Neto
- Surgery Department, Clinics Hospital, Urology Clinic, Universidade Federal de Pernambuco, Av. Professor Luis Freire, s/n. Cidade Universitária, Recife, Pernambuco, Brazil
| | - Thiago Siqueira Leite
- Surgery Department, Clinics Hospital, Urology Clinic, Universidade Federal de Pernambuco, Av. Professor Luis Freire, s/n. Cidade Universitária, Recife, Pernambuco, Brazil
| | - Paulo André Araújo Alves
- Surgery Department, Clinics Hospital, Urology Clinic, Universidade Federal de Pernambuco, Av. Professor Luis Freire, s/n. Cidade Universitária, Recife, Pernambuco, Brazil
| | - Salvador Vilar Correia Lima
- Surgery Department, Clinics Hospital, Urology Clinic, Universidade Federal de Pernambuco, Av. Professor Luis Freire, s/n. Cidade Universitária, Recife, Pernambuco, Brazil
| | - Ricardo Oliveira Silva
- Metabonomics and Chemometrics Laboratory, Fundamental Chemistry Department, Universidade Federal de Pernambuco, Av. Jornalista Anibal Fernandes, s/n, Cidade Universitária, Recife, Pernambuco, Brazil
| |
Collapse
|
6
|
Zhang J, Cui X, Yang C, Zhong D, Sun Y, Yue X, Lan G, Zhang L, Lu L, Yuan H. A deep learning-based interpretable decision tool for predicting high risk of chemotherapy-induced nausea and vomiting in cancer patients prescribed highly emetogenic chemotherapy. Cancer Med 2023; 12:18306-18316. [PMID: 37609808 PMCID: PMC10524079 DOI: 10.1002/cam4.6428] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2023] [Revised: 07/27/2023] [Accepted: 07/31/2023] [Indexed: 08/24/2023] Open
Abstract
OBJECTIVE This study aims to develop a risk prediction model for chemotherapy-induced nausea and vomiting (CINV) in cancer patients receiving highly emetogenic chemotherapy (HEC) and identify the variables that have the most significant impact on prediction. METHODS Data from Tianjin Medical University General Hospital were collected and subjected to stepwise data preprocessing. Deep learning algorithms, including deep forest, and typical machine learning algorithms such as support vector machine (SVM), categorical boosting (CatBoost), random forest, decision tree, and neural network were used to develop the prediction model. After training the model and conducting hyperparameter optimization (HPO) through cross-validation in the training set, the performance was evaluated using the test set. Shapley additive explanations (SHAP), partial dependence plot (PDP), and Local Interpretable Model-Agnostic Explanations (LIME) techniques were employed to explain the optimal model. Model performance was assessed using AUC, F1 score, accuracy, specificity, sensitivity, and Brier score. RESULTS The deep forest model exhibited good discrimination, outperforming typical machine learning models, with an AUC of 0.850 (95%CI, 0.780-0.919), an F1 score of 0.757, an accuracy of 0.852, a specificity of 0.863, a sensitivity of 0.784, and a Brier score of 0.082. The top five important features in the model were creatinine clearance (Ccr), age, gender, anticipatory nausea and vomiting, and antiemetic regimen. Among these, Ccr had the most significant predictive value. The risk of CINV decreased with increased Ccr and age, while it was higher in the presence of anticipatory nausea and vomiting, female gender, and non-standard antiemetic regimen. CONCLUSION The deep forest model demonstrated good discrimination in predicting the risk of CINV in cancer patients prescribed HEC. Kidney function, as represented by Ccr, played a crucial role in the model's prediction. The clinical application of this predictive tool can help assess individual risks and improve patient care by proactively optimizing the use of antiemetics in cancer patients receiving HEC.
Collapse
Affiliation(s)
- Jingyue Zhang
- Department of PharmacyTianjin Medical University General HospitalTianjinChina
| | - Xudong Cui
- School of MathematicsTianjin UniversityTianjinChina
| | - Chong Yang
- Department of PharmacyTianjin Medical University General HospitalTianjinChina
- Department of PharmacyTianjin Huanhu HospitalTianjinChina
| | - Diansheng Zhong
- Department of Medical OncologyTianjin Medical University General HospitalTianjinChina
| | - Yinjuan Sun
- Department of Medical OncologyTianjin Medical University General HospitalTianjinChina
| | - Xiaoxiong Yue
- Academy of Medical Engineering and Translational MedicineTianjin UniversityTianjinChina
| | - Gaoshuang Lan
- Department of PharmacyTianjin Medical University General HospitalTianjinChina
| | - Linlin Zhang
- Department of Medical OncologyTianjin Medical University General HospitalTianjinChina
| | - Liangfu Lu
- Academy of Medical Engineering and Translational MedicineTianjin UniversityTianjinChina
| | - Hengjie Yuan
- Department of PharmacyTianjin Medical University General HospitalTianjinChina
| |
Collapse
|
7
|
Yue Y, Cao L, Chen H, Chen Y, Su Z. Towards an Optimal KELM Using the PSO-BOA Optimization Strategy with Applications in Data Classification. Biomimetics (Basel) 2023; 8:306. [PMID: 37504194 PMCID: PMC10807650 DOI: 10.3390/biomimetics8030306] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2023] [Revised: 07/09/2023] [Accepted: 07/09/2023] [Indexed: 07/29/2023] Open
Abstract
The features of the kernel extreme learning machine-efficient processing, improved performance, and less human parameter setting-have allowed it to be effectively used to batch multi-label classification tasks. These classic classification algorithms must at present contend with accuracy and space-time issues as a result of the vast and quick, multi-label, and concept drift features of the developing data streams in the practical application sector. The KELM training procedure still has a difficulty in that it has to be repeated numerous times independently in order to maximize the model's generalization performance or the number of nodes in the hidden layer. In this paper, a kernel extreme learning machine multi-label data classification method based on the butterfly algorithm optimized by particle swarm optimization is proposed. The proposed algorithm, which fully accounts for the optimization of the model generalization ability and the number of hidden layer nodes, can train multiple KELM hidden layer networks at once while maintaining the algorithm's current time complexity and avoiding a significant number of repeated calculations. The simulation results demonstrate that, in comparison to the PSO-KELM, BBA-KELM, and BOA-KELM algorithms, the PSOBOA-KELM algorithm proposed in this paper can more effectively search the kernel extreme learning machine parameters and more effectively balance the global and local performance, resulting in a KELM prediction model with a higher prediction accuracy.
Collapse
Affiliation(s)
- Yinggao Yue
- School of Intelligent Manufacturing and Electronic Engineering, Wenzhou University of Technology, Wenzhou 325035, China; (Y.Y.); (L.C.); (H.C.); (Y.C.)
- Intelligent Information Systems Institute, Wenzhou University, Wenzhou 325035, China
| | - Li Cao
- School of Intelligent Manufacturing and Electronic Engineering, Wenzhou University of Technology, Wenzhou 325035, China; (Y.Y.); (L.C.); (H.C.); (Y.C.)
| | - Haishao Chen
- School of Intelligent Manufacturing and Electronic Engineering, Wenzhou University of Technology, Wenzhou 325035, China; (Y.Y.); (L.C.); (H.C.); (Y.C.)
| | - Yaodan Chen
- School of Intelligent Manufacturing and Electronic Engineering, Wenzhou University of Technology, Wenzhou 325035, China; (Y.Y.); (L.C.); (H.C.); (Y.C.)
| | - Zhonggen Su
- Taishun Research Institute, Wenzhou University of Technology, Wenzhou 325035, China
| |
Collapse
|
8
|
Multiclass convolutional neural network based classification for the diagnosis of brain MRI images. Biomed Signal Process Control 2023. [DOI: 10.1016/j.bspc.2022.104542] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
|
9
|
Semisupervised Bacterial Heuristic Feature Selection Algorithm for High-Dimensional Classification with Missing Labels. INT J INTELL SYST 2023. [DOI: 10.1155/2023/4196920] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/24/2023]
Abstract
Feature selection is a crucial method for discovering relevant features in high-dimensional data. However, most studies primarily focus on completely labeled data, ignoring the frequent occurrence of missing labels in real-world problems. To address high-dimensional and label-missing problems in data classification simultaneously, we proposed a semisupervised bacterial heuristic feature selection algorithm. To track the label-missing problem, a k-nearest neighbor semisupervised learning strategy is designed to reconstruct missing labels. In addition, the bacterial heuristic algorithm is improved using hierarchical population initialization, dynamic learning, and elite population evolution strategies to enhance the search capacity for various feature combinations. To verify the effectiveness of the proposed algorithm, three groups of comparison experiments based on eight datasets are employed, including two traditional feature selection methods, four bacterial heuristic feature selection algorithms, and two swarm-based heuristic feature selection algorithms. Experimental results demonstrate that the proposed algorithm has obvious advantages in terms of classification accuracy and selected feature numbers.
Collapse
|
10
|
Li X, Liu G, Wang Z, Zhang L, Liu H, Ai H. Ensemble multiclassification model for aquatic toxicity of organic compounds. AQUATIC TOXICOLOGY (AMSTERDAM, NETHERLANDS) 2023; 255:106379. [PMID: 36587517 DOI: 10.1016/j.aquatox.2022.106379] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Revised: 12/04/2022] [Accepted: 12/19/2022] [Indexed: 06/17/2023]
Abstract
With environmental pollution becoming increasingly serious, organic compounds have become the main hazard of environmental pollution and exert substantial negative impacts on aquatic organisms. In research pertaining to the acute toxicity of organic compounds, traditional biological experimental methods are time-consuming and expensive. In addition, computer-aided binary classification models cannot accurately classify acute toxicity. Therefore, the multiclassication model is necessary for more accurate classification of acute toxicity. In this study, median lethal concentrations of 373 organic compounds in the environmental toxicology datasets ECOTOX and EAT5 were used. These chemicals were classified into four categories based on the European Economic Community criteria. Then the random forest, support vector machine, extreme gradient boosting, adaptive gradient boosting, and C5.0 decision tree algorithms and eight molecular fingerprints were used to build a multiclassification base model for the acute toxicity of organic compounds. The base models were repeated 100 times with fivefold cross-validation and external validation. The ensemble model was obtained by the voting method. The best base classifier was ExtendFP-C5.0, which had an accuracy, sensitivity and specificity values of 87.30%, 87.32% and 95.76% for external validation, and the voting ensemble model performance of 96.92%, 96.93% and 98.97%, respectively. The ensemble model achieved a higher accuracy than previously reported studies. Our study will help to further classify the acute toxicity of organic compounds to aquatic organisms and predict the hazard classes of organic compounds.
Collapse
Affiliation(s)
- Xinran Li
- College of Life Science, Liaoning University, Shenyang, 110036, China
| | - Gaohua Liu
- College of Life Science, Liaoning University, Shenyang, 110036, China
| | - Zhibo Wang
- College of Life Science, Liaoning University, Shenyang, 110036, China
| | - Li Zhang
- College of Life Science, Liaoning University, Shenyang, 110036, China; China Research Center for Computer Simulating and Information Processing of Bio-macromolecules of Shenyang, China
| | - Hongsheng Liu
- China Research Center for Computer Simulating and Information Processing of Bio-macromolecules of Shenyang, China; College of Pharmacy, Liaoning University, Shenyang, 110036, China
| | - Haixin Ai
- College of Life Science, Liaoning University, Shenyang, 110036, China; China Research Center for Computer Simulating and Information Processing of Bio-macromolecules of Shenyang, China.
| |
Collapse
|
11
|
Navin K, Nehemiah HK, Nancy Jane Y, Veena Saroji H. A classification framework using filter–wrapper based feature selection approach for the diagnosis of congenital heart failure. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2023. [DOI: 10.3233/jifs-221348] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Premature mortality from cardiovascular disease can be reduced with early detection of heart failure by analysing the patients’ risk factors and assuring accurate diagnosis. This work proposes a clinical decision support system for the diagnosis of congenital heart failure by utilizing a data pre-processing approach for dealing missing values and a filter-wrapper based method for selecting the most relevant features. Missing values are imputed using a missForest method in four out of eight heart disease datasets collected from the Machine Learning Repository maintained by University of California, Irvine. The Fast Correlation Based Filter is used as the filter approach, while the union of the Atom Search Optimization Algorithm and the Henry Gas Solubility Optimization represent the wrapper-based algorithms, with the fitness function as the combination of accuracy, G-mean, and Matthew’s correlation coefficient measured by the Support Vector Machine. A total of four boosted classifiers namely, XGBoost, AdaBoost, CatBoost, and LightGBM are trained using the selected features. The proposed work achieves an accuracy of 89%, 84%, 83%, 80% for Heart Failure Clinical Records, 81%, 80%, 83%, 82% for Single Proton Emission Computed Tomography, 90%, 82%, 93%, 80% for Single Proton Emission Computed Tomography F, 80%, 80%, 81%, 80% for Statlog Heart Disease, 80%, 85%, 83%, 86% for Cleveland Heart Disease, 82%, 85%, 85%, 82% for Hungarian Heart Disease, 80%, 81%, 79%, 82% for VA Long Beach, 97%, 89%, 98%, 97%, for Switzerland Heart Disease for four classifiers respectively. The suggested technique outperformed the other classifiers when evaluated against Random Forest, Classification and Regression Trees, Support Vector Machine, and K-Nearest Neighbor.
Collapse
Affiliation(s)
- K.S. Navin
- Ramanujan Computing Centre, Anna University, Chennai, India
| | | | - Y. Nancy Jane
- Department of Computer Technology, Madras Institute of Technology, Chennai, India
| | - H. Veena Saroji
- Assistant Director Planning, Directorate of Health Services, Kerala, India
| |
Collapse
|
12
|
Joseph LP, Joseph EA, Prasad R. Explainable diabetes classification using hybrid Bayesian-optimized TabNet architecture. Comput Biol Med 2022; 151:106178. [PMID: 36306578 DOI: 10.1016/j.compbiomed.2022.106178] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2022] [Revised: 09/23/2022] [Accepted: 10/01/2022] [Indexed: 12/27/2022]
Abstract
Diabetes is a deadly chronic disease that occurs when the pancreas is not able to produce ample insulin or when the body cannot use insulin effectively. If undetected, it may lead to a host of health complications. Hence, accurate and explainable early-stage detection of diabetes is essential for the proper administration of treatment options in leading a healthy and productive life. For this, we developed an interpretable TabNet model tuned via Bayesian optimization (BO). To achieve model-specific interpretability, the attention mechanism of TabNet architecture was used, which offered the local and global model explanations on the influence of the attributes on the outcomes. The model was further explained locally and globally using more robust model-agnostic LIME and SHAP eXplainable Artificial Intelligence (XAI) tools. The proposed model outperformed all benchmarked models by obtaining high accuracy of 92.2% and 99.4% using the Pima Indians diabetes dataset (PIDD) and the early-stage diabetes risk prediction dataset (ESDRPD), respectively. Based on the XAI results, it was clear that the most influential attribute for diabetes classification using PIDD and ESDRPD were Insulin and Polyuria, respectively. The feature importance values registered for insulin was 0.301 (PIDD) and for polyuria 0.206 was registered (ESDRPD). The high accuracy and ancillary interpretability of our objective model is expected to increase end-users trust and confidence in early-stage detection of diabetes.
Collapse
Affiliation(s)
- Lionel P Joseph
- School of Mathematics, Physics, and Computing, University of Southern Queensland, Springfield, QLD, 4300, Australia
| | - Erica A Joseph
- Umanand Prasad School of Medicine and Health Sciences, The University of Fiji, Saweni, Lautoka, Fiji
| | - Ramendra Prasad
- Department of Science, School of Science and Technology, The University of Fiji, Saweni, Lautoka, Fiji.
| |
Collapse
|
13
|
Decision Tree Modeling for Osteoporosis Screening in Postmenopausal Thai Women. INFORMATICS 2022. [DOI: 10.3390/informatics9040083] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Osteoporosis is still a serious public health issue in Thailand, particularly in postmenopausal women; meanwhile, new effective screening tools are required for rapid diagnosis. This study constructs and confirms an osteoporosis screening tool-based decision tree (DT) model. Four DT algorithms, namely, classification and regression tree; chi-squared automatic interaction detection (CHAID); quick, unbiased, efficient statistical tree; and C4.5, were implemented on 356 patients, of whom 266 were abnormal and 90 normal. The investigation revealed that the DT algorithms have insignificantly different performances regarding the accuracy, sensitivity, specificity, and area under the curve. Each algorithm possesses its characteristic performance. The optimal model is selected according to the performance of blind data testing and compared with traditional screening tools: Osteoporosis Self-Assessment for Asians and the Khon Kaen Osteoporosis Study. The Decision Tree for Postmenopausal Osteoporosis Screening (DTPOS) tool was developed from the best performance of CHAID’s algorithms. The age of 58 years and weight at a cutoff of 57.8 kg were the essential predictors of our tool. DTPOS provides a sensitivity of 92.3% and a positive predictive value of 82.8%, which might be used to rule in subjects at risk of osteopenia and osteoporosis in a community-based screening as it is simple to conduct.
Collapse
|
14
|
Liu R, Zhan Y, Liu X, Zhang Y, Gui L, Qu Y, Nan H, Jiang Y. Stacking Ensemble Method for Gestational Diabetes Mellitus Prediction in Chinese Pregnant Women: A Prospective Cohort Study. JOURNAL OF HEALTHCARE ENGINEERING 2022; 2022:8948082. [PMID: 36147870 PMCID: PMC9489389 DOI: 10.1155/2022/8948082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/28/2022] [Accepted: 07/28/2022] [Indexed: 11/18/2022]
Abstract
Gestational diabetes mellitus (GDM) is closely related to adverse pregnancy outcomes and other diseases. Early intervention in pregnant women who are at high risk of developing GDM could help prevent adverse health consequences. The study aims to develop a simple model using the stacking ensemble method to predict GDM for women in the first trimester based on easily available factors. We used the data from the Chinese Pregnant Women Cohort Study from July 2017 to November 2018. A total of 6,848 pregnant women in the first trimester were included in the analysis. Logistic regression (LR), random forest (RF), and extreme gradient boosting (XGBoost) were considered as base learners. Optimal feature subsets for each learner were chosen by using recursive feature elimination cross-validation. Then, we built a pipeline to process imbalance data, tune hyperparameters, and evaluate model performance. The learners with the best hyperparameters were employed in the first layer of the proposed stacking method. Their predictions were obtained using optimal feature subsets and served as meta-learner's inputs. Another LR was used as a meta-learner to obtain the final prediction results. Accuracy, specificity, error rate, and other metrics were calculated to evaluate the performance of the models. A paired samples t-test was performed to compare the model performance. In total, 967 (14.12%) women developed GDM. For base learners, the RF model had the highest accuracy (0.638 (95% confidence interval (CI) 0.628-0.648)) and specificity (0.683 (0.669-0.698)) and lowest error rate (0.362 (0.352-0.372)). The stacking method effectively improved the accuracy (0.666 (95% CI 0.663-0.670)) and specificity (0.725 (0.721-0.729)) and decreased the error rate (0.333 (0.330-0.337)). The differences in the performance between the stacking method and RF were statistically significant. Our proposed stacking method based on easily available factors has better performance than other learners such as RF.
Collapse
Affiliation(s)
- Ruiyi Liu
- Department of Epidemiology and Biostatistics, School of Population Medicine and Public Health, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Yongle Zhan
- Department of Epidemiology and Biostatistics, School of Population Medicine and Public Health, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
- School of Public Health, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong, China
| | - Xuan Liu
- Department of Epidemiology and Biostatistics, School of Population Medicine and Public Health, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Yifang Zhang
- Department of Epidemiology and Biostatistics, School of Population Medicine and Public Health, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Luting Gui
- Department of Epidemiology and Biostatistics, School of Population Medicine and Public Health, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Yimin Qu
- Department of Epidemiology and Biostatistics, School of Population Medicine and Public Health, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Hairong Nan
- Department of Endocrinology, Shenzhen Longhua Maternity and Child Healthcare Hospital, Shenzhen, China
| | - Yu Jiang
- Department of Epidemiology and Biostatistics, School of Population Medicine and Public Health, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| |
Collapse
|
15
|
Song W, Zhou X, Duan Q, Wang Q, Li Y, Li A, Zhou W, Sun L, Qiu L, Li R, Li Y. Using random forest algorithm for glomerular and tubular injury diagnosis. Front Med (Lausanne) 2022; 9:911737. [PMID: 35966858 PMCID: PMC9366016 DOI: 10.3389/fmed.2022.911737] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2022] [Accepted: 07/04/2022] [Indexed: 11/16/2022] Open
Abstract
Objectives Chronic kidney disease (CKD) is a common chronic condition with high incidence and insidious onset. Glomerular injury (GI) and tubular injury (TI) represent early manifestations of CKD and could indicate the risk of its development. In this study, we aimed to classify GI and TI using three machine learning algorithms to promote their early diagnosis and slow the progression of CKD. Methods Demographic information, physical examination, blood, and morning urine samples were first collected from 13,550 subjects in 10 counties in Shanxi province for classification of GI and TI. Besides, LASSO regression was employed for feature selection of explanatory variables, and the SMOTE (synthetic minority over-sampling technique) algorithm was used to balance target datasets, i.e., GI and TI. Afterward, Random Forest (RF), Naive Bayes (NB), and logistic regression (LR) were constructed to achieve classification of GI and TI, respectively. Results A total of 12,330 participants enrolled in this study, with 20 explanatory variables. The number of patients with GI, and TI were 1,587 (12.8%) and 1,456 (11.8%), respectively. After feature selection by LASSO, 14 and 15 explanatory variables remained in these two datasets. Besides, after SMOTE, the number of patients and normal ones were 6,165, 6,165 for GI, and 6,165, 6,164 for TI, respectively. RF outperformed NB and LR in terms of accuracy (78.14, 80.49%), sensitivity (82.00, 84.60%), specificity (74.29, 76.09%), and AUC (0.868, 0.885) for both GI and TI; the four variables contributing most to the classification of GI and TI represented SBP, DBP, sex, age and age, SBP, FPG, and GHb, respectively. Conclusion RF boasts good performance in classifying GI and TI, which allows for early auxiliary diagnosis of GI and TI, thus facilitating to help alleviate the progression of CKD, and enjoying great prospects in clinical practice.
Collapse
Affiliation(s)
- Wenzhu Song
- School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Xiaoshuang Zhou
- Department of Nephrology, Shanxi Provincial People's Hospital (Fifth Hospital) of Shanxi Medical University, Taiyuan, China
| | - Qi Duan
- Shanxi Provincial Key Laboratory of Kidney Disease, Taiyuan, China
| | - Qian Wang
- Shanxi Provincial Key Laboratory of Kidney Disease, Taiyuan, China
| | - Yaheng Li
- Shanxi Provincial Key Laboratory of Kidney Disease, Taiyuan, China
| | - Aizhong Li
- Shanxi Provincial Key Laboratory of Kidney Disease, Taiyuan, China
| | - Wenjing Zhou
- School of Medical Sciences, Shanxi University of Chinese Medicine, Jinzhong, China
| | - Lin Sun
- College of Traditional Chinese Medicine and Food Engineering, Shanxi University of Chinese Medicine, Jinzhong, China
| | - Lixia Qiu
- School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Rongshan Li
- Department of Nephrology, Shanxi Provincial People's Hospital (Fifth Hospital) of Shanxi Medical University, Taiyuan, China.,Shanxi Provincial Key Laboratory of Kidney Disease, Taiyuan, China
| | - Yafeng Li
- Department of Nephrology, Shanxi Provincial People's Hospital (Fifth Hospital) of Shanxi Medical University, Taiyuan, China.,Shanxi Provincial Key Laboratory of Kidney Disease, Taiyuan, China.,Core Laboratory, Shanxi Provincial People's Hospital (Fifth Hospital) of Shanxi Medical University, Taiyuan, China.,Academy of Microbial Ecology, Shanxi Medical University, Taiyuan, China
| |
Collapse
|
16
|
Feature Selection Using Artificial Gorilla Troop Optimization for Biomedical Data: A Case Analysis with COVID-19 Data. MATHEMATICS 2022. [DOI: 10.3390/math10152742] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Feature selection (FS) is commonly thought of as a pre-processing strategy for determining the best subset of characteristics from a given collection of features. Here, a novel discrete artificial gorilla troop optimization (DAGTO) technique is introduced for the first time to handle FS tasks in the healthcare sector. Depending on the number and type of objective functions, four variants of the proposed method are implemented in this article, namely: (1) single-objective (SO-DAGTO), (2) bi-objective (wrapper) (MO-DAGTO1), (3) bi-objective (filter wrapper hybrid) (MO-DAGTO2), and (4) tri-objective (filter wrapper hybrid) (MO-DAGTO3) for identifying relevant features in diagnosing a particular disease. We provide an outstanding gorilla initialization strategy based on the label mutual information (MI) with the aim of increasing population variety and accelerate convergence. To verify the performance of the presented methods, ten medical datasets are taken into consideration, which are of variable dimensions. A comparison is also implemented between the best of the four suggested approaches (MO-DAGTO2) and four established multi-objective FS strategies, and it is statistically proven to be the superior one. Finally, a case study with COVID-19 samples is performed to extract the critical factors related to it and to demonstrate how this method is fruitful in real-world applications.
Collapse
|
17
|
Boosting chameleon swarm algorithm with consumption AEO operator for global optimization and feature selection. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.108743] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
18
|
He C, Zhu S, Wu X, Zhou J, Chen Y, Qian X, Ye J. Accurate Tumor Subtype Detection with Raman Spectroscopy via Variational Autoencoder and Machine Learning. ACS OMEGA 2022; 7:10458-10468. [PMID: 35382336 PMCID: PMC8973095 DOI: 10.1021/acsomega.1c07263] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/25/2021] [Accepted: 03/09/2022] [Indexed: 05/04/2023]
Abstract
Accurate diagnosis of cancer subtypes is a great guide for the development of surgical plans and prognosis in the clinic. Raman spectroscopy, combined with the machine learning algorithm, has been demonstrated to be a powerful tool for tumor identification. However, the analysis and classification of Raman spectra for biological samples with complex compositions are still challenges. In addition, the signal-to-noise ratio of the spectra also influences the accuracy of the classification. Herein, we applied the variational autoencoder (VAE) to Raman spectra for downscaling and noise reduction simultaneously. We validated the performance of the VAE algorithm at the cellular and tissue levels. VAE successfully downscaled high-dimensional Raman spectral data to two-dimensional (2D) data for three subtypes of non-small cell lung cancer cells and two subtypes of kidney cancer tissues. Gaussian naïve bayes was applied to subtype discrimination with the 2D data after VAE encoding at both the cellular and tissue levels, significantly outperforming the discrimination results using original spectra. Therefore, the analysis of Raman spectroscopy based on VAE and machine learning has great potential for rapid diagnosis of tumor subtypes.
Collapse
Affiliation(s)
- Chang He
- State
Key Laboratory of Oncogenes and Related Genes, School of Biomedical
Engineering, Shanghai Jiao Tong University, Shanghai 200030, P.R. China
| | - Shuo Zhu
- State
Key Laboratory of Oncogenes and Related Genes, School of Biomedical
Engineering, Shanghai Jiao Tong University, Shanghai 200030, P.R. China
| | - Xiaorong Wu
- Department
of Urology, Ren Ji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai 200127, P.R. China
| | - Jiale Zhou
- Department
of Urology, Ren Ji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai 200127, P.R. China
| | - Yonghui Chen
- Department
of Urology, Ren Ji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai 200127, P.R. China
| | - Xiaohua Qian
- State
Key Laboratory of Oncogenes and Related Genes, School of Biomedical
Engineering, Shanghai Jiao Tong University, Shanghai 200030, P.R. China
| | - Jian Ye
- State
Key Laboratory of Oncogenes and Related Genes, School of Biomedical
Engineering, Shanghai Jiao Tong University, Shanghai 200030, P.R. China
- Shanghai
Key Laboratory of Gynecologic Oncology, Ren Ji Hospital, School of
Medicine, Shanghai Jiao Tong University, Shanghai 200127, P.R. China
- Institute
of Medical Robotics, Shanghai Jiao Tong
University, Shanghai 200240, P.R. China
| |
Collapse
|
19
|
Prediction Model for Infectious Disease Health Literacy Based on Synthetic Minority Oversampling Technique Algorithm. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:8498159. [PMID: 35371281 PMCID: PMC8975663 DOI: 10.1155/2022/8498159] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/24/2021] [Revised: 02/22/2022] [Accepted: 03/08/2022] [Indexed: 12/30/2022]
Abstract
Objective Improving health literacy in infectious diseases is a direct manifestation of the solid advance in disease control and prevention. Our study is aimed at exploring applying synthetic minority oversampling technique (SMOTE) in the prediction assessment of whether residents and business employees have infectious disease health literacy. Methods The Chinese resident infectious disease health literacy evaluation scale was used to investigate the associated variables. The screened variables were input variables and the presence or absence of infectious diseases health literacy as outcome variables. Logistic regression, random forest, and support vector machine (SVM) models were built in the data sets before and after treatment by the SMOTE algorithm, respectively, and the performance of the models was evaluated by receiver operating characteristic curves (ROC). Results Logistic regression, random forest, and SVM achieved accuracies of 0.828, 0.612, and 0.654 before SMOTE algorithm processing, and the areas under the ROC curves (AUCs) of the three models were 0.754, 0.817, and 0.759, respectively. The accuracies were 0.938, 0.911, and 0.894 after SMOTE algorithm processing, and the AUCs of the three models were 0.913, 0.925, and 0.910, respectively. Conclusions The random forest model based on the SMOTE has high application value in assessing whether residents versus enterprise employees have infectious disease health literacy.
Collapse
|
20
|
Mahajan P, Rana D. Feature optimization in CNN using MROA for disease classification. INTELLIGENT DECISION TECHNOLOGIES 2022. [DOI: 10.3233/idt-220097] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/16/2023]
Abstract
Electronic Medical Records (EMR) carry important information about a patient’s journey. The past decade shows substantial use of Natural Language Processing (NLP)-based Information Retrieval (IR) techniques to extract insights such as symptoms, diseases, and tests from these unstructured records. The state-of-the-art shows that convolutional neural networks (CNN) make a significant contribution to the disease classification task.A significant improvement in precise knowledge mining is possible with precise feature extraction. Feature selection addresses undesirable, unneeded, or irrelevant features. This article proposes a Modified Rider Optimization Algorithm (MROA) to choose important features by selecting optimal weights from a pool of randomly generated weights based on high accuracy and less training time in the CNN algorithm. A modified approach is trained on 114 N2C2 patients’ records to extract symptoms, disease, and tests are performed on them to perform disease classification tasks. The proposed approach is found to be accurate, with 97.77% accuracy in the disease classification and treatment prediction task from EMR.
Collapse
Affiliation(s)
- Pranita Mahajan
- Department of Computer Science and Engineering, SVNIT, Surat, Gujarat, India
| | - Dipti Rana
- Department of Computer Engineering, SVNIT University, Surat, Gujarat, India
| |
Collapse
|
21
|
Isaac A, Nehemiah HK, Dunston SD, Elgin Christo V, Kannan A. Feature selection using competitive coevolution of bio-inspired algorithms for the diagnosis of pulmonary emphysema. Biomed Signal Process Control 2022. [DOI: 10.1016/j.bspc.2021.103340] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
22
|
Sreejith S., Nehemiah KH, Kannan A.. A Framework to Classify Clinical Data Using a Genetic Algorithm and Artificial Flora-Optimized Neural Network. INTERNATIONAL JOURNAL OF SWARM INTELLIGENCE RESEARCH 2022. [DOI: 10.4018/ijsir.304719] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
A new classification framework for a Clinical Decision Support System, utilizing a Genetic algorithm and an Artificial Flora Optimized Neural Network is presented in this paper. GAFON is an artificial neural network whose topology is optimized with Genetic Algorithm and the learnable parameters are optimized with Artificial Flora Optimization algorithm. Drop out technique is used in the topology optimization phase and weight regularization is used in the parameter optimization phase. The proposed method minimizes the co-adaptation problem, reduces over-fitting of training data and improves the generalization of a feed forward neural network. The classification framework developed has been tested for classifying both multi class and binary class clinical datasets. The proposed method attained accuracy values of 86.82% for Hepatitis C Virus (HCV) for Egyptian patients, 84.91% for Vertebral Column 95.65% for Statlog Heart Disease (SHD), SHD and 93.79% for Early Stage Diabetes Risk Prediction (ESDRP), all datasets obtained from UCI repository
Collapse
Affiliation(s)
- Sreejith S.
- Ramanujan Computing Centre, Anna University Chennai, India
| | | | - Kannan A.
- Vellore Institute of Technology, Vellore, India
| |
Collapse
|
23
|
P J, G JS, K S S, S RW. Clinical decision support system for early detection of Alzheimer's disease using an enhanced gradient boosted decision tree classifier. Health Informatics J 2022; 28:14604582221082868. [PMID: 35350906 DOI: 10.1177/14604582221082868] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Alzheimer's disease (AD) is one of the most common forms of dementia contributing to more than 70% of the cases. The factors accounting for the cause and progression of neurodegenerative diseases like AD are primarily genetic, in addition to life style and environmental factors. Early and accurate diagnoses of AD empower practitioners to take timely clinical decisions and preventive actions. This being the motivation, the work proposes a novel pattern matching and scoring method on genetic material towards devising an effective classifier. We propose a distinctive disease causing gene sequence pattern identification using suffix trees as a base detection model with an accuracy of 91.5% in linear time complexity. A scoring mechanism is implemented to assign scores to genes based on the severity of the disease causing and disease resistant Single Nucleotide Polymorphisms associated with the genes. These scores are then used as a remarkable feature in the gradient boosted decision tree classifier to enhance the classification of AD versus healthy control. The efficiency of the proposed gene powered EGBDT classifier is evaluated on ADNI benchmark data set with the prediction accuracy of 94.16% and is found to be efficient compared to the recent works in the literature.
Collapse
Affiliation(s)
- Jayashree P
- Department of Computer Technology, 29817Anna University, Chennai, India
| | - Janaka Sudha G
- Department of Computer Science and Engineering, 164007Sri Venkateswara College of Engineering, Sriperumbudur, India
| | - Srinivasan K S
- Department of Computer Technology, 29817Anna University, Chennai, India
| | - Robert Wilson S
- Department of Neurology, 93104SRM Institute of Science and Technology, Kattankulathur, India
| |
Collapse
|
24
|
Zoo: Selecting Transcriptomic and Methylomic Biomarkers by Ensembling Animal-Inspired Swarm Intelligence Feature Selection Algorithms. Genes (Basel) 2021; 12:genes12111814. [PMID: 34828418 PMCID: PMC8621246 DOI: 10.3390/genes12111814] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2021] [Revised: 11/12/2021] [Accepted: 11/15/2021] [Indexed: 02/03/2023] Open
Abstract
Biological omics data such as transcriptomes and methylomes have the inherent “large p small n” paradigm, i.e., the number of features is much larger than that of the samples. A feature selection (FS) algorithm selects a subset of the transcriptomic or methylomic biomarkers in order to build a better prediction model. The hidden patterns in the FS solution space make it challenging to achieve a feature subset with satisfying prediction performances. Swarm intelligence (SI) algorithms mimic the target searching behaviors of various animals and have demonstrated promising capabilities in selecting features with good machine learning performances. Our study revealed that different SI-based feature selection algorithms contributed complementary searching capabilities in the FS solution space, and their collaboration generated a better feature subset than the individual SI feature selection algorithms. Nine SI-based feature selection algorithms were integrated to vote for the selected features, which were further refined by the dynamic recursive feature elimination framework. In most cases, the proposed Zoo algorithm outperformed the existing feature selection algorithms on transcriptomics and methylomics datasets.
Collapse
|
25
|
Piri J, Mohapatra P. An analytical study of modified multi-objective Harris Hawk Optimizer towards medical data feature selection. Comput Biol Med 2021; 135:104558. [PMID: 34182329 DOI: 10.1016/j.compbiomed.2021.104558] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2021] [Revised: 06/04/2021] [Accepted: 06/04/2021] [Indexed: 11/28/2022]
Abstract
Dimensionality reduction or Feature Selection (FS) is a multi-target optimization problem with two goals: improving the classification efficiency while simultaneously dropping the characteristics. Harris Hawk Optimization (HHO) is introduced recently to solve different demanding optimization tasks as a metaheuristic tool. The initial HHO is for addressing optimization problems in a continuous environment, but FS is an optimization task in binary space. Therefore, in this article, a Multi-Objective Quadratic Binary HHO (MOQBHHO) technique with K-Nearest Neighbor (KNN) method as wrapper classifier is implemented for extracting the optimal feature subsets. Finally, this study uses the crowding distance (CD) value as a third criterion for picking the best one from the non-dominated solutions. Here, to estimate the performance of the proposed approach, twelve standard medical datasets are considered. The proposed MOQBHHO is compared with MOBHHO-S (using a sigmoid function), multi-objective genetic algorithm (MOGA), multi-objective ant lion optimization (MOALO), and NSGA-II. The experimental findings show that the proposed MOQBHHO finds a set of non-dominated feature subsets effectively in contrast to deep-based FS methods: Auto-encoder and Teacher-Student based FS (TSFS). The presented methodology is found superior in obtaining the best trade-off between two fitness assessment criteria compared to the other existing multi-objective techniques for recognizing relevant features.
Collapse
|
26
|
A multiple combined method for rebalancing medical data with class imbalances. Comput Biol Med 2021; 134:104527. [PMID: 34091384 DOI: 10.1016/j.compbiomed.2021.104527] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2021] [Revised: 05/24/2021] [Accepted: 05/25/2021] [Indexed: 11/24/2022]
Abstract
Most classification algorithms assume that classes are in a balanced state. However, datasets with class imbalances are everywhere. The classes of actual medical datasets are imbalanced, severely impacting identification models and even sacrificing the classification accuracy of the minority class, even though it is the most influential and representative. The medical field has irreversible characteristics. Its tolerance rate for misjudgment is relatively low, and errors may cause irreparable harm to patients. Therefore, this study proposes a multiple combined method to rebalance medical data featuring class imbalances. The combined methods include (1) resampling methods (synthetic minority oversampling technique [SMOTE] and undersampling [US]), (2) particle swarm optimization (PSO), and (3) MetaCost. This study conducted two experiments with nine medical datasets to verify and compare the proposed method with the listing methods. A decision tree is used to generate decision rules for easy understanding of the research results. The results show that (1) the proposed method with ensemble learning can improve the area under a receiver operating characteristic curve (AUC), recall, precision, and F1 metrics; (2) MetaCost can increase sensitivity; (3) SMOTE can effectively enhance AUC; (4) US can improve sensitivity, F1, and misclassification costs in data with a high-class imbalance ratio; and (5) PSO-based attribute selection can increase sensitivity and reduce data dimension. Finally, we suggest that the dataset with an imbalanced ratio >9 must use the US results to make the decision. As the imbalanced ratio is < 9, the decision-maker can simultaneously consider the results of SMOTE and US to identify the best decision.
Collapse
|
27
|
Feature Selection and Classification of Clinical Datasets Using Bioinspired Algorithms and Super Learner. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2021:6662420. [PMID: 34055041 PMCID: PMC8149240 DOI: 10.1155/2021/6662420] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/19/2020] [Revised: 04/10/2021] [Accepted: 04/23/2021] [Indexed: 11/23/2022]
Abstract
A computer-aided diagnosis (CAD) system that employs a super learner to diagnose the presence or absence of a disease has been developed. Each clinical dataset is preprocessed and split into training set (60%) and testing set (40%). A wrapper approach that uses three bioinspired algorithms, namely, cat swarm optimization (CSO), krill herd (KH) ,and bacterial foraging optimization (BFO) with the classification accuracy of support vector machine (SVM) as the fitness function has been used for feature selection. The selected features of each bioinspired algorithm are stored in three separate databases. The features selected by each bioinspired algorithm are used to train three back propagation neural networks (BPNN) independently using the conjugate gradient algorithm (CGA). Classifier testing is performed by using the testing set on each trained classifier, and the diagnostic results obtained are used to evaluate the performance of each classifier. The classification results obtained for each instance of the testing set of the three classifiers and the class label associated with each instance of the testing set will be the candidate instances for training and testing the super learner. The training set comprises of 80% of the instances, and the testing set comprises of 20% of the instances. Experimentation has been carried out using seven clinical datasets from the University of California Irvine (UCI) machine learning repository. The super learner has achieved a classification accuracy of 96.83% for Wisconsin diagnostic breast cancer dataset (WDBC), 86.36% for Statlog heart disease dataset (SHD), 94.74% for hepatocellular carcinoma dataset (HCC), 90.48% for hepatitis dataset (HD), 81.82% for vertebral column dataset (VCD), 84% for Cleveland heart disease dataset (CHD), and 70% for Indian liver patient dataset (ILP).
Collapse
|
28
|
RIFS2D: A two-dimensional version of a randomly restarted incremental feature selection algorithm with an application for detecting low-ranked biomarkers. Comput Biol Med 2021; 133:104405. [PMID: 33930763 DOI: 10.1016/j.compbiomed.2021.104405] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Revised: 04/13/2021] [Accepted: 04/13/2021] [Indexed: 12/20/2022]
Abstract
The era of big data introduces both opportunities and challenges for biomedical researchers. One of the inherent difficulties in the biomedical research field is to recruit large cohorts of samples, while high-throughput biotechnologies may produce thousands or even millions of features for each sample. Researchers tend to evaluate the individual correlation of each feature with the class label and use the incremental feature selection (IFS) strategy to select the top-ranked features with the best prediction performance. Recent experimental data showed that a subset of continuously ranked features randomly restarted from a low-ranked feature (an RIFS block) may outperform the subset of top-ranked features. This study proposed a feature selection Algorithm RIFS2D by integrating multiple RIFS blocks. A comprehensive comparative experiment was conducted with the IFS, RIFS and existing feature selection algorithms and demonstrated that a subset of low-ranked features may also achieve promising prediction performance. This study suggested that a prediction model with promising performance may be trained by low-ranked features, even when top-ranked features did not achieve satisfying prediction performance. Further comparative experiments were conducted between RIFS2D and t-tests for the detection of early-stage breast cancer. The data showed that the RIFS2D-recommended features achieved better prediction accuracy and were targeted by more drugs than the t-test top-ranked features.
Collapse
|