1
|
Yasin P, Yimit Y, Cai X, Aimaiti A, Sheng W, Mamat M, Nijiati M. Machine learning-enabled prediction of prolonged length of stay in hospital after surgery for tuberculosis spondylitis patients with unbalanced data: a novel approach using explainable artificial intelligence (XAI). Eur J Med Res 2024; 29:383. [PMID: 39054495 PMCID: PMC11270948 DOI: 10.1186/s40001-024-01988-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2023] [Accepted: 07/18/2024] [Indexed: 07/27/2024] Open
Abstract
BACKGROUND Tuberculosis spondylitis (TS), commonly known as Pott's disease, is a severe type of skeletal tuberculosis that typically requires surgical treatment. However, this treatment option has led to an increase in healthcare costs due to prolonged hospital stays (PLOS). Therefore, identifying risk factors associated with extended PLOS is necessary. In this research, we intended to develop an interpretable machine learning model that could predict extended PLOS, which can provide valuable insights for treatments and a web-based application was implemented. METHODS We obtained patient data from the spine surgery department at our hospital. Extended postoperative length of stay (PLOS) refers to a hospitalization duration equal to or exceeding the 75th percentile following spine surgery. To identify relevant variables, we employed several approaches, such as the least absolute shrinkage and selection operator (LASSO), recursive feature elimination (RFE) based on support vector machine classification (SVC), correlation analysis, and permutation importance value. Several models using implemented and some of them are ensembled using soft voting techniques. Models were constructed using grid search with nested cross-validation. The performance of each algorithm was assessed through various metrics, including the AUC value (area under the curve of receiver operating characteristics) and the Brier Score. Model interpretation involved utilizing methods such as Shapley additive explanations (SHAP), the Gini Impurity Index, permutation importance, and local interpretable model-agnostic explanations (LIME). Furthermore, to facilitate the practical application of the model, a web-based interface was developed and deployed. RESULTS The study included a cohort of 580 patients and 11 features include (CRP, transfusions, infusion volume, blood loss, X-ray bone bridge, X-ray osteophyte, CT-vertebral destruction, CT-paravertebral abscess, MRI-paravertebral abscess, MRI-epidural abscess, postoperative drainage) were selected. Most of the classifiers showed better performance, where the XGBoost model has a higher AUC value (0.86) and lower Brier Score (0.126). The XGBoost model was chosen as the optimal model. The results obtained from the calibration and decision curve analysis (DCA) plots demonstrate that XGBoost has achieved promising performance. After conducting tenfold cross-validation, the XGBoost model demonstrated a mean AUC of 0.85 ± 0.09. SHAP and LIME were used to display the variables' contributions to the predicted value. The stacked bar plots indicated that infusion volume was the primary contributor, as determined by Gini, permutation importance (PFI), and the LIME algorithm. CONCLUSIONS Our methods not only effectively predicted extended PLOS but also identified risk factors that can be utilized for future treatments. The XGBoost model developed in this study is easily accessible through the deployed web application and can aid in clinical research.
Collapse
Affiliation(s)
- Parhat Yasin
- Department of Spine Surgery, The Sixth Affiliated Hospital of Xinjiang Medical University, Urumqi, 830000, Xinjiang, People's Republic of China
- Department of Spine Surgery, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, 830054, Xinjiang, People's Republic of China
| | - Yasen Yimit
- Department of Radiology, The First People's Hospital of Kashi Prefecture, Kashi, 844000, Xinjiang, People's Republic of China
| | - Xiaoyu Cai
- Department of Spine Surgery, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, 830054, Xinjiang, People's Republic of China
| | - Abasi Aimaiti
- Department of Anesthesiology, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, 830054, Xinjiang, People's Republic of China
| | - Weibin Sheng
- Department of Spine Surgery, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, 830054, Xinjiang, People's Republic of China
| | - Mardan Mamat
- Department of Spine Surgery, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, 830054, Xinjiang, People's Republic of China.
| | - Mayidili Nijiati
- Department of Radiology, The Fourth Affiliated Hospital of Xinjiang Medical University(Xinjiang Hospital of Traditional Chinese Medicine), Urumqi, 830002, Xinjiang, People's Republic of China.
- Xinjiang Key Laboratory of Artificial Intelligence Assisted Imaging Diagnosis, Kashi, 844000, Xinjiang, People's Republic of China.
| |
Collapse
|
2
|
Ahn S, Sung Y, Song W. Machine Learning-Based Identification of Diagnostic Biomarkers for Korean Male Sarcopenia Through Integrative DNA Methylation and Methylation Risk Score: From the Korean Genomic Epidemiology Study (KoGES). J Korean Med Sci 2024; 39:e200. [PMID: 38978487 PMCID: PMC11231442 DOI: 10.3346/jkms.2024.39.e200] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/17/2024] [Accepted: 05/21/2024] [Indexed: 07/10/2024] Open
Abstract
BACKGROUND Sarcopenia, characterized by a progressive decline in muscle mass, strength, and function, is primarily attributable to aging. DNA methylation, influenced by both genetic predispositions and environmental exposures, plays a significant role in sarcopenia occurrence. This study employed machine learning (ML) methods to identify differentially methylated probes (DMPs) capable of diagnosing sarcopenia in middle-aged individuals. We also investigated the relationship between muscle strength, muscle mass, age, and sarcopenia risk as reflected in methylation profiles. METHODS Data from 509 male participants in the urban cohort of the Korean Genome Epidemiology Study_Health Examinee study were categorized into quartile groups based on the sarcopenia criteria for appendicular skeletal muscle index (ASMI) and handgrip strength (HG). To identify diagnostic biomarkers for sarcopenia, we used recursive feature elimination with cross validation (RFECV), to pinpoint DMPs significantly associated with sarcopenia. An ensemble model, leveraging majority voting, was utilized for evaluation. Furthermore, a methylation risk score (MRS) was calculated, and its correlation with muscle strength, function, and age was assessed using likelihood ratio analysis and multinomial logistic regression. RESULTS Participants were classified into two groups based on quartile thresholds: sarcopenia (n = 37) with ASMI and HG in the lowest quartile, and normal ranges (n = 48) in the highest. In total, 238 DMPs were identified and eight probes were selected using RFECV. These DMPs were used to build an ensemble model with robust diagnostic capabilities for sarcopenia, as evidenced by an area under the receiver operating characteristic curve of 0.94. Based on eight probes, the MRS was calculated and then validated by analyzing age, HG, and ASMI among the control group (n = 424). Age was positively correlated with high MRS (coefficient, 1.2494; odds ratio [OR], 3.4882), whereas ASMI and HG were negatively correlated with high MRS (ASMI coefficient, -0.4275; OR, 0.6521; HG coefficient, -0.3116; OR, 0.7323). CONCLUSION Overall, this study identified key epigenetic markers of sarcopenia in Korean males and developed a ML model with high diagnostic accuracy for sarcopenia. The MRS also revealed significant correlations between these markers and age, HG, and ASMI. These findings suggest that both diagnostic models and the MRS can play an important role in managing sarcopenia in middle-aged populations.
Collapse
Affiliation(s)
- Seohyun Ahn
- Health and Exercise Science Laboratory, Institute of Sport Science, Department of Physical Education, Seoul National University, Seoul, Korea
| | - Yunho Sung
- Health and Exercise Science Laboratory, Institute of Sport Science, Department of Physical Education, Seoul National University, Seoul, Korea
| | - Wook Song
- Health and Exercise Science Laboratory, Institute of Sport Science, Department of Physical Education, Seoul National University, Seoul, Korea
- Institute on Aging, Seoul National University, Seoul, Korea.
| |
Collapse
|
3
|
Islam MA, Majumder MZH, Miah MS, Jannaty S. Precision healthcare: A deep dive into machine learning algorithms and feature selection strategies for accurate heart disease prediction. Comput Biol Med 2024; 176:108432. [PMID: 38744014 DOI: 10.1016/j.compbiomed.2024.108432] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2024] [Revised: 04/06/2024] [Accepted: 04/07/2024] [Indexed: 05/16/2024]
Abstract
This paper presents a comprehensive exploration of machine learning algorithms (MLAs) and feature selection techniques for accurate heart disease prediction (HDP) in modern healthcare. By focusing on diverse datasets encompassing various challenges, the research sheds light on optimal strategies for early detection. MLAs such as Decision Trees (DT), Random Forests (RF), Support Vector Machines (SVM), Gaussian Naive Bayes (NB), and others were studied, with precision and recall metrics emphasized for robust predictions. Our study addresses challenges in real-world data through data cleaning and one-hot encoding, enhancing the integrity of our predictive models. Feature extraction techniques-Recursive Feature Extraction (RFE), Principal Component Analysis (PCA), and univariate feature selection-play a crucial role in identifying relevant features and reducing data dimensionality. Our findings showcase the impact of these techniques on improving prediction accuracy. Optimized models for each dataset have been achieved through grid search hyperparameter tuning, with configurations meticulously outlined. Notably, a remarkable 99.12 % accuracy was achieved on the first Kaggle dataset, showcasing the potential for accurate HDP. Model robustness across diverse datasets was highlighted, with caution against overfitting. The study emphasizes the need for validation of unseen data and encourages ongoing research for generalizability. Serving as a practical guide, this research aids researchers and practitioners in HDP model development, influencing clinical decisions and healthcare resource allocation. By providing insights into effective algorithms and techniques, the paper contributes to reducing heart disease-related morbidity and mortality, supporting the healthcare community's ongoing efforts.
Collapse
Affiliation(s)
- Md Ariful Islam
- Department of Robotics and Mechatronics Engineering, University of Dhaka, Dhaka, 1000, Bangladesh.
| | | | - Md Sohel Miah
- Department of Computer Science and Technology, Moulvibazar Polytechnic Institute, Bangladesh
| | - Sumaia Jannaty
- Gonoshasthaya Samaj Vittik Medical College, Savar, Dhaka, Bangladesh
| |
Collapse
|
4
|
Ding X, Li Y, Chen S. Maximum margin and global criterion based-recursive feature selection. Neural Netw 2024; 169:597-606. [PMID: 37956576 DOI: 10.1016/j.neunet.2023.10.037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2023] [Revised: 06/19/2023] [Accepted: 10/22/2023] [Indexed: 11/15/2023]
Abstract
In this research paper, we aim to investigate and address the limitations of recursive feature elimination (RFE) and its variants in high-dimensional feature selection tasks. We identify two main challenges associated with these methods. Firstly, the feature ranking criterion utilized in these approaches is inconsistent with the maximum-margin theory. Secondly, the computation of the criterion is performed locally, lacking the ability to measure the importance of features globally. To overcome these challenges, we propose a novel feature ranking criterion called Maximum Margin and Global (MMG) criterion. This criterion utilizes the classification margin to determine the importance of features and computes it globally, enabling a more accurate assessment of feature importance. Moreover, we introduce an optimal feature subset evaluation algorithm that leverages the MMG criterion to determine the best subset of features. To enhance the efficiency of the proposed algorithms, we provide two alpha seeding strategies that significantly reduce computational costs while maintaining high accuracy. These strategies offer a practical means to expedite the feature selection process. Through extensive experiments conducted on ten benchmark datasets, we demonstrate that our proposed algorithms outperform current state-of-the-art methods. Additionally, the alpha seeding strategies yield significant speedups, further enhancing the efficiency of the feature selection process.
Collapse
Affiliation(s)
- Xiaojian Ding
- College of Information Engineering, Nanjing University of Finance and Economics, Nanjing 210023, China.
| | - Yi Li
- College of Economics and Management, Nanjing Agricultural University, Nanjing 210095, China
| | - Shilin Chen
- Thoracic Surgery, Nanjing Medical University Affiliated Cancer Hospital, Jiangsu Cancer Hospital, Jiangsu Institute of Cancer Research, Nanjing 221005, China
| |
Collapse
|
5
|
Zieliński K, Drabczyk D, Kunicki M, Drzyzga D, Kloska A, Rumiński J. Evaluating the risk of endometriosis based on patients' self-assessment questionnaires. Reprod Biol Endocrinol 2023; 21:102. [PMID: 37898817 PMCID: PMC10612251 DOI: 10.1186/s12958-023-01156-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Accepted: 10/23/2023] [Indexed: 10/30/2023] Open
Abstract
BACKGROUND Endometriosis is a condition that significantly affects the quality of life of about 10 % of reproductive-aged women. It is characterized by the presence of tissue similar to the uterine lining (endometrium) outside the uterus, which can lead lead scarring, adhesions, pain, and fertility issues. While numerous factors associated with endometriosis are documented, a wide range of symptoms may still be undiscovered. METHODS In this study, we employed machine learning algorithms to predict endometriosis based on the patient symptoms extracted from 13,933 questionnaires. We compared the results of feature selection obtained from various algorithms (i.e., Boruta algorithm, Recursive Feature Selection) with experts' decisions. As a benchmark model architecture, we utilized a LightGBM algorithm, along with Multivariate Imputation by Chained Equations (MICE) and k-nearest neighbors (KNN), for missing data imputation. Our primary objective was to assess the model's performance and feature importance compared to existing studies. RESULTS We identified the top 20 predictors of endometriosis, uncovering previously overlooked features such as Cesarean section, ovarian cysts, and hernia. Notably, the model's performance metrics were maximized when utilizing a combination of multiple feature selection methods. Specifically, the final model achieved an area under the receiver operator characteristic curve (AUC) of 0.85 on the training dataset and an AUC of 0.82 on the testing dataset. CONCLUSIONS The application of machine learning in diagnosing endometriosis has the potential to significantly impact clinical practice, streamlining the diagnostic process and enhancing efficiency. Our questionnaire-based prediction approach empowers individuals with endometriosis to proactively identify potential symptoms, facilitating informed discussions with healthcare professionals about diagnosis and treatment options.
Collapse
Affiliation(s)
- Krystian Zieliński
- INVICTA, Research and Development Center, Sopot, Poland.
- Department of Biomedical Engineering, Faculty of Electronics, Telecommunications and Informatics, Gdańsk University of Technology, Gdańsk, Poland.
| | | | | | | | - Anna Kloska
- INVICTA, Research and Development Center, Sopot, Poland.
- Department of Medical Biology and Genetics, Faculty of Biology, University of Gdańsk, Gdańsk, Poland.
| | - Jacek Rumiński
- Department of Biomedical Engineering, Faculty of Electronics, Telecommunications and Informatics, Gdańsk University of Technology, Gdańsk, Poland
| |
Collapse
|
6
|
Zheng J, Li Y, Billor N, Ahmed MI, Fang YHD, Pat B, Denney TS, Dell’Italia LJ. Understanding post-surgical decline in left ventricular function in primary mitral regurgitation using regression and machine learning models. Front Cardiovasc Med 2023; 10:1112797. [PMID: 37153472 PMCID: PMC10160646 DOI: 10.3389/fcvm.2023.1112797] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Accepted: 03/28/2023] [Indexed: 05/09/2023] Open
Abstract
Background Class I echocardiographic guidelines in primary mitral regurgitation (PMR) risks left ventricular ejection fraction (LVEF) < 50% after mitral valve surgery even with pre-surgical LVEF > 60%. There are no models predicting LVEF < 50% after surgery in the complex interplay of increased preload and facilitated ejection in PMR using cardiac magnetic resonance (CMR). Objective Use regression and machine learning models to identify a combination of CMR LV remodeling and function parameters that predict LVEF < 50% after mitral valve surgery. Methods CMR with tissue tagging was performed in 51 pre-surgery PMR patients (median CMR LVEF 64%), 49 asymptomatic (median CMR LVEF 63%), and age-matched controls (median CMR LVEF 64%). To predict post-surgery LVEF < 50%, least absolute shrinkage and selection operator (LASSO), random forest (RF), extreme gradient boosting (XGBoost), and support vector machine (SVM) were developed and validated in pre-surgery PMR patients. Recursive feature elimination and LASSO reduced the number of features and model complexity. Data was split and tested 100 times and models were evaluated via stratified cross validation to avoid overfitting. The final RF model was tested in asymptomatic PMR patients to predict post-surgical LVEF < 50% if they had gone to mitral valve surgery. Results Thirteen pre-surgery PMR had LVEF < 50% after mitral valve surgery. In addition to LVEF (P = 0.005) and LVESD (P = 0.13), LV sphericity index (P = 0.047) and LV mid systolic circumferential strain rate (P = 0.024) were predictors of post-surgery LVEF < 50%. Using these four parameters, logistic regression achieved 77.92% classification accuracy while RF improved the accuracy to 86.17%. This final RF model was applied to asymptomatic PMR and predicted 14 (28.57%) out of 49 would have post-surgery LVEF < 50% if they had mitral valve surgery. Conclusions These preliminary findings call for a longitudinal study to determine whether LV sphericity index and circumferential strain rate, or other combination of parameters, accurately predict post-surgical LVEF in PMR.
Collapse
Affiliation(s)
- Jingyi Zheng
- Department of Mathematics and Statistics, Auburn University, Auburn, AL, United States
| | - Yuexin Li
- Department of Mathematics and Statistics, Auburn University, Auburn, AL, United States
| | - Nedret Billor
- Department of Mathematics and Statistics, Auburn University, Auburn, AL, United States
| | - Mustafa I. Ahmed
- Division of Cardiovascular Disease, University of Alabama at Birmingham, Birmingham, AL, United States
| | - Yu-Hua Dean Fang
- Department of Radiology, University of Alabama at Birmingham, Birmingham, AL, United States
| | - Betty Pat
- Division of Cardiovascular Disease, University of Alabama at Birmingham, Birmingham, AL, United States
- Birmingham Veterans Affairs Health Care System, Birmingham, AL, United States
| | - Thomas S. Denney
- Department of Electrical and Computer Engineering, Samuel Ginn College of Engineering, Auburn University, Auburn, AL, United States
| | - Louis J. Dell’Italia
- Division of Cardiovascular Disease, University of Alabama at Birmingham, Birmingham, AL, United States
- Birmingham Veterans Affairs Health Care System, Birmingham, AL, United States
| |
Collapse
|
7
|
Hou X, Hou J, Huang G. Bi-dimensional principal gene feature selection from big gene expression data. PLoS One 2022; 17:e0278583. [PMID: 36477666 PMCID: PMC9728919 DOI: 10.1371/journal.pone.0278583] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2022] [Accepted: 11/20/2022] [Indexed: 12/12/2022] Open
Abstract
Gene expression sample data, which usually contains massive expression profiles of genes, is commonly used for disease related gene analysis. The selection of relevant genes from huge amount of genes is always a fundamental process in applications of gene expression data. As more and more genes have been detected, the size of gene expression data becomes larger and larger; this challenges the computing efficiency for extracting the relevant and important genes from gene expression data. In this paper, we provide a novel Bi-dimensional Principal Feature Selection (BPFS) method for efficiently extracting critical genes from big gene expression data. It applies the principal component analysis (PCA) method on sample and gene domains successively, aiming at extracting the relevant gene features and reducing redundancies while losing less information. The experimental results on four real-world cancer gene expression datasets show that the proposed BPFS method greatly reduces the data size and achieves a nearly double processing speed compared to the counterpart methods, while maintaining better accuracy and effectiveness.
Collapse
Affiliation(s)
- Xiaoqian Hou
- School of Information Technology, Deakin University, Melbourne, Victoria, Australia
| | - Jingyu Hou
- School of Information Technology, Deakin University, Melbourne, Victoria, Australia
| | - Guangyan Huang
- School of Information Technology, Deakin University, Melbourne, Victoria, Australia
- * E-mail:
| |
Collapse
|
8
|
iEnhancer-MRBF: Identifying enhancers and their strength with a multiple Laplacian-regularized radial basis function network. Methods 2022; 208:1-8. [DOI: 10.1016/j.ymeth.2022.10.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Revised: 09/26/2022] [Accepted: 10/03/2022] [Indexed: 11/07/2022] Open
|
9
|
Abdelwahab O, Awad N, Elserafy M, Badr E. A feature selection-based framework to identify biomarkers for cancer diagnosis: A focus on lung adenocarcinoma. PLoS One 2022; 17:e0269126. [PMID: 36067196 PMCID: PMC9447897 DOI: 10.1371/journal.pone.0269126] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2021] [Accepted: 05/15/2022] [Indexed: 12/23/2022] Open
Abstract
Lung cancer (LC) represents most of the cancer incidences in the world. There are many types of LC, but Lung Adenocarcinoma (LUAD) is the most common type. Although RNA-seq and microarray data provide a vast amount of gene expression data, most of the genes are insignificant to clinical diagnosis. Feature selection (FS) techniques overcome the high dimensionality and sparsity issues of the large-scale data. We propose a framework that applies an ensemble of feature selection techniques to identify genes highly correlated to LUAD. Utilizing LUAD RNA-seq data from the Cancer Genome Atlas (TCGA), we employed mutual information (MI) and recursive feature elimination (RFE) feature selection techniques along with support vector machine (SVM) classification model. We have also utilized Random Forest (RF) as an embedded FS technique. The results were integrated and candidate biomarker genes across all techniques were identified. The proposed framework has identified 12 potential biomarkers that are highly correlated with different LC types, especially LUAD. A predictive model has been trained utilizing the identified biomarker expression profiling and performance of 97.99% was achieved. In addition, upon performing differential gene expression analysis, we could find that all 12 genes were significantly differentially expressed between normal and LUAD tissues, and strongly correlated with LUAD according to previous reports. We here propose that using multiple feature selection methods effectively reduces the number of identified biomarkers and directly affects their biological relevance.
Collapse
Affiliation(s)
- Omar Abdelwahab
- University of Science and Technology, Zewail City of Science and Technology, Giza, Egypt
| | - Nourelislam Awad
- University of Science and Technology, Zewail City of Science and Technology, Giza, Egypt
- Center of Informatics Science, Nile university, Giza, Egypt
| | - Menattallah Elserafy
- University of Science and Technology, Zewail City of Science and Technology, Giza, Egypt
- Center for Genomics, Helmy Institute for Medical Sciences, Zewail City of Science and Technology, Giza, Egypt
| | - Eman Badr
- University of Science and Technology, Zewail City of Science and Technology, Giza, Egypt
- Faculty of Computers and Artificial Intelligence, Cairo University, Giza, Egypt
| |
Collapse
|
10
|
Wang T, Jiao M, Wang X. Link Prediction in Complex Networks Using Recursive Feature Elimination and Stacking Ensemble Learning. ENTROPY (BASEL, SWITZERLAND) 2022; 24:1124. [PMID: 36010793 PMCID: PMC9407261 DOI: 10.3390/e24081124] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Revised: 08/11/2022] [Accepted: 08/12/2022] [Indexed: 06/15/2023]
Abstract
Link prediction is an important task in the field of network analysis and modeling, and predicts missing links in current networks and new links in future networks. In order to improve the performance of link prediction, we integrate global, local, and quasi-local topological information of networks. Here, a novel stacking ensemble framework is proposed for link prediction in this paper. Our approach employs random forest-based recursive feature elimination to select relevant structural features associated with networks and constructs a two-level stacking ensemble model involving various machine learning methods for link prediction. The lower level is composed of three base classifiers, i.e., logistic regression, gradient boosting decision tree, and XGBoost, and their outputs are then integrated with an XGBoost model in the upper level. Extensive experiments were conducted on six networks. Comparison results show that the proposed method can obtain better prediction results and applicability robustness.
Collapse
Affiliation(s)
- Tao Wang
- School of Mathematics and Physics, North China Electric Power University, Baoding 071003, China
- Hebei Key Laboratory of Physics and Energy Technology, North China Electric Power University, Baoding 071000, China
| | - Mengyu Jiao
- School of Mathematics and Physics, North China Electric Power University, Baoding 071003, China
| | - Xiaoxia Wang
- School of Control and Computer Engineering, North China Electric Power University, Baoding 071003, China
| |
Collapse
|
11
|
Li Y, Shen Y, Fan X, Huang X, Yu H, Zhao G, Ma W. A novel EEG-based major depressive disorder detection framework with two-stage feature selection. BMC Med Inform Decis Mak 2022; 22:209. [PMID: 35933348 PMCID: PMC9357341 DOI: 10.1186/s12911-022-01956-w] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2021] [Accepted: 07/29/2022] [Indexed: 11/16/2022] Open
Abstract
Background Major depressive disorder (MDD) is a common mental illness, characterized by persistent depression, sadness, despair, etc., troubling people’s daily life and work seriously. Methods In this work, we present a novel automatic MDD detection framework based on EEG signals. First of all, we derive highly MDD-correlated features, calculating the ratio of extracted features from EEG signals at frequency bands between \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\beta$$\end{document}β and \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\alpha$$\end{document}α. Then, a two-stage feature selection method named PAR is presented with the sequential combination of Pearson correlation coefficient (PCC) and recursive feature elimination (RFE), where the advantages lie in minimizing the feature searching space. Finally, we employ widely used machine learning methods of support vector machine (SVM), logistic regression (LR), and linear regression (LNR) for MDD detection with the merit of feature interpretability. Results Experiment results show that our proposed MDD detection framework achieves competitive results. The accuracy and \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$F_{1}$$\end{document}F1 score are up to 0.9895 and 0.9846, respectively. Meanwhile, the regression determination coefficient \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$R^2$$\end{document}R2 for MDD severity assessment is up to 0.9479. Compared with existing MDD detection methods with the best accuracy of 0.9840 and \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$F_1$$\end{document}F1 score of 0.97, our proposed framework achieves the state-of-the-art MDD detection performance. Conclusions Development of this MDD detection framework can be potentially deployed into a medical system to aid physicians to screen out MDD patients.
Collapse
Affiliation(s)
- Yujie Li
- School of Computer Science, South China Normal University, Guangzhou, China
| | - Yingshan Shen
- School of Computer Science, South China Normal University, Guangzhou, China
| | - Xiaomao Fan
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen, China.
| | - Xingxian Huang
- Department of Acupuncture and Moxibustion, Shenzhen Traditional Chinese Medicine Hospital, Shenzhen, China
| | - Haibo Yu
- Department of Acupuncture and Moxibustion, Shenzhen Traditional Chinese Medicine Hospital, Shenzhen, China
| | - Gansen Zhao
- School of Computer Science, South China Normal University, Guangzhou, China
| | - Wenjun Ma
- School of Computer Science, South China Normal University, Guangzhou, China
| |
Collapse
|
12
|
Simic V, Ebadi Torkayesh A, Ijadi Maghsoodi A. Locating a disinfection facility for hazardous healthcare waste in the COVID-19 era: a novel approach based on Fermatean fuzzy ITARA-MARCOS and random forest recursive feature elimination algorithm. ANNALS OF OPERATIONS RESEARCH 2022; 328:1-46. [PMID: 35821664 PMCID: PMC9263821 DOI: 10.1007/s10479-022-04822-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 06/07/2022] [Indexed: 05/09/2023]
Abstract
Hazardous healthcare waste (HCW) management system is one of the most critical urban systems affected by the COVID-19 pandemic due to the increase in waste generation rate in hospitals and medical centers dealing with infected patients as well as the degree of hazardousness of generated waste due to exposure to the virus. In this regard, waste network flow would face severe problems without taking care of hazardous waste through disinfection facilities. For this purpose, this study aims to develop an advanced decision support system based on a multi-stage model that was combined with the random forest recursive feature elimination (RF-RFE) algorithm, the indifference threshold-based attribute ratio analysis (ITARA), and measurement of alternatives and ranking according to compromise solution (MARCOS) methods into a unique framework under the Fermatean fuzzy environment. In the first stage, the innovative Fermatean fuzzy RF-RFE algorithm extracts core criteria from a finite set of initial criteria. In the second stage, the novel Fermatean fuzzy ITARA determines the semi-objective importance of the core criteria. In the third stage, the new Fermatean fuzzy MARCOS method ranks alternatives. A real-life case study in Istanbul, Turkey, illustrates the applicability of the introduced methodology. Our empirical findings indicate that "Pendik" is the best among five candidate locations for sitting a new disinfection facility for hazardous HCW in Istanbul. The sensitivity and comparative analyses confirmed that our approach is highly robust and reliable. This approach could be used to tackle other critical multi-dimensional problems related to COVID-19 and support sustainability and circular economy. Supplementary Information The online version contains supplementary material available at 10.1007/s10479-022-04822-0.
Collapse
Affiliation(s)
- Vladimir Simic
- Faculty of Transport and Traffic Engineering, University of Belgrade, Vojvode Stepe 305, 11010 Belgrade, Serbia
| | - Ali Ebadi Torkayesh
- School of Business and Economics, RWTH Aachen University, 52072 Aachen, Germany
| | - Abtin Ijadi Maghsoodi
- Department of Information Systems and Operations Management, Faculty of Business and Economics, Business School, University of Auckland, Auckland, 1010 New Zealand
| |
Collapse
|
13
|
Virtual reality for the observation of oncology models (VROOM): immersive analytics for oncology patient cohorts. Sci Rep 2022; 12:11337. [PMID: 35790803 PMCID: PMC9256599 DOI: 10.1038/s41598-022-15548-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2022] [Accepted: 06/24/2022] [Indexed: 11/08/2022] Open
Abstract
The significant advancement of inexpensive and portable virtual reality (VR) and augmented reality devices has re-energised the research in the immersive analytics field. The immersive environment is different from a traditional 2D display used to analyse 3D data as it provides a unified environment that supports immersion in a 3D scene, gestural interaction, haptic feedback and spatial audio. Genomic data analysis has been used in oncology to understand better the relationship between genetic profile, cancer type, and treatment option. This paper proposes a novel immersive analytics tool for cancer patient cohorts in a virtual reality environment, virtual reality to observe oncology data models. We utilise immersive technologies to analyse the gene expression and clinical data of a cohort of cancer patients. Various machine learning algorithms and visualisation methods have also been deployed in VR to enhance the data interrogation process. This is supported with established 2D visual analytics and graphical methods in bioinformatics, such as scatter plots, descriptive statistical information, linear regression, box plot and heatmap into our visualisation. Our approach allows the clinician to interrogate the information that is familiar and meaningful to them while providing them immersive analytics capabilities to make new discoveries toward personalised medicine.
Collapse
|
14
|
Xue Y, Cai X, Neri F. A multi-objective evolutionary algorithm with interval based initialization and self-adaptive crossover operator for large-scale feature selection in classification. Appl Soft Comput 2022. [DOI: 10.1016/j.asoc.2022.109420] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
|
15
|
Deng X, Li M, Wang L, Wan Q. RFCBF: Enhance the Performance and Stability of Fast Correlation-Based Filter. INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE AND APPLICATIONS 2022. [DOI: 10.1142/s1469026822500092] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Feature selection is a preprocessing step that plays a crucial role in the domain of machine learning and data mining. Feature selection methods have been shown to be effective in removing redundant and irrelevant features, improving the learning algorithm’s prediction performance. Among the various methods of feature selection based on redundancy, the fast correlation-based filter (FCBF) is one of the most effective. In this paper, we developed a novel extension of FCBF, called resampling FCBF (RFCBF) that combines resampling technique to improve classification accuracy. We performed comprehensive experiments to compare the RFCBF with other state-of-the-art feature selection methods using three competitive classifiers (K-nearest neighbor, support vector machine, and logistic regression) on 12 publicly available datasets. The experimental results show that the RFCBF algorithm yields significantly better results than previous state-of-the-art methods in terms of classification accuracy and runtime.
Collapse
Affiliation(s)
- Xiongshi Deng
- School of Information Engineering, Nanchang Institute of Technology, No. 289 Tianxxaing Road, Nanchang Jiangxi, P. R. China
| | - Min Li
- School of Information Engineering, Nanchang Institute of Technology, No. 289 Tianxxaing Road, Nanchang Jiangxi, P. R. China
| | - Lei Wang
- School of Information Engineering, Nanchang Institute of Technology, No. 289 Tianxxaing Road, Nanchang Jiangxi, P. R. China
| | - Qikang Wan
- School of Information Engineering, Nanchang Institute of Technology, No. 289 Tianxxaing Road, Nanchang Jiangxi, P. R. China
| |
Collapse
|
16
|
Bhandari A, Tripathy BK, Jawad K, Bhatia S, Rahmani MKI, Mashat A. Cancer Detection and Prediction Using Genetic Algorithms. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:1871841. [PMID: 35615545 PMCID: PMC9126682 DOI: 10.1155/2022/1871841] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/07/2022] [Revised: 04/08/2022] [Accepted: 04/21/2022] [Indexed: 01/07/2023]
Abstract
Cancer is a wide category of diseases that is caused by the abnormal, uncontrollable growth of cells, and it is the second leading cause of death globally. Screening, early diagnosis, and prediction of recurrence give patients the best possible chance for successful treatment. However, these tests can be expensive and invasive and the results have to be interpreted by experts. Genetic algorithms (GAs) are metaheuristics that belong to the class of evolutionary algorithms. GAs can find the optimal or near-optimal solutions in huge, difficult search spaces and are widely used for search and optimization. This makes them ideal for detecting cancer by creating models to interpret the results of tests, especially noninvasive. In this article, we have comprehensively reviewed the existing literature, analyzed them critically, provided a comparative analysis of the state-of-the-art techniques, and identified the future challenges in the development of such techniques by medical professionals.
Collapse
Affiliation(s)
| | | | - Khurram Jawad
- College of Computing and Informatics, Saudi Electronic University, Riyadh, Saudi Arabia
| | - Surbhi Bhatia
- Department of Information Systems, College of Computer Sciences and Information Technology, King Faisal University, Al Hasa, Saudi Arabia
| | | | - Arwa Mashat
- Faculty of Computing and Information Technology, King Abdulaziz University, Rabigh 21911, Saudi Arabia
| |
Collapse
|
17
|
Ji M, Xie W, Zhao M, Qian X, Chow CY, Lam KY, Yan J, Hao T. Probabilistic Prediction of Nonadherence to Psychiatric Disorder Medication from Mental Health Forum Data: Developing and Validating Bayesian Machine Learning Classifiers. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:6722321. [PMID: 35463247 PMCID: PMC9033323 DOI: 10.1155/2022/6722321] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/18/2022] [Revised: 02/16/2022] [Accepted: 03/19/2022] [Indexed: 11/18/2022]
Abstract
Background Medication nonadherence represents a major burden on national health systems. According to the World Health Organization, increasing medication adherence may have a greater impact on public health than any improvement in specific medical treatments. More research is needed to better predict populations at risk of medication nonadherence. Objective To develop clinically informative, easy-to-interpret machine learning classifiers to predict people with psychiatric disorders at risk of medication nonadherence based on the syntactic and structural features of written posts on health forums. Methods All data were collected from posts between 2016 and 2021 on mental health forum, administered by Together 4 Change, a long-running not-for-profit organisation based in Oxford, UK. The original social media data were annotated using the Tool for the Automatic Analysis of Syntactic Sophistication and Complexity (TAASSC) system. Through applying multiple feature optimisation techniques, we developed a best-performing model using relevance vector machine (RVM) for the probabilistic prediction of medication nonadherence among online mental health forum discussants. Results The best-performing RVM model reached a mean AUC of 0.762, accuracy of 0.763, sensitivity of 0.779, and specificity of 0.742 on the testing dataset. It outperformed competing classifiers with more complex feature sets with statistically significant improvement in sensitivity and specificity, after adjusting the alpha levels with Benjamini-Hochberg correction procedure. Discussion. We used the forest plot of multiple logistic regression to explore the association between written post features in the best-performing RVM model and the binary outcome of medication adherence among online post contributors with psychiatric disorders. We found that increased quantities of 3 syntactic complexity features were negatively associated with psychiatric medication adherence: "dobj_stdev" (standard deviation of dependents per direct object of nonpronouns) (OR, 1.486, 95% CI, 1.202-1.838, P < 0.001), "cl_av_deps" (dependents per clause) (OR, 1.597, 95% CI, 1.202-2.122, P, 0.001), and "VP_T" (verb phrases per T-unit) (OR, 2.23, 95% CI, 1.211-4.104, P, 0.010). Finally, we illustrated the clinical use of the classifier with Bayes' monograph which gives the posterior odds and their 95% CI of positive (nonadherence) versus negative (adherence) cases as predicted by the best-performing classifier. The odds ratio of the posterior probability of positive cases was 3.9, which means that around 10 in every 13 psychiatric patients with a positive result as predicted by our model were following their medication regime. The odds ratio of the posterior probability of true negative cases was 0.4, meaning that around 10 in every 14 psychiatric patients with a negative test result after screening by our classifier were not adhering to their medications. Conclusion Psychiatric medication nonadherence is a large and increasing burden on national health systems. Using Bayesian machine learning techniques and publicly accessible online health forum data, our study illustrates the viability of developing cost-effective, informative decision aids to support the monitoring and prediction of patients at risk of medication nonadherence.
Collapse
Affiliation(s)
- Meng Ji
- School of Languages and Cultures, University of Sydney, Sydney, Australia
| | - Wenxiu Xie
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong, China
| | - Mengdan Zhao
- School of Languages and Cultures, University of Sydney, Sydney, Australia
| | - Xiaobo Qian
- School of Computer Science, South China Normal University, Guangzhou, Guangdong, China
| | - Chi-Yin Chow
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong, China
| | - Kam-Yiu Lam
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong, China
| | - Jun Yan
- AI Lab, Yidu Cloud (Beijing) Technology Co. Ltd., Beijing, China
| | - Tianyong Hao
- School of Computer Science, South China Normal University, Guangzhou, Guangdong, China
| |
Collapse
|
18
|
Xu J, Qu K, Meng X, Sun Y, Hou Q. Feature selection based on multiview entropy measures in multiperspective rough set. INT J INTELL SYST 2022. [DOI: 10.1002/int.22878] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Affiliation(s)
- Jiucheng Xu
- Engineering Lab of Intelligence Business & Internet of Things Henan Province Xinxiang China
- College of Computer and Information Engineering Henan Normal University Xinxiang China
| | - Kanglin Qu
- Engineering Lab of Intelligence Business & Internet of Things Henan Province Xinxiang China
- College of Computer and Information Engineering Henan Normal University Xinxiang China
| | - Xiangru Meng
- Engineering Lab of Intelligence Business & Internet of Things Henan Province Xinxiang China
- College of Computer and Information Engineering Henan Normal University Xinxiang China
| | - Yuanhao Sun
- Engineering Lab of Intelligence Business & Internet of Things Henan Province Xinxiang China
- College of Computer and Information Engineering Henan Normal University Xinxiang China
| | - Qincheng Hou
- Engineering Lab of Intelligence Business & Internet of Things Henan Province Xinxiang China
- College of Computer and Information Engineering Henan Normal University Xinxiang China
| |
Collapse
|
19
|
Yu K, Huang M, Chen S, Feng C, Li W. GSEnet: feature extraction of gene expression data and its application to Leukemia classification. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2022; 19:4881-4891. [PMID: 35430845 DOI: 10.3934/mbe.2022228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Gene expression data is highly dimensional. As disease-related genes account for only a tiny fraction, a deep learning model, namely GSEnet, is proposed to extract instructive features from gene expression data. This model consists of three modules, namely the pre-conv module, the SE-Resnet module, and the SE-conv module. Effectiveness of the proposed model on the performance improvement of 9 representative classifiers is evaluated. Seven evaluation metrics are used for this assessment on the GSE99095 dataset. Robustness and advantages of the proposed model compared with representative feature selection methods are also discussed. Results show superiority of the proposed model on the improvement of the classification precision and accuracy.
Collapse
Affiliation(s)
- Kun Yu
- College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning 110819, China
- Key Laboratory of Intelligent Computing in Medical Image, Ministry of Education, Shenyang, Liaoning 110819, China
| | - Mingxu Huang
- School of Computer Science and Engineering, Northeastern University, Shenyang, Liaoning 110819, China
| | - Shuaizheng Chen
- School of Computer Science and Engineering, Northeastern University, Shenyang, Liaoning 110819, China
| | - Chaolu Feng
- Key Laboratory of Intelligent Computing in Medical Image, Ministry of Education, Shenyang, Liaoning 110819, China
- School of Computer Science and Engineering, Northeastern University, Shenyang, Liaoning 110819, China
| | - Wei Li
- Key Laboratory of Intelligent Computing in Medical Image, Ministry of Education, Shenyang, Liaoning 110819, China
- School of Computer Science and Engineering, Northeastern University, Shenyang, Liaoning 110819, China
| |
Collapse
|
20
|
Canayaz M, Şehribanoğlu S, Özdağ R, Demir M. COVID-19 diagnosis on CT images with Bayes optimization-based deep neural networks and machine learning algorithms. Neural Comput Appl 2022; 34:5349-5365. [PMID: 35250180 PMCID: PMC8884105 DOI: 10.1007/s00521-022-07052-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2021] [Accepted: 02/01/2022] [Indexed: 12/24/2022]
Affiliation(s)
- Murat Canayaz
- Department of Computer Engineering, Van Yuzuncu Yil University, 65100 Van, Turkey
| | - Sanem Şehribanoğlu
- Department of Econometrics, Van Yuzuncu Yil University, 65100 Van, Turkey
| | - Recep Özdağ
- Department of Computer Engineering, Van Yuzuncu Yil University, 65100 Van, Turkey
| | - Murat Demir
- Department of Software Engineering, Mus Alpaslan University, 49100 Mus, Turkey
| |
Collapse
|
21
|
Jaddi NS, Saniee Abadeh M. Cell separation algorithm with enhanced search behaviour in miRNA feature selection for cancer diagnosis. INFORM SYST 2022. [DOI: 10.1016/j.is.2021.101906] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|
22
|
Deng X, Li M, Deng S, Wang L. Hybrid gene selection approach using XGBoost and multi-objective genetic algorithm for cancer classification. Med Biol Eng Comput 2022; 60:663-681. [PMID: 35028863 DOI: 10.1007/s11517-021-02476-x] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2021] [Accepted: 11/23/2021] [Indexed: 12/15/2022]
Abstract
Microarray gene expression data are often accompanied by a large number of genes and a small number of samples. However, only a few of these genes are relevant to cancer, resulting in significant gene selection challenges. Hence, we propose a two-stage gene selection approach by combining extreme gradient boosting (XGBoost) and a multi-objective optimization genetic algorithm (XGBoost-MOGA) for cancer classification in microarray datasets. In the first stage, the genes are ranked using an ensemble-based feature selection using XGBoost. This stage can effectively remove irrelevant genes and yield a group comprising the most relevant genes related to the class. In the second stage, XGBoost-MOGA searches for an optimal gene subset based on the most relevant genes' group using a multi-objective optimization genetic algorithm. We performed comprehensive experiments to compare XGBoost-MOGA with other state-of-the-art feature selection methods using two well-known learning classifiers on 14 publicly available microarray expression datasets. The experimental results show that XGBoost-MOGA yields significantly better results than previous state-of-the-art algorithms in terms of various evaluation criteria, such as accuracy, F-score, precision, and recall.
Collapse
Affiliation(s)
- Xiongshi Deng
- School of Information Engineering, Nanchang Institute of Technology, Jiangxi, 330099, People's Republic of China.,Jiangxi Province Key Laboratory of Water Information Cooperative Sensing and Intelligent Processing, Jiangxi, 330099, People's Republic of China
| | - Min Li
- School of Information Engineering, Nanchang Institute of Technology, Jiangxi, 330099, People's Republic of China. .,Jiangxi Province Key Laboratory of Water Information Cooperative Sensing and Intelligent Processing, Jiangxi, 330099, People's Republic of China.
| | - Shaobo Deng
- School of Information Engineering, Nanchang Institute of Technology, Jiangxi, 330099, People's Republic of China.,Jiangxi Province Key Laboratory of Water Information Cooperative Sensing and Intelligent Processing, Jiangxi, 330099, People's Republic of China
| | - Lei Wang
- School of Information Engineering, Nanchang Institute of Technology, Jiangxi, 330099, People's Republic of China.,Jiangxi Province Key Laboratory of Water Information Cooperative Sensing and Intelligent Processing, Jiangxi, 330099, People's Republic of China
| |
Collapse
|
23
|
A new feature extraction technique based on improved owl search algorithm: a case study in copper electrorefining plant. Neural Comput Appl 2022. [DOI: 10.1007/s00521-021-06881-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
24
|
|
25
|
Gu X, Guo J, Xiao L, Li C. Conditional mutual information-based feature selection algorithm for maximal relevance minimal redundancy. APPL INTELL 2022. [DOI: 10.1007/s10489-021-02412-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
26
|
A Machine-Learning Approach Combining Wavelet Packet Denoising with Catboost for Weather Forecasting. ATMOSPHERE 2021. [DOI: 10.3390/atmos12121618] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Accurate forecasting of future meteorological elements is critical and has profoundly affected human life in many aspects from rainstorm warning to flight safety. The conventional numerical weather prediction (NWP) sometimes leads to unsatisfactory performance due to inappropriate initial state settings. In this paper, a short-term weather forecasting model based on wavelet packet denoising and Catboost is proposed, which takes advantage of the fusion information combining the historical observation data with the prior knowledge from NWP. The feature selection and spatiotemporal feather addition are also explored to further improve performance. The proposed method is evaluated on the datasets provided by Beijing weather stations. Experimental results demonstrate that compared with many deep-learning or machine-learning methods such as LSTM, Seq2Seq, and random forest, the proposed Catboost model incorporated with wavelet packet denoising can achieve shorter convergence time and higher prediction accuracy.
Collapse
|
27
|
Extraction of Kenyan Grassland Information Using PROBA-V Based on RFE-RF Algorithm. REMOTE SENSING 2021. [DOI: 10.3390/rs13234762] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Africa has the largest grassland area among all grassland ecosystems in the world. As a typical agricultural and animal husbandry country in Africa, animal husbandry plays an important role in this region. The investigation of grassland resources and timely grasping the quantity and spatial distribution of grassland resources are of great significance to the stable development of local animal husbandry economy. Therefore, this paper uses Kenya as the study area to investigate the effective and fast approach for grassland mapping with 100-m resolution using the open resources in the Google Earth Engine cloud platform. The main conclusions are as follows. (1) In the feature combination optimization part of this paper, the machine learning algorithm is used to compare the scores and standard deviations of several common algorithms combined with RFE. It is concluded that the combination of RFE and random forest algorithm has the highest stability in modeling and the best feature optimization effect. (2) After feature optimization by the RFE-RF algorithm, the number of features is reduced from 12 to 8, which compressed the original feature space and reduced the redundancy of features. The optimal combination features are applied to random forest classification, and the overall accuracy and Kappa coefficient of classification are 0.87 and 0.85, respectively. The eight features are: elevation, NDVI, EVI, SWIR, RVI, BLUE, RED, and LSWI. (3) There are great differences in topographic features among the local land types in the study area, and the addition of topographic features is more conducive to the recognition and classification of various land types. There exists “salt-and-pepper phenomenon” in pixel-oriented classification. Later research focus will combine the RFE-RF algorithm and the segmentation algorithm to achieve object-oriented land cover classification.
Collapse
|
28
|
Li Y, Li G, Guo L. Feature Selection for Regression Based on Gamma Test Nested Monte Carlo Tree Search. ENTROPY 2021; 23:e23101331. [PMID: 34682055 PMCID: PMC8535147 DOI: 10.3390/e23101331] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/31/2021] [Revised: 10/06/2021] [Accepted: 10/07/2021] [Indexed: 12/03/2022]
Abstract
This paper investigates the nested Monte Carlo tree search (NMCTS) for feature selection on regression tasks. NMCTS starts out with an empty subset and uses search results of lower nesting level simulation. Level 0 is based on random moves until the path reaches the leaf node. In order to accomplish feature selection on the regression task, the Gamma test is introduced to play the role of the reward function at the end of the simulation. The concept Vratio of the Gamma test is also combined with the original UCT-tuned1 and the design of stopping conditions in the selection and simulation phases. The proposed GNMCTS method was tested on seven numeric datasets and compared with six other feature selection methods. It shows better performance than the vanilla MCTS framework and maintains the relevant information in the original feature space. The experimental results demonstrate that GNMCTS is a robust and effective tool for feature selection. It can accomplish the task well in a reasonable computation budget.
Collapse
Affiliation(s)
- Ying Li
- Beijing Key Lab of Petroleum Data Mining, Department of Geophysics, China University of Petroleum, Beijing 102249, China; (Y.L.); (L.G.)
| | - Guohe Li
- Beijing Key Lab of Petroleum Data Mining, Department of Geophysics, China University of Petroleum, Beijing 102249, China; (Y.L.); (L.G.)
- Correspondence:
| | - Lingun Guo
- Beijing Key Lab of Petroleum Data Mining, Department of Geophysics, China University of Petroleum, Beijing 102249, China; (Y.L.); (L.G.)
- College of Software, Henan Normal University, Xinxiang 453007, China
| |
Collapse
|
29
|
Bommert A, Welchowski T, Schmid M, Rahnenführer J. Benchmark of filter methods for feature selection in high-dimensional gene expression survival data. Brief Bioinform 2021; 23:6366322. [PMID: 34498681 PMCID: PMC8769710 DOI: 10.1093/bib/bbab354] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Revised: 08/05/2021] [Accepted: 08/10/2021] [Indexed: 11/30/2022] Open
Abstract
Feature selection is crucial for the analysis of high-dimensional data, but benchmark studies for data with a survival outcome are rare. We compare 14 filter methods for feature selection based on 11 high-dimensional gene expression survival data sets. The aim is to provide guidance on the choice of filter methods for other researchers and practitioners. We analyze the accuracy of predictive models that employ the features selected by the filter methods. Also, we consider the run time, the number of selected features for fitting models with high predictive accuracy as well as the feature selection stability. We conclude that the simple variance filter outperforms all other considered filter methods. This filter selects the features with the largest variance and does not take into account the survival outcome. Also, we identify the correlation-adjusted regression scores filter as a more elaborate alternative that allows fitting models with similar predictive accuracy. Additionally, we investigate the filter methods based on feature rankings, finding groups of similar filters.
Collapse
Affiliation(s)
- Andrea Bommert
- Department of Statistics, TU Dortmund University, Vogelpothsweg 87, 44227, Dortmund, Germany
| | - Thomas Welchowski
- Institute of Medical Biometry, Informatics and Epidemiology (IMBIE), Medical Faculty, University of Bonn, Venusberg-Campus 1, 53127, Bonn, Germany
| | - Matthias Schmid
- Institute of Medical Biometry, Informatics and Epidemiology (IMBIE), Medical Faculty, University of Bonn, Venusberg-Campus 1, 53127, Bonn, Germany
| | - Jörg Rahnenführer
- Department of Statistics, TU Dortmund University, Vogelpothsweg 87, 44227, Dortmund, Germany
| |
Collapse
|
30
|
Yaseen ZM. An insight into machine learning models era in simulating soil, water bodies and adsorption heavy metals: Review, challenges and solutions. CHEMOSPHERE 2021; 277:130126. [PMID: 33774235 DOI: 10.1016/j.chemosphere.2021.130126] [Citation(s) in RCA: 80] [Impact Index Per Article: 26.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/20/2020] [Revised: 01/23/2021] [Accepted: 02/23/2021] [Indexed: 06/12/2023]
Abstract
The development of computer aid models for heavy metals (HMs) simulation has been remarkably advanced over the past two decades. Several machine learning (ML) models have been developed for modeling HMs over the past two decades with outstanding progress. Although there have been a noticeable number of diverse ML models investigations, it is essential to have an informative vision on the progression of those computer aid models. In the current short review covering the simulation of heavy metals in contaminated soil, water bodies and removal from aqueous solution, numerous aspects on the methodological and conceptual HMs modeling are reviewed and discussed in detail. For instance, the limitation of the classical analytical methods, types of heavy metal dataset, necessity for new versions of ML models exploration, HM input parameters selection, ML models internal parameters tuning, performance metrics selection and the types of the modelled HM. The current review provides few outlooks in understanding the underlying od the ML models application for HM simulation. Tackling these modeling aspects is significantly essential for ML developers and environmental scientists to obtain creditability and scientific consistency in the domain of environmental science. Based on the discussed modeling aspects, it was concluded several future research directions, which will promote environmental scientists for better understanding of the underlying HMs simulation.
Collapse
Affiliation(s)
- Zaher Mundher Yaseen
- New era and development in civil engineering research group, Scientific Research Center, Al-Ayen University, Thi-Qar, 64001, Iraq.
| |
Collapse
|
31
|
Sheikhi G, Altınçay H. A novel dissimilarity metric based on feature‐to‐feature scatter frequencies for clustering‐based feature selection in biomedical data. Comput Intell 2021. [DOI: 10.1111/coin.12470] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Ghazaal Sheikhi
- Department of Computer Engineering Final International University Kyrenia North Cyprus Turkey
| | - Hakan Altınçay
- Department of Computer Engineering Eastern Mediterranean University Famagusta North Cyprus Turkey
| |
Collapse
|
32
|
Maâtouk O, Ayadi W, Bouziri H, Duval B. Evolutionary Local Search Algorithm for the biclustering of gene expression data based on biological knowledge. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2021.107177] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
33
|
ZHANG HUAN, WANG XINPEI, LIU CHANGCHUN, LI YUANYANG, LIU YUANYUAN, LI PENG, YAO LIANKE, WANG JIKUO, JIAO YU. A METHOD FOR DETECTING CORONARY ARTERY STENOSIS BASED ON ECG SIGNALS. J MECH MED BIOL 2021. [DOI: 10.1142/s0219519421500032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Coronary heart disease (CHD) is a typical cardiovascular disease whose occurrence and development is a long process. Timely and accurate diagnosis of patients with varying degrees of coronary artery stenosis (VDCAS) is conducive to accurate treatment and prognosis assessment. This study aims to correctly classify VDCAS patients by utilizing multi-domain features fusion of single-lead 5-min ECG signals and machine learning methods, so as to provide reference for doctors to judge the CHD development process. ECG signals were collected from 206 subjects with CHD, mild CHD, thoracalgia and normal coronary angiograms (TNCA), and healthy. Then, the time, frequency, time–frequency, and nonlinear domain features of ECG signals were extracted to establish a multi-domain feature set. To get the optimum subset of features, the recursive feature elimination (RFE) and information gain (IG) were selected. Subsequently, eXtreme Gradient Boosting (XGBoost) and random forest (RF) were adopted for classification. Results indicated that RFE combined with XGBoost was significantly effective in classifying VDCAS patients. When the four categories of subjects (CHD, mild CHD, TNCA, and healthy) were classified, the average accuracy, sensitivity, specificity, and F1-score of the proposed method were 91.74%, 89.39%, 96.80%, and 90.09%, respectively. Besides, three categories of subjects (no stenosis, luminal narrowing [Formula: see text] 50%, and luminal narrowing [Formula: see text] 50%) and two categories of subjects (CHD and healthy) were also analyzed, and the average accuracy was 91.27% and 98.46%, respectively. The results suggest that the proposed method can provide reference for doctors to judge VDCAS patients.
Collapse
Affiliation(s)
- HUAN ZHANG
- School of Control Science and Engineering, Shandong University Jinan, Shandong 250061, P. R. China
| | - XINPEI WANG
- School of Control Science and Engineering, Shandong University Jinan, Shandong 250061, P. R. China
| | - CHANGCHUN LIU
- School of Control Science and Engineering, Shandong University Jinan, Shandong 250061, P. R. China
| | - YUANYANG LI
- Department of Medical Engineering, Shandong Provincial Hospital, Affiliated to Shandong First, Medical University Jinan, Shandong 250061, P. R. China
| | - YUANYUAN LIU
- School of Control Science and Engineering, Shandong University Jinan, Shandong 250061, P. R. China
| | - PENG LI
- Division of Sleep and CirCHDian Disorders, Brigham and Women’s Hospital, Division of Sleep Medicine Harvard, Medical School Boston, MA 02115, USA
| | - LIANKE YAO
- School of Control Science and Engineering, Shandong University Jinan, Shandong 250061, P. R. China
| | - JIKUO WANG
- School of Control Science and Engineering, Shandong University Jinan, Shandong 250061, P. R. China
| | - YU JIAO
- School of Control Science and Engineering, Shandong University Jinan, Shandong 250061, P. R. China
| |
Collapse
|
34
|
Xie W, Ji M, Zhao M, Zhou T, Yang F, Qian X, Chow CY, Lam KY, Hao T. Detecting Symptom Errors in Neural Machine Translation of Patient Health Information on Depressive Disorders: Developing Interpretable Bayesian Machine Learning Classifiers. Front Psychiatry 2021; 12:771562. [PMID: 34744846 PMCID: PMC8566668 DOI: 10.3389/fpsyt.2021.771562] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/06/2021] [Accepted: 09/24/2021] [Indexed: 11/13/2022] Open
Abstract
Background: Due to its convenience, wide availability, low usage cost, neural machine translation (NMT) has increasing applications in diverse clinical settings and web-based self-diagnosis of diseases. Given the developing nature of NMT tools, this can pose safety risks to multicultural communities with limited bilingual skills, low education, and low health literacy. Research is needed to scrutinise the reliability, credibility, usability of automatically translated patient health information. Objective: We aimed to develop high-performing Bayesian machine learning classifiers to assist clinical professionals and healthcare workers in assessing the quality and usability of NMT on depressive disorders. The tool did not require any prior knowledge from frontline health and medical professionals of the target language used by patients. Methods: We used Relevance Vector Machine (RVM) to increase generalisability and clinical interpretability of classifiers. It is a typical sparse Bayesian classifier less prone to overfitting with small training datasets. We optimised RVM by leveraging automatic recursive feature elimination and expert feature refinement from the perspective of health linguistics. We evaluated the diagnostic utility of the Bayesian classifier under different probability cut-offs in terms of sensitivity, specificity, positive and negative likelihood ratios against clinical thresholds for diagnostic tests. Finally, we illustrated interpretation of RVM tool in clinic using Bayes' nomogram. Results: After automatic and expert-based feature optimisation, the best-performing RVM classifier (RVM_DUFS12) gained the highest AUC (0.8872) among 52 competing models with distinct optimised, normalised features sets. It also had statistically higher sensitivity and specificity compared to other models. We evaluated the diagnostic utility of the best-performing model using Bayes' nomogram: it had a positive likelihood ratio (LR+) of 4.62 (95% C.I.: 2.53, 8.43), and the associated posterior probability (odds) was 83% (5.0) (95% C.I.: 73%, 90%), meaning that approximately 10 in 12 English texts with positive test are likely to contain information that would cause clinically significant conceptual errors if translated by Google; it had a negative likelihood ratio (LR-) of 0.18 (95% C.I.: 0.10,0.35) and associated posterior probability (odds) was 16% (0.2) (95% C.I: 10%, 27%), meaning that about 10 in 12 English texts with negative test can be safely translated using Google.
Collapse
Affiliation(s)
- Wenxiu Xie
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong, SAR China
| | - Meng Ji
- School of Languages and Cultures, The University of Sydney, Darlington, NSW, Australia
| | - Mengdan Zhao
- School of Languages and Cultures, The University of Sydney, Darlington, NSW, Australia
| | - Tianqi Zhou
- School of Languages and Cultures, The University of Sydney, Darlington, NSW, Australia
| | - Fan Yang
- Independent Researcher, Sichuan, China
| | - Xiaobo Qian
- School of Computer Science, South China Normal University, Guangzhou, China
| | - Chi-Yin Chow
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong, SAR China
| | - Kam-Yiu Lam
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong, SAR China
| | - Tianyong Hao
- School of Computer Science, South China Normal University, Guangzhou, China
| |
Collapse
|
35
|
|
36
|
Guo J, Jin M, Chen Y, Liu J. An embedded gene selection method using knockoffs optimizing neural network. BMC Bioinformatics 2020; 21:414. [PMID: 32962627 PMCID: PMC7510330 DOI: 10.1186/s12859-020-03717-w] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2020] [Accepted: 08/19/2020] [Indexed: 11/30/2022] Open
Abstract
Background Gene selection refers to find a small subset of discriminant genes from the gene expression profiles. How to select genes that affect specific phenotypic traits effectively is an important research work in the field of biology. The neural network has better fitting ability when dealing with nonlinear data, and it can capture features automatically and flexibly. In this work, we propose an embedded gene selection method using neural network. The important genes can be obtained by calculating the weight coefficient after the training is completed. In order to solve the problem of black box of neural network and further make the training results interpretable in neural network, we use the idea of knockoffs to construct the knockoff feature genes of the original feature genes. This method not only make each feature gene to compete with each other, but also make each feature gene compete with its knockoff feature gene. This approach can help to select the key genes that affect the decision-making of neural networks. Results We use maize carotenoids, tocopherol methyltransferase, raffinose family oligosaccharides and human breast cancer dataset to do verification and analysis. Conclusions The experiment results demonstrate that the knockoffs optimizing neural network method has better detection effect than the other existing algorithms, and specially for processing the nonlinear gene expression and phenotype data.
Collapse
Affiliation(s)
- Juncheng Guo
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China.,Institute of Information Engineering, Chinese Academy of Sciences, Beijing, 10049, China.,School of Cyber Security, University of Chinese Academy of Sciences, Beijing, 10049, China
| | - Min Jin
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
| | - Yuanyuan Chen
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
| | - Jianxiao Liu
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China. .,National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China.
| |
Collapse
|
37
|
|
38
|
A survey on single and multi omics data mining methods in cancer data classification. J Biomed Inform 2020; 107:103466. [DOI: 10.1016/j.jbi.2020.103466] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Revised: 05/01/2020] [Accepted: 05/31/2020] [Indexed: 01/09/2023]
|
39
|
A machine learning-based framework for Predicting Treatment Failure in tuberculosis: A case study of six countries. Tuberculosis (Edinb) 2020; 123:101944. [PMID: 32741529 DOI: 10.1016/j.tube.2020.101944] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2019] [Revised: 02/19/2020] [Accepted: 04/22/2020] [Indexed: 11/24/2022]
Abstract
Tuberculosis is ranked as the 2nd deadliest disease in the world and is responsible for ten million deaths in 2017. Treatment failure is one of a main reason behind these deaths. Reasons of treatment failure are still unknown and the death rate due to TB is increasing. Machine learning and data analytics approaches are proved to be useful in healthcare domain in finding the associations among different attributes that can affect the outcome of any disease. Timely identification of reasons can save a patient's life. This study aims to find features that are strongly correlated with treatment failure using feature selection techniques. The validation of features is demonstrated using different classification algorithms. Moreover, this study provides a demographic based feature association of six highly burdened treatment failure countries. A verified real-life patient's dataset gathered from different countries including Azerbaijan, Belarus, Georgia, India, Moldova, and Romania is utilized to address the problem. Two types of experimentation are performed on combined dataset by achieving an average accuracy of 78% and an accuracy of 92% on Romania's data. Results shows the importance of features obtained through this study are highly influential in leading a patient towards treatment failure.
Collapse
|
40
|
Heuristic filter feature selection methods for medical datasets. Genomics 2020; 112:1173-1181. [DOI: 10.1016/j.ygeno.2019.07.002] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2019] [Revised: 06/19/2019] [Accepted: 07/01/2019] [Indexed: 11/23/2022]
|
41
|
Bommert A, Sun X, Bischl B, Rahnenführer J, Lang M. Benchmark for filter methods for feature selection in high-dimensional classification data. Comput Stat Data Anal 2020. [DOI: 10.1016/j.csda.2019.106839] [Citation(s) in RCA: 206] [Impact Index Per Article: 51.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
|
42
|
Role of microRNAs as Clinical Cancer Biomarkers for Ovarian Cancer: A Short Overview. Cells 2020; 9:cells9010169. [PMID: 31936634 PMCID: PMC7016727 DOI: 10.3390/cells9010169] [Citation(s) in RCA: 50] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/1970] [Revised: 12/28/2019] [Accepted: 01/06/2020] [Indexed: 12/15/2022] Open
Abstract
Ovarian cancer has the highest mortality rate among gynecological cancers. Early clinical signs are missing and there is an urgent need to establish early diagnosis biomarkers. MicroRNAs are promising biomarkers in this respect. In this paper, we review the most recent advances regarding the alterations of microRNAs in ovarian cancer. We have briefly described the contribution of miRNAs in the mechanisms of ovarian cancer invasion, metastasis, and chemotherapy sensitivity. We have also summarized the alterations underwent by microRNAs in solid ovarian tumors, in animal models for ovarian cancer, and in various ovarian cancer cell lines as compared to previous reviews that were only focused the circulating microRNAs as biomarkers. In this context, we consider that the biomarker screening should not be limited to circulating microRNAs per se, but rather to the simultaneous detection of the same microRNA alteration in solid tumors, in order to understand the differences between the detection of nucleic acids in early vs. late stages of cancer. Moreover, in vitro and in vivo models should also validate these microRNAs, which could be very helpful as preclinical testing platforms for pharmacological and/or molecular genetic approaches targeting microRNAs. The enormous quantity of data produced by preclinical and clinical studies regarding the role of microRNAs that act synergistically in tumorigenesis mechanisms that are associated with ovarian cancer subtypes, should be gathered, integrated, and compared by adequate methods, including molecular clustering. In this respect, molecular clustering analysis should contribute to the discovery of best biomarkers-based microRNAs assays that will enable rapid, efficient, and cost-effective detection of ovarian cancer in early stages. In conclusion, identifying the appropriate microRNAs as clinical biomarkers in ovarian cancer might improve the life quality of patients.
Collapse
|
43
|
Object-Based Tree Species Classification Using Airborne Hyperspectral Images and LiDAR Data. FORESTS 2019. [DOI: 10.3390/f11010032] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
The identification of tree species is one of the most basic and key indicators in forest resource monitoring with great significance in the actual forest resource survey and it can comprehensively improve the efficiency of forest resource monitoring. The related research has mainly focused on single tree species without considering multiple tree species, and therefore the ability to classify forest tree species in complex stand is not clear, especially in the subtropical monsoon climate region of southern China. This study combined airborne hyperspectral data with simultaneously acquired LiDAR data, to evaluate the capability of feature combinations and k-nearest neighbor (KNN) and support vector machine (SVM) classifiers to identify tree species, in southern China. First, the stratified classification method was used to remove non-forest land. Second, the feature variables were extracted from airborne hyperspectral image and LiDAR data, including independent component analysis (ICA) transformation images, spectral indices, texture features, and canopy height model (CHM). Third, random forest and recursion feature elimination methods were adopted for feature selection. Finally, we selected different feature combinations and used KNN and SVM classifiers to classify tree species. The results showed that the SVM classifier has a higher classification accuracy as compared with KNN classifier, with the highest classification accuracy of 94.68% and a Kappa coefficient of 0.937. Through feature elimination, the classification accuracy and performance of SVM classifier was further improved. Recursive feature elimination method based on SVM is better than random forest. In the spectral indices, the new constructed slope spectral index, SL2, has a certain effect on improving the classification accuracy of tree species. Texture features and CHM height information can effectively distinguish tree species with similar spectral features. The height information plays an important role in improving the classification accuracy of other broad-leaved species. In general, the combination of different features can improve the classification accuracy, and the proposed strategies and methods are effective for the identification of tree species at complex forest type in southern China.
Collapse
|
44
|
|
45
|
Sun L, Zhang X, Qian Y, Xu J, Zhang S. Feature selection using neighborhood entropy-based uncertainty measures for gene expression data classification. Inf Sci (N Y) 2019. [DOI: 10.1016/j.ins.2019.05.072] [Citation(s) in RCA: 109] [Impact Index Per Article: 21.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
|
46
|
MLW-gcForest: A Multi-Weighted gcForest Model for Cancer Subtype Classification by Methylation Data. APPLIED SCIENCES-BASEL 2019. [DOI: 10.3390/app9173589] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Effective cancer treatment requires a clear subtype. Due to the small sample size, high dimensionality, and class imbalances of cancer gene data, classifying cancer subtypes by traditional machine learning methods remains challenging. The gcForest algorithm is a combination of machine learning methods and a deep neural network and has been indicated to achieve better classification of small samples of data. However, the gcForest algorithm still faces many challenges when this method is applied to the classification of cancer subtypes. In this paper, we propose an improved gcForest algorithm (MLW-gcForest) to study the applicability of this method to the small sample sizes, high dimensionality, and class imbalances of genetic data. The main contributions of this algorithm are as follows: (1) Different weights are assigned to different random forests according to the classification ability of the forests. (2) We propose a sorting optimization algorithm that assigns different weights to the feature vectors generated under different sliding windows. The MLW-gcForest model is trained on the methylation data of five data sets from the cancer genome atlas (TCGA). The experimental results show that the MLW-gcForest algorithm achieves high accuracy and area under curve (AUC) values for the classification of cancer subtypes compared with those of traditional machine learning methods and state of the art methods. The results also show that methylation data can be effectively used to diagnose cancer.
Collapse
|
47
|
Maâtouk O, Ayadi W, Bouziri H, Duval B. Evolutionary biclustering algorithms: an experimental study on microarray data. Soft comput 2019. [DOI: 10.1007/s00500-018-3394-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
48
|
Su R, Liu X, Wei L. MinE-RFE: determine the optimal subset from RFE by minimizing the subset-accuracy–defined energy. Brief Bioinform 2019; 21:687-698. [DOI: 10.1093/bib/bbz021] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2018] [Revised: 01/24/2019] [Accepted: 02/02/2019] [Indexed: 01/18/2023] Open
Abstract
Abstract
Recursive feature elimination (RFE), as one of the most popular feature selection algorithms, has been extensively applied to bioinformatics. During the training, a group of candidate subsets are generated by iteratively eliminating the least important features from the original features. However, how to determine the optimal subset from them still remains ambiguous. Among most current studies, either overall accuracy or subset size (SS) is used to select the most predictive features. Using which one or both and how they affect the prediction performance are still open questions. In this study, we proposed MinE-RFE, a novel RFE-based feature selection approach by sufficiently considering the effect of both factors. Subset decision problem was reflected into subset-accuracy space and became an energy-minimization problem. We also provided a mathematical description of the relationship between the overall accuracy and SS using Gaussian Mixture Models together with spline fitting. Besides, we comprehensively reviewed a variety of state-of-the-art applications in bioinformatics using RFE. We compared their approaches of deciding the final subset from all the candidate subsets with MinE-RFE on diverse bioinformatics data sets. Additionally, we also compared MinE-RFE with some well-used feature selection algorithms. The comparative results demonstrate that the proposed approach exhibits the best performance among all the approaches. To facilitate the use of MinE-RFE, we further established a user-friendly web server with the implementation of the proposed approach, which is accessible at http://qgking.wicp.net/MinE/. We expect this web server will be a useful tool for research community.
Collapse
Affiliation(s)
- Ran Su
- School of Computer Software, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Xinyi Liu
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Leyi Wei
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| |
Collapse
|
49
|
Zhao D, Liu H, Zheng Y, He Y, Lu D, Lyu C. Whale optimized mixed kernel function of support vector machine for colorectal cancer diagnosis. J Biomed Inform 2019; 92:103124. [PMID: 30796977 DOI: 10.1016/j.jbi.2019.103124] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2018] [Revised: 01/15/2019] [Accepted: 02/04/2019] [Indexed: 12/17/2022]
Abstract
Microarray technique is a prevalent method for the classification and prediction of colorectal cancer (CRC). Nevertheless, microarray data suffers from the curse of dimensionality when selecting feature genes of the disease based on imbalance samples, thus causing low prediction accuracy. Hence, it is of vital significance to build proper models that can avoid the above problems and predict the CRC more accurately. In this paper, we use an ensemble model to classify samples into healthy and CRC groups and improve prediction performance. The proposed model is composed of three functional modules. The first module mainly performs the function of removing redundant genes. The main feature genes are selected using minimum redundancy maximum relevance (mRMR) method to reduce the dimensionality of features thereby increasing the prediction results. The second module aims to solve the problem caused by imbalanced data using hybrid sampling algorithm RUSBoost. The third module focuses on the classification algorithm optimization. We use mixed kernel function (MKF) based support vector machine (SVM) model to classify an unknown sample into healthy individuals and CRC patients, and then, the Whale Optimization Algorithm (WOA) is applied to find most optimal parameters of the proposed MKF-SVM. The final results show that the proposed model achieves higher G-means than other comparable models. The conclusion comes to show that RUSBoost wrapping WOA + MKF-SVM model can be applied to improve the predictive performance of colorectal cancer based on the imbalanced data.
Collapse
Affiliation(s)
- Dandan Zhao
- School of Information Science and Engineering, Shandong Normal University, Jinan City, China; Shandong Provincial Key Laboratory for Novel Distributed Computer Software Technology, Jinan City, China
| | - Hong Liu
- School of Information Science and Engineering, Shandong Normal University, Jinan City, China; Shandong Provincial Key Laboratory for Novel Distributed Computer Software Technology, Jinan City, China.
| | - Yuanjie Zheng
- School of Information Science and Engineering, Shandong Normal University, Jinan City, China; Shandong Provincial Key Laboratory for Novel Distributed Computer Software Technology, Jinan City, China
| | - Yanlin He
- School of Information Science and Engineering, Shandong Normal University, Jinan City, China; Shandong Provincial Key Laboratory for Novel Distributed Computer Software Technology, Jinan City, China
| | - Dianjie Lu
- School of Information Science and Engineering, Shandong Normal University, Jinan City, China; Shandong Provincial Key Laboratory for Novel Distributed Computer Software Technology, Jinan City, China
| | - Chen Lyu
- School of Information Science and Engineering, Shandong Normal University, Jinan City, China; Shandong Provincial Key Laboratory for Novel Distributed Computer Software Technology, Jinan City, China
| |
Collapse
|
50
|
Sun L, Zhang X, Xu J, Zhang S. An Attribute Reduction Method Using Neighborhood Entropy Measures in Neighborhood Rough Sets. ENTROPY 2019; 21:e21020155. [PMID: 33266871 PMCID: PMC7514638 DOI: 10.3390/e21020155] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/08/2018] [Revised: 01/22/2019] [Accepted: 02/01/2019] [Indexed: 11/16/2022]
Abstract
Attribute reduction as an important preprocessing step for data mining, and has become a hot research topic in rough set theory. Neighborhood rough set theory can overcome the shortcoming that classical rough set theory may lose some useful information in the process of discretization for continuous-valued data sets. In this paper, to improve the classification performance of complex data, a novel attribute reduction method using neighborhood entropy measures, combining algebra view with information view, in neighborhood rough sets is proposed, which has the ability of dealing with continuous data whilst maintaining the classification information of original attributes. First, to efficiently analyze the uncertainty of knowledge in neighborhood rough sets, by combining neighborhood approximate precision with neighborhood entropy, a new average neighborhood entropy, based on the strong complementarity between the algebra definition of attribute significance and the definition of information view, is presented. Then, a concept of decision neighborhood entropy is investigated for handling the uncertainty and noisiness of neighborhood decision systems, which integrates the credibility degree with the coverage degree of neighborhood decision systems to fully reflect the decision ability of attributes. Moreover, some of their properties are derived and the relationships among these measures are established, which helps to understand the essence of knowledge content and the uncertainty of neighborhood decision systems. Finally, a heuristic attribute reduction algorithm is proposed to improve the classification performance of complex data sets. The experimental results under an instance and several public data sets demonstrate that the proposed method is very effective for selecting the most relevant attributes with great classification performance.
Collapse
Affiliation(s)
- Lin Sun
- College of Computer and Information Engineering, Henan Normal University, Xinxiang 453007, China
- Engineering Technology Research Center for Computing Intelligence and Data Mining, Henan 453007, China
- Correspondence: or
| | - Xiaoyu Zhang
- College of Computer and Information Engineering, Henan Normal University, Xinxiang 453007, China
| | - Jiucheng Xu
- College of Computer and Information Engineering, Henan Normal University, Xinxiang 453007, China
- Engineering Technology Research Center for Computing Intelligence and Data Mining, Henan 453007, China
| | - Shiguang Zhang
- College of Computer and Information Engineering, Henan Normal University, Xinxiang 453007, China
- Engineering Technology Research Center for Computing Intelligence and Data Mining, Henan 453007, China
| |
Collapse
|