1
|
Claude E, Leclercq M, Thébault P, Droit A, Uricaru R. Optimizing hybrid ensemble feature selection strategies for transcriptomic biomarker discovery in complex diseases. NAR Genom Bioinform 2024; 6:lqae079. [PMID: 38993634 PMCID: PMC11237901 DOI: 10.1093/nargab/lqae079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2024] [Revised: 06/03/2024] [Accepted: 06/21/2024] [Indexed: 07/13/2024] Open
Abstract
Biomedical research takes advantage of omic data, such as transcriptomics, to unravel the complexity of diseases. A conventional strategy identifies transcriptomic biomarkers characterized by expression patterns associated with a phenotype by relying on feature selection approaches. Hybrid ensemble feature selection (HEFS) has become increasingly popular as it ensures robustness of the selected features by performing data and functional perturbations. However, it remains difficult to make the best suited choices at each step when designing such approaches. We conducted an extensive analysis of four possible HEFS scenarios for the identification of Stage IV colorectal, Stage I kidney and lung and Stage III endometrial cancer biomarkers from transcriptomic data. These scenarios investigate the use of two types of feature reduction by filters (differentially expressed genes and variance) conjointly with two types of resampling strategies (repeated holdout by distribution-balanced stratified and random stratified) for downstream feature selection through an aggregation of thousands of wrapped machine learning models. Based on our results, we emphasize the advantages of using HEFS approaches to identify complex disease biomarkers, given their ability to produce generalizable and stable results to both data and functional perturbations. Finally, we highlight critical issues that need to be considered in the design of such strategies.
Collapse
Affiliation(s)
- Elsa Claude
- Univ. Bordeaux, CNRS, Bordeaux INP, LaBRI, UMR 5800, F-33400 Talence, France
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
| | - Mickaël Leclercq
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
| | - Patricia Thébault
- Univ. Bordeaux, CNRS, Bordeaux INP, LaBRI, UMR 5800, F-33400 Talence, France
| | - Arnaud Droit
- Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada
| | - Raluca Uricaru
- Univ. Bordeaux, CNRS, Bordeaux INP, LaBRI, UMR 5800, F-33400 Talence, France
| |
Collapse
|
2
|
Bodalal Z, Hong EK, Trebeschi S, Kurilova I, Landolfi F, Bogveradze N, Castagnoli F, Randon G, Snaebjornsson P, Pietrantonio F, Lee JM, Beets G, Beets-Tan R. Non-invasive CT radiomic biomarkers predict microsatellite stability status in colorectal cancer: a multicenter validation study. Eur Radiol Exp 2024; 8:98. [PMID: 39186200 PMCID: PMC11347521 DOI: 10.1186/s41747-024-00484-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Accepted: 05/30/2024] [Indexed: 08/27/2024] Open
Abstract
BACKGROUND Microsatellite instability (MSI) status is a strong predictor of response to immunotherapy of colorectal cancer. Radiogenomic approaches promise the ability to gain insight into the underlying tumor biology using non-invasive routine clinical images. This study investigates the association between tumor morphology and the status of MSI versus microsatellite stability (MSS), validating a novel radiomic signature on an external multicenter cohort. METHODS Preoperative computed tomography scans with matched MSI status were retrospectively collected for 243 colorectal cancer patients from three hospitals: Seoul National University Hospital (SNUH); Netherlands Cancer Institute (NKI); and Fondazione IRCCS Istituto Nazionale dei Tumori, Milan Italy (INT). Radiologists delineated primary tumors in each scan, from which radiomic features were extracted. Machine learning models trained on SNUH data to identify MSI tumors underwent external validation using NKI and INT images. Performances were compared in terms of area under the receiving operating curve (AUROC). RESULTS We identified a radiomic signature comprising seven radiomic features that were predictive of tumors with MSS or MSI (AUROC 0.69, 95% confidence interval [CI] 0.54-0.84, p = 0.018). Integrating radiomic and clinical data into an algorithm improved predictive performance to an AUROC of 0.78 (95% CI 0.60-0.91, p = 0.002) and enhanced the reliability of the predictions. CONCLUSION Differences in the radiomic morphological phenotype between tumors MSS or MSI could be detected using radiogenomic approaches. Future research involving large-scale multicenter prospective studies that combine various diagnostic data is necessary to refine and validate more robust, potentially tumor-agnostic MSI radiogenomic models. RELEVANCE STATEMENT Noninvasive radiomic signatures derived from computed tomography scans can predict MSI in colorectal cancer, potentially augmenting traditional biopsy-based methods and enhancing personalized treatment strategies. KEY POINTS Noninvasive CT-based radiomics predicted MSI in colorectal cancer, enhancing stratification. A seven-feature radiomic signature differentiated tumors with MSI from those with MSS in multicenter cohorts. Integrating radiomic and clinical data improved the algorithm's predictive performance.
Collapse
Affiliation(s)
- Zuhir Bodalal
- Department of Radiology, The Netherlands Cancer Institute, Amsterdam, The Netherlands
- GROW Research Institute for Oncology and Developmental Biology, Maastricht University, Maastricht, The Netherlands
| | - Eun Kyoung Hong
- Department of Radiology, The Netherlands Cancer Institute, Amsterdam, The Netherlands
- GROW Research Institute for Oncology and Developmental Biology, Maastricht University, Maastricht, The Netherlands
- Seoul National University Hospital, Seoul, South Korea
| | - Stefano Trebeschi
- Department of Radiology, The Netherlands Cancer Institute, Amsterdam, The Netherlands
- GROW Research Institute for Oncology and Developmental Biology, Maastricht University, Maastricht, The Netherlands
| | - Ieva Kurilova
- Department of Radiology, The Netherlands Cancer Institute, Amsterdam, The Netherlands
- GROW Research Institute for Oncology and Developmental Biology, Maastricht University, Maastricht, The Netherlands
| | - Federica Landolfi
- Department of Radiology, The Netherlands Cancer Institute, Amsterdam, The Netherlands
- Radiology Unit, Sant'Andrea Hospital, Sapienza University of Rome, Rome, Italy
| | - Nino Bogveradze
- Department of Radiology, The Netherlands Cancer Institute, Amsterdam, The Netherlands
- GROW Research Institute for Oncology and Developmental Biology, Maastricht University, Maastricht, The Netherlands
- Department of Radiology, American Hospital Tbilisi, Tbilisi, Georgia
| | - Francesca Castagnoli
- Department of Radiology, The Netherlands Cancer Institute, Amsterdam, The Netherlands
- Department of Radiology, Royal Marsden Hospital, London, UK
- Division of Radiotherapy and Imaging, The Institute of Cancer Research, London, UK
| | - Giovanni Randon
- Department of Medical Oncology, Fondazione IRCCS Istituto Nazionale dei Tumori di Milano, Milan, Italy
| | - Petur Snaebjornsson
- Department of Pathology, Netherlands Cancer Institute, Amsterdam, The Netherlands
- Faculty of Medicine, University of Iceland, Reykjavik, Iceland
| | - Filippo Pietrantonio
- Department of Medical Oncology, Fondazione IRCCS Istituto Nazionale dei Tumori di Milano, Milan, Italy
- Oncology and Hemato-oncology Department, University of Milan, Milan, Italy
| | - Jeong Min Lee
- Seoul National University Hospital, Seoul, South Korea
| | - Geerard Beets
- GROW Research Institute for Oncology and Developmental Biology, Maastricht University, Maastricht, The Netherlands
- Department of Surgery, Netherlands Cancer Institute, Amsterdam, The Netherlands
| | - Regina Beets-Tan
- Department of Radiology, The Netherlands Cancer Institute, Amsterdam, The Netherlands.
- GROW Research Institute for Oncology and Developmental Biology, Maastricht University, Maastricht, The Netherlands.
- Institute of Regional Health Research, University of Southern Denmark, Odense, Denmark.
| |
Collapse
|
3
|
Canero FM, Rodriguez-Galiano V, Aragones D. Machine Learning and Feature Selection for soil spectroscopy. An evaluation of Random Forest wrappers to predict soil organic matter, clay, and carbonates. Heliyon 2024; 10:e30228. [PMID: 38707402 PMCID: PMC11066688 DOI: 10.1016/j.heliyon.2024.e30228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2023] [Revised: 04/19/2024] [Accepted: 04/22/2024] [Indexed: 05/07/2024] Open
Abstract
Soil spectroscopy estimates soil properties using the absorption features in soil spectra. However, modelling soil properties with soil spectroscopy is challenging due to the high dimensionality of spectral data. Feature Selection wrapper methods are promising approaches to reduce the dimensionality but are barely used in soil spectroscopy. The aim of this study is to evaluate the performance of two feature selection wrapper methods, Sequential Forward Selection (SFS) and Sequential Flotant Forward Selection (SFFS) built using the Random Forest (RF) algorithm, for dimensionality reduction of spectral data and predictive modelling of modelling soil organic matter (SOM), clay and carbonates. The reflectance of 100 soil samples, acquired from Sierra de las Nieves (Spain), was measured under laboratory conditions using ASD FieldSpec Pro JR. Four different datasets were obtained after applying two spectral preprocessing methods to raw spectra: raw spectra, Continuum Removal (CR), Multiplicative Scatter Correction (MSC), and a so-called "Global" dataset composed of raw, CR and MSC features. The performance of RF models built with feature selection methods was compared to that of Partial Least Squares Regression (PLSR) and RF (alone). RF models built with SFS and SFFS outperformed PLSR and RF alone models: The best RF models with feature selection had a respective ratio of performance to interquartile distance of 1.93, 0.38 and 2.56. PLSR models had an accuracy of 1.41, 0.29 and 1.81 for SOM, carbonates, and clay, respectively. RF alone had a respective performance of 1.29, 0.29 and 1.81. The application of feature selection wrapper methods reduced the number of features to less than 1 % of the starting features. Features were selected across all spectra for SOM and clay, and around 900 nm, 1900 nm, and 2350 nm for carbonates. However, feature selection highlighted features around 1100 nm in SOM modelling, as well as other features around 2200 nm, which is considered a main absorption feature of clay. The application of feature selection with Random Forest was very important in improving modelling accuracy, reducing the redundant features and avoiding the curse of dimensionality or Hughes effect. Thus, this research showed an alternative to dimensionality reduction approaches that have been applied to date to model soil properties with spectroscopy and paves the way for further scientific investigation based on feature selection methods and machine learning.
Collapse
Affiliation(s)
- Francisco M. Canero
- Department of Physical Geography and Regional Geographic Analysis, Universidad de Sevilla, 41004, Seville, Spain
| | - Victor Rodriguez-Galiano
- Department of Physical Geography and Regional Geographic Analysis, Universidad de Sevilla, 41004, Seville, Spain
| | - David Aragones
- Remote Sensing and Geographic Information Systems Lab (LAST-EBD), Doñana Biological Station, C.S.I.C., 41092, Seville, Spain
| |
Collapse
|
4
|
Andishgar A, Bazmi S, Tabrizi R, Rismani M, Keshavarzian O, Pezeshki B, Ahmadizar F. Machine learning-based models to predict the conversion of normal blood pressure to hypertension within 5-year follow-up. PLoS One 2024; 19:e0300201. [PMID: 38483860 PMCID: PMC10939282 DOI: 10.1371/journal.pone.0300201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Accepted: 02/23/2024] [Indexed: 03/17/2024] Open
Abstract
BACKGROUND Factors contributing to the development of hypertension exhibit significant variations across countries and regions. Our objective was to predict individuals at risk of developing hypertension within a 5-year period in a rural Middle Eastern area. METHODS This longitudinal study utilized data from the Fasa Adults Cohort Study (FACS). The study initially included 10,118 participants aged 35-70 years in rural districts of Fasa, Iran, with a follow-up of 3,000 participants after 5 years using random sampling. A total of 160 variables were included in the machine learning (ML) models, and feature scaling and one-hot encoding were employed for data processing. Ten supervised ML algorithms were utilized, namely logistic regression (LR), support vector machine (SVM), random forest (RF), Gaussian naive Bayes (GNB), linear discriminant analysis (LDA), k-nearest neighbors (KNN), gradient boosting machine (GBM), extreme gradient boosting (XGB), cat boost (CAT), and light gradient boosting machine (LGBM). Hyperparameter tuning was performed using various combinations of hyperparameters to identify the optimal model. Synthetic Minority Over-sampling Technology (SMOTE) was used to balance the training data, and feature selection was conducted using SHapley Additive exPlanations (SHAP). RESULTS Out of 2,288 participants who met the criteria, 251 individuals (10.9%) were diagnosed with new hypertension. The LGBM model (determined to be the optimal model) with the top 30 features achieved an AUC of 0.67, an f1-score of 0.23, and an AUC-PR of 0.26. The top three predictors of hypertension were baseline systolic blood pressure (SBP), gender, and waist-to-hip ratio (WHR), with AUCs of 0.66, 0.58, and 0.63, respectively. Hematuria in urine tests and family history of hypertension ranked fourth and fifth. CONCLUSION ML models have the potential to be valuable decision-making tools in evaluating the need for early lifestyle modification or medical intervention in individuals at risk of developing hypertension.
Collapse
Affiliation(s)
- Aref Andishgar
- USERN Office, Fasa University of Medical Sciences, Fasa, Iran
| | - Sina Bazmi
- Student Research Committee, Fasa University of Medical Sciences, Fasa, Iran
| | - Reza Tabrizi
- Noncommunicable Diseases Research Center, Fasa University of Medical Science, Fasa, Iran
| | - Maziyar Rismani
- Student Research Committee, Fasa University of Medical Sciences, Fasa, Iran
| | - Omid Keshavarzian
- School of Medicine, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Babak Pezeshki
- Clinical Research Development Unit, Valiasr Hospital, Fasa University of Medical Sciences, Fasa, Iran
| | - Fariba Ahmadizar
- Department of Data Science and Biostatistics, Julius Global Health, University Medical Center Utrecht, Utrecht, The Netherlands
| |
Collapse
|
5
|
Chowa SS, Azam S, Montaha S, Payel IJ, Bhuiyan MRI, Hasan MZ, Jonkman M. Graph neural network-based breast cancer diagnosis using ultrasound images with optimized graph construction integrating the medically significant features. J Cancer Res Clin Oncol 2023; 149:18039-18064. [PMID: 37982829 PMCID: PMC10725367 DOI: 10.1007/s00432-023-05464-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Accepted: 10/06/2023] [Indexed: 11/21/2023]
Abstract
PURPOSE An automated computerized approach can aid radiologists in the early diagnosis of breast cancer. In this study, a novel method is proposed for classifying breast tumors into benign and malignant, based on the ultrasound images through a Graph Neural Network (GNN) model utilizing clinically significant features. METHOD Ten informative features are extracted from the region of interest (ROI), based on the radiologists' diagnosis markers. The significance of the features is evaluated using density plot and T test statistical analysis method. A feature table is generated where each row represents individual image, considered as node, and the edges between the nodes are denoted by calculating the Spearman correlation coefficient. A graph dataset is generated and fed into the GNN model. The model is configured through ablation study and Bayesian optimization. The optimized model is then evaluated with different correlation thresholds for getting the highest performance with a shallow graph. The performance consistency is validated with k-fold cross validation. The impact of utilizing ROIs and handcrafted features for breast tumor classification is evaluated by comparing the model's performance with Histogram of Oriented Gradients (HOG) descriptor features from the entire ultrasound image. Lastly, a clustering-based analysis is performed to generate a new filtered graph, considering weak and strong relationships of the nodes, based on the similarities. RESULTS The results indicate that with a threshold value of 0.95, the GNN model achieves the highest test accuracy of 99.48%, precision and recall of 100%, and F1 score of 99.28%, reducing the number of edges by 85.5%. The GNN model's performance is 86.91%, considering no threshold value for the graph generated from HOG descriptor features. Different threshold values for the Spearman's correlation score are experimented with and the performance is compared. No significant differences are observed between the previous graph and the filtered graph. CONCLUSION The proposed approach might aid the radiologists in effective diagnosing and learning tumor pattern of breast cancer.
Collapse
Affiliation(s)
- Sadia Sultana Chowa
- Faculty of Science and Technology, Charles Darwin University, Casuarina, NT, 0909, Australia
| | - Sami Azam
- Faculty of Science and Technology, Charles Darwin University, Casuarina, NT, 0909, Australia.
| | - Sidratul Montaha
- Faculty of Science and Technology, Charles Darwin University, Casuarina, NT, 0909, Australia
| | - Israt Jahan Payel
- Health Informatics Research Laboratory (HIRL), Department of Computer Science and Engineering, Daffodil International University, Dhaka, 1216, Bangladesh
| | - Md Rahad Islam Bhuiyan
- Faculty of Science and Technology, Charles Darwin University, Casuarina, NT, 0909, Australia
| | - Md Zahid Hasan
- Health Informatics Research Laboratory (HIRL), Department of Computer Science and Engineering, Daffodil International University, Dhaka, 1216, Bangladesh
| | - Mirjam Jonkman
- Faculty of Science and Technology, Charles Darwin University, Casuarina, NT, 0909, Australia
| |
Collapse
|
6
|
Kopitar L, Stiglic G. Using heterogeneous sources of data and interpretability of prediction models to explain the characteristics of careless respondents in survey data. Sci Rep 2023; 13:13417. [PMID: 37591974 PMCID: PMC10435557 DOI: 10.1038/s41598-023-40209-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2023] [Accepted: 08/07/2023] [Indexed: 08/19/2023] Open
Abstract
Prior to further processing, completed questionnaires must be screened for the presence of careless respondents. Different people will respond to surveys in different ways. Some take the easy path and fill out the survey carelessly. The proportion of careless respondents determines the survey's quality. As a result, identifying careless respondents is critical for the quality of obtained results. This study aims to explore the characteristics of careless respondents in survey data and evaluate the predictive power and interpretability of different types of data and indices of careless responding. The research question focuses on understanding the behavior of careless respondents and determining the effectiveness of various data sources in predicting their responses. Data from a three-month web-based survey on participants' personality traits such as honesty-humility, emotionality, extraversion, agreeableness, conscientiousness and openness to experience was used in this study. Data for this study was taken from Schroeders et al.. The gradient boosting machine-based prediction model uses data from the answers, time spent for answering, demographic information on the respondents as well as some indices of careless responding from all three types of data. Prediction models were evaluated with tenfold cross-validation repeated a hundred times. Prediction models were compared based on balanced accuracy. Models' explanations were provided with Shapley values. Compared with existing work, data fusion from multiple types of information had no noticeable effect on the performance of the gradient boosting machine model. Variables such as "I would never take a bribe, even if it was a lot", average longstring, and total intra-individual response variability were found to be useful in distinguishing careless respondents. However, variables like "I would be tempted to use counterfeit money if I could get away with it" and intra-individual response variability of the first section of a survey showed limited effectiveness. Additionally, this study indicated that, whereas the psychometric synonym score has an immediate effect and is designed with the goal of identifying careless respondents when combined with other variables, it is not necessarily the optimal choice for fitting a gradient boosting machine model.
Collapse
Affiliation(s)
- Leon Kopitar
- Faculty of Health Sciences, University of Maribor, Maribor, Slovenia.
- Faculty of Electrical Engineering and Computer Science, University of Maribor, Maribor, Slovenia.
| | - Gregor Stiglic
- Faculty of Health Sciences, University of Maribor, Maribor, Slovenia
- Faculty of Electrical Engineering and Computer Science, University of Maribor, Maribor, Slovenia
- Usher Institute, University of Edinburgh, Edinburgh, UK
| |
Collapse
|
7
|
Lee CY, Yang SH. Graph Spatio-Temporal Networks for Manufacturing Sales Forecast and Prevention Policies in Pandemic Era. COMPUTERS & INDUSTRIAL ENGINEERING 2023; 182:109413. [PMID: 38620105 PMCID: PMC10299845 DOI: 10.1016/j.cie.2023.109413] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/30/2022] [Revised: 06/23/2023] [Accepted: 06/26/2023] [Indexed: 04/17/2024]
Abstract
Worldwide manufacturing industries are significantly affected by COVID-19 pandemic because of their production characteristics with low-cost country sourcing, globalization, and inventory level. To analyze the correlated time series, spatial-temporal model becomes more attractive, and the graph convolution network (GCN) is also commonly used to provide more information to the nodes and its neighbors in the graph. Recently, attention-adjusted graph spatio-temporal network (AGSTN) was proposed to address the problem of pre-defined graph in GCN by combining multi-graph convolution and attention adjustment to learn spatial and temporal correlations over time. However, AGSTN may show potential problem with limited small non-sensor data; particularly, convergence issue. This study proposes several variants of AGSTN and applies them to non-sensor data. We suggest data augmentation and regularization techniques such as edge selection, time series decomposition, prevention policies to improve AGSTN. An empirical study of worldwide manufacturing industries in pandemic era was conducted to validate the proposed variants. The results show that the proposed variants significantly improve the prediction performance at least around 20% on mean squared error (MSE) and convergence problem.
Collapse
Affiliation(s)
- Chia-Yen Lee
- Department of Information Management, National Taiwan University, Taipei 106, Taiwan
| | - Shu-Huei Yang
- Institute of Manufacturing Information and Systems, National Cheng Kung University, Tainan City 701, Taiwan
| |
Collapse
|
8
|
Mostafaei S, Hoang MT, Jurado PG, Xu H, Zacarias-Pons L, Eriksdotter M, Chatterjee S, Garcia-Ptacek S. Machine learning algorithms for identifying predictive variables of mortality risk following dementia diagnosis: a longitudinal cohort study. Sci Rep 2023; 13:9480. [PMID: 37301891 PMCID: PMC10257644 DOI: 10.1038/s41598-023-36362-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2023] [Accepted: 06/02/2023] [Indexed: 06/12/2023] Open
Abstract
Machine learning (ML) could have advantages over traditional statistical models in identifying risk factors. Using ML algorithms, our objective was to identify the most important variables associated with mortality after dementia diagnosis in the Swedish Registry for Cognitive/Dementia Disorders (SveDem). From SveDem, a longitudinal cohort of 28,023 dementia-diagnosed patients was selected for this study. Sixty variables were considered as potential predictors of mortality risk, such as age at dementia diagnosis, dementia type, sex, body mass index (BMI), mini-mental state examination (MMSE) score, time from referral to initiation of work-up, time from initiation of work-up to diagnosis, dementia medications, comorbidities, and some specific medications for chronic comorbidities (e.g., cardiovascular disease). We applied sparsity-inducing penalties for three ML algorithms and identified twenty important variables for the binary classification task in mortality risk prediction and fifteen variables to predict time to death. Area-under-ROC curve (AUC) measure was used to evaluate the classification algorithms. Then, an unsupervised clustering algorithm was applied on the set of twenty-selected variables to find two main clusters which accurately matched surviving and dead patient clusters. A support-vector-machines with an appropriate sparsity penalty provided the classification of mortality risk with accuracy = 0.7077, AUROC = 0.7375, sensitivity = 0.6436, and specificity = 0.740. Across three ML algorithms, the majority of the identified twenty variables were compatible with literature and with our previous studies on SveDem. We also found new variables which were not previously reported in literature as associated with mortality in dementia. Performance of basic dementia diagnostic work-up, time from referral to initiation of work-up, and time from initiation of work-up to diagnosis were found to be elements of the diagnostic process identified by the ML algorithms. The median follow-up time was 1053 (IQR = 516-1771) days in surviving and 1125 (IQR = 605-1770) days in dead patients. For prediction of time to death, the CoxBoost model identified 15 variables and classified them in order of importance. These highly important variables were age at diagnosis, MMSE score, sex, BMI, and Charlson Comorbidity Index with selection scores of 23%, 15%, 14%, 12% and 10%, respectively. This study demonstrates the potential of sparsity-inducing ML algorithms in improving our understanding of mortality risk factors in dementia patients and their application in clinical settings. Moreover, ML methods can be used as a complement to traditional statistical methods.
Collapse
Affiliation(s)
- Shayan Mostafaei
- Division of Clinical Geriatrics, Department of Neurobiology, Care Sciences and Society, Karolinska Institute, Stockholm, Sweden.
- Department of Medical Epidemiology and Biostatistics, Karolinska Institute, Stockholm, Sweden.
| | - Minh Tuan Hoang
- Division of Clinical Geriatrics, Department of Neurobiology, Care Sciences and Society, Karolinska Institute, Stockholm, Sweden
- Department of Medical Epidemiology and Biostatistics, Karolinska Institute, Stockholm, Sweden
| | - Pol Grau Jurado
- Division of Clinical Geriatrics, Department of Neurobiology, Care Sciences and Society, Karolinska Institute, Stockholm, Sweden
| | - Hong Xu
- Division of Clinical Geriatrics, Department of Neurobiology, Care Sciences and Society, Karolinska Institute, Stockholm, Sweden
| | - Lluis Zacarias-Pons
- Division of Clinical Geriatrics, Department of Neurobiology, Care Sciences and Society, Karolinska Institute, Stockholm, Sweden
- Vascular Health Research Group of Girona (ISV-Girona), Institut Universitari d'Investigació en Atenció Primària Jordi Gol i Gurina (IDIAP Jordi Gol), Girona, Spain
- Network for Research on Chronicity, Primary Care, and Health Promotion (RICAPPS), Tenerife, Spain
| | - Maria Eriksdotter
- Division of Clinical Geriatrics, Department of Neurobiology, Care Sciences and Society, Karolinska Institute, Stockholm, Sweden
- Aging and Inflammation Theme, Karolinska University Hospital, Stockholm, Sweden
| | - Saikat Chatterjee
- Division of Information Science and Engineering, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, Stockholm, Sweden
| | - Sara Garcia-Ptacek
- Division of Clinical Geriatrics, Department of Neurobiology, Care Sciences and Society, Karolinska Institute, Stockholm, Sweden.
- Aging and Inflammation Theme, Karolinska University Hospital, Stockholm, Sweden.
| |
Collapse
|
9
|
Sun Y, Wang X, Ren N, Liu Y, You S. Improved Machine Learning Models by Data Processing for Predicting Life-Cycle Environmental Impacts of Chemicals. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2023; 57:3434-3444. [PMID: 36537350 DOI: 10.1021/acs.est.2c04945] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
Machine learning (ML) provides an efficient manner for rapid prediction of the life-cycle environmental impacts of chemicals, but challenges remain due to low prediction accuracy and poor interpretability of the models. To address these issues, we focused on data processing by using a mutual information-permutation importance (MI-PI) feature selection method to filter out irrelevant molecular descriptors from the input data, which improved the model interpretability by preserving the physicochemical meanings of original molecular descriptors without generation of new variables. We also applied a weighted Euclidean distance method to mine the data most relevant to the predicted targets by quantifying the contribution of each feature, thereby the prediction accuracy was improved. On the basis of above data processing, we developed artificial neural network (ANN) models for predicting the life-cycle environmental impacts of chemicals with R2 values of 0.81, 0.81, 0.84, 0.75, 0.73, and 0.86 for global warming, human health, metal depletion, freshwater ecotoxicity, particulate matter formation, and terrestrial acidification, respectively. The ML models were interpreted using the Shapley additive explanation method by quantifying the contribution of each input molecular descriptor to environmental impact categories. This work suggests that the combination of feature selection by MI-PI and source data selection based on weighted Euclidean distance has a promising potential to improve the accuracy and interpretability of the models for predicting the life-cycle environmental impacts of chemicals.
Collapse
Affiliation(s)
- Ye Sun
- State Key Laboratory of Urban Water Resource and Environment, School of Environment, Harbin Institute of Technology, Harbin150090, P. R. China
| | - Xiuheng Wang
- State Key Laboratory of Urban Water Resource and Environment, School of Environment, Harbin Institute of Technology, Harbin150090, P. R. China
| | - Nanqi Ren
- State Key Laboratory of Urban Water Resource and Environment, School of Environment, Harbin Institute of Technology, Harbin150090, P. R. China
| | - Yanbiao Liu
- College of Environmental Science and Engineering, Textile Pollution Controlling Engineering Center of the Ministry of Ecology and Environment, Donghua University, Shanghai201620, China
| | - Shijie You
- State Key Laboratory of Urban Water Resource and Environment, School of Environment, Harbin Institute of Technology, Harbin150090, P. R. China
| |
Collapse
|
10
|
A new ranking-based stability measure for feature selection algorithms. Soft comput 2023. [DOI: 10.1007/s00500-022-07767-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
|
11
|
Yan Y, Bao X, Chen B, Li Y, Yin J, Zhu G, Li Q. Interpretable machine learning framework reveals microbiome features of oral disease. Microbiol Res 2022; 265:127198. [PMID: 36126491 DOI: 10.1016/j.micres.2022.127198] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2022] [Revised: 08/25/2022] [Accepted: 09/13/2022] [Indexed: 11/16/2022]
Abstract
BACKGROUND Although the oral microbiome plays an important role in the progression of oral diseases, the microbes closely related to these diseases remain largely uncharacterized. RESULTS We collected saliva samples from 140 individuals and performed 16 S amplicon sequencing. An interpretable machine learning framework for imbalanced high-dimensional big data of clinical microbial samples was developed to identify 14 oral microbiome features associated with oral diseases. Microbiome risk scores (MRSs) with the identified features were constructed with SHapley Additive exPlanations (SHAP). Correlations of the MRSs with individual physiological indicators and lifestyle habits were calculated. CONCLUSION Our results reveal a set of oral microbiome features associated with oral diseases. Our study demonstrates the feasibility of preventing oral disease through lifestyle interventions and provides a reference method for the era of precision medicine aimed at individualized medicine.
Collapse
Affiliation(s)
- Yueyang Yan
- Key Laboratory for Zoonoses Research of the Ministry of Education, Institute of Zoonosis, College of Veterinary Medicine, Jilin University, Changchun 130062, China
| | - Xin Bao
- Hospital of Stomatology, Jilin University, 1500 Qinghua Road, Changchun 130021, China
| | - Bohua Chen
- Department of Stomatology, The Fifth Affiliated Hospital of Sun Yat-sen University, 52 Meihua East Road, Xiangzhou District, Zhuhai City, Guangdong Province, China
| | - Ying Li
- Key Laboratory of Symbol Computation and Knowledge Engineering, Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Jigang Yin
- Key Laboratory for Zoonoses Research of the Ministry of Education, Institute of Zoonosis, College of Veterinary Medicine, Jilin University, Changchun 130062, China
| | - Guan Zhu
- Key Laboratory for Zoonoses Research of the Ministry of Education, Institute of Zoonosis, College of Veterinary Medicine, Jilin University, Changchun 130062, China
| | - Qiushi Li
- Key Laboratory for Zoonoses Research of the Ministry of Education, Institute of Zoonosis, College of Veterinary Medicine, Jilin University, Changchun 130062, China; Department of Stomatology, The Fifth Affiliated Hospital of Sun Yat-sen University, 52 Meihua East Road, Xiangzhou District, Zhuhai City, Guangdong Province, China.
| |
Collapse
|
12
|
Evaluation of Feature Selection Methods for Classification of Epileptic Seizure EEG Signals. SENSORS 2022; 22:s22083066. [PMID: 35459052 PMCID: PMC9031940 DOI: 10.3390/s22083066] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/14/2022] [Revised: 04/07/2022] [Accepted: 04/13/2022] [Indexed: 02/01/2023]
Abstract
Epilepsy is a disease that decreases the quality of life of patients; it is also among the most common neurological diseases. Several studies have approached the classification and prediction of seizures by using electroencephalographic data and machine learning techniques. A large diversity of features has been extracted from electroencephalograms to perform classification tasks; therefore, it is important to use feature selection methods to select those that leverage pattern recognition. In this study, the performance of a set of feature selection methods was compared across different classification models; the classification task consisted of the detection of ictal activity from the CHB-MIT and Siena Scalp EEG databases. The comparison was implemented for different feature sets and the number of features. Furthermore, the similarity between selected feature subsets across classification models was evaluated. The best F1-score (0.90) was reported by the K-nearest neighbor along with the CHB-MIT dataset. Results showed that none of the feature selection methods clearly outperformed the rest of the methods, as the performance was notably affected by the classifier, dataset, and feature set. Two of the combinations (classifier/feature selection method) reporting the best results were K-nearest neighbor/support vector machine and random forest/embedded random forest.
Collapse
|
13
|
Liu Y, Shen Y, Wang H, Zhang Y, Zhu X. m5Cpred-XS: A New Method for Predicting RNA m5C Sites Based on XGBoost and SHAP. Front Genet 2022; 13:853258. [PMID: 35432446 PMCID: PMC9005994 DOI: 10.3389/fgene.2022.853258] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2022] [Accepted: 02/16/2022] [Indexed: 11/13/2022] Open
Abstract
As one of the most important post-transcriptional modifications of RNA, 5-cytosine-methylation (m5C) is reported to closely relate to many chemical reactions and biological functions in cells. Recently, several computational methods have been proposed for identifying m5C sites. However, the accuracy and efficiency are still not satisfactory. In this study, we proposed a new method, m5Cpred-XS, for predicting m5C sites of H. sapiens, M. musculus, and A. thaliana. First, the powerful SHAP method was used to select the optimal feature subset from seven different kinds of sequence-based features. Second, different machine learning algorithms were used to train the models. The results of five-fold cross-validation indicate that the model based on XGBoost achieved the highest prediction accuracy. Finally, our model was compared with other state-of-the-art models, which indicates that m5Cpred-XS is superior to other methods. Moreover, we deployed the model on a web server that can be accessed through http://m5cpred-xs.zhulab.org.cn/, and m5Cpred-XS is expected to be a useful tool for studying m5C sites.
Collapse
Affiliation(s)
| | | | | | - Yong Zhang
- *Correspondence: Xiaolei Zhu, ; Yong Zhang,
| | | |
Collapse
|
14
|
Mochammad S, Noh Y, Kang YJ, Park S, Lee J, Chin S. Multi-Filter Clustering Fusion for Feature Selection in Rotating Machinery Fault Classification. SENSORS 2022; 22:s22062192. [PMID: 35336363 PMCID: PMC8950067 DOI: 10.3390/s22062192] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/16/2022] [Revised: 03/08/2022] [Accepted: 03/09/2022] [Indexed: 02/04/2023]
Abstract
In the fault classification process, filter methods that sequentially remove unnecessary features have long been studied. However, the existing filter methods do not have guidelines on which, and how many, features are needed. This study developed a multi-filter clustering fusion (MFCF) technique, to effectively and efficiently select features. In the MFCF process, a multi-filter method combining existing filter methods is first applied for feature clustering; then, key features are automatically selected. The union of key features is utilized to find all potentially important features, and an exhaustive search is used to obtain the best combination of selected features to maximize the accuracy of the classification model. In the rotating machinery examples, fault classification models using MFCF were generated to classify normal and abnormal conditions of rotational machinery. The obtained results demonstrated that classification models using MFCF provide good accuracy, efficiency, and robustness in the fault classification of rotational machinery.
Collapse
Affiliation(s)
- Solichin Mochammad
- School of Mechanical Engineering, Pusan National University, Busan 46241, Korea;
- Department of Mechanical Engineering, Institut Teknologi Sepuluh Nopember, Surabaya 60111, Indonesia
| | - Yoojeong Noh
- School of Mechanical Engineering, Pusan National University, Busan 46241, Korea;
- Correspondence:
| | - Young-Jin Kang
- Research Institute of Mechanical Technology, Pusan National University, Busan 46241, Korea;
| | - Sunhwa Park
- H&A Research Center, LG Electronics, Changwon 51554, Korea; (S.P.); (J.L.); (S.C.)
| | - Jangwoo Lee
- H&A Research Center, LG Electronics, Changwon 51554, Korea; (S.P.); (J.L.); (S.C.)
| | - Simon Chin
- H&A Research Center, LG Electronics, Changwon 51554, Korea; (S.P.); (J.L.); (S.C.)
| |
Collapse
|
15
|
Khan S, Khan MA, Alhaisoni M, Tariq U, Yong HS, Armghan A, Alenezi F. Human Action Recognition: A Paradigm of Best Deep Learning Features Selection and Serial Based Extended Fusion. SENSORS (BASEL, SWITZERLAND) 2021; 21:7941. [PMID: 34883944 PMCID: PMC8659437 DOI: 10.3390/s21237941] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/04/2021] [Revised: 11/23/2021] [Accepted: 11/25/2021] [Indexed: 01/11/2023]
Abstract
Human action recognition (HAR) has gained significant attention recently as it can be adopted for a smart surveillance system in Multimedia. However, HAR is a challenging task because of the variety of human actions in daily life. Various solutions based on computer vision (CV) have been proposed in the literature which did not prove to be successful due to large video sequences which need to be processed in surveillance systems. The problem exacerbates in the presence of multi-view cameras. Recently, the development of deep learning (DL)-based systems has shown significant success for HAR even for multi-view camera systems. In this research work, a DL-based design is proposed for HAR. The proposed design consists of multiple steps including feature mapping, feature fusion and feature selection. For the initial feature mapping step, two pre-trained models are considered, such as DenseNet201 and InceptionV3. Later, the extracted deep features are fused using the Serial based Extended (SbE) approach. Later on, the best features are selected using Kurtosis-controlled Weighted KNN. The selected features are classified using several supervised learning algorithms. To show the efficacy of the proposed design, we used several datasets, such as KTH, IXMAS, WVU, and Hollywood. Experimental results showed that the proposed design achieved accuracies of 99.3%, 97.4%, 99.8%, and 99.9%, respectively, on these datasets. Furthermore, the feature selection step performed better in terms of computational time compared with the state-of-the-art.
Collapse
Affiliation(s)
- Seemab Khan
- Department of Computer Science, HITEC University Taxila, Txila 47080, Pakistan;
| | | | - Majed Alhaisoni
- College of Computer Science and Engineering, University of Ha’il, Ha’il 55211, Saudi Arabia;
| | - Usman Tariq
- College of Computer Engineering and Science, Prince Sattam Bin Abdulaziz University, Al-Kharaj 11942, Saudi Arabia;
| | - Hwan-Seung Yong
- Department of Computer Science & Engineering, Ewha Womans University, Seoul 120-750, Korea;
| | - Ammar Armghan
- Department of Electrical Engineering, College of Engineering, Jouf University, Sakakah 72311, Saudi Arabia; (A.A.); (F.A.)
| | - Fayadh Alenezi
- Department of Electrical Engineering, College of Engineering, Jouf University, Sakakah 72311, Saudi Arabia; (A.A.); (F.A.)
| |
Collapse
|
16
|
Ravindran SM, Bhaskaran SKM, Ambat SKN. A Deep Neural Network Architecture to Model Reference Evapotranspiration Using a Single Input Meteorological Parameter. ENVIRONMENTAL PROCESSES 2021; 8:1567-1599. [PMCID: PMC8486967 DOI: 10.1007/s40710-021-00543-x] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/29/2021] [Accepted: 09/22/2021] [Indexed: 06/02/2023]
Abstract
Hydro-agrological research considers the reference evapotranspiration (ETo), driven by meteorological variables, crucial for achieving precise irrigation in precision agriculture. ETo modelling based on a single meteorological parameter would be beneficial in places where the collection of climatic parameters is challenging. The aim of this research is to develop a deep neural network (DNN) architecture that predicts daily ETo with a single input parameter selected based on the feature importance (FI) score generated by the machine learning techniques, random forest (RF), and extreme gradient boosting (XGBoost). This study also investigated the potential of SHapley Additive exPlanations to interpret and validate the outcomes of the feature selection methods by assessing the contributions of each feature to the ETo prediction. These methods recommended solar radiation as a significant parameter in the datasets of three California Irrigation Management System (CIMIS) weather stations located in distinct ETo zones. Three ETo models (DNN-Ret, XGB-Ret, and RF-Ret) were built using solar radiation as the sole input, and CIMIS ETo as the output. The performance evaluation of the developed models proved that DNN-Ret outperformed XGB-Ret and RF-Ret regardless of the dataset, with coefficients of determination (R2) ranging from 0.914 to 0.954 in the local scenario, with an average decrease of 8–9.5% in mean absolute error and root mean squared error, and an improvement of 2.6–2.9% in Nash–Sutcliffe efficiency and 1.7–2% increase in R2. The overall result analysis highlighted the efficiency of DNN-Ret in the single input parameter based ETo modelling in diverse climatic zones.
Collapse
Affiliation(s)
- Sowmya Mangalath Ravindran
- Department of Computer Applications, Cochin University of Science and Technology, Kochi, Kerala 682022 India
| | | | | |
Collapse
|