1
|
Identifying influence factors and thresholds of the next day's pollen concentration in different seasons using interpretable machine learning. THE SCIENCE OF THE TOTAL ENVIRONMENT 2024; 935:173430. [PMID: 38782273 DOI: 10.1016/j.scitotenv.2024.173430] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Revised: 05/19/2024] [Accepted: 05/19/2024] [Indexed: 05/25/2024]
Abstract
The prevalence of pollen allergies is a pressing global issue, with projections suggesting that half of the world's population will be affected by 2050 according to the estimation of the World Health Organization (WHO). Accurately forecasting pollen allergy risks requires identifying key factors and their thresholds for aerosol pollen. To address this, we developed a technical framework combining advanced machine learning and SHapley Additive exPlanations (SHAP) technology, focusing on Beijing. By analyzing meteorological data and vegetation phenology, we identified the factors influencing next-day's pollen concentration (NDP) in Beijing and their thresholds. Our results highlight vegetation phenology data from Synthetic Aperture Radar (SAR), temperature, wind speed, and atmospheric pressure as crucial factors in spring. In contrast, the Normalized Difference Vegetation Index (NDVI), air temperature, and wind speed are significant in autumn. Leveraging SHAP technology, we established season-specific thresholds for these factors. Our study not only confirms previous research but also unveils seasonal variations in the relationship between radar-derived vegetation phenology data and NDP. Additionally, we observe seasonal fluctuations in the influence patterns and threshold values of daily air temperatures on NDP. These insights are pivotal for improving pollen concentration prediction accuracy and managing allergic risks effectively.
Collapse
|
2
|
Effects of spatial variability in vegetation phenology, climate, landcover, biodiversity, topography, and soil property on soil respiration across a coastal ecosystem. Heliyon 2024; 10:e30470. [PMID: 38726202 PMCID: PMC11079102 DOI: 10.1016/j.heliyon.2024.e30470] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 03/21/2024] [Accepted: 04/26/2024] [Indexed: 05/12/2024] Open
Abstract
Coastal terrestrial-aquatic interfaces (TAIs) are crucial contributors to global biogeochemical cycles and carbon exchange. The soil carbon dioxide (CO2) efflux in these transition zones is however poorly understood due to the high spatiotemporal dynamics of TAIs, as various sub-ecosystems in this region are compressed and expanded by complex influences of tides, changes in river levels, climate, and land use. We focus on the Chesapeake Bay region to (i) investigate the spatial heterogeneity of the coastal ecosystem and identify spatial zones with similar environmental characteristics based on the spatial data layers, including vegetation phenology, climate, landcover, diversity, topography, soil property, and relative tidal elevation; (ii) understand the primary driving factors affecting soil respiration within sub-ecosystems of the coastal ecosystem. Specifically, we employed hierarchical clustering analysis to identify spatial regions with distinct environmental characteristics, followed by the determination of main driving factors using Random Forest regression and SHapley Additive exPlanations. Maximum and minimum temperature are the main drivers common to all sub-ecosystems, while each region also has additional unique major drivers that differentiate them from one another. Precipitation exerts an influence on vegetated lands, while soil pH value holds importance specifically in forested lands. In croplands characterized by high clay content and low sand content, the significant role is attributed to bulk density. Wetlands demonstrate the importance of both elevation and sand content, with clay content being more relevant in non-inundated wetlands than in inundated wetlands. The topographic wetness index significantly contributes to the mixed vegetation areas, including shrub, grass, pasture, and forest. Additionally, our research reveals that dense vegetation land covers and urban/developed areas exhibit distinct soil property drivers. Overall, our research demonstrates an efficient method of employing various open-source remote sensing and GIS datasets to comprehend the spatial variability and soil respiration mechanisms in coastal TAI. There is no one-size-fits-all approach to modeling carbon fluxes released by soil respiration in coastal TAIs, and our study highlights the importance of further research and monitoring practices to improve our understanding of carbon dynamics and promote the sustainable management of coastal TAIs.
Collapse
|
3
|
Interpretable and explainable hybrid model for daily streamflow prediction based on multi-factor drivers. ENVIRONMENTAL SCIENCE AND POLLUTION RESEARCH INTERNATIONAL 2024:10.1007/s11356-024-33594-2. [PMID: 38710844 DOI: 10.1007/s11356-024-33594-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/30/2024] [Accepted: 05/02/2024] [Indexed: 05/08/2024]
Abstract
Streamflow time series data typically exhibit nonlinear and nonstationary characteristics that complicate precise estimation. Recently, multifactorial machine learning (ML) models have been developed to enhance the performance of streamflow predictions. However, the lack of interpretability within these ML models raises concerns about their inner workings and reliability. This paper introduces an innovative hybrid architecture, the TCN-LSTM-Multihead-Attention model, which combines two layers of temporal convolutional networks (TCN) followed by one layer of long short-term memory (LSTM) units, integrated with a Multihead-Attention mechanism for predicting streamflow with streamflow causation-driven prediction samples (RCDP), employing local and global interpretability studies through Shapley values and partial dependency analysis. The find_peaks method was used to identify peak flow events in the test dataset, validating the model's generality and uncovering the physical causative patterns of streamflow. The results show that (1) compared to the LSTM model with the same hyperparameter settings, the proposed TCN-LSTM-Multihead-Attention hybrid model increased the R2 by 52.9%, 2.5%, 43.1%, and 10.7% respectively at four stations in the test set predictions using RCDP samples. Moreover, comparing the prediction results of the hybrid model under different samples in Hengshan station, the R2 for RCDP increased by 5.06% and 1.22% compared to streamflow autoregressive prediction samples (RAP) and meteorological-soil volumetric water content coupled autoregressive prediction samples (MCSAP) respectively. (2) Historical streamflow data from the preceding 3 days predominantly influences predictions due to strong autocorrelation, with flow quantity (Q) typically emerging as the most significant feature alongside precipitation (P), surface soil moisture (SSM), and adjacent station flow data. (3) During periods of low and normal flow, historical data remains the most crucial factor; however, during flood periods, the roles of upstream inflow and precipitation become significantly more pronounced. This model facilitates the identification and quantification of various hydrodynamic impacts on flow predictions, including upstream flood propagation, precipitation, and soil moisture conditions. It also elucidates the model's nonlinear relationships and threshold responses, thereby enhancing the interpretability and reliability of streamflow predictions.
Collapse
|
4
|
Breast cancer molecular subtype prediction: Improving interpretability of complex machine-learning models based on multiparametric-MRI features using SHapley Additive exPlanations (SHAP) methodology. Diagn Interv Imaging 2024; 105:161-162. [PMID: 38365542 DOI: 10.1016/j.diii.2024.01.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2024] [Accepted: 01/29/2024] [Indexed: 02/18/2024]
|
5
|
Game-theoretic optimization of landslide susceptibility mapping: a comparative study between Bayesian-optimized basic neural network and new generation neural network models. ENVIRONMENTAL SCIENCE AND POLLUTION RESEARCH INTERNATIONAL 2024; 31:29811-29835. [PMID: 38592629 DOI: 10.1007/s11356-024-33128-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/09/2023] [Accepted: 03/25/2024] [Indexed: 04/10/2024]
Abstract
Landslide susceptibility mapping is essential for reducing the risk of landslides and ensuring the safety of people and infrastructure in landslide-prone areas. However, little research has been done on the development of well-optimized Elman neural networks (ENN), deep neural networks (DNN), and artificial neural networks (ANN) for robust landslide susceptibility mapping (LSM). Additionally, there is a research gap regarding the use of Bayesian optimization and the derivation of SHapley Additive exPlanations (SHAP) values from optimized models. Therefore, this study aims to optimize DNN, ENN, and ANN models using Bayesian optimization for landslide susceptibility mapping and derive SHAP values from these optimized models. The LSM models have been validated using the receiver operating characteristics curve, confusion matrix, and other twelve error matrices. The study used six machine learning-based feature selection techniques to identify the most important variables for predicting landslide susceptibility. The decision tree, random forest, and bagging feature selection models showed that slope, elevation, DFR, annual rainfall, LD, DD, RD, and LULC are influential variables, while geology and soil texture have less influence. The DNN model outperformed the other two models, covering 7839.54 km2 under the very low landslide susceptibility zone and 3613.44 km2 under the very high landslide susceptibility zone. The DNN model is better suited for generating landslide susceptibility maps, as it can classify areas with higher accuracy. The model identified several key factors that contribute to the initiation of landslides, including high elevation, built-up and agricultural land use, less vegetation, aspect (north and northwest), soil depth less than 140 cm, high rainfall, high lineament density, and a low distance from roads. The study's findings can help stakeholders make informed decisions to reduce the risk of landslides and ensure the safety of people and infrastructure in landslide-prone areas.
Collapse
|
6
|
Who Benefited Most from the Internet-Based Conversational Engagement RCT (I-CONECT)? Application of the Personalized Medicine Approach to a Behavioral Intervention Study. J Prev Alzheimers Dis 2024; 11:639-648. [PMID: 38706280 PMCID: PMC11061034 DOI: 10.14283/jpad.2024.41] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2023] [Accepted: 01/09/2024] [Indexed: 05/07/2024]
Abstract
BACKGROUND Many Alzheimer's Disease (AD) clinical trials have failed to demonstrate treatment efficacy on cognition. It is conceivable that a complex disease like AD may not have the same treatment effect due to many heterogeneities of disease processes and individual traits. OBJECTIVES We employed an individual-level treatment response (ITR) approach to determine the characteristics of treatment responders and estimated time saved in cognitive decline using the Internet-based Conversational Engagement Clinical Trial (I-CONECT) behavioral intervention study as a model. DESIGN AND SETTING I-CONECT is a multi-site, single-blind, randomized controlled trial aimed to improve cognitive functions through frequent conversational interactions via internet/webcam. The experimental group engaged in video chats with study staff 4 times/week for 6 months; the control group received weekly 10-minute check-in phone calls. PARTICIPANTS Out of 186 randomized participants, current study used 139 participants with complete information on both baseline and 6-month follow-up (73 with mild cognitive impairment (MCI), 66 with normal cognition; 64 in the experimental group, and 75 in the control group). MEASUREMENTS ITR scores were generated for the Montreal Cognitive Assessment (MoCA) (global cognition, primary outcome) and Category Fluency Animals (CFA) (semantic fluency, secondary outcome) that showed significant efficacy in the trial. ITR scores were generated through 300 iterations of 3-fold cross-validated random forest models. The average treatment difference (ATD) curve and the area between the curves (ABC) were estimated to measure the heterogeneity of treatment responses. Responder traits were identified using SHapley Additive exPlanations (SHAP) and decision tree models. The time saved in cognitive decline was explored to gauge clinical meaningfulness. RESULTS ABC statistics showed substantial heterogeneity in treatment response with MoCA but modest heterogeneity in treatment response with CFA. Age, cognitive status, time spent with family and friends, education, and personality were important characteristics that influenced treatment responses. Intervention group participants in the upper 30% of ITR scores demonstrated potential delays of 3 months in semantic fluency (CFA) and 6 months in global cognition (MoCA), assuming a 5-fold faster natural cognitive decline compared to the control group during the post-treatment period. CONCLUSIONS ITR-based analyses are valuable in profiling treatment responders for features that can inform future trial design and clinical practice. Reliably measuring time saved in cognitive decline is an area of ongoing research to gain insight into the clinical meaningfulness of treatment.
Collapse
|
7
|
High-resolution mapping of regional VOCs using the enhanced space-time extreme gradient boosting machine (XGBoost) in Shanghai. THE SCIENCE OF THE TOTAL ENVIRONMENT 2023; 905:167054. [PMID: 37714357 DOI: 10.1016/j.scitotenv.2023.167054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/29/2023] [Revised: 09/10/2023] [Accepted: 09/11/2023] [Indexed: 09/17/2023]
Abstract
The accurate estimation of highly spatiotemporal volatile organic compounds (VOCs) is of great significance to establish advanced early warning systems and regulate air pollution control. However, the estimation of high spatiotemporal VOCs remains incomplete. Here, the space-time extreme gradient boost model (STXGB) was enhanced by integrating spatiotemporal information to obtain the spatial resolution and overall accuracy of VOCs. To this end, meteorological, topographical and pollutant emissions, was input to the STXGB model, and regional hourly 300 m VOCs maps for 2020 in Shanghai were produced. Our results show that the STXGB model achieve good hourly VOCs estimations performance (R2 = 0.73). A further analysis of SHapley Additive exPlanation (SHAP) regression indicate that local interpretations of the STXGB models demonstrate the strong contribution of emissions on mapping VOCs estimations, while acknowledging the important contribution of space and time term. The proposed approach outperforms many traditional machine learning models with a lower computational burden in terms of speed and memory.
Collapse
|
8
|
Machine Learning Models Using SHapley Additive exPlanation for Fire Risk Assessment Mode and Effects Analysis of Stadiums. SENSORS (BASEL, SWITZERLAND) 2023; 23:2151. [PMID: 36850757 PMCID: PMC9964004 DOI: 10.3390/s23042151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/14/2023] [Revised: 02/10/2023] [Accepted: 02/13/2023] [Indexed: 06/18/2023]
Abstract
Machine learning methods can establish complex nonlinear relationships between input and response variables for stadium fire risk assessment. However, the output of machine learning models is considered very difficult due to their complex "black box" structure, which hinders their application in stadium fire risk assessment. The SHapley Additive exPlanations (SHAP) method makes a local approximation to the predictions of any regression or classification model so as to be faithful and interpretable, and assigns significant values (SHAP value) to each input variable for a given prediction. In this study, we designed an indicator attribute threshold interval to classify and quantify different fire risk category data, and then used a random forest model combined with SHAP strategy in order to establish a stadium fire risk assessment model. The main objective is to analyze the impact analysis of each risk characteristic on four different risk assessment models, so as to find the complex nonlinear relationship between risk characteristics and stadium fire risk. This helps managers to be able to make appropriate fire safety management and smart decisions before an incident occurs and in a targeted manner to reduce the incidence of fires. The experimental results show that the established interpretable random forest model provides 83% accuracy, 86% precision, and 85% recall for the stadium fire risk test dataset. The study also shows that the low level of data makes it difficult to identify the range of decision boundaries for Critical mode and Hazardous mode.
Collapse
|
9
|
Identification of factors influencing net primary productivity of terrestrial ecosystems based on interpretable machine learning --evidence from the county-level administrative districts in China. JOURNAL OF ENVIRONMENTAL MANAGEMENT 2023; 326:116798. [PMID: 36435139 DOI: 10.1016/j.jenvman.2022.116798] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/05/2022] [Revised: 11/10/2022] [Accepted: 11/13/2022] [Indexed: 06/16/2023]
Abstract
Global climate change is rooted in the imbalance between carbon sources and sinks, and net-zero greenhouse gas emissions should focus not only on the source-side drivers but also on the sink-side influencing factors. Taking the county-level administrative districts in China as the sample, this study uses machine learning models to fit the relationship between socioeconomic development (SED) and net primary productivity (NPP) of terrestrial ecosystems. Moreover, it identifies key influencing factors and their effects based on the SHapley Additive exPlanations (SHAP) algorithm. The results show that the districts with low terrestrial NPP show the characteristics of agglomeration distribution. The eight key factors, in order, are as follows: agricultural development level, latitude, population size, longitude, animal husbandry development level, economic scale, time trend and industrialization level. In this study, via SHAP interaction plots, we found that the effects of population, economic growth, and industrialization on terrestrial NPP are regionally heterogeneous; via cluster analysis, we found the stage characteristics of the mode of SED affecting terrestrial NPP. Therefore, the conservation of terrestrial NPP needs to be combined with the stage changes of SED, as well as inter-regional differences, to develop a regionally coordinated and time-coherent ecological carbon sink conservation plan.
Collapse
|
10
|
XML-CIMT: Explainable Machine Learning (XML) Model for Predicting Chemical-Induced Mitochondrial Toxicity. Int J Mol Sci 2022; 23:ijms232415655. [PMID: 36555297 PMCID: PMC9779353 DOI: 10.3390/ijms232415655] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2022] [Revised: 12/06/2022] [Accepted: 12/06/2022] [Indexed: 12/14/2022] Open
Abstract
Organ toxicity caused by chemicals is a serious problem in the creation and usage of chemicals such as medications, insecticides, chemical products, and cosmetics. In recent decades, the initiation and development of chemical-induced organ damage have been related to mitochondrial dysfunction, among several adverse effects. Recently, many drugs, for example, troglitazone, have been removed from the marketplace because of significant mitochondrial toxicity. As a result, it is an urgent requirement to develop in silico models that can reliably anticipate chemical-induced mitochondrial toxicity. In this paper, we have proposed an explainable machine-learning model to classify mitochondrially toxic and non-toxic compounds. After several experiments, the Mordred feature descriptor was shortlisted to be used after feature selection. The selected features used with the CatBoost learning algorithm achieved a prediction accuracy of 85% in 10-fold cross-validation and 87.1% in independent testing. The proposed model has illustrated improved prediction accuracy when compared with the existing state-of-the-art method available in the literature. The proposed tree-based ensemble model, along with the global model explanation, will aid pharmaceutical chemists in better understanding the prediction of mitochondrial toxicity.
Collapse
|
11
|
Utilization of model-agnostic explainable artificial intelligence frameworks in oncology: a narrative review. Transl Cancer Res 2022; 11:3853-3868. [PMID: 36388027 PMCID: PMC9641128 DOI: 10.21037/tcr-22-1626] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2022] [Accepted: 09/07/2022] [Indexed: 11/25/2022]
Abstract
Background and Objective Machine learning (ML) models are increasingly being utilized in oncology research for use in the clinic. However, while more complicated models may provide improvements in predictive or prognostic power, a hurdle to their adoption are limits of model interpretability, wherein the inner workings can be perceived as a "black box". Explainable artificial intelligence (XAI) frameworks including Local Interpretable Model-agnostic Explanations (LIME) and SHapley Additive exPlanations (SHAP) are novel, model-agnostic approaches that aim to provide insight into the inner workings of the "black box" by producing quantitative visualizations of how model predictions are calculated. In doing so, XAI can transform complicated ML models into easily understandable charts and interpretable sets of rules, which can give providers with an intuitive understanding of the knowledge generated, thus facilitating the deployment of such models in routine clinical workflows. Methods We performed a comprehensive, non-systematic review of the latest literature to define use cases of model-agnostic XAI frameworks in oncologic research. The examined database was PubMed/MEDLINE. The last search was run on May 1, 2022. Key Content and Findings In this review, we identified several fields in oncology research where ML models and XAI were utilized to improve interpretability, including prognostication, diagnosis, radiomics, pathology, treatment selection, radiation treatment workflows, and epidemiology. Within these fields, XAI facilitates determination of feature importance in the overall model, visualization of relationships and/or interactions, evaluation of how individual predictions are produced, feature selection, identification of prognostic and/or predictive thresholds, and overall confidence in the models, among other benefits. These examples provide a basis for future work to expand on, which can facilitate adoption in the clinic when the complexity of such modeling would otherwise be prohibitive. Conclusions Model-agnostic XAI frameworks offer an intuitive and effective means of describing oncology ML models, with applications including prognostication and determination of optimal treatment regimens. Using such frameworks presents an opportunity to improve understanding of ML models, which is a critical step to their adoption in the clinic.
Collapse
|
12
|
Explainable Machine Learning Model for Predicting First-Time Acute Exacerbation in Patients with Chronic Obstructive Pulmonary Disease. J Pers Med 2022; 12:jpm12020228. [PMID: 35207716 PMCID: PMC8879653 DOI: 10.3390/jpm12020228] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2022] [Revised: 02/02/2022] [Accepted: 02/03/2022] [Indexed: 12/15/2022] Open
Abstract
Background: The study developed accurate explainable machine learning (ML) models for predicting first-time acute exacerbation of chronic obstructive pulmonary disease (COPD, AECOPD) at an individual level. Methods: We conducted a retrospective case–control study. A total of 606 patients with COPD were screened for eligibility using registry data from the COPD Pay-for-Performance Program (COPD P4P program) database at Changhua Christian Hospital between January 2017 and December 2019. Recursive feature elimination technology was used to select the optimal subset of features for predicting the occurrence of AECOPD. We developed four ML models to predict first-time AECOPD, and the highest-performing model was applied. Finally, an explainable approach based on ML and the SHapley Additive exPlanations (SHAP) and a local explanation method were used to evaluate the risk of AECOPD and to generate individual explanations of the model’s decisions. Results: The gradient boosting machine (GBM) and support vector machine (SVM) models exhibited superior discrimination ability (area under curve [AUC] = 0.833 [95% confidence interval (CI) 0.745–0.921] and AUC = 0.836 [95% CI 0.757–0.915], respectively). The decision curve analysis indicated that the GBM model exhibited a higher net benefit in distinguishing patients at high risk for AECOPD when the threshold probability was <0.55. The COPD Assessment Test (CAT) and the symptom of wheezing were the two most important features and exhibited the highest SHAP values, followed by monocyte count and white blood cell (WBC) count, coughing, red blood cell (RBC) count, breathing rate, oral long-acting bronchodilator use, chronic pulmonary disease (CPD), systolic blood pressure (SBP), and others. Higher CAT score; monocyte, WBC, and RBC counts; BMI; diastolic blood pressure (DBP); neutrophil-to-lymphocyte ratio; and eosinophil and lymphocyte counts were associated with AECOPD. The presence of symptoms (wheezing, dyspnea, coughing), chronic disease (CPD, congestive heart failure [CHF], sleep disorders, and pneumonia), and use of COPD medications (triple-therapy long-acting bronchodilators, short-acting bronchodilators, oral long-acting bronchodilators, and antibiotics) were also positively associated with AECOPD. A high breathing rate, heart rate, or systolic blood pressure and methylxanthine use were negatively correlated with AECOPD. Conclusions: The ML model was able to accurately assess the risk of AECOPD. The ML model combined with SHAP and the local explanation method were able to provide interpretable and visual explanations of individualized risk predictions, which may assist clinical physicians in understanding the effects of key features in the model and the model’s decision-making process.
Collapse
|
13
|
Classification and Explanation for Intrusion Detection System Based on Ensemble Trees and SHAP Method. SENSORS 2022; 22:s22031154. [PMID: 35161899 PMCID: PMC8840013 DOI: 10.3390/s22031154] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/03/2022] [Revised: 01/28/2022] [Accepted: 01/28/2022] [Indexed: 01/27/2023]
Abstract
In recent years, many methods for intrusion detection systems (IDS) have been designed and developed in the research community, which have achieved a perfect detection rate using IDS datasets. Deep neural networks (DNNs) are representative examples applied widely in IDS. However, DNN models are becoming increasingly complex in model architectures with high resource computing in hardware requirements. In addition, it is difficult for humans to obtain explanations behind the decisions made by these DNN models using large IoT-based IDS datasets. Many proposed IDS methods have not been applied in practical deployments, because of the lack of explanation given to cybersecurity experts, to support them in terms of optimizing their decisions according to the judgments of the IDS models. This paper aims to enhance the attack detection performance of IDS with big IoT-based IDS datasets as well as provide explanations of machine learning (ML) model predictions. The proposed ML-based IDS method is based on the ensemble trees approach, including decision tree (DT) and random forest (RF) classifiers which do not require high computing resources for training models. In addition, two big datasets are used for the experimental evaluation of the proposed method, NF-BoT-IoT-v2, and NF-ToN-IoT-v2 (new versions of the original BoT-IoT and ToN-IoT datasets), through the feature set of the net flow meter. In addition, the IoTDS20 dataset is used for experiments. Furthermore, the SHapley additive exPlanations (SHAP) is applied to the eXplainable AI (XAI) methodology to explain and interpret the classification decisions of DT and RF models; this is not only effective in interpreting the final decision of the ensemble tree approach but also supports cybersecurity experts in quickly optimizing and evaluating the correctness of their judgments based on the explanations of the results.
Collapse
|