1
|
Zhang W, Zhao Y, Zhang F, Shi X, Zeng C, Maerker M. Understanding the mechanism of gully erosion in the alpine region through an interpretable machine learning approach. THE SCIENCE OF THE TOTAL ENVIRONMENT 2024; 949:174949. [PMID: 39067585 DOI: 10.1016/j.scitotenv.2024.174949] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/15/2024] [Revised: 07/19/2024] [Accepted: 07/20/2024] [Indexed: 07/30/2024]
Abstract
In the alpine region, climate warming has led to the retreat of glaciers, snow cover, and permafrost. This has intensified water cycling, soil erosion, and increased the occurrence of natural disasters in the alpine region. This study investigated the Lhasa River Basin in the southern Tibetan Plateau, serving as a representative case study of a typical alpine basin, with a specific focus on gully erosion. Based on field investigations and interpretation using high-resolution satellite remote sensing images, the Random Forest (RF) algorithm was applied to evaluate gully erosion susceptibility on watershed level. The Shapley Additive Interpretation method was then used to interpret the RF model and gain deeper insights into the influencing variables of gully erosion. The results showed that the RF model achieved an area under the receiver operating characteristic (AUC) accuracy of 0.99 and 0.98 for the training and testing datasets, respectively, indicating an outstanding performance of the model. The resulting susceptibility map based on the RF model shows that areas with moderate and higher levels of gully erosion susceptibility are covering 50 % of the basin. The model interpretation results indicated that elevation, slope, permafrost, rainstorm, silt loam topsoil, human activity, stream power, and vegetation were the explaining variables with the highest importance for gully erosion occurrence. Different variables are characterized by specific thresholds promoting gully erosion such as: i) elevations higher than 4950 m, ii) slopes steeper than 13.5°, iii) extreme rainstorms longer than 11 days per year, iv) silt loam topsoil, v) presence of permafrost, vi) stream power index higher than 1.2, and vii) normalized difference vegetation index (NDVI) lower than 0.25. Our findings provide the scientific basis to improve soil erosion control in such highly vulnerable alpine area.
Collapse
Affiliation(s)
- Wenjie Zhang
- ECMI Team, State Key Laboratory of Tibetan Plateau Earth System Science, Resources and Environment (TPESRE), Institute of Tibetan Plateau Research, Chinese Academy of Sciences (CAS), Beijing, China; University of Chinese Academy of Sciences, Beijing, China.
| | - Yang Zhao
- ECMI Team, State Key Laboratory of Tibetan Plateau Earth System Science, Resources and Environment (TPESRE), Institute of Tibetan Plateau Research, Chinese Academy of Sciences (CAS), Beijing, China.
| | - Fan Zhang
- ECMI Team, State Key Laboratory of Tibetan Plateau Earth System Science, Resources and Environment (TPESRE), Institute of Tibetan Plateau Research, Chinese Academy of Sciences (CAS), Beijing, China; University of Chinese Academy of Sciences, Beijing, China.
| | - Xiaonan Shi
- ECMI Team, State Key Laboratory of Tibetan Plateau Earth System Science, Resources and Environment (TPESRE), Institute of Tibetan Plateau Research, Chinese Academy of Sciences (CAS), Beijing, China.
| | - Chen Zeng
- ECMI Team, State Key Laboratory of Tibetan Plateau Earth System Science, Resources and Environment (TPESRE), Institute of Tibetan Plateau Research, Chinese Academy of Sciences (CAS), Beijing, China.
| | - Michael Maerker
- Leibniz Centre for Agricultural Landscape Research, Working Group on Soil Erosion and Feedbacks, Germany; University of Pavia, Department of Earth and Environmental Sciences, Italy.
| |
Collapse
|
2
|
Nong X, Lai C, Chen L, Wei J. A novel coupling interpretable machine learning framework for water quality prediction and environmental effect understanding in different flow discharge regulations of hydro-projects. THE SCIENCE OF THE TOTAL ENVIRONMENT 2024; 950:175281. [PMID: 39117235 DOI: 10.1016/j.scitotenv.2024.175281] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/21/2024] [Revised: 08/01/2024] [Accepted: 08/02/2024] [Indexed: 08/10/2024]
Abstract
Machine learning models (MLMs) have been increasingly used to forecast water pollution. However, the "black box" characteristic for understanding mechanism processes still limits the applicability of MLMs for water quality management in hydro-projects under complex and frequently artificial regulation. This study proposes an interpretable machine learning framework for water quality prediction coupled with a hydrodynamic (flow discharge) scenario-based Random Forest (RF) model with multiple model-agnostic techniques and quantifies global, local, and joint interpretations (i.e., partial dependence, individual conditional expectation, and accumulated local effects) of environmental factor implications. The framework was applied and verified to predict the permanganate index (CODMn) under different flow discharge regulation scenarios in the Middle Route of the South-to-North Water Diversion Project of China (MRSNWDPC). A total of 4664 sampling cases data matrices, including water quality, meteorological, and hydrological indicators from eight national stations along the main canal of the MRSNWDPC, were collected from May 2019 to December 2020. The results showed that the RF models were effective in forecasting CODMn in all flow discharge scenarios, with a mean square error, coefficient of determination, and mean absolute error of 0.006-0.026, 0.481-0.792, and 0.069-0.104, respectively, in the testing dataset. A global interpretation indicated that dissolved oxygen, flow discharge, and surface pressure are the three most important variables of CODMn. Local and joint interpretations indicated that the RF-based prediction model provides a basic understanding of the physical mechanisms of environmental systems. The proposed framework can effectively learn the fundamental environmental implications of water quality variations and provide reliable prediction performance, highlighting the importance of model interpretability for trustworthy machine learning applications in water management projects. This study provides scientific references for applying advanced data-driven MLMs to water quality forecasting and a reliable methodological framework for water quality management and similar hydro-projects.
Collapse
Affiliation(s)
- Xizhi Nong
- College of Civil Engineering and Architecture, Guangxi University, Nanning 530004, China; State Key Laboratory of Hydroscience and Engineering, Tsinghua University, Beijing 100084, China; Centre for Urban Sustainability and Resilience, Department of Civil, Environmental and Geomatic Engineering, University College London, London WC1E 6BT, UK; School of Computing and Engineering, University of West London, London W5 5RF, UK
| | - Cheng Lai
- College of Civil Engineering and Architecture, Guangxi University, Nanning 530004, China
| | - Lihua Chen
- College of Civil Engineering and Architecture, Guangxi University, Nanning 530004, China.
| | - Jiahua Wei
- State Key Laboratory of Hydroscience and Engineering, Tsinghua University, Beijing 100084, China
| |
Collapse
|
3
|
Guo Y, Zhang S, Ren L, Tian X, Tang S, Xian Y, Wu X, Zhang Z. Prediction of Chinese suitable habitats of Panax notoginseng under climate change based on MaxEnt and chemometric methods. Sci Rep 2024; 14:16434. [PMID: 39014061 PMCID: PMC11252130 DOI: 10.1038/s41598-024-67178-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Accepted: 07/09/2024] [Indexed: 07/18/2024] Open
Abstract
Notoginseng saponin R1; ginsenosides Rg1, Re, Rb1, and Rd; the sum of the five saponins; and underground-part fresh weight (UPFW) of single plants were used as quality evaluation indices for Panax notoginseng (Burk.) F. H. Chen (P. notoginseng). Comprehensive evaluation of P. notoginseng samples from 30 production areas was performed using that MaxEnt model. Spatial pattern changes in suitable P. notoginseng habitats were predicted for current and future periods (2050s, 2070s, and 2090s) using SSP126 and SSP585 models. The results revealed that temperature, precipitation, and solar radiation were important environmental variables. Suitable habitats were located mainly in Yunnan, Guizhou, and Sichuan Provinces. The distribution core of P. notoginseng is predicted to shift southeast in the future. The saponin content decreased from the southeast to the northwest of Yunnan Province, which was contrary to the UPFW trend. This study provides the necessary information for the protection and sustainable utilization of P. notoginseng resources, and a theoretical reference for its application in the quality evaluation of Chinese medicinal products.
Collapse
Affiliation(s)
- Yixin Guo
- School of Chinese Materia Medica, Beijing University of Chinese Medicine, Beijing, 102488, China
| | - Shiyan Zhang
- School of Chinese Materia Medica, Beijing University of Chinese Medicine, Beijing, 102488, China
| | - Linghui Ren
- School of Chinese Materia Medica, Beijing University of Chinese Medicine, Beijing, 102488, China
| | - Xin Tian
- School of Chinese Materia Medica, Beijing University of Chinese Medicine, Beijing, 102488, China
| | - Shicheng Tang
- School of Chinese Materia Medica, Beijing University of Chinese Medicine, Beijing, 102488, China
| | - Yisha Xian
- School of Chinese Materia Medica, Beijing University of Chinese Medicine, Beijing, 102488, China
| | - Xinjia Wu
- School of Chinese Materia Medica, Beijing University of Chinese Medicine, Beijing, 102488, China
| | - Zilong Zhang
- School of Chinese Materia Medica, Beijing University of Chinese Medicine, Beijing, 102488, China.
- Meteorological Administration Key Open Laboratory of Transforming Climate Resource to Economy, Chongqing, 401147, China.
| |
Collapse
|
4
|
Delaney JT, Larson DM. Using explainable machine learning methods to evaluate vulnerability and restoration potential of ecosystem state transitions. CONSERVATION BIOLOGY : THE JOURNAL OF THE SOCIETY FOR CONSERVATION BIOLOGY 2024; 38:e14203. [PMID: 37817744 DOI: 10.1111/cobi.14203] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Revised: 09/27/2023] [Accepted: 10/05/2023] [Indexed: 10/12/2023]
Abstract
Ecosystem state transitions can be ecologically devastating or be a restoration success. State transitions are common within aquatic systems worldwide, especially considering human-mediated changes to land use and water use. We created a transferable conceptual framework to enable multiscale assessments of state resilience and early warnings of state transitions that can inform strategic restorations and avoid ecosystem collapse. The conceptual framework integrated machine learning predictions with ecosystem state concepts (e.g., state classification, gradients of vulnerability, and recovery potential leading to state transitions) and was devised to investigate possible environmental drivers. As an application of the framework, we generated prediction probabilities of submersed aquatic vegetation (SAV) presence at nearly 10,000 sites in the Upper Mississippi River (United States). Then, we used an interpretability method to explain model predictions to gain insights into possible environmental drivers and thresholds or linear responses of SAV presence and absence. Model accuracy was 89% without spatial bias. Average water depth, suspended solids, substrate, and distance to nearest SAV were the best predictors and likely environmental drivers of SAV habitat suitability. These environmental drivers exhibited nonlinear, threshold-type responses for SAV. All the results are also presented in an online dashboard to explore results at many spatial scales. The habitat suitability model outputs and prediction explanations from many spatial scales (4 m to 400 km of river reach) can inform research and restoration planning.
Collapse
|
5
|
Tseng KY, Hsieh YT, Lin HC. Machine learning prediction on wetland succession and the impact of artificial structures from a decade of field data. THE SCIENCE OF THE TOTAL ENVIRONMENT 2024; 937:173426. [PMID: 38796015 DOI: 10.1016/j.scitotenv.2024.173426] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/03/2024] [Revised: 05/19/2024] [Accepted: 05/19/2024] [Indexed: 05/28/2024]
Abstract
The artificial structures can influence wetland topology and sediment properties, thereby shaping plant distribution and composition. Macrobenthos composition was correlated with plant cover. Previous studies on the impact of artificial structures on plant distribution are scarce in incorporating time-series data or extended field surveys. In this study, a machine-learning-based species distribution model with decade-long observation was analyzed to investigate the correlation between the shift in the distribution of B. planiculmis, artificial structure-induced elevation changes and the expansion of other plants, as well as their connection to soil properties and crab composition dynamics under plants in Gaomei Wetland. Long short-term memory model (LSTM) with Shapley additive explanations (SHAP) was employed for predicting the distribution of B. planiculmis and explaining feature importance. The results indicated that wetland topology was influenced by both artificial structures and plants. Areas initially colonized by B. planiculmis were replaced by other species. Soil properties showed significant differences among plant patches; however, principal component analysis (PCA) of sediment properties and niche similarity analysis showed that the niche of plants was overlapped. Crab composition was different under different plants. The presence probability of B. planiculmis near woody paths decreased according to LSTM and field survey data. SHAP analysis suggested that the distribution of other plants, historical distribution of B. planiculmis and sediment properties significantly contributed to the presence probability of B. planiculmis. A sharp decrease in SHAP values with increasing NDVI at suitable elevations, overlap in PCA of sediment properties and niche similarity indicated potential competition among plants. This decade-long time-series field survey revealed the joint effects of artificial structure and vegetation on the topology and soil properties dynamics. These changes influenced the plant distribution through potential plant competition. LSTM with SHAP provided valuable insights in the underlying the mechanisms of artificial structure effects on the plant zonation process.
Collapse
Affiliation(s)
- Kuang-Yu Tseng
- Department of Life Science, Tunghai University, Taichung 407, Taiwan
| | - Yun-Ting Hsieh
- Department of Life Science, Tunghai University, Taichung 407, Taiwan
| | - Hui-Chen Lin
- Department of Life Science, Tunghai University, Taichung 407, Taiwan; Center for Ecology and Environment, Tunghai University, Taiwan.
| |
Collapse
|
6
|
Talukdar S, Shahfahad, Bera S, Naikoo MW, Ramana GV, Mallik S, Kumar PA, Rahman A. Optimisation and interpretation of machine and deep learning models for improved water quality management in Lake Loktak. JOURNAL OF ENVIRONMENTAL MANAGEMENT 2024; 351:119866. [PMID: 38147770 DOI: 10.1016/j.jenvman.2023.119866] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Revised: 11/28/2023] [Accepted: 12/13/2023] [Indexed: 12/28/2023]
Abstract
Loktak Lake, one of the largest freshwater lakes in Manipur, India, is critical for the eco-hydrology and economy of the region, but faces deteriorating water quality due to urbanisation, anthropogenic activities, and domestic sewage. Addressing the urgent need for effective pollution management, this study aims to assess the lake's water quality status using the water quality index (WQI) and develop advanced machine learning (ML) tools for WQI assessment and ML model interpretation to improve pollution management decision making. The WQI was assessed using entropy-based weighting arithmetic and three ML models - Gradient Boosting Machine (GBM), Random Forest (RF) and Deep Neural Network (DNN) - were optimised using a grid search algorithm in the H2O Application Programming Interface (API). These models were validated by various metrics and interpreted globally and locally via Partial Dependency Plot (PDP), Accumulated Local Effect (ALE) and SHapley Additive exPlanations (SHAP). The results show a WQI range of 72.38-100, with 52.7% of samples categorised as very poor. The RF model outperformed GBM and DNN and showed the highest accuracy and generalisation ability, which is reflected in the superior R2 values (0.97 in training, 0.9 in test) and the lower root mean square error (RMSE). RF's minimal margin of error and reliable feature interpretation contrasted with DNN's larger margin of error and inconsistency, which affected its usefulness for decision making. Turbidity was found to be a critical predictive feature in all models, significantly influencing WQI, with other variables such as pH and temperature also playing an important role. SHAP dependency plots illustrated the direct relationship between key water quality parameters such as turbidity and WQI predictions. The novelty of this study lies in its comprehensive approach to the evaluation and interpretation of ML models for WQI estimation, which provides a nuanced understanding of water quality dynamics in Loktak Lake. By identifying the most effective ML models and key predictive functions, this study provides invaluable insights for water quality management and paves the way for targeted strategies to monitor and improve water quality in this vital freshwater ecosystem.
Collapse
Affiliation(s)
- Swapan Talukdar
- Department of Geography, Faculty of Natural Sciences, Jamia Millia Islamia, New Delhi, 110025, India.
| | - Shahfahad
- Department of Geography, Faculty of Natural Sciences, Jamia Millia Islamia, New Delhi, 110025, India.
| | - Somnath Bera
- Department of Geography, Central University of South Bihar, Gaya, Bihar, 823001, India.
| | - Mohd Waseem Naikoo
- Department of Geography & Disaster Management, University of Kashmir, Srinagar, Jammu & Kashmir, 190006, India.
| | - G V Ramana
- Department of Civil Engineering, Indian Institute of Technology Delhi, Hauz Khas, New Delhi, 110016, India.
| | - Santanu Mallik
- Department of Civil Engineering, National Institution of Technology, Agaratala, Tripura, 799046, India.
| | - Potsangbam Albino Kumar
- Department of Civil Engineering, National Institution of Technology, Imphal, Manipur, 795004, India.
| | - Atiqur Rahman
- Department of Geography, Faculty of Natural Sciences, Jamia Millia Islamia, New Delhi, 110025, India.
| |
Collapse
|
7
|
Zhang H, Guo W, Wang W. The dimensionality reductions of environmental variables have a significant effect on the performance of species distribution models. Ecol Evol 2023; 13:e10747. [PMID: 38020673 PMCID: PMC10659948 DOI: 10.1002/ece3.10747] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2023] [Revised: 10/29/2023] [Accepted: 11/06/2023] [Indexed: 12/01/2023] Open
Abstract
How to effectively obtain species-related low-dimensional data from massive environmental variables has become an urgent problem for species distribution models (SDMs). In this study, we will explore whether dimensionality reduction on environmental variables can improve the predictive performance of SDMs. We first used two linear (i.e., principal component analysis (PCA) and independent components analysis) and two nonlinear (i.e., kernel principal component analysis (KPCA) and uniform manifold approximation and projection) dimensionality reduction techniques (DRTs) to reduce the dimensionality of high-dimensional environmental data. Then, we established five SDMs based on the environmental variables of dimensionality reduction for 23 real plant species and nine virtual species, and compared the predictive performance of those with the SDMs based on the selected environmental variables through Pearson's correlation coefficient (PCC). In addition, we studied the effects of DRTs, model complexity, and sample size on the predictive performance of SDMs. The predictive performance of SDMs under DRTs other than KPCA is better than using PCC. And the predictive performance of SDMs using linear DRTs is better than using nonlinear DRTs. In addition, using DRTs to deal with environmental variables has no less impact on the predictive performance of SDMs than model complexity and sample size. When the model complexity is at the complex level, PCA can improve the predictive performance of SDMs the most by 2.55% compared with PCC. At the middle level of sample size, the PCA improved the predictive performance of SDMs by 2.68% compared with the PCC. Our study demonstrates that DRTs have a significant effect on the predictive performance of SDMs. Specifically, linear DRTs, especially PCA, are more effective at improving model predictive performance under relatively complex model complexity or large sample sizes.
Collapse
Affiliation(s)
- Hao‐Tian Zhang
- School of Mathematics and Computer ScienceNorthwest Minzu UniversityLanzhouChina
| | - Wen‐Yong Guo
- Research Center for Global Change and Complex Ecosystems, School of Ecological and Environmental SciencesEast China Normal UniversityShanghaiChina
- Zhejiang Tiantong Forest Ecosystem National Observation and Research Station, School of Ecological and Environmental SciencesEast China Normal UniversityShanghaiChina
| | - Wen‐Ting Wang
- School of Mathematics and Computer ScienceNorthwest Minzu UniversityLanzhouChina
| |
Collapse
|
8
|
Aryal K, Maraseni T, Apan A. Preference, perceived change, and professed relationship among ecosystem services in the Himalayas. JOURNAL OF ENVIRONMENTAL MANAGEMENT 2023; 344:118522. [PMID: 37390580 DOI: 10.1016/j.jenvman.2023.118522] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Revised: 06/20/2023] [Accepted: 06/24/2023] [Indexed: 07/02/2023]
Abstract
The demand side of ecosystem service (ES), especially preference and perception of supply and interactions among ES, is an important yet underexplored research area for landscape planning and management in human-dominated landscapes. Taking a case of multifunctional landscape in the Hindu-Kush Himalayan region, we carried out a social survey of ES, focusing on preference, perceived change, and observed relationship among six major ES from the local people's perspective. Using a semi-structured questionnaire, data collection was done from 300 households from 10 categories of human settlements, based on watershed and land cover types. Garrett mean score (GMS), ordinal logistic regression estimates, and Chi-square test were performed for quantitative data, while an inductive approach was adopted for qualitative data analysis. The results show that at the landscape level, local people preferred water yield (GMS = 70) and crop production (GMS = 66) as the most preferred ES, whereas habitat quality (GMS = 37) and carbon sequestration (GMS = 35) were among the least preferred ES. More than 70% of the respondents believed that the supply of crop production has decreased over the last two decades; however, the supply of other provisioning and non-provisioning ES has increased as observed by majority of the respondents. Among the 15 pairs of ES, local people believe that co-occurrence of ES is possible. Majority of the respondents said that there exist synergistic relationship among 13 pairs of ES, except crop production which is negatively related with timber production and carbon sequestration. Among the identified trade-offs in ES, majority of local people believed that direct trade-offs (i.e., linear inverse relationship) is dominant as observed in 8 pairs of ES, followed by concave and convex trade-offs. Based on our analysis, we argue that the preference and perceived change of ES is more dependent on spatial heterogeneity of communities (i.e., watershed type, municipal category, and land cover type of residence) than socio-economic determinants. Further, we have discussed and suggested few policy and management measures including place-based spatial assessment of the social demand and preference, embracing agroforestry practices in ecosystem management programs, mainstreaming non-local ES in local decision making by incentives, and optimizing the supply of desired ES though integrated biophysical and socio-economic assessment of the landscape.
Collapse
Affiliation(s)
- Kishor Aryal
- University of Southern Queensland, Toowoomba, 4350, Queensland, Australia; Ministry of Industry, Tourism, Forests, and Environment, Sudoorpaschim Province, Dhangadhi, Nepal
| | - Tek Maraseni
- University of Southern Queensland, Toowoomba, 4350, Queensland, Australia; Northwest Institute of Eco-Environment and Resources, Chinese Academy of Sciences, Lanzhou, 730000, China.
| | - Armando Apan
- University of Southern Queensland, Toowoomba, 4350, Queensland, Australia; Institute of Environmental Science and Meteorology, University of the Philippines Diliman, Quezon City, Philippines
| |
Collapse
|
9
|
Sotomayor G, Romero J, Ballari D, Vázquez RF, Ramírez-Morales I, Hampel H, Galarza X, Montesinos B, Forio MAE, Goethals PLM. Occurrence Prediction of Riffle Beetles (Coleoptera: Elmidae) in a Tropical Andean Basin of Ecuador Using Species Distribution Models. BIOLOGY 2023; 12:biology12030473. [PMID: 36979164 PMCID: PMC10045380 DOI: 10.3390/biology12030473] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/07/2023] [Revised: 03/16/2023] [Accepted: 03/16/2023] [Indexed: 03/30/2023]
Abstract
Genera and species of Elmidae (riffle beetles) are sensitive to water pollution; however, in tropical freshwater ecosystems, their requirements regarding environmental factors need to be investigated. Species distribution models (SDMs) were established for five elmid genera in the Paute river basin (southern Ecuador) using the Random Forest (RF) algorithm considering environmental variables, i.e., meteorology, land use, hydrology, and topography. Each RF-based model was trained and optimised using cross-validation. Environmental variables that explained most of the Elmidae spatial variability were land use (i.e., riparian vegetation alteration and presence/absence of canopy), precipitation, and topography, mainly elevation and slope. The highest probability of occurrence for elmids genera was predicted in streams located within well-preserved zones. Moreover, specific ecological niches were spatially predicted for each genus. Macrelmis was predicted in the lower and forested areas, with high precipitation levels, towards the Amazon basin. Austrelmis was predicted to be in the upper parts of the basin, i.e., páramo ecosystems, with an excellent level of conservation of their riparian ecosystems. Austrolimnius and Heterelmis were also predicted in the upper parts of the basin but in more widespread elevation ranges, in the Heterelmis case, and even in some areas with a medium level of anthropisation. Neoelmis was predicted to be in the mid-region of the study basin in high altitudinal streams with a high degree of meandering. The main findings of this research are likely to contribute significantly to local conservation and restoration efforts being implemented in the study basin and could be extrapolated to similar eco-hydrological systems.
Collapse
Affiliation(s)
- Gonzalo Sotomayor
- Department of Animal Sciences and Aquatic Ecology, Faculty of Bioscience Engineering, Ghent University, Coupure Links 653, 9000 Ghent, Belgium
- Departamento de Ingeniería Civil, Facultad de Ingeniería, Universidad de Cuenca, Av. 12 de abril S/N, Cuenca, Azuay 010203, Ecuador
| | - Jorge Romero
- Instituto de Estudios del Régimen Seccional del Ecuador (IERSE), Facultad de Ciencia y Tecnología, Universidad del Azuay, Cuenca 010204, Ecuador
| | - Daniela Ballari
- Instituto de Estudios del Régimen Seccional del Ecuador (IERSE), Facultad de Ciencia y Tecnología, Universidad del Azuay, Cuenca 010204, Ecuador
| | - Raúl F Vázquez
- Departamento de Ingeniería Civil, Facultad de Ingeniería, Universidad de Cuenca, Av. 12 de abril S/N, Cuenca, Azuay 010203, Ecuador
- Laboratorio de Ecología Acuática (LEA), Facultad de Ciencias Químicas, Universidad de Cuenca, Av. 12 de abril S/N, Cuenca 010203, Ecuador
| | | | - Henrietta Hampel
- Laboratorio de Ecología Acuática (LEA), Facultad de Ciencias Químicas, Universidad de Cuenca, Av. 12 de abril S/N, Cuenca 010203, Ecuador
| | - Xavier Galarza
- Instituto de Estudios del Régimen Seccional del Ecuador (IERSE), Facultad de Ciencia y Tecnología, Universidad del Azuay, Cuenca 010204, Ecuador
| | - Bolívar Montesinos
- Ministerio del Ambiente, Agua y Transición Ecológica, Dirección Zonal 6, Cuenca 010104, Ecuador
| | - Marie Anne Eurie Forio
- Department of Animal Sciences and Aquatic Ecology, Faculty of Bioscience Engineering, Ghent University, Coupure Links 653, 9000 Ghent, Belgium
| | - Peter L M Goethals
- Department of Animal Sciences and Aquatic Ecology, Faculty of Bioscience Engineering, Ghent University, Coupure Links 653, 9000 Ghent, Belgium
| |
Collapse
|
10
|
Lim SJ, Son M, Ki SJ, Suh SI, Chung J. Opportunities and challenges of machine learning in bioprocesses: Categorization from different perspectives and future direction. BIORESOURCE TECHNOLOGY 2023; 370:128518. [PMID: 36565818 DOI: 10.1016/j.biortech.2022.128518] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/31/2022] [Revised: 12/15/2022] [Accepted: 12/17/2022] [Indexed: 06/17/2023]
Abstract
Recent advances in machine learning (ML) have revolutionized an extensive range of research and industry fields by successfully addressing intricate problems that cannot be resolved with conventional approaches. However, low interpretability and incompatibility make it challenging to apply ML to complicated bioprocesses, which rely on the delicate metabolic interplay among living cells. This overview attempts to delineate ML applications to bioprocess from different perspectives, and their inherent limitations (i.e., uncertainties in prediction) were then discussed with unique attempts to supplement the ML models. A clear classification can be made depending on the purpose of the ML (supervised vs unsupervised) per application, as well as on their system boundaries (engineered vs natural). Although a limited number of hybrid approaches with meaningful outcomes (e.g., improved accuracy) are available, there is still a need to further enhance the interpretability, compatibility, and user-friendliness of ML models.
Collapse
Affiliation(s)
- Seung Ji Lim
- Water Cycle Research Center, Korea Institute of Science and Technology, Seoul 02792, Republic of Korea
| | - Moon Son
- Water Cycle Research Center, Korea Institute of Science and Technology, Seoul 02792, Republic of Korea; Division of Energy and Environmental Technology, KIST School, Korea University of Science and Technology (UST), Seoul 02792, Republic of Korea
| | - Seo Jin Ki
- Department of Environmental Engineering, Gyeongsang National University, Jinju 52725, Republic of Korea
| | - Sang-Ik Suh
- Department of Energy System Engineering, Gyeongsang National University, Jinju 52725, Republic of Korea
| | - Jaeshik Chung
- Water Cycle Research Center, Korea Institute of Science and Technology, Seoul 02792, Republic of Korea; Division of Energy and Environmental Technology, KIST School, Korea University of Science and Technology (UST), Seoul 02792, Republic of Korea.
| |
Collapse
|
11
|
Wikle CK, Datta A, Hari BV, Boone EL, Sahoo I, Kavila I, Castruccio S, Simmons SJ, Burr WS, Chang W. An illustration of model agnostic explainability methods applied to environmental data. ENVIRONMETRICS 2023; 34:e2772. [PMID: 37200542 PMCID: PMC10187774 DOI: 10.1002/env.2772] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/15/2022] [Accepted: 09/20/2022] [Indexed: 05/20/2023]
Abstract
Historically, two primary criticisms statisticians have of machine learning and deep neural models is their lack of uncertainty quantification and the inability to do inference (i.e., to explain what inputs are important). Explainable AI has developed in the last few years as a sub-discipline of computer science and machine learning to mitigate these concerns (as well as concerns of fairness and transparency in deep modeling). In this article, our focus is on explaining which inputs are important in models for predicting environmental data. In particular, we focus on three general methods for explainability that are model agnostic and thus applicable across a breadth of models without internal explainability: "feature shuffling", "interpretable local surrogates", and "occlusion analysis". We describe particular implementations of each of these and illustrate their use with a variety of models, all applied to the problem of long-lead forecasting monthly soil moisture in the North American corn belt given sea surface temperature anomalies in the Pacific Ocean.
Collapse
Affiliation(s)
| | - Abhirup Datta
- Department of Biostatistics, Johns Hopkins University, Baltimore, Maryland, USA
| | | | - Edward L. Boone
- Department of Statistical Sciences and Operations Research, Virginia Commonwealth University, Richmond, Virginia, USA
| | - Indranil Sahoo
- Department of Statistical Sciences and Operations Research, Virginia Commonwealth University, Richmond, Virginia, USA
| | - Indulekha Kavila
- School of Pure and Applied Physics, Mahatma Gandhi University, Athirampuzha, Kerala, India
| | - Stefano Castruccio
- Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, Notre Dame, Indiana, USA
| | - Susan J. Simmons
- Institute for Advanced Analytics, North Carolina State University, Raleigh, North Carolina, USA
| | - Wesley S. Burr
- Department of Mathematics, Trent University, Peterborough, Ontario, Canada
| | - Won Chang
- Department of Mathematical Sciences, University of Cincinnati, Cincinnati, Ohio, USA
| |
Collapse
|
12
|
Lee DS, Lee DY, Park YS. Interpretable machine learning approach to analyze the effects of landscape and meteorological factors on mosquito occurrences in Seoul, South Korea. ENVIRONMENTAL SCIENCE AND POLLUTION RESEARCH INTERNATIONAL 2023; 30:532-546. [PMID: 35900627 PMCID: PMC9813121 DOI: 10.1007/s11356-022-22099-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/25/2022] [Accepted: 07/14/2022] [Indexed: 06/15/2023]
Abstract
Mosquitoes are the underlying cause of various public health and economic problems. In this study, patterns of mosquito occurrence were analyzed based on landscape and meteorological factors in the metropolitan city of Seoul. We evaluated the influence of environmental factors on mosquito occurrence through the interpretation of prediction models with a machine learning algorithm. Through hierarchical cluster analysis, the study areas were classified into waterside and non-waterside areas, according to the landscape patterns. The mosquito occurrence was higher in the waterside area, and mosquito abundance was negatively affected by rainfall at the waterside. The mosquito occurrence was predicted in each cluster area based on the landscape and cumulative meteorological variables using a random forest algorithm. Both models exhibited good performance (both accuracy and AUROC > 0.8) in predicting the level of mosquito occurrence. The embedded relationship between the mosquito occurrence and the environmental factors in the models was explained using the Shapley additive explanation method. According to the variable importance and the partial dependence plots for each model, the waterside area was more influenced by the meteorological and land cover variables than the non-waterside area. Therefore, mosquito control strategies should consider the effects of landscape and meteorological conditions, including the temperature, rainfall, and the landscape heterogeneity. The present findings can contribute to the development of mosquito forecasting systems in metropolitan cities for the promotion of public health.
Collapse
Affiliation(s)
- Dae-Seong Lee
- Department of Biology, Kyung Hee University, Seoul, 02447, Republic of Korea
| | - Da-Yeong Lee
- Department of Biology, Kyung Hee University, Seoul, 02447, Republic of Korea
| | - Young-Seuk Park
- Department of Biology, Kyung Hee University, Seoul, 02447, Republic of Korea.
| |
Collapse
|
13
|
Bifarin OO. Interpretable machine learning with tree-based shapley additive explanations: Application to metabolomics datasets for binary classification. PLoS One 2023; 18:e0284315. [PMID: 37141218 PMCID: PMC10159207 DOI: 10.1371/journal.pone.0284315] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2022] [Accepted: 03/28/2023] [Indexed: 05/05/2023] Open
Abstract
Machine learning (ML) models are used in clinical metabolomics studies most notably for biomarker discoveries, to identify metabolites that discriminate between a case and control group. To improve understanding of the underlying biomedical problem and to bolster confidence in these discoveries, model interpretability is germane. In metabolomics, partial least square discriminant analysis (PLS-DA) and its variants are widely used, partly due to the model's interpretability with the Variable Influence in Projection (VIP) scores, a global interpretable method. Herein, Tree-based Shapley Additive explanations (SHAP), an interpretable ML method grounded in game theory, was used to explain ML models with local explanation properties. In this study, ML experiments (binary classification) were conducted for three published metabolomics datasets using PLS-DA, random forests, gradient boosting, and extreme gradient boosting (XGBoost). Using one of the datasets, PLS-DA model was explained using VIP scores, while one of the best-performing models, a random forest model, was interpreted using Tree SHAP. The results show that SHAP has a more explanation depth than PLS-DA's VIP, making it a powerful method for rationalizing machine learning predictions from metabolomics studies.
Collapse
Affiliation(s)
- Olatomiwa O Bifarin
- Department of Biochemistry and Molecular Biology, University of Georgia, Athens, Georgia, United States of America
| |
Collapse
|
14
|
Maloney KO, Buchanan C, Jepsen RD, Krause KP, Cashman MJ, Gressler BP, Young JA, Schmid M. Explainable machine learning improves interpretability in the predictive modeling of biological stream conditions in the Chesapeake Bay Watershed, USA. JOURNAL OF ENVIRONMENTAL MANAGEMENT 2022; 322:116068. [PMID: 36058075 DOI: 10.1016/j.jenvman.2022.116068] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/17/2022] [Revised: 08/03/2022] [Accepted: 08/19/2022] [Indexed: 06/15/2023]
Abstract
Anthropogenic alterations have resulted in widespread degradation of stream conditions. To aid in stream restoration and management, baseline estimates of conditions and improved explanation of factors driving their degradation are needed. We used random forests to model biological conditions using a benthic macroinvertebrate index of biotic integrity for small, non-tidal streams (upstream area ≤200 km2) in the Chesapeake Bay watershed (CBW) of the mid-Atlantic coast of North America. We utilized several global and local model interpretation tools to improve average and site-specific model inferences, respectively. The model was used to predict condition for 95,867 individual catchments for eight periods (2001, 2004, 2006, 2008, 2011, 2013, 2016, 2019). Predicted conditions were classified as Poor, FairGood, or Uncertain to align with management needs and individual reach lengths and catchment areas were summed by condition class for the CBW for each period. Global permutation and local Shapley importance values indicated percent of forest, development, and agriculture in upstream catchments had strong impacts on predictions. Development and agriculture negatively influenced stream condition for model average (partial dependence [PD] and accumulated local effect [ALE] plots) and local (individual condition expectation and Shapley value plots) levels. Friedman's H-statistic indicated large overall interactions for these three land covers, and bivariate global plots (PD and ALE) supported interactions among agriculture and development. Total stream length and catchment area predicted in FairGood conditions decreased then increased over the 19-years (length/area: 66.6/65.4% in 2001, 66.3/65.2% in 2011, and 66.6/65.4% in 2019). Examination of individual catchment predictions between 2001 and 2019 showed those predicted to have the largest decreases in condition had large increases in development; whereas catchments predicted to exhibit the largest increases in condition showed moderate increases in forest cover. Use of global and local interpretative methods together with watershed-wide and individual catchment predictions support conservation practitioners that need to identify widespread and localized patterns, especially acknowledging that management actions typically take place at individual-reach scales.
Collapse
Affiliation(s)
- Kelly O Maloney
- U.S. Geological Survey, Eastern Ecological Science Center, Kearneysville, West Virginia, USA 25430.
| | - Claire Buchanan
- Interstate Commission on the Potomac River Basin (ICPRB), 30 West Gude Drive, Suite 450, Rockville, MD, 20850, USA.
| | - Rikke D Jepsen
- Interstate Commission on the Potomac River Basin (ICPRB), 30 West Gude Drive, Suite 450, Rockville, MD, 20850, USA.
| | - Kevin P Krause
- U.S. Geological Survey, Eastern Ecological Science Center, Kearneysville, West Virginia, USA 25430.
| | - Matthew J Cashman
- U.S. Geological Survey, Maryland-Delaware-District of Columbia Water Science Center, Baltimore, MD, USA, 21228.
| | - Benjamin P Gressler
- U.S. Geological Survey, Eastern Ecological Science Center, Kearneysville, West Virginia, USA 25430.
| | - John A Young
- U.S. Geological Survey, Eastern Ecological Science Center, Kearneysville, West Virginia, USA 25430.
| | - Matthias Schmid
- Department of Medical Biometry, Informatics and Epidemiology, Medical Faculty, University of Bonn, Venusberg-Campus 1, 53127, Bonn, Germany.
| |
Collapse
|
15
|
An C, Yang H, Yu X, Han ZY, Cheng Z, Liu F, Dou J, Li B, Li Y, Li Y, Yu J, Liang P. A Machine Learning Model Based on Health Records for Predicting Recurrence After Microwave Ablation of Hepatocellular Carcinoma. J Hepatocell Carcinoma 2022; 9:671-684. [PMID: 35923613 PMCID: PMC9342890 DOI: 10.2147/jhc.s358197] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2022] [Accepted: 07/08/2022] [Indexed: 12/24/2022] Open
Abstract
Background and Aim Early recurrence (ER) presents a challenge for the survival prognosis of patients with hepatocellular carcinoma (HCC). The aim of this study was to investigate machine learning (ML) models using clinical data for predicting ER after microwave ablation (MWA). Methods Between August 2005 and December 2019, 1574 patients with early-stage HCC underwent MWA at four hospitals were reviewed. Then, 36 clinical data points per patient were collected, and the patients were assigned to the training, internal, and external validation set. Apart from traditional logistic regression (LR), three ML models—random forest, support vector machine, and eXtreme Gradient Boosting (XGBoost)—were built and validated for their predictive ability with the area under ROC curve (AUC). Algorithms such as SHapley Additive exPlanations (SHAP) and local interpretable model-agnostic explanations (LIME) were used to realize their interpretability. Results The three ML models all outperformed LR (P < 0.001 for all) in predictive ability. When nine variables (tumor number, platelet, α-fetoprotein, comorbidity score, white blood cell, cholinesterase, prothrombin time, neutrophils, and etiology) were extracted simultaneously using recursive feature elimination with cross-validation, the XGBoost model achieved the best discrimination among all models, with an AUC value 0.75 (95% CI [confidence interval]: 0.72–0.78) in the training set, 0.74 (95% CI: 0.69–0.80) in the internal validation set, and 0.76 (95% CI: 0.70–0.82) in the external validation set, and it was interpreted depending on the visualization of risk factors by the SHAP and LIME algorithms. The predictive system of post-ablation recurrence risk stratification was provided on online (http://114.251.235.51:8001/) based on XGboost analysis. Conclusion The XGBoost model based on clinical data can effectively predict ER risk after MWA, which can contribute to surveillance, prevention, and treatment strategies for HCC.
Collapse
Affiliation(s)
- Chao An
- Department of Ultrasound, PLA Medical College & 5th Medical Center of Chinese PLA General Hospital, Beijing, 100853, People’s Republic of China
| | - Hongcai Yang
- Department of Ultrasound, PLA Medical College & 5th Medical Center of Chinese PLA General Hospital, Beijing, 100853, People’s Republic of China
- School of Medicine, Nankai University, Tianjin, People’s Republic of China
| | - Xiaoling Yu
- Department of Ultrasound, PLA Medical College & 5th Medical Center of Chinese PLA General Hospital, Beijing, 100853, People’s Republic of China
| | - Zhi-Yu Han
- Department of Ultrasound, PLA Medical College & 5th Medical Center of Chinese PLA General Hospital, Beijing, 100853, People’s Republic of China
| | - Zhigang Cheng
- Department of Ultrasound, PLA Medical College & 5th Medical Center of Chinese PLA General Hospital, Beijing, 100853, People’s Republic of China
| | - Fangyi Liu
- Department of Ultrasound, PLA Medical College & 5th Medical Center of Chinese PLA General Hospital, Beijing, 100853, People’s Republic of China
| | - Jianping Dou
- Department of Ultrasound, PLA Medical College & 5th Medical Center of Chinese PLA General Hospital, Beijing, 100853, People’s Republic of China
| | - Bing Li
- National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing, People’s Republic of China
| | - Yansheng Li
- DHC Mediway Technology CO, Ltd, Beijing, People’s Republic of China
| | - Yichao Li
- DHC Mediway Technology CO, Ltd, Beijing, People’s Republic of China
| | - Jie Yu
- Department of Ultrasound, PLA Medical College & 5th Medical Center of Chinese PLA General Hospital, Beijing, 100853, People’s Republic of China
| | - Ping Liang
- Department of Ultrasound, PLA Medical College & 5th Medical Center of Chinese PLA General Hospital, Beijing, 100853, People’s Republic of China
- Correspondence: Ping Liang; Jie Yu, Department of Ultrasound, PLA Medical College & 5th Medical Center of Chinese PLA General Hospital, Beijing, 100853, People’s Republic of China, Tel +86-10-66939530, Fax +86-10-68161218, Email ;
| |
Collapse
|
16
|
Bellin N, Tesi G, Marchesani N, Rossi V. Species distribution modeling and machine learning in assessing the potential distribution of freshwater zooplankton in Northern Italy. ECOL INFORM 2022. [DOI: 10.1016/j.ecoinf.2022.101682] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
17
|
Kim T, Lee D, Shin J, Kim Y, Cha Y. Learning hierarchical Bayesian networks to assess the interaction effects of controlling factors on spatiotemporal patterns of fecal pollution in streams. THE SCIENCE OF THE TOTAL ENVIRONMENT 2022; 812:152520. [PMID: 34953848 DOI: 10.1016/j.scitotenv.2021.152520] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Revised: 11/28/2021] [Accepted: 12/14/2021] [Indexed: 06/14/2023]
Abstract
The dynamics of fecal indicator bacteria, such as fecal coliforms (FC) in streams, are influenced by the interactions of a myriad of factors. To predict complex spatiotemporal patterns of FC in streams and assess the relative importance of numerous controlling factors, the adoption of a hierarchical Bayesian network (HBN) was proposed in this study. By introducing latent variables correlated to the observed variables into a Bayesian network, the HBN can represent causal relationships among a large set of variables with a multilevel hierarchy. The study area encompasses 215 sites across the watersheds of the four major rivers in South Korea. The monitoring data collected during the 2012-2019 period included 32 input variables pertaining to meteorology, geography, soil characteristics, land cover, urbanization index, livestock density, and point sources. As model endpoints, the exceedance probability of the FC standard concentration as well as two pollution characteristics (i.e., pollution degree and type), derived from FC load duration curves were used. The probability of exceeding an FC threshold value (200 CFU/100 mL) showed spatiotemporal variations, whereas pollution degree and type showed spatial variations that represent long-term severity and relative dominance of nonpoint and point source fecal pollution, respectively. The conceptual model was validated using structural equation modeling to develop the HBN. The results demonstrate that the HBN effectively simplified the model structure, while showing strong model performance (AUC = 0.81, accuracy = 0.74). The results of the sensitivity analysis indicate that land cover is the most important factor in predicting the probability of exceedance and pollution degree, whereas the urbanization index explains most of the variability in pollution type. Furthermore, the results of the scenario analysis suggest that the HBN provides an interpretable framework in which the interaction of controlling factors has causal relationships at different levels that can be identified and visualized.
Collapse
Affiliation(s)
- TaeHo Kim
- School of Environment Engineering, University of Seoul, 163, Seoulsiripdae-ro, Dongdaemun-gu, Seoul 02504, Republic of Korea
| | - DoYeon Lee
- School of Environment Engineering, University of Seoul, 163, Seoulsiripdae-ro, Dongdaemun-gu, Seoul 02504, Republic of Korea
| | - Jihoon Shin
- School of Environment Engineering, University of Seoul, 163, Seoulsiripdae-ro, Dongdaemun-gu, Seoul 02504, Republic of Korea
| | - YoungWoo Kim
- School of Environment Engineering, University of Seoul, 163, Seoulsiripdae-ro, Dongdaemun-gu, Seoul 02504, Republic of Korea
| | - YoonKyung Cha
- School of Environment Engineering, University of Seoul, 163, Seoulsiripdae-ro, Dongdaemun-gu, Seoul 02504, Republic of Korea.
| |
Collapse
|
18
|
An Interpretable Machine Learning Model for Daily Global Solar Radiation Prediction. ENERGIES 2021. [DOI: 10.3390/en14217367] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
Machine learning (ML) models are commonly used in solar modeling due to their high predictive accuracy. However, the predictions of these models are difficult to explain and trust. This paper aims to demonstrate the utility of two interpretation techniques to explain and improve the predictions of ML models. We compared first the predictive performance of Light Gradient Boosting (LightGBM) with three benchmark models, including multilayer perceptron (MLP), multiple linear regression (MLR), and support-vector regression (SVR), for estimating the global solar radiation (H) in the city of Fez, Morocco. Then, the predictions of the most accurate model were explained by two model-agnostic explanation techniques: permutation feature importance (PFI) and Shapley additive explanations (SHAP). The results indicated that LightGBM (R2 = 0.9377, RMSE = 0.4827 kWh/m2, MAE = 0.3614 kWh/m2) provides similar predictive accuracy as SVR, and outperformed MLP and MLR in the testing stage. Both PFI and SHAP methods showed that extraterrestrial solar radiation (H0) and sunshine duration fraction (SF) are the two most important parameters that affect H estimation. Moreover, the SHAP method established how each feature influences the LightGBM estimations. The predictive accuracy of the LightGBM model was further improved slightly after re-examination of features, where the model combining H0, SF, and RH was better than the model with all features.
Collapse
|