1
|
Abdi B, Kolo K, Shahabi H. Assessment of land degradation susceptibility within the Shaqlawa subregion of Northern Iraq-Kurdistan Region via synergistic application of remotely acquired datasets and advanced predictive models. ENVIRONMENTAL MONITORING AND ASSESSMENT 2024; 196:1103. [PMID: 39453413 DOI: 10.1007/s10661-024-13284-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/19/2024] [Accepted: 10/16/2024] [Indexed: 10/26/2024]
Abstract
Land degradation (LD) is the decline in a land's functional capacity and productive potential, which includes various anthropogenic and natural drivers. This study focuses on three primary manifestations of LD including soil erosion, landslides, and rockfalls, which are the most prevalent in the Shaqlawa district. A set of 22 LD conditioning factors, encompassing curvature, lithology, aspect, river density, soil type, lineament density, river distance, elevation, road distance, length slope (LS), land use land cover (LULC), stream power index (SPI), valley depth, profile curvature, slope, solar radiation, road density, lineament distance, rainfall, topographic wetness index (TWI), plan curvature, and normalized difference vegetation index (NDVI), were integrated into the analysis. Variance inflation factors (VIF) and tolerance (TOL) values from linear regression indicate that most LD factors have acceptable levels of multicollinearity. The Information Gain Ratio (IGR) identified key variables TWI, NDVI, and lithology-as pivotal factors for predicting LD. Additionally, the study evaluated degradation factors using various machine learning (ML) algorithms, including random forest (RF), Naive Bayes, logistic regression, rotation forest, forest penalized attributes (FPA), and Fisher's Linear discriminant analysis (FLDA). This facilitated categorizing the study area into five susceptibility categories. The FLDA model categorized the highest area under very high degradation risk at 26.72%, emphasizing the varied insights each algorithm brought to characterizing the degradation risk. Additionally, the receiver operating characteristic curves (ROC) were employed for model validation, identifying RF as the most successful model in the training dataset with an area under the curve (AUC) of 0.882, while FLDA outperformed in the testing dataset with an AUC of 0.883. The identified LD-prone areas will help land-use planners and emergency management officials apply effective mitigation strategies for similar terrains.
Collapse
Affiliation(s)
- Badeea Abdi
- Department of Petroleum Geoscience, Faculty of Science, Soran University, Soran, Erbil, Iraq.
| | - Kamal Kolo
- Department of Biogeosciences, Scientific Research Center, Soran University, Soran, Iraq
| | - Himan Shahabi
- Department of Geomorphology, Faculty of Natural Resources, University of Kurdistan, Sanandaj, Iran
- Division of Geochronology and Environmental Isotopes, Institute of Physics, Silesian University of Technology, 44-100, Gliwice, Poland
| |
Collapse
|
2
|
Su J, Zhou P. Quantitative physics-physiology relationship modeling of human emotional response to Shu music. Front Psychol 2024; 15:1351058. [PMID: 39439756 PMCID: PMC11493695 DOI: 10.3389/fpsyg.2024.1351058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2023] [Accepted: 08/27/2024] [Indexed: 10/25/2024] Open
Abstract
Music perception is one of the most complex human neurophysiological phenomena invoked by sensory stimuli, which infers an internal representation of the structured events present in a piece of music and then forms long-term echoic memory for the music. An intrinsic relationship between the basic acoustic property (physics) of music and human emotional response (physiology) to the music is suggested, which can be statistically modeled and explained by using a novel notion termed as quantitative physics-physiology relationship (QPPR). Here, we systematically analyzed the complex response profile of people to traditional/ancient music in the Shu area, a geographical concept located in the Southwest China and one of three major origins of the Chinese nation. Chill was utilized as an indicator to characterize the response strength of 18 subjects to an in-house compiled repertoire of 86 music samples, consequently creating a systematic subject-to-sample response (SSTSR) profile consisting of 1,548 (18 × 86) paired chill elements. The multivariate statistical correlation of measured chill values with acoustic features and personal attributes was modeled by using random forest (RF) regression in a supervised manner, which was compared with linear partial least square (PLS) and non-linear support vector machine (SVM). The RF model exhibits possessed strong fitting ability (r F 2 = 0.857), good generalization capability (r P 2 = 0.712), and out-of-bag (OOB) predictability (r O 2 = 0.731) as compared to SVM and, particularly, PLS, suggesting that the RF-based QPPR approach is able to explain and predict the emotional change upon musical arousal. It is imparted that there is an underlying relationship between the acoustic physical property of music and the physiological reaction of the audience listening to the music, in which the rhythm contributes significantly to emotional response relative to timbre and pitch. In addition, individual differences, characterized by personal attributes, is also responsible for the response, in which gender and age are most important.
Collapse
Affiliation(s)
- Jun Su
- College of Music, Chengdu Normal University, Chengdu, China
| | - Peng Zhou
- Center for Informational Biology, Key Laboratory for NeuroInformation of Ministry of Education, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
3
|
Latifi M, Beig Zali R, Javadi AA, Farmani R. Customised-sampling approach for pipe failure prediction in water distribution networks. Sci Rep 2024; 14:18224. [PMID: 39107389 PMCID: PMC11303377 DOI: 10.1038/s41598-024-69109-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2024] [Accepted: 07/31/2024] [Indexed: 08/10/2024] Open
Abstract
This paper presents a new methodology for addressing imbalanced class data for failure prediction in Water Distribution Networks (WDNs). The proposed methodology relies on existing approaches including under-sampling, over-sampling, and class weighting as primary strategies. These techniques aim to treat the imbalanced datasets by adjusting the representation of minority and majority classes. Under-sampling reduces data in the majority class, over-sampling adds data to the minority class, and class weighting assigns unequal weights based on class counts to balance the influence of each class during machine learning (ML) model training. In this paper, the mentioned approaches were used at levels other than "balance point" to construct pipe failure prediction models for a WDN with highly imbalanced data. F1-score, and AUC-ROC, were selected to evaluate model performance. Results revealed that under-sampling above the balance point yields the highest F1-score, while over-sampling below the balance point achieves optimal results. Employing class weights during training and prediction emphasises the efficacy of lower weights than the balance. Combining under-sampling and over-sampling to the same ratio for both majority and minority classes showed limited improvement. However, a more effective predictive model emerged when over-sampling the minority class and under-sampling the majority class to different ratios, followed by applying class weights to balance data.
Collapse
Affiliation(s)
- Milad Latifi
- Centre for Water Systems, University of Exeter, Exeter, UK.
| | | | - Akbar A Javadi
- Centre for Water Systems, University of Exeter, Exeter, UK
| | | |
Collapse
|
4
|
Khan Z, Ali A, Khan DM, Aldahmani S. Regularized ensemble learning for prediction and risk factors assessment of students at risk in the post-COVID era. Sci Rep 2024; 14:16200. [PMID: 39003293 PMCID: PMC11246502 DOI: 10.1038/s41598-024-66894-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2024] [Accepted: 07/05/2024] [Indexed: 07/15/2024] Open
Abstract
The COVID-19 pandemic has had a significant impact on students' academic performance. The effects of the pandemic have varied among students, but some general trends have emerged. One of the primary challenges for students during the pandemic has been the disruption of their study habits. Students getting used to online learning routines might find it even more challenging to perform well in face to face learning. Therefore, assessing various potential risk factors associated with students low performance and its prediction is important for early intervention. As students' performance data encompass diverse behaviors, standard machine learning methods find it hard to get useful insights for beneficial practical decision making and early interventions. Therefore, this research explores regularized ensemble learning methods for effectively analyzing students' performance data and reaching valid conclusions. To this end, three pruning strategies are implemented for the random forest method. These methods are based on out-of-bag sampling, sub-sampling and sub-bagging. The pruning strategies discard trees that are adversely affected by the unusual patterns in the students data forming forests of accurate and diverse trees. The methods are illustrated on an example data collected from university students currently studying on campus in a face-to-face modality, who studied during the COVID-19 pandemic through online learning. The suggested methods outperform all the other methods considered in this paper for predicting students at the risk of academic failure. Moreover, various factors such as class attendance, students interaction, internet connectivity, pre-requisite course(s) during the restrictions, etc., are identified as the most significant features.
Collapse
Affiliation(s)
- Zardad Khan
- Department of Statistics and Business Analytics, United Arab Emirates University, Al Ain, UAE.
| | - Amjad Ali
- Department of Statistics and Business Analytics, United Arab Emirates University, Al Ain, UAE
| | - Dost Muhammad Khan
- Department of Statistics, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Saeed Aldahmani
- Department of Statistics and Business Analytics, United Arab Emirates University, Al Ain, UAE.
| |
Collapse
|
5
|
Zamani MG, Nikoo MR, Al-Rawas G, Nazari R, Rastad D, Gandomi AH. Hybrid WT-CNN-GRU-based model for the estimation of reservoir water quality variables considering spatio-temporal features. JOURNAL OF ENVIRONMENTAL MANAGEMENT 2024; 358:120756. [PMID: 38599080 DOI: 10.1016/j.jenvman.2024.120756] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 03/09/2024] [Accepted: 03/22/2024] [Indexed: 04/12/2024]
Abstract
Water quality indicators (WQIs), such as chlorophyll-a (Chl-a) and dissolved oxygen (DO), are crucial for understanding and assessing the health of aquatic ecosystems. Precise prediction of these indicators is fundamental for the efficient administration of rivers, lakes, and reservoirs. This research utilized two unique DL algorithms-namely, convolutional neural network (CNNs) and gated recurrent units (GRUs)-alongside their amalgamation, CNN-GRU, to precisely gauge the concentration of these indicators within a reservoir. Moreover, to optimize the outcomes of the developed hybrid model, we considered the impact of a decomposition technique, specifically the wavelet transform (WT). In addition to these efforts, we created two distinct machine learning (ML) algorithms-namely, random forest (RF) and support vector regression (SVR)-to demonstrate the superior performance of deep learning algorithms over individual ML ones. We initially gathered WQIs from diverse locations and varying depths within the reservoir using an AAQ-RINKO device in the study area to achieve this. It is important to highlight that, despite utilizing diverse data-driven models in water quality estimation, a significant gap persists in the existing literature regarding implementing a comprehensive hybrid algorithm. This algorithm integrates the wavelet transform, convolutional neural network (CNN), and gated recurrent unit (GRU) methodologies to estimate WQIs accurately within a spatiotemporal framework. Subsequently, the effectiveness of the models that were developed was assessed utilizing various statistical metrics, encompassing the correlation coefficient (r), root mean square error (RMSE), mean absolute error (MAE), and Nash-Sutcliffe efficiency (NSE) throughout both the training and testing phases. The findings demonstrated that the WT-CNN-GRU model exhibited better performance in comparison with the other algorithms by 13% (SVR), 13% (RF), 9% (CNN), and 8% (GRU) when R-squared and DO were considered as evaluation indices and WQIs, respectively.
Collapse
Affiliation(s)
- Mohammad G Zamani
- Department of Civil and Architectural Engineering, Sultan Qaboos University, Muscat, Oman.
| | - Mohammad Reza Nikoo
- Department of Civil and Architectural Engineering, Sultan Qaboos University, Muscat, Oman.
| | - Ghazi Al-Rawas
- Department of Civil and Architectural Engineering, Sultan Qaboos University, Muscat, Oman.
| | - Rouzbeh Nazari
- Department of Civil, Construction, and Environmental Engineering, The University of Alabama, Alabama, USA.
| | - Dana Rastad
- Department of Civil and Environmental Engineering, Amirkabir University of Technology, Tehran, Iran.
| | - Amir H Gandomi
- Department of Engineering and I.T., University of Technology Sydney, Ultimo, NSW, 2007, Australia; University Research and Innovation Center (EKIK), Óbuda University, 1034, Budapest, Hungary.
| |
Collapse
|
6
|
Laabs BH, Westenberger A, König IR. Identification of representative trees in random forests based on a new tree-based distance measure. ADV DATA ANAL CLASSI 2023. [DOI: 10.1007/s11634-023-00537-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/28/2023]
Abstract
AbstractIn life sciences, random forests are often used to train predictive models. However, gaining any explanatory insight into the mechanics leading to a specific outcome is rather complex, which impedes the implementation of random forests into clinical practice. By simplifying a complex ensemble of decision trees to a single most representative tree, it is assumed to be possible to observe common tree structures, the importance of specific features and variable interactions. Thus, representative trees could also help to understand interactions between genetic variants. Intuitively, representative trees are those with the minimal distance to all other trees, which requires a proper definition of the distance between two trees. Thus, we developed a new tree-based distance measure, which incorporates more of the underlying tree structure than other metrics. We compared our new method with the existing metrics in an extensive simulation study and applied it to predict the age at onset based on a set of genetic risk factors in a clinical data set. In our simulation study we were able to show the advantages of our weighted splitting variable approach. Our real data application revealed that representative trees are not only able to replicate the results from a recent genome-wide association study, but also can give additional explanations of the genetic mechanisms. Finally, we implemented all compared distance measures in R and made them publicly available in the R package timbR (https://github.com/imbs-hl/timbR).
Collapse
|
7
|
Ay Ş, Ekinci E, Garip Z. A comparative analysis of meta-heuristic optimization algorithms for feature selection on ML-based classification of heart-related diseases. THE JOURNAL OF SUPERCOMPUTING 2023; 79:11797-11826. [PMID: 37304052 PMCID: PMC9983547 DOI: 10.1007/s11227-023-05132-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Accepted: 02/21/2023] [Indexed: 06/13/2023]
Abstract
This study aims to use a machine learning (ML)-based enhanced diagnosis and survival model to predict heart disease and survival in heart failure by combining the cuckoo search (CS), flower pollination algorithm (FPA), whale optimization algorithm (WOA), and Harris hawks optimization (HHO) algorithms, which are meta-heuristic feature selection algorithms. To achieve this, experiments are conducted on the Cleveland heart disease dataset and the heart failure dataset collected from the Faisalabad Institute of Cardiology published at UCI. CS, FPA, WOA, and HHO algorithms for feature selection are applied for different population sizes and are realized based on the best fitness values. For the original dataset of heart disease, the maximum prediction F-score of 88% is obtained using K-nearest neighbour (KNN) when compared to logistic regression (LR), support vector machine (SVM), Gaussian Naive Bayes (GNB), and random forest (RF). With the proposed approach, the heart disease prediction F-score of 99.72% is obtained using KNN for population sizes 60 with FPA by selecting eight features. For the original dataset of heart failure, the maximum prediction F-score of 70% is obtained using LR and RF compared to SVM, GNB, and KNN. With the proposed approach, the heart failure prediction F-score of 97.45% is obtained using KNN for population sizes 10 with HHO by selecting five features. Experimental findings show that the applied meta-heuristic algorithms with ML algorithms significantly improve prediction performances compared to performances obtained from the original datasets. The motivation of this paper is to select the most critical and informative feature subset through meta-heuristic algorithms to improve classification accuracy.
Collapse
Affiliation(s)
- Şevket Ay
- Computer Engineering Department, Faculty of Technology, Sakarya University of Applied Sciences, Sakarya, 54187 Turkey
| | - Ekin Ekinci
- Computer Engineering Department, Faculty of Technology, Sakarya University of Applied Sciences, Sakarya, 54187 Turkey
| | - Zeynep Garip
- Computer Engineering Department, Faculty of Technology, Sakarya University of Applied Sciences, Sakarya, 54187 Turkey
| |
Collapse
|
8
|
Carino-Escobar RI, Alonso-Silverio GA, Alarcón-Paredes A, Cantillo-Negrete J. Feature-ranked self-growing forest: a tree ensemble based on structure diversity for classification and regression. Neural Comput Appl 2023. [DOI: 10.1007/s00521-023-08202-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
|
9
|
Manzali Y, Akhiat Y, Chahhou M, Elmohajir M, Zinedine A. Reducing the number of trees in a forest using noisy features. EVOLVING SYSTEMS 2022. [DOI: 10.1007/s12530-022-09441-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
10
|
Elsten T, de Rooij M. SUBiNN: a stacked uni- and bivariate kNN sparse ensemble. ADV DATA ANAL CLASSI 2021. [DOI: 10.1007/s11634-021-00462-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
AbstractNearest Neighbor classification is an intuitive distance-based classification method. It has, however, two drawbacks: (1) it is sensitive to the number of features, and (2) it does not give information about the importance of single features or pairs of features. In stacking, a set of base-learners is combined in one overall ensemble classifier by means of a meta-learner. In this manuscript we combine univariate and bivariate nearest neighbor classifiers that are by itself easily interpretable. Furthermore, we combine these classifiers by a Lasso method that results in a sparse ensemble of nonlinear main and pairwise interaction effects. We christened the new method SUBiNN: Stacked Uni- and Bivariate Nearest Neighbors. SUBiNN overcomes the two drawbacks of simple nearest neighbor methods. In extensive simulations and using benchmark data sets, we evaluate the predictive performance of SUBiNN and compare it to other nearest neighbor ensemble methods as well as Random Forests and Support Vector Machines. Results indicate that SUBiNN often outperforms other nearest neighbor methods, that SUBiNN is well capable of identifying noise features, but that Random Forests is often, but not always, the best classifier.
Collapse
|
11
|
Ali MH, Khan DM, Jamal K, Ahmad Z, Manzoor S, Khan Z. Prediction of Multidrug-Resistant Tuberculosis Using Machine Learning Algorithms in SWAT, Pakistan. JOURNAL OF HEALTHCARE ENGINEERING 2021; 2021:2567080. [PMID: 34512933 PMCID: PMC8426057 DOI: 10.1155/2021/2567080] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/26/2021] [Accepted: 08/18/2021] [Indexed: 11/20/2022]
Abstract
In this paper, we have focused on machine learning (ML) feature selection (FS) algorithms for identifying and diagnosing multidrug-resistant (MDR) tuberculosis (TB). MDR-TB is a universal public health problem, and its early detection has been one of the burning issues. The present study has been conducted in the Malakand Division of Khyber Pakhtunkhwa, Pakistan, to further add to the knowledge on the disease and to deal with the issues of identification and early detection of MDR-TB by ML algorithms. These models also identify the most important factors causing MDR-TB infection whose study gives additional insights into the matter. ML algorithms such as random forest, k-nearest neighbors, support vector machine, logistic regression, leaset absolute shrinkage and selection operator (LASSO), artificial neural networks (ANNs), and decision trees are applied to analyse the case-control dataset. This study reveals that close contacts of MDR-TB patients, smoking, depression, previous TB history, improper treatment, and interruption in first-line TB treatment have a great impact on the status of MDR. Accordingly, weight loss, chest pain, hemoptysis, and fatigue are important symptoms. Based on accuracy, sensitivity, and specificity, SVM and RF are the suggested models to be used for patients' classifications.
Collapse
Affiliation(s)
- Mian Haider Ali
- Department of Statistics, Abdul Wali Khan University, Mardan, Pakistan
- Programmatic Management of Drug-Resistant Tuberculosis, Saidu Teaching Hospital, Swat, Pakistan
| | | | - Khalid Jamal
- Programmatic Management of Drug-Resistant Tuberculosis, Saidu Teaching Hospital, Swat, Pakistan
| | - Zubair Ahmad
- Department of Statistics, Yazd University, P.O. Box 89175-741, Yazd, Iran
| | - Sadaf Manzoor
- Department of Statistics, Islamia College Peshawar, Peshawar, Pakistan
| | - Zardad Khan
- Department of Statistics, Abdul Wali Khan University, Mardan, Pakistan
| |
Collapse
|
12
|
Hamraz M, Gul N, Raza M, Khan DM, Khalil U, Zubair S, Khan Z. Robust proportional overlapping analysis for feature selection in binary classification within functional genomic experiments. PeerJ Comput Sci 2021; 7:e562. [PMID: 34141889 PMCID: PMC8176540 DOI: 10.7717/peerj-cs.562] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2020] [Accepted: 05/04/2021] [Indexed: 05/10/2023]
Abstract
In this paper, a novel feature selection method called Robust Proportional Overlapping Score (RPOS), for microarray gene expression datasets has been proposed, by utilizing the robust measure of dispersion, i.e., Median Absolute Deviation (MAD). This method robustly identifies the most discriminative genes by considering the overlapping scores of the gene expression values for binary class problems. Genes with a high degree of overlap between classes are discarded and the ones that discriminate between the classes are selected. The results of the proposed method are compared with five state-of-the-art gene selection methods based on classification error, Brier score, and sensitivity, by considering eleven gene expression datasets. Classification of observations for different sets of selected genes by the proposed method is carried out by three different classifiers, i.e., random forest, k-nearest neighbors (k-NN), and support vector machine (SVM). Box-plots and stability scores of the results are also shown in this paper. The results reveal that in most of the cases the proposed method outperforms the other methods.
Collapse
Affiliation(s)
- Muhammad Hamraz
- Department of Statistics, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Naz Gul
- Department of Statistics, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Mushtaq Raza
- Department of Computer Sciences, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Dost Muhammad Khan
- Department of Statistics, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Umair Khalil
- Department of Statistics, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Seema Zubair
- Department of Mathematics, Statistics and Computer Science, University of Agriculture Peshawar, Peshawar, Pakistan
| | - Zardad Khan
- Department of Statistics, Abdul Wali Khan University Mardan, Mardan, Pakistan
| |
Collapse
|
13
|
Identification of the Framingham Risk Score by an Entropy-Based Rule Model for Cardiovascular Disease. ENTROPY 2020; 22:e22121406. [PMID: 33322122 PMCID: PMC7764435 DOI: 10.3390/e22121406] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/24/2020] [Revised: 11/30/2020] [Accepted: 12/11/2020] [Indexed: 12/12/2022]
Abstract
Since 2001, cardiovascular disease (CVD) has had the second-highest mortality rate, about 15,700 people per year, in Taiwan. It has thus imposed a substantial burden on medical resources. This study was triggered by the following three factors. First, the CVD problem reflects an urgent issue. A high priority has been placed on long-term therapy and prevention to reduce the wastage of medical resources, particularly in developed countries. Second, from the perspective of preventive medicine, popular data-mining methods have been well learned and studied, with excellent performance in medical fields. Thus, identification of the risk factors of CVD using these popular techniques is a prime concern. Third, the Framingham risk score is a core indicator that can be used to establish an effective prediction model to accurately diagnose CVD. Thus, this study proposes an integrated predictive model to organize five notable classifiers: the rough set (RS), decision tree (DT), random forest (RF), multilayer perceptron (MLP), and support vector machine (SVM), with a novel use of the Framingham risk score for attribute selection (i.e., F-attributes first identified in this study) to determine the key features for identifying CVD. Verification experiments were conducted with three evaluation criteria-accuracy, sensitivity, and specificity-based on 1190 instances of a CVD dataset available from a Taiwan teaching hospital and 2019 examples from a public Framingham dataset. Given the empirical results, the SVM showed the best performance in terms of accuracy (99.67%), sensitivity (99.93%), and specificity (99.71%) in all F-attributes in the CVD dataset compared to the other listed classifiers. The RS showed the highest performance in terms of accuracy (85.11%), sensitivity (86.06%), and specificity (85.19%) in most of the F-attributes in the Framingham dataset. The above study results support novel evidence that no classifier or model is suitable for all practical datasets of medical applications. Thus, identifying an appropriate classifier to address specific medical data is important. Significantly, this study is novel in its calculation and identification of the use of key Framingham risk attributes integrated with the DT technique to produce entropy-based decision rules of knowledge sets, which has not been undertaken in previous research. This study conclusively yielded meaningful entropy-based knowledgeable rules in tree structures and contributed to the differentiation of classifiers from the two datasets with three useful research findings and three helpful management implications for subsequent medical research. In particular, these rules provide reasonable solutions to simplify processes of preventive medicine by standardizing the formats and codes used in medical data to address CVD problems. The specificity of these rules is thus significant compared to those of past research.
Collapse
|