1
|
Mishra M, Acharjya DP. A hybridized red deer and rough set clinical information retrieval system for hepatitis B diagnosis. Sci Rep 2024; 14:3815. [PMID: 38360918 PMCID: PMC10869783 DOI: 10.1038/s41598-024-53170-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Accepted: 01/29/2024] [Indexed: 02/17/2024] Open
Abstract
Healthcare is a big concern in the current booming population. Many approaches for improving health are imposed, such as early disease identification, treatment, and prevention. Therefore, knowledge acquisition is highly essential at different stages of decision-making. Inferring knowledge from the information system, which necessitates multiple steps for extracting useful information, is one technique to address this problem. Handling uncertainty throughout data analysis is also another challenging task. Computer intelligence is a step forward to this end while selecting characteristics, classification, clustering, and developing clinical information retrieval systems. According to recent studies, swarm optimization is a useful technique for discovering key features while resolving real-world issues. However, it is ineffective in managing uncertainty. Conversely, a rough set helps a decision system generate decision rules. This produces decision rules without any additional information. In order to assess real-world information systems while managing uncertainties, a hybrid strategy that combines a rough set and red deer algorithm is presented in this research. In the red deer optimization algorithm, the suggested method selects the optimal characteristics in terms of the degree of dependence on the rough set. In order to determine the decision rules, further a rough set is used. The efficiency of the suggested model is also contrasted with that of the decision tree algorithm and the conventional rough set. An empirical study on hepatitis disease illustrates the viability of the proposed research as compared to the decision tree and crisp rough set. The proposed hybridization of rough set and red deer algorithm achieves an accuracy of 91.7% accuracy. The acquired accuracy for the decision tree, and rough set methods is 82.9%, and 88.9%, respectively. It suggests that the proposed research is viable.
Collapse
Affiliation(s)
- Madhusmita Mishra
- Vellore Institute of Technology, School of Computer Science and Engineering, Vellore, 632014, India
| | - D P Acharjya
- Vellore Institute of Technology, School of Computer Science and Engineering, Vellore, 632014, India.
| |
Collapse
|
2
|
Vallée R, Vallée JN, Guillevin C, Lallouette A, Thomas C, Rittano G, Wager M, Guillevin R, Vallée A. Machine learning decision tree models for multiclass classification of common malignant brain tumors using perfusion and spectroscopy MRI data. Front Oncol 2023; 13:1089998. [PMID: 37614505 PMCID: PMC10442801 DOI: 10.3389/fonc.2023.1089998] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Accepted: 07/17/2023] [Indexed: 08/25/2023] Open
Abstract
Background To investigate the contribution of machine learning decision tree models applied to perfusion and spectroscopy MRI for multiclass classification of lymphomas, glioblastomas, and metastases, and then to bring out the underlying key pathophysiological processes involved in the hierarchization of the decision-making algorithms of the models. Methods From 2013 to 2020, 180 consecutive patients with histopathologically proved lymphomas (n = 77), glioblastomas (n = 45), and metastases (n = 58) were included in machine learning analysis after undergoing MRI. The perfusion parameters (rCBVmax, PSRmax) and spectroscopic concentration ratios (lac/Cr, Cho/NAA, Cho/Cr, and lip/Cr) were applied to construct Classification and Regression Tree (CART) models for multiclass classification of these brain tumors. A 5-fold random cross validation was performed on the dataset. Results The decision tree model thus constructed successfully classified all 3 tumor types with a performance (AUC) of 0.98 for PCNSLs, 0.98 for GBM and 1.00 for METs. The model accuracy was 0.96 with a RSquare of 0.887. Five rules of classifier combinations were extracted with a predicted probability from 0.907 to 0.989 for that end nodes of the decision tree for tumor multiclass classification. In hierarchical order of importance, the root node (Cho/NAA) in the decision tree algorithm was primarily based on the proliferative, infiltrative, and neuronal destructive characteristics of the tumor, the internal node (PSRmax), on tumor tissue capillary permeability characteristics, and the end node (Lac/Cr or Cho/Cr), on tumor energy glycolytic (Warburg effect), or on membrane lipid tumor metabolism. Conclusion Our study shows potential implementation of machine learning decision tree model algorithms based on a hierarchical, convenient, and personalized use of perfusion and spectroscopy MRI data for multiclass classification of these brain tumors.
Collapse
Affiliation(s)
- Rodolphe Vallée
- Interdisciplinary Laboratory in Neurosciences, Physiology and Psychology (LINP2), Université Paris Lumière (UPL), Paris Nanterre University, Nanterre, France
- Laboratory of Mathematics and Applications (LMA) Centre National de la Recherche Scientifique - Unité Mixte de Recherche (CNRS UMR)7348, i3M-DACTIM-MIH (Data Analysis and Computations Through Imaging Modeling - Mathematics, Image, Health), Poitiers University, Poitiers, France
- Glaucoma Research Center, Swiss Visio Network, Lausanne, Switzerland
| | - Jean-Noël Vallée
- Laboratory of Mathematics and Applications (LMA) Centre National de la Recherche Scientifique - Unité Mixte de Recherche (CNRS UMR)7348, i3M-DACTIM-MIH (Data Analysis and Computations Through Imaging Modeling - Mathematics, Image, Health), Poitiers University, Poitiers, France
- Diagnostic and Functional Neuroradiology and Brain stimulation Department, 15-20 National Vision Hospital of Paris - Paris University Hospital Center, University of PARIS-SACLAY - UVSQ, Paris, France
| | - Carole Guillevin
- Laboratory of Mathematics and Applications (LMA) Centre National de la Recherche Scientifique - Unité Mixte de Recherche (CNRS UMR)7348, i3M-DACTIM-MIH (Data Analysis and Computations Through Imaging Modeling - Mathematics, Image, Health), Poitiers University, Poitiers, France
- Radiology Department, Poitiers University Hospital, Poitiers University, Poitiers, France
| | | | - Clément Thomas
- Laboratory of Mathematics and Applications (LMA) Centre National de la Recherche Scientifique - Unité Mixte de Recherche (CNRS UMR)7348, i3M-DACTIM-MIH (Data Analysis and Computations Through Imaging Modeling - Mathematics, Image, Health), Poitiers University, Poitiers, France
- Diagnostic and Functional Neuroradiology and Brain stimulation Department, 15-20 National Vision Hospital of Paris - Paris University Hospital Center, University of PARIS-SACLAY - UVSQ, Paris, France
| | | | - Michel Wager
- Neurosurgery Department, Poitiers University Hospital, Poitiers University, Poitiers, France
| | - Rémy Guillevin
- Laboratory of Mathematics and Applications (LMA) Centre National de la Recherche Scientifique - Unité Mixte de Recherche (CNRS UMR)7348, i3M-DACTIM-MIH (Data Analysis and Computations Through Imaging Modeling - Mathematics, Image, Health), Poitiers University, Poitiers, France
- Radiology Department, Poitiers University Hospital, Poitiers University, Poitiers, France
| | - Alexandre Vallée
- Department of Epidemiology and Public Health, Foch Hospital, Suresnes, France
| |
Collapse
|
3
|
Vallée A. Arterial stiffness and biological parameters: A decision tree machine learning application in hypertensive participants. PLoS One 2023; 18:e0288298. [PMID: 37418473 DOI: 10.1371/journal.pone.0288298] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2022] [Accepted: 06/23/2023] [Indexed: 07/09/2023] Open
Abstract
Arterial stiffness, measured by arterial stiffness index (ASI), could be considered a main denominator in target organ damage among hypertensive subjects. Currently, no reported ASI normal references have been reported. The index of arterial stiffness is evaluated by calculation of a stiffness index. Predicted ASI can be estimated regardless to age, sex, mean blood pressure, and heart rate, to compose an individual stiffness index [(measured ASI-predicted ASI)/predicted ASI]. A stiffness index greater than zero defines arterial stiffness. Thus, the purpose of this study was 1) to determine determinants of stiffness index 2) to perform threshold values to discriminate stiffness index and then 3) to determine hierarchical associations of the determinants by performing a decision tree model among hypertensive participants without CV diseases. A study was conducted from 53,363 healthy participants in the UK Biobank survey to determine predicted ASI. Stiffness index was applied on 49,452 hypertensives without CV diseases to discriminate determinants of positive stiffness index (N = 22,453) from negative index (N = 26,999). The input variables for the models were clinical and biological parameters. The independent classifiers were ranked from the most sensitives: HDL cholesterol≤1.425 mmol/L, smoking pack years≥9.2pack-years, Phosphate≥1.172 mmol/L, to the most specifics: Cystatin c≤0.901 mg/L, Triglycerides≥1.487 mmol/L, Urate≥291.9 μmol/L, ALT≥22.13 U/L, AST≤32.5 U/L, Albumin≤45.92 g/L, Testosterone≥5.181 nmol/L. A decision tree model was performed to determine rules to highlight the different hierarchization and interactions between these classifiers with a higher performance than multiple logistic regression (p<0.001). The stiffness index could be an integrator of CV risk factors and participate in future CV risk management evaluations for preventive strategies. Decision trees can provide accurate and useful classification for clinicians.
Collapse
Affiliation(s)
- Alexandre Vallée
- Department of Epidemiology and Public Health, Foch hospital, Suresnes, France
| |
Collapse
|
4
|
Chen N, Fan F, Geng J, Yang Y, Gao Y, Jin H, Chu Q, Yu D, Wang Z, Shi J. Evaluating the risk of hypertension in residents in primary care in Shanghai, China with machine learning algorithms. Front Public Health 2022; 10:984621. [PMID: 36267989 PMCID: PMC9577109 DOI: 10.3389/fpubh.2022.984621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2022] [Accepted: 09/12/2022] [Indexed: 01/25/2023] Open
Abstract
Objective The prevention of hypertension in primary care requires an effective and suitable hypertension risk assessment model. The aim of this study was to develop and compare the performances of three machine learning algorithms in predicting the risk of hypertension for residents in primary care in Shanghai, China. Methods A dataset of 40,261 subjects over the age of 35 years was extracted from Electronic Healthcare Records of 47 community health centers from 2017 to 2019 in the Pudong district of Shanghai. Embedded methods were applied for feature selection. Machine learning algorithms, XGBoost, random forest, and logistic regression analyses were adopted in the process of model construction. The performance of models was evaluated by calculating the area under the receiver operating characteristic curve, sensitivity, specificity, positive predictive value, negative predictive value, accuracy and F1-score. Results The XGBoost model outperformed the other two models and achieved an AUC of 0.765 in the testing set. Twenty features were selected to construct the model, including age, diabetes status, urinary protein level, BMI, elderly health self-assessment, creatinine level, systolic blood pressure measured on the upper right arm, waist circumference, smoking status, low-density lipoprotein cholesterol level, high-density lipoprotein cholesterol level, frequency of drinking, glucose level, urea nitrogen level, total cholesterol level, diastolic blood pressure measured on the upper right arm, exercise frequency, time spent engaged in exercise, high salt consumption, and triglyceride level. Conclusions XGBoost outperformed random forest and logistic regression in predicting the risk of hypertension in primary care. The integration of this risk assessment model into primary care facilities may improve the prevention and management of hypertension in residents.
Collapse
Affiliation(s)
- Ning Chen
- School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Feng Fan
- School of Medicine, Tongji University, Shanghai, China
| | - Jinsong Geng
- School of Medicine, Nantong University, Nantong, China
| | - Yan Yang
- School of Economics and Management, Tongji University, Shanghai, China
| | - Ya Gao
- School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Hua Jin
- Department of General Practice, Yangpu Hospital, Tongji University School of Medicine, Shanghai, China,Shanghai General Practice and Community Health Development Research Center, Shanghai, China,Academic Department of General Practice, Tongji University School of Medicine, Shanghai, China,Clinical Research Center for General Practice, Tongji University, Shanghai, China
| | - Qiao Chu
- School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Dehua Yu
- Department of General Practice, Yangpu Hospital, Tongji University School of Medicine, Shanghai, China,Shanghai General Practice and Community Health Development Research Center, Shanghai, China,Academic Department of General Practice, Tongji University School of Medicine, Shanghai, China,Clinical Research Center for General Practice, Tongji University, Shanghai, China,*Correspondence: Dehua Yu
| | - Zhaoxin Wang
- The First Affiliated Hospital of Hainan Medical University, Haikou, China,Department of Social Medicine and Health Management, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, China,School of Management, Hainan Medical University, Haikou, China,Zhaoxin Wang
| | - Jianwei Shi
- Department of General Practice, Yangpu Hospital, Tongji University School of Medicine, Shanghai, China,Shanghai General Practice and Community Health Development Research Center, Shanghai, China,Department of Social Medicine and Health Management, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, China,Jianwei Shi
| |
Collapse
|
5
|
Constructing Explainable Classifiers from the Start—Enabling Human-in-the Loop Machine Learning. INFORMATION 2022. [DOI: 10.3390/info13100464] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Interactive machine learning (IML) enables the incorporation of human expertise because the human participates in the construction of the learned model. Moreover, with human-in-the-loop machine learning (HITL-ML), the human experts drive the learning, and they can steer the learning objective not only for accuracy but perhaps for characterisation and discrimination rules, where separating one class from others is the primary objective. Moreover, this interaction enables humans to explore and gain insights into the dataset as well as validate the learned models. Validation requires transparency and interpretable classifiers. The huge relevance of understandable classification has been recently emphasised for many applications under the banner of explainable artificial intelligence (XAI). We use parallel coordinates to deploy an IML system that enables the visualisation of decision tree classifiers but also the generation of interpretable splits beyond parallel axis splits. Moreover, we show that characterisation and discrimination rules are also well communicated using parallel coordinates. In particular, we report results from the largest usability study of a IML system, confirming the merits of our approach.
Collapse
|
6
|
Vallée A. Association between serum uric acid and arterial stiffness in a large-aged 40-70 years old population. J Clin Hypertens (Greenwich) 2022; 24:885-897. [PMID: 35748644 PMCID: PMC9278596 DOI: 10.1111/jch.14527] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2022] [Revised: 05/25/2022] [Accepted: 05/28/2022] [Indexed: 12/24/2022]
Abstract
Arterial stiffness (AS), measured by arterial stiffness index (ASI), is a determinant in cardiovascular (CV) diseases. A high serum uric acid (SUA) level is a known risk factor for CV disease. The authors investigated the relationship between SUA and ASI in the middle-age UK Biobank population study. AS was defined as ASI > 10 m/s. A cross-sectional study was conducted from 126 663 participants. Participants were divided into four quartiles according to SUA levels and sex. Sex multivariate analyses were performed with adjustment for confounding factors. The average ASI for overall participants was 9.3 m/s (SD: 2.9); 9.9 m/s (SD: 2.8) for men and 8.7 m/s (SD: 2.9) for women (P < .001). Men presented higher SUA rate (351.3 mmol/L (SD:67.9)) than women (270.7 mmol/L (SD:64.4)), P < .001. In men multivariate analysis, SUA remained a determinant of AS, with an increase in the strength of the association between the quartiles, Q4 versus Q1, OR = 1.10 [1.05-1.16], P < .001, Q3 versus Q1, OR = 1.09 [1.04-1.14], P < .001 but not between Q2 and Q1 (P = .136). In women, SUA remained significant for AS, with an increase in the strength of the association between the quartiles, Q4 versus Q1, OR = 1.22 [1.15-1.30], P < .001, Q3 versus Q1, OR = 1.13 [1.07-1.19], P < .001 and no difference between Q2 and Q1 (P = .101). When applying continuous SUA values in the multivariate analysis, SUA remained significant (P < .001), with a Youden index value for men = 338.3 mmol/L and for women = 267.3 mmol/L. High SUA levels were associated with AS, suggesting that SUA could be used as a predictor of atherosclerosis.
Collapse
Affiliation(s)
- Alexandre Vallée
- Department of Epidemiology-Data-Biostatistics, Delegation of Clinical Research and Innovation (DRCI), Foch hospital, Suresnes, France
| |
Collapse
|
7
|
Vallée A. Arterial Stiffness Determinants for Primary Cardiovascular Prevention among Healthy Participants. J Clin Med 2022; 11:jcm11092512. [PMID: 35566636 PMCID: PMC9105622 DOI: 10.3390/jcm11092512] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 04/13/2022] [Accepted: 04/27/2022] [Indexed: 12/27/2022] Open
Abstract
Background: Arterial stiffness (AS), measured by arterial stiffness index (ASI), can be considered as a major denominator in cardiovascular (CV) diseases. Thus, it remains essential to highlight the risk factors influencing its increase among healthy participants. Methods: According to European consensus, AS is defined as ASI > 10 m/s. The purpose of this study was to investigate the determinants of the arterial stiffness (ASI > 10 m/s) among UK Biobank normotensive and healthy participants without comorbidities and previous CV diseases. Thus, a cross-sectional study was conducted on 22,452 healthy participants. Results: Participants were divided into two groups, i.e., ASI > 10 m/s (n = 5782, 25.8%) and ASI < 10 m/s (n = 16,670, 74.2%). All the significant univariate covariables were included in the multivariate analysis. The remaining independent factors associated with AS were age (OR = 1.063, threshold = 53.0 years, p < 0.001), BMI (OR = 1.0450, threshold = 24.9 kg/m2, p < 0.001), cystatin c (OR = 1.384, threshold = 0.85 mg/L, p = 0.011), phosphate (OR = 2.225, threshold = 1.21 mmol/L, p < 0.001), triglycerides (OR = 1.281, threshold = 1.09 mmol/L, p < 0.001), mean BP (OR = 1.028, threshold = 91.2 mmHg, p < 0.001), HR (OR = 1.007, threshold = 55 bpm, p < 0.001), Alkaline phosphate (OR = 1.002, threshold = 67.9 U/L, p = 0.004), albumin (OR = 0.973, threshold = 46.0 g/L, p < 0.001), gender (male, OR = 1.657, p < 0.001) and tobacco use (current, OR = 1.871, p < 0.001). Conclusion: AS is associated with multiple parameters which should be investigated in future prospective studies. Determining the markers of increased ASI among healthy participants participates in the management of future CV risk for preventive strategies.
Collapse
Affiliation(s)
- Alexandre Vallée
- Department of Epidemiology-Data-Biostatistics, Delegation of Clinical Research and Innovation (DRCI), Foch Hospital, 92150 Suresnes, France
| |
Collapse
|
8
|
Ji W, Xue M, Zhang Y, Yao H, Wang Y. A Machine Learning Based Framework to Identify and Classify Non-alcoholic Fatty Liver Disease in a Large-Scale Population. Front Public Health 2022; 10:846118. [PMID: 35444985 PMCID: PMC9013842 DOI: 10.3389/fpubh.2022.846118] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Accepted: 02/23/2022] [Indexed: 12/12/2022] Open
Abstract
Non-alcoholic fatty liver disease (NAFLD) is a common serious health problem worldwide, which lacks efficient medical treatment. We aimed to develop and validate the machine learning (ML) models which could be used to the accurate screening of large number of people. This paper included 304,145 adults who have joined in the national physical examination and used their questionnaire and physical measurement parameters as model's candidate covariates. Absolute shrinkage and selection operator (LASSO) was used to feature selection from candidate covariates, then four ML algorithms were used to build the screening model for NAFLD, used a classifier with the best performance to output the importance score of the covariate in NAFLD. Among the four ML algorithms, XGBoost owned the best performance (accuracy = 0.880, precision = 0.801, recall = 0.894, F-1 = 0.882, and AUC = 0.951), and the importance ranking of covariates is accordingly BMI, age, waist circumference, gender, type 2 diabetes, gallbladder disease, smoking, hypertension, dietary status, physical activity, oil-loving and salt-loving. ML classifiers could help medical agencies achieve the early identification and classification of NAFLD, which is particularly useful for areas with poor economy, and the covariates' importance degree will be helpful to the prevention and treatment of NAFLD.
Collapse
Affiliation(s)
- Weidong Ji
- Department of Medical Information, Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou, China
| | - Mingyue Xue
- Hospital of Traditional Chinese Medicine Affiliated to the Fourth Clinical Medical College of Xinjiang Medical University, Urumqi, China
| | - Yushan Zhang
- Department of Maternal and Child Health, School of Public Health, Sun Yat-sen University, Guangzhou, China
| | - Hua Yao
- Center of Health Management, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, China
| | - Yushan Wang
- Center of Health Management, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, China
- *Correspondence: Yushan Wang
| |
Collapse
|
9
|
Haouassi H, Mahdaoui R, Chouhal O, Bakhouche A. An efficient classification rule generation for coronary artery disease diagnosis using a novel discrete equilibrium optimizer algorithm. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2022. [DOI: 10.3233/jifs-213257] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Many machine learning-based methods have been widely applied to Coronary Artery Disease (CAD) and are achieving high accuracy. However, they are black-box methods that are unable to explain the reasons behind the diagnosis. The trade-off between accuracy and interpretability of diagnosis models is important, especially for human disease. This work aims to propose an approach for generating rule-based models for CAD diagnosis. The classification rule generation is modeled as combinatorial optimization problem and it can be solved by means of metaheuristic algorithms. Swarm intelligence algorithms like Equilibrium Optimizer Algorithm (EOA) have demonstrated great performance in solving different optimization problems. Our present study comes up with a Novel Discrete Equilibrium Optimizer Algorithm (NDEOA) for the classification rule generation from training CAD dataset. The proposed NDEOA is a discrete version of EOA, which use a discrete encoding of a particle for representing a classification rule; new discrete operators are also defined for the particle’s position update equation to adapt real operators to discrete space. To evaluate the proposed approach, the real world Z-Alizadeh Sani dataset has been employed. The proposed approach generate a diagnosis model composed of 17 rules, among them, five rules for the class “Normal” and 12 rules for the class “CAD”. In comparison to nine black-box and eight white-box state-of-the-art approaches, the results show that the generated diagnosis model by the proposed approach is more accurate and more interpretable than all white-box models and are competitive to the black-box models. It achieved an overall accuracy, sensitivity and specificity of 93.54%, 80% and 100% respectively; which show that, the proposed approach can be successfully utilized to generate efficient rule-based CAD diagnosis models.
Collapse
Affiliation(s)
- Hichem Haouassi
- Department of Mathematics and Computer Science, ICOSI Lab, University Abbas Laghrour, Khenchela, Algeria
| | - Rafik Mahdaoui
- Department of Mathematics and Computer Science, ICOSI Lab, University Abbas Laghrour, Khenchela, Algeria
| | - Ouahiba Chouhal
- Department of Mathematics and Computer Science, ICOSI Lab, University Abbas Laghrour, Khenchela, Algeria
| | - Abdelali Bakhouche
- Department of Mathematics and Computer Science, ICOSI Lab, University Abbas Laghrour, Khenchela, Algeria
| |
Collapse
|
10
|
k-relevance vectors: Considering relevancy beside nearness. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2021.107762] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
11
|
Kolose S, Stewart T, Hume P, Tomkinson GR. Prediction of military combat clothing size using decision trees and 3D body scan data. APPLIED ERGONOMICS 2021; 95:103435. [PMID: 33932688 DOI: 10.1016/j.apergo.2021.103435] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/13/2020] [Revised: 04/06/2021] [Accepted: 04/12/2021] [Indexed: 06/12/2023]
Abstract
AIM To determine how well decision tree models can predict tailor-assigned uniform sizes using anthropometry data from the New Zealand Defence Force Anthropometry Survey (NZDFAS). This information may inform automatic sizing systems for military personnel. METHODS Anthropometric data from two separate samples of the New Zealand Defence Force military were used. Data on Army personnel from the NZDFAS (n = 583) were used to develop a series of shirt- and trouser-size prediction models based on decision trees. Different combinations of physical, automatic, and post-processed measurements (the latter two derived from a 3D body scan) were trialled, and the models with the highest cross-validation accuracy were retained. The accuracy of these models were then tested on an independent sample of Army recruits (n = 154). RESULTS The automated measurement method (measurements derived automatically by the body scanner software) were the best predictors of shirt size (58.1% accuracy) and trouser size (61.7%), with body weight and waist girth being the strongest predictors. Clothing sizes that were incorrectly predicted by the model where generally one size above or below the tailor-predicted size. CONCLUSIONS Anthropometry measurements, when used with decision tree models, show promise for classifying clothing size. Methodological changes such as fitting gender-specific models, using additional anthropometry variables, and testing other data mining techniques are avenues for future work. More research is required before fully automated body scanning is a viable option for obtaining fast and accurate clothing sizes for military clothing and logistics departments.
Collapse
Affiliation(s)
- Stephven Kolose
- Sport Performance Research Institute New Zealand, Auckland University of Technology, Auckland, New Zealand.
| | - Tom Stewart
- Sport Performance Research Institute New Zealand, Auckland University of Technology, Auckland, New Zealand; Human Potential Centre, School of Sport and Recreation, Auckland University of Technology, Auckland, New Zealand.
| | - Patria Hume
- Sport Performance Research Institute New Zealand, Auckland University of Technology, Auckland, New Zealand.
| | - Grant R Tomkinson
- Department of Education, Health and Behavior Studies, University of North Dakota, Grand Forks, ND, USA; Alliance for Research in Exercise, Nutrition and Activity (ARENA), School of Health Sciences, University of South Australia, Adelaide, SA, Australia.
| |
Collapse
|
12
|
A Noninvasive Prediction Model for Hepatitis B Virus Disease in Patients with HIV: Based on the Population of Jiangsu, China. BIOMED RESEARCH INTERNATIONAL 2021; 2021:6696041. [PMID: 33860053 PMCID: PMC8024075 DOI: 10.1155/2021/6696041] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/27/2020] [Accepted: 03/17/2021] [Indexed: 02/07/2023]
Abstract
Objective To establish a machine learning model for identifying patients coinfected with hepatitis B virus (HBV) and human immunodeficiency virus (HIV) through two sexual transmission routes in Jiangsu, China. Methods A total of 14197 HIV cases transmitted by homosexual and heterosexual routes were recruited. After data processing, 12469 cases (HIV and HBV, 1033; HIV, 11436) were left for further analysis, including 7849 cases with homosexual transmission and 4620 cases with heterosexual transmission. Univariate logistic regression was used to select variables with significant P value and odds ratio for multivariable analysis. In homosexual transmission and heterosexual transmission groups, 10 and 6 variables were selected, respectively. For identifying HIV individuals coinfected with HBV, a machine learning model was constructed with four algorithms, including Decision Tree, Random Forest, AdaBoost with decision tree (AdaBoost), and extreme gradient boosting decision tree (XGBoost). The detective value of each variable was calculated using the optimal machine learning algorithm. Results AdaBoost algorithm showed the highest efficiency in both transmission groups (homosexual transmission group: accuracy = 0.928, precision = 0.915, recall = 0.944, F − 1 = 0.930, and AUC = 0.96; heterosexual transmission group: accuracy = 0.892, precision = 0.881, recall = 0.905, F − 1 = 0.893, and AUC = 0.98). Calculated by AdaBoost algorithm, the detective value of PLA was the highest in homosexual transmission group, followed by CR, AST, HB, ALT, TBIL, leucocyte, age, marital status, and treatment condition; in the heterosexual transmission group, the detective value of PLA was the highest (consistent with the condition in the homosexual group), followed by ALT, AST, TBIL, leucocyte, and symptom severity. Conclusions The univariate logistics regression combined with the AdaBoost algorithm could accurately screen the risk factors of HBV in HIV coinfection without invasive testing. Further studies are needed to evaluate the utility and feasibility of this model in various settings.
Collapse
|
13
|
de Barros ADMC, Silva AFR, Zibordi M, Spagnolo JD, Corrêa RR, Belli CB, de Camargo MM. Equine simplified acute physiology score: Personalised medicine for the equine emergency patient. Vet Rec 2021; 189:e136. [PMID: 33729604 DOI: 10.1002/vetr.136] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2020] [Revised: 12/11/2020] [Accepted: 01/26/2021] [Indexed: 11/12/2022]
Abstract
BACKGROUND Scoring models are useful tools that guide the attending clinician in gauging the severity of disease evolution and in evaluating the efficacy of treatment. There are few tools available with this purpose for the non-human patient, including horses. We aimed (i) to adapt the simplified acute physiology score 3 (SAPS-3) model for the equine species, reaching a margin of accuracy greater than 75% in the calculation of the probability of survival/death and (ii) to build a decision tree that helps the attending veterinarian in assessment of the clinical evolution of the equine patient. METHODS From an initial pool of 5568 medical records from University-based Veterinary Hospitals, a final cohort of 1000 was further mined manually for data extraction. A set of 19 variables were evaluated and tested by five machine learning data mining algorithms. RESULTS The final scoring model, named EqSAPS for equine simplified acute physiology score, reached 91.83% of correct estimates (post hoc) for probability of death within 24 hours upon hospitalization. The area under receiver operating characteristic curve for outcome 'death' was 0.742, while for 'survival' was 0.652. The final decision tree was able to refine prognosis of patients whose EqSAPS score suggested 'death'. CONCLUSION EqSAPS is a useful tool to gauge the severity of the clinical presentation of the equine patient.
Collapse
Affiliation(s)
| | - Ana Flávia Rocha Silva
- School of Zootechnics and Food Engineering, University of São Paulo, Pirassununga, Brazil
| | - Miriam Zibordi
- School of Veterinary Medicine, University of São Paulo, São Paulo, Brazil
| | - Julio David Spagnolo
- Veterinary Hospital, Large Animals Surgery Section, School of Veterinary Medicine, University of São Paulo, São Paulo, Brazil
| | - Rodrigo Romero Corrêa
- Department of Surgery, School of Veterinary Medicine, University of São Paulo, São Paulo, Brazil
| | - Carla B Belli
- Department of Clinics, School of Veterinary Medicine, University of São Paulo, São Paulo, Brazil
| | | |
Collapse
|
14
|
Prediction of Important Factors for Bleeding in Liver Cirrhosis Disease Using Ensemble Data Mining Approach. MATHEMATICS 2020. [DOI: 10.3390/math8111887] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
The main motivation to conduct the study presented in this paper was the fact that due to the development of improved solutions for prediction risk of bleeding and thus a faster and more accurate diagnosis of complications in cirrhotic patients, mortality of cirrhosis patients caused by bleeding of varices fell at the turn in the 21th century. Due to this fact, an additional research in this field is needed. The objective of this paper is to develop one prediction model that determines most important factors for bleeding in liver cirrhosis, which is useful for diagnosis and future treatment of patients. To achieve this goal, authors proposed one ensemble data mining methodology, as the most modern in the field of prediction, for integrating on one new way the two most commonly used techniques in prediction, classification with precede attribute number reduction and multiple logistic regression for calibration. Method was evaluated in the study, which analyzed the occurrence of variceal bleeding for 96 patients from the Clinical Center of Nis, Serbia, using 29 data from clinical to the color Doppler. Obtained results showed that proposed method with such big number and different types of data demonstrates better characteristics than individual technique integrated into it.
Collapse
|
15
|
AlKaabi LA, Ahmed LS, Al Attiyah MF, Abdel-Rahman ME. Predicting hypertension using machine learning: Findings from Qatar Biobank Study. PLoS One 2020; 15:e0240370. [PMID: 33064740 PMCID: PMC7567367 DOI: 10.1371/journal.pone.0240370] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2020] [Accepted: 09/08/2020] [Indexed: 12/14/2022] Open
Abstract
Background and objective Hypertension, a global burden, is associated with several risk factors and can be treated by lifestyle modifications and medications. Prediction and early diagnosis is important to prevent related health complications. The objective is to construct and compare predictive models to identify individuals at high risk of developing hypertension without the need of invasive clinical procedures. Methods This is a cross-sectional study using 987 records of Qataris and long-term residents aged 18+ years from Qatar Biobank. Percentages were used to summarize data and chi-square tests to assess associations. Predictive models of hypertension were constructed and compared using three supervised machine learning algorithms: decision tree, random forest, and logistics regression using 5-fold cross-validation. The performance of algorithms was assessed using accuracy, positive predictive value (PPV), sensitivity, F-measure, and area under the receiver operating characteristic curve (AUC). Stata and Weka were used for analysis. Results Age, gender, education level, employment, tobacco use, physical activity, adequate consumption of fruits and vegetables, abdominal obesity, history of diabetes, history of high cholesterol, and mother’s history high blood pressure were important predictors of hypertension. All algorithms showed more or less similar performances: Random forest (accuracy = 82.1%, PPV = 81.4%, sensitivity = 82.1%), logistic regression (accuracy = 81.1%, PPV = 80.1%, sensitivity = 81.1%) and decision tree (accuracy = 82.1%, PPV = 81.2%, sensitivity = 82.1%. In terms of AUC, compared to logistic regression, while random forest performed similarly, decision tree had a significantly lower discrimination ability (p-value<0.05) with AUC’s equal to 85.0, 86.9, and 79.9, respectively. Conclusions Machine learning provides the chance of having a rapid predictive model using non-invasive predictors to screen for hypertension. Future research should consider improving the predictive accuracy of models in larger general populations, including more important predictors and using a variety of algorithms.
Collapse
Affiliation(s)
- Latifa A. AlKaabi
- Department of Public Health, College of Health Science, QU Health, Qatar University, Doha, Qatar
| | - Lina S. Ahmed
- Department of Public Health, College of Health Science, QU Health, Qatar University, Doha, Qatar
| | - Maryam F. Al Attiyah
- Department of Public Health, College of Health Science, QU Health, Qatar University, Doha, Qatar
| | - Manar E. Abdel-Rahman
- Department of Public Health, College of Health Science, QU Health, Qatar University, Doha, Qatar
- * E-mail:
| |
Collapse
|
16
|
Design of an integrated model for diagnosis and classification of pediatric acute leukemia using machine learning. Proc Inst Mech Eng H 2020; 234:1051-1069. [DOI: 10.1177/0954411920938567] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Applying artificial intelligence techniques for diagnosing diseases in hospitals often provides advanced medical services to patients such as the diagnosis of leukemia. On the other hand, surgery and bone marrow sampling, especially in the diagnosis of childhood leukemia, are even more complex and difficult, resulting in increased human error and procedure time decreased patient satisfaction and increased costs. This study investigates the use of neuro-fuzzy and group method of data handling, for the diagnosis of acute leukemia in children based on the complete blood count test. Furthermore, a principal component analysis is applied to increase the accuracy of the diagnosis. The results show that distinguishing between patient and non-patient individuals can easily be done with adaptive neuro-fuzzy inference system, whereas for classifying between the types of diseases themselves, more pre-processing operations such as reduction of features may be needed. The proposed approach may help to distinguish between two types of leukemia including acute lymphoblastic leukemia and acute myeloid leukemia. Based on the sensitivity of the diagnosis, experts can use the proposed algorithm to help identify the disease earlier and lessen the cost.
Collapse
|
17
|
Benmouna Y, Mezmaz MS, Mahmoudi S, Chikh MA. Parallel cycle-based branch-and-bound method for Bayesian network learning. Pattern Anal Appl 2020. [DOI: 10.1007/s10044-019-00815-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
18
|
Geldof T, Van Damme N, Huys I, Van Dyck W. Patient-Level Effectiveness Prediction Modeling for Glioblastoma Using Classification Trees. Front Pharmacol 2020; 10:1665. [PMID: 32116674 PMCID: PMC7025482 DOI: 10.3389/fphar.2019.01665] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2019] [Accepted: 12/19/2019] [Indexed: 12/18/2022] Open
Abstract
Objectives Little research has been done in pharmacoepidemiology on the use of machine learning for exploring medicinal treatment effectiveness in oncology. Therefore, the aim of this study was to explore the added value of machine learning methods to investigate individual treatment responses for glioblastoma patients treated with temozolomide. Methods Based on a retrospective observational registry covering 3090 patients with glioblastoma treated with temozolomide, we proposed the use of a two-step iterative exploratory learning process consisting of an initialization phase and a machine learning phase. For initialization, we defined a binary response variable as the target label using one-by-one nearest neighbor propensity score matching. Secondly, a classification tree algorithm was trained and validated for dividing individual patients into treatment response and non-response groups. Theorizing about treatment response was then done by evaluating the tree performance. Results The classification tree model has an area under the curve (AUC) classification performance of 67% corresponding to a sensitivity of 0.69 and a specificity of 0.51. This result in predicting patient-level response was slightly better than the logistic regression model featuring an AUC of 64% (0.63 sensitivity and 0.54 specificity). The tree confirms confounding by age and discovers further age-related stratification with chemotherapy-treatment dependency, both not revealed in preceding clinical studies. The model lacked genetic information confounding treatment response. Conclusions A classification tree was found to be suitable for understanding patient-level effectiveness for this glioblastoma–temozolomide case because of its high interpretability and capability to deal with covariate interdependencies, essential in a real-world environment. Possible improvements in the model’s classification can be achieved by including genetic information and collecting primary data on treatment response. The model can be valuable in clinical practice for predicting personal treatment pathways.
Collapse
Affiliation(s)
- Tine Geldof
- Healthcare Management Centre, Vlerick Business School, Ghent, Belgium.,Department of Pharmaceutical and Pharmacological Sciences, Research Centre for Pharmaceutical Care and Pharmaco-economics, KU Leuven, Leuven, Belgium
| | | | - Isabelle Huys
- Department of Pharmaceutical and Pharmacological Sciences, Research Centre for Pharmaceutical Care and Pharmaco-economics, KU Leuven, Leuven, Belgium
| | - Walter Van Dyck
- Healthcare Management Centre, Vlerick Business School, Ghent, Belgium.,Department of Pharmaceutical and Pharmacological Sciences, Research Centre for Pharmaceutical Care and Pharmaco-economics, KU Leuven, Leuven, Belgium
| |
Collapse
|
19
|
|
20
|
AlMuhaideb S, Alswailem O, Alsubaie N, Ferwana I, Alnajem A. Prediction of hospital no-show appointments through artificial intelligence algorithms. Ann Saudi Med 2019; 39:373-381. [PMID: 31804138 PMCID: PMC6894458 DOI: 10.5144/0256-4947.2019.373] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND No-shows, a major issue for healthcare centers, can be quite costly and disruptive. Capacity is wasted and expensive resources are underutilized. Numerous studies have shown that reducing uncancelled missed appointments can have a tremendous impact, improving efficiency, reducing costs and improving patient outcomes. Strategies involving machine learning and artificial intelligence could provide a solution. OBJECTIVE Use artificial intelligence to build a model that predicts no-shows for individual appointments. DESIGN Predictive modeling. SETTING Major tertiary care center. PATIENTS AND METHODS All historic outpatient clinic scheduling data in the electronic medical record for a one-year period between 01 January 2014 and 31 December 2014 were used to independently build predictive models with JRip and Hoeffding tree algorithms. MAIN OUTCOME MEASURES No show appointments. SAMPLE SIZE 1 087 979 outpatient clinic appointments. RESULTS The no show rate was 11.3% (123 299). The most important information-gain ranking for predicting no-shows in descending order were history of no shows (0.3596), appointment location (0.0323), and specialty (0.025). The following had very low information-gain ranking: age, day of the week, slot description, time of appointment, gender and nationality. Both JRip and Hoeffding algorithms yielded a reasonable degrees of accuracy 76.44% and 77.13%, respectively, with area under the curve indices at acceptable discrimination power for JRip at 0.776 and at 0.861 with excellent discrimination for Hoeffding trees. CONCLUSION Appointments having high risk of no-shows can be predicted in real-time to set appropriate proactive interventions that reduce the negative impact of no-shows. LIMITATIONS Single center. Only one year of data. CONFLICT OF INTEREST None.
Collapse
Affiliation(s)
- Sarab AlMuhaideb
- From the Department of Computer Science, Prince Sultan University, Riyadh, Saudi Arabia
| | - Osama Alswailem
- From the Health Informatics and Telecommunication Affairs, King Faisal Specialist Hospital and Research Center, Riyadh, Saudi Arabia
| | - Nayef Alsubaie
- From the Health Informatics and Telecommunication Affairs, King Faisal Specialist Hospital and Research Center, Riyadh, Saudi Arabia
| | - Ibtihal Ferwana
- From the Department of Computer Science, Prince Sultan University, Riyadh, Saudi Arabia
| | - Afnan Alnajem
- From the Department of Biostatistics, Epidemiology & Scientific Computing, Princess Nora bint Abdulrahman University, Riyadh, Saudi Arabia
| |
Collapse
|
21
|
Di Noia A, Martino A, Montanari P, Rizzi A. Supervised machine learning techniques and genetic optimization for occupational diseases risk prediction. Soft comput 2019. [DOI: 10.1007/s00500-019-04200-2] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
22
|
Itani S, Rossignol M, Lecron F, Fortemps P. Towards interpretable machine learning models for diagnosis aid: A case study on attention deficit/hyperactivity disorder. PLoS One 2019; 14:e0215720. [PMID: 31022245 PMCID: PMC6483231 DOI: 10.1371/journal.pone.0215720] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2018] [Accepted: 04/09/2019] [Indexed: 12/31/2022] Open
Abstract
Attention Deficit/Hyperactivity Disorder (ADHD) is a neurodevelopmental disorder that has heavy consequences on a child's wellbeing, especially in the academic, psychological and relational planes. The current evaluation of the disorder is supported by clinical assessment and written tests. A definitive diagnosis is usually made based on the DSM-V criteria. There is a lot of ongoing research on ADHD, in order to determine the neurophysiological basis of the disorder and to reach a more objective diagnosis. The advent of Machine Learning (ML) opens up promising prospects for the development of systems able to predict a diagnosis from phenotypic and neuroimaging data. This was the reason why the ADHD-200 contest was launched a few years ago. Based on the publicly available ADHD-200 collection, participants were challenged to predict ADHD with the best possible predictive accuracy. In the present work, we propose instead a ML methodology which primarily places importance on the explanatory power of a model. Such an approach is intended to achieve a fair trade-off between the needs of performance and interpretability expected from medical diagnosis aid systems. We applied our methodology on a data sample extracted from the ADHD-200 collection, through the development of decision trees which are valued for their readability. Our analysis indicates the relevance of the limbic system for the diagnosis of the disorder. Moreover, while providing explanations that make sense, the resulting decision tree performs favorably given the recent results reported in the literature.
Collapse
Affiliation(s)
- Sarah Itani
- Fund for Scientific Research - FNRS (F.R.S.- FNRS), Brussels, Belgium
- Department of Mathematics and Operations Research, Faculty of Engineering, University of Mons, Mons, Belgium
| | - Mandy Rossignol
- Department of Cognitive Psychology and Neuropsychology, Faculty of Psychology and Education, University of Mons, Mons, Belgium
| | - Fabian Lecron
- Department of Engineering Innovation Management, Faculty of Engineering, University of Mons, Mons, Belgium
| | - Philippe Fortemps
- Department of Engineering Innovation Management, Faculty of Engineering, University of Mons, Mons, Belgium
| |
Collapse
|
23
|
Pei D, Gong Y, Kang H, Zhang C, Guo Q. Accurate and rapid screening model for potential diabetes mellitus. BMC Med Inform Decis Mak 2019; 19:41. [PMID: 30866905 PMCID: PMC6416888 DOI: 10.1186/s12911-019-0790-3] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2018] [Accepted: 03/03/2019] [Indexed: 11/26/2022] Open
Abstract
Background Prediction or early diagnosis of diabetes is crucial for populations with high risk of diabetes. Methods In this study, we assessed the ability of five popular classifiers (J48, AdaboostM1, SMO, Bayes Net, and Naïve Bayes) to identify individuals with diabetes based on nine non-invasive and easily obtained clinical features, including age, gender, body mass index (BMI), hypertension, history of cardiovascular disease or stroke, family history of diabetes, physical activity, work stress, and salty food preference. A total of 4205 data entries were obtained from annual physical examination reports for adults in the Shengjing Hospital of China Medical University during January–April 2017. Weka data mining software was used to identify the best algorithm for diabetes classification. Results The results indicate that decision tree classifier J48 has the best performance (accuracy = 0.9503, precision = 0.950, recall = 0.950, F-measure = 0.948, and AUC = 0.964). The decision tree structure shows that age is the most significant feature, followed by family history of diabetes, work stress, BMI, salty food preference, physical activity, hypertension, gender, and history of cardiovascular disease or stroke. Conclusions Our study shows that decision tree analyses can be applied to screen individuals for early diabetes risk without the need for invasive tests. This procedure will be particularly useful in developing regions with high epidemiological risk and poor socioeconomic status, and enable clinical practitioners to rapidly screen patients for increased risk of diabetes. The key features in the tree structure could further facilitate diabetes prevention through targeted community interventions, which can potentially improve early diabetes diagnosis and reduce burdens on the healthcare system.
Collapse
Affiliation(s)
- Dongmei Pei
- Department of Family Medicine, Shengjing Hospital, China Medical University, Shenyang, Liaoning, China
| | - Yang Gong
- University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Hong Kang
- University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Chengpu Zhang
- Department of Family Medicine, Shengjing Hospital, China Medical University, Shenyang, Liaoning, China
| | - Qiyong Guo
- Department of radiology, Shengjing Hospital, China Medical University, Shenyang, Liaoning, China.
| |
Collapse
|
24
|
A rule-based semantic approach for data integration, standardization and dimensionality reduction utilizing the UMLS: Application to predicting bariatric surgery outcomes. Comput Biol Med 2019; 106:84-90. [DOI: 10.1016/j.compbiomed.2019.01.019] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2018] [Revised: 01/21/2019] [Accepted: 01/21/2019] [Indexed: 11/24/2022]
|
25
|
Parrales Bravo F, Del Barrio García AA, Gallego MM, Gago Veiga AB, Ruiz M, Guerrero Peral A, Ayala JL. Prediction of patient's response to OnabotulinumtoxinA treatment for migraine. Heliyon 2019; 5:e01043. [PMID: 30886915 PMCID: PMC6401533 DOI: 10.1016/j.heliyon.2018.e01043] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2018] [Revised: 05/15/2018] [Accepted: 12/10/2018] [Indexed: 01/03/2023] Open
Abstract
Migraine affects the daily life of millions of people around the world. The most well-known disabling symptom associated with this illness is the intense headache. Nowadays, there are treatments that can diminish the level of pain. OnabotulinumtoxinA (BoNT-A) has become a very popular medication for treating migraine headaches in those cases in which other medication is not working, typically in chronic migraines. Currently, the positive response to Botox treatment is not clearly understood, yet understanding the mechanisms that determine the effectiveness of the treatment could help with the development of more effective treatments. To solve this problem, this paper sets up a realistic scenario of electronic medical records of migraineurs under BoNT-A treatment where some clinical features from real patients are labeled by doctors. Medical registers have been preprocessed. A label encoding method based on simulated annealing has been proposed. Two methodologies for predicting the results of the first and the second infiltration of the BoNT-A based treatment are contempled. Firstly, a strategy based on the medical HIT6 metric is described, which achieves an accuracy over 91%. Secondly, when this value is not available, several classifiers and clustering methods have been performed in order to predict the reduction and adverse effects, obtaining an accuracy of 85%. Some clinical features as Greater occipital nerves (GON), chronic migraine time evolution and others have been detected as relevant features when examining the prediction models. The GON and the retroocular component have also been described as important features according to doctors.
Collapse
Affiliation(s)
- Franklin Parrales Bravo
- Department of Computer Architecture and Automation, Complutense University of Madrid, Madrid 28040, Spain.,Carrera de Ingeniería en Sistemas Computacionales, Facultad Ciencias Matemáticas y Física, Universidad de Guayaquil, Guayaquil, Ecuador
| | | | - María Mercedes Gallego
- Neurology Department, "La Princesa" University Hospital, Calle de Diego Leon, 62, 28006 Madrid, Spain
| | - Ana Beatriz Gago Veiga
- Neurology Department, "La Princesa" University Hospital, Calle de Diego Leon, 62, 28006 Madrid, Spain
| | - Marina Ruiz
- Headache Unit, Department of Neurology, Hospital Clínico Universitario de Valladolid, Valladolid, Spain
| | - Angel Guerrero Peral
- Headache Unit, Department of Neurology, Hospital Clínico Universitario de Valladolid, Valladolid, Spain
| | - José L Ayala
- Department of Computer Architecture and Automation, Complutense University of Madrid, Madrid 28040, Spain.,CCS-Center for Computational Simulation, Campus de Montegancedo UPM, Boadilla del Monte 28660, Spain
| |
Collapse
|
26
|
Mining Compact Predictive Pattern Sets Using Classification Model. Artif Intell Med 2019; 11526:386-396. [DOI: 10.1007/978-3-030-21642-9_49] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
27
|
Vijayakumar R, Cheung MWL. Replicability of Machine Learning Models in the Social Sciences. ZEITSCHRIFT FUR PSYCHOLOGIE-JOURNAL OF PSYCHOLOGY 2018. [DOI: 10.1027/2151-2604/a000344] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
Abstract. Machine learning tools are increasingly used in social sciences and policy fields due to their increase in predictive accuracy. However, little research has been done on how well the models of machine learning methods replicate across samples. We compare machine learning methods with regression on the replicability of variable selection, along with predictive accuracy, using an empirical dataset as well as simulated data with additive, interaction, and non-linear squared terms added as predictors. Methods analyzed include support vector machines (SVM), random forests (RF), multivariate adaptive regression splines (MARS), and the regularized regression variants, least absolute shrinkage and selection operator (LASSO), and elastic net. In simulations with additive and linear interactions, machine learning methods performed similarly to regression in replicating predictors; they also performed mostly equal or below regression on measures of predictive accuracy. In simulations with square terms, machine learning methods SVM, RF, and MARS improved predictive accuracy and replicated predictors better than regression. Thus, in simulated datasets, the gap between machine learning methods and regression on predictive measures foreshadowed the gap in variable selection. In replications on the empirical dataset, however, improved prediction by machine learning methods was not accompanied by a visible improvement in replicability in variable selection. This disparity is explained by the overall explanatory power of the models. When predictors have small effects and noise predominates, improved global measures of prediction in a sample by machine learning methods may not lead to the robust selection of predictors; thus, in the presence of weak predictors and noise, regression remains a useful tool for model building and replication.
Collapse
Affiliation(s)
| | - Mike W.-L. Cheung
- Department of Psychology, National University of Singapore, Singapore
| |
Collapse
|
28
|
Evaluating of associated risk factors of metabolic syndrome by using decision tree. ACTA ACUST UNITED AC 2017. [DOI: 10.1007/s00580-017-2580-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
|
29
|
K G, C R. Heuristic Classifier for Observe Accuracy of Cancer Polyp Using Video Capsule Endoscopy. Asian Pac J Cancer Prev 2017; 18:1681-1688. [PMID: 28670889 PMCID: PMC6373793 DOI: 10.22034/apjcp.2017.18.6.1681] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
Methods: Colonoscopy is a technique for examine colon cancer, polyps. In endoscopy, video capsule is universally used mechanism for finding gastrointestinal stages. But both the mechanisms are used to find the colon cancer or colorectal polyp. The Automatic Polyp Detection sub-challenge conducted as part of the Endoscopic Vision Challenge (http://endovis.grand-challenge.org). Method: Colonoscopy may be primary way of improve the ability of colon cancer detection especially flat lesions. Which otherwise may be difficult to detect. Recently, automatic polyp detection algorithms have been proposed with various degrees of success. Though polyp detection in colonoscopy and other traditional endoscopy procedure based images is becoming a mature field, due to its unique imaging characteristics, detecting polyps automatically in colonoscopy is a hard problem. So the proposed video capsule cam supports to diagnose the polyps accurate and easy to identify its pattern. Existing methodology mainly concentrated on high accuracy and less time consumption and it uses many different types of data mining techniques. To analyse these high resolution video scale image we have to take segmentation of image in pixel level binary pattern with the help of a mid-pass filter and relative gray level of neighbours. This work consists of three major steps to improve the accuracy of video capsule endoscopy such as missing data imputation, high dimensionality reduction or feature selection and classification. The above steps are performed using a dataset called endoscopy polyp disease dataset with 500 patients. Our binary classification algorithm relieves human analyses using the video frames. SVM has given major contribution to process the dataset. Results: In this paper the key aspect of proposed results provide segmentation, binary pattern approach with Genetic Fuzzy based Improved Kernel Support Vector machine (GF-IKSVM) classifier. The segmented images all are mostly round shape. The result is refined via smooth filtering, computer vision methods and thresholding steps. Conclusion: Our experimental result produces 94.4% accuracy in that the proposed fuzzy system and genetic Fuzzy, which is higher than the methods, used in the literature. The GF-IKSVM classifier is well-organized and provides good accuracy results for patched VCE polyp disease diagnosis.
Collapse
Affiliation(s)
- Geetha K
- Department of Information Technology, Excel Engineering College, India.
| | | |
Collapse
|
30
|
Olivera AR, Roesler V, Iochpe C, Schmidt MI, Vigo Á, Barreto SM, Duncan BB. Comparison of machine-learning algorithms to build a predictive model for detecting undiagnosed diabetes - ELSA-Brasil: accuracy study. SAO PAULO MED J 2017; 135:234-246. [PMID: 28746659 PMCID: PMC10019841 DOI: 10.1590/1516-3180.2016.0309010217] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/19/2017] [Accepted: 02/01/2017] [Indexed: 01/23/2023] Open
Abstract
CONTEXT AND OBJECTIVE: Type 2 diabetes is a chronic disease associated with a wide range of serious health complications that have a major impact on overall health. The aims here were to develop and validate predictive models for detecting undiagnosed diabetes using data from the Longitudinal Study of Adult Health (ELSA-Brasil) and to compare the performance of different machine-learning algorithms in this task. DESIGN AND SETTING: Comparison of machine-learning algorithms to develop predictive models using data from ELSA-Brasil. METHODS: After selecting a subset of 27 candidate variables from the literature, models were built and validated in four sequential steps: (i) parameter tuning with tenfold cross-validation, repeated three times; (ii) automatic variable selection using forward selection, a wrapper strategy with four different machine-learning algorithms and tenfold cross-validation (repeated three times), to evaluate each subset of variables; (iii) error estimation of model parameters with tenfold cross-validation, repeated ten times; and (iv) generalization testing on an independent dataset. The models were created with the following machine-learning algorithms: logistic regression, artificial neural network, naïve Bayes, K-nearest neighbor and random forest. RESULTS: The best models were created using artificial neural networks and logistic regression. -These achieved mean areas under the curve of, respectively, 75.24% and 74.98% in the error estimation step and 74.17% and 74.41% in the generalization testing step. CONCLUSION: Most of the predictive models produced similar results, and demonstrated the feasibility of identifying individuals with highest probability of having undiagnosed diabetes, through easily-obtained clinical data.
Collapse
Affiliation(s)
- André Rodrigues Olivera
- MSc. IT Analyst, Postgraduate Computing Program, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre (RS), Brazil.
| | - Valter Roesler
- PhD. Professor, Postgraduate Computing Program, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre (RS), Brazil.
| | - Cirano Iochpe
- PhD. Professor, Postgraduate Computing Program, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre (RS), Brazil.
| | - Maria Inês Schmidt
- PhD. Professor, Postgraduate Epidemiology Program and Hospital de Clínicas, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre (RS), Brazil.
| | - Álvaro Vigo
- PhD. Professor, Postgraduate Epidemiology Program, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre (RS), Brazil.
| | - Sandhi Maria Barreto
- PhD. Professor, Department of Social and Preventive Medicine & Postgraduate Program in Public Health, Universidade Federal de Minas Gerais (UFMG), Belo Horizonte (MG), Brazil.
| | - Bruce Bartholow Duncan
- PhD. Professor, Postgraduate Epidemiology Program and Hospital de Clínicas, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre (RS), Brazil.
| |
Collapse
|
31
|
Arabasadi Z, Alizadehsani R, Roshanzamir M, Moosaei H, Yarifard AA. Computer aided decision making for heart disease detection using hybrid neural network-Genetic algorithm. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2017; 141:19-26. [PMID: 28241964 DOI: 10.1016/j.cmpb.2017.01.004] [Citation(s) in RCA: 139] [Impact Index Per Article: 19.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/11/2016] [Revised: 12/18/2016] [Accepted: 01/12/2017] [Indexed: 05/28/2023]
Abstract
Cardiovascular disease is one of the most rampant causes of death around the world and was deemed as a major illness in Middle and Old ages. Coronary artery disease, in particular, is a widespread cardiovascular malady entailing high mortality rates. Angiography is, more often than not, regarded as the best method for the diagnosis of coronary artery disease; on the other hand, it is associated with high costs and major side effects. Much research has, therefore, been conducted using machine learning and data mining so as to seek alternative modalities. Accordingly, we herein propose a highly accurate hybrid method for the diagnosis of coronary artery disease. As a matter of fact, the proposed method is able to increase the performance of neural network by approximately 10% through enhancing its initial weights using genetic algorithm which suggests better weights for neural network. Making use of such methodology, we achieved accuracy, sensitivity and specificity rates of 93.85%, 97% and 92% respectively, on Z-Alizadeh Sani dataset.
Collapse
Affiliation(s)
- Zeinab Arabasadi
- Department of Computer Engineering, University of Bojnord, Bojnord, Iran
| | - Roohallah Alizadehsani
- Department of Computer Engineering, Sharif University of Technology, Azadi Ave, Tehran, Iran.
| | - Mohamad Roshanzamir
- Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan, Iran
| | - Hossein Moosaei
- Department of Mathematics, Faculty of Science, University of Bojnord, Iran
| | | |
Collapse
|
32
|
Alanazi HO, Abdullah AH, Qureshi KN. A Critical Review for Developing Accurate and Dynamic Predictive Models Using Machine Learning Methods in Medicine and Health Care. J Med Syst 2017; 41:69. [PMID: 28285459 DOI: 10.1007/s10916-017-0715-6] [Citation(s) in RCA: 63] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2016] [Accepted: 02/26/2017] [Indexed: 10/20/2022]
Abstract
Recently, Artificial Intelligence (AI) has been used widely in medicine and health care sector. In machine learning, the classification or prediction is a major field of AI. Today, the study of existing predictive models based on machine learning methods is extremely active. Doctors need accurate predictions for the outcomes of their patients' diseases. In addition, for accurate predictions, timing is another significant factor that influences treatment decisions. In this paper, existing predictive models in medicine and health care have critically reviewed. Furthermore, the most famous machine learning methods have explained, and the confusion between a statistical approach and machine learning has clarified. A review of related literature reveals that the predictions of existing predictive models differ even when the same dataset is used. Therefore, existing predictive models are essential, and current methods must be improved.
Collapse
Affiliation(s)
- Hamdan O Alanazi
- Faculty of Computing, Universiti Teknologi Malaysia, Johor Bahru, Malaysia.,Department of Medical Science Technology, Faculty of Applied Medical Science, Majmaah University, Al Majmaah, Kingdom of Saudi Arabia
| | | | | |
Collapse
|
33
|
Tayefi M, Esmaeili H, Saberi Karimian M, Amirabadi Zadeh A, Ebrahimi M, Safarian M, Nematy M, Parizadeh SMR, Ferns GA, Ghayour-Mobarhan M. The application of a decision tree to establish the parameters associated with hypertension. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2017; 139:83-91. [PMID: 28187897 DOI: 10.1016/j.cmpb.2016.10.020] [Citation(s) in RCA: 39] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/17/2016] [Revised: 09/13/2016] [Accepted: 10/18/2016] [Indexed: 06/06/2023]
Abstract
INTRODUCTION Hypertension is an important risk factor for cardiovascular disease (CVD). The goal of this study was to establish the factors associated with hypertension by using a decision-tree algorithm as a supervised classification method of data mining. METHODS Data from a cross-sectional study were used in this study. A total of 9078 subjects who met the inclusion criteria were recruited. 70% of these subjects (6358 cases) were randomly allocated to the training dataset for the constructing of the decision-tree. The remaining 30% (2720 cases) were used as the testing dataset to evaluate the performance of decision-tree. Two models were evaluated in this study. In model I, age, gender, body mass index, marital status, level of education, occupation status, depression and anxiety status, physical activity level, smoking status, LDL, TG, TC, FBG, uric acid and hs-CRP were considered as input variables and in model II, age, gender, WBC, RBC, HGB, HCT MCV, MCH, PLT, RDW and PDW were considered as input variables. The validation of the model was assessed by constructing a receiver operating characteristic (ROC) curve. RESULTS The prevalence rates of hypertension were 32% in our population. For the decision-tree model I, the accuracy, sensitivity, specificity and area under the ROC curve (AUC) value for identifying the related risk factors of hypertension were 73%, 63%, 77% and 0.72, respectively. The corresponding values for model II were 70%, 61%, 74% and 0.68, respectively. CONCLUSION We have developed a decision tree model to identify the risk factors associated with hypertension that maybe used to develop programs for hypertension management.
Collapse
Affiliation(s)
- Maryam Tayefi
- Biochemistry and Nutrition Research Center, School of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Habibollah Esmaeili
- Biochemistry and Nutrition Research Center, School of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran; Department of Biostatistics, School of Health, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Maryam Saberi Karimian
- Student Research Committee, Department of Modern Sciences and Technologies, School of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Alireza Amirabadi Zadeh
- Department of Biostatistics, School of Health, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Mahmoud Ebrahimi
- Cardiovascular Research Center, School of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Mohammad Safarian
- Department of Nutrition Research Center, School of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Mohsen Nematy
- Department of Nutrition Research Center, School of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Seyed Mohammad Reza Parizadeh
- Biochemistry and Nutrition Research Center, School of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Gordon A Ferns
- Brighton & Sussex Medical School, Division of Medical Education, Falmer, Brighton, Sussex BN1 9PH, UK
| | - Majid Ghayour-Mobarhan
- Biochemistry and Nutrition Research Center, School of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran; Cardiovascular Research Center, School of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran.
| |
Collapse
|
34
|
Bamidis PD, Psarouli E, Stilou S. Using modern IT tools to assess the awareness of MDs on radiation issues and plan a continuous education programme. Health Informatics J 2016. [DOI: 10.1177/146045820100700307] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Radiation is significantly involved in a number of common medical tests and procedures. Knowledge of issues concerned with ionizing radiation is, therefore, mandatory in clinical practice. Is there a way to access whether university-level educational programmes concerning medical physics suffice in covering the practical requirements or not? One of the key issues involved in the assessment of the above procedure, however, is how to make use of available technology in order to conduct the assessment as efficiently and as effectively as possible. In previous work, we devised a sample manual questionnaire aimed at retrieving information regarding the status of radiation awareness, but also investigated the feasibility of applying a robust methodology in the analysis of the results. In this paper, we extend the above mentioned methodology to other areas of MD activity and utilize the question ‘how familiar are MDs with radiation?’ as an example, in order to demonstrate the importance of health information management in the modern era of e-health. Three stages or aspects of information management are considered: (i) efficient online information collection, (ii) effective and robust analysis based on Web databases, and finally (iii) extraction of rules and knowledge using data mining techniques and home-made software that facilitates the retrieval of associations. The latter stage is then further exploited in order to design an improved and realistic plan for the process of continuous education of MDs and health professionals in general.
Collapse
Affiliation(s)
- P. D. Bamidis
- Department of Computer Science, CITY Liberal Studies, Affiliated Institution of the University of Sheffield and Laboratory of Medical Informatics, Medical School, University of Thessaloniki, Greece, Tsimiski 13, 54624 Thessaloniki, Greece,
| | - E. Psarouli
- Department of Nuclear Medicine, Ippocration General Hospital, Thessaloniki, Greece
| | - S. Stilou
- Laboratory of Medical Informatics, Medical School University of Thessaloniki, Thessaloniki, Greece
| |
Collapse
|
35
|
Umut İ. PSGMiner: A modular software for polysomnographic analysis. Comput Biol Med 2016; 73:1-9. [DOI: 10.1016/j.compbiomed.2016.03.023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2016] [Revised: 03/26/2016] [Accepted: 03/28/2016] [Indexed: 10/22/2022]
|
36
|
Molina ME, Perez A, Valente JP. Classification of auditory brainstem responses through symbolic pattern discovery. Artif Intell Med 2016; 70:12-30. [PMID: 27431034 DOI: 10.1016/j.artmed.2016.05.001] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2015] [Accepted: 05/09/2016] [Indexed: 01/20/2023]
Abstract
INTRODUCTION Numeric time series are present in a very wide range of domains, including many branches of medicine. Data mining techniques have proved to be useful for knowledge discovery in this type of data and for supporting decision-making processes. OBJECTIVES The overall objective is to classify time series based on the discovery of frequent patterns. These patterns will be discovered in symbolic sequences obtained from the time series data by means of a temporal abstraction process. METHODS Firstly, we transform numeric time series into symbolic time sequences, where the symbols aim to represent the relevant domain concepts. These symbols can be defined using either public or expert domain knowledge. Then we apply a symbolic pattern discovery technique to the output symbolic sequences. This technique identifies the subsequences frequently found in a population group. These subsequences (patterns) are representative of population groups. Finally, we employ a classification technique based on the identified patterns in order to classify new individuals. Thanks to the inclusion of domain knowledge, the classification results can be explained using domain terminology. This makes the results easier to interpret for the domain specialist (physician). RESULTS This method has been applied to brainstem auditory evoked potentials (BAEPs) time series. Preliminary experiments were carried out to analyse several aspects of the method including the best configuration of the pattern discovery technique parameters. We then applied the method to the BAEPs of 83 individuals belonging to four classes (healthy, conductive hearing loss, vestibular schwannoma-brainstem involvement and vestibular schwannoma-8th-nerve involvement). According to the results of the cross-validation, overall accuracy was 99.4%, sensitivity (recall) was 97.6% and specificity was 100% (no false positives). CONCLUSION The proposed method effectively reduces dimensionality. Additionally, if the symbolic transformation includes the right domain knowledge, the method arguably outputs a data representation that denotes the relevant domain concepts more clearly. The method is capable of finding patterns in BAEPs time series and is very accurate at correctly predicting whether or not new patients have an auditory-related disorder.
Collapse
Affiliation(s)
- Marco E Molina
- Department of Languages, Information Systems and Software Engineering, School of Computer Engineering, Technical University of Madrid, Campus de Montegancedo, s/n, Boadilla del Monte, Madrid 28660, Spain.
| | - Aurora Perez
- Department of Languages, Information Systems and Software Engineering, School of Computer Engineering, Technical University of Madrid, Campus de Montegancedo, s/n, Boadilla del Monte, Madrid 28660, Spain.
| | - Juan P Valente
- Department of Languages, Information Systems and Software Engineering, School of Computer Engineering, Technical University of Madrid, Campus de Montegancedo, s/n, Boadilla del Monte, Madrid 28660, Spain.
| |
Collapse
|
37
|
Umut İ, Çentik G. Detection of Periodic Leg Movements by Machine Learning Methods Using Polysomnographic Parameters Other Than Leg Electromyography. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2016; 2016:2041467. [PMID: 27213008 PMCID: PMC4860221 DOI: 10.1155/2016/2041467] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/29/2016] [Revised: 04/02/2016] [Accepted: 04/04/2016] [Indexed: 11/18/2022]
Abstract
The number of channels used for polysomnographic recording frequently causes difficulties for patients because of the many cables connected. Also, it increases the risk of having troubles during recording process and increases the storage volume. In this study, it is intended to detect periodic leg movement (PLM) in sleep with the use of the channels except leg electromyography (EMG) by analysing polysomnography (PSG) data with digital signal processing (DSP) and machine learning methods. PSG records of 153 patients of different ages and genders with PLM disorder diagnosis were examined retrospectively. A novel software was developed for the analysis of PSG records. The software utilizes the machine learning algorithms, statistical methods, and DSP methods. In order to classify PLM, popular machine learning methods (multilayer perceptron, K-nearest neighbour, and random forests) and logistic regression were used. Comparison of classified results showed that while K-nearest neighbour classification algorithm had higher average classification rate (91.87%) and lower average classification error value (RMSE = 0.2850), multilayer perceptron algorithm had the lowest average classification rate (83.29%) and the highest average classification error value (RMSE = 0.3705). Results showed that PLM can be classified with high accuracy (91.87%) without leg EMG record being present.
Collapse
Affiliation(s)
- İlhan Umut
- Department of Computer Engineering, Faculty of Engineering, Trakya University, 22030 Edirne, Turkey
| | - Güven Çentik
- Department of Computer Engineering, Faculty of Engineering, Trakya University, 22030 Edirne, Turkey
| |
Collapse
|
38
|
Prediction of solubility of some statin drugs in supercritical carbon dioxide using classification and regression tree analysis and adaptive neuro-fuzzy inference systems. Russ Chem Bull 2016. [DOI: 10.1007/s11172-016-1424-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
39
|
Gutiérrez S, Tardaguila J, Fernández-Novales J, Diago MP. Data Mining and NIR Spectroscopy in Viticulture: Applications for Plant Phenotyping under Field Conditions. SENSORS 2016; 16:236. [PMID: 26891304 PMCID: PMC4801612 DOI: 10.3390/s16020236] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/02/2015] [Revised: 01/29/2016] [Accepted: 02/04/2016] [Indexed: 11/16/2022]
Abstract
Plant phenotyping is a very important topic in agriculture. In this context, data mining strategies may be applied to agricultural data retrieved with new non-invasive devices, with the aim of yielding useful, reliable and objective information. This work presents some applications of machine learning algorithms along with in-field acquired NIR spectral data for plant phenotyping in viticulture, specifically for grapevine variety discrimination and assessment of plant water status. Support vector machine (SVM), rotation forests and M5 trees models were built using NIR spectra acquired in the field directly on the adaxial side of grapevine leaves, with a non-invasive portable spectrophotometer working in the spectral range between 1600 and 2400 nm. The ν-SVM algorithm was used for the training of a model for varietal classification. The classifiers’ performance for the 10 varieties reached, for cross- and external validations, the 88.7% and 92.5% marks, respectively. For water stress assessment, the models developed using the absorbance spectra of six varieties yielded the same determination coefficient for both cross- and external validations (R2 = 0.84; RMSEs of 0.164 and 0.165 MPa, respectively). Furthermore, a variety-specific model trained only with samples of Tempranillo from two different vintages yielded R2 = 0.76 and RMSE of 0.16 MPa for cross-validation and R2 = 0.79, RMSE of 0.17 MPa for external validation. These results show the power of the combined use of data mining and non-invasive NIR sensing for in-field grapevine phenotyping and their usefulness for the wine industry and precision viticulture implementations.
Collapse
Affiliation(s)
- Salvador Gutiérrez
- Instituto de Ciencias de la Vid y del Vino (University of La Rioja, CSIC, Gobierno de La Rioja) Ctra. De Burgos Km, 6, 26007 Logroño, Spain.
| | - Javier Tardaguila
- Instituto de Ciencias de la Vid y del Vino (University of La Rioja, CSIC, Gobierno de La Rioja) Ctra. De Burgos Km, 6, 26007 Logroño, Spain.
| | - Juan Fernández-Novales
- Instituto de Ciencias de la Vid y del Vino (University of La Rioja, CSIC, Gobierno de La Rioja) Ctra. De Burgos Km, 6, 26007 Logroño, Spain.
| | - Maria P Diago
- Instituto de Ciencias de la Vid y del Vino (University of La Rioja, CSIC, Gobierno de La Rioja) Ctra. De Burgos Km, 6, 26007 Logroño, Spain.
| |
Collapse
|
40
|
|
41
|
Early-Stage Event Prediction for Longitudinal Data. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING 2016. [DOI: 10.1007/978-3-319-31753-3_12] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
|
42
|
High-dimensional feature selection via feature grouping: A Variable Neighborhood Search approach. Inf Sci (N Y) 2016. [DOI: 10.1016/j.ins.2015.07.041] [Citation(s) in RCA: 76] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
43
|
Bashir S, Qamar U, Khan FH. IntelliHealth: A medical decision support application using a novel weighted multi-layer classifier ensemble framework. J Biomed Inform 2015; 59:185-200. [PMID: 26703093 DOI: 10.1016/j.jbi.2015.12.001] [Citation(s) in RCA: 82] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2015] [Revised: 11/01/2015] [Accepted: 12/06/2015] [Indexed: 11/30/2022]
Abstract
Accuracy plays a vital role in the medical field as it concerns with the life of an individual. Extensive research has been conducted on disease classification and prediction using machine learning techniques. However, there is no agreement on which classifier produces the best results. A specific classifier may be better than others for a specific dataset, but another classifier could perform better for some other dataset. Ensemble of classifiers has been proved to be an effective way to improve classification accuracy. In this research we present an ensemble framework with multi-layer classification using enhanced bagging and optimized weighting. The proposed model called "HM-BagMoov" overcomes the limitations of conventional performance bottlenecks by utilizing an ensemble of seven heterogeneous classifiers. The framework is evaluated on five different heart disease datasets, four breast cancer datasets, two diabetes datasets, two liver disease datasets and one hepatitis dataset obtained from public repositories. The analysis of the results show that ensemble framework achieved the highest accuracy, sensitivity and F-Measure when compared with individual classifiers for all the diseases. In addition to this, the ensemble framework also achieved the highest accuracy when compared with the state of the art techniques. An application named "IntelliHealth" is also developed based on proposed model that may be used by hospitals/doctors for diagnostic advice.
Collapse
Affiliation(s)
- Saba Bashir
- Computer Engineering Department, College of Electrical and Mechanical Engineering, National University of Sciences and Technology (NUST), Islamabad 44000, Pakistan.
| | - Usman Qamar
- Computer Engineering Department, College of Electrical and Mechanical Engineering, National University of Sciences and Technology (NUST), Islamabad 44000, Pakistan.
| | - Farhan Hassan Khan
- Computer Engineering Department, College of Electrical and Mechanical Engineering, National University of Sciences and Technology (NUST), Islamabad 44000, Pakistan.
| |
Collapse
|
44
|
Shahraki AD, Safdari R, Gahfarokhi HH, Tahmasebian S. The Usage of Association Rule Mining to Identify Influencing Factors on Deafness After Birth. Acta Inform Med 2015; 23:356-9. [PMID: 26862245 PMCID: PMC4720831 DOI: 10.5455/aim.2015.23.356-359] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2015] [Accepted: 11/17/2015] [Indexed: 11/03/2022] Open
Abstract
BACKGROUND Providing complete and high quality health care services has very important role to enable people to understand the factors related to personal and social health and to make decision regarding choice of suitable healthy behaviors in order to achieve healthy life. For this reason, demographic and clinical data of person are collecting, this huge volume of data can be known as a valuable resource for analyzing, exploring and discovering valuable information and communication. This study using forum rules techniques in the data mining has tried to identify the affecting factors on hearing loss after birth in Iran. MATERIALS AND METHODS The survey is kind of data oriented study. The population of the study is contained questionnaires in several provinces of the country. First, all data of questionnaire was implemented in the form of information table in Software SQL Server and followed by Data Entry using written software of C # .Net, then algorithm Association in SQL Server Data Tools software and Clementine software was implemented to determine the rules and hidden patterns in the gathered data. FINDINGS Two factors of number of deaf brothers and the degree of consanguinity of the parents have a significant impact on severity of deafness of individuals. Also, when the severity of hearing loss is greater than or equal to moderately severe hearing loss, people use hearing aids and Men are also less interested in the use of hearing aids. CONCLUSION In fact, it can be said that in families with consanguineous marriage of parents that are from first degree (girl/boy cousins) and 2(nd) degree relatives (girl/boy cousins) and especially from first degree, the number of people with severe hearing loss or deafness are more and in the use of hearing aids, gender of the patient is more important than the severity of the hearing loss.
Collapse
Affiliation(s)
| | - Reza Safdari
- Isfahan University of Medical Sciences, Isfahan, Iran
| | | | | |
Collapse
|
45
|
Diciolla M, Binetti G, Di Noia T, Pesce F, Schena FP, Vågane AM, Bjørneklett R, Suzuki H, Tomino Y, Naso D. Patient classification and outcome prediction in IgA nephropathy. Comput Biol Med 2015; 66:278-86. [PMID: 26453758 DOI: 10.1016/j.compbiomed.2015.09.003] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2015] [Revised: 08/08/2015] [Accepted: 09/02/2015] [Indexed: 10/23/2022]
Abstract
OBJECTIVE IgA Nephropathy (IgAN) is a common kidney disease which may entail renal failure, known as End Stage Kidney Disease (ESKD). One of the major difficulties dealing with this disease is to predict the time of the long-term prognosis for a patient at the time of diagnosis. In fact, the progression of IgAN to ESKD depends on an intricate interrelationship between clinical and laboratory findings. Therefore, the objective of this work has been the selection of the best data mining tool to build a model able to predict (I) if a patient with a biopsy proven IgAN will reach ESKD and (II) if a patient will reach the ESKD before or after 5 years. MATERIAL AND METHODS The largest available cohort study worldwide on IgAN has been used to design and compare several data-driven models. The complete dataset was composed of 1174 records collected from Italian, Norwegian, and Japanese IgAN patients, in the last 30 years. The data mining tools considered in this work were artificial neural networks (ANNs), neuro fuzzy systems (NFSs), support vector machines (SVMs), and decision trees (DTs). A 10-fold cross validation was used to evaluate unbiased performances for all the models. RESULTS An extensive model comparison based on accuracy, precision, recall, and f-measure was provided. Overall, the results indicate that ANNs can provide superior performance compared to the other models. The ANN for time-to-ESKD prediction is characterized by accuracy, precision, recall, and f-measure greater than 90%. The ANN for ESKD prediction has accuracy greater than 90% as well as precision, recall, and f-measure for the class of patients not reaching ESKD, while precision, recall, and f-measure for the class of patients reaching ESKD are slightly lower. The obtained model has been implemented in a Web-based decision support system (DSS). CONCLUSIONS The extraction of novel knowledge from clinical data and the definition of predictive models to support diagnosis, prognosis, and therapy is becoming an essential tool for researchers and clinical practitioners in medicine. The proposed comparative study of several data mining models for the outcome prediction in IgAN patients, using a large dataset of clinical records from three different countries, provides an insight into the relative prediction ability of the considered methods applied to such a disease.
Collapse
Affiliation(s)
- M Diciolla
- Department of Electrical and Information Engineering, Polytechnic University of Bari, Bari, Italy
| | - G Binetti
- Department of Electrical and Information Engineering, Polytechnic University of Bari, Bari, Italy
| | - T Di Noia
- Department of Electrical and Information Engineering, Polytechnic University of Bari, Bari, Italy.
| | - F Pesce
- Cardiovascular Genetics and Genomics, National Heart & Lung Institute, Royal Brompton Hospital, Imperial College London, UK; Department of Emergency and Organ Transplantation, University of Bari, Bari, Italy
| | - F P Schena
- Department of Emergency and Organ Transplantation, University of Bari, Bari, Italy; C.A.R.S.O. Consortium, Valenzano-Casamassima, Italy
| | - A M Vågane
- Department of Clinical Medicine, Renal Research Group, University of Bergen, Bergen, Norway; Department of Medicine, Haukeland University Hospital, Bergen, Norway
| | - R Bjørneklett
- Department of Clinical Medicine, Renal Research Group, University of Bergen, Bergen, Norway; Department of Medicine, Haukeland University Hospital, Bergen, Norway
| | - H Suzuki
- Division of Nephrology, Department of Internal Medicine, Juntendo University, Faculty of Medicine, Tokyo, Japan
| | - Y Tomino
- Division of Nephrology, Department of Internal Medicine, Juntendo University, Faculty of Medicine, Tokyo, Japan
| | - D Naso
- Department of Electrical and Information Engineering, Polytechnic University of Bari, Bari, Italy
| |
Collapse
|
46
|
Huang H, Fava A, Guhr T, Cimbro R, Rosen A, Boin F, Ellis H. A methodology for exploring biomarker--phenotype associations: application to flow cytometry data and systemic sclerosis clinical manifestations. BMC Bioinformatics 2015; 16:293. [PMID: 26373409 PMCID: PMC4571079 DOI: 10.1186/s12859-015-0722-x] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2015] [Accepted: 08/26/2015] [Indexed: 01/19/2023] Open
Abstract
Background This work seeks to develop a methodology for identifying reliable biomarkers of disease activity, progression and outcome through the identification of significant associations between high-throughput flow cytometry (FC) data and interstitial lung disease (ILD) - a systemic sclerosis (SSc, or scleroderma) clinical phenotype which is the leading cause of morbidity and mortality in SSc. A specific aim of the work involves developing a clinically useful screening tool that could yield accurate assessments of disease state such as the risk or presence of SSc-ILD, the activity of lung involvement and the likelihood to respond to therapeutic intervention. Ultimately this instrument could facilitate a refined stratification of SSc patients into clinically relevant subsets at the time of diagnosis and subsequently during the course of the disease and thus help in preventing bad outcomes from disease progression or unnecessary treatment side effects. The methods utilized in the work involve: (1) clinical and peripheral blood flow cytometry data (Immune Response In Scleroderma, IRIS) from consented patients followed at the Johns Hopkins Scleroderma Center. (2) machine learning (Conditional Random Forests - CRF) coupled with Gene Set Enrichment Analysis (GSEA) to identify subsets of FC variables that are highly effective in classifying ILD patients; and (3) stochastic simulation to design, train and validate ILD risk screening tools. Results Our hybrid analysis approach (CRF-GSEA) proved successful in predicting SSc patient ILD status with a high degree of success (>82 % correct classification in validation; 79 patients in the training data set, 40 patients in the validation data set). Conclusions IRIS flow cytometry data provides useful information in assessing the ILD status of SSc patients. Our new approach combining Conditional Random Forests and Gene Set Enrichment Analysis was successful in identifying a subset of flow cytometry variables to create a screening tool that proved effective in correctly identifying ILD patients in the training and validation data sets. From a somewhat broader perspective, the identification of subsets of flow cytometry variables that exhibit coordinated movement (i.e., multi-variable up or down regulation) may lead to insights into possible effector pathways and thereby improve the state of knowledge of systemic sclerosis pathogenesis. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0722-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Hongtai Huang
- Department of Geography and Environmental Engineering, GWC Whiting School of Engineering, The Johns Hopkins University, Baltimore, MD, USA.
| | - Andrea Fava
- Division of Rheumatology, Department of Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA.
| | - Tara Guhr
- Division of Rheumatology, Department of Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA.
| | - Raffaello Cimbro
- Division of Rheumatology, Department of Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA.
| | - Antony Rosen
- Division of Rheumatology, Department of Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA.
| | - Francesco Boin
- Division of Rheumatology, Department of Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA. .,Present address: Division of Rheumatology, Department of Medicine, University of California, San Francisco, CA, USA.
| | - Hugh Ellis
- Department of Geography and Environmental Engineering, GWC Whiting School of Engineering, The Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
47
|
Tomczak JM, Zięba M. Probabilistic combination of classification rules and its application to medical diagnosis. Mach Learn 2015. [DOI: 10.1007/s10994-015-5508-x] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
48
|
Al-Hyari AY, Al-Taee AM, Al-Taee MA. Diagnosis and Classification of Chronic Renal Failure Utilising Intelligent Data Mining Classifiers. INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY AND WEB ENGINEERING 2014. [DOI: 10.4018/ijitwe.2014100101] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
This paper presents a new clinical decision support system for diagnosing patients with Chronic Renal Failure (CRF) which is not yet thoroughly explored in literature. This paper aims at improving performance of a previously reported CRF diagnosis system which was based on Artificial Neural Network (ANN), Decision Tree (DT) and Naïve Bayes (NB) classifying algorithms. This is achieved by utilizing more efficient data mining classifiers, Support Vector Machine (SVM) and Logistic Regression (LR), in order to: (i) diagnose patients with CRF and (ii) determine the rate at which the disease is progressing. A clinical dataset of more than 100 instances is used in this study. Performance of the developed decision support system is assessed in terms of diagnostic accuracy, sensitivity, specificity and decisions made by consultant specialist physicians. The open source Waikato Environment for Knowledge Analysis library is used in this study to build and evaluate performance of the developed data mining classifiers. The obtained results showed SVM to be the most accurate (93.14%) when compared to LR as well as other classifiers reported in the previous study. A complete system prototype has been developed and tested successfully with the aid of NHS collaborators to support both diagnosis and long-term management of the disease.
Collapse
|
49
|
Ramezankhani A, Pournik O, Shahrabi J, Khalili D, Azizi F, Hadaegh F. Applying decision tree for identification of a low risk population for type 2 diabetes. Tehran Lipid and Glucose Study. Diabetes Res Clin Pract 2014; 105:391-8. [PMID: 25085758 DOI: 10.1016/j.diabres.2014.07.003] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/30/2013] [Revised: 04/15/2014] [Accepted: 07/05/2014] [Indexed: 01/06/2023]
Abstract
AIMS The aim of this study was to create a prediction model using data mining approach to identify low risk individuals for incidence of type 2 diabetes, using the Tehran Lipid and Glucose Study (TLGS) database. METHODS For a 6647 population without diabetes, aged ≥20 years, followed for 12 years, a prediction model was developed using classification by the decision tree technique. Seven hundred and twenty-nine (11%) diabetes cases occurred during the follow-up. Predictor variables were selected from demographic characteristics, smoking status, medical and drug history and laboratory measures. RESULTS We developed the predictive models by decision tree using 60 input variables and one output variable. The overall classification accuracy was 90.5%, with 31.1% sensitivity, 97.9% specificity; and for the subjects without diabetes, precision and f-measure were 92% and 0.95, respectively. The identified variables included fasting plasma glucose, body mass index, triglycerides, mean arterial blood pressure, family history of diabetes, educational level and job status. CONCLUSIONS In conclusion, decision tree analysis, using routine demographic, clinical, anthropometric and laboratory measurements, created a simple tool to predict individuals at low risk for type 2 diabetes.
Collapse
Affiliation(s)
- Azra Ramezankhani
- Prevention of Metabolic Disorders Research Center, Research Institute for Endocrine Science, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Omid Pournik
- Department of Medical Informatics, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran; Medical Informatics Research Center, Faculty of Medicine, Mashhad, Iran
| | - Jamal Shahrabi
- Industrial Engineering Department, Amirkabir University of Technology, Tehran, Iran
| | - Davood Khalili
- Prevention of Metabolic Disorders Research Center, Research Institute for Endocrine Science, Shahid Beheshti University of Medical Sciences, Tehran, Iran; Department of Epidemiology, School of Public Health, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Fereidoun Azizi
- Endocrine Research Center, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Farzad Hadaegh
- Prevention of Metabolic Disorders Research Center, Research Institute for Endocrine Science, Shahid Beheshti University of Medical Sciences, Tehran, Iran.
| |
Collapse
|
50
|
Ho JC, Ghosh J, Steinhubl SR, Stewart WF, Denny JC, Malin BA, Sun J. Limestone: high-throughput candidate phenotype generation via tensor factorization. J Biomed Inform 2014; 52:199-211. [PMID: 25038555 DOI: 10.1016/j.jbi.2014.07.001] [Citation(s) in RCA: 72] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2013] [Revised: 05/14/2014] [Accepted: 07/02/2014] [Indexed: 12/22/2022]
Abstract
The rapidly increasing availability of electronic health records (EHRs) from multiple heterogeneous sources has spearheaded the adoption of data-driven approaches for improved clinical research, decision making, prognosis, and patient management. Unfortunately, EHR data do not always directly and reliably map to medical concepts that clinical researchers need or use. Some recent studies have focused on EHR-derived phenotyping, which aims at mapping the EHR data to specific medical concepts; however, most of these approaches require labor intensive supervision from experienced clinical professionals. Furthermore, existing approaches are often disease-centric and specialized to the idiosyncrasies of the information technology and/or business practices of a single healthcare organization. In this paper, we propose Limestone, a nonnegative tensor factorization method to derive phenotype candidates with virtually no human supervision. Limestone represents the data source interactions naturally using tensors (a generalization of matrices). In particular, we investigate the interaction of diagnoses and medications among patients. The resulting tensor factors are reported as phenotype candidates that automatically reveal patient clusters on specific diagnoses and medications. Using the proposed method, multiple phenotypes can be identified simultaneously from data. We demonstrate the capability of Limestone on a cohort of 31,815 patient records from the Geisinger Health System. The dataset spans 7years of longitudinal patient records and was initially constructed for a heart failure onset prediction study. Our experiments demonstrate the robustness, stability, and the conciseness of Limestone-derived phenotypes. Our results show that using only 40 phenotypes, we can outperform the original 640 features (169 diagnosis categories and 471 medication types) to achieve an area under the receiver operator characteristic curve (AUC) of 0.720 (95% CI 0.715 to 0.725). Moreover, in consultation with a medical expert, we confirmed 82% of the top 50 candidates automatically extracted by Limestone are clinically meaningful.
Collapse
Affiliation(s)
- Joyce C Ho
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX 78712, United States.
| | - Joydeep Ghosh
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX 78712, United States
| | - Steve R Steinhubl
- Scripps Translational Science Institute, Scripps Health, La Jolla, CA 92037, United States
| | - Walter F Stewart
- Sutter Health Research, Development, and Dissemination Team, Sutter Health, Walnut Creek, CA 94598, United States
| | - Joshua C Denny
- Department of Biomedical Informatics, Vanderbilt University, Nashville, TN 37232, United States; Department of Medicine, Vanderbilt University, Nashville, TN 37232, United States
| | - Bradley A Malin
- Department of Biomedical Informatics, Vanderbilt University, Nashville, TN 37232, United States; Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN 37232, United States
| | - Jimeng Sun
- School of Computational Science and Engineering at College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, United States
| |
Collapse
|