1
|
Gregg RW, Karoleski CM, Silverman EK, Sciurba FC, DeMeo DL, Benos PV. Identification of factors directly linked to incident chronic obstructive pulmonary disease: A causal graph modeling study. PLoS Med 2024; 21:e1004444. [PMID: 39137208 PMCID: PMC11349214 DOI: 10.1371/journal.pmed.1004444] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Revised: 08/27/2024] [Accepted: 07/18/2024] [Indexed: 08/15/2024] Open
Abstract
BACKGROUND Beyond exposure to cigarette smoking and aging, the factors that influence lung function decline to incident chronic obstructive pulmonary disease (COPD) remain unclear. Advancements have been made in categorizing COPD into emphysema and airway predominant disease subtypes; however, predicting which healthy individuals will progress to COPD is difficult because they can exhibit profoundly different disease trajectories despite similar initial risk factors. This study aimed to identify clinical, genetic, and radiological features that are directly linked-and subsequently predict-abnormal lung function. METHODS AND FINDINGS We employed graph modeling on 2,643 COPDGene participants (aged 45 to 80 years, 51.25% female, 35.1% African Americans; enrollment 11/2007-4/2011) with smoking history but normal spirometry at study enrollment to identify variables that are directly linked to future lung function abnormalities. We developed logistic regression and random forest predictive models for distinguishing individuals who maintain lung function from those who decline. Of the 131 variables analyzed, 6 were identified as informative to future lung function abnormalities, namely forced expiratory flow in the middle range (FEF25-75%), average lung wall thickness in a 10 mm radius (Pi10), severe emphysema, age, sex, and height. We investigated whether these features predict individuals leaving GOLD 0 status (normal spirometry according to Global Initiative for Obstructive Lung Disease (GOLD) criteria). Linear models, trained with these features, were quite predictive (area under receiver operator characteristic curve or AUROC = 0.75). Random forest predictors performed similarly to logistic regression (AUROC = 0.7), indicating that no significant nonlinear effects were present. The results were externally validated on 150 participants from Specialized Center for Clinically Oriented Research (SCCOR) cohort (aged 45 to 80 years, 52.7% female, 4.7% African Americans; enrollment: 7/2007-12/2012) (AUROC = 0.89). The main limitation of longitudinal studies with 5- and 10-year follow-up is the introduction of mortality bias that disproportionately affects the more severe cases. However, our study focused on spirometrically normal individuals, who have a lower mortality rate. Another limitation is the use of strict criteria to define spirometrically normal individuals, which was unavoidable when studying factors associated with changes in normalized forced expiratory volume in 1 s (FEV1%predicted) or the ratio of FEV1/FVC (forced vital capacity). CONCLUSIONS This study took an agnostic approach to identify which baseline measurements differentiate and predict the early stages of lung function decline in individuals with previous smoking history. Our analysis suggests that emphysema affects obstruction onset, while airway predominant pathology may play a more important role in future FEV1 (%predicted) decline without obstruction, and FEF25-75% may affect both.
Collapse
Affiliation(s)
- Robert W. Gregg
- Department of Epidemiology, University of Florida, Gainesville, Florida, United States of America
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Chad M. Karoleski
- University of Pittsburgh Medical Center, Department of Medicine, Department of Pulmonary Allergy and Critical Care Medicine, Pittsburgh, Pennsylvania, United States of America
| | - Edwin K. Silverman
- Channing Division of Network Medicine and the Division of Pulmonary and Critical Care Medicine, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Frank C. Sciurba
- University of Pittsburgh Medical Center, Department of Medicine, Department of Pulmonary Allergy and Critical Care Medicine, Pittsburgh, Pennsylvania, United States of America
| | - Dawn L. DeMeo
- Channing Division of Network Medicine and the Division of Pulmonary and Critical Care Medicine, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Panayiotis V. Benos
- Department of Epidemiology, University of Florida, Gainesville, Florida, United States of America
| |
Collapse
|
2
|
Pálková M, Uhlík O, Apeltauer T. Calibration of pedestrian ingress model based on CCTV surveillance data using machine learning methods. PLoS One 2024; 19:e0293679. [PMID: 38236901 PMCID: PMC10795986 DOI: 10.1371/journal.pone.0293679] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Accepted: 10/17/2023] [Indexed: 01/22/2024] Open
Abstract
Machine learning methods and agent-based models enable the optimization of the operation of high-capacity facilities. In this paper, we propose a method for automatically extracting and cleaning pedestrian traffic detector data for subsequent calibration of the ingress pedestrian model. The data was obtained from the waiting room traffic of a vaccination center. Walking speed distribution, the number of stops, the distribution of waiting times, and the locations of waiting points were extracted. Of the 9 machine learning algorithms, the random forest model achieved the highest accuracy in classifying valid data and noise. The proposed microscopic calibration allows for more accurate capacity assessment testing, procedural changes testing, and geometric modifications testing in parts of the facility adjacent to the calibrated parts. The results show that the proposed method achieves state-of-the-art performance on a violent-flows dataset. The proposed method has the potential to significantly improve the accuracy and efficiency of input model predictions and optimize the operation of high-capacity facilities.
Collapse
Affiliation(s)
- Martina Pálková
- Faculty of Civil Engineering, Brno University of Technology, Brno, Czech Republic
| | - Ondřej Uhlík
- Faculty of Civil Engineering, Brno University of Technology, Brno, Czech Republic
| | - Tomáš Apeltauer
- Faculty of Civil Engineering, Brno University of Technology, Brno, Czech Republic
| |
Collapse
|
3
|
Sutradhar A, Al Rafi M, Shamrat FMJM, Ghosh P, Das S, Islam MA, Ahmed K, Zhou X, Azad AKM, Alyami SA, Moni MA. BOO-ST and CBCEC: two novel hybrid machine learning methods aim to reduce the mortality of heart failure patients. Sci Rep 2023; 13:22874. [PMID: 38129433 PMCID: PMC10739972 DOI: 10.1038/s41598-023-48486-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 11/27/2023] [Indexed: 12/23/2023] Open
Abstract
Heart failure (HF) is a leading cause of mortality worldwide. Machine learning (ML) approaches have shown potential as an early detection tool for improving patient outcomes. Enhancing the effectiveness and clinical applicability of the ML model necessitates training an efficient classifier with a diverse set of high-quality datasets. Hence, we proposed two novel hybrid ML methods ((a) consisting of Boosting, SMOTE, and Tomek links (BOO-ST); (b) combining the best-performing conventional classifier with ensemble classifiers (CBCEC)) to serve as an efficient early warning system for HF mortality. The BOO-ST was introduced to tackle the challenge of class imbalance, while CBCEC was responsible for training the processed and selected features derived from the Feature Importance (FI) and Information Gain (IG) feature selection techniques. We also conducted an explicit and intuitive comprehension to explore the impact of potential characteristics correlating with the fatality cases of HF. The experimental results demonstrated the proposed classifier CBCEC showcases a significant accuracy of 93.67% in terms of providing the early forecasting of HF mortality. Therefore, we can reveal that our proposed aspects (BOO-ST and CBCEC) can be able to play a crucial role in preventing the death rate of HF and reducing stress in the healthcare sector.
Collapse
Affiliation(s)
- Ananda Sutradhar
- Department of Computer Science and Engineering, Daffodil International University, Daffodil Smart City (DSC), Birulia, Savar, Dhaka, 1216, Bangladesh
| | - Mustahsin Al Rafi
- Department of Computer Science and Engineering, Daffodil International University, Daffodil Smart City (DSC), Birulia, Savar, Dhaka, 1216, Bangladesh
| | - F M Javed Mehedi Shamrat
- Department of Computer System and Technology, University of Malaya, 50603, Kuala Lumpur, Malaysia
| | - Pronab Ghosh
- Department of Computer Science, Lakehead University, 955 Oliver Rd, Thunder Bay, ON, P7B 5E1, Canada
| | - Subrata Das
- Department of Computer Science, Lakehead University, 955 Oliver Rd, Thunder Bay, ON, P7B 5E1, Canada
| | - Md Anaytul Islam
- Department of Computer Science, Lakehead University, 955 Oliver Rd, Thunder Bay, ON, P7B 5E1, Canada
| | - Kawsar Ahmed
- Department of Electrical and Computer Engineering, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, S7N 5A9, Canada
- Department of Information and Communication Technology, Mawlana Bhashani Science and Technology University, Santosh, Tangail, 1902, Bangladesh
- Health Informatics Research Lab, Department of Computer Science and Engineering, Daffodil International University, Daffodil Smart City, Birulia, Dhaka, 1216, Bangladesh
| | - Xujuan Zhou
- School of Business, University of Southern Queensland, Toowoomba, Australia
| | - A K M Azad
- Department of Mathematics and Statistics, Faculty of Science, Imam Mohammad Ibn Saud Islamic University (IMSIU), 13318, Riyadh, Saudi Arabia
| | - Salem A Alyami
- Department of Mathematics and Statistics, Faculty of Science, Imam Mohammad Ibn Saud Islamic University (IMSIU), 13318, Riyadh, Saudi Arabia
| | - Mohammad Ali Moni
- Centre for AI & Digital Health Technology, Artificial Intelligence & Cyber Future Institute, Charles Sturt University, Bathurst, NSW, 2795, Australia.
| |
Collapse
|
4
|
Biswas A, Chen C, Dobson KG, Prince SA, Shahidi FV, Smith PM, Fuller D. Identifying the sociodemographic and work-related factors related to workers' daily physical activity using a decision tree approach. BMC Public Health 2023; 23:1853. [PMID: 37741965 PMCID: PMC10517528 DOI: 10.1186/s12889-023-16747-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2023] [Accepted: 09/12/2023] [Indexed: 09/25/2023] Open
Abstract
BACKGROUND The social and behavioural factors related to physical activity among adults are well known. Despite the overlapping nature of these factors, few studies have examined how multiple predictors of physical activity interact. This study aimed to identify the relative importance of multiple interacting sociodemographic and work-related factors associated with the daily physical activity patterns of a population-based sample of workers. METHODS Sociodemographic, work, screen time, and health variables were obtained from five, repeated cross-sectional cohorts of workers from the Canadian Health Measures Survey (2007 to 2017). Classification and Regression Tree (CART) modelling was used to identify the discriminators associated with six daily physical activity patterns. The performance of the CART approach was compared to a stepwise multinomial logistic regression model. RESULTS Among the 8,909 workers analysed, the most important CART discriminators of daily physical activity patterns were age, job skill, and physical strength requirements of the job. Other important factors included participants' sex, educational attainment, fruit/vegetable intake, industry, work hours, marital status, having a child living at home, computer time, and household income. The CART tree had moderate classification accuracy and performed marginally better than the stepwise multinomial logistic regression model. CONCLUSION Age and work-related factors-particularly job skill, and physical strength requirements at work-appeared as the most important factors related to physical activity attainment, and differed based on sex, work hours, and industry. Delineating the hierarchy of factors associated with daily physical activity may assist in targeting preventive strategies aimed at promoting physical activity in workers.
Collapse
Affiliation(s)
- Aviroop Biswas
- Institute for Work & Health, 400 University Avenue, Suite 1800, Toronto, ON, M5G1S5, Canada.
- Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada.
| | - Cynthia Chen
- Institute for Work & Health, 400 University Avenue, Suite 1800, Toronto, ON, M5G1S5, Canada
| | - Kathleen G Dobson
- Institute for Work & Health, 400 University Avenue, Suite 1800, Toronto, ON, M5G1S5, Canada
| | - Stephanie A Prince
- Centre for Surveillance and Applied Research, Public Health Agency of Canada, Ottawa, ON, Canada
- School of Epidemiology and Public Health, University of Ottawa, Ottawa, ON, Canada
| | - Faraz Vahid Shahidi
- Institute for Work & Health, 400 University Avenue, Suite 1800, Toronto, ON, M5G1S5, Canada
- Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
| | - Peter M Smith
- Institute for Work & Health, 400 University Avenue, Suite 1800, Toronto, ON, M5G1S5, Canada
- Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
- Department of Epidemiology and Preventive Medicine, Monash University, VIC, Melbourne, Australia
| | - Daniel Fuller
- Department of Community Health and Epidemiology, University of Saskatchewan, Saskatoon, SK, Canada
| |
Collapse
|
5
|
Vos G, Trinh K, Sarnyai Z, Rahimi Azghadi M. Generalizable machine learning for stress monitoring from wearable devices: A systematic literature review. Int J Med Inform 2023; 173:105026. [PMID: 36893657 DOI: 10.1016/j.ijmedinf.2023.105026] [Citation(s) in RCA: 14] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Revised: 02/21/2023] [Accepted: 02/23/2023] [Indexed: 03/06/2023]
Abstract
INTRODUCTION Wearable sensors have shown promise as a non-intrusive method for collecting biomarkers that may correlate with levels of elevated stress. Stressors cause a variety of biological responses, and these physiological reactions can be measured using biomarkers including Heart Rate Variability (HRV), Electrodermal Activity (EDA) and Heart Rate (HR) that represent the stress response from the Hypothalamic-Pituitary-Adrenal (HPA) axis, the Autonomic Nervous System (ANS), and the immune system. While Cortisol response magnitude remains the gold standard indicator for stress assessment [1], recent advances in wearable technologies have resulted in the availability of a number of consumer devices capable of recording HRV, EDA and HR sensor biomarkers, amongst other signals. At the same time, researchers have been applying machine learning techniques to the recorded biomarkers in order to build models that may be able to predict elevated levels of stress. OBJECTIVE The aim of this review is to provide an overview of machine learning techniques utilized in prior research with a specific focus on model generalization when using these public datasets as training data. We also shed light on the challenges and opportunities that machine learning-enabled stress monitoring and detection face. METHODS This study reviewed published works contributing and/or using public datasets designed for detecting stress and their associated machine learning methods. The electronic databases of Google Scholar, Crossref, DOAJ and PubMed were searched for relevant articles and a total of 33 articles were identified and included in the final analysis. The reviewed works were synthesized into three categories of publicly available stress datasets, machine learning techniques applied using those, and future research directions. For the machine learning studies reviewed, we provide an analysis of their approach to results validation and model generalization. The quality assessment of the included studies was conducted in accordance with the IJMEDI checklist [2]. RESULTS A number of public datasets were identified that are labeled for stress detection. These datasets were most commonly produced from sensor biomarker data recorded using the Empatica E4 device, a well-studied, medical-grade wrist-worn wearable that provides sensor biomarkers most notable to correlate with elevated levels of stress. Most of the reviewed datasets contain less than twenty-four hours of data, and the varied experimental conditions and labeling methodologies potentially limit their ability to generalize for unseen data. In addition, we discuss that previous works show shortcomings in areas such as their labeling protocols, lack of statistical power, validity of stress biomarkers, and model generalization ability. CONCLUSION Health tracking and monitoring using wearable devices is growing in popularity, while the generalization of existing machine learning models still requires further study, and research in this area will continue to provide improvements as newer and more substantial datasets become available.
Collapse
Affiliation(s)
- Gideon Vos
- College of Science and Engineering, James Cook University, James Cook Dr, Townsville, 4811, QLD, Australia
| | - Kelly Trinh
- College of Science and Engineering, James Cook University, James Cook Dr, Townsville, 4811, QLD, Australia
| | - Zoltan Sarnyai
- College of Public Health, Medical, and Vet Sciences, James Cook University, James Cook Dr, Townsville, 4811, QLD, Australia
| | - Mostafa Rahimi Azghadi
- College of Science and Engineering, James Cook University, James Cook Dr, Townsville, 4811, QLD, Australia.
| |
Collapse
|
6
|
Syakiylla Sayed Daud SN, Sudirman R, Wee Shing T. Safe-level SMOTE method for handling the class imbalanced problem in electroencephalography dataset of adult anxious state. Biomed Signal Process Control 2023. [DOI: 10.1016/j.bspc.2023.104649] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/10/2023]
|
7
|
Kim C, Jeong J, Choi J. Effects of Class Imbalance and Data Scarcity on the Performance of Binary Classification Machine Learning Models Developed Based on ToxCast/Tox21 Assay Data. Chem Res Toxicol 2022; 35:2219-2226. [PMID: 36475638 DOI: 10.1021/acs.chemrestox.2c00189] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
The development of toxicity classification models using the ToxCast database has been extensively studied. Machine learning approaches are effective in identifying the bioactivity of untested chemicals. However, ToxCast assays differ in the amount of data and degree of class imbalance (CI). Therefore, the resampling algorithm employed should vary depending on the data distribution to achieve optimal classification performance. In this study, the effects of CI and data scarcity (DS) on the performance of binary classification models were investigated using ToxCast bioassay data. An assay matrix based on CI and DS was prepared for 335 assays with biologically intended target information, and 28 CI assays and 3 DS assays were selected. Thirty models established by combining five molecular fingerprints (i.e., Morgan, MACCS, RDKit, Pattern, and Layered) and six algorithms [i.e., gradient boosting tree, random forest (RF), multi-layered perceptron, k-nearest neighbor, logistic regression, and naive Bayes] were trained using the selected assay data set. Of the 30 trained models, MACCS-RF showed the best performance and thus was selected for analyses of the effects of CI and DS. Results showed that recall and F1 were significantly lower when training with the CI assays than with the DS assays. In addition, hyperparameter tuning of the RF algorithm significantly improved F1 on CI assays. This study provided a basis for developing a toxicity classification model with improved performance by evaluating the effects of data set characteristics. This study also emphasized the importance of using appropriate evaluation metrics and tuning hyperparameters in model development.
Collapse
Affiliation(s)
- Changhun Kim
- Chemical Bigdata Research Center, University of Seoul, 163 Seoulsiripdae-ro, Dongdaemun-gu, Seoul 02504, Republic of Korea.,School of Environmental Engineering, University of Seoul, 163 Seoulsiripdae-ro, Dongdaemun-gu, Seoul 02504, Republic of Korea
| | - Jaeseong Jeong
- Chemical Bigdata Research Center, University of Seoul, 163 Seoulsiripdae-ro, Dongdaemun-gu, Seoul 02504, Republic of Korea.,School of Environmental Engineering, University of Seoul, 163 Seoulsiripdae-ro, Dongdaemun-gu, Seoul 02504, Republic of Korea
| | - Jinhee Choi
- Chemical Bigdata Research Center, University of Seoul, 163 Seoulsiripdae-ro, Dongdaemun-gu, Seoul 02504, Republic of Korea.,School of Environmental Engineering, University of Seoul, 163 Seoulsiripdae-ro, Dongdaemun-gu, Seoul 02504, Republic of Korea
| |
Collapse
|
8
|
Islam MT, Mustafa HA. Multi-Layer Hybrid (MLH) balancing technique: A combined approach to remove data imbalance. DATA KNOWL ENG 2022. [DOI: 10.1016/j.datak.2022.102105] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
9
|
Junaid M, Ali S, Siddiqui IF, Nam C, Qureshi NMF, Kim J, Shin DR. Performance Evaluation of Data-driven Intelligent Algorithms for Big data Ecosystem. WIRELESS PERSONAL COMMUNICATIONS 2022; 126:2403-2423. [PMID: 36033548 PMCID: PMC9396610 DOI: 10.1007/s11277-021-09362-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Accepted: 11/04/2021] [Indexed: 06/15/2023]
Abstract
Artificial intelligence, specifically machine learning, has been applied in a variety of methods by the research group to transform several data sources into valuable facts and understanding, allowing for superior pattern identification skills. Machine learning algorithms on huge and complicated data sets, computationally expensive on the other hand, processing requires hardware and logical resources, such as space, CPU, and memory. As the amount of data created daily reaches quintillion bytes, A complex big data infrastructure becomes more and more relevant. Apache Spark Machine learning library (ML-lib) is a famous platform used for big data analysis, it includes several useful features for machine learning applications, involving regression, classification, and dimension reduction, as well as clustering and features extraction. In this contribution, we consider Apache Spark ML-lib as a computationally independent machine learning library, which is open-source, distributed, scalable, and platform. We have evaluated and compared several ML algorithms to analyze the platform's qualities, compared Apache Spark ML-lib against Rapid Miner and Sklearn, which are two additional Big data and machine learning processing platforms. Logistic Classifier (LC), Decision Tree Classifier (DTc), Random Forest Classifier (RFC), and Gradient Boosted Tree Classifier (GBTC) are four machine learning algorithms that are compared across platforms. In addition, we have tested general regression methods such as Linear Regressor (LR), Decision Tree Regressor (DTR), Random Forest Regressor (RFR), and Gradient Boosted Tree Regressor (GBTR) on SUSY and Higgs datasets. Moreover, We have evaluated the unsupervised learning methods like K-means and Gaussian Mixer Models on the data set SUSY and Hepmass to determine the robustness of PySpark, in comparison with the classification and regression models. We used "SUSY," "HIGGS," "BANK," and "HEPMASS" dataset from the UCI data repository. We also talk about recent developments in the research into Big Data machines and provide future research directions.
Collapse
Affiliation(s)
- Muhammad Junaid
- Department of Electrical and Computer Engineering, Sungkyunkwan University, Suwon, South Korea
| | - Sajid Ali
- Department of Computer Science and Engineering, Sungkyunkwan University, Suwon, South Korea
| | - Isma Farah Siddiqui
- Department of Software Engineering, Mehran University of Engineering and Technology, Jamshoro, Pakistan
| | - Choonsung Nam
- Department of Software Convergence Engineering, Inha University, Incheon, South Korea
| | | | - Jaehyoun Kim
- Department of Computer Education, Sungkyunkwan University, Seoul, South Korea
| | - Dong Ryeol Shin
- Department of Electrical and Computer Engineering, Sungkyunkwan University, Suwon, South Korea
| |
Collapse
|
10
|
A Highly Adaptive Oversampling Approach to Address the Issue of Data Imbalance. COMPUTERS 2022. [DOI: 10.3390/computers11050073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
Abstract
Data imbalance is a serious problem in machine learning that can be alleviated at the data level by balancing the class distribution with sampling. In the last decade, several sampling methods have been published to address the shortcomings of the initial ones, such as noise sensitivity and incorrect neighbor selection. Based on the review of the literature, it has become clear to us that the algorithms achieve varying performance on different data sets. In this paper, we present a new oversampler that has been developed based on the key steps and sampling strategies identified by analyzing dozens of existing methods and that can be fitted to various data sets through an optimization process. Experiments were performed on a number of data sets, which show that the proposed method had a similar or better effect on the performance of SVM, DTree, kNN and MLP classifiers compared with other well-known samplers found in the literature. The results were also confirmed by statistical tests.
Collapse
|
11
|
Machine Learning in Prediction of Bladder Cancer on Clinical Laboratory Data. Diagnostics (Basel) 2022; 12:diagnostics12010203. [PMID: 35054370 PMCID: PMC8774436 DOI: 10.3390/diagnostics12010203] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2021] [Revised: 01/09/2022] [Accepted: 01/13/2022] [Indexed: 12/19/2022] Open
Abstract
Bladder cancer has been increasing globally. Urinary cytology is considered a major screening method for bladder cancer, but it has poor sensitivity. This study aimed to utilize clinical laboratory data and machine learning methods to build predictive models of bladder cancer. A total of 1336 patients with cystitis, bladder cancer, kidney cancer, uterus cancer, and prostate cancer were enrolled in this study. Two-step feature selection combined with WEKA and forward selection was performed. Furthermore, five machine learning models, including decision tree, random forest, support vector machine, extreme gradient boosting (XGBoost), and light gradient boosting machine (GBM) were applied. Features, including calcium, alkaline phosphatase (ALP), albumin, urine ketone, urine occult blood, creatinine, alanine aminotransferase (ALT), and diabetes were selected. The lightGBM model obtained an accuracy of 84.8% to 86.9%, a sensitivity 84% to 87.8%, a specificity of 82.9% to 86.7%, and an area under the curve (AUC) of 0.88 to 0.92 in discriminating bladder cancer from cystitis and other cancers. Our study provides a demonstration of utilizing clinical laboratory data to predict bladder cancer.
Collapse
|
12
|
RDPVR: Random Data Partitioning with Voting Rule for Machine Learning from Class-Imbalanced Datasets. ELECTRONICS 2022. [DOI: 10.3390/electronics11020228] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Since most classifiers are biased toward the dominant class, class imbalance is a challenging problem in machine learning. The most popular approaches to solving this problem include oversampling minority examples and undersampling majority examples. Oversampling may increase the probability of overfitting, whereas undersampling eliminates examples that may be crucial to the learning process. We present a linear time resampling method based on random data partitioning and a majority voting rule to address both concerns, where an imbalanced dataset is partitioned into a number of small subdatasets, each of which must be class balanced. After that, a specific classifier is trained for each subdataset, and the final classification result is established by applying the majority voting rule to the results of all of the trained models. We compared the performance of the proposed method to some of the most well-known oversampling and undersampling methods, employing a range of classifiers, on 33 benchmark machine learning class-imbalanced datasets. The classification results produced by the classifiers employed on the generated data by the proposed method were comparable to most of the resampling methods tested, with the exception of SMOTEFUNA, which is an oversampling method that increases the probability of overfitting. The proposed method produced results that were comparable to the Easy Ensemble (EE) undersampling method. As a result, for solving the challenge of machine learning from class-imbalanced datasets, we advocate using either EE or our method.
Collapse
|
13
|
Jeong W, Gaggioli CA, Gagliardi L. Active Learning Configuration Interaction for Excited-State Calculations of Polycyclic Aromatic Hydrocarbons. J Chem Theory Comput 2021; 17:7518-7530. [PMID: 34787422 PMCID: PMC8675132 DOI: 10.1021/acs.jctc.1c00769] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2021] [Indexed: 11/30/2022]
Abstract
We present the active learning configuration interaction (ALCI) method for multiconfigurational calculations based on large active spaces. ALCI leverages the use of an active learning procedure to find important electronic configurations among the full configurational space generated within an active space. We tested it for the calculation of singlet-singlet excited states of acenes and pyrene using different machine learning algorithms. The ALCI method yields excitation energies within 0.2-0.3 eV from those obtained by traditional complete active-space configuration interaction (CASCI) calculations (affordable for active spaces up to 16 electrons in 16 orbitals) by including only a small fraction of the CASCI configuration space in the calculations. For larger active spaces (we tested up to 26 electrons in 26 orbitals), not affordable with traditional CI methods, ALCI captures the trends of experimental excitation energies. Overall, ALCI provides satisfactory approximations to large active-space wave functions with up to 10 orders of magnitude fewer determinants for the systems presented here. These ALCI wave functions are promising and affordable starting points for the subsequent second-order perturbation theory or pair-density functional theory calculations.
Collapse
Affiliation(s)
- WooSeok Jeong
- Department
of Chemistry, Nanoporous Materials Genome Center, Chemical Theory
Center, and Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, Minnesota 55455, United States
| | - Carlo Alberto Gaggioli
- Department
of Chemistry, Pritzker School of Molecular Engineering, James Franck
Institute, Chicago Center for Theoretical Chemistry, University of Chicago, Chicago, Illinois 60637, United States
| | - Laura Gagliardi
- Department
of Chemistry, Pritzker School of Molecular Engineering, James Franck
Institute, Chicago Center for Theoretical Chemistry, University of Chicago, Chicago, Illinois 60637, United States
- Argonne
National Laboratory, Lemont, Illinois 60439, United States
| |
Collapse
|
14
|
Performance Improvement of Decision Tree: A Robust Classifier Using Tabu Search Algorithm. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app11156728] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Classification and regression are the major applications of machine learning algorithms which are widely used to solve problems in numerous domains of engineering and computer science. Different classifiers based on the optimization of the decision tree have been proposed, however, it is still evolving over time. This paper presents a novel and robust classifier based on a decision tree and tabu search algorithms, respectively. In the aim of improving performance, our proposed algorithm constructs multiple decision trees while employing a tabu search algorithm to consistently monitor the leaf and decision nodes in the corresponding decision trees. Additionally, the used tabu search algorithm is responsible to balance the entropy of the corresponding decision trees. For training the model, we used the clinical data of COVID-19 patients to predict whether a patient is suffering. The experimental results were obtained using our proposed classifier based on the built-in sci-kit learn library in Python. The extensive analysis for the performance comparison was presented using Big O and statistical analysis for conventional supervised machine learning algorithms. Moreover, the performance comparison to optimized state-of-the-art classifiers is also presented. The achieved accuracy of 98%, the required execution time of 55.6 ms and the area under receiver operating characteristic (AUROC) for proposed method of 0.95 reveals that the proposed classifier algorithm is convenient for large datasets.
Collapse
|
15
|
Ullah Z, Saleem F, Jamjoom M, Fakieh B. Reliable Prediction Models Based on Enriched Data for Identifying the Mode of Childbirth by Using Machine Learning Methods: Development Study. J Med Internet Res 2021; 23:e28856. [PMID: 34085938 PMCID: PMC8214183 DOI: 10.2196/28856] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Revised: 03/30/2021] [Accepted: 04/30/2021] [Indexed: 11/30/2022] Open
Abstract
Background The use of artificial intelligence has revolutionized every area of life such as business and trade, social and electronic media, education and learning, manufacturing industries, medicine and sciences, and every other sector. The new reforms and advanced technologies of artificial intelligence have enabled data analysts to transmute raw data generated by these sectors into meaningful insights for an effective decision-making process. Health care is one of the integral sectors where a large amount of data is generated daily, and making effective decisions based on these data is therefore a challenge. In this study, cases related to childbirth either by the traditional method of vaginal delivery or cesarean delivery were investigated. Cesarean delivery is performed to save both the mother and the fetus when complications related to vaginal birth arise. Objective The aim of this study was to develop reliable prediction models for a maternity care decision support system to predict the mode of delivery before childbirth. Methods This study was conducted in 2 parts for identifying the mode of childbirth: first, the existing data set was enriched and second, previous medical records about the mode of delivery were investigated using machine learning algorithms and by extracting meaningful insights from unseen cases. Several prediction models were trained to achieve this objective, such as decision tree, random forest, AdaBoostM1, bagging, and k-nearest neighbor, based on original and enriched data sets. Results The prediction models based on enriched data performed well in terms of accuracy, sensitivity, specificity, F-measure, and receiver operating characteristic curves in the outcomes. Specifically, the accuracy of k-nearest neighbor was 84.38%, that of bagging was 83.75%, that of random forest was 83.13%, that of decision tree was 81.25%, and that of AdaBoostM1 was 80.63%. Enrichment of the data set had a good impact on improving the accuracy of the prediction process, which supports maternity care practitioners in making decisions in critical cases. Conclusions Our study shows that enriching the data set improves the accuracy of the prediction process, thereby supporting maternity care practitioners in making informed decisions in critical cases. The enriched data set used in this study yields good results, but this data set can become even better if the records are increased with real clinical data.
Collapse
Affiliation(s)
- Zahid Ullah
- Department of Information Systems, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Farrukh Saleem
- Department of Information Systems, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Mona Jamjoom
- Department of Computer Sciences, College of Computer and Information Sciences, Princess Nourah Bint Abdulrahman University, Riyadh, Saudi Arabia
| | - Bahjat Fakieh
- Department of Information Systems, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
| |
Collapse
|