1
|
Bombina P, Tally D, Abrams ZB, Coombes KR. SillyPutty: Improved clustering by optimizing the silhouette width. PLoS One 2024; 19:e0300358. [PMID: 38848330 PMCID: PMC11161052 DOI: 10.1371/journal.pone.0300358] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2023] [Accepted: 02/26/2024] [Indexed: 06/09/2024] Open
Abstract
Clustering is an important task in biomedical science, and it is widely believed that different data sets are best clustered using different algorithms. When choosing between clustering algorithms on the same data set, reseachers typically rely on global measures of quality, such as the mean silhouette width, and overlook the fine details of clustering. However, the silhouette width actually computes scores that describe how well each individual element is clustered. Inspired by this observation, we developed a novel clustering method, called SillyPutty. Unlike existing methods, SillyPutty uses the silhouette width for individual elements as a tool to optimize the mean silhouette width. This shift in perspective allows for a more granular evaluation of clustering quality, potentially addressing limitations in current methodologies. To test the SillyPutty algorithm, we first simulated a series of data sets using the Umpire R package and then used real-workd data from The Cancer Genome Atlas. Using these data sets, we compared SillyPutty to several existing algorithms using multiple metrics (Silhouette Width, Adjusted Rand Index, Entropy, Normalized Within-group Sum of Square errors, and Perfect Classification Count). Our findings revealed that SillyPutty is a valid standalone clustering method, comparable in accuracy to the best existing methods. We also found that the combination of hierarchical clustering followed by SillyPutty has the best overall performance in terms of both accuracy and speed. Availability: The SillyPutty R package can be downloaded from the Comprehensive R Archive Network (CRAN).
Collapse
Affiliation(s)
- Polina Bombina
- Department of Biostatistics, Data Science and Epidemiology, Georgia Cancer Center at Augusta University, Augusta, GA, United States of America
| | - Dwayne Tally
- Department of Informatics, Indiana University, United States of America
| | - Zachary B. Abrams
- Division of Data Science and Biostatistics, Institute for Informatics, Washington University School of Medicine, Saint Louis, MO, United States of America
| | - Kevin R. Coombes
- Department of Biostatistics, Data Science and Epidemiology, Georgia Cancer Center at Augusta University, Augusta, GA, United States of America
| |
Collapse
|
2
|
Bombina P, Tally D, Abrams ZB, Coombes KR. SillyPutty: Improved clustering by optimizing the silhouette width. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.07.566055. [PMID: 37986817 PMCID: PMC10659363 DOI: 10.1101/2023.11.07.566055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/22/2023]
Abstract
Unsupervised clustering is an important task in biomedical science. We developed a new clustering method, called SillyPutty, for unsupervised clustering. As test data, we generated a series of datasets using the Umpire R package. Using these datasets, we compared SillyPutty to several existing algorithms using multiple metrics (Silhouette Width, Adjusted Rand Index, Entropy, Normalized Within-group Sum of Square errors, and Perfect Classification Count). Our findings revealed that SillyPutty is a valid standalone clustering method, comparable in accuracy to the best existing methods. We also found that the combination of hierarchical clustering followed by SillyPutty has the best overall performance in terms of both accuracy and speed.
Collapse
Affiliation(s)
- Polina Bombina
- Department of Biostatistics, Data Science, and Epidemiology, Georgia Cancer Center at Augusta University, Augusta, GA, USA
| | - Dwayne Tally
- Department of Informatics, Indiana University, USA
| | - Zachary B. Abrams
- Institute for Informatics, Division of Data Science and Biostatistics. Washington University School of Medicine. Saint Louis, MO, USA
| | - Kevin R. Coombes
- Department of Biostatistics, Data Science, and Epidemiology, Georgia Cancer Center at Augusta University, Augusta, GA, USA
| |
Collapse
|
3
|
Chaunzwa TL, del Rey MQ, Bitterman DS. Clinical Informatics Approaches to Understand and Address Cancer Disparities. Yearb Med Inform 2022; 31:121-130. [PMID: 36463869 PMCID: PMC9719762 DOI: 10.1055/s-0042-1742511] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/07/2022] Open
Abstract
OBJECTIVES Disparities in cancer incidence and outcomes across race, ethnicity, gender, socioeconomic status, and geography are well-documented, but their etiologies are often poorly understood and multifactorial. Clinical informatics can provide tools to better understand and address these disparities by enabling high-throughput analysis of multiple types of data. Here, we review recent efforts in clinical informatics to study and measure disparities in cancer. METHODS We carried out a narrative review of clinical informatics studies related to cancer disparities and bias published from 2018-2021, with a focus on domains such as real-world data (RWD) analysis, natural language processing (NLP), radiomics, genomics, proteomics, metabolomics, and metagenomics. RESULTS Clinical informatics studies that investigated cancer disparities across race, ethnicity, gender, and age were identified. Most cancer disparities work within clinical informatics used RWD analysis, NLP, radiomics, and genomics. Emerging applications of clinical informatics to understand cancer disparities, including proteomics, metabolomics, and metagenomics, were less well represented in the literature but are promising future research avenues. Algorithmic bias was identified as an important consideration when developing and implementing cancer clinical informatics techniques, and efforts to address this bias were reviewed. CONCLUSIONS In recent years, clinical informatics has been used to probe a range of data sources to understand cancer disparities across different populations. As informatics tools become integrated into clinical decision-making, attention will need to be paid to ensure that algorithmic bias does not amplify existing disparities. In our increasingly interconnected medical systems, clinical informatics is poised to untap the full potential of multi-platform health data to address cancer disparities.
Collapse
Affiliation(s)
- Tafadzwa L. Chaunzwa
- Department of Radiation Oncology, Dana-Farber Brigham Cancer Center, Harvard Medical School, Boston, MA, USA,Artificial Intelligence in Medicine (AIM) Program, Mass General Brigham, Harvard Medical School, Boston, MA, USA
| | - Maria Quiles del Rey
- Department of Radiation Oncology, Dana-Farber Brigham Cancer Center, Harvard Medical School, Boston, MA, USA
| | - Danielle S. Bitterman
- Department of Radiation Oncology, Dana-Farber Brigham Cancer Center, Harvard Medical School, Boston, MA, USA,Artificial Intelligence in Medicine (AIM) Program, Mass General Brigham, Harvard Medical School, Boston, MA, USA,Correspondence to: Dr. Danielle S. Bitterman Department of Radiation Oncology, Dana-Farber Cancer Institute/Brigham and Women's Hospital75 Francis Street, Boston, MA 02115USA+1 857 215 1489+1 617 975 0985
| |
Collapse
|
4
|
Zohdi H, Natale L, Scholkmann F, Wolf U. Intersubject Variability in Cerebrovascular Hemodynamics and Systemic Physiology during a Verbal Fluency Task under Colored Light Exposure: Clustering of Subjects by Unsupervised Machine Learning. Brain Sci 2022; 12:1449. [PMID: 36358375 PMCID: PMC9688708 DOI: 10.3390/brainsci12111449] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2022] [Revised: 10/19/2022] [Accepted: 10/21/2022] [Indexed: 10/18/2023] Open
Abstract
There is large intersubject variability in cerebrovascular hemodynamic and systemic physiological responses induced by a verbal fluency task (VFT) under colored light exposure (CLE). We hypothesized that machine learning would enable us to classify the response patterns and provide new insights into the common response patterns between subjects. In total, 32 healthy subjects (15 men and 17 women, age: 25.5 ± 4.3 years) were exposed to two different light colors (red vs. blue) in a randomized cross-over study design for 9 min while performing a VFT. We used the systemic physiology augmented functional near-infrared spectroscopy (SPA-fNIRS) approach to measure cerebrovascular hemodynamics and oxygenation at the prefrontal cortex (PFC) and visual cortex (VC) concurrently with systemic physiological parameters. We found that subjects were suitably classified by unsupervised machine learning into different groups according to the changes in the following parameters: end-tidal carbon dioxide, arterial oxygen saturation, skin conductance, oxygenated hemoglobin in the VC, and deoxygenated hemoglobin in the PFC. With hard clustering methods, three and five different groups of subjects were found for the blue and red light exposure, respectively. Our results highlight the fact that humans show specific reactivity types to the CLE-VFT experimental paradigm.
Collapse
Affiliation(s)
- Hamoon Zohdi
- Institute of Complementary and Integrative Medicine, University of Bern, 3012 Bern, Switzerland
| | - Luciano Natale
- Institute of Complementary and Integrative Medicine, University of Bern, 3012 Bern, Switzerland
| | - Felix Scholkmann
- Institute of Complementary and Integrative Medicine, University of Bern, 3012 Bern, Switzerland
- Biomedical Optics Research Laboratory, Neonatology Research, Department of Neonatology, University Hospital Zurich, University of Zurich, 8091 Zurich, Switzerland
| | - Ursula Wolf
- Institute of Complementary and Integrative Medicine, University of Bern, 3012 Bern, Switzerland
| |
Collapse
|
5
|
Shanbehzadeh M, Afrash MR, Mirani N, Kazemi-Arpanahi H. Comparing machine learning algorithms to predict 5-year survival in patients with chronic myeloid leukemia. BMC Med Inform Decis Mak 2022; 22:236. [PMID: 36068539 PMCID: PMC9450320 DOI: 10.1186/s12911-022-01980-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2022] [Accepted: 08/30/2022] [Indexed: 12/03/2022] Open
Abstract
Introduction Chronic myeloid leukemia (CML) is a myeloproliferative disorder resulting from the translocation of chromosomes 19 and 22. CML includes 15–20% of all cases of leukemia. Although bone marrow transplant and, more recently, tyrosine kinase inhibitors (TKIs) as a first-line treatment have significantly prolonged survival in CML patients, accurate prediction using available patient-level factors can be challenging. We intended to predict 5-year survival among CML patients via eight machine learning (ML) algorithms and compare their performance.
Methods The data of 837 CML patients were retrospectively extracted and randomly split into training and test segments (70:30 ratio). The outcome variable was 5-year survival with potential values of alive or deceased. The dataset for the full features and important features selected by minimal redundancy maximal relevance (mRMR) feature selection were fed into eight ML techniques, including eXtreme gradient boosting (XGBoost), multilayer perceptron (MLP), pattern recognition network, k-nearest neighborhood (KNN), probabilistic neural network, support vector machine (SVM) (kernel = linear), SVM (kernel = RBF), and J-48. The scikit-learn library in Python was used to implement the models. Finally, the performance of the developed models was measured using some evaluation criteria with 95% confidence intervals (CI). Results Spleen palpable, age, and unexplained hemorrhage were identified as the top three effective features affecting CML 5-year survival. The performance of ML models using the selected-features was superior to that of the full-features dataset. Among the eight ML algorithms, SVM (kernel = RBF) had the best performance in tenfold cross-validation with an accuracy of 85.7%, specificity of 85%, sensitivity of 86%, F-measure of 87%, kappa statistic of 86.1%, and area under the curve (AUC) of 85% for the selected-features. Using the full-features dataset yielded an accuracy of 69.7%, specificity of 69.1%, sensitivity of 71.3%, F-measure of 72%, kappa statistic of 75.2%, and AUC of 70.1%. Conclusions Accurate prediction of the survival likelihood of CML patients can inform caregivers to promote patient prognostication and choose the best possible treatment path. While external validation is required, our developed models will offer customized treatment and may guide the prescription of personalized medicine for CML patients.
Collapse
Affiliation(s)
- Mostafa Shanbehzadeh
- Department of Health Information Technology, Faculty of Paramedical, Ilam University of Medical Sciences, Ilam, Iran
| | - Mohammad Reza Afrash
- Department of Health Information Technology and Management, School of Allied Medical Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Nader Mirani
- Department of Treatment, Head of the Medical Truism, Zanjan University of Medical Sciences, Zanjan, Iran
| | - Hadi Kazemi-Arpanahi
- Department of Health Information Technology, Abadan University of Medical Sciences, Abadan, Iran. .,Department of Student Research Committee, Abadan University of Medical Sciences, Abadan, Iran.
| |
Collapse
|
6
|
El Alaoui Y, Elomri A, Qaraqe M, Padmanabhan R, Yasin Taha R, El Omri H, El Omri A, Aboumarzouk O. A Review of Artificial Intelligence Applications in Hematology Management: Current Practices and Future Prospects. J Med Internet Res 2022; 24:e36490. [PMID: 35819826 PMCID: PMC9328784 DOI: 10.2196/36490] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2022] [Revised: 05/14/2022] [Accepted: 05/29/2022] [Indexed: 12/23/2022] Open
Abstract
Background Machine learning (ML) and deep learning (DL) methods have recently garnered a great deal of attention in the field of cancer research by making a noticeable contribution to the growth of predictive medicine and modern oncological practices. Considerable focus has been particularly directed toward hematologic malignancies because of the complexity in detecting early symptoms. Many patients with blood cancer do not get properly diagnosed until their cancer has reached an advanced stage with limited treatment prospects. Hence, the state-of-the-art revolves around the latest artificial intelligence (AI) applications in hematology management. Objective This comprehensive review provides an in-depth analysis of the current AI practices in the field of hematology. Our objective is to explore the ML and DL applications in blood cancer research, with a special focus on the type of hematologic malignancies and the patient’s cancer stage to determine future research directions in blood cancer. Methods We searched a set of recognized databases (Scopus, Springer, and Web of Science) using a selected number of keywords. We included studies written in English and published between 2015 and 2021. For each study, we identified the ML and DL techniques used and highlighted the performance of each model. Results Using the aforementioned inclusion criteria, the search resulted in 567 papers, of which 144 were selected for review. Conclusions The current literature suggests that the application of AI in the field of hematology has generated impressive results in the screening, diagnosis, and treatment stages. Nevertheless, optimizing the patient’s pathway to treatment requires a prior prediction of the malignancy based on the patient’s symptoms or blood records, which is an area that has still not been properly investigated.
Collapse
Affiliation(s)
- Yousra El Alaoui
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
| | - Adel Elomri
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
| | - Marwa Qaraqe
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
| | - Regina Padmanabhan
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
| | - Ruba Yasin Taha
- National Center for Cancer Care and Research, Hamad Medical Corporation, Doha, Qatar
| | - Halima El Omri
- National Center for Cancer Care and Research, Hamad Medical Corporation, Doha, Qatar
| | - Abdelfatteh El Omri
- Surgical Research Section, Department of Surgery, Hamad Medical Corporation, Doha, Qatar
| | - Omar Aboumarzouk
- Surgical Research Section, Department of Surgery, Hamad Medical Corporation, Doha, Qatar.,College of Medicine, Qatar University, Doha, Qatar.,College of Medicine, University of Glasgow, Glasgow, United Kingdom
| |
Collapse
|
7
|
Liu J, Yuan R, Li Y, Zhou L, Zhang Z, Yang J, Xiao L. A deep learning method and device for bone marrow imaging cell detection. ANNALS OF TRANSLATIONAL MEDICINE 2022; 10:208. [PMID: 35280370 PMCID: PMC8908139 DOI: 10.21037/atm-22-486] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/05/2022] [Accepted: 02/18/2022] [Indexed: 11/06/2022]
Abstract
BACKGROUND Morphological analysis of bone marrow cells is considered as the gold standard for the diagnosis of leukemia. However, due to the diverse morphology of bone marrow cells, extensive experience and patience are needed for morphological examination. automatic diagnosis system through the comprehensive application of image analysis and pattern recognition technology is urgently needed to reduce work intensity, error probability and improves work efficiency. METHODS In this article, we establish a new morphological diagnosis system for bone marrow cell detection based on the deep learning object detection framework. The model is based on the Faster Region-Convolutional Neural Network (R-CNN), a classical object detection model. The system automatically detects bone marrow cells and determines their types. As specimens have severe long-tail distribution, i.e., the frequency of different types of cells varies dramatically, we proposed a general score ranking loss to solve such a problem. The general score ranking loss considers the ranking relationship between positive and negative samples and optimizes the positive sample with a higher classification probability value. RESULTS We verified this system with 70 bone marrow specimens of leukemia patients, which proved that it can realize intelligent recognition with high efficiency. The software is finally integrated into the microscope system to build an augmented reality system. CONCLUSIONS Clinical tests show that the response speed of the newly developed diagnostic system is faster than that of trained diagnostic experts.
Collapse
Affiliation(s)
- Jie Liu
- Department of Laboratory, The Seventh Medical Center of Chinese PLA General Hospital, Beijing, China
| | - Ruize Yuan
- School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing, China
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
| | - Yinhao Li
- School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing, China
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
| | - Lin Zhou
- Department of Laboratory, The Seventh Medical Center of Chinese PLA General Hospital, Beijing, China
| | | | - Jidong Yang
- Hanyuan Pharmaceutical Co., Ltd., Beijing, China
| | - Li Xiao
- School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing, China
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
- Ningbo Huamei Hospital, University of Chinese Academy of Sciences, Ningbo, China
| |
Collapse
|
8
|
The importance of genomic predictors for clinical outcome of hematological malignancies. BLOOD SCIENCE 2021; 3:93-95. [PMID: 35402837 PMCID: PMC8974908 DOI: 10.1097/bs9.0000000000000075] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2021] [Accepted: 04/21/2021] [Indexed: 12/17/2022] Open
|
9
|
Coombes CE, Liu X, Abrams ZB, Coombes KR, Brock G. Simulation-derived best practices for clustering clinical data. J Biomed Inform 2021; 118:103788. [PMID: 33862229 DOI: 10.1016/j.jbi.2021.103788] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2020] [Revised: 03/23/2021] [Accepted: 04/11/2021] [Indexed: 11/18/2022]
Abstract
INTRODUCTION Clustering analyses in clinical contexts hold promise to improve the understanding of patient phenotype and disease course in chronic and acute clinical medicine. However, work remains to ensure that solutions are rigorous, valid, and reproducible. In this paper, we evaluate best practices for dissimilarity matrix calculation and clustering on mixed-type, clinical data. METHODS We simulate clinical data to represent problems in clinical trials, cohort studies, and EHR data, including single-type datasets (binary, continuous, categorical) and 4 data mixtures. We test 5 single distance metrics (Jaccard, Hamming, Gower, Manhattan, Euclidean) and 3 mixed distance metrics (DAISY, Supersom, and Mercator) with 3 clustering algorithms (hierarchical (HC), k-medoids, self-organizing maps (SOM)). We quantitatively and visually validate by Adjusted Rand Index (ARI) and silhouette width (SW). We applied our best methods to two real-world data sets: (1) 21 features collected on 247 patients with chronic lymphocytic leukemia, and (2) 40 features collected on 6000 patients admitted to an intensive care unit. RESULTS HC outperformed k-medoids and SOM by ARI across data types. DAISY produced the highest mean ARI for mixed data types for all mixtures except unbalanced mixtures dominated by continuous data. Compared to other methods, DAISY with HC uncovered superior, separable clusters in both real-world data sets. DISCUSSION Selecting an appropriate mixed-type metric allows the investigator to obtain optimal separation of patient clusters and get maximum use of their data. Superior metrics for mixed-type data handle multiple data types using multiple, type-focused distances. Better subclassification of disease opens avenues for targeted treatments, precision medicine, clinical decision support, and improved patient outcomes.
Collapse
Affiliation(s)
- Caitlin E Coombes
- The Ohio State University College of Medicine, 370 W 9th Ave, Columbus, OH 43210, USA.
| | - Xin Liu
- Department of Biomedical Informatics, The Ohio State University, 1800 Cannon Dr, Columbus, OH 43210, USA.
| | - Zachary B Abrams
- Institute for Informatics, Washington University in St. Louis, 444 Forest Park Ave., St. Louis, MO 63108, USA.
| | - Kevin R Coombes
- Department of Biomedical Informatics, The Ohio State University, 1800 Cannon Dr, Columbus, OH 43210, USA.
| | - Guy Brock
- Department of Biomedical Informatics, The Ohio State University, 1800 Cannon Dr, Columbus, OH 43210, USA.
| |
Collapse
|
10
|
Coombes CE, Coombes KR, Fareed N. A novel model to label delirium in an intensive care unit from clinician actions. BMC Med Inform Decis Mak 2021; 21:97. [PMID: 33750375 PMCID: PMC7941123 DOI: 10.1186/s12911-021-01461-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2020] [Accepted: 03/02/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In the intensive care unit (ICU), delirium is a common, acute, confusional state associated with high risk for short- and long-term morbidity and mortality. Machine learning (ML) has promise to address research priorities and improve delirium outcomes. However, due to clinical and billing conventions, delirium is often inconsistently or incompletely labeled in electronic health record (EHR) datasets. Here, we identify clinical actions abstracted from clinical guidelines in electronic health records (EHR) data that indicate risk of delirium among intensive care unit (ICU) patients. We develop a novel prediction model to label patients with delirium based on a large data set and assess model performance. METHODS EHR data on 48,451 admissions from 2001 to 2012, available through Medical Information Mart for Intensive Care-III database (MIMIC-III), was used to identify features to develop our prediction models. Five binary ML classification models (Logistic Regression; Classification and Regression Trees; Random Forests; Naïve Bayes; and Support Vector Machines) were fit and ranked by Area Under the Curve (AUC) scores. We compared our best model with two models previously proposed in the literature for goodness of fit, precision, and through biological validation. RESULTS Our best performing model with threshold reclassification for predicting delirium was based on a multiple logistic regression using the 31 clinical actions (AUC 0.83). Our model out performed other proposed models by biological validation on clinically meaningful, delirium-associated outcomes. CONCLUSIONS Hurdles in identifying accurate labels in large-scale datasets limit clinical applications of ML in delirium. We developed a novel labeling model for delirium in the ICU using a large, public data set. By using guideline-directed clinical actions independent from risk factors, treatments, and outcomes as model predictors, our classifier could be used as a delirium label for future clinically targeted models.
Collapse
Affiliation(s)
- Caitlin E Coombes
- College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
| | - Kevin R Coombes
- Department of Biomedical Informatics, The Ohio State University College of Medicine, 460 Medical Center Dr., 512 Institute of Behavioral Medicine Research, Columbus, OH, 43210, USA
| | - Naleef Fareed
- Department of Biomedical Informatics, The Ohio State University College of Medicine, 460 Medical Center Dr., 512 Institute of Behavioral Medicine Research, Columbus, OH, 43210, USA.
- Center for the Advancement of Team Science, Analytics, and Systems Thinking, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA.
| |
Collapse
|
11
|
Coombes CE, Abrams ZB, Nakayiza S, Brock G, Coombes KR. Umpire 2.0: Simulating realistic, mixed-type, clinical data for machine learning. F1000Res 2021. [DOI: 10.12688/f1000research.25877.2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
The Umpire 2.0 R-package offers a streamlined, user-friendly workflow to simulate complex, heterogeneous, mixed-type data with known subgroup identities, dichotomous outcomes, and time-to-event data, while providing ample opportunities for fine-tuning and flexibility. Here, we describe how we have expanded the core Umpire 1.0 R-package, developed to simulate gene expression data, to generate clinically realistic, mixed-type data for use in evaluating unsupervised and supervised machine learning (ML) methods. As the availability of large-scale clinical data for ML has increased, clinical data has posed unique challenges, including widely variable size, individual biological heterogeneity, data collection and measurement noise, and mixed data types. Developing and validating ML methods for clinical data requires data sets with known ground truth, generated from simulation. Umpire 2.0 addresses challenges to simulating realistic clinical data by providing the user a series of modules to generate survival parameters and subgroups, apply meaningful additive noise, and discretize to single or mixed data types. Umpire 2.0 provides broad functionality across sample sizes, feature spaces, and data types, allowing the user to simulate correlated, heterogeneous, binary, continuous, categorical, or mixed type data from the scale of a small clinical trial to data on thousands of patients drawn from electronic health records. The user may generate elaborate simulations by varying parameters in order to compare algorithms or interrogate operating characteristics of an algorithm in both supervised and unsupervised ML.
Collapse
|