1
|
Brosula R, Corbin CK, Chen JH. Pathophysiological Features in Electronic Medical Records Sustain Model Performance under Temporal Dataset Shift. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2024; 2024:95-104. [PMID: 38827052 PMCID: PMC11141811] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 06/04/2024]
Abstract
Access to real-world data streams like electronic medical records (EMRs) has accelerated the development of supervised machine learning (ML) models for clinical applications. However, few studies investigate the differential impact of particular features in the EMR on model performance under temporal dataset shift. To explain how features in the EMR impact models over time, this study aggregates features into feature groups by their source (e.g. medication orders, diagnosis codes and lab results) and feature categories based on their reflection of patient pathophysiology or healthcare processes. We adapt Shapley values to explain feature groups' and feature categories' marginal contribution to initial and sustained model performance. We investigate three standard clinical prediction tasks and find that while feature contributions to initial performance differ across tasks, pathophysiological features help mitigate temporal discrimination deterioration. These results provide interpretable insights on how specific feature groups contribute to model performance and robustness to temporal dataset shift.
Collapse
Affiliation(s)
- Raphael Brosula
- Genomic Center for Infectious Diseases, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Conor K Corbin
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
| | - Jonathan H Chen
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| |
Collapse
|
2
|
Davis SE, Embí PJ, Matheny ME. Sustainable deployment of clinical prediction tools-a 360° approach to model maintenance. J Am Med Inform Assoc 2024; 31:1195-1198. [PMID: 38422379 PMCID: PMC11031208 DOI: 10.1093/jamia/ocae036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Accepted: 02/15/2024] [Indexed: 03/02/2024] Open
Abstract
BACKGROUND As the enthusiasm for integrating artificial intelligence (AI) into clinical care grows, so has our understanding of the challenges associated with deploying impactful and sustainable clinical AI models. Complex dataset shifts resulting from evolving clinical environments strain the longevity of AI models as predictive accuracy and associated utility deteriorate over time. OBJECTIVE Responsible practice thus necessitates the lifecycle of AI models be extended to include ongoing monitoring and maintenance strategies within health system algorithmovigilance programs. We describe a framework encompassing a 360° continuum of preventive, preemptive, responsive, and reactive approaches to address model monitoring and maintenance from critically different angles. DISCUSSION We describe the complementary advantages and limitations of these four approaches and highlight the importance of such a coordinated strategy to help ensure the promise of clinical AI is not short-lived.
Collapse
Affiliation(s)
- Sharon E Davis
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Peter J Embí
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37232, United States
| | - Michael E Matheny
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37232, United States
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
- Geriatric Research, Education, and Clinical Care, Tennessee Valley Healthcare System VA Medical Center, Veterans Health Administration, Nashville, TN 37212, United States
| |
Collapse
|
3
|
Ahmad FS, Hu TL, Adler ED, Petito LC, Wehbe RM, Wilcox JE, Mutharasan RK, Nardone B, Tadel M, Greenberg B, Yagil A, Campagnari C. Performance of risk models to predict mortality risk for patients with heart failure: evaluation in an integrated health system. Clin Res Cardiol 2024:10.1007/s00392-024-02433-2. [PMID: 38565710 DOI: 10.1007/s00392-024-02433-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/09/2023] [Accepted: 03/05/2024] [Indexed: 04/04/2024]
Abstract
BACKGROUND Referral of patients with heart failure (HF) who are at high mortality risk for specialist evaluation is recommended. Yet, most tools for identifying such patients are difficult to implement in electronic health record (EHR) systems. OBJECTIVE To assess the performance and ease of implementation of Machine learning Assessment of RisK and EaRly mortality in Heart Failure (MARKER-HF), a machine-learning model that uses structured data that is readily available in the EHR, and compare it with two commonly used risk scores: the Seattle Heart Failure Model (SHFM) and Meta-Analysis Global Group in Chronic (MAGGIC) Heart Failure Risk Score. DESIGN Retrospective, cohort study. PARTICIPANTS Data from 6764 adults with HF were abstracted from EHRs at a large integrated health system from 1/1/10 to 12/31/19. MAIN MEASURES One-year survival from time of first cardiology or primary care visit was estimated using MARKER-HF, SHFM, and MAGGIC. Discrimination was measured by the area under the receiver operating curve (AUC). Calibration was assessed graphically. KEY RESULTS Compared to MARKER-HF, both SHFM and MAGGIC required a considerably larger amount of data engineering and imputation to generate risk score estimates. MARKER-HF, SHFM, and MAGGIC exhibited similar discriminations with AUCs of 0.70 (0.69-0.73), 0.71 (0.69-0.72), and 0.71 (95% CI 0.70-0.73), respectively. All three scores showed good calibration across the full risk spectrum. CONCLUSIONS These findings suggest that MARKER-HF, which uses readily available clinical and lab measurements in the EHR and required less imputation and data engineering than SHFM and MAGGIC, is an easier tool to identify high-risk patients in ambulatory clinics who could benefit from referral to a HF specialist.
Collapse
Affiliation(s)
- Faraz S Ahmad
- Division of Cardiology, Department of Medicine, Feinberg School of Medicine, Northwestern University, 676 North Saint Clair Street, Suite 600, Chicago, IL, 60611, USA.
- Bluhm Cardiovascular Institute Center for Artificial Intelligence, Northwestern Medicine, Chicago, IL, USA.
- Institute for Augmented Intelligence in Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA.
| | - Ted Ling Hu
- Institute for Augmented Intelligence in Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| | - Eric D Adler
- Division of Cardiology, Department of Medicine, UC San Diego School of Medicine, La Jolla, CA, USA
| | - Lucia C Petito
- Division of Biostatistics, Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| | - Ramsey M Wehbe
- Bluhm Cardiovascular Institute Center for Artificial Intelligence, Northwestern Medicine, Chicago, IL, USA
- Division of Cardiology, Department of Medicine, Medical University of South Carolina, Charleston, SC, USA
| | - Jane E Wilcox
- Division of Cardiology, Department of Medicine, Feinberg School of Medicine, Northwestern University, 676 North Saint Clair Street, Suite 600, Chicago, IL, 60611, USA
- Bluhm Cardiovascular Institute Center for Artificial Intelligence, Northwestern Medicine, Chicago, IL, USA
| | - R Kannan Mutharasan
- Division of Cardiology, Department of Medicine, Feinberg School of Medicine, Northwestern University, 676 North Saint Clair Street, Suite 600, Chicago, IL, 60611, USA
- Bluhm Cardiovascular Institute Center for Artificial Intelligence, Northwestern Medicine, Chicago, IL, USA
| | - Beatrice Nardone
- Institute for Augmented Intelligence in Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
- Division of General Internal Medicine, Department of Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| | - Matevz Tadel
- Physics Department, UC San Diego, La Jolla, CA, USA
| | - Barry Greenberg
- Division of Cardiology, Department of Medicine, UC San Diego School of Medicine, La Jolla, CA, USA
| | - Avi Yagil
- Physics Department, UC San Diego, La Jolla, CA, USA
| | | |
Collapse
|
4
|
Andersen ES, Birk-Korch JB, Röttger R, Brasen CL, Brandslund I, Madsen JS. Monitoring performance of clinical artificial intelligence: a scoping review protocol. JBI Evid Synth 2024; 22:453-460. [PMID: 38328955 DOI: 10.11124/jbies-23-00390] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/09/2024]
Abstract
OBJECTIVE The objective of this scoping review is to describe the scope and nature of research on the monitoring of clinical artificial intelligence (AI) systems. The review will identify the various methodologies used to monitor clinical AI, while also mapping the factors that influence the selection of monitoring approaches. INTRODUCTION AI is being used in clinical decision-making at an increasing rate. While much attention has been directed toward the development and validation of AI for clinical applications, the practical implementation aspects, notably the establishment of rational monitoring/quality assurance systems, has received comparatively limited scientific interest. Given the scarcity of evidence and the heterogeneity of methodologies used in this domain, there is a compelling rationale for conducting a scoping review on this subject. INCLUSION CRITERIA This scoping review will include any publications that describe systematic, continuous, or repeated initiatives that evaluate or predict clinical performance of AI models with direct implications for the management of patients in any segment of the health care system. METHODS Publications will be identified through searches of the MEDLINE (Ovid), Embase (Ovid), and Scopus databases. Additionally, backward and forward citation searches, as well as a thorough investigation of gray literature, will be conducted. Title and abstract screening, full-text evaluation, and data extraction will be performed by 2 or more independent reviewers. Data will be extracted using a tool developed by the authors. The results will be presented graphically and narratively. REVIEW REGISTRATION Open Science Framework https://osf.io/afkrn.
Collapse
Affiliation(s)
- Eline Sandvig Andersen
- Department of Biochemistry and Immunology, Lillebaelt Hospital, Vejle, Denmark
- Department of Regional Health Research, University of Southern Denmark, Vejle, Denmark
| | - Johan Baden Birk-Korch
- Department of Biochemistry and Immunology, Lillebaelt Hospital, Vejle, Denmark
- Department of Regional Health Research, University of Southern Denmark, Vejle, Denmark
| | - Richard Röttger
- Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark
| | - Claus Lohman Brasen
- Department of Biochemistry and Immunology, Lillebaelt Hospital, Vejle, Denmark
- Department of Regional Health Research, University of Southern Denmark, Vejle, Denmark
| | - Ivan Brandslund
- Department of Biochemistry and Immunology, Lillebaelt Hospital, Vejle, Denmark
- Department of Regional Health Research, University of Southern Denmark, Vejle, Denmark
| | - Jonna Skov Madsen
- Department of Biochemistry and Immunology, Lillebaelt Hospital, Vejle, Denmark
- Department of Regional Health Research, University of Southern Denmark, Vejle, Denmark
| |
Collapse
|
5
|
Guo LL, Morse KE, Aftandilian C, Steinberg E, Fries J, Posada J, Fleming SL, Lemmon J, Jessa K, Shah N, Sung L. Characterizing the limitations of using diagnosis codes in the context of machine learning for healthcare. BMC Med Inform Decis Mak 2024; 24:51. [PMID: 38355486 PMCID: PMC10868117 DOI: 10.1186/s12911-024-02449-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2023] [Accepted: 01/30/2024] [Indexed: 02/16/2024] Open
Abstract
BACKGROUND Diagnostic codes are commonly used as inputs for clinical prediction models, to create labels for prediction tasks, and to identify cohorts for multicenter network studies. However, the coverage rates of diagnostic codes and their variability across institutions are underexplored. The primary objective was to describe lab- and diagnosis-based labels for 7 selected outcomes at three institutions. Secondary objectives were to describe agreement, sensitivity, and specificity of diagnosis-based labels against lab-based labels. METHODS This study included three cohorts: SickKids from The Hospital for Sick Children, and StanfordPeds and StanfordAdults from Stanford Medicine. We included seven clinical outcomes with lab-based definitions: acute kidney injury, hyperkalemia, hypoglycemia, hyponatremia, anemia, neutropenia and thrombocytopenia. For each outcome, we created four lab-based labels (abnormal, mild, moderate and severe) based on test result and one diagnosis-based label. Proportion of admissions with a positive label were presented for each outcome stratified by cohort. Using lab-based labels as the gold standard, agreement using Cohen's Kappa, sensitivity and specificity were calculated for each lab-based severity level. RESULTS The number of admissions included were: SickKids (n = 59,298), StanfordPeds (n = 24,639) and StanfordAdults (n = 159,985). The proportion of admissions with a positive diagnosis-based label was significantly higher for StanfordPeds compared to SickKids across all outcomes, with odds ratio (99.9% confidence interval) for abnormal diagnosis-based label ranging from 2.2 (1.7-2.7) for neutropenia to 18.4 (10.1-33.4) for hyperkalemia. Lab-based labels were more similar by institution. When using lab-based labels as the gold standard, Cohen's Kappa and sensitivity were lower at SickKids for all severity levels compared to StanfordPeds. CONCLUSIONS Across multiple outcomes, diagnosis codes were consistently different between the two pediatric institutions. This difference was not explained by differences in test results. These results may have implications for machine learning model development and deployment.
Collapse
Affiliation(s)
- Lin Lawrence Guo
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON, Canada
| | - Keith E Morse
- Division of Pediatric Hospital Medicine, Department of Pediatrics, Stanford University, Palo Alto, CA, USA
| | - Catherine Aftandilian
- Division of Hematology/Oncology, Department of Pediatrics, Stanford University, Palo Alto, CA, USA
| | - Ethan Steinberg
- Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA, USA
| | - Jason Fries
- Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA, USA
| | - Jose Posada
- Universidad del Norte, Barranquilla, Colombia
| | - Scott Lanyon Fleming
- Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA, USA
| | - Joshua Lemmon
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON, Canada
| | - Karim Jessa
- Information Services, The Hospital for Sick Children, Toronto, ON, Canada
| | - Nigam Shah
- Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA, USA
| | - Lillian Sung
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON, Canada.
- Division of Haematology/Oncology, The Hospital for Sick Children, 555 University Avenue, M5G1X8, Toronto, ON, Canada.
| |
Collapse
|
6
|
Bhaskhar N, Ip W, Chen JH, Rubin DL. Clinical outcome prediction using observational supervision with electronic health records and audit logs. J Biomed Inform 2023; 147:104522. [PMID: 37827476 DOI: 10.1016/j.jbi.2023.104522] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Revised: 10/04/2023] [Accepted: 10/06/2023] [Indexed: 10/14/2023]
Abstract
OBJECTIVE Audit logs in electronic health record (EHR) systems capture interactions of providers with clinical data. We determine if machine learning (ML) models trained using audit logs in conjunction with clinical data ("observational supervision") outperform ML models trained using clinical data alone in clinical outcome prediction tasks, and whether they are more robust to temporal distribution shifts in the data. MATERIALS AND METHODS Using clinical and audit log data from Stanford Healthcare, we trained and evaluated various ML models including logistic regression, support vector machine (SVM) classifiers, neural networks, random forests, and gradient boosted machines (GBMs) on clinical EHR data, with and without audit logs for two clinical outcome prediction tasks: major adverse kidney events within 120 days of ICU admission (MAKE-120) in acute kidney injury (AKI) patients and 30-day readmission in acute stroke patients. We further tested the best performing models using patient data acquired during different time-intervals to evaluate the impact of temporal distribution shifts on model performance. RESULTS Performance generally improved for all models when trained with clinical EHR data and audit log data compared with those trained with only clinical EHR data, with GBMs tending to have the overall best performance. GBMs trained with clinical EHR data and audit logs outperformed GBMs trained without audit logs in both clinical outcome prediction tasks: AUROC 0.88 (95% CI: 0.85-0.91) vs. 0.79 (95% CI: 0.77-0.81), respectively, for MAKE-120 prediction in AKI patients, and AUROC 0.74 (95% CI: 0.71-0.77) vs. 0.63 (95% CI: 0.62-0.64), respectively, for 30-day readmission prediction in acute stroke patients. The performance of GBM models trained using audit log and clinical data degraded less in later time-intervals than models trained using only clinical data. CONCLUSION Observational supervision with audit logs improved the performance of ML models trained to predict important clinical outcomes in patients with AKI and acute stroke, and improved robustness to temporal distribution shifts.
Collapse
Affiliation(s)
- Nandita Bhaskhar
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA.
| | - Wui Ip
- Department of Pediatrics, Stanford School of Medicine, Palo Alto, CA 94305, USA
| | - Jonathan H Chen
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA 94305, USA; Division of Hospital Medicine, Stanford School of Medicine, Palo Alto, CA 94305, USA; Clinical Excellence Research Center, Stanford School of Medicine, Palo Alto, CA 94305, USA
| | - Daniel L Rubin
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA; Department of Radiology, Stanford University, Stanford, CA 94305, USA; Department of Medicine, Stanford School of Medicine, Palo Alto, CA 94305, USA
| |
Collapse
|
7
|
Zeng Z, Wang L, Wu Y, Hu Z, Evans J, Zhu X, Ye G, He S. Utilizing Mixed Training and Multi-Head Attention to Address Data Shift in AI-Based Electromagnetic Solvers for Nano-Structured Metamaterials. NANOMATERIALS (BASEL, SWITZERLAND) 2023; 13:2778. [PMID: 37887929 PMCID: PMC10609168 DOI: 10.3390/nano13202778] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Revised: 10/14/2023] [Accepted: 10/15/2023] [Indexed: 10/28/2023]
Abstract
When designing nano-structured metamaterials with an iterative optimization method, a fast deep learning solver is desirable to replace a time-consuming numerical solver, and the related issue of data shift is a subtle yet easily overlooked challenge. In this work, we explore the data shift challenge in an AI-based electromagnetic solver and present innovative solutions. Using a one-dimensional grating coupler as a case study, we demonstrate the presence of data shift through the probability density method and principal component analysis, and show the degradation of neural network performance through experiments dealing with data affected by data shift. We propose three effective strategies to mitigate the effects of data shift: mixed training, adding multi-head attention, and a comprehensive approach that combines both. The experimental results validate the efficacy of these approaches in addressing data shift. Specifically, the combination of mixed training and multi-head attention significantly reduces the mean absolute error, by approximately 36%, when applied to data affected by data shift. Our work provides crucial insights and guidance for AI-based electromagnetic solvers in the optimal design of nano-structured metamaterials.
Collapse
Affiliation(s)
- Zhenjia Zeng
- National Engineering Research Center for Optical Instruments, Centre for Optical and Electromagnetic Research, Zhejiang University, Hangzhou 310058, China; (Z.Z.); (L.W.); (Y.W.); (Z.H.); (J.E.)
| | - Lei Wang
- National Engineering Research Center for Optical Instruments, Centre for Optical and Electromagnetic Research, Zhejiang University, Hangzhou 310058, China; (Z.Z.); (L.W.); (Y.W.); (Z.H.); (J.E.)
| | - Yiran Wu
- National Engineering Research Center for Optical Instruments, Centre for Optical and Electromagnetic Research, Zhejiang University, Hangzhou 310058, China; (Z.Z.); (L.W.); (Y.W.); (Z.H.); (J.E.)
| | - Zhipeng Hu
- National Engineering Research Center for Optical Instruments, Centre for Optical and Electromagnetic Research, Zhejiang University, Hangzhou 310058, China; (Z.Z.); (L.W.); (Y.W.); (Z.H.); (J.E.)
| | - Julian Evans
- National Engineering Research Center for Optical Instruments, Centre for Optical and Electromagnetic Research, Zhejiang University, Hangzhou 310058, China; (Z.Z.); (L.W.); (Y.W.); (Z.H.); (J.E.)
| | - Xinhua Zhu
- Shanghai Institute for Advanced Study, Zhejiang University, Shanghai 201203, China;
| | - Gaoao Ye
- Taizhou Research Institute, Zhejiang University, Taizhou 317700, China;
| | - Sailing He
- National Engineering Research Center for Optical Instruments, Centre for Optical and Electromagnetic Research, Zhejiang University, Hangzhou 310058, China; (Z.Z.); (L.W.); (Y.W.); (Z.H.); (J.E.)
- Taizhou Research Institute, Zhejiang University, Taizhou 317700, China;
- Department of Electrical Engineering, Royal Institute of Technology, 100 44 Stockholm, Sweden
| |
Collapse
|
8
|
Sahiner B, Chen W, Samala RK, Petrick N. Data drift in medical machine learning: implications and potential remedies. Br J Radiol 2023; 96:20220878. [PMID: 36971405 PMCID: PMC10546450 DOI: 10.1259/bjr.20220878] [Citation(s) in RCA: 12] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Revised: 02/16/2023] [Accepted: 02/20/2023] [Indexed: 03/29/2023] Open
Abstract
Data drift refers to differences between the data used in training a machine learning (ML) model and that applied to the model in real-world operation. Medical ML systems can be exposed to various forms of data drift, including differences between the data sampled for training and used in clinical operation, differences between medical practices or context of use between training and clinical use, and time-related changes in patient populations, disease patterns, and data acquisition, to name a few. In this article, we first review the terminology used in ML literature related to data drift, define distinct types of drift, and discuss in detail potential causes within the context of medical applications with an emphasis on medical imaging. We then review the recent literature regarding the effects of data drift on medical ML systems, which overwhelmingly show that data drift can be a major cause for performance deterioration. We then discuss methods for monitoring data drift and mitigating its effects with an emphasis on pre- and post-deployment techniques. Some of the potential methods for drift detection and issues around model retraining when drift is detected are included. Based on our review, we find that data drift is a major concern in medical ML deployment and that more research is needed so that ML models can identify drift early, incorporate effective mitigation strategies and resist performance decay.
Collapse
Affiliation(s)
- Berkman Sahiner
- Center for Devices and Radiological Health, U.S. Food and Drug Administration 10903 New Hampshire Avenue, Silver Spring, MD 20993-0002
| | - Weijie Chen
- Center for Devices and Radiological Health, U.S. Food and Drug Administration 10903 New Hampshire Avenue, Silver Spring, MD 20993-0002
| | - Ravi K. Samala
- Center for Devices and Radiological Health, U.S. Food and Drug Administration 10903 New Hampshire Avenue, Silver Spring, MD 20993-0002
| | - Nicholas Petrick
- Center for Devices and Radiological Health, U.S. Food and Drug Administration 10903 New Hampshire Avenue, Silver Spring, MD 20993-0002
| |
Collapse
|
9
|
Oikonomou EK, Khera R. Machine learning in precision diabetes care and cardiovascular risk prediction. Cardiovasc Diabetol 2023; 22:259. [PMID: 37749579 PMCID: PMC10521578 DOI: 10.1186/s12933-023-01985-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/25/2023] [Accepted: 09/07/2023] [Indexed: 09/27/2023] Open
Abstract
Artificial intelligence and machine learning are driving a paradigm shift in medicine, promising data-driven, personalized solutions for managing diabetes and the excess cardiovascular risk it poses. In this comprehensive review of machine learning applications in the care of patients with diabetes at increased cardiovascular risk, we offer a broad overview of various data-driven methods and how they may be leveraged in developing predictive models for personalized care. We review existing as well as expected artificial intelligence solutions in the context of diagnosis, prognostication, phenotyping, and treatment of diabetes and its cardiovascular complications. In addition to discussing the key properties of such models that enable their successful application in complex risk prediction, we define challenges that arise from their misuse and the role of methodological standards in overcoming these limitations. We also identify key issues in equity and bias mitigation in healthcare and discuss how the current regulatory framework should ensure the efficacy and safety of medical artificial intelligence products in transforming cardiovascular care and outcomes in diabetes.
Collapse
Affiliation(s)
- Evangelos K Oikonomou
- Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, New Haven, CT, USA
| | - Rohan Khera
- Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, New Haven, CT, USA.
- Section of Health Informatics, Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA.
- Section of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT, USA.
- Center for Outcomes Research and Evaluation, Yale-New Haven Hospital, 195 Church St, 6th floor, New Haven, CT, 06510, USA.
| |
Collapse
|
10
|
Ekemeyong Awong LE, Zielinska T. Comparative Analysis of the Clustering Quality in Self-Organizing Maps for Human Posture Classification. SENSORS (BASEL, SWITZERLAND) 2023; 23:7925. [PMID: 37765983 PMCID: PMC10538130 DOI: 10.3390/s23187925] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Revised: 09/05/2023] [Accepted: 09/13/2023] [Indexed: 09/29/2023]
Abstract
The objective of this article is to develop a methodology for selecting the appropriate number of clusters to group and identify human postures using neural networks with unsupervised self-organizing maps. Although unsupervised clustering algorithms have proven effective in recognizing human postures, many works are limited to testing which data are correctly or incorrectly recognized. They often neglect the task of selecting the appropriate number of groups (where the number of clusters corresponds to the number of output neurons, i.e., the number of postures) using clustering quality assessments. The use of quality scores to determine the number of clusters frees the expert to make subjective decisions about the number of postures, enabling the use of unsupervised learning. Due to high dimensionality and data variability, expert decisions (referred to as data labeling) can be difficult and time-consuming. In our case, there is no manual labeling step. We introduce a new clustering quality score: the discriminant score (DS). We describe the process of selecting the most suitable number of postures using human activity records captured by RGB-D cameras. Comparative studies on the usefulness of popular clustering quality scores-such as the silhouette coefficient, Dunn index, Calinski-Harabasz index, Davies-Bouldin index, and DS-for posture classification tasks are presented, along with graphical illustrations of the results produced by DS. The findings show that DS offers good quality in posture recognition, effectively following postural transitions and similarities.
Collapse
Affiliation(s)
- Lisiane Esther Ekemeyong Awong
- Faculty of Power and Aeronautical Engineering, Division of Theory of Machines and Robots, Warsaw University of Technology, 00-665 Warszawa, Poland
| | - Teresa Zielinska
- Faculty of Power and Aeronautical Engineering, Division of Theory of Machines and Robots, Warsaw University of Technology, 00-665 Warszawa, Poland
| |
Collapse
|
11
|
Corbin CK, Maclay R, Acharya A, Mony S, Punnathanam S, Thapa R, Kotecha N, Shah NH, Chen JH. DEPLOYR: a technical framework for deploying custom real-time machine learning models into the electronic medical record. J Am Med Inform Assoc 2023; 30:1532-1542. [PMID: 37369008 PMCID: PMC10436147 DOI: 10.1093/jamia/ocad114] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Revised: 05/16/2023] [Accepted: 06/13/2023] [Indexed: 06/29/2023] Open
Abstract
OBJECTIVE Heatlhcare institutions are establishing frameworks to govern and promote the implementation of accurate, actionable, and reliable machine learning models that integrate with clinical workflow. Such governance frameworks require an accompanying technical framework to deploy models in a resource efficient, safe and high-quality manner. Here we present DEPLOYR, a technical framework for enabling real-time deployment and monitoring of researcher-created models into a widely used electronic medical record system. MATERIALS AND METHODS We discuss core functionality and design decisions, including mechanisms to trigger inference based on actions within electronic medical record software, modules that collect real-time data to make inferences, mechanisms that close-the-loop by displaying inferences back to end-users within their workflow, monitoring modules that track performance of deployed models over time, silent deployment capabilities, and mechanisms to prospectively evaluate a deployed model's impact. RESULTS We demonstrate the use of DEPLOYR by silently deploying and prospectively evaluating 12 machine learning models trained using electronic medical record data that predict laboratory diagnostic results, triggered by clinician button-clicks in Stanford Health Care's electronic medical record. DISCUSSION Our study highlights the need and feasibility for such silent deployment, because prospectively measured performance varies from retrospective estimates. When possible, we recommend using prospectively estimated performance measures during silent trials to make final go decisions for model deployment. CONCLUSION Machine learning applications in healthcare are extensively researched, but successful translations to the bedside are rare. By describing DEPLOYR, we aim to inform machine learning deployment best practices and help bridge the model implementation gap.
Collapse
Affiliation(s)
- Conor K Corbin
- Department of Biomedical Data Science, Stanford, California, USA
| | - Rob Maclay
- Stanford Children’s Health, Palo Alto, California, USA
| | | | | | | | - Rahul Thapa
- Stanford Health Care, Palo Alto, California, USA
| | | | - Nigam H Shah
- Center for Biomedical Informatics Research, Division of Hospital Medicine, Department of Medicine, Stanford University, School of Medicine, Stanford, California, USA
| | - Jonathan H Chen
- Center for Biomedical Informatics Research, Division of Hospital Medicine, Department of Medicine, Stanford University, School of Medicine, Stanford, California, USA
| |
Collapse
|
12
|
Chen RJ, Wang JJ, Williamson DFK, Chen TY, Lipkova J, Lu MY, Sahai S, Mahmood F. Algorithmic fairness in artificial intelligence for medicine and healthcare. Nat Biomed Eng 2023; 7:719-742. [PMID: 37380750 PMCID: PMC10632090 DOI: 10.1038/s41551-023-01056-8] [Citation(s) in RCA: 30] [Impact Index Per Article: 30.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2021] [Accepted: 04/13/2023] [Indexed: 06/30/2023]
Abstract
In healthcare, the development and deployment of insufficiently fair systems of artificial intelligence (AI) can undermine the delivery of equitable care. Assessments of AI models stratified across subpopulations have revealed inequalities in how patients are diagnosed, treated and billed. In this Perspective, we outline fairness in machine learning through the lens of healthcare, and discuss how algorithmic biases (in data acquisition, genetic variation and intra-observer labelling variability, in particular) arise in clinical workflows and the resulting healthcare disparities. We also review emerging technology for mitigating biases via disentanglement, federated learning and model explainability, and their role in the development of AI-based software as a medical device.
Collapse
Affiliation(s)
- Richard J Chen
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA, USA
- Cancer Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Judy J Wang
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Boston University School of Medicine, Boston, MA, USA
| | - Drew F K Williamson
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Tiffany Y Chen
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Jana Lipkova
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Ming Y Lu
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA, USA
- Cancer Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA
- Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Sharifa Sahai
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | - Faisal Mahmood
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
- Cancer Program, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA, USA.
- Cancer Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA.
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA.
- Harvard Data Science Initiative, Harvard University, Cambridge, MA, USA.
| |
Collapse
|
13
|
Yamga E, Mullie L, Durand M, Cadrin-Chenevert A, Tang A, Montagnon E, Chartrand-Lefebvre C, Chassé M. Interpretable clinical phenotypes among patients hospitalized with COVID-19 using cluster analysis. Front Digit Health 2023; 5:1142822. [PMID: 37114183 PMCID: PMC10128042 DOI: 10.3389/fdgth.2023.1142822] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2023] [Accepted: 03/13/2023] [Indexed: 04/29/2023] Open
Abstract
Background Multiple clinical phenotypes have been proposed for coronavirus disease (COVID-19), but few have used multimodal data. Using clinical and imaging data, we aimed to identify distinct clinical phenotypes in patients admitted with COVID-19 and to assess their clinical outcomes. Our secondary objective was to demonstrate the clinical applicability of this method by developing an interpretable model for phenotype assignment. Methods We analyzed data from 547 patients hospitalized with COVID-19 at a Canadian academic hospital. We processed the data by applying a factor analysis of mixed data (FAMD) and compared four clustering algorithms: k-means, partitioning around medoids (PAM), and divisive and agglomerative hierarchical clustering. We used imaging data and 34 clinical variables collected within the first 24 h of admission to train our algorithm. We conducted a survival analysis to compare the clinical outcomes across phenotypes. With the data split into training and validation sets (75/25 ratio), we developed a decision-tree-based model to facilitate the interpretation and assignment of the observed phenotypes. Results Agglomerative hierarchical clustering was the most robust algorithm. We identified three clinical phenotypes: 79 patients (14%) in Cluster 1, 275 patients (50%) in Cluster 2, and 203 (37%) in Cluster 3. Cluster 2 and Cluster 3 were both characterized by a low-risk respiratory and inflammatory profile but differed in terms of demographics. Compared with Cluster 3, Cluster 2 comprised older patients with more comorbidities. Cluster 1 represented the group with the most severe clinical presentation, as inferred by the highest rate of hypoxemia and the highest radiological burden. Intensive care unit (ICU) admission and mechanical ventilation risks were the highest in Cluster 1. Using only two to four decision rules, the classification and regression tree (CART) phenotype assignment model achieved an AUC of 84% (81.5-86.5%, 95 CI) on the validation set. Conclusions We conducted a multidimensional phenotypic analysis of adult inpatients with COVID-19 and identified three distinct phenotypes associated with different clinical outcomes. We also demonstrated the clinical usability of this approach, as phenotypes can be accurately assigned using a simple decision tree. Further research is still needed to properly incorporate these phenotypes in the management of patients with COVID-19.
Collapse
Affiliation(s)
- Eric Yamga
- Department of Medicine, Centre Hospitalier de l’Université de Montréal (CHUM), Montréal, QC, Canada
| | - Louis Mullie
- Department of Medicine, Centre Hospitalier de l’Université de Montréal (CHUM), Montréal, QC, Canada
| | - Madeleine Durand
- Department of Medicine, Centre Hospitalier de l’Université de Montréal (CHUM), Montréal, QC, Canada
- Centre de Recherche du Centre Hospitalier de l'Université de Montréal (CRCHUM), Montréal, QC, Canada
| | | | - An Tang
- Centre de Recherche du Centre Hospitalier de l'Université de Montréal (CRCHUM), Montréal, QC, Canada
- Department of Radiology and Nuclear Medicine, Centre Hospitalier de l’Université de Montréal (CHUM), Montréal, QC, Canada
| | - Emmanuel Montagnon
- Centre de Recherche du Centre Hospitalier de l'Université de Montréal (CRCHUM), Montréal, QC, Canada
| | - Carl Chartrand-Lefebvre
- Centre de Recherche du Centre Hospitalier de l'Université de Montréal (CRCHUM), Montréal, QC, Canada
- Department of Radiology and Nuclear Medicine, Centre Hospitalier de l’Université de Montréal (CHUM), Montréal, QC, Canada
| | - Michaël Chassé
- Department of Medicine, Centre Hospitalier de l’Université de Montréal (CHUM), Montréal, QC, Canada
- Centre de Recherche du Centre Hospitalier de l'Université de Montréal (CRCHUM), Montréal, QC, Canada
| |
Collapse
|
14
|
Moor M, Banerjee O, Abad ZSH, Krumholz HM, Leskovec J, Topol EJ, Rajpurkar P. Foundation models for generalist medical artificial intelligence. Nature 2023; 616:259-265. [PMID: 37045921 DOI: 10.1038/s41586-023-05881-4] [Citation(s) in RCA: 191] [Impact Index Per Article: 191.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2022] [Accepted: 02/22/2023] [Indexed: 04/14/2023]
Abstract
The exceptionally rapid development of highly flexible, reusable artificial intelligence (AI) models is likely to usher in newfound capabilities in medicine. We propose a new paradigm for medical AI, which we refer to as generalist medical AI (GMAI). GMAI models will be capable of carrying out a diverse set of tasks using very little or no task-specific labelled data. Built through self-supervision on large, diverse datasets, GMAI will flexibly interpret different combinations of medical modalities, including data from imaging, electronic health records, laboratory results, genomics, graphs or medical text. Models will in turn produce expressive outputs such as free-text explanations, spoken recommendations or image annotations that demonstrate advanced medical reasoning abilities. Here we identify a set of high-impact potential applications for GMAI and lay out specific technical capabilities and training datasets necessary to enable them. We expect that GMAI-enabled applications will challenge current strategies for regulating and validating AI devices for medicine and will shift practices associated with the collection of large medical datasets.
Collapse
Affiliation(s)
- Michael Moor
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Oishi Banerjee
- Department of Biomedical Informatics, Harvard University, Cambridge, MA, USA
| | - Zahra Shakeri Hossein Abad
- Institute of Health Policy, Management and Evaluation, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
| | - Harlan M Krumholz
- Yale University School of Medicine, Center for Outcomes Research and Evaluation, Yale New Haven Hospital, New Haven, CT, USA
| | - Jure Leskovec
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Eric J Topol
- Scripps Research Translational Institute, La Jolla, CA, USA.
| | - Pranav Rajpurkar
- Department of Biomedical Informatics, Harvard University, Cambridge, MA, USA.
| |
Collapse
|
15
|
Guo LL, Steinberg E, Fleming SL, Posada J, Lemmon J, Pfohl SR, Shah N, Fries J, Sung L. EHR foundation models improve robustness in the presence of temporal distribution shift. Sci Rep 2023; 13:3767. [PMID: 36882576 PMCID: PMC9992466 DOI: 10.1038/s41598-023-30820-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2022] [Accepted: 03/02/2023] [Indexed: 03/09/2023] Open
Abstract
Temporal distribution shift negatively impacts the performance of clinical prediction models over time. Pretraining foundation models using self-supervised learning on electronic health records (EHR) may be effective in acquiring informative global patterns that can improve the robustness of task-specific models. The objective was to evaluate the utility of EHR foundation models in improving the in-distribution (ID) and out-of-distribution (OOD) performance of clinical prediction models. Transformer- and gated recurrent unit-based foundation models were pretrained on EHR of up to 1.8 M patients (382 M coded events) collected within pre-determined year groups (e.g., 2009-2012) and were subsequently used to construct patient representations for patients admitted to inpatient units. These representations were used to train logistic regression models to predict hospital mortality, long length of stay, 30-day readmission, and ICU admission. We compared our EHR foundation models with baseline logistic regression models learned on count-based representations (count-LR) in ID and OOD year groups. Performance was measured using area-under-the-receiver-operating-characteristic curve (AUROC), area-under-the-precision-recall curve, and absolute calibration error. Both transformer and recurrent-based foundation models generally showed better ID and OOD discrimination relative to count-LR and often exhibited less decay in tasks where there is observable degradation of discrimination performance (average AUROC decay of 3% for transformer-based foundation model vs. 7% for count-LR after 5-9 years). In addition, the performance and robustness of transformer-based foundation models continued to improve as pretraining set size increased. These results suggest that pretraining EHR foundation models at scale is a useful approach for developing clinical prediction models that perform well in the presence of temporal distribution shift.
Collapse
Affiliation(s)
- Lin Lawrence Guo
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON, Canada
| | - Ethan Steinberg
- Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA, USA
| | - Scott Lanyon Fleming
- Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA, USA
| | - Jose Posada
- Universidad del Norte, Barranquilla, Colombia
| | - Joshua Lemmon
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON, Canada
| | - Stephen R Pfohl
- Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA, USA
| | - Nigam Shah
- Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA, USA
| | - Jason Fries
- Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA, USA
| | - Lillian Sung
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON, Canada. .,Division of Haematology/Oncology, The Hospital for Sick Children, 555 University Avenue, Toronto, ON, M5G1X8, Canada.
| |
Collapse
|
16
|
Lemmon J, Guo LL, Posada J, Pfohl SR, Fries J, Fleming SL, Aftandilian C, Shah N, Sung L. Evaluation of Feature Selection Methods for Preserving Machine Learning Performance in the Presence of Temporal Dataset Shift in Clinical Medicine. Methods Inf Med 2023; 62:60-70. [PMID: 36812932 DOI: 10.1055/s-0043-1762904] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/24/2023]
Abstract
BACKGROUND Temporal dataset shift can cause degradation in model performance as discrepancies between training and deployment data grow over time. The primary objective was to determine whether parsimonious models produced by specific feature selection methods are more robust to temporal dataset shift as measured by out-of-distribution (OOD) performance, while maintaining in-distribution (ID) performance. METHODS Our dataset consisted of intensive care unit patients from MIMIC-IV categorized by year groups (2008-2010, 2011-2013, 2014-2016, and 2017-2019). We trained baseline models using L2-regularized logistic regression on 2008-2010 to predict in-hospital mortality, long length of stay (LOS), sepsis, and invasive ventilation in all year groups. We evaluated three feature selection methods: L1-regularized logistic regression (L1), Remove and Retrain (ROAR), and causal feature selection. We assessed whether a feature selection method could maintain ID performance (2008-2010) and improve OOD performance (2017-2019). We also assessed whether parsimonious models retrained on OOD data performed as well as oracle models trained on all features in the OOD year group. RESULTS The baseline model showed significantly worse OOD performance with the long LOS and sepsis tasks when compared with the ID performance. L1 and ROAR retained 3.7 to 12.6% of all features, whereas causal feature selection generally retained fewer features. Models produced by L1 and ROAR exhibited similar ID and OOD performance as the baseline models. The retraining of these models on 2017-2019 data using features selected from training on 2008-2010 data generally reached parity with oracle models trained directly on 2017-2019 data using all available features. Causal feature selection led to heterogeneous results with the superset maintaining ID performance while improving OOD calibration only on the long LOS task. CONCLUSIONS While model retraining can mitigate the impact of temporal dataset shift on parsimonious models produced by L1 and ROAR, new methods are required to proactively improve temporal robustness.
Collapse
Affiliation(s)
- Joshua Lemmon
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Lin Lawrence Guo
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Jose Posada
- Biomedical Informatics Research, Stanford University, Palo Alto, California, United States.,Department of Systems Engineering, Universidad del Norte, Barranquilla, Atlantico, Colombia
| | - Stephen R Pfohl
- Biomedical Informatics Research, Stanford University, Palo Alto, California, United States
| | - Jason Fries
- Biomedical Informatics Research, Stanford University, Palo Alto, California, United States
| | - Scott Lanyon Fleming
- Biomedical Informatics Research, Stanford University, Palo Alto, California, United States
| | - Catherine Aftandilian
- Division of Pediatric Hematology/Oncology, Stanford University, Palo Alto, California, United States
| | - Nigam Shah
- Biomedical Informatics Research, Stanford University, Palo Alto, California, United States
| | - Lillian Sung
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, Ontario, Canada.,Division of Haematology/Oncology, The Hospital for Sick Children, Toronto, Ontario, Canada
| |
Collapse
|
17
|
Artificial intelligence in bronchopulmonary dysplasia- current research and unexplored frontiers. Pediatr Res 2023; 93:287-290. [PMID: 36385519 DOI: 10.1038/s41390-022-02387-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/15/2022] [Revised: 10/21/2022] [Accepted: 10/30/2022] [Indexed: 11/17/2022]
Abstract
Provide an overview of bronchopulmonary dysplasia, its definitions, and their shortcomings. Explore the areas where machine learning may be used to further our understanding of bronchopulmonary dysplasia.
Collapse
|
18
|
Parimbelli E, Buonocore TM, Nicora G, Michalowski W, Wilk S, Bellazzi R. Why did AI get this one wrong? - Tree-based explanations of machine learning model predictions. Artif Intell Med 2023; 135:102471. [PMID: 36628785 DOI: 10.1016/j.artmed.2022.102471] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2022] [Revised: 11/25/2022] [Accepted: 11/28/2022] [Indexed: 12/02/2022]
Abstract
Increasingly complex learning methods such as boosting, bagging and deep learning have made ML models more accurate, but harder to interpret and explain, culminating in black-box machine learning models. Model developers and users alike are often presented with a trade-off between performance and intelligibility, especially in high-stakes applications like medicine. In the present article we propose a novel methodological approach for generating explanations for the predictions of a generic machine learning model, given a specific instance for which the prediction has been made. The method, named AraucanaXAI, is based on surrogate, locally-fitted classification and regression trees that are used to provide post-hoc explanations of the prediction of a generic machine learning model. Advantages of the proposed XAI approach include superior fidelity to the original model, ability to deal with non-linear decision boundaries, and native support to both classification and regression problems. We provide a packaged, open-source implementation of the AraucanaXAI method and evaluate its behaviour in a number of different settings that are commonly encountered in medical applications of AI. These include potential disagreement between the model prediction and physician's expert opinion and low reliability of the prediction due to data scarcity.
Collapse
Affiliation(s)
- Enea Parimbelli
- Department of Electric, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy; Telfer school of Management, University of Ottawa, Ottawa, Ontario, Canada.
| | - Tommaso Mario Buonocore
- Department of Electric, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy
| | - Giovanna Nicora
- Department of Electric, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy; enGenome srl, Pavia, Italy
| | - Wojtek Michalowski
- Telfer school of Management, University of Ottawa, Ottawa, Ontario, Canada
| | - Szymon Wilk
- Division of Intelligent Decision Support Systems, Institute of Computing Science, Poznan University of Technology, Poznan, Poland
| | - Riccardo Bellazzi
- Department of Electric, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy
| |
Collapse
|
19
|
Sperrin M, Riley RD, Collins GS, Martin GP. Targeted validation: validating clinical prediction models in their intended population and setting. Diagn Progn Res 2022; 6:24. [PMID: 36550534 PMCID: PMC9773429 DOI: 10.1186/s41512-022-00136-8] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/24/2022] [Accepted: 11/14/2022] [Indexed: 12/24/2022] Open
Abstract
Clinical prediction models must be appropriately validated before they can be used. While validation studies are sometimes carefully designed to match an intended population/setting of the model, it is common for validation studies to take place with arbitrary datasets, chosen for convenience rather than relevance. We call estimating how well a model performs within the intended population/setting "targeted validation". Use of this term sharpens the focus on the intended use of a model, which may increase the applicability of developed models, avoid misleading conclusions, and reduce research waste. It also exposes that external validation may not be required when the intended population for the model matches the population used to develop the model; here, a robust internal validation may be sufficient, especially if the development dataset was large.
Collapse
Affiliation(s)
- Matthew Sperrin
- Division of Informatics, Imaging and Data Science, Faculty of Biology, Medicine and Health, University of Manchester, Manchester, UK.
| | - Richard D Riley
- Institute of Applied Health Research, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK
| | - Gary S Collins
- Centre for Statistics in Medicine, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, University of Oxford, Oxford, UK
| | - Glen P Martin
- Division of Informatics, Imaging and Data Science, Faculty of Biology, Medicine and Health, University of Manchester, Manchester, UK
| |
Collapse
|
20
|
Nelson AE, Arbeeva L. Narrative Review of Machine Learning in Rheumatic and Musculoskeletal Diseases for Clinicians and Researchers: Biases, Goals, and Future Directions. J Rheumatol 2022; 49:1191-1200. [PMID: 35840150 PMCID: PMC9633365 DOI: 10.3899/jrheum.220326] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/21/2022] [Indexed: 11/22/2022]
Abstract
There has been rapid growth in the use of artificial intelligence (AI) analytics in medicine in recent years, including in rheumatic and musculoskeletal diseases (RMDs). Such methods represent a challenge to clinicians, patients, and researchers, given the "black box" nature of most algorithms, the unfamiliarity of the terms, and the lack of awareness of potential issues around these analyses. Therefore, this review aims to introduce this subject area in a way that is relevant and meaningful to clinicians and researchers. We hope to provide some insights into relevant strengths and limitations, reporting guidelines, as well as recent examples of such analyses in key areas, with a focus on lessons learned and future directions in diagnosis, phenotyping, prognosis, and precision medicine in RMDs.
Collapse
Affiliation(s)
- Amanda E Nelson
- A.E. Nelson, MD, MSCR, Department of Medicine, Division of Rheumatology, Allergy, and Immunology, University of North Carolina at Chapel Hill;
| | - Liubov Arbeeva
- L. Arbeeva, MS, Thurston Arthritis Research Center, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| |
Collapse
|
21
|
Davis SE, Walsh CG, Matheny ME. Open questions and research gaps for monitoring and updating AI-enabled tools in clinical settings. Front Digit Health 2022; 4:958284. [PMID: 36120717 PMCID: PMC9478183 DOI: 10.3389/fdgth.2022.958284] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2022] [Accepted: 08/11/2022] [Indexed: 11/15/2022] Open
Abstract
As the implementation of artificial intelligence (AI)-enabled tools is realized across diverse clinical environments, there is a growing understanding of the need for ongoing monitoring and updating of prediction models. Dataset shift-temporal changes in clinical practice, patient populations, and information systems-is now well-documented as a source of deteriorating model accuracy and a challenge to the sustainability of AI-enabled tools in clinical care. While best practices are well-established for training and validating new models, there has been limited work developing best practices for prospective validation and model maintenance. In this paper, we highlight the need for updating clinical prediction models and discuss open questions regarding this critical aspect of the AI modeling lifecycle in three focus areas: model maintenance policies, performance monitoring perspectives, and model updating strategies. With the increasing adoption of AI-enabled tools, the need for such best practices must be addressed and incorporated into new and existing implementations. This commentary aims to encourage conversation and motivate additional research across clinical and data science stakeholders.
Collapse
Affiliation(s)
- Sharon E. Davis
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States,Correspondence: Sharon E. Davis
| | - Colin G. Walsh
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States,Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, United States,Department of Psychiatry, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Michael E. Matheny
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States,Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, United States,Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, United States,Tennessee Valley Healthcare System VA Medical Center, Veterans Health Administration, Nashville, TN, United States
| |
Collapse
|
22
|
Ahmad FS, Luo Y, Wehbe RM, Thomas JD, Shah SJ. Advances in Machine Learning Approaches to Heart Failure with Preserved Ejection Fraction. Heart Fail Clin 2022; 18:287-300. [PMID: 35341541 PMCID: PMC8983114 DOI: 10.1016/j.hfc.2021.12.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Heart failure with preserved ejection fraction (HFpEF) represents a prototypical cardiovascular condition in which machine learning may improve targeted therapies and mechanistic understanding of pathogenesis. Machine learning, which involves algorithms that learn from data, has the potential to guide precision medicine approaches for complex clinical syndromes such as HFpEF. It is therefore important to understand the potential utility and common pitfalls of machine learning so that it can be applied and interpreted appropriately. Although machine learning holds considerable promise for HFpEF, it is subject to several potential pitfalls, which are important factors to consider when interpreting machine learning studies.
Collapse
Affiliation(s)
- Faraz S. Ahmad
- Division of Cardiology, Department of Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL
- Bluhm Cardiovascular Institute Center for Artificial Intelligence, Northwestern Medicine, Chicago, IL
| | - Yuan Luo
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL
- Bluhm Cardiovascular Institute Center for Artificial Intelligence, Northwestern Medicine, Chicago, IL
| | - Ramsey M. Wehbe
- Division of Cardiology, Department of Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL
- Bluhm Cardiovascular Institute Center for Artificial Intelligence, Northwestern Medicine, Chicago, IL
| | - James D. Thomas
- Division of Cardiology, Department of Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL
- Bluhm Cardiovascular Institute Center for Artificial Intelligence, Northwestern Medicine, Chicago, IL
| | - Sanjiv J. Shah
- Division of Cardiology, Department of Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL
- Bluhm Cardiovascular Institute Center for Artificial Intelligence, Northwestern Medicine, Chicago, IL
| |
Collapse
|
23
|
Guo LL, Pfohl SR, Fries J, Johnson AEW, Posada J, Aftandilian C, Shah N, Sung L. Evaluation of domain generalization and adaptation on improving model robustness to temporal dataset shift in clinical medicine. Sci Rep 2022; 12:2726. [PMID: 35177653 PMCID: PMC8854561 DOI: 10.1038/s41598-022-06484-1] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2021] [Accepted: 01/31/2022] [Indexed: 11/24/2022] Open
Abstract
Temporal dataset shift associated with changes in healthcare over time is a barrier to deploying machine learning-based clinical decision support systems. Algorithms that learn robust models by estimating invariant properties across time periods for domain generalization (DG) and unsupervised domain adaptation (UDA) might be suitable to proactively mitigate dataset shift. The objective was to characterize the impact of temporal dataset shift on clinical prediction models and benchmark DG and UDA algorithms on improving model robustness. In this cohort study, intensive care unit patients from the MIMIC-IV database were categorized by year groups (2008–2010, 2011–2013, 2014–2016 and 2017–2019).
Tasks were predicting mortality, long length of stay, sepsis and invasive ventilation. Feedforward neural networks were used as prediction models. The baseline experiment trained models using empirical risk minimization (ERM) on 2008–2010 (ERM[08–10]) and evaluated them on subsequent year groups. DG experiment trained models using algorithms that estimated invariant properties using 2008–2016 and evaluated them on 2017–2019. UDA experiment leveraged unlabelled samples from 2017 to 2019 for unsupervised distribution matching. DG and UDA models were compared to ERM[08–16] models trained using 2008–2016. Main performance measures were area-under-the-receiver-operating-characteristic curve (AUROC), area-under-the-precision-recall curve and absolute calibration error. Threshold-based metrics including false-positives and false-negatives were used to assess the clinical impact of temporal dataset shift and its mitigation strategies. In the baseline experiments, dataset shift was most evident for sepsis prediction (maximum AUROC drop, 0.090; 95% confidence interval (CI), 0.080–0.101). Considering a scenario of 100 consecutively admitted patients showed that ERM[08–10] applied to 2017–2019 was associated with one additional false-negative among 11 patients with sepsis, when compared to the model applied to 2008–2010. When compared with ERM[08–16], DG and UDA experiments failed to produce more robust models (range of AUROC difference, − 0.003 to 0.050). In conclusion, DG and UDA failed to produce more robust models compared to ERM in the setting of temporal dataset shift. Alternate approaches are required to preserve model performance over time in clinical medicine.
Collapse
Affiliation(s)
- Lin Lawrence Guo
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON, Canada
| | - Stephen R Pfohl
- Biomedical Informatics Research, Stanford University, Palo Alto, USA
| | - Jason Fries
- Biomedical Informatics Research, Stanford University, Palo Alto, USA
| | - Alistair E W Johnson
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON, Canada
| | - Jose Posada
- Biomedical Informatics Research, Stanford University, Palo Alto, USA
| | | | - Nigam Shah
- Biomedical Informatics Research, Stanford University, Palo Alto, USA
| | - Lillian Sung
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON, Canada. .,Division of Haematology/Oncology, The Hospital for Sick Children, 555 University Avenue, Toronto, ON, M5G1X8, Canada.
| |
Collapse
|
24
|
Machine Learning Approaches to Investigate Clostridioides difficile Infection and Outcomes: A Systematic Review. Int J Med Inform 2022; 160:104706. [DOI: 10.1016/j.ijmedinf.2022.104706] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Revised: 12/21/2021] [Accepted: 01/22/2022] [Indexed: 11/20/2022]
|