51
|
Zhang A, Wu Z, Wu E, Wu M, Snyder MP, Zou J, Wu JC. Leveraging physiology and artificial intelligence to deliver advancements in health care. Physiol Rev 2023; 103:2423-2450. [PMID: 37104717 PMCID: PMC10390055 DOI: 10.1152/physrev.00033.2022] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2022] [Revised: 03/06/2023] [Accepted: 04/25/2023] [Indexed: 04/29/2023] Open
Abstract
Artificial intelligence in health care has experienced remarkable innovation and progress in the last decade. Significant advancements can be attributed to the utilization of artificial intelligence to transform physiology data to advance health care. In this review, we explore how past work has shaped the field and defined future challenges and directions. In particular, we focus on three areas of development. First, we give an overview of artificial intelligence, with special attention to the most relevant artificial intelligence models. We then detail how physiology data have been harnessed by artificial intelligence to advance the main areas of health care: automating existing health care tasks, increasing access to care, and augmenting health care capabilities. Finally, we discuss emerging concerns surrounding the use of individual physiology data and detail an increasingly important consideration for the field, namely the challenges of deploying artificial intelligence models to achieve meaningful clinical impact.
Collapse
Affiliation(s)
- Angela Zhang
- Stanford Cardiovascular Institute, School of Medicine, Stanford University, Stanford, California, United States
- Department of Genetics, School of Medicine, Stanford University, Stanford, California, United States
- Greenstone Biosciences, Palo Alto, California, United States
| | - Zhenqin Wu
- Department of Chemistry, Stanford University, Stanford, California, United States
| | - Eric Wu
- Department of Electrical Engineering, Stanford University, Stanford, California, United States
| | - Matthew Wu
- Greenstone Biosciences, Palo Alto, California, United States
| | - Michael P Snyder
- Department of Genetics, School of Medicine, Stanford University, Stanford, California, United States
| | - James Zou
- Department of Biomedical Informatics, School of Medicine, Stanford University, Stanford, California, United States
- Department of Computer Science, Stanford University, Stanford, California, United States
| | - Joseph C Wu
- Stanford Cardiovascular Institute, School of Medicine, Stanford University, Stanford, California, United States
- Greenstone Biosciences, Palo Alto, California, United States
- Division of Cardiovascular Medicine, Department of Medicine, Stanford University, Stanford, California, United States
- Department of Radiology, School of Medicine, Stanford University, Stanford, California, United States
| |
Collapse
|
52
|
Herington J, McCradden MD, Creel K, Boellaard R, Jones EC, Jha AK, Rahmim A, Scott PJH, Sunderland JJ, Wahl RL, Zuehlsdorff S, Saboury B. Ethical Considerations for Artificial Intelligence in Medical Imaging: Deployment and Governance. J Nucl Med 2023; 64:1509-1515. [PMID: 37620051 DOI: 10.2967/jnumed.123.266110] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Revised: 07/11/2023] [Indexed: 08/26/2023] Open
Abstract
The deployment of artificial intelligence (AI) has the potential to make nuclear medicine and medical imaging faster, cheaper, and both more effective and more accessible. This is possible, however, only if clinicians and patients feel that these AI medical devices (AIMDs) are trustworthy. Highlighting the need to ensure health justice by fairly distributing benefits and burdens while respecting individual patients' rights, the AI Task Force of the Society of Nuclear Medicine and Molecular Imaging has identified 4 major ethical risks that arise during the deployment of AIMD: autonomy of patients and clinicians, transparency of clinical performance and limitations, fairness toward marginalized populations, and accountability of physicians and developers. We provide preliminary recommendations for governing these ethical risks to realize the promise of AIMD for patients and populations.
Collapse
Affiliation(s)
- Jonathan Herington
- Department of Health Humanities and Bioethics and Department of Philosophy, University of Rochester, Rochester, New York
| | - Melissa D McCradden
- Department of Bioethics, Hospital for Sick Children, and Dana Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
| | - Kathleen Creel
- Department of Philosophy and Religion and Khoury College of Computer Sciences, Northeastern University, Boston, Massachusetts
| | - Ronald Boellaard
- Department of Radiology and Nuclear Medicine, Cancer Centre Amsterdam, Amsterdam University Medical Centres, Amsterdam, The Netherlands
| | - Elizabeth C Jones
- Department of Radiology and Imaging Sciences, Clinical Center, National Institutes of Health, Bethesda, Maryland
| | - Abhinav K Jha
- Department of Biomedical Engineering and Mallinckrodt Institute of Radiology, Washington University in St. Louis, St. Louis, Missouri
| | - Arman Rahmim
- Departments of Radiology and Physics, University of British Columbia, Vancouver, British Columbia, Canada
| | - Peter J H Scott
- Department of Radiology, University of Michigan Medical School, Ann Arbor, Michigan
| | - John J Sunderland
- Departments of Radiology and Physics, University of Iowa, Iowa City, Iowa
| | - Richard L Wahl
- Mallinckrodt Institute of Radiology, Washington University in St. Louis, St. Louis, Missouri; and
| | | | - Babak Saboury
- Department of Radiology and Imaging Sciences, Clinical Center, National Institutes of Health, Bethesda, Maryland;
| |
Collapse
|
53
|
Stegmann JU, Littlebury R, Trengove M, Goetz L, Bate A, Branson KM. Trustworthy AI for safe medicines. Nat Rev Drug Discov 2023; 22:855-856. [PMID: 37550364 DOI: 10.1038/s41573-023-00769-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/09/2023]
Affiliation(s)
| | | | - Markus Trengove
- Artificial Intelligence and Machine Learning, GSK, London, UK
| | - Lea Goetz
- Artificial Intelligence and Machine Learning, GSK, London, UK
| | | | - Kim M Branson
- Artificial Intelligence and Machine Learning, GSK, San Francisco, USA
| |
Collapse
|
54
|
Wang SM, Hogg HDJ, Sangvai D, Patel MR, Weissler EH, Kellogg KC, Ratliff W, Balu S, Sendak M. Development and Integration of Machine Learning Algorithm to Identify Peripheral Arterial Disease: Multistakeholder Qualitative Study. JMIR Form Res 2023; 7:e43963. [PMID: 37733427 PMCID: PMC10557008 DOI: 10.2196/43963] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2022] [Revised: 01/20/2023] [Accepted: 04/30/2023] [Indexed: 09/22/2023] Open
Abstract
BACKGROUND Machine learning (ML)-driven clinical decision support (CDS) continues to draw wide interest and investment as a means of improving care quality and value, despite mixed real-world implementation outcomes. OBJECTIVE This study aimed to explore the factors that influence the integration of a peripheral arterial disease (PAD) identification algorithm to implement timely guideline-based care. METHODS A total of 12 semistructured interviews were conducted with individuals from 3 stakeholder groups during the first 4 weeks of integration of an ML-driven CDS. The stakeholder groups included technical, administrative, and clinical members of the team interacting with the ML-driven CDS. The ML-driven CDS identified patients with a high probability of having PAD, and these patients were then reviewed by an interdisciplinary team that developed a recommended action plan and sent recommendations to the patient's primary care provider. Pseudonymized transcripts were coded, and thematic analysis was conducted by a multidisciplinary research team. RESULTS Three themes were identified: positive factors translating in silico performance to real-world efficacy, organizational factors and data structure factors affecting clinical impact, and potential challenges to advancing equity. Our study found that the factors that led to successful translation of in silico algorithm performance to real-world impact were largely nontechnical, given adequate efficacy in retrospective validation, including strong clinical leadership, trustworthy workflows, early consideration of end-user needs, and ensuring that the CDS addresses an actionable problem. Negative factors of integration included failure to incorporate the on-the-ground context, the lack of feedback loops, and data silos limiting the ML-driven CDS. The success criteria for each stakeholder group were also characterized to better understand how teams work together to integrate ML-driven CDS and to understand the varying needs across stakeholder groups. CONCLUSIONS Longitudinal and multidisciplinary stakeholder engagement in the development and integration of ML-driven CDS underpins its effective translation into real-world care. Although previous studies have focused on the technical elements of ML-driven CDS, our study demonstrates the importance of including administrative and operational leaders as well as an early consideration of clinicians' needs. Seeing how different stakeholder groups have this more holistic perspective also permits more effective detection of context-driven health care inequities, which are uncovered or exacerbated via ML-driven CDS integration through structural and organizational challenges. Many of the solutions to these inequities lie outside the scope of ML and require coordinated systematic solutions for mitigation to help reduce disparities in the care of patients with PAD.
Collapse
Affiliation(s)
- Sabrina M Wang
- Duke University School of Medicine, Durham, NC, United States
| | - H D Jeffry Hogg
- Population Health Science Institute, Faculty of Medical Sciences, Newcastle University, Newcastle upon Tyne, United Kingdom
- Newcastle Eye Centre, Royal Victoria Infirmary, Newcastle upon Tyne, United Kingdom
| | - Devdutta Sangvai
- Population Health Management, Duke Health, Durham, NC, United States
| | - Manesh R Patel
- Department of Cardiology, Duke University, Durham, NC, United States
| | - E Hope Weissler
- Department of Vascular Surgery, Duke University, Durham, NC, United States
| | | | - William Ratliff
- Duke Institute for Health Innovation, Durham, NC, United States
| | - Suresh Balu
- Duke Institute for Health Innovation, Durham, NC, United States
| | - Mark Sendak
- Duke Institute for Health Innovation, Durham, NC, United States
| |
Collapse
|
55
|
Kwong JCC, Khondker A, Lajkosz K, McDermott MBA, Frigola XB, McCradden MD, Mamdani M, Kulkarni GS, Johnson AEW. APPRAISE-AI Tool for Quantitative Evaluation of AI Studies for Clinical Decision Support. JAMA Netw Open 2023; 6:e2335377. [PMID: 37747733 PMCID: PMC10520738 DOI: 10.1001/jamanetworkopen.2023.35377] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Accepted: 08/14/2023] [Indexed: 09/26/2023] Open
Abstract
Importance Artificial intelligence (AI) has gained considerable attention in health care, yet concerns have been raised around appropriate methods and fairness. Current AI reporting guidelines do not provide a means of quantifying overall quality of AI research, limiting their ability to compare models addressing the same clinical question. Objective To develop a tool (APPRAISE-AI) to evaluate the methodological and reporting quality of AI prediction models for clinical decision support. Design, Setting, and Participants This quality improvement study evaluated AI studies in the model development, silent, and clinical trial phases using the APPRAISE-AI tool, a quantitative method for evaluating quality of AI studies across 6 domains: clinical relevance, data quality, methodological conduct, robustness of results, reporting quality, and reproducibility. These domains included 24 items with a maximum overall score of 100 points. Points were assigned to each item, with higher points indicating stronger methodological or reporting quality. The tool was applied to a systematic review on machine learning to estimate sepsis that included articles published until September 13, 2019. Data analysis was performed from September to December 2022. Main Outcomes and Measures The primary outcomes were interrater and intrarater reliability and the correlation between APPRAISE-AI scores and expert scores, 3-year citation rate, number of Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) low risk-of-bias domains, and overall adherence to the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) statement. Results A total of 28 studies were included. Overall APPRAISE-AI scores ranged from 33 (low quality) to 67 (high quality). Most studies were moderate quality. The 5 lowest scoring items included source of data, sample size calculation, bias assessment, error analysis, and transparency. Overall APPRAISE-AI scores were associated with expert scores (Spearman ρ, 0.82; 95% CI, 0.64-0.91; P < .001), 3-year citation rate (Spearman ρ, 0.69; 95% CI, 0.43-0.85; P < .001), number of QUADAS-2 low risk-of-bias domains (Spearman ρ, 0.56; 95% CI, 0.24-0.77; P = .002), and adherence to the TRIPOD statement (Spearman ρ, 0.87; 95% CI, 0.73-0.94; P < .001). Intraclass correlation coefficient ranges for interrater and intrarater reliability were 0.74 to 1.00 for individual items, 0.81 to 0.99 for individual domains, and 0.91 to 0.98 for overall scores. Conclusions and Relevance In this quality improvement study, APPRAISE-AI demonstrated strong interrater and intrarater reliability and correlated well with several study quality measures. This tool may provide a quantitative approach for investigators, reviewers, editors, and funding organizations to compare the research quality across AI studies for clinical decision support.
Collapse
Affiliation(s)
- Jethro C. C. Kwong
- Division of Urology, Department of Surgery, University of Toronto, Toronto, Ontario, Canada
- Temerty Centre for AI Research and Education in Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Adree Khondker
- Division of Urology, Department of Surgery, University of Toronto, Toronto, Ontario, Canada
| | - Katherine Lajkosz
- Division of Urology, Department of Surgery, University of Toronto, Toronto, Ontario, Canada
- Department of Biostatistics, University Health Network, University of Toronto, Toronto, Ontario, Canada
| | | | - Xavier Borrat Frigola
- Laboratory for Computational Physiology, Harvard–Massachusetts Institute of Technology Division of Health Sciences and Technology, Cambridge
- Anesthesiology and Critical Care Department, Hospital Clinic de Barcelona, Barcelona, Spain
| | - Melissa D. McCradden
- Department of Bioethics, The Hospital for Sick Children, Toronto, Ontario, Canada
- Genetics & Genome Biology Research Program, Peter Gilgan Centre for Research and Learning, Toronto, Ontario, Canada
- Division of Clinical and Public Health, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
| | - Muhammad Mamdani
- Temerty Centre for AI Research and Education in Medicine, University of Toronto, Toronto, Ontario, Canada
- Data Science and Advanced Analytics, Unity Health Toronto, Toronto, Ontario, Canada
| | - Girish S. Kulkarni
- Division of Urology, Department of Surgery, University of Toronto, Toronto, Ontario, Canada
- Princess Margaret Cancer Centre, University Health Network, University of Toronto, Toronto, Ontario, Canada
| | - Alistair E. W. Johnson
- Temerty Centre for AI Research and Education in Medicine, University of Toronto, Toronto, Ontario, Canada
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
- Child Health Evaluative Sciences, The Hospital for Sick Children, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
56
|
Balch JA, Loftus TJ. Actionable artificial intelligence: Overcoming barriers to adoption of prediction tools. Surgery 2023; 174:730-732. [PMID: 37198040 DOI: 10.1016/j.surg.2023.03.019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2023] [Accepted: 03/30/2023] [Indexed: 05/19/2023]
Abstract
Clinical prediction models based on artificial intelligence algorithms can potentially improve patient care, reduce errors, and add value to the health care system. However, their adoption is hindered by legitimate economic, practical, professional, and intellectual concerns. This article explores these barriers and highlights well-studied instruments that can be used to overcome them. Adopting actionable predictive models will require the purposeful incorporation of patient, clinical, technical, and administrative perspectives. Model developers must articulate a priori clinical needs, ensure explainability and low error frequency and severity, and promote safety and fairness. Models themselves require ongoing validation and monitoring to address variations in health care settings and must comply with an evolving regulatory environment. Through these principles, surgeons and health care providers can leverage artificial intelligence to optimize patient care.
Collapse
Affiliation(s)
- Jeremy A Balch
- Department of Surgery, University of Florida Health, Gainesville, FL; Intelligent Critical Care Center (IC3), University of Florida, Gainesville, FL. https://twitter.com/balchja
| | - Tyler J Loftus
- Department of Surgery, University of Florida Health, Gainesville, FL; Intelligent Critical Care Center (IC3), University of Florida, Gainesville, FL.
| |
Collapse
|
57
|
Wang C, Liu S, Yang H, Guo J, Wu Y, Liu J. Ethical Considerations of Using ChatGPT in Health Care. J Med Internet Res 2023; 25:e48009. [PMID: 37566454 PMCID: PMC10457697 DOI: 10.2196/48009] [Citation(s) in RCA: 101] [Impact Index Per Article: 50.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2023] [Revised: 07/05/2023] [Accepted: 07/25/2023] [Indexed: 08/12/2023] Open
Abstract
ChatGPT has promising applications in health care, but potential ethical issues need to be addressed proactively to prevent harm. ChatGPT presents potential ethical challenges from legal, humanistic, algorithmic, and informational perspectives. Legal ethics concerns arise from the unclear allocation of responsibility when patient harm occurs and from potential breaches of patient privacy due to data collection. Clear rules and legal boundaries are needed to properly allocate liability and protect users. Humanistic ethics concerns arise from the potential disruption of the physician-patient relationship, humanistic care, and issues of integrity. Overreliance on artificial intelligence (AI) can undermine compassion and erode trust. Transparency and disclosure of AI-generated content are critical to maintaining integrity. Algorithmic ethics raise concerns about algorithmic bias, responsibility, transparency and explainability, as well as validation and evaluation. Information ethics include data bias, validity, and effectiveness. Biased training data can lead to biased output, and overreliance on ChatGPT can reduce patient adherence and encourage self-diagnosis. Ensuring the accuracy, reliability, and validity of ChatGPT-generated content requires rigorous validation and ongoing updates based on clinical practice. To navigate the evolving ethical landscape of AI, AI in health care must adhere to the strictest ethical standards. Through comprehensive ethical guidelines, health care professionals can ensure the responsible use of ChatGPT, promote accurate and reliable information exchange, protect patient privacy, and empower patients to make informed decisions about their health care.
Collapse
Affiliation(s)
- Changyu Wang
- Department of Medical Informatics, West China Medical School, Sichuan University, Chengdu, China
- West China College of Stomatology, Sichuan University, Chengdu, China
| | - Siru Liu
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Hao Yang
- Information Center, West China Hospital, Sichuan University, Chengdu, China
| | - Jiulin Guo
- Information Center, West China Hospital, Sichuan University, Chengdu, China
| | - Yuxuan Wu
- Department of Medical Informatics, West China Medical School, Sichuan University, Chengdu, China
| | - Jialin Liu
- Department of Medical Informatics, West China Medical School, Sichuan University, Chengdu, China
- Information Center, West China Hospital, Sichuan University, Chengdu, China
- Department of Otolaryngology-Head and Neck Surgery, West China Hospital, Sichuan University, Chengdu, China
| |
Collapse
|
58
|
Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, Tanwani A, Cole-Lewis H, Pfohl S, Payne P, Seneviratne M, Gamble P, Kelly C, Babiker A, Schärli N, Chowdhery A, Mansfield P, Demner-Fushman D, Agüera Y Arcas B, Webster D, Corrado GS, Matias Y, Chou K, Gottweis J, Tomasev N, Liu Y, Rajkomar A, Barral J, Semturs C, Karthikesalingam A, Natarajan V. Large language models encode clinical knowledge. Nature 2023; 620:172-180. [PMID: 37438534 PMCID: PMC10396962 DOI: 10.1038/s41586-023-06291-2] [Citation(s) in RCA: 616] [Impact Index Per Article: 308.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2023] [Accepted: 06/05/2023] [Indexed: 07/14/2023]
Abstract
Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model1 (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM2 on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA3, MedMCQA4, PubMedQA5 and Measuring Massive Multitask Language Understanding (MMLU) clinical topics6), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.
Collapse
Affiliation(s)
| | | | - Tao Tu
- Google Research, Mountain View, CA, USA
| | | | - Jason Wei
- Google Research, Mountain View, CA, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Yun Liu
- Google Research, Mountain View, CA, USA
| | | | | | | | | | | |
Collapse
|
59
|
van Leeuwen K, Becks M, Grob D, de Lange F, Rutten J, Schalekamp S, Rutten M, van Ginneken B, de Rooij M, Meijer F. AI-support for the detection of intracranial large vessel occlusions: One-year prospective evaluation. Heliyon 2023; 9:e19065. [PMID: 37636476 PMCID: PMC10458691 DOI: 10.1016/j.heliyon.2023.e19065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Revised: 08/07/2023] [Accepted: 08/09/2023] [Indexed: 08/29/2023] Open
Abstract
Purpose Few studies have evaluated real-world performance of radiological AI-tools in clinical practice. Over one-year, we prospectively evaluated the use of AI software to support the detection of intracranial large vessel occlusions (LVO) on CT angiography (CTA). Method Quantitative measures (user log-in attempts, AI standalone performance) and qualitative data (user surveys) were reviewed by a key-user group at three timepoints. A total of 491 CTA studies of 460 patients were included for analysis. Results The overall accuracy of the AI-tool for LVO detection and localization was 87.6%, sensitivity 69.1% and specificity 91.2%. Out of 81 LVOs, 31 of 34 (91%) M1 occlusions were detected correctly, 19 of 38 (50%) M2 occlusions, and 6 of 9 (67%) ICA occlusions. The product was considered user-friendly. The diagnostic confidence of the users for LVO detection remained the same over the year. The last measured net promotor score was -56%. The use of the AI-tool fluctuated over the year with a declining trend. Conclusions Our pragmatic approach of evaluating the AI-tool used in clinical practice, helped us to monitor the usage, to estimate the perceived added value by the users of the AI-tool, and to make an informed decision about the continuation of the use of the AI-tool.
Collapse
Affiliation(s)
- K.G. van Leeuwen
- Department of Medical Imaging, Radboud University Medical Center, Nijmegen, the Netherlands
| | - M.J. Becks
- Department of Medical Imaging, Radboud University Medical Center, Nijmegen, the Netherlands
| | - D. Grob
- Department of Medical Imaging, Radboud University Medical Center, Nijmegen, the Netherlands
| | - F. de Lange
- Department of Medical Imaging, Radboud University Medical Center, Nijmegen, the Netherlands
| | - J.H.E. Rutten
- Department of Medical Imaging, Radboud University Medical Center, Nijmegen, the Netherlands
| | - S. Schalekamp
- Department of Medical Imaging, Radboud University Medical Center, Nijmegen, the Netherlands
| | - M.J.C.M. Rutten
- Department of Medical Imaging, Radboud University Medical Center, Nijmegen, the Netherlands
- Department of Radiology, Jeroen Bosch Hospital, ‘s-Hertogenbosch, the Netherlands
| | - B. van Ginneken
- Department of Medical Imaging, Radboud University Medical Center, Nijmegen, the Netherlands
| | - M. de Rooij
- Department of Medical Imaging, Radboud University Medical Center, Nijmegen, the Netherlands
| | - F.J.A. Meijer
- Department of Medical Imaging, Radboud University Medical Center, Nijmegen, the Netherlands
| |
Collapse
|
60
|
Stutchfield BM, Attia A, Rowe IA, Harrison EM, Gordon-Walker T. UK liver transplantation allocation algorithm: transplant benefit score - Authors' reply. Lancet 2023; 402:371-372. [PMID: 37516542 DOI: 10.1016/s0140-6736(23)01307-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Accepted: 06/21/2023] [Indexed: 07/31/2023]
Affiliation(s)
- Ben M Stutchfield
- Department of Clinical and Surgical Sciences, University of Edinburgh, Edinburgh EH14 4SA, UK; Edinburgh Transplant Centre, Royal Infirmary of Edinburgh, Edinburgh, UK.
| | - Antony Attia
- School of Medicine, University of Edinburgh, Edinburgh EH14 4SA, UK
| | - Ian A Rowe
- Leeds Institute for Medical Research, University of Leeds, Leeds, UK
| | - Ewen M Harrison
- Department of Clinical and Surgical Sciences, University of Edinburgh, Edinburgh EH14 4SA, UK; Centre for Medical Informatics, Usher Institute, University of Edinburgh, Edinburgh, UK
| | - Tim Gordon-Walker
- Edinburgh Transplant Centre, Royal Infirmary of Edinburgh, Edinburgh, UK
| |
Collapse
|
61
|
Banda JM, Shah NH, Periyakoil VS. Characterizing subgroup performance of probabilistic phenotype algorithms within older adults: a case study for dementia, mild cognitive impairment, and Alzheimer's and Parkinson's diseases. JAMIA Open 2023; 6:ooad043. [PMID: 37397506 PMCID: PMC10307941 DOI: 10.1093/jamiaopen/ooad043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Revised: 06/06/2023] [Accepted: 06/22/2023] [Indexed: 07/04/2023] Open
Abstract
Objective Biases within probabilistic electronic phenotyping algorithms are largely unexplored. In this work, we characterize differences in subgroup performance of phenotyping algorithms for Alzheimer's disease and related dementias (ADRD) in older adults. Materials and methods We created an experimental framework to characterize the performance of probabilistic phenotyping algorithms under different racial distributions allowing us to identify which algorithms may have differential performance, by how much, and under what conditions. We relied on rule-based phenotype definitions as reference to evaluate probabilistic phenotype algorithms created using the Automated PHenotype Routine for Observational Definition, Identification, Training and Evaluation framework. Results We demonstrate that some algorithms have performance variations anywhere from 3% to 30% for different populations, even when not using race as an input variable. We show that while performance differences in subgroups are not present for all phenotypes, they do affect some phenotypes and groups more disproportionately than others. Discussion Our analysis establishes the need for a robust evaluation framework for subgroup differences. The underlying patient populations for the algorithms showing subgroup performance differences have great variance between model features when compared with the phenotypes with little to no differences. Conclusion We have created a framework to identify systematic differences in the performance of probabilistic phenotyping algorithms specifically in the context of ADRD as a use case. Differences in subgroup performance of probabilistic phenotyping algorithms are not widespread nor do they occur consistently. This highlights the great need for careful ongoing monitoring to evaluate, measure, and try to mitigate such differences.
Collapse
Affiliation(s)
- Juan M Banda
- Corresponding Author: Juan M. Banda, PhD, Department of Computer Science, College of Arts and Sciences, Georgia State University, 25 Park Place, Suite 752, Atlanta, GA 30303, USA;
| | - Nigam H Shah
- Stanford Center for Biomedical Informatics Research, Stanford University School of Medicine, Stanford, California, USA
| | - Vyjeyanthi S Periyakoil
- Stanford Department of Medicine, Palo Alto, California, USA
- VA Palo Alto Health Care System, Palo Alto, California, USA
| |
Collapse
|
62
|
Kwong JCC, Khondker A, Meng E, Taylor N, Kuk C, Perlis N, Kulkarni GS, Hamilton RJ, Fleshner NE, Finelli A, van der Kwast TH, Ali A, Jamal M, Papanikolaou F, Short T, Srigley JR, Colinet V, Peltier A, Diamand R, Lefebvre Y, Mandoorah Q, Sanchez-Salas R, Macek P, Cathelineau X, Eklund M, Johnson AEW, Feifer A, Zlotta AR. Development, multi-institutional external validation, and algorithmic audit of an artificial intelligence-based Side-specific Extra-Prostatic Extension Risk Assessment tool (SEPERA) for patients undergoing radical prostatectomy: a retrospective cohort study. Lancet Digit Health 2023; 5:e435-e445. [PMID: 37211455 DOI: 10.1016/s2589-7500(23)00067-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Revised: 02/11/2023] [Accepted: 03/22/2023] [Indexed: 05/23/2023]
Abstract
BACKGROUND Accurate prediction of side-specific extraprostatic extension (ssEPE) is essential for performing nerve-sparing surgery to mitigate treatment-related side-effects such as impotence and incontinence in patients with localised prostate cancer. Artificial intelligence (AI) might provide robust and personalised ssEPE predictions to better inform nerve-sparing strategy during radical prostatectomy. We aimed to develop, externally validate, and perform an algorithmic audit of an AI-based Side-specific Extra-Prostatic Extension Risk Assessment tool (SEPERA). METHODS Each prostatic lobe was treated as an individual case such that each patient contributed two cases to the overall cohort. SEPERA was trained on 1022 cases from a community hospital network (Trillium Health Partners; Mississauga, ON, Canada) between 2010 and 2020. Subsequently, SEPERA was externally validated on 3914 cases across three academic centres: Princess Margaret Cancer Centre (Toronto, ON, Canada) from 2008 to 2020; L'Institut Mutualiste Montsouris (Paris, France) from 2010 to 2020; and Jules Bordet Institute (Brussels, Belgium) from 2015 to 2020. Model performance was characterised by area under the receiver operating characteristic curve (AUROC), area under the precision recall curve (AUPRC), calibration, and net benefit. SEPERA was compared against contemporary nomograms (ie, Sayyid nomogram, Soeterik nomogram [non-MRI and MRI]), as well as a separate logistic regression model using the same variables included in SEPERA. An algorithmic audit was performed to assess model bias and identify common patient characteristics among predictive errors. FINDINGS Overall, 2468 patients comprising 4936 cases (ie, prostatic lobes) were included in this study. SEPERA was well calibrated and had the best performance across all validation cohorts (pooled AUROC of 0·77 [95% CI 0·75-0·78] and pooled AUPRC of 0·61 [0·58-0·63]). In patients with pathological ssEPE despite benign ipsilateral biopsies, SEPERA correctly predicted ssEPE in 72 (68%) of 106 cases compared with the other models (47 [44%] in the logistic regression model, none in the Sayyid model, 13 [12%] in the Soeterik non-MRI model, and five [5%] in the Soeterik MRI model). SEPERA had higher net benefit than the other models to predict ssEPE, enabling more patients to safely undergo nerve-sparing. In the algorithmic audit, no evidence of model bias was observed, with no significant difference in AUROC when stratified by race, biopsy year, age, biopsy type (systematic only vs systematic and MRI-targeted biopsy), biopsy location (academic vs community), and D'Amico risk group. According to the audit, the most common errors were false positives, particularly for older patients with high-risk disease. No aggressive tumours (ie, grade >2 or high-risk disease) were found among false negatives. INTERPRETATION We demonstrated the accuracy, safety, and generalisability of using SEPERA to personalise nerve-sparing approaches during radical prostatectomy. FUNDING None.
Collapse
Affiliation(s)
- Jethro C C Kwong
- Division of Urology, Department of Surgery, University of Toronto, Toronto, ON, Canada; Division of Urology, Department of Surgery, Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada; Temerty Centre for AI Research and Education in Medicine, University of Toronto, Toronto, ON, Canada
| | - Adree Khondker
- Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada
| | - Eric Meng
- Faculty of Medicine, Queen's University, Kingston, ON, Canada
| | - Nicholas Taylor
- Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada
| | - Cynthia Kuk
- Division of Urology, Department of Surgery, Mount Sinai Hospital, Sinai Health System, Toronto, ON, Canada
| | - Nathan Perlis
- Division of Urology, Department of Surgery, University of Toronto, Toronto, ON, Canada; Division of Urology, Department of Surgery, Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada
| | - Girish S Kulkarni
- Division of Urology, Department of Surgery, University of Toronto, Toronto, ON, Canada; Division of Urology, Department of Surgery, Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada; Temerty Centre for AI Research and Education in Medicine, University of Toronto, Toronto, ON, Canada
| | - Robert J Hamilton
- Division of Urology, Department of Surgery, University of Toronto, Toronto, ON, Canada; Division of Urology, Department of Surgery, Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada
| | - Neil E Fleshner
- Division of Urology, Department of Surgery, University of Toronto, Toronto, ON, Canada; Division of Urology, Department of Surgery, Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada
| | - Antonio Finelli
- Division of Urology, Department of Surgery, University of Toronto, Toronto, ON, Canada; Division of Urology, Department of Surgery, Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada
| | - Theodorus H van der Kwast
- Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, ON, Canada; Laboratory Medicine Program, Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada
| | - Amna Ali
- Institute for Better Health, Trillium Health Partners, Mississauga, ON, Canada
| | - Munir Jamal
- Division of Urology, Department of Surgery, University of Toronto, Toronto, ON, Canada
| | - Frank Papanikolaou
- Division of Urology, Department of Surgery, University of Toronto, Toronto, ON, Canada
| | - Thomas Short
- Division of Urology, Department of Surgery, University of Toronto, Toronto, ON, Canada
| | - John R Srigley
- Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, ON, Canada
| | - Valentin Colinet
- Division of Urology, Department of Surgery, Jules Bordet Institute, Brussels, Belgium
| | - Alexandre Peltier
- Division of Urology, Department of Surgery, Jules Bordet Institute, Brussels, Belgium
| | - Romain Diamand
- Division of Urology, Department of Surgery, Jules Bordet Institute, Brussels, Belgium
| | - Yolene Lefebvre
- Department of Medical Imagery, Jules Bordet Institute, Brussels, Belgium
| | - Qusay Mandoorah
- Division of Urology, Department of Surgery, L'Institut Mutualiste Montsouris, Paris, France
| | - Rafael Sanchez-Salas
- Division of Urology, Department of Surgery, L'Institut Mutualiste Montsouris, Paris, France
| | - Petr Macek
- Division of Urology, Department of Surgery, L'Institut Mutualiste Montsouris, Paris, France
| | - Xavier Cathelineau
- Division of Urology, Department of Surgery, L'Institut Mutualiste Montsouris, Paris, France
| | - Martin Eklund
- Department of Medical Epidemiology and Biostatistics, Karolinska Institute, Stockholm, Sweden
| | - Alistair E W Johnson
- Temerty Centre for AI Research and Education in Medicine, University of Toronto, Toronto, ON, Canada; Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada; Vector Institute, Toronto, ON, Canada
| | - Andrew Feifer
- Division of Urology, Department of Surgery, University of Toronto, Toronto, ON, Canada; Institute for Better Health, Trillium Health Partners, Mississauga, ON, Canada
| | - Alexandre R Zlotta
- Division of Urology, Department of Surgery, University of Toronto, Toronto, ON, Canada; Division of Urology, Department of Surgery, Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada; Division of Urology, Department of Surgery, Mount Sinai Hospital, Sinai Health System, Toronto, ON, Canada.
| |
Collapse
|
63
|
Lyell D, Wang Y, Coiera E, Magrabi F. More than algorithms: an analysis of safety events involving ML-enabled medical devices reported to the FDA. J Am Med Inform Assoc 2023; 30:1227-1236. [PMID: 37071804 PMCID: PMC10280342 DOI: 10.1093/jamia/ocad065] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Revised: 03/13/2023] [Accepted: 03/30/2023] [Indexed: 04/20/2023] Open
Abstract
OBJECTIVE To examine the real-world safety problems involving machine learning (ML)-enabled medical devices. MATERIALS AND METHODS We analyzed 266 safety events involving approved ML medical devices reported to the US FDA's MAUDE program between 2015 and October 2021. Events were reviewed against an existing framework for safety problems with Health IT to identify whether a reported problem was due to the ML device (device problem) or its use, and key contributors to the problem. Consequences of events were also classified. RESULTS Events described hazards with potential to harm (66%), actual harm (16%), consequences for healthcare delivery (9%), near misses that would have led to harm if not for intervention (4%), no harm or consequences (3%), and complaints (2%). While most events involved device problems (93%), use problems (7%) were 4 times more likely to harm (relative risk 4.2; 95% CI 2.5-7). Problems with data input to ML devices were the top contributor to events (82%). DISCUSSION Much of what is known about ML safety comes from case studies and the theoretical limitations of ML. We contribute a systematic analysis of ML safety problems captured as part of the FDA's routine post-market surveillance. Most problems involved devices and concerned the acquisition of data for processing by algorithms. However, problems with the use of devices were more likely to harm. CONCLUSIONS Safety problems with ML devices involve more than algorithms, highlighting the need for a whole-of-system approach to safe implementation with a special focus on how users interact with devices.
Collapse
Affiliation(s)
- David Lyell
- Centre for Health Informatics, Australian Institute of Health Innovation, Macquarie University, NSW 2109, Australia
| | - Ying Wang
- Centre for Health Informatics, Australian Institute of Health Innovation, Macquarie University, NSW 2109, Australia
| | - Enrico Coiera
- Centre for Health Informatics, Australian Institute of Health Innovation, Macquarie University, NSW 2109, Australia
| | - Farah Magrabi
- Centre for Health Informatics, Australian Institute of Health Innovation, Macquarie University, NSW 2109, Australia
| |
Collapse
|
64
|
González C, Ranem A, Pinto Dos Santos D, Othman A, Mukhopadhyay A. Lifelong nnU-Net: a framework for standardized medical continual learning. Sci Rep 2023; 13:9381. [PMID: 37296233 PMCID: PMC10256748 DOI: 10.1038/s41598-023-34484-2] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2022] [Accepted: 05/02/2023] [Indexed: 06/12/2023] Open
Abstract
As the enthusiasm surrounding Deep Learning grows, both medical practitioners and regulatory bodies are exploring ways to safely introduce image segmentation in clinical practice. One frontier to overcome when translating promising research into the clinical open world is the shift from static to continual learning. Continual learning, the practice of training models throughout their lifecycle, is seeing growing interest but is still in its infancy in healthcare. We present Lifelong nnU-Net, a standardized framework that places continual segmentation at the hands of researchers and clinicians. Built on top of the nnU-Net-widely regarded as the best-performing segmenter for multiple medical applications-and equipped with all necessary modules for training and testing models sequentially, we ensure broad applicability and lower the barrier to evaluating new methods in a continual fashion. Our benchmark results across three medical segmentation use cases and five continual learning methods give a comprehensive outlook on the current state of the field and signify a first reproducible benchmark.
Collapse
Affiliation(s)
- Camila González
- Technical University of Darmstadt, Karolinenpl. 5, 64289, Darmstadt, Germany.
| | - Amin Ranem
- Technical University of Darmstadt, Karolinenpl. 5, 64289, Darmstadt, Germany
| | - Daniel Pinto Dos Santos
- University Hospital Cologne, Kerpener Str. 62, 50937, Cologne, Germany
- University Hospital Frankfurt, Theodor-Stern-Kai 7, 60590, Frankfurt, Germany
| | - Ahmed Othman
- University Medical Center Mainz, Langenbeckstraße 1, 55131, Mainz, Germany
| | | |
Collapse
|
65
|
Kann BH, Likitlersuang J, Bontempi D, Ye Z, Aneja S, Bakst R, Kelly HR, Juliano AF, Payabvash S, Guenette JP, Uppaluri R, Margalit DN, Schoenfeld JD, Tishler RB, Haddad R, Aerts HJWL, Garcia JJ, Flamand Y, Subramaniam RM, Burtness BA, Ferris RL. Screening for extranodal extension in HPV-associated oropharyngeal carcinoma: evaluation of a CT-based deep learning algorithm in patient data from a multicentre, randomised de-escalation trial. Lancet Digit Health 2023; 5:e360-e369. [PMID: 37087370 PMCID: PMC10245380 DOI: 10.1016/s2589-7500(23)00046-8] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Revised: 01/18/2023] [Accepted: 02/21/2023] [Indexed: 04/24/2023]
Abstract
BACKGROUND Pretreatment identification of pathological extranodal extension (ENE) would guide therapy de-escalation strategies for in human papillomavirus (HPV)-associated oropharyngeal carcinoma but is diagnostically challenging. ECOG-ACRIN Cancer Research Group E3311 was a multicentre trial wherein patients with HPV-associated oropharyngeal carcinoma were treated surgically and assigned to a pathological risk-based adjuvant strategy of observation, radiation, or concurrent chemoradiation. Despite protocol exclusion of patients with overt radiographic ENE, more than 30% had pathological ENE and required postoperative chemoradiation. We aimed to evaluate a CT-based deep learning algorithm for prediction of ENE in E3311, a diagnostically challenging cohort wherein algorithm use would be impactful in guiding decision-making. METHODS For this retrospective evaluation of deep learning algorithm performance, we obtained pretreatment CTs and corresponding surgical pathology reports from the multicentre, randomised de-escalation trial E3311. All enrolled patients on E3311 required pretreatment and diagnostic head and neck imaging; patients with radiographically overt ENE were excluded per study protocol. The lymph node with largest short-axis diameter and up to two additional nodes were segmented on each scan and annotated for ENE per pathology reports. Deep learning algorithm performance for ENE prediction was compared with four board-certified head and neck radiologists. The primary endpoint was the area under the curve (AUC) of the receiver operating characteristic. FINDINGS From 178 collected scans, 313 nodes were annotated: 71 (23%) with ENE in general, 39 (13%) with ENE larger than 1 mm ENE. The deep learning algorithm AUC for ENE classification was 0·86 (95% CI 0·82-0·90), outperforming all readers (p<0·0001 for each). Among radiologists, there was high variability in specificity (43-86%) and sensitivity (45-96%) with poor inter-reader agreement (κ 0·32). Matching the algorithm specificity to that of the reader with highest AUC (R2, false positive rate 22%) yielded improved sensitivity to 75% (+ 13%). Setting the algorithm false positive rate to 30% yielded 90% sensitivity. The algorithm showed improved performance compared with radiologists for ENE larger than 1 mm (p<0·0001) and in nodes with short-axis diameter 1 cm or larger. INTERPRETATION The deep learning algorithm outperformed experts in predicting pathological ENE on a challenging cohort of patients with HPV-associated oropharyngeal carcinoma from a randomised clinical trial. Deep learning algorithms should be evaluated prospectively as a treatment selection tool. FUNDING ECOG-ACRIN Cancer Research Group and the National Cancer Institute of the US National Institutes of Health.
Collapse
Affiliation(s)
- Benjamin H Kann
- Department of Radiation Oncology, Harvard Medical School, Boston, MA, USA; Mass General Brigham Artificial Intelligence in Medicine Program, Boston, MA, USA.
| | - Jirapat Likitlersuang
- Dana-Farber Cancer Institute/Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA; Mass General Brigham Artificial Intelligence in Medicine Program, Boston, MA, USA
| | - Dennis Bontempi
- Dana-Farber Cancer Institute/Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA; Mass General Brigham Artificial Intelligence in Medicine Program, Boston, MA, USA
| | - Zezhong Ye
- Dana-Farber Cancer Institute/Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA; Mass General Brigham Artificial Intelligence in Medicine Program, Boston, MA, USA
| | - Sanjay Aneja
- Department of Therapeutic Radiology, New Haven, CT, USA
| | - Richard Bakst
- Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | | | - Amy F Juliano
- Mass Eye and Ear, Mass General Hospital, Boston, MA, USA
| | | | - Jeffrey P Guenette
- Dana-Farber Cancer Institute/Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Ravindra Uppaluri
- Dana-Farber Cancer Institute/Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Danielle N Margalit
- Dana-Farber Cancer Institute/Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Jonathan D Schoenfeld
- Dana-Farber Cancer Institute/Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Roy B Tishler
- Dana-Farber Cancer Institute/Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Robert Haddad
- Dana-Farber Cancer Institute/Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Hugo J W L Aerts
- Dana-Farber Cancer Institute/Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA; Mass General Brigham Artificial Intelligence in Medicine Program, Boston, MA, USA; Department of Radiology, Maastricht University, Maastricht, Netherlands
| | | | - Yael Flamand
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, ECOG-ACRIN Biostatistics Center, Boston, MA, USA
| | - Rathan M Subramaniam
- Department of Radiology and Nuclear Medicine, University of Notre Dame Australia, Sydney, NSW, Australia; Department of Radiology, Duke University, Durham, NC, USA
| | | | - Robert L Ferris
- Department of Otolaryngology, University of Pittsburgh Cancer Institute, Pittsburgh, PA, USA
| |
Collapse
|
66
|
Liefgreen A, Weinstein N, Wachter S, Mittelstadt B. Beyond ideals: why the (medical) AI industry needs to motivate behavioural change in line with fairness and transparency values, and how it can do it. AI & SOCIETY 2023; 39:2183-2199. [PMID: 39309255 PMCID: PMC11415467 DOI: 10.1007/s00146-023-01684-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2022] [Accepted: 04/21/2023] [Indexed: 09/25/2024]
Abstract
Artificial intelligence (AI) is increasingly relied upon by clinicians for making diagnostic and treatment decisions, playing an important role in imaging, diagnosis, risk analysis, lifestyle monitoring, and health information management. While research has identified biases in healthcare AI systems and proposed technical solutions to address these, we argue that effective solutions require human engagement. Furthermore, there is a lack of research on how to motivate the adoption of these solutions and promote investment in designing AI systems that align with values such as transparency and fairness from the outset. Drawing on insights from psychological theories, we assert the need to understand the values that underlie decisions made by individuals involved in creating and deploying AI systems. We describe how this understanding can be leveraged to increase engagement with de-biasing and fairness-enhancing practices within the AI healthcare industry, ultimately leading to sustained behavioral change via autonomy-supportive communication strategies rooted in motivational and social psychology theories. In developing these pathways to engagement, we consider the norms and needs that govern the AI healthcare domain, and we evaluate incentives for maintaining the status quo against economic, legal, and social incentives for behavior change in line with transparency and fairness values.
Collapse
Affiliation(s)
- Alice Liefgreen
- Hillary Rodham Clinton School of Law, University of Swansea, Swansea, SA2 8PP UK
- School of Psychology and Clinical Language Sciences, University of Reading, Whiteknights Road, Reading, RG6 6AL UK
| | - Netta Weinstein
- School of Psychology and Clinical Language Sciences, University of Reading, Whiteknights Road, Reading, RG6 6AL UK
| | - Sandra Wachter
- Oxford Internet Institute, University of Oxford, 1 St. Giles, Oxford, OX1 3JS UK
| | - Brent Mittelstadt
- Oxford Internet Institute, University of Oxford, 1 St. Giles, Oxford, OX1 3JS UK
| |
Collapse
|
67
|
Zbrzezny AM, Grzybowski AE. Deceptive Tricks in Artificial Intelligence: Adversarial Attacks in Ophthalmology. J Clin Med 2023; 12:jcm12093266. [PMID: 37176706 PMCID: PMC10179065 DOI: 10.3390/jcm12093266] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Revised: 04/20/2023] [Accepted: 04/26/2023] [Indexed: 05/15/2023] Open
Abstract
The artificial intelligence (AI) systems used for diagnosing ophthalmic diseases have significantly progressed in recent years. The diagnosis of difficult eye conditions, such as cataracts, diabetic retinopathy, age-related macular degeneration, glaucoma, and retinopathy of prematurity, has become significantly less complicated as a result of the development of AI algorithms, which are currently on par with ophthalmologists in terms of their level of effectiveness. However, in the context of building AI systems for medical applications such as identifying eye diseases, addressing the challenges of safety and trustworthiness is paramount, including the emerging threat of adversarial attacks. Research has increasingly focused on understanding and mitigating these attacks, with numerous articles discussing this topic in recent years. As a starting point for our discussion, we used the paper by Ma et al. "Understanding Adversarial Attacks on Deep Learning Based Medical Image Analysis Systems". A literature review was performed for this study, which included a thorough search of open-access research papers using online sources (PubMed and Google). The research provides examples of unique attack strategies for medical images. Unfortunately, unique algorithms for attacks on the various ophthalmic image types have yet to be developed. It is a task that needs to be performed. As a result, it is necessary to build algorithms that validate the computation and explain the findings of artificial intelligence models. In this article, we focus on adversarial attacks, one of the most well-known attack methods, which provide evidence (i.e., adversarial examples) of the lack of resilience of decision models that do not include provable guarantees. Adversarial attacks have the potential to provide inaccurate findings in deep learning systems and can have catastrophic effects in the healthcare industry, such as healthcare financing fraud and wrong diagnosis.
Collapse
Affiliation(s)
- Agnieszka M Zbrzezny
- Faculty of Mathematics and Computer Science, University of Warmia and Mazury, 10-710 Olsztyn, Poland
- Faculty of Design, SWPS University of Social Sciences and Humanities, Chodakowska 19/31, 03-815 Warsaw, Poland
| | - Andrzej E Grzybowski
- Institute for Research in Ophthalmology, Foundation for Ophthalmology Development, 60-836 Poznan, Poland
| |
Collapse
|
68
|
de Vries CF, Colosimo SJ, Staff RT, Dymiter JA, Yearsley J, Dinneen D, Boyle M, Harrison DJ, Anderson LA, Lip G. Impact of Different Mammography Systems on Artificial Intelligence Performance in Breast Cancer Screening. Radiol Artif Intell 2023; 5:e220146. [PMID: 37293340 PMCID: PMC10245180 DOI: 10.1148/ryai.220146] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2022] [Revised: 02/14/2023] [Accepted: 03/02/2023] [Indexed: 06/10/2023]
Abstract
Artificial intelligence (AI) tools may assist breast screening mammography programs, but limited evidence supports their generalizability to new settings. This retrospective study used a 3-year dataset (April 1, 2016-March 31, 2019) from a U.K. regional screening program. The performance of a commercially available breast screening AI algorithm was assessed with a prespecified and site-specific decision threshold to evaluate whether its performance was transferable to a new clinical site. The dataset consisted of women (aged approximately 50-70 years) who attended routine screening, excluding self-referrals, those with complex physical requirements, those who had undergone a previous mastectomy, and those who underwent screening that had technical recalls or did not have the four standard image views. In total, 55 916 screening attendees (mean age, 60 years ± 6 [SD]) met the inclusion criteria. The prespecified threshold resulted in high recall rates (48.3%, 21 929 of 45 444), which reduced to 13.0% (5896 of 45 444) following threshold calibration, closer to the observed service level (5.0%, 2774 of 55 916). Recall rates also increased approximately threefold following a software upgrade on the mammography equipment, requiring per-software version thresholds. Using software-specific thresholds, the AI algorithm would have recalled 277 of 303 (91.4%) screen-detected cancers and 47 of 138 (34.1%) interval cancers. AI performance and thresholds should be validated for new clinical settings before deployment, while quality assurance systems should monitor AI performance for consistency. Keywords: Breast, Screening, Mammography, Computer Applications-Detection/Diagnosis, Neoplasms-Primary, Technology Assessment Supplemental material is available for this article. © RSNA, 2023.
Collapse
|
69
|
Pham N, Hill V, Rauschecker A, Lui Y, Niogi S, Fillipi CG, Chang P, Zaharchuk G, Wintermark M. Critical Appraisal of Artificial Intelligence-Enabled Imaging Tools Using the Levels of Evidence System. AJNR Am J Neuroradiol 2023; 44:E21-E28. [PMID: 37080722 PMCID: PMC10171388 DOI: 10.3174/ajnr.a7850] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2022] [Accepted: 03/16/2023] [Indexed: 04/22/2023]
Abstract
Clinical adoption of an artificial intelligence-enabled imaging tool requires critical appraisal of its life cycle from development to implementation by using a systematic, standardized, and objective approach that can verify both its technical and clinical efficacy. Toward this concerted effort, the ASFNR/ASNR Artificial Intelligence Workshop Technology Working Group is proposing a hierarchal evaluation system based on the quality, type, and amount of scientific evidence that the artificial intelligence-enabled tool can demonstrate for each component of its life cycle. The current proposal is modeled after the levels of evidence in medicine, with the uppermost level of the hierarchy showing the strongest evidence for potential impact on patient care and health care outcomes. The intended goal of establishing an evidence-based evaluation system is to encourage transparency, foster an understanding of the creation of artificial intelligence tools and the artificial intelligence decision-making process, and to report the relevant data on the efficacy of artificial intelligence tools that are developed. The proposed system is an essential step in working toward a more formalized, clinically validated, and regulated framework for the safe and effective deployment of artificial intelligence imaging applications that will be used in clinical practice.
Collapse
Affiliation(s)
- N Pham
- From the Department of Radiology (N.P., G.Z.), Stanford School of Medicine, Palo Alto, California
| | - V Hill
- Department of Radiology (V.H.), Northwestern University Feinberg School of Medicine, Chicago, Illinois
| | - A Rauschecker
- Department of Radiology (A.R.), University of California, San Francisco, San Francisco, California
| | - Y Lui
- Department of Radiology (Y.L.), NYU Grossman School of Medicine, New York, New York
| | - S Niogi
- Department of Radiology (S.N.), Weill Cornell Medicine, New York, New York
| | - C G Fillipi
- Department of Radiology (C.G.F.), Tufts University School of Medicine, Boston, Massachusetts
| | - P Chang
- Department of Radiology (P.C.), University of California, Irvine, Irvine, California
| | - G Zaharchuk
- From the Department of Radiology (N.P., G.Z.), Stanford School of Medicine, Palo Alto, California
| | - M Wintermark
- Department of Neuroradiology (M.W.), The University of Texas MD Anderson Cancer Center, Houston, Texas
| |
Collapse
|
70
|
Steele L, Tan XL, Olabi B, Gao JM, Tanaka RJ, Williams HC. Determining the clinical applicability of machine learning models through assessment of reporting across skin phototypes and rarer skin cancer types: A systematic review. J Eur Acad Dermatol Venereol 2023; 37:657-665. [PMID: 36514990 DOI: 10.1111/jdv.18814] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2022] [Accepted: 11/09/2022] [Indexed: 12/15/2022]
Abstract
Machine learning (ML) models for skin cancer recognition may have variable performance across different skin phototypes and skin cancer types. Overall performance metrics alone are insufficient to detect poor subgroup performance. We aimed (1) to assess whether studies of ML models reported results separately for different skin phototypes and rarer skin cancers, and (2) to graphically represent the skin cancer training datasets used by current ML models. In this systematic review, we searched PubMed, Embase and CENTRAL. We included all studies in medical journals assessing an ML technique for skin cancer diagnosis that used clinical or dermoscopic images from 1 January 2012 to 22 September 2021. No language restrictions were applied. We considered rarer skin cancers to be skin cancers other than pigmented melanoma, basal cell carcinoma and squamous cell carcinoma. We identified 114 studies for inclusion. Rarer skin cancers were included by 8/114 studies (7.0%), and results for a rarer skin cancer were reported separately in 1/114 studies (0.9%). Performance was reported across all skin phototypes in 1/114 studies (0.9%), but performance was uncertain in skin phototypes I and VI from minimal representation of the skin phototypes in the test dataset (9/3756 and 1/3756, respectively). For training datasets, although public datasets were most frequently used, with the most widely used being the International Skin Imaging Collaboration (ISIC) archive (65/114 studies, 57.0%), the largest datasets were private. Our review identified that most ML models did not report performance separately for rarer skin cancers and different skin phototypes. A degree of variability in ML model performance across subgroups is expected, but the current lack of transparency is not justifiable and risks models being used inappropriately in populations in whom accuracy is low.
Collapse
Affiliation(s)
- Lloyd Steele
- Department of Dermatology, The Royal London Hospital, London, UK.,Centre for Cell Biology and Cutaneous Research, Blizard Institute, Queen Mary University of London, London, UK
| | - Xiang Li Tan
- St George's University Hospitals NHS Foundation Trust, London, UK
| | - Bayanne Olabi
- Biosciences Institute, Newcastle University, Newcastle, UK
| | - Jing Mia Gao
- Department of Dermatology, The Royal London Hospital, London, UK
| | - Reiko J Tanaka
- Department of Bioengineering, Imperial College London, London, UK
| | - Hywel C Williams
- Centre of Evidence-Based Dermatology, School of Medicine, University of Nottingham, Nottingham, UK
| |
Collapse
|
71
|
Lundström C, Lindvall M. Mapping the Landscape of Care Providers' Quality Assurance Approaches for AI in Diagnostic Imaging. J Digit Imaging 2023; 36:379-387. [PMID: 36352164 PMCID: PMC10039170 DOI: 10.1007/s10278-022-00731-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Revised: 10/26/2022] [Accepted: 10/28/2022] [Indexed: 11/10/2022] Open
Abstract
The discussion on artificial intelligence (AI) solutions in diagnostic imaging has matured in recent years. The potential value of AI adoption is well established, as are the potential risks associated. Much focus has, rightfully, been on regulatory certification of AI products, with the strong incentive of being an enabling step for the commercial actors. It is, however, becoming evident that regulatory approval is not enough to ensure safe and effective AI usage in the local setting. In other words, care providers need to develop and implement quality assurance (QA) approaches for AI solutions in diagnostic imaging. The domain of AI-specific QA is still in an early development phase. We contribute to this development by describing the current landscape of QA-for-AI approaches in medical imaging, with focus on radiology and pathology. We map the potential quality threats and review the existing QA approaches in relation to those threats. We propose a practical categorization of QA approaches, based on key characteristics corresponding to means, situation, and purpose. The review highlights the heterogeneity of methods and practices relevant for this domain and points to targets for future research efforts.
Collapse
Affiliation(s)
- Claes Lundström
- Center for Medical Image Science and Visualization, Linköping University, Linköping, Sweden.
- Sectra AB, Linköping, Sweden.
| | | |
Collapse
|
72
|
Redrup Hill E, Mitchell C, Brigden T, Hall A. Ethical and legal considerations influencing human involvement in the implementation of artificial intelligence in a clinical pathway: A multi-stakeholder perspective. Front Digit Health 2023; 5:1139210. [PMID: 36999168 PMCID: PMC10043985 DOI: 10.3389/fdgth.2023.1139210] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Accepted: 02/23/2023] [Indexed: 03/18/2023] Open
Abstract
IntroductionEthical and legal factors will have an important bearing on when and whether automation is appropriate in healthcare. There is a developing literature on the ethics of artificial intelligence (AI) in health, including specific legal or regulatory questions such as whether there is a right to an explanation of AI decision-making. However, there has been limited consideration of the specific ethical and legal factors that influence when, and in what form, human involvement may be required in the implementation of AI in a clinical pathway, and the views of the wide range of stakeholders involved. To address this question, we chose the exemplar of the pathway for the early detection of Barrett's Oesophagus (BE) and oesophageal adenocarcinoma, where Gehrung and colleagues have developed a “semi-automated”, deep-learning system to analyse samples from the CytospongeTM TFF3 test (a minimally invasive alternative to endoscopy), where AI promises to mitigate increasing demands for pathologists' time and input.MethodsWe gathered a multidisciplinary group of stakeholders, including developers, patients, healthcare professionals and regulators, to obtain their perspectives on the ethical and legal issues that may arise using this exemplar.ResultsThe findings are grouped under six general themes: risk and potential harms; impacts on human experts; equity and bias; transparency and oversight; patient information and choice; accountability, moral responsibility and liability for error. Within these themes, a range of subtle and context-specific elements emerged, highlighting the importance of pre-implementation, interdisciplinary discussions and appreciation of pathway specific considerations.DiscussionTo evaluate these findings, we draw on the well-established principles of biomedical ethics identified by Beauchamp and Childress as a lens through which to view these results and their implications for personalised medicine. Our findings are not only relevant to this context but have implications for AI in digital pathology and healthcare more broadly.
Collapse
|
73
|
Glocker B, Jones C, Bernhardt M, Winzeck S. Algorithmic encoding of protected characteristics in chest X-ray disease detection models. EBioMedicine 2023; 89:104467. [PMID: 36791660 PMCID: PMC10025760 DOI: 10.1016/j.ebiom.2023.104467] [Citation(s) in RCA: 22] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Revised: 01/23/2023] [Accepted: 01/24/2023] [Indexed: 02/16/2023] Open
Abstract
BACKGROUND It has been rightfully emphasized that the use of AI for clinical decision making could amplify health disparities. An algorithm may encode protected characteristics, and then use this information for making predictions due to undesirable correlations in the (historical) training data. It remains unclear how we can establish whether such information is actually used. Besides the scarcity of data from underserved populations, very little is known about how dataset biases manifest in predictive models and how this may result in disparate performance. This article aims to shed some light on these issues by exploring methodology for subgroup analysis in image-based disease detection models. METHODS We utilize two publicly available chest X-ray datasets, CheXpert and MIMIC-CXR, to study performance disparities across race and biological sex in deep learning models. We explore test set resampling, transfer learning, multitask learning, and model inspection to assess the relationship between the encoding of protected characteristics and disease detection performance across subgroups. FINDINGS We confirm subgroup disparities in terms of shifted true and false positive rates which are partially removed after correcting for population and prevalence shifts in the test sets. We find that transfer learning alone is insufficient for establishing whether specific patient information is used for making predictions. The proposed combination of test-set resampling, multitask learning, and model inspection reveals valuable insights about the way protected characteristics are encoded in the feature representations of deep neural networks. INTERPRETATION Subgroup analysis is key for identifying performance disparities of AI models, but statistical differences across subgroups need to be taken into account when analyzing potential biases in disease detection. The proposed methodology provides a comprehensive framework for subgroup analysis enabling further research into the underlying causes of disparities. FUNDING European Research Council Horizon 2020, UK Research and Innovation.
Collapse
Affiliation(s)
- Ben Glocker
- Department of Computing, Imperial College London, London, SW7 2AZ, UK.
| | - Charles Jones
- Department of Computing, Imperial College London, London, SW7 2AZ, UK
| | - Mélanie Bernhardt
- Department of Computing, Imperial College London, London, SW7 2AZ, UK
| | - Stefan Winzeck
- Department of Computing, Imperial College London, London, SW7 2AZ, UK
| |
Collapse
|
74
|
Taribagil P, Hogg HDJ, Balaskas K, Keane PA. Integrating artificial intelligence into an ophthalmologist’s workflow: obstacles and opportunities. EXPERT REVIEW OF OPHTHALMOLOGY 2023. [DOI: 10.1080/17469899.2023.2175672] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/11/2023]
Affiliation(s)
- Priyal Taribagil
- Medical Retina Department, Moorfields Eye Hospital NHS Foundation Trust, London, UK
| | - HD Jeffry Hogg
- Medical Retina Department, Moorfields Eye Hospital NHS Foundation Trust, London, UK
- Department of Population Health Science, Population Health Science Institute, Newcastle University, Newcastle upon Tyne, UK
- Department of Ophthalmology, Newcastle upon Tyne Hospitals NHS Foundation Trust, Freeman Road, Newcastle upon Tyne, UK
| | - Konstantinos Balaskas
- NIHR Biomedical Research Centre, Moorfields Eye Hospital NHS Foundation Trust, London, UK
- Medical Retina, Institute of Ophthalmology, University College of London Institute of Ophthalmology, London, UK
| | - Pearse A Keane
- NIHR Biomedical Research Centre, Moorfields Eye Hospital NHS Foundation Trust, London, UK
- Medical Retina, Institute of Ophthalmology, University College of London Institute of Ophthalmology, London, UK
| |
Collapse
|
75
|
Beyond the AJR: Validation and Algorithmic Audit of a Deep Learning System to Detect Hip Fractures Radiographically. AJR Am J Roentgenol 2023; 220:150. [PMID: 35674349 DOI: 10.2214/ajr.22.28053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
|
76
|
Sapey E, Gallier S, Evison F, McNulty D, Reeves K, Ball S. Variability and performance of NHS England's 'reason to reside' criteria in predicting hospital discharge in acute hospitals in England: a retrospective, observational cohort study. BMJ Open 2022; 12:e065862. [PMID: 36572492 PMCID: PMC9805825 DOI: 10.1136/bmjopen-2022-065862] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/23/2022] [Accepted: 12/08/2022] [Indexed: 12/27/2022] Open
Abstract
OBJECTIVES NHS England (NHSE) advocates 'reason to reside' (R2R) criteria to support discharge planning. The proportion of patients without R2R and their rate of discharge are reported daily by acute hospitals in England. R2R has no interoperable standardised data model (SDM), and its performance has not been validated. We aimed to understand the degree of intercentre and intracentre variation in R2R-related metrics reported to NHSE, define an SDM implemented within a single centre Electronic Health Record to generate an electronic R2R (eR2R) and evaluate its performance in predicting subsequent discharge. DESIGN Retrospective observational cohort study using routinely collected health data. SETTING 122 NHS Trusts in England for national reporting and an acute hospital in England for local reporting. PARTICIPANTS 6 602 706 patient-days were analysed using 3-month national data and 1 039 592 patient-days, using 3-year single centre data. MAIN OUTCOME MEASURES Variability in R2R-related metrics reported to NHSE. Performance of eR2R in predicting discharge within 24 hours. RESULTS There were high levels of intracentre and intercentre variability in R2R-related metrics (p<0.0001) but not in eR2R. Informedness of eR2R for discharge within 24 hours was low (J-statistic 0.09-0.12 across three consecutive years). In those remaining in hospital without eR2R, 61.2% met eR2R criteria on subsequent days (76% within 24 hours), most commonly due to increased NEWS2 (21.9%) or intravenous therapy administration (32.8%). CONCLUSIONS Reported R2R metrics are highly variable between and within acute Trusts in England. Although case-mix or community care provision may account for some variability, the absence of a SDM prevents standardised reporting. Following the development of a SDM in one acute Trust, the variability reduced. However, the performance of eR2R was poor, prone to change even when negative and unable to meaningfully contribute to discharge planning.
Collapse
Affiliation(s)
- Elizabeth Sapey
- PIONEER Data Hub, University of Birmingham, Birmingham, UK
- Department of Acute Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK
| | - Suzy Gallier
- PIONEER Data Hub, University of Birmingham, Birmingham, UK
- Department of Research Informatics, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK
| | - Felicity Evison
- Department of Research Informatics, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK
| | - David McNulty
- Department of Research Informatics, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK
| | - Katherine Reeves
- Department of Research Informatics, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK
| | - Simon Ball
- Renal Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham, West Midlands, UK
- Better Care Programme and Midlands Site, HDR UK, Birmingham, West Midlands, UK
| |
Collapse
|
77
|
Müller L, Kloeckner R, Mildenberger P, Pinto Dos Santos D. [Validation and implementation of artificial intelligence in radiology : Quo vadis in 2022?]. RADIOLOGIE (HEIDELBERG, GERMANY) 2022; 63:381-386. [PMID: 36510007 DOI: 10.1007/s00117-022-01097-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 11/17/2022] [Indexed: 12/14/2022]
Abstract
BACKGROUND The hype around artificial intelligence (AI) in radiology continues and the number of approved AI tools is growing steadily. Despite the great potential, integration into clinical routine in radiology remains limited. In addition, the large number of individual applications poses a challenge for clinical routine, as individual applications have to be selected for different questions and organ systems, which increases the complexity and time required. OBJECTIVES This review will discuss the current status of validation and implementation of AI tools in clinical routine, and identify possible approaches for an improved assessment of the generalizability of results of AI tools. MATERIALS AND METHODS A literature search in various literature and product databases as well as publications, position papers, and reports from various stakeholders was conducted for this review. RESULTS Scientific evidence and independent validation studies are available for only a few commercial AI tools and the generalizability of the results often remains questionable. CONCLUSIONS One challenge is the multitude of offerings for individual, specific application areas by a large number of manufacturers, making integration into the existing site-specific IT infrastructure more difficult. Furthermore, remuneration for the use of AI tools in clinical routine by health insurance companies in Germany is lacking. But in order for reimbursement to be granted, the clinical utility of new applications must first be proven. Such proof, however, is lacking for most applications.
Collapse
Affiliation(s)
- Lukas Müller
- Klinik und Poliklinik für Diagnostische und Interventionelle Radiologie, Universitätsmedizin Mainz, Langenbeckstr. 1, 55131, Mainz, Deutschland.
| | - Roman Kloeckner
- Institut für Interventionelle Radiologie, Universitätsklinikum Schleswig-Holstein - Campus Lübeck, Lübeck, Deutschland
| | - Peter Mildenberger
- Klinik und Poliklinik für Diagnostische und Interventionelle Radiologie, Universitätsmedizin Mainz, Langenbeckstr. 1, 55131, Mainz, Deutschland
| | - Daniel Pinto Dos Santos
- Institut für Diagnostische und Interventionelle Radiologie, Uniklinik Köln, Köln, Deutschland.,Institut für Diagnostische und Interventionelle Radiologie, Universitätsklinikum Frankfurt, Frankfurt am Main, Deutschland
| |
Collapse
|
78
|
van de Sande D, van Genderen ME, Braaf H, Gommers D, van Bommel J. Moving towards clinical use of artificial intelligence in intensive care medicine: business as usual? Intensive Care Med 2022; 48:1815-1817. [PMID: 36269330 DOI: 10.1007/s00134-022-06910-y] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/07/2022] [Indexed: 11/05/2022]
Affiliation(s)
- Davy van de Sande
- Department of Adult Intensive Care, Erasmus University Medical Center, Room Ne-403, Doctor Molewaterplein 40, 3015 GD, Rotterdam, The Netherlands
| | - Michel E van Genderen
- Department of Adult Intensive Care, Erasmus University Medical Center, Room Ne-403, Doctor Molewaterplein 40, 3015 GD, Rotterdam, The Netherlands.
| | - Heleen Braaf
- Department of Adult Intensive Care, Erasmus University Medical Center, Room Ne-403, Doctor Molewaterplein 40, 3015 GD, Rotterdam, The Netherlands
| | - Diederik Gommers
- Department of Adult Intensive Care, Erasmus University Medical Center, Room Ne-403, Doctor Molewaterplein 40, 3015 GD, Rotterdam, The Netherlands
| | - Jasper van Bommel
- Department of Adult Intensive Care, Erasmus University Medical Center, Room Ne-403, Doctor Molewaterplein 40, 3015 GD, Rotterdam, The Netherlands
| |
Collapse
|
79
|
Developing robust benchmarks for driving forward AI innovation in healthcare. NAT MACH INTELL 2022. [DOI: 10.1038/s42256-022-00559-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
80
|
Monteith S, Glenn T, Geddes J, Whybrow PC, Achtyes E, Bauer M. Expectations for Artificial Intelligence (AI) in Psychiatry. Curr Psychiatry Rep 2022; 24:709-721. [PMID: 36214931 PMCID: PMC9549456 DOI: 10.1007/s11920-022-01378-5] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 09/15/2022] [Indexed: 01/29/2023]
Abstract
PURPOSE OF REVIEW Artificial intelligence (AI) is often presented as a transformative technology for clinical medicine even though the current technology maturity of AI is low. The purpose of this narrative review is to describe the complex reasons for the low technology maturity and set realistic expectations for the safe, routine use of AI in clinical medicine. RECENT FINDINGS For AI to be productive in clinical medicine, many diverse factors that contribute to the low maturity level need to be addressed. These include technical problems such as data quality, dataset shift, black-box opacity, validation and regulatory challenges, and human factors such as a lack of education in AI, workflow changes, automation bias, and deskilling. There will also be new and unanticipated safety risks with the introduction of AI. The solutions to these issues are complex and will take time to discover, develop, validate, and implement. However, addressing the many problems in a methodical manner will expedite the safe and beneficial use of AI to augment medical decision making in psychiatry.
Collapse
Affiliation(s)
- Scott Monteith
- Michigan State University College of Human Medicine, Traverse City Campus, Traverse City, MI, 49684, USA.
| | - Tasha Glenn
- ChronoRecord Association, Fullerton, CA, USA
| | - John Geddes
- Department of Psychiatry, University of Oxford, Warneford Hospital, Oxford, UK
| | - Peter C Whybrow
- Department of Psychiatry and Biobehavioral Sciences, Semel Institute for Neuroscience and Human Behavior, University of California Los Angeles (UCLA), Los Angeles, CA, USA
| | - Eric Achtyes
- Michigan State University College of Human Medicine, Grand Rapids, MI, 49684, USA
- Network180, Grand Rapids, MI, USA
| | - Michael Bauer
- Department of Psychiatry and Psychotherapy, University Hospital Carl Gustav Carus Medical Faculty, Technische Universität Dresden, Dresden, Germany
| |
Collapse
|
81
|
Mascagni P, Alapatt D, Sestini L, Altieri MS, Madani A, Watanabe Y, Alseidi A, Redan JA, Alfieri S, Costamagna G, Boškoski I, Padoy N, Hashimoto DA. Computer vision in surgery: from potential to clinical value. NPJ Digit Med 2022; 5:163. [PMID: 36307544 PMCID: PMC9616906 DOI: 10.1038/s41746-022-00707-5] [Citation(s) in RCA: 45] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2022] [Accepted: 10/10/2022] [Indexed: 11/09/2022] Open
Abstract
Hundreds of millions of operations are performed worldwide each year, and the rising uptake in minimally invasive surgery has enabled fiber optic cameras and robots to become both important tools to conduct surgery and sensors from which to capture information about surgery. Computer vision (CV), the application of algorithms to analyze and interpret visual data, has become a critical technology through which to study the intraoperative phase of care with the goals of augmenting surgeons' decision-making processes, supporting safer surgery, and expanding access to surgical care. While much work has been performed on potential use cases, there are currently no CV tools widely used for diagnostic or therapeutic applications in surgery. Using laparoscopic cholecystectomy as an example, we reviewed current CV techniques that have been applied to minimally invasive surgery and their clinical applications. Finally, we discuss the challenges and obstacles that remain to be overcome for broader implementation and adoption of CV in surgery.
Collapse
Affiliation(s)
- Pietro Mascagni
- Gemelli Hospital, Catholic University of the Sacred Heart, Rome, Italy.
- IHU-Strasbourg, Institute of Image-Guided Surgery, Strasbourg, France.
- Global Surgical Artificial Intelligence Collaborative, Toronto, ON, Canada.
| | - Deepak Alapatt
- ICube, University of Strasbourg, CNRS, IHU, Strasbourg, France
| | - Luca Sestini
- ICube, University of Strasbourg, CNRS, IHU, Strasbourg, France
- Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milano, Italy
| | - Maria S Altieri
- Global Surgical Artificial Intelligence Collaborative, Toronto, ON, Canada
- Department of Surgery, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Amin Madani
- Global Surgical Artificial Intelligence Collaborative, Toronto, ON, Canada
- Department of Surgery, University Health Network, Toronto, ON, Canada
| | - Yusuke Watanabe
- Global Surgical Artificial Intelligence Collaborative, Toronto, ON, Canada
- Department of Surgery, University of Hokkaido, Hokkaido, Japan
| | - Adnan Alseidi
- Global Surgical Artificial Intelligence Collaborative, Toronto, ON, Canada
- Department of Surgery, University of California San Francisco, San Francisco, CA, USA
| | - Jay A Redan
- Department of Surgery, AdventHealth-Celebration Health, Celebration, FL, USA
| | - Sergio Alfieri
- Fondazione Policlinico Universitario A. Gemelli IRCCS, Rome, Italy
| | - Guido Costamagna
- Fondazione Policlinico Universitario A. Gemelli IRCCS, Rome, Italy
| | - Ivo Boškoski
- Fondazione Policlinico Universitario A. Gemelli IRCCS, Rome, Italy
| | - Nicolas Padoy
- IHU-Strasbourg, Institute of Image-Guided Surgery, Strasbourg, France
- ICube, University of Strasbourg, CNRS, IHU, Strasbourg, France
| | - Daniel A Hashimoto
- Global Surgical Artificial Intelligence Collaborative, Toronto, ON, Canada
- Department of Surgery, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| |
Collapse
|
82
|
Garrucho L, Kushibar K, Jouide S, Diaz O, Igual L, Lekadir K. Domain generalization in deep learning based mass detection in mammography: A large-scale multi-center study. Artif Intell Med 2022; 132:102386. [PMID: 36207090 DOI: 10.1016/j.artmed.2022.102386] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2022] [Revised: 08/07/2022] [Accepted: 08/19/2022] [Indexed: 11/02/2022]
Abstract
Computer-aided detection systems based on deep learning have shown great potential in breast cancer detection. However, the lack of domain generalization of artificial neural networks is an important obstacle to their deployment in changing clinical environments. In this study, we explored the domain generalization of deep learning methods for mass detection in digital mammography and analyzed in-depth the sources of domain shift in a large-scale multi-center setting. To this end, we compared the performance of eight state-of-the-art detection methods, including Transformer based models, trained in a single domain and tested in five unseen domains. Moreover, a single-source mass detection training pipeline was designed to improve the domain generalization without requiring images from the new domain. The results show that our workflow generalized better than state-of-the-art transfer learning based approaches in four out of five domains while reducing the domain shift caused by the different acquisition protocols and scanner manufacturers. Subsequently, an extensive analysis was performed to identify the covariate shifts with the greatest effects on detection performance, such as those due to differences in patient age, breast density, mass size, and mass malignancy. Ultimately, this comprehensive study provides key insights and best practices for future research on domain generalization in deep learning based breast cancer detection.
Collapse
Affiliation(s)
- Lidia Garrucho
- Artificial Intelligence in Medicine Lab (BCN-AIM), Faculty of Mathematics and Computer Science, University of Barcelona, Gran Via de les Corts Catalanes 585, Barcelona, 08007, Barcelona, Spain.
| | - Kaisar Kushibar
- Artificial Intelligence in Medicine Lab (BCN-AIM), Faculty of Mathematics and Computer Science, University of Barcelona, Gran Via de les Corts Catalanes 585, Barcelona, 08007, Barcelona, Spain
| | - Socayna Jouide
- Artificial Intelligence in Medicine Lab (BCN-AIM), Faculty of Mathematics and Computer Science, University of Barcelona, Gran Via de les Corts Catalanes 585, Barcelona, 08007, Barcelona, Spain
| | - Oliver Diaz
- Artificial Intelligence in Medicine Lab (BCN-AIM), Faculty of Mathematics and Computer Science, University of Barcelona, Gran Via de les Corts Catalanes 585, Barcelona, 08007, Barcelona, Spain
| | - Laura Igual
- Artificial Intelligence in Medicine Lab (BCN-AIM), Faculty of Mathematics and Computer Science, University of Barcelona, Gran Via de les Corts Catalanes 585, Barcelona, 08007, Barcelona, Spain
| | - Karim Lekadir
- Artificial Intelligence in Medicine Lab (BCN-AIM), Faculty of Mathematics and Computer Science, University of Barcelona, Gran Via de les Corts Catalanes 585, Barcelona, 08007, Barcelona, Spain
| |
Collapse
|
83
|
Fehr J, Jaramillo-Gutierrez G, Oala L, Gröschel MI, Bierwirth M, Balachandran P, Werneck-Leite A, Lippert C. Piloting a Survey-Based Assessment of Transparency and Trustworthiness with Three Medical AI Tools. Healthcare (Basel) 2022; 10:1923. [PMID: 36292369 PMCID: PMC9601535 DOI: 10.3390/healthcare10101923] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2022] [Revised: 09/18/2022] [Accepted: 09/21/2022] [Indexed: 11/04/2022] Open
Abstract
Artificial intelligence (AI) offers the potential to support healthcare delivery, but poorly trained or validated algorithms bear risks of harm. Ethical guidelines stated transparency about model development and validation as a requirement for trustworthy AI. Abundant guidance exists to provide transparency through reporting, but poorly reported medical AI tools are common. To close this transparency gap, we developed and piloted a framework to quantify the transparency of medical AI tools with three use cases. Our framework comprises a survey to report on the intended use, training and validation data and processes, ethical considerations, and deployment recommendations. The transparency of each response was scored with either 0, 0.5, or 1 to reflect if the requested information was not, partially, or fully provided. Additionally, we assessed on an analogous three-point scale if the provided responses fulfilled the transparency requirement for a set of trustworthiness criteria from ethical guidelines. The degree of transparency and trustworthiness was calculated on a scale from 0% to 100%. Our assessment of three medical AI use cases pin-pointed reporting gaps and resulted in transparency scores of 67% for two use cases and one with 59%. We report anecdotal evidence that business constraints and limited information from external datasets were major obstacles to providing transparency for the three use cases. The observed transparency gaps also lowered the degree of trustworthiness, indicating compliance gaps with ethical guidelines. All three pilot use cases faced challenges to provide transparency about medical AI tools, but more studies are needed to investigate those in the wider medical AI sector. Applying this framework for an external assessment of transparency may be infeasible if business constraints prevent the disclosure of information. New strategies may be necessary to enable audits of medical AI tools while preserving business secrets.
Collapse
Affiliation(s)
- Jana Fehr
- Digital Engineering Faculty, University of Potsdam, 14482 Potsdam, Germany
- Digital Health & Machine Learning, Hasso Plattner Institute, 14482 Potsdam, Germany
| | | | - Luis Oala
- Department of Artificial Intelligence, Fraunhofer HHI, 10587 Berlin, Germany
| | - Matthias I. Gröschel
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
| | - Manuel Bierwirth
- ITU/WHO Focus Group AI4H, 1211 Geneva, Switzerland
- Alumnus Goethe Frankfurt University, 60323 Frankfurt am Main, Germany
| | - Pradeep Balachandran
- ITU/WHO Focus Group AI4H, 1211 Geneva, Switzerland
- Technical Consultant (Digital Health), Thiruvananthapuram 695010, India
| | | | - Christoph Lippert
- Digital Engineering Faculty, University of Potsdam, 14482 Potsdam, Germany
- Digital Health & Machine Learning, Hasso Plattner Institute, 14482 Potsdam, Germany
| |
Collapse
|
84
|
Denniston AK, Kale AU, Lee WH, Mollan SP, Keane PA. Building trust in real-world data: lessons from INSIGHT, the UK's health data research hub for eye health and oculomics. Curr Opin Ophthalmol 2022; 33:399-406. [PMID: 35916569 DOI: 10.1097/icu.0000000000000887] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
PURPOSE OF REVIEW In this review, we consider the challenges of creating a trusted resource for real-world data in ophthalmology, based on our experience of establishing INSIGHT, the UK's Health Data Research Hub for Eye Health and Oculomics. RECENT FINDINGS The INSIGHT Health Data Research Hub maximizes the benefits and impact of historical, patient-level UK National Health Service (NHS) electronic health record data, including images, through making it research-ready including curation and anonymisation. It is built around a shared 'north star' of enabling research for patient benefit. INSIGHT has worked to establish patient and public trust in the concept and delivery of INSIGHT, with efficient and robust governance processes that support safe and secure access to data for researchers. By linking to systemic data, there is an opportunity for discovery of novel ophthalmic biomarkers of systemic diseases ('oculomics'). Datasets that provide a representation of the whole population are an important tool to address the increasingly recognized threat of health data poverty. SUMMARY Enabling efficient, safe access to routinely collected clinical data is a substantial undertaking, especially when this includes imaging modalities, but provides an exceptional resource for research. Research and innovation built on inclusive real-world data is an important tool in ensuring that discoveries and technologies of the future may not only favour selected groups, but also work for all patients.
Collapse
Affiliation(s)
- Alastair K Denniston
- INSIGHT Health Data Research hub for Eye Health
- Academic Unit of Ophthalmology, Institute of Inflammation & Ageing, College of Medical and Dental Sciences, University of Birmingham
- Ophthalmology Department, University Hospitals Birmingham NHS Foundation Trust, Birmingham
| | - Aditya U Kale
- INSIGHT Health Data Research hub for Eye Health
- Academic Unit of Ophthalmology, Institute of Inflammation & Ageing, College of Medical and Dental Sciences, University of Birmingham
| | - Wen Hwa Lee
- INSIGHT Health Data Research hub for Eye Health
- Action Against Age-Related Macular Degeneration, London
| | - Susan P Mollan
- INSIGHT Health Data Research hub for Eye Health
- Ophthalmology Department, University Hospitals Birmingham NHS Foundation Trust, Birmingham
- Institute of Metabolism and Systems Research, College of Medical and Dental Sciences, University of Birmingham
| | - Pearse A Keane
- INSIGHT Health Data Research hub for Eye Health
- NIHR Biomedical Research Centre At Moorfields Eye Hospital NHS Foundation Trust, UCL Institute of Ophthalmology, London, UK
| |
Collapse
|
85
|
Albert K, Delano M. Sex trouble: Sex/gender slippage, sex confusion, and sex obsession in machine learning using electronic health records. PATTERNS (NEW YORK, N.Y.) 2022; 3:100534. [PMID: 36033589 PMCID: PMC9403398 DOI: 10.1016/j.patter.2022.100534] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
False assumptions that sex and gender are binary, static, and concordant are deeply embedded in the medical system. As machine learning researchers use medical data to build tools to solve novel problems, understanding how existing systems represent sex/gender incorrectly is necessary to avoid perpetuating harm. In this perspective, we identify and discuss three factors to consider when working with sex/gender in research: "sex/gender slippage," the frequent substitution of sex and sex-related terms for gender and vice versa; "sex confusion," the fact that any given sex variable holds many different potential meanings; and "sex obsession," the idea that the relevant variable for most inquiries related to sex/gender is sex assigned at birth. We then explore how these phenomena show up in medical machine learning research using electronic health records, with a specific focus on HIV risk prediction. Finally, we offer recommendations about how machine learning researchers can engage more carefully with questions of sex/gender.
Collapse
Affiliation(s)
- Kendra Albert
- Cyberlaw Clinic, Harvard Law School, Cambridge, MA 02138, USA
| | - Maggie Delano
- Engineering Department, Swarthmore College, Swarthmore, PA 19146, USA
| |
Collapse
|
86
|
Arora A, Arora A. Generative adversarial networks and synthetic patient data: current challenges and future perspectives. Future Healthc J 2022; 9:190-193. [DOI: 10.7861/fhj.2022-0013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
87
|
Oakden-Rayner L, Gale W, Bonham TA, Lungren MP, Carneiro G, Bradley AP, Palmer LJ. Validation and algorithmic audit of a deep learning system for the detection of proximal femoral fractures in patients in the emergency department: a diagnostic accuracy study. Lancet Digit Health 2022; 4:e351-e358. [PMID: 35396184 DOI: 10.1016/s2589-7500(22)00004-8] [Citation(s) in RCA: 36] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2021] [Revised: 11/02/2021] [Accepted: 01/12/2022] [Indexed: 02/01/2023]
Abstract
BACKGROUND Proximal femoral fractures are an important clinical and public health issue associated with substantial morbidity and early mortality. Artificial intelligence might offer improved diagnostic accuracy for these fractures, but typical approaches to testing of artificial intelligence models can underestimate the risks of artificial intelligence-based diagnostic systems. METHODS We present a preclinical evaluation of a deep learning model intended to detect proximal femoral fractures in frontal x-ray films in emergency department patients, trained on films from the Royal Adelaide Hospital (Adelaide, SA, Australia). This evaluation included a reader study comparing the performance of the model against five radiologists (three musculoskeletal specialists and two general radiologists) on a dataset of 200 fracture cases and 200 non-fractures (also from the Royal Adelaide Hospital), an external validation study using a dataset obtained from Stanford University Medical Center, CA, USA, and an algorithmic audit to detect any unusual or unexpected model behaviour. FINDINGS In the reader study, the area under the receiver operating characteristic curve (AUC) for the performance of the deep learning model was 0·994 (95% CI 0·988-0·999) compared with an AUC of 0·969 (0·960-0·978) for the five radiologists. This strong model performance was maintained on external validation, with an AUC of 0·980 (0·931-1·000). However, the preclinical evaluation identified barriers to safe deployment, including a substantial shift in the model operating point on external validation and an increased error rate on cases with abnormal bones (eg, Paget's disease). INTERPRETATION The model outperformed the radiologists tested and maintained performance on external validation, but showed several unexpected limitations during further testing. Thorough preclinical evaluation of artificial intelligence models, including algorithmic auditing, can reveal unexpected and potentially harmful behaviour even in high-performance artificial intelligence systems, which can inform future clinical testing and deployment decisions. FUNDING None.
Collapse
Affiliation(s)
- Lauren Oakden-Rayner
- School of Public Health, University of Adelaide, Adelaide, SA, Australia; Australian Institute for Machine Learning, University of Adelaide, Adelaide, SA, Australia.
| | - William Gale
- Australian Institute for Machine Learning, University of Adelaide, Adelaide, SA, Australia; School of Computer Science, University of Adelaide, Adelaide, SA, Australia
| | - Thomas A Bonham
- Stanford University School of Medicine, Department of Radiology, Stanford, CA, USA
| | - Matthew P Lungren
- Stanford University School of Medicine, Department of Radiology, Stanford, CA, USA; Stanford Artificial Intelligence in Medicine and Imaging Center, Stanford University, Stanford, CA, USA
| | - Gustavo Carneiro
- Australian Institute for Machine Learning, University of Adelaide, Adelaide, SA, Australia
| | - Andrew P Bradley
- Science and Engineering Faculty, Queensland University of Technology, Brisbane, QLD, Australia
| | - Lyle J Palmer
- School of Public Health, University of Adelaide, Adelaide, SA, Australia; Australian Institute for Machine Learning, University of Adelaide, Adelaide, SA, Australia
| |
Collapse
|