51
|
Lyell D, Wang Y, Coiera E, Magrabi F. More than algorithms: an analysis of safety events involving ML-enabled medical devices reported to the FDA. J Am Med Inform Assoc 2023; 30:1227-1236. [PMID: 37071804 PMCID: PMC10280342 DOI: 10.1093/jamia/ocad065] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Revised: 03/13/2023] [Accepted: 03/30/2023] [Indexed: 04/20/2023] Open
Abstract
OBJECTIVE To examine the real-world safety problems involving machine learning (ML)-enabled medical devices. MATERIALS AND METHODS We analyzed 266 safety events involving approved ML medical devices reported to the US FDA's MAUDE program between 2015 and October 2021. Events were reviewed against an existing framework for safety problems with Health IT to identify whether a reported problem was due to the ML device (device problem) or its use, and key contributors to the problem. Consequences of events were also classified. RESULTS Events described hazards with potential to harm (66%), actual harm (16%), consequences for healthcare delivery (9%), near misses that would have led to harm if not for intervention (4%), no harm or consequences (3%), and complaints (2%). While most events involved device problems (93%), use problems (7%) were 4 times more likely to harm (relative risk 4.2; 95% CI 2.5-7). Problems with data input to ML devices were the top contributor to events (82%). DISCUSSION Much of what is known about ML safety comes from case studies and the theoretical limitations of ML. We contribute a systematic analysis of ML safety problems captured as part of the FDA's routine post-market surveillance. Most problems involved devices and concerned the acquisition of data for processing by algorithms. However, problems with the use of devices were more likely to harm. CONCLUSIONS Safety problems with ML devices involve more than algorithms, highlighting the need for a whole-of-system approach to safe implementation with a special focus on how users interact with devices.
Collapse
Affiliation(s)
- David Lyell
- Centre for Health Informatics, Australian Institute of Health Innovation, Macquarie University, NSW 2109, Australia
| | - Ying Wang
- Centre for Health Informatics, Australian Institute of Health Innovation, Macquarie University, NSW 2109, Australia
| | - Enrico Coiera
- Centre for Health Informatics, Australian Institute of Health Innovation, Macquarie University, NSW 2109, Australia
| | - Farah Magrabi
- Centre for Health Informatics, Australian Institute of Health Innovation, Macquarie University, NSW 2109, Australia
| |
Collapse
|
52
|
González C, Ranem A, Pinto Dos Santos D, Othman A, Mukhopadhyay A. Lifelong nnU-Net: a framework for standardized medical continual learning. Sci Rep 2023; 13:9381. [PMID: 37296233 PMCID: PMC10256748 DOI: 10.1038/s41598-023-34484-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2022] [Accepted: 05/02/2023] [Indexed: 06/12/2023] Open
Abstract
As the enthusiasm surrounding Deep Learning grows, both medical practitioners and regulatory bodies are exploring ways to safely introduce image segmentation in clinical practice. One frontier to overcome when translating promising research into the clinical open world is the shift from static to continual learning. Continual learning, the practice of training models throughout their lifecycle, is seeing growing interest but is still in its infancy in healthcare. We present Lifelong nnU-Net, a standardized framework that places continual segmentation at the hands of researchers and clinicians. Built on top of the nnU-Net-widely regarded as the best-performing segmenter for multiple medical applications-and equipped with all necessary modules for training and testing models sequentially, we ensure broad applicability and lower the barrier to evaluating new methods in a continual fashion. Our benchmark results across three medical segmentation use cases and five continual learning methods give a comprehensive outlook on the current state of the field and signify a first reproducible benchmark.
Collapse
Affiliation(s)
- Camila González
- Technical University of Darmstadt, Karolinenpl. 5, 64289, Darmstadt, Germany.
| | - Amin Ranem
- Technical University of Darmstadt, Karolinenpl. 5, 64289, Darmstadt, Germany
| | - Daniel Pinto Dos Santos
- University Hospital Cologne, Kerpener Str. 62, 50937, Cologne, Germany
- University Hospital Frankfurt, Theodor-Stern-Kai 7, 60590, Frankfurt, Germany
| | - Ahmed Othman
- University Medical Center Mainz, Langenbeckstraße 1, 55131, Mainz, Germany
| | | |
Collapse
|
53
|
Kann BH, Likitlersuang J, Bontempi D, Ye Z, Aneja S, Bakst R, Kelly HR, Juliano AF, Payabvash S, Guenette JP, Uppaluri R, Margalit DN, Schoenfeld JD, Tishler RB, Haddad R, Aerts HJWL, Garcia JJ, Flamand Y, Subramaniam RM, Burtness BA, Ferris RL. Screening for extranodal extension in HPV-associated oropharyngeal carcinoma: evaluation of a CT-based deep learning algorithm in patient data from a multicentre, randomised de-escalation trial. Lancet Digit Health 2023; 5:e360-e369. [PMID: 37087370 PMCID: PMC10245380 DOI: 10.1016/s2589-7500(23)00046-8] [Citation(s) in RCA: 18] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Revised: 01/18/2023] [Accepted: 02/21/2023] [Indexed: 04/24/2023]
Abstract
BACKGROUND Pretreatment identification of pathological extranodal extension (ENE) would guide therapy de-escalation strategies for in human papillomavirus (HPV)-associated oropharyngeal carcinoma but is diagnostically challenging. ECOG-ACRIN Cancer Research Group E3311 was a multicentre trial wherein patients with HPV-associated oropharyngeal carcinoma were treated surgically and assigned to a pathological risk-based adjuvant strategy of observation, radiation, or concurrent chemoradiation. Despite protocol exclusion of patients with overt radiographic ENE, more than 30% had pathological ENE and required postoperative chemoradiation. We aimed to evaluate a CT-based deep learning algorithm for prediction of ENE in E3311, a diagnostically challenging cohort wherein algorithm use would be impactful in guiding decision-making. METHODS For this retrospective evaluation of deep learning algorithm performance, we obtained pretreatment CTs and corresponding surgical pathology reports from the multicentre, randomised de-escalation trial E3311. All enrolled patients on E3311 required pretreatment and diagnostic head and neck imaging; patients with radiographically overt ENE were excluded per study protocol. The lymph node with largest short-axis diameter and up to two additional nodes were segmented on each scan and annotated for ENE per pathology reports. Deep learning algorithm performance for ENE prediction was compared with four board-certified head and neck radiologists. The primary endpoint was the area under the curve (AUC) of the receiver operating characteristic. FINDINGS From 178 collected scans, 313 nodes were annotated: 71 (23%) with ENE in general, 39 (13%) with ENE larger than 1 mm ENE. The deep learning algorithm AUC for ENE classification was 0·86 (95% CI 0·82-0·90), outperforming all readers (p<0·0001 for each). Among radiologists, there was high variability in specificity (43-86%) and sensitivity (45-96%) with poor inter-reader agreement (κ 0·32). Matching the algorithm specificity to that of the reader with highest AUC (R2, false positive rate 22%) yielded improved sensitivity to 75% (+ 13%). Setting the algorithm false positive rate to 30% yielded 90% sensitivity. The algorithm showed improved performance compared with radiologists for ENE larger than 1 mm (p<0·0001) and in nodes with short-axis diameter 1 cm or larger. INTERPRETATION The deep learning algorithm outperformed experts in predicting pathological ENE on a challenging cohort of patients with HPV-associated oropharyngeal carcinoma from a randomised clinical trial. Deep learning algorithms should be evaluated prospectively as a treatment selection tool. FUNDING ECOG-ACRIN Cancer Research Group and the National Cancer Institute of the US National Institutes of Health.
Collapse
Affiliation(s)
- Benjamin H Kann
- Department of Radiation Oncology, Harvard Medical School, Boston, MA, USA; Mass General Brigham Artificial Intelligence in Medicine Program, Boston, MA, USA.
| | - Jirapat Likitlersuang
- Dana-Farber Cancer Institute/Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA; Mass General Brigham Artificial Intelligence in Medicine Program, Boston, MA, USA
| | - Dennis Bontempi
- Dana-Farber Cancer Institute/Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA; Mass General Brigham Artificial Intelligence in Medicine Program, Boston, MA, USA
| | - Zezhong Ye
- Dana-Farber Cancer Institute/Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA; Mass General Brigham Artificial Intelligence in Medicine Program, Boston, MA, USA
| | - Sanjay Aneja
- Department of Therapeutic Radiology, New Haven, CT, USA
| | - Richard Bakst
- Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | | | - Amy F Juliano
- Mass Eye and Ear, Mass General Hospital, Boston, MA, USA
| | | | - Jeffrey P Guenette
- Dana-Farber Cancer Institute/Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Ravindra Uppaluri
- Dana-Farber Cancer Institute/Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Danielle N Margalit
- Dana-Farber Cancer Institute/Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Jonathan D Schoenfeld
- Dana-Farber Cancer Institute/Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Roy B Tishler
- Dana-Farber Cancer Institute/Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Robert Haddad
- Dana-Farber Cancer Institute/Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Hugo J W L Aerts
- Dana-Farber Cancer Institute/Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA; Mass General Brigham Artificial Intelligence in Medicine Program, Boston, MA, USA; Department of Radiology, Maastricht University, Maastricht, Netherlands
| | | | - Yael Flamand
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, ECOG-ACRIN Biostatistics Center, Boston, MA, USA
| | - Rathan M Subramaniam
- Department of Radiology and Nuclear Medicine, University of Notre Dame Australia, Sydney, NSW, Australia; Department of Radiology, Duke University, Durham, NC, USA
| | | | - Robert L Ferris
- Department of Otolaryngology, University of Pittsburgh Cancer Institute, Pittsburgh, PA, USA
| |
Collapse
|
54
|
Liefgreen A, Weinstein N, Wachter S, Mittelstadt B. Beyond ideals: why the (medical) AI industry needs to motivate behavioural change in line with fairness and transparency values, and how it can do it. AI & SOCIETY 2023; 39:2183-2199. [PMID: 39309255 PMCID: PMC11415467 DOI: 10.1007/s00146-023-01684-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2022] [Accepted: 04/21/2023] [Indexed: 09/25/2024]
Abstract
Artificial intelligence (AI) is increasingly relied upon by clinicians for making diagnostic and treatment decisions, playing an important role in imaging, diagnosis, risk analysis, lifestyle monitoring, and health information management. While research has identified biases in healthcare AI systems and proposed technical solutions to address these, we argue that effective solutions require human engagement. Furthermore, there is a lack of research on how to motivate the adoption of these solutions and promote investment in designing AI systems that align with values such as transparency and fairness from the outset. Drawing on insights from psychological theories, we assert the need to understand the values that underlie decisions made by individuals involved in creating and deploying AI systems. We describe how this understanding can be leveraged to increase engagement with de-biasing and fairness-enhancing practices within the AI healthcare industry, ultimately leading to sustained behavioral change via autonomy-supportive communication strategies rooted in motivational and social psychology theories. In developing these pathways to engagement, we consider the norms and needs that govern the AI healthcare domain, and we evaluate incentives for maintaining the status quo against economic, legal, and social incentives for behavior change in line with transparency and fairness values.
Collapse
Affiliation(s)
- Alice Liefgreen
- Hillary Rodham Clinton School of Law, University of Swansea, Swansea, SA2 8PP UK
- School of Psychology and Clinical Language Sciences, University of Reading, Whiteknights Road, Reading, RG6 6AL UK
| | - Netta Weinstein
- School of Psychology and Clinical Language Sciences, University of Reading, Whiteknights Road, Reading, RG6 6AL UK
| | - Sandra Wachter
- Oxford Internet Institute, University of Oxford, 1 St. Giles, Oxford, OX1 3JS UK
| | - Brent Mittelstadt
- Oxford Internet Institute, University of Oxford, 1 St. Giles, Oxford, OX1 3JS UK
| |
Collapse
|
55
|
Zbrzezny AM, Grzybowski AE. Deceptive Tricks in Artificial Intelligence: Adversarial Attacks in Ophthalmology. J Clin Med 2023; 12:jcm12093266. [PMID: 37176706 PMCID: PMC10179065 DOI: 10.3390/jcm12093266] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Revised: 04/20/2023] [Accepted: 04/26/2023] [Indexed: 05/15/2023] Open
Abstract
The artificial intelligence (AI) systems used for diagnosing ophthalmic diseases have significantly progressed in recent years. The diagnosis of difficult eye conditions, such as cataracts, diabetic retinopathy, age-related macular degeneration, glaucoma, and retinopathy of prematurity, has become significantly less complicated as a result of the development of AI algorithms, which are currently on par with ophthalmologists in terms of their level of effectiveness. However, in the context of building AI systems for medical applications such as identifying eye diseases, addressing the challenges of safety and trustworthiness is paramount, including the emerging threat of adversarial attacks. Research has increasingly focused on understanding and mitigating these attacks, with numerous articles discussing this topic in recent years. As a starting point for our discussion, we used the paper by Ma et al. "Understanding Adversarial Attacks on Deep Learning Based Medical Image Analysis Systems". A literature review was performed for this study, which included a thorough search of open-access research papers using online sources (PubMed and Google). The research provides examples of unique attack strategies for medical images. Unfortunately, unique algorithms for attacks on the various ophthalmic image types have yet to be developed. It is a task that needs to be performed. As a result, it is necessary to build algorithms that validate the computation and explain the findings of artificial intelligence models. In this article, we focus on adversarial attacks, one of the most well-known attack methods, which provide evidence (i.e., adversarial examples) of the lack of resilience of decision models that do not include provable guarantees. Adversarial attacks have the potential to provide inaccurate findings in deep learning systems and can have catastrophic effects in the healthcare industry, such as healthcare financing fraud and wrong diagnosis.
Collapse
Affiliation(s)
- Agnieszka M Zbrzezny
- Faculty of Mathematics and Computer Science, University of Warmia and Mazury, 10-710 Olsztyn, Poland
- Faculty of Design, SWPS University of Social Sciences and Humanities, Chodakowska 19/31, 03-815 Warsaw, Poland
| | - Andrzej E Grzybowski
- Institute for Research in Ophthalmology, Foundation for Ophthalmology Development, 60-836 Poznan, Poland
| |
Collapse
|
56
|
de Vries CF, Colosimo SJ, Staff RT, Dymiter JA, Yearsley J, Dinneen D, Boyle M, Harrison DJ, Anderson LA, Lip G. Impact of Different Mammography Systems on Artificial Intelligence Performance in Breast Cancer Screening. Radiol Artif Intell 2023; 5:e220146. [PMID: 37293340 PMCID: PMC10245180 DOI: 10.1148/ryai.220146] [Citation(s) in RCA: 13] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2022] [Revised: 02/14/2023] [Accepted: 03/02/2023] [Indexed: 06/10/2023]
Abstract
Artificial intelligence (AI) tools may assist breast screening mammography programs, but limited evidence supports their generalizability to new settings. This retrospective study used a 3-year dataset (April 1, 2016-March 31, 2019) from a U.K. regional screening program. The performance of a commercially available breast screening AI algorithm was assessed with a prespecified and site-specific decision threshold to evaluate whether its performance was transferable to a new clinical site. The dataset consisted of women (aged approximately 50-70 years) who attended routine screening, excluding self-referrals, those with complex physical requirements, those who had undergone a previous mastectomy, and those who underwent screening that had technical recalls or did not have the four standard image views. In total, 55 916 screening attendees (mean age, 60 years ± 6 [SD]) met the inclusion criteria. The prespecified threshold resulted in high recall rates (48.3%, 21 929 of 45 444), which reduced to 13.0% (5896 of 45 444) following threshold calibration, closer to the observed service level (5.0%, 2774 of 55 916). Recall rates also increased approximately threefold following a software upgrade on the mammography equipment, requiring per-software version thresholds. Using software-specific thresholds, the AI algorithm would have recalled 277 of 303 (91.4%) screen-detected cancers and 47 of 138 (34.1%) interval cancers. AI performance and thresholds should be validated for new clinical settings before deployment, while quality assurance systems should monitor AI performance for consistency. Keywords: Breast, Screening, Mammography, Computer Applications-Detection/Diagnosis, Neoplasms-Primary, Technology Assessment Supplemental material is available for this article. © RSNA, 2023.
Collapse
|
57
|
Pham N, Hill V, Rauschecker A, Lui Y, Niogi S, Fillipi CG, Chang P, Zaharchuk G, Wintermark M. Critical Appraisal of Artificial Intelligence-Enabled Imaging Tools Using the Levels of Evidence System. AJNR Am J Neuroradiol 2023; 44:E21-E28. [PMID: 37080722 PMCID: PMC10171388 DOI: 10.3174/ajnr.a7850] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2022] [Accepted: 03/16/2023] [Indexed: 04/22/2023]
Abstract
Clinical adoption of an artificial intelligence-enabled imaging tool requires critical appraisal of its life cycle from development to implementation by using a systematic, standardized, and objective approach that can verify both its technical and clinical efficacy. Toward this concerted effort, the ASFNR/ASNR Artificial Intelligence Workshop Technology Working Group is proposing a hierarchal evaluation system based on the quality, type, and amount of scientific evidence that the artificial intelligence-enabled tool can demonstrate for each component of its life cycle. The current proposal is modeled after the levels of evidence in medicine, with the uppermost level of the hierarchy showing the strongest evidence for potential impact on patient care and health care outcomes. The intended goal of establishing an evidence-based evaluation system is to encourage transparency, foster an understanding of the creation of artificial intelligence tools and the artificial intelligence decision-making process, and to report the relevant data on the efficacy of artificial intelligence tools that are developed. The proposed system is an essential step in working toward a more formalized, clinically validated, and regulated framework for the safe and effective deployment of artificial intelligence imaging applications that will be used in clinical practice.
Collapse
Affiliation(s)
- N Pham
- From the Department of Radiology (N.P., G.Z.), Stanford School of Medicine, Palo Alto, California
| | - V Hill
- Department of Radiology (V.H.), Northwestern University Feinberg School of Medicine, Chicago, Illinois
| | - A Rauschecker
- Department of Radiology (A.R.), University of California, San Francisco, San Francisco, California
| | - Y Lui
- Department of Radiology (Y.L.), NYU Grossman School of Medicine, New York, New York
| | - S Niogi
- Department of Radiology (S.N.), Weill Cornell Medicine, New York, New York
| | - C G Fillipi
- Department of Radiology (C.G.F.), Tufts University School of Medicine, Boston, Massachusetts
| | - P Chang
- Department of Radiology (P.C.), University of California, Irvine, Irvine, California
| | - G Zaharchuk
- From the Department of Radiology (N.P., G.Z.), Stanford School of Medicine, Palo Alto, California
| | - M Wintermark
- Department of Neuroradiology (M.W.), The University of Texas MD Anderson Cancer Center, Houston, Texas
| |
Collapse
|
58
|
Steele L, Tan XL, Olabi B, Gao JM, Tanaka RJ, Williams HC. Determining the clinical applicability of machine learning models through assessment of reporting across skin phototypes and rarer skin cancer types: A systematic review. J Eur Acad Dermatol Venereol 2023; 37:657-665. [PMID: 36514990 DOI: 10.1111/jdv.18814] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2022] [Accepted: 11/09/2022] [Indexed: 12/15/2022]
Abstract
Machine learning (ML) models for skin cancer recognition may have variable performance across different skin phototypes and skin cancer types. Overall performance metrics alone are insufficient to detect poor subgroup performance. We aimed (1) to assess whether studies of ML models reported results separately for different skin phototypes and rarer skin cancers, and (2) to graphically represent the skin cancer training datasets used by current ML models. In this systematic review, we searched PubMed, Embase and CENTRAL. We included all studies in medical journals assessing an ML technique for skin cancer diagnosis that used clinical or dermoscopic images from 1 January 2012 to 22 September 2021. No language restrictions were applied. We considered rarer skin cancers to be skin cancers other than pigmented melanoma, basal cell carcinoma and squamous cell carcinoma. We identified 114 studies for inclusion. Rarer skin cancers were included by 8/114 studies (7.0%), and results for a rarer skin cancer were reported separately in 1/114 studies (0.9%). Performance was reported across all skin phototypes in 1/114 studies (0.9%), but performance was uncertain in skin phototypes I and VI from minimal representation of the skin phototypes in the test dataset (9/3756 and 1/3756, respectively). For training datasets, although public datasets were most frequently used, with the most widely used being the International Skin Imaging Collaboration (ISIC) archive (65/114 studies, 57.0%), the largest datasets were private. Our review identified that most ML models did not report performance separately for rarer skin cancers and different skin phototypes. A degree of variability in ML model performance across subgroups is expected, but the current lack of transparency is not justifiable and risks models being used inappropriately in populations in whom accuracy is low.
Collapse
Affiliation(s)
- Lloyd Steele
- Department of Dermatology, The Royal London Hospital, London, UK.,Centre for Cell Biology and Cutaneous Research, Blizard Institute, Queen Mary University of London, London, UK
| | - Xiang Li Tan
- St George's University Hospitals NHS Foundation Trust, London, UK
| | - Bayanne Olabi
- Biosciences Institute, Newcastle University, Newcastle, UK
| | - Jing Mia Gao
- Department of Dermatology, The Royal London Hospital, London, UK
| | - Reiko J Tanaka
- Department of Bioengineering, Imperial College London, London, UK
| | - Hywel C Williams
- Centre of Evidence-Based Dermatology, School of Medicine, University of Nottingham, Nottingham, UK
| |
Collapse
|
59
|
Lundström C, Lindvall M. Mapping the Landscape of Care Providers' Quality Assurance Approaches for AI in Diagnostic Imaging. J Digit Imaging 2023; 36:379-387. [PMID: 36352164 PMCID: PMC10039170 DOI: 10.1007/s10278-022-00731-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Revised: 10/26/2022] [Accepted: 10/28/2022] [Indexed: 11/10/2022] Open
Abstract
The discussion on artificial intelligence (AI) solutions in diagnostic imaging has matured in recent years. The potential value of AI adoption is well established, as are the potential risks associated. Much focus has, rightfully, been on regulatory certification of AI products, with the strong incentive of being an enabling step for the commercial actors. It is, however, becoming evident that regulatory approval is not enough to ensure safe and effective AI usage in the local setting. In other words, care providers need to develop and implement quality assurance (QA) approaches for AI solutions in diagnostic imaging. The domain of AI-specific QA is still in an early development phase. We contribute to this development by describing the current landscape of QA-for-AI approaches in medical imaging, with focus on radiology and pathology. We map the potential quality threats and review the existing QA approaches in relation to those threats. We propose a practical categorization of QA approaches, based on key characteristics corresponding to means, situation, and purpose. The review highlights the heterogeneity of methods and practices relevant for this domain and points to targets for future research efforts.
Collapse
Affiliation(s)
- Claes Lundström
- Center for Medical Image Science and Visualization, Linköping University, Linköping, Sweden.
- Sectra AB, Linköping, Sweden.
| | | |
Collapse
|
60
|
Redrup Hill E, Mitchell C, Brigden T, Hall A. Ethical and legal considerations influencing human involvement in the implementation of artificial intelligence in a clinical pathway: A multi-stakeholder perspective. Front Digit Health 2023; 5:1139210. [PMID: 36999168 PMCID: PMC10043985 DOI: 10.3389/fdgth.2023.1139210] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Accepted: 02/23/2023] [Indexed: 03/18/2023] Open
Abstract
IntroductionEthical and legal factors will have an important bearing on when and whether automation is appropriate in healthcare. There is a developing literature on the ethics of artificial intelligence (AI) in health, including specific legal or regulatory questions such as whether there is a right to an explanation of AI decision-making. However, there has been limited consideration of the specific ethical and legal factors that influence when, and in what form, human involvement may be required in the implementation of AI in a clinical pathway, and the views of the wide range of stakeholders involved. To address this question, we chose the exemplar of the pathway for the early detection of Barrett's Oesophagus (BE) and oesophageal adenocarcinoma, where Gehrung and colleagues have developed a “semi-automated”, deep-learning system to analyse samples from the CytospongeTM TFF3 test (a minimally invasive alternative to endoscopy), where AI promises to mitigate increasing demands for pathologists' time and input.MethodsWe gathered a multidisciplinary group of stakeholders, including developers, patients, healthcare professionals and regulators, to obtain their perspectives on the ethical and legal issues that may arise using this exemplar.ResultsThe findings are grouped under six general themes: risk and potential harms; impacts on human experts; equity and bias; transparency and oversight; patient information and choice; accountability, moral responsibility and liability for error. Within these themes, a range of subtle and context-specific elements emerged, highlighting the importance of pre-implementation, interdisciplinary discussions and appreciation of pathway specific considerations.DiscussionTo evaluate these findings, we draw on the well-established principles of biomedical ethics identified by Beauchamp and Childress as a lens through which to view these results and their implications for personalised medicine. Our findings are not only relevant to this context but have implications for AI in digital pathology and healthcare more broadly.
Collapse
|
61
|
Glocker B, Jones C, Bernhardt M, Winzeck S. Algorithmic encoding of protected characteristics in chest X-ray disease detection models. EBioMedicine 2023; 89:104467. [PMID: 36791660 PMCID: PMC10025760 DOI: 10.1016/j.ebiom.2023.104467] [Citation(s) in RCA: 17] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Revised: 01/23/2023] [Accepted: 01/24/2023] [Indexed: 02/16/2023] Open
Abstract
BACKGROUND It has been rightfully emphasized that the use of AI for clinical decision making could amplify health disparities. An algorithm may encode protected characteristics, and then use this information for making predictions due to undesirable correlations in the (historical) training data. It remains unclear how we can establish whether such information is actually used. Besides the scarcity of data from underserved populations, very little is known about how dataset biases manifest in predictive models and how this may result in disparate performance. This article aims to shed some light on these issues by exploring methodology for subgroup analysis in image-based disease detection models. METHODS We utilize two publicly available chest X-ray datasets, CheXpert and MIMIC-CXR, to study performance disparities across race and biological sex in deep learning models. We explore test set resampling, transfer learning, multitask learning, and model inspection to assess the relationship between the encoding of protected characteristics and disease detection performance across subgroups. FINDINGS We confirm subgroup disparities in terms of shifted true and false positive rates which are partially removed after correcting for population and prevalence shifts in the test sets. We find that transfer learning alone is insufficient for establishing whether specific patient information is used for making predictions. The proposed combination of test-set resampling, multitask learning, and model inspection reveals valuable insights about the way protected characteristics are encoded in the feature representations of deep neural networks. INTERPRETATION Subgroup analysis is key for identifying performance disparities of AI models, but statistical differences across subgroups need to be taken into account when analyzing potential biases in disease detection. The proposed methodology provides a comprehensive framework for subgroup analysis enabling further research into the underlying causes of disparities. FUNDING European Research Council Horizon 2020, UK Research and Innovation.
Collapse
Affiliation(s)
- Ben Glocker
- Department of Computing, Imperial College London, London, SW7 2AZ, UK.
| | - Charles Jones
- Department of Computing, Imperial College London, London, SW7 2AZ, UK
| | - Mélanie Bernhardt
- Department of Computing, Imperial College London, London, SW7 2AZ, UK
| | - Stefan Winzeck
- Department of Computing, Imperial College London, London, SW7 2AZ, UK
| |
Collapse
|
62
|
Taribagil P, Hogg HDJ, Balaskas K, Keane PA. Integrating artificial intelligence into an ophthalmologist’s workflow: obstacles and opportunities. EXPERT REVIEW OF OPHTHALMOLOGY 2023. [DOI: 10.1080/17469899.2023.2175672] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/11/2023]
Affiliation(s)
- Priyal Taribagil
- Medical Retina Department, Moorfields Eye Hospital NHS Foundation Trust, London, UK
| | - HD Jeffry Hogg
- Medical Retina Department, Moorfields Eye Hospital NHS Foundation Trust, London, UK
- Department of Population Health Science, Population Health Science Institute, Newcastle University, Newcastle upon Tyne, UK
- Department of Ophthalmology, Newcastle upon Tyne Hospitals NHS Foundation Trust, Freeman Road, Newcastle upon Tyne, UK
| | - Konstantinos Balaskas
- NIHR Biomedical Research Centre, Moorfields Eye Hospital NHS Foundation Trust, London, UK
- Medical Retina, Institute of Ophthalmology, University College of London Institute of Ophthalmology, London, UK
| | - Pearse A Keane
- NIHR Biomedical Research Centre, Moorfields Eye Hospital NHS Foundation Trust, London, UK
- Medical Retina, Institute of Ophthalmology, University College of London Institute of Ophthalmology, London, UK
| |
Collapse
|
63
|
Beyond the AJR: Validation and Algorithmic Audit of a Deep Learning System to Detect Hip Fractures Radiographically. AJR Am J Roentgenol 2023; 220:150. [PMID: 35674349 DOI: 10.2214/ajr.22.28053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
|
64
|
Sapey E, Gallier S, Evison F, McNulty D, Reeves K, Ball S. Variability and performance of NHS England's 'reason to reside' criteria in predicting hospital discharge in acute hospitals in England: a retrospective, observational cohort study. BMJ Open 2022; 12:e065862. [PMID: 36572492 PMCID: PMC9805825 DOI: 10.1136/bmjopen-2022-065862] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/23/2022] [Accepted: 12/08/2022] [Indexed: 12/27/2022] Open
Abstract
OBJECTIVES NHS England (NHSE) advocates 'reason to reside' (R2R) criteria to support discharge planning. The proportion of patients without R2R and their rate of discharge are reported daily by acute hospitals in England. R2R has no interoperable standardised data model (SDM), and its performance has not been validated. We aimed to understand the degree of intercentre and intracentre variation in R2R-related metrics reported to NHSE, define an SDM implemented within a single centre Electronic Health Record to generate an electronic R2R (eR2R) and evaluate its performance in predicting subsequent discharge. DESIGN Retrospective observational cohort study using routinely collected health data. SETTING 122 NHS Trusts in England for national reporting and an acute hospital in England for local reporting. PARTICIPANTS 6 602 706 patient-days were analysed using 3-month national data and 1 039 592 patient-days, using 3-year single centre data. MAIN OUTCOME MEASURES Variability in R2R-related metrics reported to NHSE. Performance of eR2R in predicting discharge within 24 hours. RESULTS There were high levels of intracentre and intercentre variability in R2R-related metrics (p<0.0001) but not in eR2R. Informedness of eR2R for discharge within 24 hours was low (J-statistic 0.09-0.12 across three consecutive years). In those remaining in hospital without eR2R, 61.2% met eR2R criteria on subsequent days (76% within 24 hours), most commonly due to increased NEWS2 (21.9%) or intravenous therapy administration (32.8%). CONCLUSIONS Reported R2R metrics are highly variable between and within acute Trusts in England. Although case-mix or community care provision may account for some variability, the absence of a SDM prevents standardised reporting. Following the development of a SDM in one acute Trust, the variability reduced. However, the performance of eR2R was poor, prone to change even when negative and unable to meaningfully contribute to discharge planning.
Collapse
Affiliation(s)
- Elizabeth Sapey
- PIONEER Data Hub, University of Birmingham, Birmingham, UK
- Department of Acute Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK
| | - Suzy Gallier
- PIONEER Data Hub, University of Birmingham, Birmingham, UK
- Department of Research Informatics, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK
| | - Felicity Evison
- Department of Research Informatics, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK
| | - David McNulty
- Department of Research Informatics, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK
| | - Katherine Reeves
- Department of Research Informatics, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK
| | - Simon Ball
- Renal Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham, West Midlands, UK
- Better Care Programme and Midlands Site, HDR UK, Birmingham, West Midlands, UK
| |
Collapse
|
65
|
Müller L, Kloeckner R, Mildenberger P, Pinto Dos Santos D. [Validation and implementation of artificial intelligence in radiology : Quo vadis in 2022?]. RADIOLOGIE (HEIDELBERG, GERMANY) 2022; 63:381-386. [PMID: 36510007 DOI: 10.1007/s00117-022-01097-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 11/17/2022] [Indexed: 12/14/2022]
Abstract
BACKGROUND The hype around artificial intelligence (AI) in radiology continues and the number of approved AI tools is growing steadily. Despite the great potential, integration into clinical routine in radiology remains limited. In addition, the large number of individual applications poses a challenge for clinical routine, as individual applications have to be selected for different questions and organ systems, which increases the complexity and time required. OBJECTIVES This review will discuss the current status of validation and implementation of AI tools in clinical routine, and identify possible approaches for an improved assessment of the generalizability of results of AI tools. MATERIALS AND METHODS A literature search in various literature and product databases as well as publications, position papers, and reports from various stakeholders was conducted for this review. RESULTS Scientific evidence and independent validation studies are available for only a few commercial AI tools and the generalizability of the results often remains questionable. CONCLUSIONS One challenge is the multitude of offerings for individual, specific application areas by a large number of manufacturers, making integration into the existing site-specific IT infrastructure more difficult. Furthermore, remuneration for the use of AI tools in clinical routine by health insurance companies in Germany is lacking. But in order for reimbursement to be granted, the clinical utility of new applications must first be proven. Such proof, however, is lacking for most applications.
Collapse
Affiliation(s)
- Lukas Müller
- Klinik und Poliklinik für Diagnostische und Interventionelle Radiologie, Universitätsmedizin Mainz, Langenbeckstr. 1, 55131, Mainz, Deutschland.
| | - Roman Kloeckner
- Institut für Interventionelle Radiologie, Universitätsklinikum Schleswig-Holstein - Campus Lübeck, Lübeck, Deutschland
| | - Peter Mildenberger
- Klinik und Poliklinik für Diagnostische und Interventionelle Radiologie, Universitätsmedizin Mainz, Langenbeckstr. 1, 55131, Mainz, Deutschland
| | - Daniel Pinto Dos Santos
- Institut für Diagnostische und Interventionelle Radiologie, Uniklinik Köln, Köln, Deutschland.,Institut für Diagnostische und Interventionelle Radiologie, Universitätsklinikum Frankfurt, Frankfurt am Main, Deutschland
| |
Collapse
|
66
|
van de Sande D, van Genderen ME, Braaf H, Gommers D, van Bommel J. Moving towards clinical use of artificial intelligence in intensive care medicine: business as usual? Intensive Care Med 2022; 48:1815-1817. [PMID: 36269330 DOI: 10.1007/s00134-022-06910-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/07/2022] [Indexed: 11/05/2022]
Affiliation(s)
- Davy van de Sande
- Department of Adult Intensive Care, Erasmus University Medical Center, Room Ne-403, Doctor Molewaterplein 40, 3015 GD, Rotterdam, The Netherlands
| | - Michel E van Genderen
- Department of Adult Intensive Care, Erasmus University Medical Center, Room Ne-403, Doctor Molewaterplein 40, 3015 GD, Rotterdam, The Netherlands.
| | - Heleen Braaf
- Department of Adult Intensive Care, Erasmus University Medical Center, Room Ne-403, Doctor Molewaterplein 40, 3015 GD, Rotterdam, The Netherlands
| | - Diederik Gommers
- Department of Adult Intensive Care, Erasmus University Medical Center, Room Ne-403, Doctor Molewaterplein 40, 3015 GD, Rotterdam, The Netherlands
| | - Jasper van Bommel
- Department of Adult Intensive Care, Erasmus University Medical Center, Room Ne-403, Doctor Molewaterplein 40, 3015 GD, Rotterdam, The Netherlands
| |
Collapse
|
67
|
Developing robust benchmarks for driving forward AI innovation in healthcare. NAT MACH INTELL 2022. [DOI: 10.1038/s42256-022-00559-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
68
|
Monteith S, Glenn T, Geddes J, Whybrow PC, Achtyes E, Bauer M. Expectations for Artificial Intelligence (AI) in Psychiatry. Curr Psychiatry Rep 2022; 24:709-721. [PMID: 36214931 PMCID: PMC9549456 DOI: 10.1007/s11920-022-01378-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 09/15/2022] [Indexed: 01/29/2023]
Abstract
PURPOSE OF REVIEW Artificial intelligence (AI) is often presented as a transformative technology for clinical medicine even though the current technology maturity of AI is low. The purpose of this narrative review is to describe the complex reasons for the low technology maturity and set realistic expectations for the safe, routine use of AI in clinical medicine. RECENT FINDINGS For AI to be productive in clinical medicine, many diverse factors that contribute to the low maturity level need to be addressed. These include technical problems such as data quality, dataset shift, black-box opacity, validation and regulatory challenges, and human factors such as a lack of education in AI, workflow changes, automation bias, and deskilling. There will also be new and unanticipated safety risks with the introduction of AI. The solutions to these issues are complex and will take time to discover, develop, validate, and implement. However, addressing the many problems in a methodical manner will expedite the safe and beneficial use of AI to augment medical decision making in psychiatry.
Collapse
Affiliation(s)
- Scott Monteith
- Michigan State University College of Human Medicine, Traverse City Campus, Traverse City, MI, 49684, USA.
| | - Tasha Glenn
- ChronoRecord Association, Fullerton, CA, USA
| | - John Geddes
- Department of Psychiatry, University of Oxford, Warneford Hospital, Oxford, UK
| | - Peter C Whybrow
- Department of Psychiatry and Biobehavioral Sciences, Semel Institute for Neuroscience and Human Behavior, University of California Los Angeles (UCLA), Los Angeles, CA, USA
| | - Eric Achtyes
- Michigan State University College of Human Medicine, Grand Rapids, MI, 49684, USA
- Network180, Grand Rapids, MI, USA
| | - Michael Bauer
- Department of Psychiatry and Psychotherapy, University Hospital Carl Gustav Carus Medical Faculty, Technische Universität Dresden, Dresden, Germany
| |
Collapse
|
69
|
Mascagni P, Alapatt D, Sestini L, Altieri MS, Madani A, Watanabe Y, Alseidi A, Redan JA, Alfieri S, Costamagna G, Boškoski I, Padoy N, Hashimoto DA. Computer vision in surgery: from potential to clinical value. NPJ Digit Med 2022; 5:163. [PMID: 36307544 PMCID: PMC9616906 DOI: 10.1038/s41746-022-00707-5] [Citation(s) in RCA: 35] [Impact Index Per Article: 17.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2022] [Accepted: 10/10/2022] [Indexed: 11/09/2022] Open
Abstract
Hundreds of millions of operations are performed worldwide each year, and the rising uptake in minimally invasive surgery has enabled fiber optic cameras and robots to become both important tools to conduct surgery and sensors from which to capture information about surgery. Computer vision (CV), the application of algorithms to analyze and interpret visual data, has become a critical technology through which to study the intraoperative phase of care with the goals of augmenting surgeons' decision-making processes, supporting safer surgery, and expanding access to surgical care. While much work has been performed on potential use cases, there are currently no CV tools widely used for diagnostic or therapeutic applications in surgery. Using laparoscopic cholecystectomy as an example, we reviewed current CV techniques that have been applied to minimally invasive surgery and their clinical applications. Finally, we discuss the challenges and obstacles that remain to be overcome for broader implementation and adoption of CV in surgery.
Collapse
Affiliation(s)
- Pietro Mascagni
- Gemelli Hospital, Catholic University of the Sacred Heart, Rome, Italy.
- IHU-Strasbourg, Institute of Image-Guided Surgery, Strasbourg, France.
- Global Surgical Artificial Intelligence Collaborative, Toronto, ON, Canada.
| | - Deepak Alapatt
- ICube, University of Strasbourg, CNRS, IHU, Strasbourg, France
| | - Luca Sestini
- ICube, University of Strasbourg, CNRS, IHU, Strasbourg, France
- Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milano, Italy
| | - Maria S Altieri
- Global Surgical Artificial Intelligence Collaborative, Toronto, ON, Canada
- Department of Surgery, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Amin Madani
- Global Surgical Artificial Intelligence Collaborative, Toronto, ON, Canada
- Department of Surgery, University Health Network, Toronto, ON, Canada
| | - Yusuke Watanabe
- Global Surgical Artificial Intelligence Collaborative, Toronto, ON, Canada
- Department of Surgery, University of Hokkaido, Hokkaido, Japan
| | - Adnan Alseidi
- Global Surgical Artificial Intelligence Collaborative, Toronto, ON, Canada
- Department of Surgery, University of California San Francisco, San Francisco, CA, USA
| | - Jay A Redan
- Department of Surgery, AdventHealth-Celebration Health, Celebration, FL, USA
| | - Sergio Alfieri
- Fondazione Policlinico Universitario A. Gemelli IRCCS, Rome, Italy
| | - Guido Costamagna
- Fondazione Policlinico Universitario A. Gemelli IRCCS, Rome, Italy
| | - Ivo Boškoski
- Fondazione Policlinico Universitario A. Gemelli IRCCS, Rome, Italy
| | - Nicolas Padoy
- IHU-Strasbourg, Institute of Image-Guided Surgery, Strasbourg, France
- ICube, University of Strasbourg, CNRS, IHU, Strasbourg, France
| | - Daniel A Hashimoto
- Global Surgical Artificial Intelligence Collaborative, Toronto, ON, Canada
- Department of Surgery, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| |
Collapse
|
70
|
Garrucho L, Kushibar K, Jouide S, Diaz O, Igual L, Lekadir K. Domain generalization in deep learning based mass detection in mammography: A large-scale multi-center study. Artif Intell Med 2022; 132:102386. [PMID: 36207090 DOI: 10.1016/j.artmed.2022.102386] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2022] [Revised: 08/07/2022] [Accepted: 08/19/2022] [Indexed: 11/02/2022]
Abstract
Computer-aided detection systems based on deep learning have shown great potential in breast cancer detection. However, the lack of domain generalization of artificial neural networks is an important obstacle to their deployment in changing clinical environments. In this study, we explored the domain generalization of deep learning methods for mass detection in digital mammography and analyzed in-depth the sources of domain shift in a large-scale multi-center setting. To this end, we compared the performance of eight state-of-the-art detection methods, including Transformer based models, trained in a single domain and tested in five unseen domains. Moreover, a single-source mass detection training pipeline was designed to improve the domain generalization without requiring images from the new domain. The results show that our workflow generalized better than state-of-the-art transfer learning based approaches in four out of five domains while reducing the domain shift caused by the different acquisition protocols and scanner manufacturers. Subsequently, an extensive analysis was performed to identify the covariate shifts with the greatest effects on detection performance, such as those due to differences in patient age, breast density, mass size, and mass malignancy. Ultimately, this comprehensive study provides key insights and best practices for future research on domain generalization in deep learning based breast cancer detection.
Collapse
Affiliation(s)
- Lidia Garrucho
- Artificial Intelligence in Medicine Lab (BCN-AIM), Faculty of Mathematics and Computer Science, University of Barcelona, Gran Via de les Corts Catalanes 585, Barcelona, 08007, Barcelona, Spain.
| | - Kaisar Kushibar
- Artificial Intelligence in Medicine Lab (BCN-AIM), Faculty of Mathematics and Computer Science, University of Barcelona, Gran Via de les Corts Catalanes 585, Barcelona, 08007, Barcelona, Spain
| | - Socayna Jouide
- Artificial Intelligence in Medicine Lab (BCN-AIM), Faculty of Mathematics and Computer Science, University of Barcelona, Gran Via de les Corts Catalanes 585, Barcelona, 08007, Barcelona, Spain
| | - Oliver Diaz
- Artificial Intelligence in Medicine Lab (BCN-AIM), Faculty of Mathematics and Computer Science, University of Barcelona, Gran Via de les Corts Catalanes 585, Barcelona, 08007, Barcelona, Spain
| | - Laura Igual
- Artificial Intelligence in Medicine Lab (BCN-AIM), Faculty of Mathematics and Computer Science, University of Barcelona, Gran Via de les Corts Catalanes 585, Barcelona, 08007, Barcelona, Spain
| | - Karim Lekadir
- Artificial Intelligence in Medicine Lab (BCN-AIM), Faculty of Mathematics and Computer Science, University of Barcelona, Gran Via de les Corts Catalanes 585, Barcelona, 08007, Barcelona, Spain
| |
Collapse
|
71
|
Fehr J, Jaramillo-Gutierrez G, Oala L, Gröschel MI, Bierwirth M, Balachandran P, Werneck-Leite A, Lippert C. Piloting a Survey-Based Assessment of Transparency and Trustworthiness with Three Medical AI Tools. Healthcare (Basel) 2022; 10:1923. [PMID: 36292369 PMCID: PMC9601535 DOI: 10.3390/healthcare10101923] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2022] [Revised: 09/18/2022] [Accepted: 09/21/2022] [Indexed: 11/04/2022] Open
Abstract
Artificial intelligence (AI) offers the potential to support healthcare delivery, but poorly trained or validated algorithms bear risks of harm. Ethical guidelines stated transparency about model development and validation as a requirement for trustworthy AI. Abundant guidance exists to provide transparency through reporting, but poorly reported medical AI tools are common. To close this transparency gap, we developed and piloted a framework to quantify the transparency of medical AI tools with three use cases. Our framework comprises a survey to report on the intended use, training and validation data and processes, ethical considerations, and deployment recommendations. The transparency of each response was scored with either 0, 0.5, or 1 to reflect if the requested information was not, partially, or fully provided. Additionally, we assessed on an analogous three-point scale if the provided responses fulfilled the transparency requirement for a set of trustworthiness criteria from ethical guidelines. The degree of transparency and trustworthiness was calculated on a scale from 0% to 100%. Our assessment of three medical AI use cases pin-pointed reporting gaps and resulted in transparency scores of 67% for two use cases and one with 59%. We report anecdotal evidence that business constraints and limited information from external datasets were major obstacles to providing transparency for the three use cases. The observed transparency gaps also lowered the degree of trustworthiness, indicating compliance gaps with ethical guidelines. All three pilot use cases faced challenges to provide transparency about medical AI tools, but more studies are needed to investigate those in the wider medical AI sector. Applying this framework for an external assessment of transparency may be infeasible if business constraints prevent the disclosure of information. New strategies may be necessary to enable audits of medical AI tools while preserving business secrets.
Collapse
Affiliation(s)
- Jana Fehr
- Digital Engineering Faculty, University of Potsdam, 14482 Potsdam, Germany
- Digital Health & Machine Learning, Hasso Plattner Institute, 14482 Potsdam, Germany
| | | | - Luis Oala
- Department of Artificial Intelligence, Fraunhofer HHI, 10587 Berlin, Germany
| | - Matthias I. Gröschel
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
| | - Manuel Bierwirth
- ITU/WHO Focus Group AI4H, 1211 Geneva, Switzerland
- Alumnus Goethe Frankfurt University, 60323 Frankfurt am Main, Germany
| | - Pradeep Balachandran
- ITU/WHO Focus Group AI4H, 1211 Geneva, Switzerland
- Technical Consultant (Digital Health), Thiruvananthapuram 695010, India
| | | | - Christoph Lippert
- Digital Engineering Faculty, University of Potsdam, 14482 Potsdam, Germany
- Digital Health & Machine Learning, Hasso Plattner Institute, 14482 Potsdam, Germany
| |
Collapse
|
72
|
Denniston AK, Kale AU, Lee WH, Mollan SP, Keane PA. Building trust in real-world data: lessons from INSIGHT, the UK's health data research hub for eye health and oculomics. Curr Opin Ophthalmol 2022; 33:399-406. [PMID: 35916569 DOI: 10.1097/icu.0000000000000887] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
PURPOSE OF REVIEW In this review, we consider the challenges of creating a trusted resource for real-world data in ophthalmology, based on our experience of establishing INSIGHT, the UK's Health Data Research Hub for Eye Health and Oculomics. RECENT FINDINGS The INSIGHT Health Data Research Hub maximizes the benefits and impact of historical, patient-level UK National Health Service (NHS) electronic health record data, including images, through making it research-ready including curation and anonymisation. It is built around a shared 'north star' of enabling research for patient benefit. INSIGHT has worked to establish patient and public trust in the concept and delivery of INSIGHT, with efficient and robust governance processes that support safe and secure access to data for researchers. By linking to systemic data, there is an opportunity for discovery of novel ophthalmic biomarkers of systemic diseases ('oculomics'). Datasets that provide a representation of the whole population are an important tool to address the increasingly recognized threat of health data poverty. SUMMARY Enabling efficient, safe access to routinely collected clinical data is a substantial undertaking, especially when this includes imaging modalities, but provides an exceptional resource for research. Research and innovation built on inclusive real-world data is an important tool in ensuring that discoveries and technologies of the future may not only favour selected groups, but also work for all patients.
Collapse
Affiliation(s)
- Alastair K Denniston
- INSIGHT Health Data Research hub for Eye Health
- Academic Unit of Ophthalmology, Institute of Inflammation & Ageing, College of Medical and Dental Sciences, University of Birmingham
- Ophthalmology Department, University Hospitals Birmingham NHS Foundation Trust, Birmingham
| | - Aditya U Kale
- INSIGHT Health Data Research hub for Eye Health
- Academic Unit of Ophthalmology, Institute of Inflammation & Ageing, College of Medical and Dental Sciences, University of Birmingham
| | - Wen Hwa Lee
- INSIGHT Health Data Research hub for Eye Health
- Action Against Age-Related Macular Degeneration, London
| | - Susan P Mollan
- INSIGHT Health Data Research hub for Eye Health
- Ophthalmology Department, University Hospitals Birmingham NHS Foundation Trust, Birmingham
- Institute of Metabolism and Systems Research, College of Medical and Dental Sciences, University of Birmingham
| | - Pearse A Keane
- INSIGHT Health Data Research hub for Eye Health
- NIHR Biomedical Research Centre At Moorfields Eye Hospital NHS Foundation Trust, UCL Institute of Ophthalmology, London, UK
| |
Collapse
|
73
|
Albert K, Delano M. Sex trouble: Sex/gender slippage, sex confusion, and sex obsession in machine learning using electronic health records. PATTERNS (NEW YORK, N.Y.) 2022; 3:100534. [PMID: 36033589 PMCID: PMC9403398 DOI: 10.1016/j.patter.2022.100534] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
False assumptions that sex and gender are binary, static, and concordant are deeply embedded in the medical system. As machine learning researchers use medical data to build tools to solve novel problems, understanding how existing systems represent sex/gender incorrectly is necessary to avoid perpetuating harm. In this perspective, we identify and discuss three factors to consider when working with sex/gender in research: "sex/gender slippage," the frequent substitution of sex and sex-related terms for gender and vice versa; "sex confusion," the fact that any given sex variable holds many different potential meanings; and "sex obsession," the idea that the relevant variable for most inquiries related to sex/gender is sex assigned at birth. We then explore how these phenomena show up in medical machine learning research using electronic health records, with a specific focus on HIV risk prediction. Finally, we offer recommendations about how machine learning researchers can engage more carefully with questions of sex/gender.
Collapse
Affiliation(s)
- Kendra Albert
- Cyberlaw Clinic, Harvard Law School, Cambridge, MA 02138, USA
| | - Maggie Delano
- Engineering Department, Swarthmore College, Swarthmore, PA 19146, USA
| |
Collapse
|
74
|
Arora A, Arora A. Generative adversarial networks and synthetic patient data: current challenges and future perspectives. Future Healthc J 2022; 9:190-193. [DOI: 10.7861/fhj.2022-0013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
75
|
Oakden-Rayner L, Gale W, Bonham TA, Lungren MP, Carneiro G, Bradley AP, Palmer LJ. Validation and algorithmic audit of a deep learning system for the detection of proximal femoral fractures in patients in the emergency department: a diagnostic accuracy study. Lancet Digit Health 2022; 4:e351-e358. [PMID: 35396184 DOI: 10.1016/s2589-7500(22)00004-8] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2021] [Revised: 11/02/2021] [Accepted: 01/12/2022] [Indexed: 02/01/2023]
Abstract
BACKGROUND Proximal femoral fractures are an important clinical and public health issue associated with substantial morbidity and early mortality. Artificial intelligence might offer improved diagnostic accuracy for these fractures, but typical approaches to testing of artificial intelligence models can underestimate the risks of artificial intelligence-based diagnostic systems. METHODS We present a preclinical evaluation of a deep learning model intended to detect proximal femoral fractures in frontal x-ray films in emergency department patients, trained on films from the Royal Adelaide Hospital (Adelaide, SA, Australia). This evaluation included a reader study comparing the performance of the model against five radiologists (three musculoskeletal specialists and two general radiologists) on a dataset of 200 fracture cases and 200 non-fractures (also from the Royal Adelaide Hospital), an external validation study using a dataset obtained from Stanford University Medical Center, CA, USA, and an algorithmic audit to detect any unusual or unexpected model behaviour. FINDINGS In the reader study, the area under the receiver operating characteristic curve (AUC) for the performance of the deep learning model was 0·994 (95% CI 0·988-0·999) compared with an AUC of 0·969 (0·960-0·978) for the five radiologists. This strong model performance was maintained on external validation, with an AUC of 0·980 (0·931-1·000). However, the preclinical evaluation identified barriers to safe deployment, including a substantial shift in the model operating point on external validation and an increased error rate on cases with abnormal bones (eg, Paget's disease). INTERPRETATION The model outperformed the radiologists tested and maintained performance on external validation, but showed several unexpected limitations during further testing. Thorough preclinical evaluation of artificial intelligence models, including algorithmic auditing, can reveal unexpected and potentially harmful behaviour even in high-performance artificial intelligence systems, which can inform future clinical testing and deployment decisions. FUNDING None.
Collapse
Affiliation(s)
- Lauren Oakden-Rayner
- School of Public Health, University of Adelaide, Adelaide, SA, Australia; Australian Institute for Machine Learning, University of Adelaide, Adelaide, SA, Australia.
| | - William Gale
- Australian Institute for Machine Learning, University of Adelaide, Adelaide, SA, Australia; School of Computer Science, University of Adelaide, Adelaide, SA, Australia
| | - Thomas A Bonham
- Stanford University School of Medicine, Department of Radiology, Stanford, CA, USA
| | - Matthew P Lungren
- Stanford University School of Medicine, Department of Radiology, Stanford, CA, USA; Stanford Artificial Intelligence in Medicine and Imaging Center, Stanford University, Stanford, CA, USA
| | - Gustavo Carneiro
- Australian Institute for Machine Learning, University of Adelaide, Adelaide, SA, Australia
| | - Andrew P Bradley
- Science and Engineering Faculty, Queensland University of Technology, Brisbane, QLD, Australia
| | - Lyle J Palmer
- School of Public Health, University of Adelaide, Adelaide, SA, Australia; Australian Institute for Machine Learning, University of Adelaide, Adelaide, SA, Australia
| |
Collapse
|