1
|
Yang Y, Zhang H, Gichoya JW, Katabi D, Ghassemi M. The limits of fair medical imaging AI in real-world generalization. Nat Med 2024:10.1038/s41591-024-03113-4. [PMID: 38942996 DOI: 10.1038/s41591-024-03113-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2023] [Accepted: 06/05/2024] [Indexed: 06/30/2024]
Abstract
As artificial intelligence (AI) rapidly approaches human-level performance in medical imaging, it is crucial that it does not exacerbate or propagate healthcare disparities. Previous research established AI's capacity to infer demographic data from chest X-rays, leading to a key concern: do models using demographic shortcuts have unfair predictions across subpopulations? In this study, we conducted a thorough investigation into the extent to which medical AI uses demographic encodings, focusing on potential fairness discrepancies within both in-distribution training sets and external test sets. Our analysis covers three key medical imaging disciplines-radiology, dermatology and ophthalmology-and incorporates data from six global chest X-ray datasets. We confirm that medical imaging AI leverages demographic shortcuts in disease classification. Although correcting shortcuts algorithmically effectively addresses fairness gaps to create 'locally optimal' models within the original data distribution, this optimality is not true in new test settings. Surprisingly, we found that models with less encoding of demographic attributes are often most 'globally optimal', exhibiting better fairness during model evaluation in new test environments. Our work establishes best practices for medical imaging models that maintain their performance and fairness in deployments beyond their initial training contexts, underscoring critical considerations for AI clinical deployments across populations and sites.
Collapse
Affiliation(s)
- Yuzhe Yang
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA.
| | - Haoran Zhang
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Judy W Gichoya
- Department of Radiology, Emory University School of Medicine, Atlanta, GA, USA
| | - Dina Katabi
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Marzyeh Ghassemi
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
- Institute for Medical Engineering & Science, Massachusetts Institute of Technology, Cambridge, MA, USA
| |
Collapse
|
2
|
Kale AU, Hogg HDJ, Pearson R, Glocker B, Golder S, Coombe A, Waring J, Liu X, Moore DJ, Denniston AK. Detecting Algorithmic Errors and Patient Harms for AI-Enabled Medical Devices in Randomized Controlled Trials: Protocol for a Systematic Review. JMIR Res Protoc 2024; 13:e51614. [PMID: 38941147 PMCID: PMC11245650 DOI: 10.2196/51614] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Revised: 03/11/2024] [Accepted: 04/18/2024] [Indexed: 06/29/2024] Open
Abstract
BACKGROUND Artificial intelligence (AI) medical devices have the potential to transform existing clinical workflows and ultimately improve patient outcomes. AI medical devices have shown potential for a range of clinical tasks such as diagnostics, prognostics, and therapeutic decision-making such as drug dosing. There is, however, an urgent need to ensure that these technologies remain safe for all populations. Recent literature demonstrates the need for rigorous performance error analysis to identify issues such as algorithmic encoding of spurious correlations (eg, protected characteristics) or specific failure modes that may lead to patient harm. Guidelines for reporting on studies that evaluate AI medical devices require the mention of performance error analysis; however, there is still a lack of understanding around how performance errors should be analyzed in clinical studies, and what harms authors should aim to detect and report. OBJECTIVE This systematic review will assess the frequency and severity of AI errors and adverse events (AEs) in randomized controlled trials (RCTs) investigating AI medical devices as interventions in clinical settings. The review will also explore how performance errors are analyzed including whether the analysis includes the investigation of subgroup-level outcomes. METHODS This systematic review will identify and select RCTs assessing AI medical devices. Search strategies will be deployed in MEDLINE (Ovid), Embase (Ovid), Cochrane CENTRAL, and clinical trial registries to identify relevant papers. RCTs identified in bibliographic databases will be cross-referenced with clinical trial registries. The primary outcomes of interest are the frequency and severity of AI errors, patient harms, and reported AEs. Quality assessment of RCTs will be based on version 2 of the Cochrane risk-of-bias tool (RoB2). Data analysis will include a comparison of error rates and patient harms between study arms, and a meta-analysis of the rates of patient harm in control versus intervention arms will be conducted if appropriate. RESULTS The project was registered on PROSPERO in February 2023. Preliminary searches have been completed and the search strategy has been designed in consultation with an information specialist and methodologist. Title and abstract screening started in September 2023. Full-text screening is ongoing and data collection and analysis began in April 2024. CONCLUSIONS Evaluations of AI medical devices have shown promising results; however, reporting of studies has been variable. Detection, analysis, and reporting of performance errors and patient harms is vital to robustly assess the safety of AI medical devices in RCTs. Scoping searches have illustrated that the reporting of harms is variable, often with no mention of AEs. The findings of this systematic review will identify the frequency and severity of AI performance errors and patient harms and generate insights into how errors should be analyzed to account for both overall and subgroup performance. TRIAL REGISTRATION PROSPERO CRD42023387747; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=387747. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID) PRR1-10.2196/51614.
Collapse
Affiliation(s)
- Aditya U Kale
- Institute of Inflammation and Ageing, University of Birmingham, Birmingham, United Kingdom
- University Hospitals Birmingham NHS Foundation Trust, Birmingham, United Kingdom
- NIHR Birmingham Biomedical Research Centre, Birmingham, United Kingdom
- NIHR Incubator for AI and Digital Health Research, Birmingham, United Kingdom
| | - Henry David Jeffry Hogg
- Population Health Science Institute, Faculty of Medical Sciences, Newcastle University, Newcastle upon Tyne, United Kingdom
| | - Russell Pearson
- Medicines and Healthcare Products Regulatory Agency, London, United Kingdom
| | - Ben Glocker
- Kheiron Medical Technologies, London, United Kingdom
- Department of Computing, Imperial College London, London, United Kingdom
| | - Su Golder
- Department of Health Sciences, University of York, York, United Kingdom
| | - April Coombe
- Institute of Applied Health Research, University of Birmingham, Birmingham, United Kingdom
| | - Justin Waring
- Health Services Management Centre, University of Birmingham, Birmingham, United Kingdom
| | - Xiaoxuan Liu
- Institute of Inflammation and Ageing, University of Birmingham, Birmingham, United Kingdom
- University Hospitals Birmingham NHS Foundation Trust, Birmingham, United Kingdom
- NIHR Birmingham Biomedical Research Centre, Birmingham, United Kingdom
- NIHR Incubator for AI and Digital Health Research, Birmingham, United Kingdom
| | - David J Moore
- Institute of Applied Health Research, University of Birmingham, Birmingham, United Kingdom
| | - Alastair K Denniston
- Institute of Inflammation and Ageing, University of Birmingham, Birmingham, United Kingdom
- University Hospitals Birmingham NHS Foundation Trust, Birmingham, United Kingdom
- NIHR Birmingham Biomedical Research Centre, Birmingham, United Kingdom
- NIHR Incubator for AI and Digital Health Research, Birmingham, United Kingdom
| |
Collapse
|
3
|
Stanley EAM, Souza R, Winder AJ, Gulve V, Amador K, Wilms M, Forkert ND. Towards objective and systematic evaluation of bias in artificial intelligence for medical imaging. J Am Med Inform Assoc 2024:ocae165. [PMID: 38942737 DOI: 10.1093/jamia/ocae165] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2024] [Revised: 06/11/2024] [Accepted: 06/18/2024] [Indexed: 06/30/2024] Open
Abstract
OBJECTIVE Artificial intelligence (AI) models trained using medical images for clinical tasks often exhibit bias in the form of subgroup performance disparities. However, since not all sources of bias in real-world medical imaging data are easily identifiable, it is challenging to comprehensively assess their impacts. In this article, we introduce an analysis framework for systematically and objectively investigating the impact of biases in medical images on AI models. MATERIALS AND METHODS Our framework utilizes synthetic neuroimages with known disease effects and sources of bias. We evaluated the impact of bias effects and the efficacy of 3 bias mitigation strategies in counterfactual data scenarios on a convolutional neural network (CNN) classifier. RESULTS The analysis revealed that training a CNN model on the datasets containing bias effects resulted in expected subgroup performance disparities. Moreover, reweighing was the most successful bias mitigation strategy for this setup. Finally, we demonstrated that explainable AI methods can aid in investigating the manifestation of bias in the model using this framework. DISCUSSION The value of this framework is showcased in our findings on the impact of bias scenarios and efficacy of bias mitigation in a deep learning model pipeline. This systematic analysis can be easily expanded to conduct further controlled in silico trials in other investigations of bias in medical imaging AI. CONCLUSION Our novel methodology for objectively studying bias in medical imaging AI can help support the development of clinical decision-support tools that are robust and responsible.
Collapse
Affiliation(s)
- Emma A M Stanley
- Biomedical Engineering Graduate Program, University of Calgary, Calgary, Alberta, T2N 1N4, Canada
- Department of Radiology, University of Calgary, Calgary, Alberta, T2N 4N1, Canada
- Hotchkiss Brain Institute, University of Calgary, Calgary, Alberta, T2N 4N1, Canada
- Alberta Children's Hospital Research Institute, University of Calgary, Calgary, Alberta, T2N 4N1, Canada
| | - Raissa Souza
- Biomedical Engineering Graduate Program, University of Calgary, Calgary, Alberta, T2N 1N4, Canada
- Department of Radiology, University of Calgary, Calgary, Alberta, T2N 4N1, Canada
- Hotchkiss Brain Institute, University of Calgary, Calgary, Alberta, T2N 4N1, Canada
- Alberta Children's Hospital Research Institute, University of Calgary, Calgary, Alberta, T2N 4N1, Canada
| | - Anthony J Winder
- Department of Radiology, University of Calgary, Calgary, Alberta, T2N 4N1, Canada
- Hotchkiss Brain Institute, University of Calgary, Calgary, Alberta, T2N 4N1, Canada
| | - Vedant Gulve
- Department of Radiology, University of Calgary, Calgary, Alberta, T2N 4N1, Canada
| | - Kimberly Amador
- Biomedical Engineering Graduate Program, University of Calgary, Calgary, Alberta, T2N 1N4, Canada
- Department of Radiology, University of Calgary, Calgary, Alberta, T2N 4N1, Canada
- Hotchkiss Brain Institute, University of Calgary, Calgary, Alberta, T2N 4N1, Canada
- Alberta Children's Hospital Research Institute, University of Calgary, Calgary, Alberta, T2N 4N1, Canada
| | - Matthias Wilms
- Hotchkiss Brain Institute, University of Calgary, Calgary, Alberta, T2N 4N1, Canada
- Alberta Children's Hospital Research Institute, University of Calgary, Calgary, Alberta, T2N 4N1, Canada
- Department of Pediatrics, University of Calgary, Calgary, Alberta, T2N 4N1, Canada
- Department of Community Health Sciences, University of Calgary, Calgary, Alberta, T2N 4N1, Canada
| | - Nils D Forkert
- Department of Radiology, University of Calgary, Calgary, Alberta, T2N 4N1, Canada
- Hotchkiss Brain Institute, University of Calgary, Calgary, Alberta, T2N 4N1, Canada
- Alberta Children's Hospital Research Institute, University of Calgary, Calgary, Alberta, T2N 4N1, Canada
- Department of Community Health Sciences, University of Calgary, Calgary, Alberta, T2N 4N1, Canada
- Department of Clinical Neuroscience, University of Calgary, Calgary, Alberta, T2N 4N1, Canada
- Department of Electrical and Software Engineering, University of Calgary, Calgary, Alberta, T2N 1N4, Canada
| |
Collapse
|
4
|
Meerwijk EL, McElfresh DC, Martins S, Tamang SR. Evaluating accuracy and fairness of clinical decision support algorithms when health care resources are limited. J Biomed Inform 2024; 156:104664. [PMID: 38851413 DOI: 10.1016/j.jbi.2024.104664] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2023] [Revised: 04/02/2024] [Accepted: 06/02/2024] [Indexed: 06/10/2024]
Abstract
OBJECTIVE Guidance on how to evaluate accuracy and algorithmic fairness across subgroups is missing for clinical models that flag patients for an intervention but when health care resources to administer that intervention are limited. We aimed to propose a framework of metrics that would fit this specific use case. METHODS We evaluated the following metrics and applied them to a Veterans Health Administration clinical model that flags patients for intervention who are at risk of overdose or a suicidal event among outpatients who were prescribed opioids (N = 405,817): Receiver - Operating Characteristic and area under the curve, precision - recall curve, calibration - reliability curve, false positive rate, false negative rate, and false omission rate. In addition, we developed a new approach to visualize false positives and false negatives that we named 'per true positive bars.' We demonstrate the utility of these metrics to our use case for three cohorts of patients at the highest risk (top 0.5 %, 1.0 %, and 5.0 %) by evaluating algorithmic fairness across the following age groups: <=30, 31-50, 51-65, and >65 years old. RESULTS Metrics that allowed us to assess group differences more clearly were the false positive rate, false negative rate, false omission rate, and the new 'per true positive bars'. Metrics with limited utility to our use case were the Receiver - Operating Characteristic and area under the curve, the calibration - reliability curve, and the precision - recall curve. CONCLUSION There is no "one size fits all" approach to model performance monitoring and bias analysis. Our work informs future researchers and clinicians who seek to evaluate accuracy and fairness of predictive models that identify patients to intervene on in the context of limited health care resources. In terms of ease of interpretation and utility for our use case, the new 'per true positive bars' may be the most intuitive to a range of stakeholders and facilitates choosing a threshold that allows weighing false positives against false negatives, which is especially important when predicting severe adverse events.
Collapse
Affiliation(s)
- Esther L Meerwijk
- Program Evaluation and Resource Center, Office of Mental Health and Suicide Prevention, Department of Veterans Affairs, Menlo Park, CA, USA; VA Health Systems Research, Center for Innovation to Implementation (Ci2i), VA Palo Alto Health Care System, Menlo Park, CA, USA.
| | - Duncan C McElfresh
- Program Evaluation and Resource Center, Office of Mental Health and Suicide Prevention, Department of Veterans Affairs, Menlo Park, CA, USA
| | - Susana Martins
- Program Evaluation and Resource Center, Office of Mental Health and Suicide Prevention, Department of Veterans Affairs, Menlo Park, CA, USA
| | - Suzanne R Tamang
- Program Evaluation and Resource Center, Office of Mental Health and Suicide Prevention, Department of Veterans Affairs, Menlo Park, CA, USA; VA Health Systems Research, Center for Innovation to Implementation (Ci2i), VA Palo Alto Health Care System, Menlo Park, CA, USA; Department of Medicine, Stanford University, Stanford, CA, USA
| |
Collapse
|
5
|
Kotter E, Pinto Dos Santos D. [Ethics and artificial intelligence]. RADIOLOGIE (HEIDELBERG, GERMANY) 2024; 64:498-502. [PMID: 38499692 DOI: 10.1007/s00117-024-01286-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 02/26/2024] [Indexed: 03/20/2024]
Abstract
The introduction of artificial intelligence (AI) into radiology promises to enhance efficiency and improve diagnostic accuracy, yet it also raises manifold ethical questions. These include data protection issues, the future role of radiologists, liability when using AI systems, and the avoidance of bias. To prevent data bias, the datasets need to be compiled carefully and to be representative of the target population. Accordingly, the upcoming European Union AI act sets particularly high requirements for the datasets used in training medical AI systems. Cognitive bias occurs when radiologists place too much trust in the results provided by AI systems (overreliance). So far, diagnostic AI systems are used almost exclusively as "second look" systems. If diagnostic AI systems are to be used in the future as "first look" systems or even as autonomous AI systems in order to enhance efficiency in radiology, the question of liability needs to be addressed, comparable to liability for autonomous driving. Such use of AI would also significantly change the role of radiologists.
Collapse
Affiliation(s)
- Elmar Kotter
- Klinik für Diagnostische und Interventionelle Radiologie, Universitätsklinikum Freiburg, Hugstetterstr. 55, 79106, Freiburg, Deutschland.
| | - Daniel Pinto Dos Santos
- Institut für Diagnostische und Interventionelle Radiologie, Uniklinik Köln, Kerpener Str. 62, 50937, Köln, Deutschland.
- Institut für Diagnostische und Interventionelle Radiologie, Universitätsklinik Frankfurt, Theodor-Stern-Kai 7, 60596, Frankfurt am Main, Deutschland.
| |
Collapse
|
6
|
Restrepo D, Wu C, Vásquez-Venegas C, Nakayama LF, Celi LA, López DM. DF-DM: A foundational process model for multimodal data fusion in the artificial intelligence era. RESEARCH SQUARE 2024:rs.3.rs-4277992. [PMID: 38746100 PMCID: PMC11092829 DOI: 10.21203/rs.3.rs-4277992/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2024]
Abstract
In the big data era, integrating diverse data modalities poses significant challenges, particularly in complex fields like healthcare. This paper introduces a new process model for multimodal Data Fusion for Data Mining, integrating embeddings and the Cross-Industry Standard Process for Data Mining with the existing Data Fusion Information Group model. Our model aims to decrease computational costs, complexity, and bias while improving efficiency and reliability. We also propose "disentangled dense fusion," a novel embedding fusion method designed to optimize mutual information and facilitate dense inter-modality feature interaction, thereby minimizing redundant information. We demonstrate the model's efficacy through three use cases: predicting diabetic retinopathy using retinal images and patient metadata, domestic violence prediction employing satellite imagery, internet, and census data, and identifying clinical and demographic features from radiography images and clinical notes. The model achieved a Macro F1 score of 0.92 in diabetic retinopathy prediction, an R-squared of 0.854 and sMAPE of 24.868 in domestic violence prediction, and a macro AUC of 0.92 and 0.99 for disease prediction and sex classification, respectively, in radiological analysis. These results underscore the Data Fusion for Data Mining model's potential to significantly impact multimodal data processing, promoting its adoption in diverse, resource-constrained settings.
Collapse
Affiliation(s)
- David Restrepo
- Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
- Departamento de Telemática, Universidad del Cauca, Popayán, Cauca, Colombia
| | - Chenwei Wu
- Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, Michigan, United States of America
| | | | - Luis Filipe Nakayama
- Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
- Department of Ophthalmology, São Paulo Federal University, São Paulo, São Paulo, Brazil
| | - Leo Anthony Celi
- Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
- Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, Massachusetts, United States of America
- Department of Medicine, Beth Israel Deaconess Medical Center, Boston, Massachusetts, United States of America
| | - Diego M López
- Departamento de Telemática, Universidad del Cauca, Popayán, Cauca, Colombia
| |
Collapse
|
7
|
Vaidya A, Chen RJ, Williamson DFK, Song AH, Jaume G, Yang Y, Hartvigsen T, Dyer EC, Lu MY, Lipkova J, Shaban M, Chen TY, Mahmood F. Demographic bias in misdiagnosis by computational pathology models. Nat Med 2024; 30:1174-1190. [PMID: 38641744 DOI: 10.1038/s41591-024-02885-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2023] [Accepted: 02/23/2024] [Indexed: 04/21/2024]
Abstract
Despite increasing numbers of regulatory approvals, deep learning-based computational pathology systems often overlook the impact of demographic factors on performance, potentially leading to biases. This concern is all the more important as computational pathology has leveraged large public datasets that underrepresent certain demographic groups. Using publicly available data from The Cancer Genome Atlas and the EBRAINS brain tumor atlas, as well as internal patient data, we show that whole-slide image classification models display marked performance disparities across different demographic groups when used to subtype breast and lung carcinomas and to predict IDH1 mutations in gliomas. For example, when using common modeling approaches, we observed performance gaps (in area under the receiver operating characteristic curve) between white and Black patients of 3.0% for breast cancer subtyping, 10.9% for lung cancer subtyping and 16.0% for IDH1 mutation prediction in gliomas. We found that richer feature representations obtained from self-supervised vision foundation models reduce performance variations between groups. These representations provide improvements upon weaker models even when those weaker models are combined with state-of-the-art bias mitigation strategies and modeling choices. Nevertheless, self-supervised vision foundation models do not fully eliminate these discrepancies, highlighting the continuing need for bias mitigation efforts in computational pathology. Finally, we demonstrate that our results extend to other demographic factors beyond patient race. Given these findings, we encourage regulatory and policy agencies to integrate demographic-stratified evaluation into their assessment guidelines.
Collapse
Affiliation(s)
- Anurag Vaidya
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Cancer Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA
- Health Sciences and Technology, Harvard-MIT, Cambridge, MA, USA
| | - Richard J Chen
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Cancer Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Drew F K Williamson
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
- Department of Pathology and Laboratory Medicine, Emory University School of Medicine, Atlanta, GA, USA
| | - Andrew H Song
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Cancer Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Guillaume Jaume
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Cancer Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Yuzhe Yang
- Electrical Engineering and Computer Science, MIT, Cambridge, MA, USA
| | - Thomas Hartvigsen
- School of Data Science, University of Virginia, Charlottesville, VA, USA
| | - Emma C Dyer
- T.H. Chan School of Public Health, Harvard University, Cambridge, MA, USA
| | - Ming Y Lu
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Cancer Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA
- Electrical Engineering and Computer Science, MIT, Cambridge, MA, USA
| | - Jana Lipkova
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Cancer Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Muhammad Shaban
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Cancer Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Tiffany Y Chen
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Cancer Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Faisal Mahmood
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA.
- Cancer Program, Broad Institute of Harvard and MIT, Cambridge, MA, USA.
- Cancer Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA.
- Harvard Data Science Initiative, Harvard University, Cambridge, MA, USA.
| |
Collapse
|
8
|
Wang R, Kuo PC, Chen LC, Seastedt KP, Gichoya JW, Celi LA. Drop the shortcuts: image augmentation improves fairness and decreases AI detection of race and other demographics from medical images. EBioMedicine 2024; 102:105047. [PMID: 38471396 PMCID: PMC10945176 DOI: 10.1016/j.ebiom.2024.105047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2023] [Revised: 02/15/2024] [Accepted: 02/21/2024] [Indexed: 03/14/2024] Open
Abstract
BACKGROUND It has been shown that AI models can learn race on medical images, leading to algorithmic bias. Our aim in this study was to enhance the fairness of medical image models by eliminating bias related to race, age, and sex. We hypothesise models may be learning demographics via shortcut learning and combat this using image augmentation. METHODS This study included 44,953 patients who identified as Asian, Black, or White (mean age, 60.68 years ±18.21; 23,499 women) for a total of 194,359 chest X-rays (CXRs) from MIMIC-CXR database. The included CheXpert images comprised 45,095 patients (mean age 63.10 years ±18.14; 20,437 women) for a total of 134,300 CXRs were used for external validation. We also collected 1195 3D brain magnetic resonance imaging (MRI) data from the ADNI database, which included 273 participants with an average age of 76.97 years ±14.22, and 142 females. DL models were trained on either non-augmented or augmented images and assessed using disparity metrics. The features learned by the models were analysed using task transfer experiments and model visualisation techniques. FINDINGS In the detection of radiological findings, training a model using augmented CXR images was shown to reduce disparities in error rate among racial groups (-5.45%), age groups (-13.94%), and sex (-22.22%). For AD detection, the model trained with augmented MRI images was shown 53.11% and 31.01% reduction of disparities in error rate among age and sex groups, respectively. Image augmentation led to a reduction in the model's ability to identify demographic attributes and resulted in the model trained for clinical purposes incorporating fewer demographic features. INTERPRETATION The model trained using the augmented images was less likely to be influenced by demographic information in detecting image labels. These results demonstrate that the proposed augmentation scheme could enhance the fairness of interpretations by DL models when dealing with data from patients with different demographic backgrounds. FUNDING National Science and Technology Council (Taiwan), National Institutes of Health.
Collapse
Affiliation(s)
- Ryan Wang
- Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan
| | - Po-Chih Kuo
- Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan.
| | - Li-Ching Chen
- Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan
| | - Kenneth Patrick Seastedt
- Department of Surgery, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, USA; Department of Thoracic Surgery, Roswell Park Comprehensive Cancer Center, Buffalo, NY, USA
| | | | - Leo Anthony Celi
- Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA, USA; Division of Pulmonary Critical Care and Sleep Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA; Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| |
Collapse
|
9
|
Meissen F, Breuer S, Knolle M, Buyx A, Müller R, Kaissis G, Wiestler B, Rückert D. (Predictable) performance bias in unsupervised anomaly detection. EBioMedicine 2024; 101:105002. [PMID: 38335791 PMCID: PMC10873649 DOI: 10.1016/j.ebiom.2024.105002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 01/23/2024] [Accepted: 01/24/2024] [Indexed: 02/12/2024] Open
Abstract
BACKGROUND With the ever-increasing amount of medical imaging data, the demand for algorithms to assist clinicians has amplified. Unsupervised anomaly detection (UAD) models promise to aid in the crucial first step of disease detection. While previous studies have thoroughly explored fairness in supervised models in healthcare, for UAD, this has so far been unexplored. METHODS In this study, we evaluated how dataset composition regarding subgroups manifests in disparate performance of UAD models along multiple protected variables on three large-scale publicly available chest X-ray datasets. Our experiments were validated using two state-of-the-art UAD models for medical images. Finally, we introduced subgroup-AUROC (sAUROC), which aids in quantifying fairness in machine learning. FINDINGS Our experiments revealed empirical "fairness laws" (similar to "scaling laws" for Transformers) for training-dataset composition: Linear relationships between anomaly detection performance within a subpopulation and its representation in the training data. Our study further revealed performance disparities, even in the case of balanced training data, and compound effects that exacerbate the drop in performance for subjects associated with multiple adversely affected groups. INTERPRETATION Our study quantified the disparate performance of UAD models against certain demographic subgroups. Importantly, we showed that this unfairness cannot be mitigated by balanced representation alone. Instead, the representation of some subgroups seems harder to learn by UAD models than that of others. The empirical "fairness laws" discovered in our study make disparate performance in UAD models easier to estimate and aid in determining the most desirable dataset composition. FUNDING European Research Council Deep4MI.
Collapse
Affiliation(s)
- Felix Meissen
- Chair for AI in Healthcare and Medicine, Klinikum rechts der Isar der Technischen Universität München, Einsteinstr. 25, Munich, 81675, Germany.
| | - Svenja Breuer
- Department of Science, Technology and Society, School of Social Sciences and Technology, and Technical University of Munich, Arcisstr. 21, Munich, 80333, Germany; Department of Economics and Policy, School of Management, Technical University of Munich, Arcisstraße 21, 80333, Munich, Germany
| | - Moritz Knolle
- Chair for AI in Healthcare and Medicine, Klinikum rechts der Isar der Technischen Universität München, Einsteinstr. 25, Munich, 81675, Germany; Konrad Zuse School of Excellence in Reliable AI, Munich Data Science Institute (MDSI), Walther-von-Dyck-Str. 10, Garching, 85748, Germany
| | - Alena Buyx
- Department of Science, Technology and Society, School of Social Sciences and Technology, and Technical University of Munich, Arcisstr. 21, Munich, 80333, Germany; Institute for History and Ethics of Medicine, School of Medicine, Technical University of Munich, Prinzregentenstraße 68, Munich, 81675, Germany
| | - Ruth Müller
- Department of Science, Technology and Society, School of Social Sciences and Technology, and Technical University of Munich, Arcisstr. 21, Munich, 80333, Germany; Department of Economics and Policy, School of Management, Technical University of Munich, Arcisstraße 21, 80333, Munich, Germany
| | - Georgios Kaissis
- Chair for AI in Healthcare and Medicine, Klinikum rechts der Isar der Technischen Universität München, Einsteinstr. 25, Munich, 81675, Germany; Institute for Machine Learning in Biomedical Imaging, Helmholtz Munich, Ingolstädter Landstraße 1, 85764, Neuherberg, Germany; Department of Computing, Imperial College London, London, SW7 2AZ, UK
| | - Benedikt Wiestler
- Department of Diagnostic and Interventional Neuroradiology, Klinikum rechts der Isar, Ismaninger Str. 22, Munich, 81675, Germany; TranslaTUM, Center for Translational Cancer Research, Technical University of Munich, Ismaninger Str. 22, Munich, 81675, Germany
| | - Daniel Rückert
- Chair for AI in Healthcare and Medicine, Klinikum rechts der Isar der Technischen Universität München, Einsteinstr. 25, Munich, 81675, Germany; Department of Computing, Imperial College London, London, SW7 2AZ, UK
| |
Collapse
|
10
|
Khara G, Trivedi H, Newell MS, Patel R, Rijken T, Kecskemethy P, Glocker B. Generalisable deep learning method for mammographic density prediction across imaging techniques and self-reported race. COMMUNICATIONS MEDICINE 2024; 4:21. [PMID: 38374436 PMCID: PMC10876691 DOI: 10.1038/s43856-024-00446-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2023] [Accepted: 01/31/2024] [Indexed: 02/21/2024] Open
Abstract
BACKGROUND Breast density is an important risk factor for breast cancer complemented by a higher risk of cancers being missed during screening of dense breasts due to reduced sensitivity of mammography. Automated, deep learning-based prediction of breast density could provide subject-specific risk assessment and flag difficult cases during screening. However, there is a lack of evidence for generalisability across imaging techniques and, importantly, across race. METHODS This study used a large, racially diverse dataset with 69,697 mammographic studies comprising 451,642 individual images from 23,057 female participants. A deep learning model was developed for four-class BI-RADS density prediction. A comprehensive performance evaluation assessed the generalisability across two imaging techniques, full-field digital mammography (FFDM) and two-dimensional synthetic (2DS) mammography. A detailed subgroup performance and bias analysis assessed the generalisability across participants' race. RESULTS Here we show that a model trained on FFDM-only achieves a 4-class BI-RADS classification accuracy of 80.5% (79.7-81.4) on FFDM and 79.4% (78.5-80.2) on unseen 2DS data. When trained on both FFDM and 2DS images, the performance increases to 82.3% (81.4-83.0) and 82.3% (81.3-83.1). Racial subgroup analysis shows unbiased performance across Black, White, and Asian participants, despite a separate analysis confirming that race can be predicted from the images with a high accuracy of 86.7% (86.0-87.4). CONCLUSIONS Deep learning-based breast density prediction generalises across imaging techniques and race. No substantial disparities are found for any subgroup, including races that were never seen during model development, suggesting that density predictions are unbiased.
Collapse
Affiliation(s)
| | - Hari Trivedi
- Winship Cancer Institute, Emory University, Atlanta, GA, USA
| | - Mary S Newell
- Winship Cancer Institute, Emory University, Atlanta, GA, USA
| | - Ravi Patel
- Kheiron Medical Technologies, London, UK
| | | | | | - Ben Glocker
- Kheiron Medical Technologies, London, UK.
- Department of Computing, Imperial College London, London, UK.
| |
Collapse
|
11
|
Weng WH, Sellergen A, Kiraly AP, D'Amour A, Park J, Pilgrim R, Pfohl S, Lau C, Natarajan V, Azizi S, Karthikesalingam A, Cole-Lewis H, Matias Y, Corrado GS, Webster DR, Shetty S, Prabhakara S, Eswaran K, Celi LAG, Liu Y. An intentional approach to managing bias in general purpose embedding models. Lancet Digit Health 2024; 6:e126-e130. [PMID: 38278614 DOI: 10.1016/s2589-7500(23)00227-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Revised: 10/24/2023] [Accepted: 11/02/2023] [Indexed: 01/28/2024]
Abstract
Advances in machine learning for health care have brought concerns about bias from the research community; specifically, the introduction, perpetuation, or exacerbation of care disparities. Reinforcing these concerns is the finding that medical images often reveal signals about sensitive attributes in ways that are hard to pinpoint by both algorithms and people. This finding raises a question about how to best design general purpose pretrained embeddings (GPPEs, defined as embeddings meant to support a broad array of use cases) for building downstream models that are free from particular types of bias. The downstream model should be carefully evaluated for bias, and audited and improved as appropriate. However, in our view, well intentioned attempts to prevent the upstream components-GPPEs-from learning sensitive attributes can have unintended consequences on the downstream models. Despite producing a veneer of technical neutrality, the resultant end-to-end system might still be biased or poorly performing. We present reasons, by building on previously published data, to support the reasoning that GPPEs should ideally contain as much information as the original data contain, and highlight the perils of trying to remove sensitive attributes from a GPPE. We also emphasise that downstream prediction models trained for specific tasks and settings, whether developed using GPPEs or not, should be carefully designed and evaluated to avoid bias that makes models vulnerable to issues such as distributional shift. These evaluations should be done by a diverse team, including social scientists, on a diverse cohort representing the full breadth of the patient population for which the final model is intended.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Leo A G Celi
- Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, MA, USA; Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA
| | - Yun Liu
- Google, Mountain View, CA, USA.
| |
Collapse
|
12
|
Glocker B, Jones C, Roschewitz M, Winzeck S. Risk of Bias in Chest Radiography Deep Learning Foundation Models. Radiol Artif Intell 2023; 5:e230060. [PMID: 38074789 PMCID: PMC10698597 DOI: 10.1148/ryai.230060] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Revised: 08/07/2023] [Accepted: 08/24/2023] [Indexed: 03/15/2024]
Abstract
PURPOSE To analyze a recently published chest radiography foundation model for the presence of biases that could lead to subgroup performance disparities across biologic sex and race. MATERIALS AND METHODS This Health Insurance Portability and Accountability Act-compliant retrospective study used 127 118 chest radiographs from 42 884 patients (mean age, 63 years ± 17 [SD]; 23 623 male, 19 261 female) from the CheXpert dataset that were collected between October 2002 and July 2017. To determine the presence of bias in features generated by a chest radiography foundation model and baseline deep learning model, dimensionality reduction methods together with two-sample Kolmogorov-Smirnov tests were used to detect distribution shifts across sex and race. A comprehensive disease detection performance analysis was then performed to associate any biases in the features to specific disparities in classification performance across patient subgroups. RESULTS Ten of 12 pairwise comparisons across biologic sex and race showed statistically significant differences in the studied foundation model, compared with four significant tests in the baseline model. Significant differences were found between male and female (P < .001) and Asian and Black (P < .001) patients in the feature projections that primarily capture disease. Compared with average model performance across all subgroups, classification performance on the "no finding" label decreased between 6.8% and 7.8% for female patients, and performance in detecting "pleural effusion" decreased between 10.7% and 11.6% for Black patients. CONCLUSION The studied chest radiography foundation model demonstrated racial and sex-related bias, which led to disparate performance across patient subgroups; thus, this model may be unsafe for clinical applications.Keywords: Conventional Radiography, Computer Application-Detection/Diagnosis, Chest Radiography, Bias, Foundation Models Supplemental material is available for this article. Published under a CC BY 4.0 license.See also commentary by Czum and Parr in this issue.
Collapse
Affiliation(s)
- Ben Glocker
- From the Department of Computing, Imperial College London, South
Kensington Campus, London SW7 2AZ, United Kingdom
| | - Charles Jones
- From the Department of Computing, Imperial College London, South
Kensington Campus, London SW7 2AZ, United Kingdom
| | - Mélanie Roschewitz
- From the Department of Computing, Imperial College London, South
Kensington Campus, London SW7 2AZ, United Kingdom
| | - Stefan Winzeck
- From the Department of Computing, Imperial College London, South
Kensington Campus, London SW7 2AZ, United Kingdom
| |
Collapse
|
13
|
Baughan N, Whitney HM, Drukker K, Sahiner B, Hu T, Kim GH, McNitt-Gray M, Myers KJ, Giger ML. Sequestration of imaging studies in MIDRC: stratified sampling to balance demographic characteristics of patients in a multi-institutional data commons. J Med Imaging (Bellingham) 2023; 10:064501. [PMID: 38074627 PMCID: PMC10704184 DOI: 10.1117/1.jmi.10.6.064501] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2023] [Revised: 10/23/2023] [Accepted: 10/25/2023] [Indexed: 02/12/2024] Open
Abstract
Purpose The Medical Imaging and Data Resource Center (MIDRC) is a multi-institutional effort to accelerate medical imaging machine intelligence research and create a publicly available image repository/commons as well as a sequestered commons for performance evaluation and benchmarking of algorithms. After de-identification, approximately 80% of the medical images and associated metadata become part of the open commons and 20% are sequestered from the open commons. To ensure that both commons are representative of the population available, we introduced a stratified sampling method to balance the demographic characteristics across the two datasets. Approach Our method uses multi-dimensional stratified sampling where several demographic variables of interest are sequentially used to separate the data into individual strata, each representing a unique combination of variables. Within each resulting stratum, patients are assigned to the open or sequestered commons. This algorithm was used on an example dataset containing 5000 patients using the variables of race, age, sex at birth, ethnicity, COVID-19 status, and image modality and compared resulting demographic distributions to naïve random sampling of the dataset over 2000 independent trials. Results Resulting prevalence of each demographic variable matched the prevalence from the input dataset within one standard deviation. Mann-Whitney U test results supported the hypothesis that sequestration by stratified sampling provided more balanced subsets than naïve randomization, except for demographic subcategories with very low prevalence. Conclusions The developed multi-dimensional stratified sampling algorithm can partition a large dataset while maintaining balance across several variables, superior to the balance achieved from naïve randomization.
Collapse
Affiliation(s)
- Natalie Baughan
- University of Chicago, Department of Radiology, Chicago, Illinois, United States
| | - Heather M. Whitney
- University of Chicago, Department of Radiology, Chicago, Illinois, United States
| | - Karen Drukker
- University of Chicago, Department of Radiology, Chicago, Illinois, United States
| | - Berkman Sahiner
- US Food and Drug Administration, Bethesda, Maryland, United States
| | - Tingting Hu
- US Food and Drug Administration, Bethesda, Maryland, United States
| | - Grace Hyun Kim
- University of California, Los Angeles, Los Angeles, California, United States
| | - Michael McNitt-Gray
- University of California, Los Angeles, Los Angeles, California, United States
| | | | - Maryellen L. Giger
- University of Chicago, Department of Radiology, Chicago, Illinois, United States
| |
Collapse
|
14
|
Brown A, Tomasev N, Freyberg J, Liu Y, Karthikesalingam A, Schrouff J. Detecting shortcut learning for fair medical AI using shortcut testing. Nat Commun 2023; 14:4314. [PMID: 37463884 DOI: 10.1038/s41467-023-39902-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Accepted: 06/26/2023] [Indexed: 07/20/2023] Open
Abstract
Machine learning (ML) holds great promise for improving healthcare, but it is critical to ensure that its use will not propagate or amplify health disparities. An important step is to characterize the (un)fairness of ML models-their tendency to perform differently across subgroups of the population-and to understand its underlying mechanisms. One potential driver of algorithmic unfairness, shortcut learning, arises when ML models base predictions on improper correlations in the training data. Diagnosing this phenomenon is difficult as sensitive attributes may be causally linked with disease. Using multitask learning, we propose a method to directly test for the presence of shortcut learning in clinical ML systems and demonstrate its application to clinical tasks in radiology and dermatology. Finally, our approach reveals instances when shortcutting is not responsible for unfairness, highlighting the need for a holistic approach to fairness mitigation in medical AI.
Collapse
Affiliation(s)
| | | | | | - Yuan Liu
- Google Research, Palo Alto, CA, USA
| | | | | |
Collapse
|
15
|
Petersen E, Holm S, Ganz M, Feragen A. The path toward equal performance in medical machine learning. PATTERNS (NEW YORK, N.Y.) 2023; 4:100790. [PMID: 37521051 PMCID: PMC10382979 DOI: 10.1016/j.patter.2023.100790] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 08/01/2023]
Abstract
To ensure equitable quality of care, differences in machine learning model performance between patient groups must be addressed. Here, we argue that two separate mechanisms can cause performance differences between groups. First, model performance may be worse than theoretically achievable in a given group. This can occur due to a combination of group underrepresentation, modeling choices, and the characteristics of the prediction task at hand. We examine scenarios in which underrepresentation leads to underperformance, scenarios in which it does not, and the differences between them. Second, the optimal achievable performance may also differ between groups due to differences in the intrinsic difficulty of the prediction task. We discuss several possible causes of such differences in task difficulty. In addition, challenges such as label biases and selection biases may confound both learning and performance evaluation. We highlight consequences for the path toward equal performance, and we emphasize that leveling up model performance may require gathering not only more data from underperforming groups but also better data. Throughout, we ground our discussion in real-world medical phenomena and case studies while also referencing relevant statistical theory.
Collapse
Affiliation(s)
- Eike Petersen
- DTU Compute, Technical University of Denmark, Richard Pedersens Plads, 2800 Kgs. Lyngby, Denmark
- Pioneer Centre for AI, Øster Voldgade 3, 1350 Copenhagen, Denmark
| | - Sune Holm
- Pioneer Centre for AI, Øster Voldgade 3, 1350 Copenhagen, Denmark
- Department of Food and Resource Economics, University of Copenhagen, Rolighedsvej 23, 1958 Frederiksberg C., Denmark
| | - Melanie Ganz
- Pioneer Centre for AI, Øster Voldgade 3, 1350 Copenhagen, Denmark
- Department of Computer Science, University of Copenhagen, Universitetsparken 1, 2100 Copenhagen, Denmark
- Neurobiology Research Unit, Rigshospitalet, Inge Lehmanns Vej 6–8, 2100 Copenhagen, Denmark
| | - Aasa Feragen
- DTU Compute, Technical University of Denmark, Richard Pedersens Plads, 2800 Kgs. Lyngby, Denmark
- Pioneer Centre for AI, Øster Voldgade 3, 1350 Copenhagen, Denmark
| |
Collapse
|
16
|
Chen RJ, Wang JJ, Williamson DFK, Chen TY, Lipkova J, Lu MY, Sahai S, Mahmood F. Algorithmic fairness in artificial intelligence for medicine and healthcare. Nat Biomed Eng 2023; 7:719-742. [PMID: 37380750 PMCID: PMC10632090 DOI: 10.1038/s41551-023-01056-8] [Citation(s) in RCA: 30] [Impact Index Per Article: 30.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2021] [Accepted: 04/13/2023] [Indexed: 06/30/2023]
Abstract
In healthcare, the development and deployment of insufficiently fair systems of artificial intelligence (AI) can undermine the delivery of equitable care. Assessments of AI models stratified across subpopulations have revealed inequalities in how patients are diagnosed, treated and billed. In this Perspective, we outline fairness in machine learning through the lens of healthcare, and discuss how algorithmic biases (in data acquisition, genetic variation and intra-observer labelling variability, in particular) arise in clinical workflows and the resulting healthcare disparities. We also review emerging technology for mitigating biases via disentanglement, federated learning and model explainability, and their role in the development of AI-based software as a medical device.
Collapse
Affiliation(s)
- Richard J Chen
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA, USA
- Cancer Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Judy J Wang
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Boston University School of Medicine, Boston, MA, USA
| | - Drew F K Williamson
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Tiffany Y Chen
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Jana Lipkova
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Ming Y Lu
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA, USA
- Cancer Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA
- Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Sharifa Sahai
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | - Faisal Mahmood
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
- Cancer Program, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA, USA.
- Cancer Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA.
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA.
- Harvard Data Science Initiative, Harvard University, Cambridge, MA, USA.
| |
Collapse
|
17
|
Fairness metrics for health AI: we have a long way to go. EBioMedicine 2023; 90:104525. [PMID: 36924621 PMCID: PMC10114188 DOI: 10.1016/j.ebiom.2023.104525] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Accepted: 03/01/2023] [Indexed: 03/17/2023] Open
|