Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Liu X, Glocker B, McCradden MM, Ghassemi M, Denniston AK, Oakden-Rayner L. The medical algorithmic audit. Lancet Digit Health 2022;4:e384-e397. [PMID: 35396183 DOI: 10.1016/s2589-7500(22)00003-6] [Citation(s) in RCA: 83] [Impact Index Per Article: 27.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2021] [Revised: 11/02/2021] [Accepted: 01/12/2022] [Indexed: 12/12/2022]

For:	Liu X, Glocker B, McCradden MM, Ghassemi M, Denniston AK, Oakden-Rayner L. The medical algorithmic audit. Lancet Digit Health 2022;4:e384-e397. [PMID: 35396183 DOI: 10.1016/s2589-7500(22)00003-6] [Citation(s) in RCA: 83] [Impact Index Per Article: 27.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2021] [Revised: 11/02/2021] [Accepted: 01/12/2022] [Indexed: 12/12/2022]

Number

Cited by Other Article(s)

Zhang A, Wu Z, Wu E, Wu M, Snyder MP, Zou J, Wu JC. Leveraging physiology and artificial intelligence to deliver advancements in health care. Physiol Rev 2023;103:2423-2450. [PMID: 37104717 PMCID: PMC10390055 DOI: 10.1152/physrev.00033.2022] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2022] [Revised: 03/06/2023] [Accepted: 04/25/2023] [Indexed: 04/29/2023] Open

Herington J, McCradden MD, Creel K, Boellaard R, Jones EC, Jha AK, Rahmim A, Scott PJH, Sunderland JJ, Wahl RL, Zuehlsdorff S, Saboury B. Ethical Considerations for Artificial Intelligence in Medical Imaging: Deployment and Governance. J Nucl Med 2023;64:1509-1515. [PMID: 37620051 DOI: 10.2967/jnumed.123.266110] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Revised: 07/11/2023] [Indexed: 08/26/2023] Open

Stegmann JU, Littlebury R, Trengove M, Goetz L, Bate A, Branson KM. Trustworthy AI for safe medicines. Nat Rev Drug Discov 2023;22:855-856. [PMID: 37550364 DOI: 10.1038/s41573-023-00769-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/09/2023]

Wang SM, Hogg HDJ, Sangvai D, Patel MR, Weissler EH, Kellogg KC, Ratliff W, Balu S, Sendak M. Development and Integration of Machine Learning Algorithm to Identify Peripheral Arterial Disease: Multistakeholder Qualitative Study. JMIR Form Res 2023;7:e43963. [PMID: 37733427 PMCID: PMC10557008 DOI: 10.2196/43963] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2022] [Revised: 01/20/2023] [Accepted: 04/30/2023] [Indexed: 09/22/2023] Open

Abstract

BACKGROUND

Machine learning (ML)-driven clinical decision support (CDS) continues to draw wide interest and investment as a means of improving care quality and value, despite mixed real-world implementation outcomes.

OBJECTIVE

This study aimed to explore the factors that influence the integration of a peripheral arterial disease (PAD) identification algorithm to implement timely guideline-based care.

METHODS

A total of 12 semistructured interviews were conducted with individuals from 3 stakeholder groups during the first 4 weeks of integration of an ML-driven CDS. The stakeholder groups included technical, administrative, and clinical members of the team interacting with the ML-driven CDS. The ML-driven CDS identified patients with a high probability of having PAD, and these patients were then reviewed by an interdisciplinary team that developed a recommended action plan and sent recommendations to the patient's primary care provider. Pseudonymized transcripts were coded, and thematic analysis was conducted by a multidisciplinary research team.

RESULTS

Three themes were identified: positive factors translating in silico performance to real-world efficacy, organizational factors and data structure factors affecting clinical impact, and potential challenges to advancing equity. Our study found that the factors that led to successful translation of in silico algorithm performance to real-world impact were largely nontechnical, given adequate efficacy in retrospective validation, including strong clinical leadership, trustworthy workflows, early consideration of end-user needs, and ensuring that the CDS addresses an actionable problem. Negative factors of integration included failure to incorporate the on-the-ground context, the lack of feedback loops, and data silos limiting the ML-driven CDS. The success criteria for each stakeholder group were also characterized to better understand how teams work together to integrate ML-driven CDS and to understand the varying needs across stakeholder groups.

CONCLUSIONS

Longitudinal and multidisciplinary stakeholder engagement in the development and integration of ML-driven CDS underpins its effective translation into real-world care. Although previous studies have focused on the technical elements of ML-driven CDS, our study demonstrates the importance of including administrative and operational leaders as well as an early consideration of clinicians' needs. Seeing how different stakeholder groups have this more holistic perspective also permits more effective detection of context-driven health care inequities, which are uncovered or exacerbated via ML-driven CDS integration through structural and organizational challenges. Many of the solutions to these inequities lie outside the scope of ML and require coordinated systematic solutions for mitigation to help reduce disparities in the care of patients with PAD.

Collapse

Kwong JCC, Khondker A, Lajkosz K, McDermott MBA, Frigola XB, McCradden MD, Mamdani M, Kulkarni GS, Johnson AEW. APPRAISE-AI Tool for Quantitative Evaluation of AI Studies for Clinical Decision Support. JAMA Netw Open 2023;6:e2335377. [PMID: 37747733 PMCID: PMC10520738 DOI: 10.1001/jamanetworkopen.2023.35377] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Accepted: 08/14/2023] [Indexed: 09/26/2023] Open

Abstract

Importance

Artificial intelligence (AI) has gained considerable attention in health care, yet concerns have been raised around appropriate methods and fairness. Current AI reporting guidelines do not provide a means of quantifying overall quality of AI research, limiting their ability to compare models addressing the same clinical question.

Objective

To develop a tool (APPRAISE-AI) to evaluate the methodological and reporting quality of AI prediction models for clinical decision support.

Design, Setting, and Participants

This quality improvement study evaluated AI studies in the model development, silent, and clinical trial phases using the APPRAISE-AI tool, a quantitative method for evaluating quality of AI studies across 6 domains: clinical relevance, data quality, methodological conduct, robustness of results, reporting quality, and reproducibility. These domains included 24 items with a maximum overall score of 100 points. Points were assigned to each item, with higher points indicating stronger methodological or reporting quality. The tool was applied to a systematic review on machine learning to estimate sepsis that included articles published until September 13, 2019. Data analysis was performed from September to December 2022.

Main Outcomes and Measures

The primary outcomes were interrater and intrarater reliability and the correlation between APPRAISE-AI scores and expert scores, 3-year citation rate, number of Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) low risk-of-bias domains, and overall adherence to the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) statement.

Results

A total of 28 studies were included. Overall APPRAISE-AI scores ranged from 33 (low quality) to 67 (high quality). Most studies were moderate quality. The 5 lowest scoring items included source of data, sample size calculation, bias assessment, error analysis, and transparency. Overall APPRAISE-AI scores were associated with expert scores (Spearman ρ, 0.82; 95% CI, 0.64-0.91; P < .001), 3-year citation rate (Spearman ρ, 0.69; 95% CI, 0.43-0.85; P < .001), number of QUADAS-2 low risk-of-bias domains (Spearman ρ, 0.56; 95% CI, 0.24-0.77; P = .002), and adherence to the TRIPOD statement (Spearman ρ, 0.87; 95% CI, 0.73-0.94; P < .001). Intraclass correlation coefficient ranges for interrater and intrarater reliability were 0.74 to 1.00 for individual items, 0.81 to 0.99 for individual domains, and 0.91 to 0.98 for overall scores.

Conclusions and Relevance

In this quality improvement study, APPRAISE-AI demonstrated strong interrater and intrarater reliability and correlated well with several study quality measures. This tool may provide a quantitative approach for investigators, reviewers, editors, and funding organizations to compare the research quality across AI studies for clinical decision support.

Collapse

Affiliation(s)

Jethro C. C. Kwong Division of Urology, Department of Surgery, University of Toronto, Toronto, Ontario, Canada Temerty Centre for AI Research and Education in Medicine, University of Toronto, Toronto, Ontario, Canada
Adree Khondker Division of Urology, Department of Surgery, University of Toronto, Toronto, Ontario, Canada
Katherine Lajkosz Division of Urology, Department of Surgery, University of Toronto, Toronto, Ontario, Canada Department of Biostatistics, University Health Network, University of Toronto, Toronto, Ontario, Canada
Matthew B. A. McDermott Department of Biomedical Informatics, Massachusetts Institute of Technology, Cambridge
Xavier Borrat Frigola Laboratory for Computational Physiology, Harvard–Massachusetts Institute of Technology Division of Health Sciences and Technology, Cambridge Anesthesiology and Critical Care Department, Hospital Clinic de Barcelona, Barcelona, Spain
Melissa D. McCradden Department of Bioethics, The Hospital for Sick Children, Toronto, Ontario, Canada Genetics & Genome Biology Research Program, Peter Gilgan Centre for Research and Learning, Toronto, Ontario, Canada Division of Clinical and Public Health, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
Muhammad Mamdani Temerty Centre for AI Research and Education in Medicine, University of Toronto, Toronto, Ontario, Canada Data Science and Advanced Analytics, Unity Health Toronto, Toronto, Ontario, Canada
Girish S. Kulkarni Division of Urology, Department of Surgery, University of Toronto, Toronto, Ontario, Canada Princess Margaret Cancer Centre, University Health Network, University of Toronto, Toronto, Ontario, Canada
Alistair E. W. Johnson Temerty Centre for AI Research and Education in Medicine, University of Toronto, Toronto, Ontario, Canada Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada Child Health Evaluative Sciences, The Hospital for Sick Children, University of Toronto, Toronto, Ontario, Canada

Collapse

Balch JA, Loftus TJ. Actionable artificial intelligence: Overcoming barriers to adoption of prediction tools. Surgery 2023;174:730-732. [PMID: 37198040 DOI: 10.1016/j.surg.2023.03.019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2023] [Accepted: 03/30/2023] [Indexed: 05/19/2023]

Wang C, Liu S, Yang H, Guo J, Wu Y, Liu J. Ethical Considerations of Using ChatGPT in Health Care. J Med Internet Res 2023;25:e48009. [PMID: 37566454 PMCID: PMC10457697 DOI: 10.2196/48009] [Citation(s) in RCA: 101] [Impact Index Per Article: 50.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2023] [Revised: 07/05/2023] [Accepted: 07/25/2023] [Indexed: 08/12/2023] Open

Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, Tanwani A, Cole-Lewis H, Pfohl S, Payne P, Seneviratne M, Gamble P, Kelly C, Babiker A, Schärli N, Chowdhery A, Mansfield P, Demner-Fushman D, Agüera Y Arcas B, Webster D, Corrado GS, Matias Y, Chou K, Gottweis J, Tomasev N, Liu Y, Rajkomar A, Barral J, Semturs C, Karthikesalingam A, Natarajan V. Large language models encode clinical knowledge. Nature 2023;620:172-180. [PMID: 37438534 PMCID: PMC10396962 DOI: 10.1038/s41586-023-06291-2] [Citation(s) in RCA: 616] [Impact Index Per Article: 308.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2023] [Accepted: 06/05/2023] [Indexed: 07/14/2023]

Abstract

Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model1 (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM2 on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA3, MedMCQA4, PubMedQA5 and Measuring Massive Multitask Language Understanding (MMLU) clinical topics6), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.

Collapse

van Leeuwen K, Becks M, Grob D, de Lange F, Rutten J, Schalekamp S, Rutten M, van Ginneken B, de Rooij M, Meijer F. AI-support for the detection of intracranial large vessel occlusions: One-year prospective evaluation. Heliyon 2023;9:e19065. [PMID: 37636476 PMCID: PMC10458691 DOI: 10.1016/j.heliyon.2023.e19065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Revised: 08/07/2023] [Accepted: 08/09/2023] [Indexed: 08/29/2023] Open

Stutchfield BM, Attia A, Rowe IA, Harrison EM, Gordon-Walker T. UK liver transplantation allocation algorithm: transplant benefit score - Authors' reply. Lancet 2023;402:371-372. [PMID: 37516542 DOI: 10.1016/s0140-6736(23)01307-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Accepted: 06/21/2023] [Indexed: 07/31/2023]

Banda JM, Shah NH, Periyakoil VS. Characterizing subgroup performance of probabilistic phenotype algorithms within older adults: a case study for dementia, mild cognitive impairment, and Alzheimer's and Parkinson's diseases. JAMIA Open 2023;6:ooad043. [PMID: 37397506 PMCID: PMC10307941 DOI: 10.1093/jamiaopen/ooad043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Revised: 06/06/2023] [Accepted: 06/22/2023] [Indexed: 07/04/2023] Open

Abstract

Objective

Biases within probabilistic electronic phenotyping algorithms are largely unexplored. In this work, we characterize differences in subgroup performance of phenotyping algorithms for Alzheimer's disease and related dementias (ADRD) in older adults.

Materials and methods

We created an experimental framework to characterize the performance of probabilistic phenotyping algorithms under different racial distributions allowing us to identify which algorithms may have differential performance, by how much, and under what conditions. We relied on rule-based phenotype definitions as reference to evaluate probabilistic phenotype algorithms created using the Automated PHenotype Routine for Observational Definition, Identification, Training and Evaluation framework.

Results

We demonstrate that some algorithms have performance variations anywhere from 3% to 30% for different populations, even when not using race as an input variable. We show that while performance differences in subgroups are not present for all phenotypes, they do affect some phenotypes and groups more disproportionately than others.

Discussion

Our analysis establishes the need for a robust evaluation framework for subgroup differences. The underlying patient populations for the algorithms showing subgroup performance differences have great variance between model features when compared with the phenotypes with little to no differences.

Conclusion

We have created a framework to identify systematic differences in the performance of probabilistic phenotyping algorithms specifically in the context of ADRD as a use case. Differences in subgroup performance of probabilistic phenotyping algorithms are not widespread nor do they occur consistently. This highlights the great need for careful ongoing monitoring to evaluate, measure, and try to mitigate such differences.

Collapse

Kwong JCC, Khondker A, Meng E, Taylor N, Kuk C, Perlis N, Kulkarni GS, Hamilton RJ, Fleshner NE, Finelli A, van der Kwast TH, Ali A, Jamal M, Papanikolaou F, Short T, Srigley JR, Colinet V, Peltier A, Diamand R, Lefebvre Y, Mandoorah Q, Sanchez-Salas R, Macek P, Cathelineau X, Eklund M, Johnson AEW, Feifer A, Zlotta AR. Development, multi-institutional external validation, and algorithmic audit of an artificial intelligence-based Side-specific Extra-Prostatic Extension Risk Assessment tool (SEPERA) for patients undergoing radical prostatectomy: a retrospective cohort study. Lancet Digit Health 2023;5:e435-e445. [PMID: 37211455 DOI: 10.1016/s2589-7500(23)00067-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Revised: 02/11/2023] [Accepted: 03/22/2023] [Indexed: 05/23/2023]

Abstract

BACKGROUND

Accurate prediction of side-specific extraprostatic extension (ssEPE) is essential for performing nerve-sparing surgery to mitigate treatment-related side-effects such as impotence and incontinence in patients with localised prostate cancer. Artificial intelligence (AI) might provide robust and personalised ssEPE predictions to better inform nerve-sparing strategy during radical prostatectomy. We aimed to develop, externally validate, and perform an algorithmic audit of an AI-based Side-specific Extra-Prostatic Extension Risk Assessment tool (SEPERA).

METHODS

Each prostatic lobe was treated as an individual case such that each patient contributed two cases to the overall cohort. SEPERA was trained on 1022 cases from a community hospital network (Trillium Health Partners; Mississauga, ON, Canada) between 2010 and 2020. Subsequently, SEPERA was externally validated on 3914 cases across three academic centres: Princess Margaret Cancer Centre (Toronto, ON, Canada) from 2008 to 2020; L'Institut Mutualiste Montsouris (Paris, France) from 2010 to 2020; and Jules Bordet Institute (Brussels, Belgium) from 2015 to 2020. Model performance was characterised by area under the receiver operating characteristic curve (AUROC), area under the precision recall curve (AUPRC), calibration, and net benefit. SEPERA was compared against contemporary nomograms (ie, Sayyid nomogram, Soeterik nomogram [non-MRI and MRI]), as well as a separate logistic regression model using the same variables included in SEPERA. An algorithmic audit was performed to assess model bias and identify common patient characteristics among predictive errors.

FINDINGS

Overall, 2468 patients comprising 4936 cases (ie, prostatic lobes) were included in this study. SEPERA was well calibrated and had the best performance across all validation cohorts (pooled AUROC of 0·77 [95% CI 0·75-0·78] and pooled AUPRC of 0·61 [0·58-0·63]). In patients with pathological ssEPE despite benign ipsilateral biopsies, SEPERA correctly predicted ssEPE in 72 (68%) of 106 cases compared with the other models (47 [44%] in the logistic regression model, none in the Sayyid model, 13 [12%] in the Soeterik non-MRI model, and five [5%] in the Soeterik MRI model). SEPERA had higher net benefit than the other models to predict ssEPE, enabling more patients to safely undergo nerve-sparing. In the algorithmic audit, no evidence of model bias was observed, with no significant difference in AUROC when stratified by race, biopsy year, age, biopsy type (systematic only vs systematic and MRI-targeted biopsy), biopsy location (academic vs community), and D'Amico risk group. According to the audit, the most common errors were false positives, particularly for older patients with high-risk disease. No aggressive tumours (ie, grade >2 or high-risk disease) were found among false negatives.

INTERPRETATION

We demonstrated the accuracy, safety, and generalisability of using SEPERA to personalise nerve-sparing approaches during radical prostatectomy.

FUNDING

None.

Collapse

Affiliation(s)

Jethro C C Kwong Division of Urology, Department of Surgery, University of Toronto, Toronto, ON, Canada; Division of Urology, Department of Surgery, Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada; Temerty Centre for AI Research and Education in Medicine, University of Toronto, Toronto, ON, Canada
Adree Khondker Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada
Eric Meng Faculty of Medicine, Queen's University, Kingston, ON, Canada
Nicholas Taylor Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada
Cynthia Kuk Division of Urology, Department of Surgery, Mount Sinai Hospital, Sinai Health System, Toronto, ON, Canada
Nathan Perlis Division of Urology, Department of Surgery, University of Toronto, Toronto, ON, Canada; Division of Urology, Department of Surgery, Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada
Girish S Kulkarni Division of Urology, Department of Surgery, University of Toronto, Toronto, ON, Canada; Division of Urology, Department of Surgery, Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada; Temerty Centre for AI Research and Education in Medicine, University of Toronto, Toronto, ON, Canada
Robert J Hamilton Division of Urology, Department of Surgery, University of Toronto, Toronto, ON, Canada; Division of Urology, Department of Surgery, Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada
Neil E Fleshner Division of Urology, Department of Surgery, University of Toronto, Toronto, ON, Canada; Division of Urology, Department of Surgery, Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada
Antonio Finelli Division of Urology, Department of Surgery, University of Toronto, Toronto, ON, Canada; Division of Urology, Department of Surgery, Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada
Theodorus H van der Kwast Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, ON, Canada; Laboratory Medicine Program, Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada
Amna Ali Institute for Better Health, Trillium Health Partners, Mississauga, ON, Canada
Munir Jamal Division of Urology, Department of Surgery, University of Toronto, Toronto, ON, Canada
Frank Papanikolaou Division of Urology, Department of Surgery, University of Toronto, Toronto, ON, Canada
Thomas Short Division of Urology, Department of Surgery, University of Toronto, Toronto, ON, Canada
John R Srigley Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, ON, Canada
Valentin Colinet Division of Urology, Department of Surgery, Jules Bordet Institute, Brussels, Belgium
Alexandre Peltier Division of Urology, Department of Surgery, Jules Bordet Institute, Brussels, Belgium
Romain Diamand Division of Urology, Department of Surgery, Jules Bordet Institute, Brussels, Belgium
Yolene Lefebvre Department of Medical Imagery, Jules Bordet Institute, Brussels, Belgium
Qusay Mandoorah Division of Urology, Department of Surgery, L'Institut Mutualiste Montsouris, Paris, France
Rafael Sanchez-Salas Division of Urology, Department of Surgery, L'Institut Mutualiste Montsouris, Paris, France
Petr Macek Division of Urology, Department of Surgery, L'Institut Mutualiste Montsouris, Paris, France
Xavier Cathelineau Division of Urology, Department of Surgery, L'Institut Mutualiste Montsouris, Paris, France
Martin Eklund Department of Medical Epidemiology and Biostatistics, Karolinska Institute, Stockholm, Sweden
Alistair E W Johnson Temerty Centre for AI Research and Education in Medicine, University of Toronto, Toronto, ON, Canada; Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada; Vector Institute, Toronto, ON, Canada
Andrew Feifer Division of Urology, Department of Surgery, University of Toronto, Toronto, ON, Canada; Institute for Better Health, Trillium Health Partners, Mississauga, ON, Canada
Alexandre R Zlotta Division of Urology, Department of Surgery, University of Toronto, Toronto, ON, Canada; Division of Urology, Department of Surgery, Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada; Division of Urology, Department of Surgery, Mount Sinai Hospital, Sinai Health System, Toronto, ON, Canada.

Collapse

Lyell D, Wang Y, Coiera E, Magrabi F. More than algorithms: an analysis of safety events involving ML-enabled medical devices reported to the FDA. J Am Med Inform Assoc 2023;30:1227-1236. [PMID: 37071804 PMCID: PMC10280342 DOI: 10.1093/jamia/ocad065] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Revised: 03/13/2023] [Accepted: 03/30/2023] [Indexed: 04/20/2023] Open

González C, Ranem A, Pinto Dos Santos D, Othman A, Mukhopadhyay A. Lifelong nnU-Net: a framework for standardized medical continual learning. Sci Rep 2023;13:9381. [PMID: 37296233 PMCID: PMC10256748 DOI: 10.1038/s41598-023-34484-2] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2022] [Accepted: 05/02/2023] [Indexed: 06/12/2023] Open

Kann BH, Likitlersuang J, Bontempi D, Ye Z, Aneja S, Bakst R, Kelly HR, Juliano AF, Payabvash S, Guenette JP, Uppaluri R, Margalit DN, Schoenfeld JD, Tishler RB, Haddad R, Aerts HJWL, Garcia JJ, Flamand Y, Subramaniam RM, Burtness BA, Ferris RL. Screening for extranodal extension in HPV-associated oropharyngeal carcinoma: evaluation of a CT-based deep learning algorithm in patient data from a multicentre, randomised de-escalation trial. Lancet Digit Health 2023;5:e360-e369. [PMID: 37087370 PMCID: PMC10245380 DOI: 10.1016/s2589-7500(23)00046-8] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Revised: 01/18/2023] [Accepted: 02/21/2023] [Indexed: 04/24/2023]

Abstract

BACKGROUND

Pretreatment identification of pathological extranodal extension (ENE) would guide therapy de-escalation strategies for in human papillomavirus (HPV)-associated oropharyngeal carcinoma but is diagnostically challenging. ECOG-ACRIN Cancer Research Group E3311 was a multicentre trial wherein patients with HPV-associated oropharyngeal carcinoma were treated surgically and assigned to a pathological risk-based adjuvant strategy of observation, radiation, or concurrent chemoradiation. Despite protocol exclusion of patients with overt radiographic ENE, more than 30% had pathological ENE and required postoperative chemoradiation. We aimed to evaluate a CT-based deep learning algorithm for prediction of ENE in E3311, a diagnostically challenging cohort wherein algorithm use would be impactful in guiding decision-making.

METHODS

For this retrospective evaluation of deep learning algorithm performance, we obtained pretreatment CTs and corresponding surgical pathology reports from the multicentre, randomised de-escalation trial E3311. All enrolled patients on E3311 required pretreatment and diagnostic head and neck imaging; patients with radiographically overt ENE were excluded per study protocol. The lymph node with largest short-axis diameter and up to two additional nodes were segmented on each scan and annotated for ENE per pathology reports. Deep learning algorithm performance for ENE prediction was compared with four board-certified head and neck radiologists. The primary endpoint was the area under the curve (AUC) of the receiver operating characteristic.

FINDINGS

From 178 collected scans, 313 nodes were annotated: 71 (23%) with ENE in general, 39 (13%) with ENE larger than 1 mm ENE. The deep learning algorithm AUC for ENE classification was 0·86 (95% CI 0·82-0·90), outperforming all readers (p<0·0001 for each). Among radiologists, there was high variability in specificity (43-86%) and sensitivity (45-96%) with poor inter-reader agreement (κ 0·32). Matching the algorithm specificity to that of the reader with highest AUC (R2, false positive rate 22%) yielded improved sensitivity to 75% (+ 13%). Setting the algorithm false positive rate to 30% yielded 90% sensitivity. The algorithm showed improved performance compared with radiologists for ENE larger than 1 mm (p<0·0001) and in nodes with short-axis diameter 1 cm or larger.

INTERPRETATION

The deep learning algorithm outperformed experts in predicting pathological ENE on a challenging cohort of patients with HPV-associated oropharyngeal carcinoma from a randomised clinical trial. Deep learning algorithms should be evaluated prospectively as a treatment selection tool.

FUNDING

ECOG-ACRIN Cancer Research Group and the National Cancer Institute of the US National Institutes of Health.

Collapse

Affiliation(s)

Benjamin H Kann Department of Radiation Oncology, Harvard Medical School, Boston, MA, USA; Mass General Brigham Artificial Intelligence in Medicine Program, Boston, MA, USA.
Jirapat Likitlersuang Dana-Farber Cancer Institute/Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA; Mass General Brigham Artificial Intelligence in Medicine Program, Boston, MA, USA
Dennis Bontempi Dana-Farber Cancer Institute/Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA; Mass General Brigham Artificial Intelligence in Medicine Program, Boston, MA, USA
Zezhong Ye Dana-Farber Cancer Institute/Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA; Mass General Brigham Artificial Intelligence in Medicine Program, Boston, MA, USA
Sanjay Aneja Department of Therapeutic Radiology, New Haven, CT, USA
Richard Bakst Icahn School of Medicine at Mount Sinai, New York, NY, USA
Hillary R Kelly Mass Eye and Ear, Mass General Hospital, Boston, MA, USA
Amy F Juliano Mass Eye and Ear, Mass General Hospital, Boston, MA, USA
Sam Payabvash Department of Radiology, New Haven, CT, USA
Jeffrey P Guenette Dana-Farber Cancer Institute/Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
Ravindra Uppaluri Dana-Farber Cancer Institute/Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
Danielle N Margalit Dana-Farber Cancer Institute/Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
Jonathan D Schoenfeld Dana-Farber Cancer Institute/Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
Roy B Tishler Dana-Farber Cancer Institute/Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
Robert Haddad Dana-Farber Cancer Institute/Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
Hugo J W L Aerts Dana-Farber Cancer Institute/Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA; Mass General Brigham Artificial Intelligence in Medicine Program, Boston, MA, USA; Department of Radiology, Maastricht University, Maastricht, Netherlands
Joaquin J Garcia Department of Pathology, Mayo Clinic, Rochester, MN, USA
Yael Flamand Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, ECOG-ACRIN Biostatistics Center, Boston, MA, USA
Rathan M Subramaniam Department of Radiology and Nuclear Medicine, University of Notre Dame Australia, Sydney, NSW, Australia; Department of Radiology, Duke University, Durham, NC, USA
Barbara A Burtness Yale School of Medicine, New Haven, CT, USA
Robert L Ferris Department of Otolaryngology, University of Pittsburgh Cancer Institute, Pittsburgh, PA, USA

Collapse

Liefgreen A, Weinstein N, Wachter S, Mittelstadt B. Beyond ideals: why the (medical) AI industry needs to motivate behavioural change in line with fairness and transparency values, and how it can do it. AI & SOCIETY 2023;39:2183-2199. [PMID: 39309255 PMCID: PMC11415467 DOI: 10.1007/s00146-023-01684-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2022] [Accepted: 04/21/2023] [Indexed: 09/25/2024]

Zbrzezny AM, Grzybowski AE. Deceptive Tricks in Artificial Intelligence: Adversarial Attacks in Ophthalmology. J Clin Med 2023;12:jcm12093266. [PMID: 37176706 PMCID: PMC10179065 DOI: 10.3390/jcm12093266] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Revised: 04/20/2023] [Accepted: 04/26/2023] [Indexed: 05/15/2023] Open

Abstract

The artificial intelligence (AI) systems used for diagnosing ophthalmic diseases have significantly progressed in recent years. The diagnosis of difficult eye conditions, such as cataracts, diabetic retinopathy, age-related macular degeneration, glaucoma, and retinopathy of prematurity, has become significantly less complicated as a result of the development of AI algorithms, which are currently on par with ophthalmologists in terms of their level of effectiveness. However, in the context of building AI systems for medical applications such as identifying eye diseases, addressing the challenges of safety and trustworthiness is paramount, including the emerging threat of adversarial attacks. Research has increasingly focused on understanding and mitigating these attacks, with numerous articles discussing this topic in recent years. As a starting point for our discussion, we used the paper by Ma et al. "Understanding Adversarial Attacks on Deep Learning Based Medical Image Analysis Systems". A literature review was performed for this study, which included a thorough search of open-access research papers using online sources (PubMed and Google). The research provides examples of unique attack strategies for medical images. Unfortunately, unique algorithms for attacks on the various ophthalmic image types have yet to be developed. It is a task that needs to be performed. As a result, it is necessary to build algorithms that validate the computation and explain the findings of artificial intelligence models. In this article, we focus on adversarial attacks, one of the most well-known attack methods, which provide evidence (i.e., adversarial examples) of the lack of resilience of decision models that do not include provable guarantees. Adversarial attacks have the potential to provide inaccurate findings in deep learning systems and can have catastrophic effects in the healthcare industry, such as healthcare financing fraud and wrong diagnosis.

Collapse

de Vries CF, Colosimo SJ, Staff RT, Dymiter JA, Yearsley J, Dinneen D, Boyle M, Harrison DJ, Anderson LA, Lip G. Impact of Different Mammography Systems on Artificial Intelligence Performance in Breast Cancer Screening. Radiol Artif Intell 2023;5:e220146. [PMID: 37293340 PMCID: PMC10245180 DOI: 10.1148/ryai.220146] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2022] [Revised: 02/14/2023] [Accepted: 03/02/2023] [Indexed: 06/10/2023]

Pham N, Hill V, Rauschecker A, Lui Y, Niogi S, Fillipi CG, Chang P, Zaharchuk G, Wintermark M. Critical Appraisal of Artificial Intelligence-Enabled Imaging Tools Using the Levels of Evidence System. AJNR Am J Neuroradiol 2023;44:E21-E28. [PMID: 37080722 PMCID: PMC10171388 DOI: 10.3174/ajnr.a7850] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2022] [Accepted: 03/16/2023] [Indexed: 04/22/2023]

Steele L, Tan XL, Olabi B, Gao JM, Tanaka RJ, Williams HC. Determining the clinical applicability of machine learning models through assessment of reporting across skin phototypes and rarer skin cancer types: A systematic review. J Eur Acad Dermatol Venereol 2023;37:657-665. [PMID: 36514990 DOI: 10.1111/jdv.18814] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2022] [Accepted: 11/09/2022] [Indexed: 12/15/2022]

Abstract

Machine learning (ML) models for skin cancer recognition may have variable performance across different skin phototypes and skin cancer types. Overall performance metrics alone are insufficient to detect poor subgroup performance. We aimed (1) to assess whether studies of ML models reported results separately for different skin phototypes and rarer skin cancers, and (2) to graphically represent the skin cancer training datasets used by current ML models. In this systematic review, we searched PubMed, Embase and CENTRAL. We included all studies in medical journals assessing an ML technique for skin cancer diagnosis that used clinical or dermoscopic images from 1 January 2012 to 22 September 2021. No language restrictions were applied. We considered rarer skin cancers to be skin cancers other than pigmented melanoma, basal cell carcinoma and squamous cell carcinoma. We identified 114 studies for inclusion. Rarer skin cancers were included by 8/114 studies (7.0%), and results for a rarer skin cancer were reported separately in 1/114 studies (0.9%). Performance was reported across all skin phototypes in 1/114 studies (0.9%), but performance was uncertain in skin phototypes I and VI from minimal representation of the skin phototypes in the test dataset (9/3756 and 1/3756, respectively). For training datasets, although public datasets were most frequently used, with the most widely used being the International Skin Imaging Collaboration (ISIC) archive (65/114 studies, 57.0%), the largest datasets were private. Our review identified that most ML models did not report performance separately for rarer skin cancers and different skin phototypes. A degree of variability in ML model performance across subgroups is expected, but the current lack of transparency is not justifiable and risks models being used inappropriately in populations in whom accuracy is low.

Collapse

Lundström C, Lindvall M. Mapping the Landscape of Care Providers' Quality Assurance Approaches for AI in Diagnostic Imaging. J Digit Imaging 2023;36:379-387. [PMID: 36352164 PMCID: PMC10039170 DOI: 10.1007/s10278-022-00731-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Revised: 10/26/2022] [Accepted: 10/28/2022] [Indexed: 11/10/2022] Open

Redrup Hill E, Mitchell C, Brigden T, Hall A. Ethical and legal considerations influencing human involvement in the implementation of artificial intelligence in a clinical pathway: A multi-stakeholder perspective. Front Digit Health 2023;5:1139210. [PMID: 36999168 PMCID: PMC10043985 DOI: 10.3389/fdgth.2023.1139210] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Accepted: 02/23/2023] [Indexed: 03/18/2023] Open

Abstract IntroductionEthical and legal factors will have an important bearing on when and whether automation is appropriate in healthcare. There is a developing literature on the ethics of artificial intelligence (AI) in health, including specific legal or regulatory questions such as whether there is a right to an explanation of AI decision-making. However, there has been limited consideration of the specific ethical and legal factors that influence when, and in what form, human involvement may be required in the implementation of AI in a clinical pathway, and the views of the wide range of stakeholders involved. To address this question, we chose the exemplar of the pathway for the early detection of Barrett's Oesophagus (BE) and oesophageal adenocarcinoma, where Gehrung and colleagues have developed a “semi-automated”, deep-learning system to analyse samples from the CytospongeTM TFF3 test (a minimally invasive alternative to endoscopy), where AI promises to mitigate increasing demands for pathologists' time and input.MethodsWe gathered a multidisciplinary group of stakeholders, including developers, patients, healthcare professionals and regulators, to obtain their perspectives on the ethical and legal issues that may arise using this exemplar.ResultsThe findings are grouped under six general themes: risk and potential harms; impacts on human experts; equity and bias; transparency and oversight; patient information and choice; accountability, moral responsibility and liability for error. Within these themes, a range of subtle and context-specific elements emerged, highlighting the importance of pre-implementation, interdisciplinary discussions and appreciation of pathway specific considerations.DiscussionTo evaluate these findings, we draw on the well-established principles of biomedical ethics identified by Beauchamp and Childress as a lens through which to view these results and their implications for personalised medicine. Our findings are not only relevant to this context but have implications for AI in digital pathology and healthcare more broadly. Collapse

Glocker B, Jones C, Bernhardt M, Winzeck S. Algorithmic encoding of protected characteristics in chest X-ray disease detection models. EBioMedicine 2023;89:104467. [PMID: 36791660 PMCID: PMC10025760 DOI: 10.1016/j.ebiom.2023.104467] [Citation(s) in RCA: 22] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Revised: 01/23/2023] [Accepted: 01/24/2023] [Indexed: 02/16/2023] Open

Abstract

BACKGROUND

It has been rightfully emphasized that the use of AI for clinical decision making could amplify health disparities. An algorithm may encode protected characteristics, and then use this information for making predictions due to undesirable correlations in the (historical) training data. It remains unclear how we can establish whether such information is actually used. Besides the scarcity of data from underserved populations, very little is known about how dataset biases manifest in predictive models and how this may result in disparate performance. This article aims to shed some light on these issues by exploring methodology for subgroup analysis in image-based disease detection models.

METHODS

We utilize two publicly available chest X-ray datasets, CheXpert and MIMIC-CXR, to study performance disparities across race and biological sex in deep learning models. We explore test set resampling, transfer learning, multitask learning, and model inspection to assess the relationship between the encoding of protected characteristics and disease detection performance across subgroups.

FINDINGS

We confirm subgroup disparities in terms of shifted true and false positive rates which are partially removed after correcting for population and prevalence shifts in the test sets. We find that transfer learning alone is insufficient for establishing whether specific patient information is used for making predictions. The proposed combination of test-set resampling, multitask learning, and model inspection reveals valuable insights about the way protected characteristics are encoded in the feature representations of deep neural networks.

INTERPRETATION

Subgroup analysis is key for identifying performance disparities of AI models, but statistical differences across subgroups need to be taken into account when analyzing potential biases in disease detection. The proposed methodology provides a comprehensive framework for subgroup analysis enabling further research into the underlying causes of disparities.

FUNDING

European Research Council Horizon 2020, UK Research and Innovation.

Collapse

Taribagil P, Hogg HDJ, Balaskas K, Keane PA. Integrating artificial intelligence into an ophthalmologist’s workflow: obstacles and opportunities. EXPERT REVIEW OF OPHTHALMOLOGY 2023. [DOI: 10.1080/17469899.2023.2175672] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/11/2023]

Beyond the AJR: Validation and Algorithmic Audit of a Deep Learning System to Detect Hip Fractures Radiographically. AJR Am J Roentgenol 2023;220:150. [PMID: 35674349 DOI: 10.2214/ajr.22.28053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]

Sapey E, Gallier S, Evison F, McNulty D, Reeves K, Ball S. Variability and performance of NHS England's 'reason to reside' criteria in predicting hospital discharge in acute hospitals in England: a retrospective, observational cohort study. BMJ Open 2022;12:e065862. [PMID: 36572492 PMCID: PMC9805825 DOI: 10.1136/bmjopen-2022-065862] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/23/2022] [Accepted: 12/08/2022] [Indexed: 12/27/2022] Open

Abstract

OBJECTIVES

NHS England (NHSE) advocates 'reason to reside' (R2R) criteria to support discharge planning. The proportion of patients without R2R and their rate of discharge are reported daily by acute hospitals in England. R2R has no interoperable standardised data model (SDM), and its performance has not been validated. We aimed to understand the degree of intercentre and intracentre variation in R2R-related metrics reported to NHSE, define an SDM implemented within a single centre Electronic Health Record to generate an electronic R2R (eR2R) and evaluate its performance in predicting subsequent discharge.

DESIGN

Retrospective observational cohort study using routinely collected health data.

SETTING

122 NHS Trusts in England for national reporting and an acute hospital in England for local reporting.

PARTICIPANTS

6 602 706 patient-days were analysed using 3-month national data and 1 039 592 patient-days, using 3-year single centre data.

MAIN OUTCOME MEASURES

Variability in R2R-related metrics reported to NHSE. Performance of eR2R in predicting discharge within 24 hours.

RESULTS

There were high levels of intracentre and intercentre variability in R2R-related metrics (p<0.0001) but not in eR2R. Informedness of eR2R for discharge within 24 hours was low (J-statistic 0.09-0.12 across three consecutive years). In those remaining in hospital without eR2R, 61.2% met eR2R criteria on subsequent days (76% within 24 hours), most commonly due to increased NEWS2 (21.9%) or intravenous therapy administration (32.8%).

CONCLUSIONS

Reported R2R metrics are highly variable between and within acute Trusts in England. Although case-mix or community care provision may account for some variability, the absence of a SDM prevents standardised reporting. Following the development of a SDM in one acute Trust, the variability reduced. However, the performance of eR2R was poor, prone to change even when negative and unable to meaningfully contribute to discharge planning.

Collapse

Müller L, Kloeckner R, Mildenberger P, Pinto Dos Santos D. [Validation and implementation of artificial intelligence in radiology : Quo vadis in 2022?]. RADIOLOGIE (HEIDELBERG, GERMANY) 2022;63:381-386. [PMID: 36510007 DOI: 10.1007/s00117-022-01097-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 11/17/2022] [Indexed: 12/14/2022]

van de Sande D, van Genderen ME, Braaf H, Gommers D, van Bommel J. Moving towards clinical use of artificial intelligence in intensive care medicine: business as usual? Intensive Care Med 2022;48:1815-1817. [PMID: 36269330 DOI: 10.1007/s00134-022-06910-y] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/07/2022] [Indexed: 11/05/2022]

Developing robust benchmarks for driving forward AI innovation in healthcare. NAT MACH INTELL 2022. [DOI: 10.1038/s42256-022-00559-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]

Monteith S, Glenn T, Geddes J, Whybrow PC, Achtyes E, Bauer M. Expectations for Artificial Intelligence (AI) in Psychiatry. Curr Psychiatry Rep 2022;24:709-721. [PMID: 36214931 PMCID: PMC9549456 DOI: 10.1007/s11920-022-01378-5] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 09/15/2022] [Indexed: 01/29/2023]

Mascagni P, Alapatt D, Sestini L, Altieri MS, Madani A, Watanabe Y, Alseidi A, Redan JA, Alfieri S, Costamagna G, Boškoski I, Padoy N, Hashimoto DA. Computer vision in surgery: from potential to clinical value. NPJ Digit Med 2022;5:163. [PMID: 36307544 PMCID: PMC9616906 DOI: 10.1038/s41746-022-00707-5] [Citation(s) in RCA: 45] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2022] [Accepted: 10/10/2022] [Indexed: 11/09/2022] Open

Garrucho L, Kushibar K, Jouide S, Diaz O, Igual L, Lekadir K. Domain generalization in deep learning based mass detection in mammography: A large-scale multi-center study. Artif Intell Med 2022;132:102386. [PMID: 36207090 DOI: 10.1016/j.artmed.2022.102386] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2022] [Revised: 08/07/2022] [Accepted: 08/19/2022] [Indexed: 11/02/2022]

Fehr J, Jaramillo-Gutierrez G, Oala L, Gröschel MI, Bierwirth M, Balachandran P, Werneck-Leite A, Lippert C. Piloting a Survey-Based Assessment of Transparency and Trustworthiness with Three Medical AI Tools. Healthcare (Basel) 2022;10:1923. [PMID: 36292369 PMCID: PMC9601535 DOI: 10.3390/healthcare10101923] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2022] [Revised: 09/18/2022] [Accepted: 09/21/2022] [Indexed: 11/04/2022] Open

Abstract

Artificial intelligence (AI) offers the potential to support healthcare delivery, but poorly trained or validated algorithms bear risks of harm. Ethical guidelines stated transparency about model development and validation as a requirement for trustworthy AI. Abundant guidance exists to provide transparency through reporting, but poorly reported medical AI tools are common. To close this transparency gap, we developed and piloted a framework to quantify the transparency of medical AI tools with three use cases. Our framework comprises a survey to report on the intended use, training and validation data and processes, ethical considerations, and deployment recommendations. The transparency of each response was scored with either 0, 0.5, or 1 to reflect if the requested information was not, partially, or fully provided. Additionally, we assessed on an analogous three-point scale if the provided responses fulfilled the transparency requirement for a set of trustworthiness criteria from ethical guidelines. The degree of transparency and trustworthiness was calculated on a scale from 0% to 100%. Our assessment of three medical AI use cases pin-pointed reporting gaps and resulted in transparency scores of 67% for two use cases and one with 59%. We report anecdotal evidence that business constraints and limited information from external datasets were major obstacles to providing transparency for the three use cases. The observed transparency gaps also lowered the degree of trustworthiness, indicating compliance gaps with ethical guidelines. All three pilot use cases faced challenges to provide transparency about medical AI tools, but more studies are needed to investigate those in the wider medical AI sector. Applying this framework for an external assessment of transparency may be infeasible if business constraints prevent the disclosure of information. New strategies may be necessary to enable audits of medical AI tools while preserving business secrets.

Collapse

Denniston AK, Kale AU, Lee WH, Mollan SP, Keane PA. Building trust in real-world data: lessons from INSIGHT, the UK's health data research hub for eye health and oculomics. Curr Opin Ophthalmol 2022;33:399-406. [PMID: 35916569 DOI: 10.1097/icu.0000000000000887] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]

Albert K, Delano M. Sex trouble: Sex/gender slippage, sex confusion, and sex obsession in machine learning using electronic health records. PATTERNS (NEW YORK, N.Y.) 2022;3:100534. [PMID: 36033589 PMCID: PMC9403398 DOI: 10.1016/j.patter.2022.100534] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]

Arora A, Arora A. Generative adversarial networks and synthetic patient data: current challenges and future perspectives. Future Healthc J 2022;9:190-193. [DOI: 10.7861/fhj.2022-0013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]

Oakden-Rayner L, Gale W, Bonham TA, Lungren MP, Carneiro G, Bradley AP, Palmer LJ. Validation and algorithmic audit of a deep learning system for the detection of proximal femoral fractures in patients in the emergency department: a diagnostic accuracy study. Lancet Digit Health 2022;4:e351-e358. [PMID: 35396184 DOI: 10.1016/s2589-7500(22)00004-8] [Citation(s) in RCA: 36] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2021] [Revised: 11/02/2021] [Accepted: 01/12/2022] [Indexed: 02/01/2023]

Abstract

BACKGROUND

Proximal femoral fractures are an important clinical and public health issue associated with substantial morbidity and early mortality. Artificial intelligence might offer improved diagnostic accuracy for these fractures, but typical approaches to testing of artificial intelligence models can underestimate the risks of artificial intelligence-based diagnostic systems.

METHODS

We present a preclinical evaluation of a deep learning model intended to detect proximal femoral fractures in frontal x-ray films in emergency department patients, trained on films from the Royal Adelaide Hospital (Adelaide, SA, Australia). This evaluation included a reader study comparing the performance of the model against five radiologists (three musculoskeletal specialists and two general radiologists) on a dataset of 200 fracture cases and 200 non-fractures (also from the Royal Adelaide Hospital), an external validation study using a dataset obtained from Stanford University Medical Center, CA, USA, and an algorithmic audit to detect any unusual or unexpected model behaviour.

FINDINGS

In the reader study, the area under the receiver operating characteristic curve (AUC) for the performance of the deep learning model was 0·994 (95% CI 0·988-0·999) compared with an AUC of 0·969 (0·960-0·978) for the five radiologists. This strong model performance was maintained on external validation, with an AUC of 0·980 (0·931-1·000). However, the preclinical evaluation identified barriers to safe deployment, including a substantial shift in the model operating point on external validation and an increased error rate on cases with abnormal bones (eg, Paget's disease).

INTERPRETATION

The model outperformed the radiologists tested and maintained performance on external validation, but showed several unexpected limitations during further testing. Thorough preclinical evaluation of artificial intelligence models, including algorithmic auditing, can reveal unexpected and potentially harmful behaviour even in high-performance artificial intelligence systems, which can inform future clinical testing and deployment decisions.

FUNDING

None.

Collapse