Oeding JF, Krych AJ, Pearle AD, Kelly BT, Kunze KN. Medical Imaging Applications Developed Using Artificial Intelligence Demonstrate High Internal Validity Yet Are Limited in Scope and Lack External Validation.
Arthroscopy 2024:S0749-8063(24)00099-9. [PMID:
38325497 DOI:
10.1016/j.arthro.2024.01.043]
[Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/17/2024] [Revised: 01/21/2024] [Accepted: 01/29/2024] [Indexed: 02/09/2024]
Abstract
PURPOSE
To (1) review definitions and concepts necessary to interpret applications of deep learning (DL; a domain of artificial intelligence that leverages neural networks to make predictions on media inputs such as images) and (2) identify knowledge and translational gaps in the literature to provide insight into specific areas for improvement as adoption of this technology continues.
METHODS
A comprehensive search of the literature was performed in December 2023 for articles regarding the use of DL in sports medicine. For each study, information regarding the joint of focus, specific anatomic structure/pathology to which DL was applied, imaging modality utilized, source of images used for model training and testing, data set size, model performance, and whether the DL model was externally validated was recorded. A numerical scale was used to rate each DL model's clinical impact, with 1 corresponding to proof-of-concept studies with little to no direct clinical impact and 5 corresponding to practice-changing clinical impact and ready for clinical deployment.
RESULTS
Fifty-five studies were identified, all of which were published within the past 5 years, while 82% were published within the past 3 years. Of the DL models identified, 84% were developed for classification tasks, 9% for automated measurements, and 7% for segmentation. A total of 62% of studies utilized magnetic resonance imaging as the imaging modality, 25% radiographs, and 7% ultrasound, while 1 study each used computed tomography, arthroscopic images, or arthroscopic video. Sixty-five percent of studies focused on the detection of tears (anterior cruciate ligament [ACL], rotator cuff [RC], and meniscus). The diagnostic performance of ACL tears, as determined by the area under the receiver operator curve (AUROC), ranged from 0.81 to 0.99 for ACL tears (excellent to near perfect), 0.83 to 0.94 for RC tears (excellent), and from 0.75 to 0.96 for meniscus tears (acceptable to excellent). In addition, 3 studies focused on detection of cartilage lesions had AUROC ranging from 0.90 to 0.92 (excellent performance). However, only 4 (7%) studies externally validated their models, suggesting that they may not be generalizable or may not perform well when applied to populations other than that used to develop the model. Finally, the mean clinical impact score was 2 (range, 1-3) on scale of 1 to 5, corresponding to limited clinical applicability.
CONCLUSIONS
DL models in orthopaedic sports medicine show generally excellent performance (high internal validity) but require external validation to facilitate clinical deployment. In addition, current models have low clinical applicability and fail to advance the field due to a focus on routine tasks and a narrow conceptual framework.
LEVEL OF EVIDENCE
Level IV, scoping review of Level I to IV studies.
Collapse