1
|
Fervers P, Hahnfeldt R, Kottlors J, Wagner A, Maintz D, Pinto dos Santos D, Lennartz S, Persigehl T. ChatGPT yields low accuracy in determining LI-RADS scores based on free-text and structured radiology reports in German language. FRONTIERS IN RADIOLOGY 2024; 4:1390774. [PMID: 39036542 PMCID: PMC11257913 DOI: 10.3389/fradi.2024.1390774] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/23/2024] [Accepted: 06/13/2024] [Indexed: 07/23/2024]
Abstract
Background To investigate the feasibility of the large language model (LLM) ChatGPT for classifying liver lesions according to the Liver Imaging Reporting and Data System (LI-RADS) based on MRI reports, and to compare classification performance on structured vs. unstructured reports. Methods LI-RADS classifiable liver lesions were included from German written structured and unstructured MRI reports with report of size, location, and arterial phase contrast enhancement as minimum inclusion requirements. The findings sections of the reports were propagated to ChatGPT (GPT-3.5), which was instructed to determine LI-RADS scores for each classifiable liver lesion. Ground truth was established by two radiologists in consensus. Agreement between ground truth and ChatGPT was assessed with Cohen's kappa. Test-retest reliability was assessed by passing a subset of n = 50 lesions five times to ChatGPT, using the intraclass correlation coefficient (ICC). Results 205 MRIs from 150 patients were included. The accuracy of ChatGPT at determining LI-RADS categories was poor (53% and 44% on unstructured and structured reports). The agreement to the ground truth was higher (k = 0.51 and k = 0.44), the mean absolute error in LI-RADS scores was lower (0.5 ± 0.5 vs. 0.6 ± 0.7, p < 0.05), and the test-retest reliability was higher (ICC = 0.81 vs. 0.50), in free-text compared to structured reports, respectively, although structured reports comprised the minimum required imaging features significantly more frequently (Chi-square test, p < 0.05). Conclusions ChatGPT attained only low accuracy when asked to determine LI-RADS scores from liver imaging reports. The superior accuracy and consistency throughout free-text reports might relate to ChatGPT's training process. Clinical relevance statement Our study indicates both the necessity of optimization of LLMs for structured clinical data input and the potential of LLMs for creating machine-readable labels based on large free-text radiological databases.
Collapse
Affiliation(s)
- Philipp Fervers
- Department of Diagnostic and Interventional Radiology, University Cologne, Faculty of Medicine and University Hospital Cologne, Cologne, Germany
| | - Robert Hahnfeldt
- Department of Diagnostic and Interventional Radiology, University Cologne, Faculty of Medicine and University Hospital Cologne, Cologne, Germany
| | - Jonathan Kottlors
- Department of Diagnostic and Interventional Radiology, University Cologne, Faculty of Medicine and University Hospital Cologne, Cologne, Germany
| | - Anton Wagner
- Department of Diagnostic and Interventional Radiology, University Cologne, Faculty of Medicine and University Hospital Cologne, Cologne, Germany
| | - David Maintz
- Department of Diagnostic and Interventional Radiology, University Cologne, Faculty of Medicine and University Hospital Cologne, Cologne, Germany
| | - Daniel Pinto dos Santos
- Department of Diagnostic and Interventional Radiology, University Cologne, Faculty of Medicine and University Hospital Cologne, Cologne, Germany
- Department of Diagnostic and Interventional Radiology, Goethe University Frankfurt am Main, University Hospital Frankfurt, Frankfurt am Main, Germany
| | - Simon Lennartz
- Department of Diagnostic and Interventional Radiology, University Cologne, Faculty of Medicine and University Hospital Cologne, Cologne, Germany
| | - Thorsten Persigehl
- Department of Diagnostic and Interventional Radiology, University Cologne, Faculty of Medicine and University Hospital Cologne, Cologne, Germany
| |
Collapse
|
2
|
Puts S, Nobel M, Zegers C, Bermejo I, Robben S, Dekker A. How Natural Language Processing Can Aid With Pulmonary Oncology Tumor Node Metastasis Staging From Free-Text Radiology Reports: Algorithm Development and Validation. JMIR Form Res 2023; 7:e38125. [PMID: 36947118 PMCID: PMC10131747 DOI: 10.2196/38125] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Revised: 09/25/2022] [Accepted: 12/22/2022] [Indexed: 03/23/2023] Open
Abstract
BACKGROUND Natural language processing (NLP) is thought to be a promising solution to extract and store concepts from free text in a structured manner for data mining purposes. This is also true for radiology reports, which still consist mostly of free text. Accurate and complete reports are very important for clinical decision support, for instance, in oncological staging. As such, NLP can be a tool to structure the content of the radiology report, thereby increasing the report's value. OBJECTIVE This study describes the implementation and validation of an N-stage classifier for pulmonary oncology. It is based on free-text radiological chest computed tomography reports according to the tumor, node, and metastasis (TNM) classification, which has been added to the already existing T-stage classifier to create a combined TN-stage classifier. METHODS SpaCy, PyContextNLP, and regular expressions were used for proper information extraction, after additional rules were set to accurately extract N-stage. RESULTS The overall TN-stage classifier accuracy scores were 0.84 and 0.85, respectively, for the training (N=95) and validation (N=97) sets. This is comparable to the outcomes of the T-stage classifier (0.87-0.92). CONCLUSIONS This study shows that NLP has potential in classifying pulmonary oncology from free-text radiological reports according to the TNM classification system as both the T- and N-stages can be extracted with high accuracy.
Collapse
Affiliation(s)
- Sander Puts
- GROW School for Oncology and Reproduction, Maastricht University Medical Centre+, Maastricht, Netherlands
- Department of Radiation Oncology, Maastro, Maastricht, Netherlands
| | - Martijn Nobel
- School of Health Professions Education, Maastricht University, Maastricht, Netherlands
- Department of Radiology and Nuclear Medicine, Maastricht University Medical Center+, Maastricht, Netherlands
| | - Catharina Zegers
- GROW School for Oncology and Reproduction, Maastricht University Medical Centre+, Maastricht, Netherlands
- Department of Radiation Oncology, Maastro, Maastricht, Netherlands
| | - Iñigo Bermejo
- GROW School for Oncology and Reproduction, Maastricht University Medical Centre+, Maastricht, Netherlands
| | - Simon Robben
- School of Health Professions Education, Maastricht University, Maastricht, Netherlands
- Department of Radiology and Nuclear Medicine, Maastricht University Medical Center+, Maastricht, Netherlands
| | - Andre Dekker
- GROW School for Oncology and Reproduction, Maastricht University Medical Centre+, Maastricht, Netherlands
- Department of Radiation Oncology, Maastro, Maastricht, Netherlands
| |
Collapse
|
3
|
Jungmann F, Arnhold G, Kämpgen B, Jorg T, Düber C, Mildenberger P, Kloeckner R. A Hybrid Reporting Platform for Extended RadLex Coding Combining Structured Reporting Templates and Natural Language Processing. J Digit Imaging 2021; 33:1026-1033. [PMID: 32318897 DOI: 10.1007/s10278-020-00342-0] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023] Open
Abstract
Structured reporting is a favorable and sustainable form of reporting in radiology. Among its advantages are better presentation, clearer nomenclature, and higher quality. By using MRRT-compliant templates, the content of the categorized items (e.g., select fields) can be automatically stored in a database, which allows further research and quality analytics based on established ontologies like RadLex® linked to the items. Additionally, it is relevant to provide free-text input for descriptions of findings and impressions in complex imaging studies or for the information included with the clinical referral. So far, however, this unstructured content cannot be categorized. We developed a solution to analyze and code these free-text parts of the templates in our MRRT-compliant reporting platform, using natural language processing (NLP) with RadLex® terms in addition to the already categorized items. The established hybrid reporting concept is working successfully. The NLP tool provides RadLex® codes with modifiers (affirmed, speculated, negated). Radiologists can confirm or reject codes provided by NLP before finalizing the structured report. Furthermore, users can suggest RadLex® codes from free text that is not correctly coded with NLP or can suggest to change the modifier. Analyzing free-text fields took 1.23 s on average. Hybrid reporting enables coding of free-text information in our MRRT-compliant templates and thus increases the amount of categorized data that can be stored in the database. This enhances the possibilities for further analyses, such as correlating clinical information with radiological findings or storing high-quality structured information for machine-learning approaches.
Collapse
Affiliation(s)
- Florian Jungmann
- Department of Diagnostic and Interventional Radiology, University Medical Center of the Johannes Gutenberg University Mainz, Langenbeckst. 1, 55131, Mainz, Germany.
| | - G Arnhold
- Department of Diagnostic and Interventional Radiology, University Medical Center of the Johannes Gutenberg University Mainz, Langenbeckst. 1, 55131, Mainz, Germany
| | - B Kämpgen
- Empolis Information Management GmbH, Kaiserslautern, Germany
| | - T Jorg
- Department of Diagnostic and Interventional Radiology, University Medical Center of the Johannes Gutenberg University Mainz, Langenbeckst. 1, 55131, Mainz, Germany
| | - C Düber
- Department of Diagnostic and Interventional Radiology, University Medical Center of the Johannes Gutenberg University Mainz, Langenbeckst. 1, 55131, Mainz, Germany
| | - P Mildenberger
- Department of Diagnostic and Interventional Radiology, University Medical Center of the Johannes Gutenberg University Mainz, Langenbeckst. 1, 55131, Mainz, Germany
| | - R Kloeckner
- Department of Diagnostic and Interventional Radiology, University Medical Center of the Johannes Gutenberg University Mainz, Langenbeckst. 1, 55131, Mainz, Germany
| |
Collapse
|
4
|
Maros ME, Cho CG, Junge AG, Kämpgen B, Saase V, Siegel F, Trinkmann F, Ganslandt T, Groden C, Wenz H. Comparative analysis of machine learning algorithms for computer-assisted reporting based on fully automated cross-lingual RadLex mappings. Sci Rep 2021; 11:5529. [PMID: 33750857 PMCID: PMC7970897 DOI: 10.1038/s41598-021-85016-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2020] [Accepted: 02/23/2021] [Indexed: 02/03/2023] Open
Abstract
Computer-assisted reporting (CAR) tools were suggested to improve radiology report quality by context-sensitively recommending key imaging biomarkers. However, studies evaluating machine learning (ML) algorithms on cross-lingual ontological (RadLex) mappings for developing embedded CAR algorithms are lacking. Therefore, we compared ML algorithms developed on human expert-annotated features against those developed on fully automated cross-lingual (German to English) RadLex mappings using 206 CT reports of suspected stroke. Target label was whether the Alberta Stroke Programme Early CT Score (ASPECTS) should have been provided (yes/no:154/52). We focused on probabilistic outputs of ML-algorithms including tree-based methods, elastic net, support vector machines (SVMs) and fastText (linear classifier), which were evaluated in the same 5 × fivefold nested cross-validation framework. This allowed for model stacking and classifier rankings. Performance was evaluated using calibration metrics (AUC, brier score, log loss) and -plots. Contextual ML-based assistance recommending ASPECTS was feasible. SVMs showed the highest accuracies both on human-extracted- (87%) and RadLex features (findings:82.5%; impressions:85.4%). FastText achieved the highest accuracy (89.3%) and AUC (92%) on impressions. Boosted trees fitted on findings had the best calibration profile. Our approach provides guidance for choosing ML classifiers for CAR tools in fully automated and language-agnostic fashion using bag-of-RadLex terms on limited expert-labelled training data.
Collapse
Affiliation(s)
- Máté E Maros
- Department of Neuroradiology, Medical Faculty Mannheim, Heidelberg University, Theodor-Kutzer-Ufer 1-3, 68137, Mannheim, Germany.
- Department of Biomedical Informatics at the Center for Preventive Medicine and Digital Health (CPD-BW), Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany.
| | - Chang Gyu Cho
- Department of Neuroradiology, Medical Faculty Mannheim, Heidelberg University, Theodor-Kutzer-Ufer 1-3, 68137, Mannheim, Germany
- Department of Biomedical Informatics at the Center for Preventive Medicine and Digital Health (CPD-BW), Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany
| | - Andreas G Junge
- Department of Neuroradiology, Medical Faculty Mannheim, Heidelberg University, Theodor-Kutzer-Ufer 1-3, 68137, Mannheim, Germany
| | | | - Victor Saase
- Department of Neuroradiology, Medical Faculty Mannheim, Heidelberg University, Theodor-Kutzer-Ufer 1-3, 68137, Mannheim, Germany
| | - Fabian Siegel
- Department of Biomedical Informatics at the Center for Preventive Medicine and Digital Health (CPD-BW), Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany
| | - Frederik Trinkmann
- Department of Biomedical Informatics at the Center for Preventive Medicine and Digital Health (CPD-BW), Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany
| | - Thomas Ganslandt
- Department of Biomedical Informatics at the Center for Preventive Medicine and Digital Health (CPD-BW), Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany
| | - Christoph Groden
- Department of Neuroradiology, Medical Faculty Mannheim, Heidelberg University, Theodor-Kutzer-Ufer 1-3, 68137, Mannheim, Germany
| | - Holger Wenz
- Department of Neuroradiology, Medical Faculty Mannheim, Heidelberg University, Theodor-Kutzer-Ufer 1-3, 68137, Mannheim, Germany
| |
Collapse
|
5
|
König M, Sander A, Demuth I, Diekmann D, Steinhagen-Thiessen E. Knowledge-based best of breed approach for automated detection of clinical events based on German free text digital hospital discharge letters. PLoS One 2019; 14:e0224916. [PMID: 31774830 PMCID: PMC6881027 DOI: 10.1371/journal.pone.0224916] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2019] [Accepted: 10/24/2019] [Indexed: 12/26/2022] Open
Abstract
Objectives The secondary use of medical data contained in electronic medical records, such as hospital discharge letters, is a valuable resource for the improvement of clinical care (e.g. in terms of medication safety) or for research purposes. However, the automated processing and analysis of medical free text still poses a huge challenge to available natural language processing (NLP) systems. The aim of this study was to implement a knowledge-based best of breed approach, combining a terminology server with integrated ontology, a NLP pipeline and a rules engine. Methods We tested the performance of this approach in a use case. The clinical event of interest was the particular drug-disease interaction “proton-pump inhibitor [PPI] use and osteoporosis”. Cases were to be identified based on free text digital discharge letters as source of information. Automated detection was validated against a gold standard. Results Precision of recognition of osteoporosis was 94.19%, and recall was 97.45%. PPIs were detected with 100% precision and 97.97% recall. The F-score for the detection of the given drug-disease-interaction was 96,13%. Conclusion We could show that our approach of combining a NLP pipeline, a terminology server, and a rules engine for the purpose of automated detection of clinical events such as drug-disease interactions from free text digital hospital discharge letters was effective. There is huge potential for the implementation in clinical and research contexts, as this approach enables analyses of very high numbers of medical free text documents within a short time period.
Collapse
Affiliation(s)
- Maximilian König
- Charité-Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Lipid Clinic at Interdisciplinary Metabolism Center, Berlin, Germany
- Charité-Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Department of Nephrology and Internal Intensive Care Medicine Berlin, Germany
- * E-mail:
| | - André Sander
- ID Information und Dokumentation im Gesundheitswesen GmbH, Berlin, Germany
| | - Ilja Demuth
- Charité-Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Lipid Clinic at Interdisciplinary Metabolism Center, Berlin, Germany
- Charité - Universitätsmedizin Berlin, BCRT—Berlin Institute of Health Center for Regenerative Therapies, Berlin, Germany
| | - Daniel Diekmann
- ID Information und Dokumentation im Gesundheitswesen GmbH, Berlin, Germany
| | - Elisabeth Steinhagen-Thiessen
- Charité-Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Lipid Clinic at Interdisciplinary Metabolism Center, Berlin, Germany
| |
Collapse
|