1
|
Koga S. Advancing large language models in nephrology: bridging the gap in image interpretation. Clin Exp Nephrol 2024:10.1007/s10157-024-02581-9. [PMID: 39465433 DOI: 10.1007/s10157-024-02581-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2024] [Accepted: 10/18/2024] [Indexed: 10/29/2024]
Affiliation(s)
- Shunsuke Koga
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, 3400 Spruce Street, Philadelphia, PA, 19104, USA.
| |
Collapse
|
2
|
Kim W, Kim BC, Yeom HG. Performance of Large Language Models on the Korean Dental Licensing Examination: A Comparative Study. Int Dent J 2024:S0020-6539(24)01492-8. [PMID: 39370338 DOI: 10.1016/j.identj.2024.09.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2024] [Revised: 08/30/2024] [Accepted: 09/02/2024] [Indexed: 10/08/2024] Open
Abstract
PURPOSE This study investigated the potential application of large language models (LLMs) in dental education and practice, with a focus on ChatGPT and Claude3-Opus. Using the Korean Dental Licensing Examination (KDLE) as a benchmark, we aimed to assess the capabilities of these models in the dental field. METHODS This study evaluated three LLMs: GPT-3.5, GPT-4 (version: March 2024), and Claude3-Opus (version: March 2024). We used the KDLE questionnaire from 2019 to 2023 as inputs to the LLMs and then used the outputs from the LLMs as the corresponding answers. The total scores for individual subjects were obtained and compared. We also compared the performance of LLMs with those of individuals who underwent the exams. RESULTS Claude3-Opus performed best among the considered LLMs, except in 2019 when ChatGPT-4 performed best. Claude3-Opus and ChatGPT-4 surpassed the cut-off scores in all the years considered; this indicated that Claude3-Opus and ChatGPT-4 passed the KDLE, whereas ChatGPT-3.5 did not. However, all LLMs considered performed worse than humans, represented here by dental students in Korea. On average, the best-performing LLM annually achieved 85.4% of human performance. CONCLUSION Using the KDLE as a benchmark, our study demonstrates that although LLMs have not yet reached human-level performance in overall scores, both Claude3-Opus and ChatGPT-4 exceed the cut-off scores and perform exceptionally well in specific subjects. CLINICAL RELEVANCE Our findings will aid in evaluating the feasibility of integrating LLMs into dentistry to improve the quality and availability of dental services by offering patient information that meets the basic competency standards of a dentist.
Collapse
Affiliation(s)
- Woojun Kim
- The Robotics Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Bong Chul Kim
- Department of Oral and Maxillofacial Surgery, Daejeon Dental Hospital, Wonkwang University College of Dentistry, Daejeon, Korea
| | - Han-Gyeol Yeom
- Department of Oral and Maxillofacial Radiology, Daejeon Dental Hospital, Wonkwang University College of Dentistry, Daejeon, Korea.
| |
Collapse
|
3
|
Ono D, Dickson DW, Koga S. Evaluating the efficacy of few-shot learning for GPT-4Vision in neurodegenerative disease histopathology: A comparative analysis with convolutional neural network model. Neuropathol Appl Neurobiol 2024; 50:e12997. [PMID: 39010256 DOI: 10.1111/nan.12997] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2024] [Revised: 06/21/2024] [Accepted: 07/01/2024] [Indexed: 07/17/2024]
Abstract
AIMS Recent advances in artificial intelligence, particularly with large language models like GPT-4Vision (GPT-4V)-a derivative feature of ChatGPT-have expanded the potential for medical image interpretation. This study evaluates the accuracy of GPT-4V in image classification tasks of histopathological images and compares its performance with a traditional convolutional neural network (CNN). METHODS We utilised 1520 images, including haematoxylin and eosin staining and tau immunohistochemistry, from patients with various neurodegenerative diseases, such as Alzheimer's disease (AD), progressive supranuclear palsy (PSP) and corticobasal degeneration (CBD). We assessed GPT-4V's performance using multi-step prompts to determine how textual context influences image interpretation. We also employed few-shot learning to enhance improvements in GPT-4V's diagnostic performance in classifying three specific tau lesions-astrocytic plaques, neuritic plaques and tufted astrocytes-and compared the outcomes with the CNN model YOLOv8. RESULTS GPT-4V accurately recognised staining techniques and tissue origin but struggled with specific lesion identification. The interpretation of images was notably influenced by the provided textual context, which sometimes led to diagnostic inaccuracies. For instance, when presented with images of the motor cortex, the diagnosis shifted inappropriately from AD to CBD or PSP. However, few-shot learning markedly improved GPT-4V's diagnostic capabilities, enhancing accuracy from 40% in zero-shot learning to 90% with 20-shot learning, matching the performance of YOLOv8, which required 100-shot learning to achieve the same accuracy. CONCLUSIONS Although GPT-4V faces challenges in independently interpreting histopathological images, few-shot learning significantly improves its performance. This approach is especially promising for neuropathology, where acquiring extensive labelled datasets is often challenging.
Collapse
Affiliation(s)
- Daisuke Ono
- Department of Neuroscience, Mayo Clinic, Jacksonville, Florida, USA
| | - Dennis W Dickson
- Department of Neuroscience, Mayo Clinic, Jacksonville, Florida, USA
| | - Shunsuke Koga
- Department of Neuroscience, Mayo Clinic, Jacksonville, Florida, USA
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, Philadelphia, Pennsylvania, USA
| |
Collapse
|
4
|
Laohawetwanit T, Pinto DG, Bychkov A. A survey analysis of the adoption of large language models among pathologists. Am J Clin Pathol 2024:aqae093. [PMID: 39076014 DOI: 10.1093/ajcp/aqae093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2024] [Accepted: 06/28/2024] [Indexed: 07/31/2024] Open
Abstract
OBJECTIVES We sought to investigate the adoption and perception of large language model (LLM) applications among pathologists. METHODS A cross-sectional survey was conducted, gathering data from pathologists on their usage and views concerning LLM tools. The survey, distributed globally through various digital platforms, included quantitative and qualitative questions. Patterns in the respondents' adoption and perspectives on these artificial intelligence tools were analyzed. RESULTS Of 215 respondents, 100 (46.5%) reported using LLMs, particularly ChatGPT (OpenAI), for professional purposes, predominantly for information retrieval, proofreading, academic writing, and drafting pathology reports, highlighting a significant time-saving benefit. Academic pathologists demonstrated a better level of understanding of LLMs than their peers. Although chatbots sometimes provided incorrect general domain information, they were considered moderately proficient concerning pathology-specific knowledge. The technology was mainly used for drafting educational materials and programming tasks. The most sought-after feature in LLMs was their image analysis capabilities. Participants expressed concerns about information accuracy, privacy, and the need for regulatory approval. CONCLUSIONS Large language model applications are gaining notable acceptance among pathologists, with nearly half of respondents indicating adoption less than a year after the tools' introduction to the market. They see the benefits but are also worried about these tools' reliability, ethical implications, and security.
Collapse
Affiliation(s)
- Thiyaphat Laohawetwanit
- Division of Pathology, Chulabhorn International College of Medicine, Thammasat University, Pathum Thani, Thailand
- Division of Pathology, Thammasat University Hospital, Pathum Thani, Thailand
| | - Daniel Gomes Pinto
- Department of Pathology, Hospital Garcia de Orta, Almada, Portugal
- Nova Medical School, Lisbon, Portugal
| | - Andrey Bychkov
- Department of Pathology, Kameda Medical Center, Kamogawa, Japan
| |
Collapse
|
5
|
Laohawetwanit T, Apornvirat S, Namboonlue C. Thinking like a pathologist: Morphologic approach to hepatobiliary tumors by ChatGPT. Am J Clin Pathol 2024:aqae087. [PMID: 39030695 DOI: 10.1093/ajcp/aqae087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2024] [Accepted: 06/22/2024] [Indexed: 07/21/2024] Open
Abstract
OBJECTIVES This research aimed to evaluate the effectiveness of ChatGPT in accurately diagnosing hepatobiliary tumors using histopathologic images. METHODS The study compared the diagnostic accuracies of the GPT-4 model, providing the same set of images and 2 different input prompts. The first prompt, the morphologic approach, was designed to mimic pathologists' approach to analyzing tissue morphology. In contrast, the second prompt functioned without incorporating this morphologic analysis feature. Diagnostic accuracy and consistency were analyzed. RESULTS A total of 120 photomicrographs, composed of 60 images of each hepatobiliary tumor and nonneoplastic liver tissue, were used. The findings revealed that the morphologic approach significantly enhanced the diagnostic accuracy and consistency of the artificial intelligence (AI). This version was particularly more accurate in identifying hepatocellular carcinoma (mean accuracy: 62.0% vs 27.3%), bile duct adenoma (10.7% vs 3.3%), and cholangiocarcinoma (68.7% vs 16.0%), as well as in distinguishing nonneoplastic liver tissues (77.3% vs 37.5%) (Ps ≤ .01). It also demonstrated higher diagnostic consistency than the other model without a morphologic analysis (κ: 0.46 vs 0.27). CONCLUSIONS This research emphasizes the importance of incorporating pathologists' diagnostic approaches into AI to enhance accuracy and consistency in medical diagnostics. It mainly showcases the AI's histopathologic promise when replicating expert diagnostic processes.
Collapse
Affiliation(s)
- Thiyaphat Laohawetwanit
- Division of Pathology, Chulabhorn International College of Medicine, Thammasat University, Pathum Thani, Thailand
- Division of Pathology, Thammasat University Hospital, Pathum Thani, Thailand
| | - Sompon Apornvirat
- Division of Pathology, Chulabhorn International College of Medicine, Thammasat University, Pathum Thani, Thailand
- Division of Pathology, Thammasat University Hospital, Pathum Thani, Thailand
| | | |
Collapse
|
6
|
Apornvirat S, Thinpanja W, Damrongkiet K, Benjakul N, Laohawetwanit T. Comparing customized ChatGPT and pathology residents in histopathologic description and diagnosis of common diseases. Ann Diagn Pathol 2024; 73:152359. [PMID: 38972166 DOI: 10.1016/j.anndiagpath.2024.152359] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2024] [Revised: 06/30/2024] [Accepted: 07/01/2024] [Indexed: 07/09/2024]
Abstract
This study aimed to evaluate and analyze the performance of a customized Chat Generative Pre-Trained Transformer (ChatGPT), known as GPT, against pathology residents in providing microscopic descriptions and diagnosing diseases from histopathological images. A dataset of representative photomicrographs from 70 diseases across 14 organ systems was analyzed by a customized version of ChatGPT-4 (GPT-4) and pathology residents. Two pathologists independently evaluated the microscopic descriptions and diagnoses using a predefined scoring system (0-4 for microscopic descriptions and 0-2 for pathological diagnoses), with higher scores indicating greater accuracy. Microscopic descriptions that received perfect scores, which included all relevant keywords and findings, were then presented to the standard version of ChatGPT to assess its diagnostic capabilities based on these descriptions. GPT-4 showed consistency in microscopic description and diagnosis scores across five rounds, accomplishing median scores of 50 % and 48.6 %, respectively. However, its performance was still inferior to junior and senior pathology residents (73.9 % and 93.9 % description scores and 63.9 % and 87.9 % diagnosis scores, respectively). When analyzing classic ChatGPT's understanding of microscopic descriptions provided by residents, it correctly diagnosed 35 (87.5 %) of cases from junior residents and 44 (68.8 %) from senior residents, given that the initial descriptions consisted of keywords and relevant findings. While GPT-4 can accurately interpret some histopathological images, its overall performance is currently inferior to that of pathology residents. However, ChatGPT's ability to accurately interpret and diagnose diseases from the descriptions provided by residents suggests that this technology could serve as a valuable support tool in pathology diagnostics.
Collapse
Affiliation(s)
- Sompon Apornvirat
- Division of Pathology, Chulabhorn International College of Medicine, Thammasat University, Pathum Thani, Thailand; Division of Pathology, Thammasat University Hospital, Pathum Thani, Thailand
| | - Warut Thinpanja
- Division of Pathology, Thammasat University Hospital, Pathum Thani, Thailand
| | - Khampee Damrongkiet
- Department of Pathology, King Chulalongkorn Memorial Hospital, Bangkok, Thailand; Department of Anatomical Pathology, Faculty of Medicine Vajira Hospital, Navamindradhiraj University, Bangkok, Thailand
| | - Nontawat Benjakul
- Department of Anatomical Pathology, Faculty of Medicine Vajira Hospital, Navamindradhiraj University, Bangkok, Thailand; Vajira Pathology-Clinical-Correlation Target Research Interest Group, Faculty of Medicine Vajira Hospital, Navamindradhiraj University, Bangkok, Thailand
| | - Thiyaphat Laohawetwanit
- Division of Pathology, Chulabhorn International College of Medicine, Thammasat University, Pathum Thani, Thailand; Division of Pathology, Thammasat University Hospital, Pathum Thani, Thailand.
| |
Collapse
|
7
|
Koga S, Du W. Integrating AI in medicine: Lessons from Chat-GPT's limitations in medical imaging. Dig Liver Dis 2024; 56:1114-1115. [PMID: 38429138 DOI: 10.1016/j.dld.2024.02.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/17/2024] [Accepted: 02/19/2024] [Indexed: 03/03/2024]
Affiliation(s)
- Shunsuke Koga
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, 3400 Spruce Street, Philadelphia, PA 19104, United States.
| | - Wei Du
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, 3400 Spruce Street, Philadelphia, PA 19104, United States
| |
Collapse
|
8
|
Koga S. The double-edged nature of ChatGPT in self-diagnosis. Wien Klin Wochenschr 2024; 136:243-244. [PMID: 38504058 DOI: 10.1007/s00508-024-02343-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2024] [Accepted: 02/27/2024] [Indexed: 03/21/2024]
Affiliation(s)
- Shunsuke Koga
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, 3400 Spruce Street, 19104, Philadelphia, PA, USA.
| |
Collapse
|
9
|
Ohta K, Ohta S. The Performance of GPT-3.5, GPT-4, and Bard on the Japanese National Dentist Examination: A Comparison Study. Cureus 2023; 15:e50369. [PMID: 38213361 PMCID: PMC10782219 DOI: 10.7759/cureus.50369] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/12/2023] [Indexed: 01/13/2024] Open
Abstract
Purpose This study aims to evaluate the performance of three large language models (LLMs), the Generative Pre-trained Transformer (GPT)-3.5, GPT-4, and Google Bard, on the 2023 Japanese National Dentist Examination (JNDE) and assess their potential clinical applications in Japan. Methods A total of 185 questions from the 2023 JNDE were used. These questions were categorized by question type and category. McNemar's test compared the correct response rates between two LLMs, while Fisher's exact test evaluated the performance of LLMs in each question category. Results The overall correct response rates were 73.5% for GPT-4, 66.5% for Bard, and 51.9% for GPT-3.5. GPT-4 showed a significantly higher correct response rate than Bard and GPT-3.5. In the category of essential questions, Bard achieved a correct response rate of 80.5%, surpassing the passing criterion of 80%. In contrast, both GPT-4 and GPT-3.5 fell short of this benchmark, with GPT-4 attaining 77.6% and GPT-3.5 only 52.5%. The scores of GPT-4 and Bard were significantly higher than that of GPT-3.5 (p<0.01). For general questions, the correct response rates were 71.2% for GPT-4, 58.5% for Bard, and 52.5% for GPT-3.5. GPT-4 outperformed GPT-3.5 and Bard (p<0.01). The correct response rates for professional dental questions were 51.6% for GPT-4, 45.3% for Bard, and 35.9% for GPT-3.5. The differences among the models were not statistically significant. All LLMs demonstrated significantly lower accuracy for dentistry questions compared to other types of questions (p<0.01). Conclusions GPT-4 achieved the highest overall score in the JNDE, followed by Bard and GPT-3.5. However, only Bard surpassed the passing score for essential questions. To further understand the application of LLMs in clinical dentistry worldwide, more research on their performance in dental examinations across different languages is required.
Collapse
Affiliation(s)
| | - Satomi Ohta
- Dentistry, Dentist of Mama and Kodomo, Kobe, JPN
| |
Collapse
|