1
|
Ishida K, Arisaka N, Fujii K. Analysis of Responses of GPT-4 V to the Japanese National Clinical Engineer Licensing Examination. J Med Syst 2024; 48:83. [PMID: 39259341 DOI: 10.1007/s10916-024-02103-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2024] [Accepted: 08/28/2024] [Indexed: 09/13/2024]
Abstract
Chat Generative Pretrained Transformer (ChatGPT; OpenAI) is a state-of-the-art large language model that can simulate human-like conversations based on user input. We evaluated the performance of GPT-4 V in the Japanese National Clinical Engineer Licensing Examination using 2,155 questions from 2012 to 2023. The average correct answer rate for all questions was 86.0%. In particular, clinical medicine, basic medicine, medical materials, biological properties, and mechanical engineering achieved a correct response rate of ≥ 90%. Conversely, medical device safety management, electrical and electronic engineering, and extracorporeal circulation obtained low correct answer rates ranging from 64.8% to 76.5%. The correct answer rates for questions that included figures/tables, required numerical calculation, figure/table ∩ calculation, and knowledge of Japanese Industrial Standards were 55.2%, 85.8%, 64.2% and 31.0%, respectively. The reason for the low correct answer rates is that ChatGPT lacked recognition of the images and knowledge of standards and laws. This study concludes that careful attention is required when using ChatGPT because several of its explanations lack the correct description.
Collapse
Affiliation(s)
- Kai Ishida
- Department of Materials and Human Environmental Sciences, Faculty of Engineering, Shonan Institute of Technology, Fujisawa, Japan.
| | - Naoya Arisaka
- Department of Medical Informatics, School of Allied Health Science, Kitasato University, Sagamihara, Japan
| | - Kiyotaka Fujii
- Department of Clinical Engineering, School of Allied Health Science, Kitasato University, Sagamihara, Japan
| |
Collapse
|
2
|
Fatima A, Shafique MA, Alam K, Fadlalla Ahmed TK, Mustafa MS. ChatGPT in medicine: A cross-disciplinary systematic review of ChatGPT's (artificial intelligence) role in research, clinical practice, education, and patient interaction. Medicine (Baltimore) 2024; 103:e39250. [PMID: 39121303 PMCID: PMC11315549 DOI: 10.1097/md.0000000000039250] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Accepted: 07/19/2024] [Indexed: 08/11/2024] Open
Abstract
BACKGROUND ChatGPT, a powerful AI language model, has gained increasing prominence in medicine, offering potential applications in healthcare, clinical decision support, patient communication, and medical research. This systematic review aims to comprehensively assess the applications of ChatGPT in healthcare education, research, writing, patient communication, and practice while also delineating potential limitations and areas for improvement. METHOD Our comprehensive database search retrieved relevant papers from PubMed, Medline and Scopus. After the screening process, 83 studies met the inclusion criteria. This review includes original studies comprising case reports, analytical studies, and editorials with original findings. RESULT ChatGPT is useful for scientific research and academic writing, and assists with grammar, clarity, and coherence. This helps non-English speakers and improves accessibility by breaking down linguistic barriers. However, its limitations include probable inaccuracy and ethical issues, such as bias and plagiarism. ChatGPT streamlines workflows and offers diagnostic and educational potential in healthcare but exhibits biases and lacks emotional sensitivity. It is useful in inpatient communication, but requires up-to-date data and faces concerns about the accuracy of information and hallucinatory responses. CONCLUSION Given the potential for ChatGPT to transform healthcare education, research, and practice, it is essential to approach its adoption in these areas with caution due to its inherent limitations.
Collapse
Affiliation(s)
- Afia Fatima
- Department of Medicine, Jinnah Sindh Medical University, Karachi, Pakistan
| | | | - Khadija Alam
- Department of Medicine, Liaquat National Medical College, Karachi, Pakistan
| | | | | |
Collapse
|
3
|
Ishida K, Hanada E. Potential of ChatGPT to Pass the Japanese Medical and Healthcare Professional National Licenses: A Literature Review. Cureus 2024; 16:e66324. [PMID: 39247019 PMCID: PMC11377128 DOI: 10.7759/cureus.66324] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/06/2024] [Indexed: 09/10/2024] Open
Abstract
This systematic review aimed to assess the academic potential of ChatGPT (GPT-3.5, 4, and 4V) for Japanese national medical and healthcare licensing examinations, taking into account its strengths and limitations. Electronic databases such as PubMed/Medline, Google Scholar, and ICHUSHI (a Japanese medical article database) were systematically searched for relevant articles, particularly those published between January 1, 2022, and April 30, 2024. A formal narrative analysis was conducted by systematically arranging similarities and differences between individual research findings together. After rigorous screening, we reviewed 22 articles. Except for one article, all articles that evaluated GPT-4 showed that this tool could pass each exam containing text only. However, some studies also reported that, despite the possibility to pass, the results of GPT-4 were worse than those of the actual examinee. Moreover, the newest model GPT-4V insufficiently recognized images, thereby providing insufficient answers to questions that involved images and figures/tables. Therefore, their precision needs to be improved to obtain better results.
Collapse
Affiliation(s)
- Kai Ishida
- Faculty of Engineering, Shonan Institute of Technology, Fujisawa, JPN
| | - Eisuke Hanada
- Faculty of Science and Engineering, Saga University, Saga, JPN
| |
Collapse
|
4
|
Miao Y, Luo Y, Zhao Y, Li J, Liu M, Wang H, Chen Y, Wu Y. Performance of GPT-4 on Chinese Nursing Examination: Potentials for AI-Assisted Nursing Education Using Large Language Models. Nurse Educ 2024:00006223-990000000-00488. [PMID: 38981035 DOI: 10.1097/nne.0000000000001679] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/11/2024]
Abstract
BACKGROUND The performance of GPT-4 in nursing examinations within the Chinese context has not yet been thoroughly evaluated. OBJECTIVE To assess the performance of GPT-4 on multiple-choice and open-ended questions derived from nursing examinations in the Chinese context. METHODS The data sets of the Chinese National Nursing Licensure Examination spanning 2021 to 2023 were used to evaluate the accuracy of GPT-4 in multiple-choice questions. The performance of GPT-4 on open-ended questions was examined using 18 case-based questions. RESULTS For multiple-choice questions, GPT-4 achieved an accuracy of 71.0% (511/720). For open-ended questions, the responses were evaluated for cosine similarity, logical consistency, and information quality, all of which were found to be at a moderate level. CONCLUSION GPT-4 performed well at addressing queries on basic knowledge. However, it has notable limitations in answering open-ended questions. Nursing educators should weigh the benefits and challenges of GPT-4 for integration into nursing education.
Collapse
Affiliation(s)
- Yiqun Miao
- Author Affiliations: School of Nursing, Capital Medical University, Beijing, China (Drs Miao, Luo, Zhao, Li, Liu, Wang, and Wu); and School of Nursing, Johns Hopkins University, Baltimore, USA (Dr Chen)
| | | | | | | | | | | | | | | |
Collapse
|
5
|
Samaan JS, Rajeev N, Ng WH, Srinivasan N, Busam JA, Yeo YH, Samakar K. ChatGPT as a Source of Information for Bariatric Surgery Patients: a Comparative Analysis of Accuracy and Comprehensiveness Between GPT-4 and GPT-3.5. Obes Surg 2024; 34:1987-1989. [PMID: 38564173 PMCID: PMC11031485 DOI: 10.1007/s11695-024-07212-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2023] [Revised: 03/22/2024] [Accepted: 03/28/2024] [Indexed: 04/04/2024]
Affiliation(s)
- Jamil S Samaan
- Karsh Division of Digestive and Liver Diseases, Department of Medicine, Cedars-Sinai Medical Center, 8700 Beverly Blvd, Los Angeles, CA, 90048, USA.
| | - Nithya Rajeev
- Division of Upper GI and General Surgery, Department of Surgery, Keck School of Medicine of USC, Health Care Consultation Center, 1510 San Pablo St #514, Los Angeles, CA, 90033, USA
| | - Wee Han Ng
- Bristol Medical School, University of Bristol, 5 Tyndall Ave, Bristol, BS8 1UD, UK
| | - Nitin Srinivasan
- Division of Upper GI and General Surgery, Department of Surgery, Keck School of Medicine of USC, Health Care Consultation Center, 1510 San Pablo St #514, Los Angeles, CA, 90033, USA
| | - Jonathan A Busam
- Karsh Division of Digestive and Liver Diseases, Department of Medicine, Cedars-Sinai Medical Center, 8700 Beverly Blvd, Los Angeles, CA, 90048, USA
| | - Yee Hui Yeo
- Karsh Division of Digestive and Liver Diseases, Department of Medicine, Cedars-Sinai Medical Center, 8700 Beverly Blvd, Los Angeles, CA, 90048, USA
| | - Kamran Samakar
- Division of Upper GI and General Surgery, Department of Surgery, Keck School of Medicine of USC, Health Care Consultation Center, 1510 San Pablo St #514, Los Angeles, CA, 90033, USA
| |
Collapse
|
6
|
Noda M, Ueno T, Koshu R, Takaso Y, Shimada MD, Saito C, Sugimoto H, Fushiki H, Ito M, Nomura A, Yoshizaki T. Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study. JMIR MEDICAL EDUCATION 2024; 10:e57054. [PMID: 38546736 PMCID: PMC11009855 DOI: 10.2196/57054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/03/2024] [Revised: 02/22/2024] [Accepted: 03/09/2024] [Indexed: 04/14/2024]
Abstract
BACKGROUND Artificial intelligence models can learn from medical literature and clinical cases and generate answers that rival human experts. However, challenges remain in the analysis of complex data containing images and diagrams. OBJECTIVE This study aims to assess the answering capabilities and accuracy of ChatGPT-4 Vision (GPT-4V) for a set of 100 questions, including image-based questions, from the 2023 otolaryngology board certification examination. METHODS Answers to 100 questions from the 2023 otolaryngology board certification examination, including image-based questions, were generated using GPT-4V. The accuracy rate was evaluated using different prompts, and the presence of images, clinical area of the questions, and variations in the answer content were examined. RESULTS The accuracy rate for text-only input was, on average, 24.7% but improved to 47.3% with the addition of English translation and prompts (P<.001). The average nonresponse rate for text-only input was 46.3%; this decreased to 2.7% with the addition of English translation and prompts (P<.001). The accuracy rate was lower for image-based questions than for text-only questions across all types of input, with a relatively high nonresponse rate. General questions and questions from the fields of head and neck allergies and nasal allergies had relatively high accuracy rates, which increased with the addition of translation and prompts. In terms of content, questions related to anatomy had the highest accuracy rate. For all content types, the addition of translation and prompts increased the accuracy rate. As for the performance based on image-based questions, the average of correct answer rate with text-only input was 30.4%, and that with text-plus-image input was 41.3% (P=.02). CONCLUSIONS Examination of artificial intelligence's answering capabilities for the otolaryngology board certification examination improves our understanding of its potential and limitations in this field. Although the improvement was noted with the addition of translation and prompts, the accuracy rate for image-based questions was lower than that for text-based questions, suggesting room for improvement in GPT-4V at this stage. Furthermore, text-plus-image input answers a higher rate in image-based questions. Our findings imply the usefulness and potential of GPT-4V in medicine; however, future consideration of safe use methods is needed.
Collapse
Affiliation(s)
- Masao Noda
- Department of Otolaryngology and Head and Neck Surgery, Jichi Medical University, Shimotsuke, Japan
- Department of Otolaryngology and Head and Neck Surgery, Kanazawa University, Kanazawa, Japan
| | - Takayoshi Ueno
- Department of Otolaryngology and Head and Neck Surgery, Kanazawa University, Kanazawa, Japan
| | - Ryota Koshu
- Department of Otolaryngology and Head and Neck Surgery, Jichi Medical University, Shimotsuke, Japan
| | - Yuji Takaso
- Department of Otolaryngology and Head and Neck Surgery, Jichi Medical University, Shimotsuke, Japan
- Department of Otolaryngology and Head and Neck Surgery, Kanazawa University, Kanazawa, Japan
| | - Mari Dias Shimada
- Department of Otolaryngology and Head and Neck Surgery, Jichi Medical University, Shimotsuke, Japan
| | - Chizu Saito
- Department of Otolaryngology and Head and Neck Surgery, Jichi Medical University, Shimotsuke, Japan
| | - Hisashi Sugimoto
- Department of Otolaryngology and Head and Neck Surgery, Kanazawa University, Kanazawa, Japan
| | - Hiroaki Fushiki
- Department of Otolaryngology, Mejiro University Ear Institute Clinic, Saitama, Japan
| | - Makoto Ito
- Department of Otolaryngology and Head and Neck Surgery, Jichi Medical University, Shimotsuke, Japan
| | - Akihiro Nomura
- College of Transdisciplinary Sciences for Innovation, Kanazawa University, Kanazawa, Japan
| | - Tomokazu Yoshizaki
- Department of Otolaryngology and Head and Neck Surgery, Kanazawa University, Kanazawa, Japan
| |
Collapse
|
7
|
Sato H, Ogasawara K. ChatGPT (GPT-4) passed the Japanese National License Examination for Pharmacists in 2022, answering all items including those with diagrams: a descriptive study. JOURNAL OF EDUCATIONAL EVALUATION FOR HEALTH PROFESSIONS 2024; 21:4. [PMID: 38413129 PMCID: PMC10948916 DOI: 10.3352/jeehp.2024.21.4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/29/2023] [Accepted: 02/28/2024] [Indexed: 02/29/2024]
Abstract
PURPOSE The objective of this study was to assess the performance of ChatGPT (GPT-4) on all items, including those with diagrams, in the Japanese National License Examination for Pharmacists (JNLEP) and compare it with the previous GPT-3.5 model’s performance. METHODS The 107th JNLEP, conducted in 2022, with 344 items input into the GPT-4 model, was targeted for this study. Separately, 284 items, excluding those with diagrams, were entered into the GPT-3.5 model. The answers were categorized and analyzed to determine accuracy rates based on categories, subjects, and presence or absence of diagrams. The accuracy rates were compared to the main passing criteria (overall accuracy rate ≥62.9%). RESULTS The overall accuracy rate for all items in the 107th JNLEP in GPT-4 was 72.5%, successfully meeting all the passing criteria. For the set of items without diagrams, the accuracy rate was 80.0%, which was significantly higher than that of the GPT-3.5 model (43.5%). The GPT-4 model demonstrated an accuracy rate of 36.1% for items that included diagrams. CONCLUSION Advancements that allow GPT-4 to process images have made it possible for LLMs to answer all items in medical-related license examinations. This study’s findings confirm that ChatGPT (GPT-4) possesses sufficient knowledge to meet the passing criteria.
Collapse
Affiliation(s)
- Hiroyasu Sato
- Department of Pharmacy, Abashiri Kosei General Hospital, Abashiri, Japan
| | - Katsuhiko Ogasawara
- Graduate School of Health Sciences, Hokkaido University, Sapporo, Japan
- Graduate School of Engineering, Muroran Institute of Technology, Muroran, Japan
| |
Collapse
|
8
|
Ohta K, Ohta S. The Performance of GPT-3.5, GPT-4, and Bard on the Japanese National Dentist Examination: A Comparison Study. Cureus 2023; 15:e50369. [PMID: 38213361 PMCID: PMC10782219 DOI: 10.7759/cureus.50369] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/12/2023] [Indexed: 01/13/2024] Open
Abstract
Purpose This study aims to evaluate the performance of three large language models (LLMs), the Generative Pre-trained Transformer (GPT)-3.5, GPT-4, and Google Bard, on the 2023 Japanese National Dentist Examination (JNDE) and assess their potential clinical applications in Japan. Methods A total of 185 questions from the 2023 JNDE were used. These questions were categorized by question type and category. McNemar's test compared the correct response rates between two LLMs, while Fisher's exact test evaluated the performance of LLMs in each question category. Results The overall correct response rates were 73.5% for GPT-4, 66.5% for Bard, and 51.9% for GPT-3.5. GPT-4 showed a significantly higher correct response rate than Bard and GPT-3.5. In the category of essential questions, Bard achieved a correct response rate of 80.5%, surpassing the passing criterion of 80%. In contrast, both GPT-4 and GPT-3.5 fell short of this benchmark, with GPT-4 attaining 77.6% and GPT-3.5 only 52.5%. The scores of GPT-4 and Bard were significantly higher than that of GPT-3.5 (p<0.01). For general questions, the correct response rates were 71.2% for GPT-4, 58.5% for Bard, and 52.5% for GPT-3.5. GPT-4 outperformed GPT-3.5 and Bard (p<0.01). The correct response rates for professional dental questions were 51.6% for GPT-4, 45.3% for Bard, and 35.9% for GPT-3.5. The differences among the models were not statistically significant. All LLMs demonstrated significantly lower accuracy for dentistry questions compared to other types of questions (p<0.01). Conclusions GPT-4 achieved the highest overall score in the JNDE, followed by Bard and GPT-3.5. However, only Bard surpassed the passing score for essential questions. To further understand the application of LLMs in clinical dentistry worldwide, more research on their performance in dental examinations across different languages is required.
Collapse
Affiliation(s)
| | - Satomi Ohta
- Dentistry, Dentist of Mama and Kodomo, Kobe, JPN
| |
Collapse
|
9
|
Kaneda Y, Takita M, Hamaki T, Ozaki A, Tanimoto T. ChatGPT's Potential in Enhancing Physician Efficiency: A Japanese Case Study. Cureus 2023; 15:e48235. [PMID: 38050503 PMCID: PMC10693924 DOI: 10.7759/cureus.48235] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/03/2023] [Indexed: 12/06/2023] Open
Abstract
Artificial intelligence (AI), particularly ChatGPT, developed by OpenAI (San Francisco, CA, USA), is making significant strides in the medical field. In a simulated case study, a 66-year-old Japanese female patient's dialogue with a physician was transcribed and inputted into ChatGPT to assess its efficacy in drafting medical records, formulating differential diagnoses, and establishing treatment plans. The results showed a high similarity between the medical summaries generated by ChatGPT and those of the attending physician. This suggests that ChatGPT has the potential to assist physicians in clinical reasoning and reduce the administrative burden, allowing them to spend more time with patients. However, there are limitations, such as the system's reliance on linguistic data and occasional inaccuracies. Despite its potential, the ethical implications of using patient data and the risk of AI replacing clinicians emphasize the need for continuous evaluation, rigorous oversight, and the establishment of comprehensive guidelines. As AI continues to integrate into healthcare, it is crucial for physicians to ensure that technology complements, rather than replaces, human expertise, with the primary focus remaining on delivering high-quality patient care.
Collapse
Affiliation(s)
- Yudai Kaneda
- Epidemiology and Public Health, School of Medicine, Hokkaido University, Hokkaido, JPN
| | - Morihito Takita
- Internal Medicine, Medical Governance Research Institute, Tokyo, JPN
| | - Tamae Hamaki
- Internal Medicine, Accessible Rail Medical Services Tetsuikai, Navitas Clinic Shinjuku, Tokyo, JPN
| | - Akihiko Ozaki
- Breast and Thyroid Surgery, Jyoban Hospital of Tokiwa Foundation, Fukushima, JPN
| | - Tetsuya Tanimoto
- Internal Medicine, Accessible Rail Medical Services Tetsuikai, Navitas Clinic Kawasaki, Kanagawa, JPN
| |
Collapse
|
10
|
Yanagita Y, Yokokawa D, Uchida S, Tawara J, Ikusaka M. Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study. JMIR Form Res 2023; 7:e48023. [PMID: 37831496 PMCID: PMC10612006 DOI: 10.2196/48023] [Citation(s) in RCA: 20] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2023] [Revised: 06/21/2023] [Accepted: 10/03/2023] [Indexed: 10/14/2023] Open
Abstract
BACKGROUND ChatGPT (OpenAI) has gained considerable attention because of its natural and intuitive responses. ChatGPT sometimes writes plausible-sounding but incorrect or nonsensical answers, as stated by OpenAI as a limitation. However, considering that ChatGPT is an interactive AI that has been trained to reduce the output of unethical sentences, the reliability of the training data is high and the usefulness of the output content is promising. Fortunately, in March 2023, a new version of ChatGPT, GPT-4, was released, which, according to internal evaluations, was expected to increase the likelihood of producing factual responses by 40% compared with its predecessor, GPT-3.5. The usefulness of this version of ChatGPT in English is widely appreciated. It is also increasingly being evaluated as a system for obtaining medical information in languages other than English. Although it does not reach a passing score on the national medical examination in Chinese, its accuracy is expected to gradually improve. Evaluation of ChatGPT with Japanese input is limited, although there have been reports on the accuracy of ChatGPT's answers to clinical questions regarding the Japanese Society of Hypertension guidelines and on the performance of the National Nursing Examination. OBJECTIVE The objective of this study is to evaluate whether ChatGPT can provide accurate diagnoses and medical knowledge for Japanese input. METHODS Questions from the National Medical Licensing Examination (NMLE) in Japan, administered by the Japanese Ministry of Health, Labour and Welfare in 2022, were used. All 400 questions were included. Exclusion criteria were figures and tables that ChatGPT could not recognize; only text questions were extracted. We instructed GPT-3.5 and GPT-4 to input the Japanese questions as they were and to output the correct answers for each question. The output of ChatGPT was verified by 2 general practice physicians. In case of discrepancies, they were checked by another physician to make a final decision. The overall performance was evaluated by calculating the percentage of correct answers output by GPT-3.5 and GPT-4. RESULTS Of the 400 questions, 292 were analyzed. Questions containing charts, which are not supported by ChatGPT, were excluded. The correct response rate for GPT-4 was 81.5% (237/292), which was significantly higher than the rate for GPT-3.5, 42.8% (125/292). Moreover, GPT-4 surpassed the passing standard (>72%) for the NMLE, indicating its potential as a diagnostic and therapeutic decision aid for physicians. CONCLUSIONS GPT-4 reached the passing standard for the NMLE in Japan, entered in Japanese, although it is limited to written questions. As the accelerated progress in the past few months has shown, the performance of the AI will improve as the large language model continues to learn more, and it may well become a decision support system for medical professionals by providing more accurate information.
Collapse
Affiliation(s)
- Yasutaka Yanagita
- Department of General Medicine, Chiba University Hospital, Chiba, Japan
| | - Daiki Yokokawa
- Department of General Medicine, Chiba University Hospital, Chiba, Japan
| | - Shun Uchida
- Department of General Medicine, Chiba University Hospital, Chiba, Japan
| | - Junsuke Tawara
- Department of Internal Medicine, Sanmu Medical Center, Chiba, Japan
| | - Masatomi Ikusaka
- Department of General Medicine, Chiba University Hospital, Chiba, Japan
| |
Collapse
|
11
|
Kaneda Y, Namba M, Kaneda U, Tanimoto T. Artificial Intelligence in Childcare: Assessing the Performance and Acceptance of ChatGPT Responses. Cureus 2023; 15:e44484. [PMID: 37791148 PMCID: PMC10544433 DOI: 10.7759/cureus.44484] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/31/2023] [Indexed: 10/05/2023] Open
Abstract
Purpose This study aimed to evaluate the performance and acceptance of responses generated by ChatGPT-3.5 and GPT-4 to Japanese childcare-related questions to assess their potential applicability and limitations in the childcare field, specifically focusing on the accuracy, usefulness, and empathy of the generated answers. Methods We evaluated answers in Japanese generated by GPT-3.5 and GPT-4 for two types of childcare-related questions. ① For the written examination questions of Japan's childcare worker national examination for 2023's fiscal year, we calculated the correct answer rates using official answers. ② We selected one question from each of the seven categories from the child-rearing questions posted on the Japanese National Childcare Workers Association's website and had GPT-3.5 and GPT-4 generate answers. These were evaluated alongside existing childcare worker answers by human professionals. Five childcare workers then blindly selected what they considered the best answer among the three and rated them on a five-point scale for 'accuracy,' 'usefulness,' and 'empathy.' Results In the examination consisting of 160 written questions, both GPT-3.5 and GPT-4 produced responses to all 155 questions, excluding four questions omitted due to copyright concerns and one question deemed invalid due to inherent flaws in the question itself, with correct answer rates of 30.3% for GPT-3.5 and 47.7% for GPT-4 (p<0.01). For the child-rearing Q&A questions, childcare worker answers by human professionals were chosen as the best answer most frequently (45.7%), followed by GPT-3.5 (31.4%) and GPT-4 (22.9%). While GPT-3.5 received the highest average rating for accuracy (3.69 points), childcare worker answers by human professionals received the highest average ratings for usefulness and empathy (both 3.57 points). Conclusions Both GPT-3.5 and GPT-4 failed to meet the passing criteria in Japan's childcare worker national examination, and for the child-rearing questions, GPT-3.5 was rated higher in accuracy despite lower correct answer rates. Over half of the childcare workers considered the ChatGPT-generated answers to be the best ones, yet concerns about accuracy were observed, highlighting the potential risk of incorrect information in the Japanese context.
Collapse
Affiliation(s)
- Yudai Kaneda
- School of Medicine, Hokkaido University, Sapporo, JPN
| | - Mira Namba
- School of Medicine, Keio University, Tokyo, JPN
| | - Uiri Kaneda
- Faculty of Foreign Languages, Dokkyo University, Soka, JPN
| | - Tetsuya Tanimoto
- Internal Medicine, Jyoban Hospital of Tokiwa Foundation, Iwaki, JPN
| |
Collapse
|