1
|
Suárez A, Jiménez J, Llorente de Pedro M, Andreu-Vázquez C, Díaz-Flores García V, Gómez Sánchez M, Freire Y. Beyond the Scalpel: Assessing ChatGPT's potential as an auxiliary intelligent virtual assistant in oral surgery. Comput Struct Biotechnol J 2024; 24:46-52. [PMID: 38162955 PMCID: PMC10755495 DOI: 10.1016/j.csbj.2023.11.058] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Revised: 11/28/2023] [Accepted: 11/28/2023] [Indexed: 01/03/2024] Open
Abstract
AI has revolutionized the way we interact with technology. Noteworthy advances in AI algorithms and large language models (LLM) have led to the development of natural generative language (NGL) systems such as ChatGPT. Although these LLM can simulate human conversations and generate content in real time, they face challenges related to the topicality and accuracy of the information they generate. This study aimed to assess whether ChatGPT-4 could provide accurate and reliable answers to general dentists in the field of oral surgery, and thus explore its potential as an intelligent virtual assistant in clinical decision making in oral surgery. Thirty questions related to oral surgery were posed to ChatGPT4, each question repeated 30 times. Subsequently, a total of 900 responses were obtained. Two surgeons graded the answers according to the guidelines of the Spanish Society of Oral Surgery, using a three-point Likert scale (correct, partially correct/incomplete, and incorrect). Disagreements were arbitrated by an experienced oral surgeon, who provided the final grade Accuracy was found to be 71.7%, and consistency of the experts' grading across iterations, ranged from moderate to almost perfect. ChatGPT-4, with its potential capabilities, will inevitably be integrated into dental disciplines, including oral surgery. In the future, it could be considered as an auxiliary intelligent virtual assistant, though it would never replace oral surgery experts. Proper training and verified information by experts will remain vital to the implementation of the technology. More comprehensive research is needed to ensure the safe and successful application of AI in oral surgery.
Collapse
Affiliation(s)
- Ana Suárez
- Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - Jaime Jiménez
- Department of Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - María Llorente de Pedro
- Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - Cristina Andreu-Vázquez
- Department of Veterinary Medicine, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - Víctor Díaz-Flores García
- Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - Margarita Gómez Sánchez
- Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - Yolanda Freire
- Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| |
Collapse
|
2
|
Elyoseph Z, Gur T, Haber Y, Simon T, Angert T, Navon Y, Tal A, Asman O. An Ethical Perspective on the Democratization of Mental Health With Generative AI. JMIR Ment Health 2024; 11:e58011. [PMID: 39417792 DOI: 10.2196/58011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/02/2024] [Revised: 07/20/2024] [Accepted: 07/24/2024] [Indexed: 10/19/2024] Open
Abstract
Unlabelled Knowledge has become more open and accessible to a large audience with the "democratization of information" facilitated by technology. This paper provides a sociohistorical perspective for the theme issue "Responsible Design, Integration, and Use of Generative AI in Mental Health." It evaluates ethical considerations in using generative artificial intelligence (GenAI) for the democratization of mental health knowledge and practice. It explores the historical context of democratizing information, transitioning from restricted access to widespread availability due to the internet, open-source movements, and most recently, GenAI technologies such as large language models. The paper highlights why GenAI technologies represent a new phase in the democratization movement, offering unparalleled access to highly advanced technology as well as information. In the realm of mental health, this requires delicate and nuanced ethical deliberation. Including GenAI in mental health may allow, among other things, improved accessibility to mental health care, personalized responses, and conceptual flexibility, and could facilitate a flattening of traditional hierarchies between health care providers and patients. At the same time, it also entails significant risks and challenges that must be carefully addressed. To navigate these complexities, the paper proposes a strategic questionnaire for assessing artificial intelligence-based mental health applications. This tool evaluates both the benefits and the risks, emphasizing the need for a balanced and ethical approach to GenAI integration in mental health. The paper calls for a cautious yet positive approach to GenAI in mental health, advocating for the active engagement of mental health professionals in guiding GenAI development. It emphasizes the importance of ensuring that GenAI advancements are not only technologically sound but also ethically grounded and patient-centered.
Collapse
Affiliation(s)
- Zohar Elyoseph
- Department of Brain Sciences, Faculty of Medicine, Imperial College, Fulham Palace Rd, London, W6 8RF, United Kingdom, 44 547836088
- Faculty of Education, University of Haifa, Haifa, Israel
| | - Tamar Gur
- The Adelson School of Entrepreneurship, Reichman University, Herzliya, Israel
| | - Yuval Haber
- The PhD Program of Hermeneutics & Cultural Studies, Bar-Ilan University, Ramat Gan, Israel
| | - Tomer Simon
- Microsoft Israel R&D Center, Tel Aviv, Israel
| | - Tal Angert
- Sagol School of Neuroscience, Tel Aviv University, Tel Aviv, Israel
| | - Yuval Navon
- Sagol School of Neuroscience, Tel Aviv University, Tel Aviv, Israel
| | - Amir Tal
- Samueli Initiative for Responsible AI in Medicine, Faculty of Medical and Health Sciences, Tel Aviv University, Tel Aviv, Israel
| | - Oren Asman
- Samueli Initiative for Responsible AI in Medicine, Faculty of Medical and Health Sciences, Tel Aviv University, Tel Aviv, Israel
- Department of Nursing, Faculty of Medical and Health Sciences, Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|
3
|
Hadar-Shoval D, Asraf K, Shinan-Altman S, Elyoseph Z, Levkovich I. Embedded values-like shape ethical reasoning of large language models on primary care ethical dilemmas. Heliyon 2024; 10:e38056. [PMID: 39381244 PMCID: PMC11458949 DOI: 10.1016/j.heliyon.2024.e38056] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2024] [Accepted: 09/17/2024] [Indexed: 10/10/2024] Open
Abstract
Objective This article uses the framework of Schwartz's values theory to examine whether the embedded values-like profile within large language models (LLMs) impact ethical decision-making dilemmas faced by primary care. It specifically aims to evaluate whether each LLM exhibits a distinct values-like profile, assess its alignment with general population values, and determine whether latent values influence clinical recommendations. Methods The Portrait Values Questionnaire-Revised (PVQ-RR) was submitted to each LLM (Claude, Bard, GPT-3.5, and GPT-4) 20 times to ensure reliable and valid responses. Their responses were compared to a benchmark derived from a diverse international sample consisting of over 53,000 culturally diverse respondents who completed the PVQ-RR. Four vignettes depicting prototypical professional quandaries involving conflicts between competing values were presented to the LLMs. The option selected by each LLM and the strength of its recommendation were evaluated to determine if underlying values-like impact output. Results Each LLM demonstrated a unique values-like profile. Universalism and self-direction were prioritized, while power and tradition were assigned less importance than population benchmarks, suggesting potential Western-centric biases. Four clinical vignettes involving value conflicts were presented to the LLMs. Preliminary indications suggested that embedded values-like influence recommendations. Significant variances in confidence strength regarding chosen recommendations materialized between models, proposing that further vetting is required before the LLMs can be relied on as judgment aids. However, the overall selection of preferences aligned with intrinsic value hierarchies. Conclusion The distinct intrinsic values-like embedded within LLMs shape ethical decision-making, which carries implications for their integration in primary care settings serving diverse populations. For context-appropriate, equitable delivery of AI-assisted healthcare globally it is essential that LLMs are tailored to align with cultural outlooks.
Collapse
Affiliation(s)
- Dorit Hadar-Shoval
- The Center for Psychobiological Research, Department of Psychology and Educational Counseling, Max Stern Yezreel Valley College, Israel
| | - Kfir Asraf
- The Center for Psychobiological Research, Department of Psychology and Educational Counseling, Max Stern Yezreel Valley College, Israel
| | - Shiri Shinan-Altman
- The Louis and Gabi Weisfeld School of Social Work, Bar-Ilan University, Ramat Gan, Israel
| | - Zohar Elyoseph
- The Center for Psychobiological Research, Department of Psychology and Educational Counseling, Max Stern Yezreel Valley College, Israel
- Department of Brain Sciences, Faculty of Medicine, Imperial College London, England
- Department of Counseling and Human Development, Department of Education, University of Haifa, Israel
| | | |
Collapse
|
4
|
Lian C, Yuan X, Chokkakula S, Wang G, Song B, Wang Z, Fan G, Yin C. Assessing the ChatGPT aptitude: A competent and effective Dermatology doctor? Heliyon 2024; 10:e37220. [PMID: 39319150 PMCID: PMC11419909 DOI: 10.1016/j.heliyon.2024.e37220] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2023] [Revised: 08/29/2024] [Accepted: 08/29/2024] [Indexed: 09/26/2024] Open
Abstract
Background The efficacy and adeptness of ChatGPT 3.5 and ChatGPT 4.0 in the precise diagnosis and management of conditions like atopic dermatitis and Autoimmune blistering skin diseases (AIBD) remain to be elucidated. So this study examined the accuracy and effectiveness of the ChatGPT responses related to understanding, therapies, and specific cases of these two conditions. Method Firstly, the responses provided by ChatGPTs to a set of 50 questionnaires underwent evaluation by five distinct dermatologists, with complete adjudication of the third-party reviewer. The comparative analysis included the evaluative efficacy of both ChatGPT3.5 and ChatGPT4.0 against the diagnostic abilities exhibited by three distinct cohorts of qualified clinical professionals. And then, an examination was conducted to assess the diagnostic proficiency of ChatGPT3.5 and ChatGPT4.0 in the context of diagnosing specific instances of skin blistering autoimmune diseases. Results In assessing the proficiency of ChatGPTs in generating responses related to fundamental knowledge about AD it is noteworthy that both versions of ChatGPTs, despite their lack of specialized training on medical databases, exhibited a commendable capacity to yield solutions that exhibited a substantial degree of concurrence with evidence-based medical information. Accordingly we observed that the performance of ChatGPT-4.0 beyond that of the ChatGPT-3.5. However, it it crucial to emphasize that ChatGPT-4.0 did not show the ability to offer answers surpassing those provided by associate senior, and senior medical professionals. In the assessment designed to determine the proficiency of ChatGPTs in recognizing particular type of AIBD, it is evident that both ChatGPT-4 and ChatGPT-3.5 demonstrated inadequacy in providing responses that are both precise and accurate for each individual occurrence of this skin condition. Conclusion Both ChatGPT-3.5 and ChatGPT-4.0 satisfactory for addressing fundamental inquiries related to atopic dermatitis, however they prove insufficient for diagnosing AIBD. The progress of ChatGPT in achieving utility within the professional medical domain remains a considerable journey ahead.
Collapse
Affiliation(s)
- Chengxiang Lian
- Department of Dermatology and Venereology, The First Affiliated Hospital of Guang-xi Medical University, Nanning, 530021, China
| | - Xin Yuan
- Department of Dermatology, GuiZhou Provincial People's Hospital, Guiyang, 550000, China
| | - Santosh Chokkakula
- Department of Microbiology, Chungbuk National University College of Medicine and Medical Research Institute, Cheongju, Chungbuk, 28644, South Korea
| | - Guanqing Wang
- Department of Dermatology, Shanghai General Hospital (South), Shanghai Jiao Tong University, No. 650, New Songjiang Road, Shanghai, 200000, China
| | - Biao Song
- Zhihui Big Data Research Institute of Inner Mongolia, Inner Mongolia, 010020, China
- Collaborative Innovation Center of Big Data Application Research of Inner Mongolia University of Finance and Economics, Inner Mongolia, 010020, China
| | - Zhe Wang
- Zhihui Big Data Research Institute of Inner Mongolia, Inner Mongolia, 010020, China
| | - Ge Fan
- Lightspeed & Quantum Studios, Tencent Inc., Shenzhen, 693388, China
| | - Chengliang Yin
- Faculty of Medicine, Macau University of Science and Technology, Macau, 999078, China
| |
Collapse
|
5
|
Du X, Zhou Z, Wang Y, Chuang YW, Yang R, Zhang W, Wang X, Zhang R, Hong P, Bates DW, Zhou L. Generative Large Language Models in Electronic Health Records for Patient Care Since 2023: A Systematic Review. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.08.11.24311828. [PMID: 39228726 PMCID: PMC11370524 DOI: 10.1101/2024.08.11.24311828] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/05/2024]
Abstract
Background Generative Large language models (LLMs) represent a significant advancement in natural language processing, achieving state-of-the-art performance across various tasks. However, their application in clinical settings using real electronic health records (EHRs) is still rare and presents numerous challenges. Objective This study aims to systematically review the use of generative LLMs, and the effectiveness of relevant techniques in patient care-related topics involving EHRs, summarize the challenges faced, and suggest future directions. Methods A Boolean search for peer-reviewed articles was conducted on May 19th, 2024 using PubMed and Web of Science to include research articles published since 2023, which was one month after the release of ChatGPT. The search results were deduplicated. Multiple reviewers, including biomedical informaticians, computer scientists, and a physician, screened the publications for eligibility and conducted data extraction. Only studies utilizing generative LLMs to analyze real EHR data were included. We summarized the use of prompt engineering, fine-tuning, multimodal EHR data, and evaluation matrices. Additionally, we identified current challenges in applying LLMs in clinical settings as reported by the included studies and proposed future directions. Results The initial search identified 6,328 unique studies, with 76 studies included after eligibility screening. Of these, 67 studies (88.2%) employed zero-shot prompting, five of them reported 100% accuracy on five specific clinical tasks. Nine studies used advanced prompting strategies; four tested these strategies experimentally, finding that prompt engineering improved performance, with one study noting a non-linear relationship between the number of examples in a prompt and performance improvement. Eight studies explored fine-tuning generative LLMs, all reported performance improvements on specific tasks, but three of them noted potential performance degradation after fine-tuning on certain tasks. Only two studies utilized multimodal data, which improved LLM-based decision-making and enabled accurate rare disease diagnosis and prognosis. The studies employed 55 different evaluation metrics for 22 purposes, such as correctness, completeness, and conciseness. Two studies investigated LLM bias, with one detecting no bias and the other finding that male patients received more appropriate clinical decision-making suggestions. Six studies identified hallucinations, such as fabricating patient names in structured thyroid ultrasound reports. Additional challenges included but were not limited to the impersonal tone of LLM consultations, which made patients uncomfortable, and the difficulty patients had in understanding LLM responses. Conclusion Our review indicates that few studies have employed advanced computational techniques to enhance LLM performance. The diverse evaluation metrics used highlight the need for standardization. LLMs currently cannot replace physicians due to challenges such as bias, hallucinations, and impersonal responses.
Collapse
Affiliation(s)
- Xinsong Du
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02115
| | - Zhengyang Zhou
- Department of Computer Science, Brandeis University, Waltham, MA 02453
| | - Yifei Wang
- Department of Computer Science, Brandeis University, Waltham, MA 02453
| | - Ya-Wen Chuang
- Division of Nephrology, Department of Internal Medicine, Taichung Veterans General Hospital, Taichung, Taiwan, 407219
- Department of Post-Baccalaureate Medicine, College of Medicine, National Chung Hsing University, Taichung, Taiwan, 402202
- School of Medicine, College of Medicine, China Medical University, Taichung, Taiwan, 404328
| | - Richard Yang
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
| | - Wenyu Zhang
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02115
| | - Xinyi Wang
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02115
| | - Rui Zhang
- Division of Computational Health Sciences, University of Minnesota, Minneapolis, MN 55455
| | - Pengyu Hong
- Department of Computer Science, Brandeis University, Waltham, MA 02453
| | - David W. Bates
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
- Department of Health Policy and Management, Harvard T.H. Chan School of Public Health, Boston, MA 02115
| | - Li Zhou
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02115
| |
Collapse
|
6
|
Bhugra D, Liebrenz M, Ventriglio A, Ng R, Javed A, Kar A, Chumakov E, Moura H, Tolentino E, Gupta S, Ruiz R, Okasha T, Chisolm MS, Castaldelli-Maia J, Torales J, Smith A. World Psychiatric Association-Asian Journal of Psychiatry Commission on Public Mental Health. Asian J Psychiatr 2024; 98:104105. [PMID: 38861790 DOI: 10.1016/j.ajp.2024.104105] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/02/2023] [Revised: 04/22/2024] [Accepted: 05/31/2024] [Indexed: 06/13/2024]
Abstract
Although there is considerable evidence showing that the prevention of mental illnesses and adverse outcomes and mental health promotion can help people lead better and more functional lives, public mental health remains overlooked in the broader contexts of psychiatry and public health. Likewise, in undergraduate and postgraduate medical curricula, prevention and mental health promotion have often been ignored. However, there has been a recent increase in interest in public mental health, including an emphasis on the prevention of psychiatric disorders and improving individual and community wellbeing to support life trajectories, from childhood through to adulthood and into older age. These lifespan approaches have significant potential to reduce the onset of mental illnesses and the related burdens for the individual and communities, as well as mitigating social, economic, and political costs. Informed by principles of social justice and respect for human rights, this may be especially important for addressing salient problems in communities with distinct vulnerabilities, where prominent disadvantages and barriers for care delivery exist. Therefore, this Commission aims to address these topics, providing a narrative overview of relevant literature and suggesting ways forward. Additionally, proposals for improving mental health and preventing mental illnesses and adverse outcomes are presented, particularly amongst at-risk populations.
Collapse
Affiliation(s)
- Dinesh Bhugra
- Institute of Psychiatry, Psychology and Neurosciences, Kings College, London SE5 8AF, United Kingdom.
| | - Michael Liebrenz
- Department of Forensic Psychiatry, University of Bern, Bern, Switzerland
| | | | - Roger Ng
- World Psychiatric Association, Geneva, Switzerland
| | | | - Anindya Kar
- Advanced Neuropsychiatry Institute, Kolkata, India
| | - Egor Chumakov
- Department of Psychiatry & Addiction, St Petersburg State University, St Petersburg, Russia
| | | | | | - Susham Gupta
- East London NHS Foundation Trust, London, United Kingdom
| | - Roxanna Ruiz
- University of Francisco Moaroquin, Guatemala City, Guatemala
| | | | | | | | | | - Alexander Smith
- Department of Forensic Psychiatry, University of Bern, Bern, Switzerland
| |
Collapse
|
7
|
Banerjee S, Dunn P, Conard S, Ali A. Mental Health Applications of Generative AI and Large Language Modeling in the United States. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2024; 21:910. [PMID: 39063487 PMCID: PMC11276907 DOI: 10.3390/ijerph21070910] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/26/2024] [Revised: 07/09/2024] [Accepted: 07/10/2024] [Indexed: 07/28/2024]
Abstract
(1) Background: Artificial intelligence (AI) has flourished in recent years. More specifically, generative AI has had broad applications in many disciplines. While mental illness is on the rise, AI has proven valuable in aiding the diagnosis and treatment of mental disorders. However, there is little to no research about precisely how much interest there is in AI technology. (2) Methods: We performed a Google Trends search for "AI and mental health" and compared relative search volume (RSV) indices of "AI", "AI and Depression", and "AI and anxiety". This time series study employed Box-Jenkins time series modeling to forecast long-term interest through the end of 2024. (3) Results: Within the United States, AI interest steadily increased throughout 2023, with some anomalies due to media reporting. Through predictive models, we found that this trend is predicted to increase 114% through the end of the year 2024, with public interest in AI applications being on the rise. (4) Conclusions: According to our study, we found that the awareness of AI has drastically increased throughout 2023, especially in mental health. This demonstrates increasing public awareness of mental health and AI, making advocacy and education about AI technology of paramount importance.
Collapse
Affiliation(s)
- Sri Banerjee
- School of Health Sciences and Public Policy, Walden University, Minneapolis, MN 55401, USA
| | - Pat Dunn
- Center for Health Technology & Innovation American Heart Association, Dallas, TX 75231, USA;
| | | | - Asif Ali
- McGovern Medical School, University of Texas Health Science Center at Houston, Houston, TX 77030, USA;
| |
Collapse
|
8
|
Yilmaz Muluk S, Olcucu N. The Role of Artificial Intelligence in the Primary Prevention of Common Musculoskeletal Diseases. Cureus 2024; 16:e65372. [PMID: 39184635 PMCID: PMC11344583 DOI: 10.7759/cureus.65372] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/25/2024] [Indexed: 08/27/2024] Open
Abstract
BACKGROUND Musculoskeletal disorders (MSDs) are a leading cause of disability worldwide, with a growing burden across all demographics. With advancements in technology, conversational artificial intelligence (AI) platforms such as ChatGPT (OpenAI, San Francisco, CA) have become instrumental in disseminating health information. This study evaluated the effectiveness of ChatGPT versions 3.5 and 4 in delivering primary prevention information for common MSDs, emphasizing that the study is focused on prevention and not on diagnosis. METHODS This mixed-methods study employed the CLEAR tool to assess the quality of responses from ChatGPT versions in terms of completeness, lack of false information, evidence support, appropriateness, and relevance. Responses were evaluated independently by two expert raters in a blinded manner. Statistical analyses included Wilcoxon signed-rank tests and paired samples t-tests to compare the performance across versions. RESULTS ChatGPT-3.5 and ChatGPT-4 effectively provided primary prevention information, with overall performance ranging from satisfactory to excellent. Responses for low back pain, fractures, knee osteoarthritis, neck pain, and gout received excellent scores from both versions. Additionally, ChatGPT-4 was better than ChatGPT-3.5 in terms of completeness (p = 0.015), appropriateness (p = 0.007), and relevance (p = 0.036), and ChatGPT-4 performed better across most medical conditions (p = 0.010). CONCLUSIONS ChatGPT versions 3.5 and 4 are effective tools for disseminating primary prevention information for common MSDs, with ChatGPT-4 showing superior performance. This study underscores the potential of AI in enhancing public health strategies through reliable and accessible health communication. Advanced models such as ChatGPT-4 can effectively contribute to the primary prevention of MSDs by delivering high-quality health information, highlighting the role of AIs in addressing the global burden of chronic diseases. It is important to note that these AI tools are intended for preventive education purposes only and not for diagnostic use. Continuous improvements are necessary to fully harness the potential of AI in preventive medicine. Future studies should explore other AI platforms, languages, and secondary and tertiary prevention measures to maximize the utility of AIs in global health contexts.
Collapse
Affiliation(s)
| | - Nazli Olcucu
- Physical Medicine and Rehabilitation, Antalya Atatürk State Hospital, Antalya, TUR
| |
Collapse
|
9
|
Omar M, Soffer S, Charney AW, Landi I, Nadkarni GN, Klang E. Applications of large language models in psychiatry: a systematic review. Front Psychiatry 2024; 15:1422807. [PMID: 38979501 PMCID: PMC11228775 DOI: 10.3389/fpsyt.2024.1422807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/24/2024] [Accepted: 06/05/2024] [Indexed: 07/10/2024] Open
Abstract
Background With their unmatched ability to interpret and engage with human language and context, large language models (LLMs) hint at the potential to bridge AI and human cognitive processes. This review explores the current application of LLMs, such as ChatGPT, in the field of psychiatry. Methods We followed PRISMA guidelines and searched through PubMed, Embase, Web of Science, and Scopus, up until March 2024. Results From 771 retrieved articles, we included 16 that directly examine LLMs' use in psychiatry. LLMs, particularly ChatGPT and GPT-4, showed diverse applications in clinical reasoning, social media, and education within psychiatry. They can assist in diagnosing mental health issues, managing depression, evaluating suicide risk, and supporting education in the field. However, our review also points out their limitations, such as difficulties with complex cases and potential underestimation of suicide risks. Conclusion Early research in psychiatry reveals LLMs' versatile applications, from diagnostic support to educational roles. Given the rapid pace of advancement, future investigations are poised to explore the extent to which these models might redefine traditional roles in mental health care.
Collapse
Affiliation(s)
- Mahmud Omar
- Faculty of Medicine, Tel-Aviv University, Tel-Aviv, Israel
| | - Shelly Soffer
- Internal Medicine B, Assuta Medical Center, Ashdod, Israel
- Ben-Gurion University of the Negev, Be'er Sheva, Israel
| | | | - Isotta Landi
- Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Girish N Nadkarni
- Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Eyal Klang
- Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| |
Collapse
|
10
|
Suwała S, Szulc P, Guzowski C, Kamińska B, Dorobiała J, Wojciechowska K, Berska M, Kubicka O, Kosturkiewicz O, Kosztulska B, Rajewska A, Junik R. ChatGPT-3.5 passes Poland's medical final examination-Is it possible for ChatGPT to become a doctor in Poland? SAGE Open Med 2024; 12:20503121241257777. [PMID: 38895543 PMCID: PMC11185017 DOI: 10.1177/20503121241257777] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2024] [Accepted: 05/08/2024] [Indexed: 06/21/2024] Open
Abstract
Objectives ChatGPT is an advanced chatbot based on Large Language Model that has the ability to answer questions. Undoubtedly, ChatGPT is capable of transforming communication, education, and customer support; however, can it play the role of a doctor? In Poland, prior to obtaining a medical diploma, candidates must successfully pass the Medical Final Examination. Methods The purpose of this research was to determine how well ChatGPT performed on the Polish Medical Final Examination, which passing is required to become a doctor in Poland (an exam is considered passed if at least 56% of the tasks are answered correctly). A total of 2138 categorized Medical Final Examination questions (from 11 examination sessions held between 2013-2015 and 2021-2023) were presented to ChatGPT-3.5 from 19 to 26 May 2023. For further analysis, the questions were divided into quintiles based on difficulty and duration, as well as question types (simple A-type or complex K-type). The answers provided by ChatGPT were compared to the official answer key, reviewed for any changes resulting from the advancement of medical knowledge. Results ChatGPT correctly answered 53.4%-64.9% of questions. In 8 out of 11 exam sessions, ChatGPT achieved the scores required to successfully pass the examination (60%). The correlation between the efficacy of artificial intelligence and the level of complexity, difficulty, and length of a question was found to be negative. AI outperformed humans in one category: psychiatry (77.18% vs. 70.25%, p = 0.081). Conclusions The performance of artificial intelligence is deemed satisfactory; however, it is observed to be markedly inferior to that of human graduates in the majority of instances. Despite its potential utility in many medical areas, ChatGPT is constrained by its inherent limitations that prevent it from entirely supplanting human expertise and knowledge.
Collapse
Affiliation(s)
- Szymon Suwała
- Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Paulina Szulc
- Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Cezary Guzowski
- Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Barbara Kamińska
- Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Jakub Dorobiała
- Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Karolina Wojciechowska
- Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Maria Berska
- Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Olga Kubicka
- Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Oliwia Kosturkiewicz
- Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Bernadetta Kosztulska
- Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Alicja Rajewska
- Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Roman Junik
- Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| |
Collapse
|
11
|
Shinan-Altman S, Elyoseph Z, Levkovich I. The impact of history of depression and access to weapons on suicide risk assessment: a comparison of ChatGPT-3.5 and ChatGPT-4. PeerJ 2024; 12:e17468. [PMID: 38827287 PMCID: PMC11143969 DOI: 10.7717/peerj.17468] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Accepted: 05/05/2024] [Indexed: 06/04/2024] Open
Abstract
The aim of this study was to evaluate the effectiveness of ChatGPT-3.5 and ChatGPT-4 in incorporating critical risk factors, namely history of depression and access to weapons, into suicide risk assessments. Both models assessed suicide risk using scenarios that featured individuals with and without a history of depression and access to weapons. The models estimated the likelihood of suicidal thoughts, suicide attempts, serious suicide attempts, and suicide-related mortality on a Likert scale. A multivariate three-way ANOVA analysis with Bonferroni post hoc tests was conducted to examine the impact of the forementioned independent factors (history of depression and access to weapons) on these outcome variables. Both models identified history of depression as a significant suicide risk factor. ChatGPT-4 demonstrated a more nuanced understanding of the relationship between depression, access to weapons, and suicide risk. In contrast, ChatGPT-3.5 displayed limited insight into this complex relationship. ChatGPT-4 consistently assigned higher severity ratings to suicide-related variables than did ChatGPT-3.5. The study highlights the potential of these two models, particularly ChatGPT-4, to enhance suicide risk assessment by considering complex risk factors.
Collapse
Affiliation(s)
| | - Zohar Elyoseph
- Department of Brain Sciences, Faculty of Medicine, Imperial College London, London, England, United Kingdom
- The Center for Psychobiological Research, Department of Psychology and Educational Counseling, Max Stern Yezreel Valley College, Emek Yezreel, Israel
| | - Inbar Levkovich
- Faculty of Graduate Studies, Oranim Academic College of Education, Kiryat Tiv’on, Israel
| |
Collapse
|
12
|
Haber Y, Levkovich I, Hadar-Shoval D, Elyoseph Z. The Artificial Third: A Broad View of the Effects of Introducing Generative Artificial Intelligence on Psychotherapy. JMIR Ment Health 2024; 11:e54781. [PMID: 38787297 PMCID: PMC11137430 DOI: 10.2196/54781] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Revised: 03/24/2024] [Accepted: 04/18/2024] [Indexed: 05/25/2024] Open
Abstract
Unlabelled This paper explores a significant shift in the field of mental health in general and psychotherapy in particular following generative artificial intelligence's new capabilities in processing and generating humanlike language. Following Freud, this lingo-technological development is conceptualized as the "fourth narcissistic blow" that science inflicts on humanity. We argue that this narcissistic blow has a potentially dramatic influence on perceptions of human society, interrelationships, and the self. We should, accordingly, expect dramatic changes in perceptions of the therapeutic act following the emergence of what we term the artificial third in the field of psychotherapy. The introduction of an artificial third marks a critical juncture, prompting us to ask the following important core questions that address two basic elements of critical thinking, namely, transparency and autonomy: (1) What is this new artificial presence in therapy relationships? (2) How does it reshape our perception of ourselves and our interpersonal dynamics? and (3) What remains of the irreplaceable human elements at the core of therapy? Given the ethical implications that arise from these questions, this paper proposes that the artificial third can be a valuable asset when applied with insight and ethical consideration, enhancing but not replacing the human touch in therapy.
Collapse
Affiliation(s)
- Yuval Haber
- The PhD Program of Hermeneutics and Cultural Studies, Interdisciplinary Studies Unit, Bar-Ilan University, Ramat Gan, Israel
| | | | - Dorit Hadar-Shoval
- Department of Psychology and Educational Counseling, The Max Stern Yezreel Valley College, Emek Yezreel, Israel
| | - Zohar Elyoseph
- Department of Brain Sciences, Faculty of Medicine, Imperial College London, London, United Kingdom
- The Center for Psychobiological Research, Department of Psychology and Educational Counseling, The Max Stern Yezreel Valley College, Emek Yezreel, Israel
| |
Collapse
|
13
|
Hadar-Shoval D, Asraf K, Mizrachi Y, Haber Y, Elyoseph Z. Assessing the Alignment of Large Language Models With Human Values for Mental Health Integration: Cross-Sectional Study Using Schwartz's Theory of Basic Values. JMIR Ment Health 2024; 11:e55988. [PMID: 38593424 DOI: 10.2196/55988] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Revised: 03/01/2024] [Accepted: 03/08/2024] [Indexed: 04/11/2024] Open
Abstract
BACKGROUND Large language models (LLMs) hold potential for mental health applications. However, their opaque alignment processes may embed biases that shape problematic perspectives. Evaluating the values embedded within LLMs that guide their decision-making have ethical importance. Schwartz's theory of basic values (STBV) provides a framework for quantifying cultural value orientations and has shown utility for examining values in mental health contexts, including cultural, diagnostic, and therapist-client dynamics. OBJECTIVE This study aimed to (1) evaluate whether the STBV can measure value-like constructs within leading LLMs and (2) determine whether LLMs exhibit distinct value-like patterns from humans and each other. METHODS In total, 4 LLMs (Bard, Claude 2, Generative Pretrained Transformer [GPT]-3.5, GPT-4) were anthropomorphized and instructed to complete the Portrait Values Questionnaire-Revised (PVQ-RR) to assess value-like constructs. Their responses over 10 trials were analyzed for reliability and validity. To benchmark the LLMs' value profiles, their results were compared to published data from a diverse sample of 53,472 individuals across 49 nations who had completed the PVQ-RR. This allowed us to assess whether the LLMs diverged from established human value patterns across cultural groups. Value profiles were also compared between models via statistical tests. RESULTS The PVQ-RR showed good reliability and validity for quantifying value-like infrastructure within the LLMs. However, substantial divergence emerged between the LLMs' value profiles and population data. The models lacked consensus and exhibited distinct motivational biases, reflecting opaque alignment processes. For example, all models prioritized universalism and self-direction, while de-emphasizing achievement, power, and security relative to humans. Successful discriminant analysis differentiated the 4 LLMs' distinct value profiles. Further examination found the biased value profiles strongly predicted the LLMs' responses when presented with mental health dilemmas requiring choosing between opposing values. This provided further validation for the models embedding distinct motivational value-like constructs that shape their decision-making. CONCLUSIONS This study leveraged the STBV to map the motivational value-like infrastructure underpinning leading LLMs. Although the study demonstrated the STBV can effectively characterize value-like infrastructure within LLMs, substantial divergence from human values raises ethical concerns about aligning these models with mental health applications. The biases toward certain cultural value sets pose risks if integrated without proper safeguards. For example, prioritizing universalism could promote unconditional acceptance even when clinically unwise. Furthermore, the differences between the LLMs underscore the need to standardize alignment processes to capture true cultural diversity. Thus, any responsible integration of LLMs into mental health care must account for their embedded biases and motivation mismatches to ensure equitable delivery across diverse populations. Achieving this will require transparency and refinement of alignment techniques to instill comprehensive human values.
Collapse
Affiliation(s)
- Dorit Hadar-Shoval
- The Psychology Department, Max Stern Yezreel Valley College, Tel Adashim, Israel
| | - Kfir Asraf
- The Psychology Department, Max Stern Yezreel Valley College, Tel Adashim, Israel
| | - Yonathan Mizrachi
- The Jane Goodall Institute, Max Stern Yezreel Valley College, Tel Adashim, Israel
- The Laboratory for AI, Machine Learning, Business & Data Analytics, Tel-Aviv University, Tel Aviv, Israel
| | - Yuval Haber
- The PhD Program of Hermeneutics and Cultural Studies, Interdisciplinary Studies Unit, Bar-Ilan University, Ramat Gan, Israel
| | - Zohar Elyoseph
- The Psychology Department, Center for Psychobiological Research, Max Stern Yezreel Valley College, Tel Adashim, Israel
- Department of Brain Sciences, Faculty of Medicine, Imperial College London, London, United Kingdom
| |
Collapse
|
14
|
Moise A, Centomo-Bozzo A, Orishchak O, Alnoury MK, Daniel SJ. Can ChatGPT Replace an Otolaryngologist in Guiding Parents on Tonsillectomy? EAR, NOSE & THROAT JOURNAL 2024:1455613241230841. [PMID: 38563440 DOI: 10.1177/01455613241230841] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/04/2024] Open
Abstract
Background: ChatGPT is an artificial intelligence tool, which utilizes machine learning to analyze and generate human-like text. The user-friendly accessibility of this tool enables patients conveniently access medical information without intricate terminology challenges. The objective of this study was to assess the accuracy of ChatGPT in providing insights into indications and management of complications after tonsillectomy, a common pediatric otolaryngology procedure. Methods: The responses generated by ChatGPT were compared to the "Clinical practice guidelines: tonsillectomy in children-executive summary" developed by the American Academy of Otolaryngology-Head and Neck Surgery Foundation (AAO-HNSF). An assessment was carried out by presenting predetermined questions regarding indications and complications post tonsillectomy to ChatGPT, followed by a comparison of its responses with the established guideline by 2 otolaryngology experts. The responses of both parties were reviewed by the senior author. Results: A total of 16 responses generated by ChatGPT were assessed. After a comprehensive review, it was concluded that 15 out of 16 (93.8%) responses demonstrated a high degree of reliability and accuracy, closely adhering to the standard established by the AAO-HNSF guideline. Conclusion: The results validate the potential of using ChatGPT to enhance healthcare delivery making guidelines more accessible to patients while also emphasizing the importance of ensuring the provision of accurate and reliable medical advice to patients.
Collapse
Affiliation(s)
- Alexander Moise
- Faculty of Medicine and Health Sciences, McGill University, Montreal, QC, Canada
| | - Adam Centomo-Bozzo
- Faculty of Dental Medicine and Oral Health Sciences, McGill University, Montreal, QC, Canada
| | - Ostap Orishchak
- Department of Pediatric Otolaryngology, Montreal Children's Hospital, Montreal, QC, Canada
| | - Mohammed K Alnoury
- Department of Otolaryngology-Head and Neck Surgery, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Sam J Daniel
- Department of Pediatric Otolaryngology, Montreal Children's Hospital, Montreal, QC, Canada
| |
Collapse
|
15
|
Elyoseph Z, Levkovich I. Comparing the Perspectives of Generative AI, Mental Health Experts, and the General Public on Schizophrenia Recovery: Case Vignette Study. JMIR Ment Health 2024; 11:e53043. [PMID: 38533615 PMCID: PMC11004608 DOI: 10.2196/53043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/24/2023] [Revised: 01/24/2024] [Accepted: 02/11/2024] [Indexed: 03/28/2024] Open
Abstract
Background The current paradigm in mental health care focuses on clinical recovery and symptom remission. This model's efficacy is influenced by therapist trust in patient recovery potential and the depth of the therapeutic relationship. Schizophrenia is a chronic illness with severe symptoms where the possibility of recovery is a matter of debate. As artificial intelligence (AI) becomes integrated into the health care field, it is important to examine its ability to assess recovery potential in major psychiatric disorders such as schizophrenia. Objective This study aimed to evaluate the ability of large language models (LLMs) in comparison to mental health professionals to assess the prognosis of schizophrenia with and without professional treatment and the long-term positive and negative outcomes. Methods Vignettes were inputted into LLMs interfaces and assessed 10 times by 4 AI platforms: ChatGPT-3.5, ChatGPT-4, Google Bard, and Claude. A total of 80 evaluations were collected and benchmarked against existing norms to analyze what mental health professionals (general practitioners, psychiatrists, clinical psychologists, and mental health nurses) and the general public think about schizophrenia prognosis with and without professional treatment and the positive and negative long-term outcomes of schizophrenia interventions. Results For the prognosis of schizophrenia with professional treatment, ChatGPT-3.5 was notably pessimistic, whereas ChatGPT-4, Claude, and Bard aligned with professional views but differed from the general public. All LLMs believed untreated schizophrenia would remain static or worsen without professional treatment. For long-term outcomes, ChatGPT-4 and Claude predicted more negative outcomes than Bard and ChatGPT-3.5. For positive outcomes, ChatGPT-3.5 and Claude were more pessimistic than Bard and ChatGPT-4. Conclusions The finding that 3 out of the 4 LLMs aligned closely with the predictions of mental health professionals when considering the "with treatment" condition is a demonstration of the potential of this technology in providing professional clinical prognosis. The pessimistic assessment of ChatGPT-3.5 is a disturbing finding since it may reduce the motivation of patients to start or persist with treatment for schizophrenia. Overall, although LLMs hold promise in augmenting health care, their application necessitates rigorous validation and a harmonious blend with human expertise.
Collapse
Affiliation(s)
- Zohar Elyoseph
- Department of Brain Sciences, Faculty of Medicine, Imperial College London, London, United Kingdom
- The Center for Psychobiological Research, Department of Psychology and Educational Counseling, Max Stern Yezreel Valley College, Emek Yezreel, Israel
| | - Inbar Levkovich
- Faculty of Graduate Studies, Oranim Academic College, Kiryat Tiv'on, Israel
| |
Collapse
|
16
|
Sallam M, Barakat M, Sallam M. A Preliminary Checklist (METRICS) to Standardize the Design and Reporting of Studies on Generative Artificial Intelligence-Based Models in Health Care Education and Practice: Development Study Involving a Literature Review. Interact J Med Res 2024; 13:e54704. [PMID: 38276872 PMCID: PMC10905357 DOI: 10.2196/54704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2023] [Revised: 12/18/2023] [Accepted: 01/26/2024] [Indexed: 01/27/2024] Open
Abstract
BACKGROUND Adherence to evidence-based practice is indispensable in health care. Recently, the utility of generative artificial intelligence (AI) models in health care has been evaluated extensively. However, the lack of consensus guidelines on the design and reporting of findings of these studies poses a challenge for the interpretation and synthesis of evidence. OBJECTIVE This study aimed to develop a preliminary checklist to standardize the reporting of generative AI-based studies in health care education and practice. METHODS A literature review was conducted in Scopus, PubMed, and Google Scholar. Published records with "ChatGPT," "Bing," or "Bard" in the title were retrieved. Careful examination of the methodologies employed in the included records was conducted to identify the common pertinent themes and the possible gaps in reporting. A panel discussion was held to establish a unified and thorough checklist for the reporting of AI studies in health care. The finalized checklist was used to evaluate the included records by 2 independent raters. Cohen κ was used as the method to evaluate the interrater reliability. RESULTS The final data set that formed the basis for pertinent theme identification and analysis comprised a total of 34 records. The finalized checklist included 9 pertinent themes collectively referred to as METRICS (Model, Evaluation, Timing, Range/Randomization, Individual factors, Count, and Specificity of prompts and language). Their details are as follows: (1) Model used and its exact settings; (2) Evaluation approach for the generated content; (3) Timing of testing the model; (4) Transparency of the data source; (5) Range of tested topics; (6) Randomization of selecting the queries; (7) Individual factors in selecting the queries and interrater reliability; (8) Count of queries executed to test the model; and (9) Specificity of the prompts and language used. The overall mean METRICS score was 3.0 (SD 0.58). The tested METRICS score was acceptable, with the range of Cohen κ of 0.558 to 0.962 (P<.001 for the 9 tested items). With classification per item, the highest average METRICS score was recorded for the "Model" item, followed by the "Specificity" item, while the lowest scores were recorded for the "Randomization" item (classified as suboptimal) and "Individual factors" item (classified as satisfactory). CONCLUSIONS The METRICS checklist can facilitate the design of studies guiding researchers toward best practices in reporting results. The findings highlight the need for standardized reporting algorithms for generative AI-based studies in health care, considering the variability observed in methodologies and reporting. The proposed METRICS checklist could be a preliminary helpful base to establish a universally accepted approach to standardize the design and reporting of generative AI-based studies in health care, which is a swiftly evolving research topic.
Collapse
Affiliation(s)
- Malik Sallam
- Department of Pathology, Microbiology and Forensic Medicine, School of Medicine, The University of Jordan, Amman, Jordan
- Department of Clinical Laboratories and Forensic Medicine, Jordan University Hospital, Amman, Jordan
- Department of Translational Medicine, Faculty of Medicine, Lund University, Malmo, Sweden
| | - Muna Barakat
- Department of Clinical Pharmacy and Therapeutics, Faculty of Pharmacy, Applied Science Private University, Amman, Jordan
| | - Mohammed Sallam
- Department of Pharmacy, Mediclinic Parkview Hospital, Mediclinic Middle East, Dubai, United Arab Emirates
| |
Collapse
|
17
|
Elyoseph Z, Refoua E, Asraf K, Lvovsky M, Shimoni Y, Hadar-Shoval D. Capacity of Generative AI to Interpret Human Emotions From Visual and Textual Data: Pilot Evaluation Study. JMIR Ment Health 2024; 11:e54369. [PMID: 38319707 PMCID: PMC10879976 DOI: 10.2196/54369] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Revised: 12/09/2023] [Accepted: 12/25/2023] [Indexed: 02/07/2024] Open
Abstract
BACKGROUND Mentalization, which is integral to human cognitive processes, pertains to the interpretation of one's own and others' mental states, including emotions, beliefs, and intentions. With the advent of artificial intelligence (AI) and the prominence of large language models in mental health applications, questions persist about their aptitude in emotional comprehension. The prior iteration of the large language model from OpenAI, ChatGPT-3.5, demonstrated an advanced capacity to interpret emotions from textual data, surpassing human benchmarks. Given the introduction of ChatGPT-4, with its enhanced visual processing capabilities, and considering Google Bard's existing visual functionalities, a rigorous assessment of their proficiency in visual mentalizing is warranted. OBJECTIVE The aim of the research was to critically evaluate the capabilities of ChatGPT-4 and Google Bard with regard to their competence in discerning visual mentalizing indicators as contrasted with their textual-based mentalizing abilities. METHODS The Reading the Mind in the Eyes Test developed by Baron-Cohen and colleagues was used to assess the models' proficiency in interpreting visual emotional indicators. Simultaneously, the Levels of Emotional Awareness Scale was used to evaluate the large language models' aptitude in textual mentalizing. Collating data from both tests provided a holistic view of the mentalizing capabilities of ChatGPT-4 and Bard. RESULTS ChatGPT-4, displaying a pronounced ability in emotion recognition, secured scores of 26 and 27 in 2 distinct evaluations, significantly deviating from a random response paradigm (P<.001). These scores align with established benchmarks from the broader human demographic. Notably, ChatGPT-4 exhibited consistent responses, with no discernible biases pertaining to the sex of the model or the nature of the emotion. In contrast, Google Bard's performance aligned with random response patterns, securing scores of 10 and 12 and rendering further detailed analysis redundant. In the domain of textual analysis, both ChatGPT and Bard surpassed established benchmarks from the general population, with their performances being remarkably congruent. CONCLUSIONS ChatGPT-4 proved its efficacy in the domain of visual mentalizing, aligning closely with human performance standards. Although both models displayed commendable acumen in textual emotion interpretation, Bard's capabilities in visual emotion interpretation necessitate further scrutiny and potential refinement. This study stresses the criticality of ethical AI development for emotional recognition, highlighting the need for inclusive data, collaboration with patients and mental health experts, and stringent governmental oversight to ensure transparency and protect patient privacy.
Collapse
Affiliation(s)
- Zohar Elyoseph
- Department of Educational Psychology, The Center for Psychobiological Research, The Max Stern Yezreel Valley College, Emek Yezreel, Israel
- Imperial College London, London, United Kingdom
| | - Elad Refoua
- Department of Psychology, Bar-Ilan University, Ramat Gan, Israel
| | - Kfir Asraf
- Department of Psychology, The Max Stern Yezreel Valley College, Emek Yezreel, Israel
| | - Maya Lvovsky
- Department of Psychology, The Max Stern Yezreel Valley College, Emek Yezreel, Israel
| | - Yoav Shimoni
- Boston Children's Hospital, Boston, MA, United States
| | - Dorit Hadar-Shoval
- Department of Psychology, The Max Stern Yezreel Valley College, Emek Yezreel, Israel
| |
Collapse
|
18
|
Reis F, Lenz C. Performance of Artificial Intelligence (AI)-Powered Chatbots in the Assessment of Medical Case Reports: Qualitative Insights From Simulated Scenarios. Cureus 2024; 16:e53899. [PMID: 38465163 PMCID: PMC10925004 DOI: 10.7759/cureus.53899] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/08/2024] [Indexed: 03/12/2024] Open
Abstract
Introduction With the expanding awareness and use of AI-powered chatbots, it seems possible that an increasing number of people could use them to assess and evaluate their medical symptoms. If chatbots are used for this purpose, that have not previously undergone a thorough medical evaluation for this specific use, various risks might arise. The aim of this study is to analyze and compare the performance of popular chatbots in differentiating between severe and less critical medical symptoms described from a patient's perspective and to examine the variations in substantive medical assessment accuracy and empathetic communication style among the chatbots' responses. Materials and methods Our study compared three different AI-supported chatbots - OpenAI's ChatGPT 3.5, Microsoft's Bing Chat, and Inflection's Pi AI. Three exemplary case reports for medical emergencies as well as three cases without an urgent reason for an emergency medical admission were constructed and analyzed. Each case report was accompanied by identical questions concerning the most likely suspected diagnosis and the urgency of an immediate medical evaluation. The respective answers of the chatbots were qualitatively compared with each other regarding the medical accuracy of the differential diagnoses mentioned and the conclusions drawn, as well as regarding patient-oriented and empathetic language. Results All examined chatbots were capable of providing medically plausible and probable diagnoses and classifying situations as acute or less critical. However, their responses varied slightly in the level of their urgency assessment. Clear differences could be seen in the level of detail of the differential diagnoses, the overall length of the answers, and how the chatbot dealt with the challenge of being confronted with medical issues. All given answers were comparable in terms of empathy level and comprehensibility. Conclusion Even AI chatbots that are not designed for medical applications already offer substantial guidance in assessing typical medical emergency indications but should always be provided with a disclaimer. In responding to medical queries, characteristic differences emerge among chatbots in the extent and style of their respective answers. Given the lack of medical supervision of many established chatbots, subsequent studies, and experiences are essential to clarify whether a more extensive use of these chatbots for medical concerns will have a positive impact on healthcare or rather pose major medical risks.
Collapse
Affiliation(s)
- Florian Reis
- Medical Affairs, Pfizer Pharma GmbH, Berlin, DEU
| | | |
Collapse
|
19
|
Elyoseph Z, Levkovich I, Shinan-Altman S. Assessing prognosis in depression: comparing perspectives of AI models, mental health professionals and the general public. Fam Med Community Health 2024; 12:e002583. [PMID: 38199604 PMCID: PMC10806564 DOI: 10.1136/fmch-2023-002583] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2024] Open
Abstract
BACKGROUND Artificial intelligence (AI) has rapidly permeated various sectors, including healthcare, highlighting its potential to facilitate mental health assessments. This study explores the underexplored domain of AI's role in evaluating prognosis and long-term outcomes in depressive disorders, offering insights into how AI large language models (LLMs) compare with human perspectives. METHODS Using case vignettes, we conducted a comparative analysis involving different LLMs (ChatGPT-3.5, ChatGPT-4, Claude and Bard), mental health professionals (general practitioners, psychiatrists, clinical psychologists and mental health nurses), and the general public that reported previously. We evaluate the LLMs ability to generate prognosis, anticipated outcomes with and without professional intervention, and envisioned long-term positive and negative consequences for individuals with depression. RESULTS In most of the examined cases, the four LLMs consistently identified depression as the primary diagnosis and recommended a combined treatment of psychotherapy and antidepressant medication. ChatGPT-3.5 exhibited a significantly pessimistic prognosis distinct from other LLMs, professionals and the public. ChatGPT-4, Claude and Bard aligned closely with mental health professionals and the general public perspectives, all of whom anticipated no improvement or worsening without professional help. Regarding long-term outcomes, ChatGPT 3.5, Claude and Bard consistently projected significantly fewer negative long-term consequences of treatment than ChatGPT-4. CONCLUSIONS This study underscores the potential of AI to complement the expertise of mental health professionals and promote a collaborative paradigm in mental healthcare. The observation that three of the four LLMs closely mirrored the anticipations of mental health experts in scenarios involving treatment underscores the technology's prospective value in offering professional clinical forecasts. The pessimistic outlook presented by ChatGPT 3.5 is concerning, as it could potentially diminish patients' drive to initiate or continue depression therapy. In summary, although LLMs show potential in enhancing healthcare services, their utilisation requires thorough verification and a seamless integration with human judgement and skills.
Collapse
Affiliation(s)
- Zohar Elyoseph
- Department of Psychology and Educational Counseling, The Center for Psychobiological Research, Max Stern Yezreel Valley College, Yezreel Valley, Israel
- Department of Brain Sciences, Imperial College London, London, UK
| | - Inbar Levkovich
- Faculty of Graduate Studies, Oranim Academic College, Tivon, Israel
| | - Shiri Shinan-Altman
- The Louis and Gabi Weisfeld School of Social Work, Bar-Ilan University, Ramat Gan, Tel Aviv, Israel
| |
Collapse
|
20
|
Elyoseph Z, Hadar Shoval D, Levkovich I. Beyond Personhood: Ethical Paradigms in the Generative Artificial Intelligence Era. THE AMERICAN JOURNAL OF BIOETHICS : AJOB 2024; 24:57-59. [PMID: 38236857 DOI: 10.1080/15265161.2023.2278546] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/23/2024]
Affiliation(s)
- Zohar Elyoseph
- The Max Stern Yezreel Valley College
- Imperial College, London
| | | | | |
Collapse
|