1
|
Carnino JM, Chong NYK, Bayly H, Salvati LR, Tiwana HS, Levi JR. AI-generated text in otolaryngology publications: a comparative analysis before and after the release of ChatGPT. Eur Arch Otorhinolaryngol 2024; 281:6141-6146. [PMID: 39014250 PMCID: PMC11513233 DOI: 10.1007/s00405-024-08834-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2024] [Accepted: 07/08/2024] [Indexed: 07/18/2024]
Abstract
PURPOSE This study delves into the broader implications of artificial intelligence (AI) text generation technologies, including large language models (LLMs) and chatbots, on the scientific literature of otolaryngology. By observing trends in AI-generated text within published otolaryngology studies, this investigation aims to contextualize the impact of AI-driven tools that are reshaping scientific writing and communication. METHODS Text from 143 original articles published in JAMA Otolaryngology - Head and Neck Surgery was collected, representing periods before and after ChatGPT's release in November 2022. The text from each article's abstract, introduction, methods, results, and discussion were entered into ZeroGPT.com to estimate the percentage of AI-generated content. Statistical analyses, including T-Tests and Fligner-Killeen's tests, were conducted using R. RESULTS A significant increase was observed in the mean percentage of AI-generated text post-ChatGPT release, especially in the abstract (from 34.36 to 46.53%, p = 0.004), introduction (from 32.43 to 45.08%, p = 0.010), and discussion sections (from 15.73 to 25.03%, p = 0.015). Publications of authors from non-English speaking countries demonstrated a higher percentage of AI-generated text. CONCLUSION This study found that the advent of ChatGPT has significantly impacted writing practices among researchers publishing in JAMA Otolaryngology - Head and Neck Surgery, raising concerns over the accuracy of AI-created content and potential misinformation risks. This manuscript highlights the evolving dynamics between AI technologies, scientific communication, and publication integrity, emphasizing the urgent need for continued research in this dynamic field. The findings also suggest an increasing reliance on AI tools like ChatGPT, raising questions about their broader implications for scientific publishing.
Collapse
Affiliation(s)
- Jonathan M Carnino
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
| | - Nicholas Y K Chong
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
| | - Henry Bayly
- Boston University School of Public Health, Boston, MA, USA
| | | | - Hardeep S Tiwana
- Washington State University Elson S. Floyd College of Medicine, Spokane, WA, USA
| | - Jessica R Levi
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
- Department of Otolaryngology - Head and Neck Surgery, Boston Medical Center, Boston, MA, USA
| |
Collapse
|
2
|
Howard EC, Carnino JM, Chong NYK, Levi JR. Navigating ChatGPT's alignment with expert consensus on pediatric OSA management. Int J Pediatr Otorhinolaryngol 2024; 186:112131. [PMID: 39423592 DOI: 10.1016/j.ijporl.2024.112131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/05/2024] [Revised: 09/28/2024] [Accepted: 10/10/2024] [Indexed: 10/21/2024]
Abstract
OBJECTIVE This study aimed to evaluate the potential integration of artificial intelligence (AI), specifically ChatGPT, into healthcare decision-making, focusing on its alignment with expert consensus statements regarding the management of persistent pediatric obstructive sleep apnea. METHODS We analyzed ChatGPT's responses to 52 statements from the 2024 expert consensus statement (ECS) on the management of pediatric persistent OSA after adenotonsillectomy. Each statement was input into ChatGPT using a 9-point Likert scale format, with each statement entered three times to calculate mean scores and standard deviations. Statistical analysis was performed using Excel. RESULTS ChatGPT's responses were within 1.0 of the consensus statement mean score for 63 % (33/52) of the statements. 13 % (7/52) were statements in which the ChatGPT mean response was different from the ECS mean by 2.0 or greater, the majority of which were in the categories of surgical and medical management. Statements with ChatGPT mean scores differing by more than 2.0 from the consensus mean highlighted the risk of disseminating incorrect information on established medical topics, with a notable variation in responses suggesting inconsistencies in ChatGPT's reliability. CONCLUSION While ChatGPT demonstrated a promising ability to align with expert medical opinions in many cases, its inconsistencies and potential to propagate inaccuracies in contested areas raise important considerations for its application in clinical settings. The findings underscore the need for ongoing evaluation and refinement of AI tools in healthcare, emphasizing collaboration between AI developers, healthcare professionals, and regulatory bodies to ensure AI's safe and effective integration into medical decision-making processes.
Collapse
Affiliation(s)
- Eileen C Howard
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
| | - Jonathan M Carnino
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
| | - Nicholas Y K Chong
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
| | - Jessica R Levi
- Department of Otolaryngology - Head and Neck Surgery, Boston Medical Center, Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
| |
Collapse
|
3
|
Ahn S. The transformative impact of large language models on medical writing and publishing: current applications, challenges and future directions. THE KOREAN JOURNAL OF PHYSIOLOGY & PHARMACOLOGY : OFFICIAL JOURNAL OF THE KOREAN PHYSIOLOGICAL SOCIETY AND THE KOREAN SOCIETY OF PHARMACOLOGY 2024; 28:393-401. [PMID: 39198220 PMCID: PMC11362003 DOI: 10.4196/kjpp.2024.28.5.393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Revised: 06/10/2024] [Accepted: 06/14/2024] [Indexed: 09/01/2024]
Abstract
Large language models (LLMs) are rapidly transforming medical writing and publishing. This review article focuses on experimental evidence to provide a comprehensive overview of the current applications, challenges, and future implications of LLMs in various stages of academic research and publishing process. Global surveys reveal a high prevalence of LLM usage in scientific writing, with both potential benefits and challenges associated with its adoption. LLMs have been successfully applied in literature search, research design, writing assistance, quality assessment, citation generation, and data analysis. LLMs have also been used in peer review and publication processes, including manuscript screening, generating review comments, and identifying potential biases. To ensure the integrity and quality of scholarly work in the era of LLM-assisted research, responsible artificial intelligence (AI) use is crucial. Researchers should prioritize verifying the accuracy and reliability of AI-generated content, maintain transparency in the use of LLMs, and develop collaborative human-AI workflows. Reviewers should focus on higher-order reviewing skills and be aware of the potential use of LLMs in manuscripts. Editorial offices should develop clear policies and guidelines on AI use and foster open dialogue within the academic community. Future directions include addressing the limitations and biases of current LLMs, exploring innovative applications, and continuously updating policies and practices in response to technological advancements. Collaborative efforts among stakeholders are necessary to harness the transformative potential of LLMs while maintaining the integrity of medical writing and publishing.
Collapse
Affiliation(s)
- Sangzin Ahn
- Department of Pharmacology and PharmacoGenomics Research Center, Inje University College of Medicine, Busan 47392, Korea
- Center for Personalized Precision Medicine of Tuberculosis, Inje University College of Medicine, Busan 47392, Korea
| |
Collapse
|
4
|
Ng MK, Magruder ML, Heckmann ND, Delanois RE, Piuzzi NS, Krebs VE, Mont MA. How-To Create an Orthopaedic Systematic Review: A Step-by-Step Guide. Part III: Executing a Meta-Analysis. J Arthroplasty 2024; 39:2383-2388. [PMID: 38493965 DOI: 10.1016/j.arth.2024.03.026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/07/2024] [Revised: 03/05/2024] [Accepted: 03/10/2024] [Indexed: 03/19/2024] Open
Abstract
At the top of the evidence-based pyramid, systematic reviews stand out as the most powerful, synthesizing findings from numerous primary studies. Specifically, a quantitative systematic review, known as a meta-analysis, combines results from various studies to address a specific research question. This review serves as a guide on how to: (1) design; (2) perform; and (3) publish an orthopedic arthroplasty systematic review. In Part III, we focus on how to design and perform a meta-analysis. We delineate the advantages and disadvantages of meta-analyses compared to systematic reviews, acknowledging their potential challenges due to time constraints and the complexities posed by study heterogeneity and data availability. Despite these obstacles, a well-executed meta-analysis contributes precision and heightened statistical power, standing at the apex of the evidence-based pyramid. The design of a meta-analysis closely mirrors that of a systematic review, but necessitates the inclusion of effect sizes, variability measures, sample sizes, outcome measures, and overall study characteristics. Effective data presentation involves the use of forest plots, along with analyses for heterogeneities and subgroups. Widely-used software tools are common in this domain, and there is a growing trend toward incorporating artificial intelligence software. Ultimately, the intention is for these papers to act as foundational resources for individuals interested in conducting systematic reviews and meta-analyses in the context of orthopaedic arthroplasty, where applicable.
Collapse
Affiliation(s)
- Mitchell K Ng
- Department of Orthopaedic Surgery, Maimonides Medical Center, Brooklyn, New York
| | - Matthew L Magruder
- Department of Orthopaedic Surgery, Maimonides Medical Center, Brooklyn, New York
| | - Nathanael D Heckmann
- Department of Orthopaedic Surgery, Keck School of Medicine of USC, Los Angeles, California
| | - Ronald E Delanois
- Rubin Institute for Advanced Orthopedics, Sinai Hospital of Baltimore, Baltimore, Maryland
| | - Nicolas S Piuzzi
- Department of Orthopaedic Surgery, Cleveland Clinic Foundation, Cleveland, Ohio
| | - Viktor E Krebs
- Department of Orthopaedic Surgery, Cleveland Clinic Foundation, Cleveland, Ohio
| | - Michael A Mont
- Rubin Institute for Advanced Orthopedics, Sinai Hospital of Baltimore, Baltimore, Maryland; Northwell Health Orthopaedics, Lenox Hill Hospital, New York, New York
| |
Collapse
|
5
|
Guerra GA, Hofmann HL, Le JL, Wong AM, Fathi A, Mayfield CK, Petrigliano FA, Liu JN. ChatGPT, Bard, and Bing Chat Are Large Language Processing Models That Answered Orthopaedic In-Training Examination Questions With Similar Accuracy to First-Year Orthopaedic Surgery Residents. Arthroscopy 2024:S0749-8063(24)00621-2. [PMID: 39209078 DOI: 10.1016/j.arthro.2024.08.023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 08/11/2024] [Accepted: 08/15/2024] [Indexed: 09/04/2024]
Abstract
PURPOSE To assess ChatGPT's, Bard's, and Bing Chat's ability to generate accurate orthopaedic diagnoses or corresponding treatments by comparing their performance on the Orthopaedic In-Training Examination (OITE) with that of orthopaedic trainees. METHODS OITE question sets from 2021 and 2022 were compiled to form a large set of 420 questions. ChatGPT (GPT-3.5), Bard, and Bing Chat were instructed to select one of the provided responses to each question. The accuracy of composite questions was recorded and comparatively analyzed to human cohorts including medical students and orthopaedic residents, stratified by postgraduate year (PGY). RESULTS ChatGPT correctly answered 46.3% of composite questions whereas Bing Chat correctly answered 52.4% of questions and Bard correctly answered 51.4% of questions on the OITE. When image-associated questions were excluded, ChatGPT's, Bing Chat's, and Bard's overall accuracies improved to 49.1%, 53.5%, and 56.8%, respectively. Medical students correctly answered 30.8%, and PGY-1, -2, -3, -4, and -5 orthopaedic residents correctly answered 53.1%, 60.4%, 66.6%, 70.0%, and 71.9%, respectively. CONCLUSIONS ChatGPT, Bard, and Bing Chat are artificial intelligence (AI) models that answered OITE questions with accuracy similar to that of first-year orthopaedic surgery residents. ChatGPT, Bard, and Bing Chat achieved this result without using images or other supplementary media that human test takers are provided. CLINICAL RELEVANCE Our comparative performance analysis of AI models on orthopaedic board-style questions highlights ChatGPT's, Bing Chat's, and Bard's clinical knowledge and proficiency. Our analysis establishes a baseline of AI model proficiency in the field of orthopaedics and provides a comparative marker for future, more advanced deep learning models. Although in its elementary phase, future AI models' orthopaedic knowledge may provide clinical support and serve as an educational tool.
Collapse
Affiliation(s)
- Gage A Guerra
- USC Epstein Family Center for Sports Medicine, Keck Medicine of USC, Los Angeles, California, U.S.A..
| | - Hayden L Hofmann
- USC Epstein Family Center for Sports Medicine, Keck Medicine of USC, Los Angeles, California, U.S.A
| | - Jonathan L Le
- USC Epstein Family Center for Sports Medicine, Keck Medicine of USC, Los Angeles, California, U.S.A
| | - Alexander M Wong
- USC Epstein Family Center for Sports Medicine, Keck Medicine of USC, Los Angeles, California, U.S.A
| | - Amir Fathi
- USC Epstein Family Center for Sports Medicine, Keck Medicine of USC, Los Angeles, California, U.S.A
| | - Cory K Mayfield
- USC Epstein Family Center for Sports Medicine, Keck Medicine of USC, Los Angeles, California, U.S.A
| | - Frank A Petrigliano
- USC Epstein Family Center for Sports Medicine, Keck Medicine of USC, Los Angeles, California, U.S.A
| | - Joseph N Liu
- USC Epstein Family Center for Sports Medicine, Keck Medicine of USC, Los Angeles, California, U.S.A
| |
Collapse
|
6
|
Carnino JM, Pellegrini WR, Willis M, Cohen MB, Paz-Lansberg M, Davis EM, Grillone GA, Levi JR. Assessing ChatGPT's Responses to Otolaryngology Patient Questions. Ann Otol Rhinol Laryngol 2024; 133:658-664. [PMID: 38676440 DOI: 10.1177/00034894241249621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/28/2024]
Abstract
OBJECTIVE This study aims to evaluate ChatGPT's performance in addressing real-world otolaryngology patient questions, focusing on accuracy, comprehensiveness, and patient safety, to assess its suitability for integration into healthcare. METHODS A cross-sectional study was conducted using patient questions from the public online forum Reddit's r/AskDocs, where medical advice is sought from healthcare professionals. Patient questions were input into ChatGPT (GPT-3.5), and responses were reviewed by 5 board-certified otolaryngologists. The evaluation criteria included difficulty, accuracy, comprehensiveness, and bedside manner/empathy. Statistical analysis explored the relationship between patient question characteristics and ChatGPT response scores. Potentially dangerous responses were also identified. RESULTS Patient questions averaged 224.93 words, while ChatGPT responses were longer at 414.93 words. The accuracy scores for ChatGPT responses were 3.76/5, comprehensiveness scores were 3.59/5, and bedside manner/empathy scores were 4.28/5. Longer patient questions did not correlate with higher response ratings. However, longer ChatGPT responses scored higher in bedside manner/empathy. Higher question difficulty correlated with lower comprehensiveness. Five responses were flagged as potentially dangerous. CONCLUSION While ChatGPT exhibits promise in addressing otolaryngology patient questions, this study demonstrates its limitations, particularly in accuracy and comprehensiveness. The identification of potentially dangerous responses underscores the need for a cautious approach to AI in medical advice. Responsible integration of AI into healthcare necessitates thorough assessments of model performance and ethical considerations for patient safety.
Collapse
Affiliation(s)
- Jonathan M Carnino
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
| | - William R Pellegrini
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
- Department of Otolaryngology-Head and Neck Surgery, Boston Medical Center, Boston, MA, USA
| | - Megan Willis
- Department of Biostatistics, Boston University, Boston, MA, USA
| | - Michael B Cohen
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
- Department of Otolaryngology-Head and Neck Surgery, Boston Medical Center, Boston, MA, USA
| | - Marianella Paz-Lansberg
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
- Department of Otolaryngology-Head and Neck Surgery, Boston Medical Center, Boston, MA, USA
| | - Elizabeth M Davis
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
- Department of Otolaryngology-Head and Neck Surgery, Boston Medical Center, Boston, MA, USA
| | - Gregory A Grillone
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
- Department of Otolaryngology-Head and Neck Surgery, Boston Medical Center, Boston, MA, USA
| | - Jessica R Levi
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
- Department of Otolaryngology-Head and Neck Surgery, Boston Medical Center, Boston, MA, USA
| |
Collapse
|
7
|
Rizk PA, Gonzalez MR, Galoaa BM, Girgis AG, Van Der Linden L, Chang CY, Lozano-Calderon SA. Machine Learning-Assisted Decision Making in Orthopaedic Oncology. JBJS Rev 2024; 12:01874474-202407000-00005. [PMID: 38991098 DOI: 10.2106/jbjs.rvw.24.00057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/13/2024]
Abstract
» Artificial intelligence is an umbrella term for computational calculations that are designed to mimic human intelligence and problem-solving capabilities, although in the future, this may become an incomplete definition. Machine learning (ML) encompasses the development of algorithms or predictive models that generate outputs without explicit instructions, assisting in clinical predictions based on large data sets. Deep learning is a subset of ML that utilizes layers of networks that use various inter-relational connections to define and generalize data.» ML algorithms can enhance radiomics techniques for improved image evaluation and diagnosis. While ML shows promise with the advent of radiomics, there are still obstacles to overcome.» Several calculators leveraging ML algorithms have been developed to predict survival in primary sarcomas and metastatic bone disease utilizing patient-specific data. While these models often report exceptionally accurate performance, it is crucial to evaluate their robustness using standardized guidelines.» While increased computing power suggests continuous improvement of ML algorithms, these advancements must be balanced against challenges such as diversifying data, addressing ethical concerns, and enhancing model interpretability.
Collapse
Affiliation(s)
- Paul A Rizk
- Division of Orthopaedic Oncology, Department of Orthopaedic Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
| | - Marcos R Gonzalez
- Division of Orthopaedic Oncology, Department of Orthopaedic Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
| | - Bishoy M Galoaa
- Interdisciplinary Science & Engineering Complex (ISEC), Northeastern University, Boston, Massachusetts
| | - Andrew G Girgis
- Boston University Chobanian & Avedisian School of Medicine, Boston, Massachusetts
| | - Lotte Van Der Linden
- Division of Orthopaedic Oncology, Department of Orthopaedic Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
| | - Connie Y Chang
- Musculoskeletal Imaging and Intervention, Department of Radiology, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts
| | - Santiago A Lozano-Calderon
- Division of Orthopaedic Oncology, Department of Orthopaedic Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
| |
Collapse
|
8
|
Yao JJ, Aggarwal M, Lopez RD, Namdari S. Current Concepts Review: Large Language Models in Orthopaedics: Definitions, Uses, and Limitations. J Bone Joint Surg Am 2024:00004623-990000000-01136. [PMID: 38896652 DOI: 10.2106/jbjs.23.01417] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Abstract
➤ Large language models are a subset of artificial intelligence. Large language models are powerful tools that excel in natural language text processing and generation.➤ There are many potential clinical, research, and educational applications of large language models in orthopaedics, but the development of these applications needs to be focused on patient safety and the maintenance of high standards.➤ There are numerous methodological, ethical, and regulatory concerns with regard to the use of large language models. Orthopaedic surgeons need to be aware of the controversies and advocate for an alignment of these models with patient and caregiver priorities.
Collapse
Affiliation(s)
- Jie J Yao
- Rothman Orthopaedic Institute, Thomas Jefferson University, Philadelphia, Pennsylvania
| | | | - Ryan D Lopez
- Rothman Orthopaedic Institute, Thomas Jefferson University, Philadelphia, Pennsylvania
| | - Surena Namdari
- Rothman Orthopaedic Institute, Thomas Jefferson University, Philadelphia, Pennsylvania
| |
Collapse
|
9
|
Howard EC, Chong NYK, Carnino JM, Levi JR. Comparison of ChatGPT knowledge against 2020 consensus statement on ankyloglossia in children. Int J Pediatr Otorhinolaryngol 2024; 180:111957. [PMID: 38640573 DOI: 10.1016/j.ijporl.2024.111957] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/12/2024] [Revised: 04/01/2024] [Accepted: 04/13/2024] [Indexed: 04/21/2024]
Abstract
OBJECTIVE This paper evaluates ChatGPT's accuracy and consistency in providing information on ankyloglossia, a congenital oral condition. Assessing alignment with expert consensus, the study explores potential implications for patients relying on AI for medical information. METHODS Statements from the 2020 clinical consensus statement on ankyloglossia were presented to ChatGPT, and its responses were scored using a 9-point Likert scale. The study analyzed the mean and standard deviation of ChatGPT scores for each statement. Statistical analysis was conducted using Excel. RESULTS Among the 63 statements assessed, 67 % of ChatGPT responses closely aligned with expert consensus mean scores. However, 17 % (11/63) were statements in which the ChatGPT mean response was different from the CCS mean by 2.0 or greater, raising concerns about ChatGPT's potential influence in disseminating uncertain or debated medical information. Variations in mean scores highlighted discrepancies, with some statements showing significant deviations from expert opinions. CONCLUSION While ChatGPT mirrored medical viewpoints on ankyloglossia, alignment with non-consensus statements raises caution in relying on it for medical advice. Future research should refine AI models, address inaccuracies, and explore diverse user queries for safe integration into medical decision-making. Despite potential benefits, ongoing examination of ChatGPT's power and limitations is crucial, considering its impact on health equity and information access.
Collapse
Affiliation(s)
- Eileen C Howard
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA.
| | - Nicholas Y K Chong
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
| | - Jonathan M Carnino
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
| | - Jessica R Levi
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA; Department of Otolaryngology - Head and Neck Surgery, Boston Medical Center, Boston, MA, USA
| |
Collapse
|
10
|
Leopold SS. Editor's Spotlight/Take 5: How Does ChatGPT Use Source Information Compared With Google? A Text Network Analysis of Online Health Information. Clin Orthop Relat Res 2024; 482:574-577. [PMID: 38446430 PMCID: PMC10936992 DOI: 10.1097/corr.0000000000003006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Accepted: 01/23/2024] [Indexed: 03/07/2024]
Affiliation(s)
- Seth S Leopold
- Editor-in-Chief, Clinical Orthopaedics and Related Research® , Park Ridge, IL, USA
| |
Collapse
|
11
|
Sánchez-Rosenberg G, Magnéli M, Barle N, Kontakis MG, Müller AM, Wittauer M, Gordon M, Brodén C. ChatGPT-4 generates orthopedic discharge documents faster than humans maintaining comparable quality: a pilot study of 6 cases. Acta Orthop 2024; 95:152-156. [PMID: 38597205 PMCID: PMC10959013 DOI: 10.2340/17453674.2024.40182] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Accepted: 01/28/2024] [Indexed: 04/11/2024] Open
Abstract
BACKGROUND AND PURPOSE Large language models like ChatGPT-4 have emerged. They hold the potential to reduce the administrative burden by generating everyday clinical documents, thus allowing the physician to spend more time with the patient. We aimed to assess both the quality and efficiency of discharge documents generated by ChatGPT-4 in comparison with those produced by physicians. PATIENTS AND METHODS To emulate real-world situations, the health records of 6 fictional orthopedic cases were created. Discharge documents for each case were generated by a junior attending orthopedic surgeon and an advanced orthopedic resident. ChatGPT-4 was then prompted to generate the discharge documents using the same health record information. The quality assessment was performed by an expert panel (n = 15) blinded to the source of the documents. As secondary outcome, the time required to generate the documents was compared, logging the duration of the creation of the discharge documents by the physician and by ChatGPT-4. RESULTS Overall, both ChatGPT-4 and physician-generated notes were comparable in quality. Notably, ChatGPT-4 generated discharge documents 10 times faster than the traditional method. 4 events of hallucinations were found in the ChatGPT-4-generated content, compared with 6 events in the human/physician produced notes. CONCLUSION ChatGPT-4 creates orthopedic discharge notes faster than physicians, with comparable quality. This shows it has great potential for making these documents more efficient in orthopedic care. ChatGPT-4 has the potential to significantly reduce the administrative burden on healthcare professionals.
Collapse
Affiliation(s)
| | - Martin Magnéli
- Karolinska Institute, Department of Clinical Sciences at Danderyd Hospital, Stockholm; Sweden
| | - Niklas Barle
- Karolinska Institute, Department of Clinical Sciences at Danderyd Hospital, Stockholm; Sweden
| | - Michael G Kontakis
- Department of Surgical Sciences, Orthopedics, Uppsala University Hospital, Uppsala, Sweden
| | - Andreas Marc Müller
- Department of Orthopedic and Trauma Surgery, University Hospital Basel, Switzerland
| | - Matthias Wittauer
- Department of Orthopedic and Trauma Surgery, University Hospital Basel, Switzerland
| | - Max Gordon
- Karolinska Institute, Department of Clinical Sciences at Danderyd Hospital, Stockholm; Sweden
| | - Cyrus Brodén
- Department of Surgical Sciences, Orthopedics, Uppsala University Hospital, Uppsala, Sweden.
| |
Collapse
|
12
|
Lawrence KW, Habibi AA, Ward SA, Lajam CM, Schwarzkopf R, Rozell JC. Human versus artificial intelligence-generated arthroplasty literature: A single-blinded analysis of perceived communication, quality, and authorship source. Int J Med Robot 2024; 20:e2621. [PMID: 38348740 DOI: 10.1002/rcs.2621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2023] [Revised: 01/02/2024] [Accepted: 01/22/2024] [Indexed: 02/15/2024]
Abstract
BACKGROUND Large language models (LLM) have unknown implications for medical research. This study assessed whether LLM-generated abstracts are distinguishable from human-written abstracts and to compare their perceived quality. METHODS The LLM ChatGPT was used to generate 20 arthroplasty abstracts (AI-generated) based on full-text manuscripts, which were compared to originally published abstracts (human-written). Six blinded orthopaedic surgeons rated abstracts on overall quality, communication, and confidence in the authorship source. Authorship-confidence scores were compared to a test value representing complete inability to discern authorship. RESULTS Modestly increased confidence in human authorship was observed for human-written abstracts compared with AI-generated abstracts (p = 0.028), though AI-generated abstract authorship-confidence scores were statistically consistent with inability to discern authorship (p = 0.999). Overall abstract quality was higher for human-written abstracts (p = 0.019). CONCLUSIONS AI-generated abstracts' absolute authorship-confidence ratings demonstrated difficulty in discerning authorship but did not achieve the perceived quality of human-written abstracts. Caution is warranted in implementing LLMs into scientific writing.
Collapse
Affiliation(s)
- Kyle W Lawrence
- Department of Orthopedic Surgery, NYU Langone Health, New York, New York, USA
| | - Akram A Habibi
- Department of Orthopedic Surgery, NYU Langone Health, New York, New York, USA
| | - Spencer A Ward
- Department of Orthopedic Surgery, NYU Langone Health, New York, New York, USA
| | - Claudette M Lajam
- Department of Orthopedic Surgery, NYU Langone Health, New York, New York, USA
| | - Ran Schwarzkopf
- Department of Orthopedic Surgery, NYU Langone Health, New York, New York, USA
| | - Joshua C Rozell
- Department of Orthopedic Surgery, NYU Langone Health, New York, New York, USA
| |
Collapse
|
13
|
Omiye JA, Gui H, Rezaei SJ, Zou J, Daneshjou R. Large Language Models in Medicine: The Potentials and Pitfalls : A Narrative Review. Ann Intern Med 2024; 177:210-220. [PMID: 38285984 DOI: 10.7326/m23-2772] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/31/2024] Open
Abstract
Large language models (LLMs) are artificial intelligence models trained on vast text data to generate humanlike outputs. They have been applied to various tasks in health care, ranging from answering medical examination questions to generating clinical reports. With increasing institutional partnerships between companies producing LLMs and health systems, the real-world clinical application of these models is nearing realization. As these models gain traction, health care practitioners must understand what LLMs are, their development, their current and potential applications, and the associated pitfalls in a medical setting. This review, coupled with a tutorial, provides a comprehensive yet accessible overview of these areas with the aim of familiarizing health care professionals with the rapidly changing landscape of LLMs in medicine. Furthermore, the authors highlight active research areas in the field that promise to improve LLMs' usability in health care contexts.
Collapse
Affiliation(s)
- Jesutofunmi A Omiye
- Department of Dermatology and Department of Biomedical Data Science, Stanford University, Stanford, California (J.A.O., R.D.)
| | - Haiwen Gui
- Department of Dermatology, Stanford University, Stanford, California (H.G., S.J.R.)
| | - Shawheen J Rezaei
- Department of Dermatology, Stanford University, Stanford, California (H.G., S.J.R.)
| | - James Zou
- Department of Biomedical Data Science, Stanford University, Stanford, California (J.Z.)
| | - Roxana Daneshjou
- Department of Dermatology and Department of Biomedical Data Science, Stanford University, Stanford, California (J.A.O., R.D.)
| |
Collapse
|