1
|
Kachman MM, Brennan I, Oskvarek JJ, Waseem T, Pines JM. How artificial intelligence could transform emergency care. Am J Emerg Med 2024; 81:40-46. [PMID: 38663302 DOI: 10.1016/j.ajem.2024.04.024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2024] [Revised: 04/13/2024] [Accepted: 04/15/2024] [Indexed: 06/07/2024] Open
Abstract
Artificial intelligence (AI) in healthcare is the ability of a computer to perform tasks typically associated with clinical care (e.g. medical decision-making and documentation). AI will soon be integrated into an increasing number of healthcare applications, including elements of emergency department (ED) care. Here, we describe the basics of AI, various categories of its functions (including machine learning and natural language processing) and review emerging and potential future use-cases for emergency care. For example, AI-assisted symptom checkers could help direct patients to the appropriate setting, models could assist in assigning triage levels, and ambient AI systems could document clinical encounters. AI could also help provide focused summaries of charts, summarize encounters for hand-offs, and create discharge instructions with an appropriate language and reading level. Additional use cases include medical decision making for decision rules, real-time models that predict clinical deterioration or sepsis, and efficient extraction of unstructured data for coding, billing, research, and quality initiatives. We discuss the potential transformative benefits of AI, as well as the concerns regarding its use (e.g. privacy, data accuracy, and the potential for changing the doctor-patient relationship).
Collapse
Affiliation(s)
- Marika M Kachman
- US Acute Care Solutions, Canton, OH, United States of America; Department of Emergency Medicine, Virginia Hospital Center, Arlington, VA, United States of America
| | - Irina Brennan
- US Acute Care Solutions, Canton, OH, United States of America; Department of Emergency Medicine, Inova Alexandria Hospital, Alexandria, VA, United States of America
| | - Jonathan J Oskvarek
- US Acute Care Solutions, Canton, OH, United States of America; Department of Emergency Medicine, Summa Health, Akron, OH, United States of America
| | - Tayab Waseem
- Department of Emergency Medicine, George Washington University, Washington, DC, United States of America
| | - Jesse M Pines
- US Acute Care Solutions, Canton, OH, United States of America; Department of Emergency Medicine, George Washington University, Washington, DC, United States of America.
| |
Collapse
|
2
|
Meczner A, Cohen N, Qureshi A, Reza M, Sutaria S, Blount E, Bagyura Z, Malak T. Controlling Inputter Variability in Vignette Studies Assessing Web-Based Symptom Checkers: Evaluation of Current Practice and Recommendations for Isolated Accuracy Metrics. JMIR Form Res 2024; 8:e49907. [PMID: 38820578 PMCID: PMC11179013 DOI: 10.2196/49907] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Revised: 08/10/2023] [Accepted: 04/24/2024] [Indexed: 06/02/2024] Open
Abstract
BACKGROUND The rapid growth of web-based symptom checkers (SCs) is not matched by advances in quality assurance. Currently, there are no widely accepted criteria assessing SCs' performance. Vignette studies are widely used to evaluate SCs, measuring the accuracy of outcome. Accuracy behaves as a composite metric as it is affected by a number of individual SC- and tester-dependent factors. In contrast to clinical studies, vignette studies have a small number of testers. Hence, measuring accuracy alone in vignette studies may not provide a reliable assessment of performance due to tester variability. OBJECTIVE This study aims to investigate the impact of tester variability on the accuracy of outcome of SCs, using clinical vignettes. It further aims to investigate the feasibility of measuring isolated aspects of performance. METHODS Healthily's SC was assessed using 114 vignettes by 3 groups of 3 testers who processed vignettes with different instructions: free interpretation of vignettes (free testers), specified chief complaints (partially free testers), and specified chief complaints with strict instruction for answering additional symptoms (restricted testers). κ statistics were calculated to assess agreement of top outcome condition and recommended triage. Crude and adjusted accuracy was measured against a gold standard. Adjusted accuracy was calculated using only results of consultations identical to the vignette, following a review and selection process. A feasibility study for assessing symptom comprehension of SCs was performed using different variations of 51 chief complaints across 3 SCs. RESULTS Intertester agreement of most likely condition and triage was, respectively, 0.49 and 0.51 for the free tester group, 0.66 and 0.66 for the partially free group, and 0.72 and 0.71 for the restricted group. For the restricted group, accuracy ranged from 43.9% to 57% for individual testers, averaging 50.6% (SD 5.35%). Adjusted accuracy was 56.1%. Assessing symptom comprehension was feasible for all 3 SCs. Comprehension scores ranged from 52.9% and 68%. CONCLUSIONS We demonstrated that by improving standardization of the vignette testing process, there is a significant improvement in the agreement of outcome between testers. However, significant variability remained due to uncontrollable tester-dependent factors, reflected by varying outcome accuracy. Tester-dependent factors, combined with a small number of testers, limit the reliability and generalizability of outcome accuracy when used as a composite measure in vignette studies. Measuring and reporting different aspects of SC performance in isolation provides a more reliable assessment of SC performance. We developed an adjusted accuracy measure using a review and selection process to assess data algorithm quality. In addition, we demonstrated that symptom comprehension with different input methods can be feasibly compared. Future studies reporting accuracy need to apply vignette testing standardization and isolated metrics.
Collapse
Affiliation(s)
- András Meczner
- Healthily, London, United Kingdom
- Institute for Clinical Data Management, Semmelweis University, Budapest, Hungary
| | | | | | | | | | | | - Zsolt Bagyura
- Institute for Clinical Data Management, Semmelweis University, Budapest, Hungary
| | | |
Collapse
|
3
|
Petrella RJ. The AI Future of Emergency Medicine. Ann Emerg Med 2024:S0196-0644(24)00043-X. [PMID: 38795081 DOI: 10.1016/j.annemergmed.2024.01.031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Revised: 01/23/2024] [Accepted: 01/24/2024] [Indexed: 05/27/2024]
Abstract
In the coming years, artificial intelligence (AI) and machine learning will likely give rise to profound changes in the field of emergency medicine, and medicine more broadly. This article discusses these anticipated changes in terms of 3 overlapping yet distinct stages of AI development. It reviews some fundamental concepts in AI and explores their relation to clinical practice, with a focus on emergency medicine. In addition, it describes some of the applications of AI in disease diagnosis, prognosis, and treatment, as well as some of the practical issues that they raise, the barriers to their implementation, and some of the legal and regulatory challenges they create.
Collapse
Affiliation(s)
- Robert J Petrella
- Emergency Departments, CharterCARE Health Partners, Providence and North Providence, RI; Emergency Department, Boston VA Medical Center, Boston, MA; Emergency Departments, Steward Health Care System, Boston and Methuen, MA; Harvard Medical School, Boston, MA; Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA; Department of Medicine, Brigham and Women's Hospital, Boston, MA.
| |
Collapse
|
4
|
Harada Y, Sakamoto T, Sugimoto S, Shimizu T. Longitudinal Changes in Diagnostic Accuracy of a Differential Diagnosis List Developed by an AI-Based Symptom Checker: Retrospective Observational Study. JMIR Form Res 2024; 8:e53985. [PMID: 38758588 PMCID: PMC11143391 DOI: 10.2196/53985] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Revised: 03/23/2024] [Accepted: 04/24/2024] [Indexed: 05/18/2024] Open
Abstract
BACKGROUND Artificial intelligence (AI) symptom checker models should be trained using real-world patient data to improve their diagnostic accuracy. Given that AI-based symptom checkers are currently used in clinical practice, their performance should improve over time. However, longitudinal evaluations of the diagnostic accuracy of these symptom checkers are limited. OBJECTIVE This study aimed to assess the longitudinal changes in the accuracy of differential diagnosis lists created by an AI-based symptom checker used in the real world. METHODS This was a single-center, retrospective, observational study. Patients who visited an outpatient clinic without an appointment between May 1, 2019, and April 30, 2022, and who were admitted to a community hospital in Japan within 30 days of their index visit were considered eligible. We only included patients who underwent an AI-based symptom checkup at the index visit, and the diagnosis was finally confirmed during follow-up. Final diagnoses were categorized as common or uncommon, and all cases were categorized as typical or atypical. The primary outcome measure was the accuracy of the differential diagnosis list created by the AI-based symptom checker, defined as the final diagnosis in a list of 10 differential diagnoses created by the symptom checker. To assess the change in the symptom checker's diagnostic accuracy over 3 years, we used a chi-square test to compare the primary outcome over 3 periods: from May 1, 2019, to April 30, 2020 (first year); from May 1, 2020, to April 30, 2021 (second year); and from May 1, 2021, to April 30, 2022 (third year). RESULTS A total of 381 patients were included. Common diseases comprised 257 (67.5%) cases, and typical presentations were observed in 298 (78.2%) cases. Overall, the accuracy of the differential diagnosis list created by the AI-based symptom checker was 172 (45.1%), which did not differ across the 3 years (first year: 97/219, 44.3%; second year: 32/72, 44.4%; and third year: 43/90, 47.7%; P=.85). The accuracy of the differential diagnosis list created by the symptom checker was low in those with uncommon diseases (30/124, 24.2%) and atypical presentations (12/83, 14.5%). In the multivariate logistic regression model, common disease (P<.001; odds ratio 4.13, 95% CI 2.50-6.98) and typical presentation (P<.001; odds ratio 6.92, 95% CI 3.62-14.2) were significantly associated with the accuracy of the differential diagnosis list created by the symptom checker. CONCLUSIONS A 3-year longitudinal survey of the diagnostic accuracy of differential diagnosis lists developed by an AI-based symptom checker, which has been implemented in real-world clinical practice settings, showed no improvement over time. Uncommon diseases and atypical presentations were independently associated with a lower diagnostic accuracy. In the future, symptom checkers should be trained to recognize uncommon conditions.
Collapse
Affiliation(s)
- Yukinori Harada
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga, Japan
- Department of General Medicine, Nagano Chuo Hospital, Nagano, Japan
| | - Tetsu Sakamoto
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga, Japan
| | - Shu Sugimoto
- Department of Medicine (Neurology and Rheumatology), Shinshu University School of Medicine, Matsumoto, Japan
| | - Taro Shimizu
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga, Japan
| |
Collapse
|
5
|
Müller R, Klemmt M, Koch R, Ehni HJ, Henking T, Langmann E, Wiesing U, Ranisch R. "That's just Future Medicine" - a qualitative study on users' experiences of symptom checker apps. BMC Med Ethics 2024; 25:17. [PMID: 38365749 PMCID: PMC10874001 DOI: 10.1186/s12910-024-01011-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 02/06/2024] [Indexed: 02/18/2024] Open
Abstract
BACKGROUND Symptom checker apps (SCAs) are mobile or online applications for lay people that usually have two main functions: symptom analysis and recommendations. SCAs ask users questions about their symptoms via a chatbot, give a list with possible causes, and provide a recommendation, such as seeing a physician. However, it is unclear whether the actual performance of a SCA corresponds to the users' experiences. This qualitative study investigates the subjective perspectives of SCA users to close the empirical gap identified in the literature and answers the following main research question: How do individuals (healthy users and patients) experience the usage of SCA, including their attitudes, expectations, motivations, and concerns regarding their SCA use? METHODS A qualitative interview study was chosen to clarify the relatively unknown experience of SCA use. Semi-structured qualitative interviews with SCA users were carried out by two researchers in tandem via video call. Qualitative content analysis was selected as methodology for the data analysis. RESULTS Fifteen interviews with SCA users were conducted and seven main categories identified: (1) Attitudes towards findings and recommendations, (2) Communication, (3) Contact with physicians, (4) Expectations (prior to use), (5) Motivations, (6) Risks, and (7) SCA-use for others. CONCLUSIONS The aspects identified in the analysis emphasise the specific perspective of SCA users and, at the same time, the immense scope of different experiences. Moreover, the study reveals ethical issues, such as relational aspects, that are often overlooked in debates on mHealth. Both empirical and ethical research is more needed, as the awareness of the subjective experience of those affected is an essential component in the responsible development and implementation of health apps such as SCA. TRIAL REGISTRATION German Clinical Trials Register (DRKS): DRKS00022465. 07/08/2020.
Collapse
Affiliation(s)
- Regina Müller
- Institute of Philosophy, University Bremen, Bremen, Germany.
| | - Malte Klemmt
- Institute of General Practice and Palliative Care, Hannover Medical School, Hannover, Germany
| | - Roland Koch
- Institute of General Practice and Interprofessional Care, University Hospital Tübingen, Tübingen, Germany
| | - Hans-Jörg Ehni
- Institute of Ethics and History of Medicine, University Tübingen, Tübingen, Germany
| | - Tanja Henking
- Institute of Applied Social Science, University of Applied Science Würzburg- Schweinfurt, Würzburg, Germany
| | - Elisabeth Langmann
- Institute of Ethics and History of Medicine, University Tübingen, Tübingen, Germany
| | - Urban Wiesing
- Institute of Ethics and History of Medicine, University Tübingen, Tübingen, Germany
| | - Robert Ranisch
- Faculty of Health Science Brandenburg, University of Potsdam, Potsdam, Germany
| |
Collapse
|
6
|
Kearney LE, Jansen E, Kathuria H, Steiling K, Jones KC, Walkey A, Cordella N. Efficacy of Digital Outreach Strategies for Collecting Smoking Data: Pragmatic Randomized Trial. JMIR Form Res 2024; 8:e50465. [PMID: 38335012 PMCID: PMC10891497 DOI: 10.2196/50465] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2023] [Revised: 12/19/2023] [Accepted: 12/24/2023] [Indexed: 02/10/2024] Open
Abstract
BACKGROUND Tobacco smoking is an important risk factor for disease, but inaccurate smoking history data in the electronic medical record (EMR) limits the reach of lung cancer screening (LCS) and tobacco cessation interventions. Patient-generated health data is a novel approach to documenting smoking history; however, the comparative effectiveness of different approaches is unclear. OBJECTIVE We designed a quality improvement intervention to evaluate the effectiveness of portal questionnaires compared to SMS text message-based surveys, to compare message frames, and to evaluate the completeness of patient-generated smoking histories. METHODS We randomly assigned patients aged between 50 and 80 years with a history of tobacco use who identified English as a preferred language and have never undergone LCS to receive an EMR portal questionnaire or a text survey. The portal questionnaire used a "helpfulness" message, while the text survey tested frame types informed by behavior economics ("gain," "loss," and "helpfulness") and nudge messaging. The primary outcome was the response rate for each modality and framing type. Completeness and consistency with documented structured smoking data were also evaluated. RESULTS Participants were more likely to respond to the text survey (191/1000, 19.1%) compared to the portal questionnaire (35/504, 6.9%). Across all text survey rounds, patients were less responsive to the "helpfulness" frame compared with the "gain" frame (odds ratio [OR] 0.29, 95% CI 0.09-0.91; P<.05) and "loss" frame (OR 0.32, 95% CI 11.8-99.4; P<.05). Compared to the structured data in the EMR, the patient-generated data were significantly more likely to be complete enough to determine LCS eligibility both compared to the portal questionnaire (OR 34.2, 95% CI 3.8-11.1; P<.05) and to the text survey (OR 6.8, 95% CI 3.8-11.1; P<.05). CONCLUSIONS We found that an approach using patient-generated data is a feasible way to engage patients and collect complete smoking histories. Patients are likely to respond to a text survey using "gain" or "loss" framing to report detailed smoking histories. Optimizing an SMS text message approach to collect medical information has implications for preventative and follow-up clinical care beyond smoking histories, LCS, and smoking cessation therapy.
Collapse
Affiliation(s)
- Lauren E Kearney
- The Pulmonary Center, Boston University, Boston, MA, United States
| | - Emily Jansen
- Department of Quality and Patient Safety, Boston Medical Center, Boston, MA, United States
| | | | - Katrina Steiling
- The Pulmonary Center, Boston University, Boston, MA, United States
| | - Kayla C Jones
- The Evan's Center for Implementation & Improvement Sciences, Boston University, Boston, MA, United States
| | - Allan Walkey
- The Pulmonary Center, Boston University, Boston, MA, United States
- The Evan's Center for Implementation & Improvement Sciences, Boston University, Boston, MA, United States
| | - Nicholas Cordella
- Department of Quality and Patient Safety, Boston Medical Center, Boston, MA, United States
| |
Collapse
|
7
|
Schnoor K, Versluis A, Chavannes NH, Talboom-Kamp EPWA. Digital Triage Tools for Sexually Transmitted Infection Testing Compared With General Practitioners' Advice: Vignette-Based Qualitative Study With Interviews Among General Practitioners. JMIR Hum Factors 2024; 11:e49221. [PMID: 38252474 PMCID: PMC10845018 DOI: 10.2196/49221] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Revised: 07/05/2023] [Accepted: 11/20/2023] [Indexed: 01/23/2024] Open
Abstract
BACKGROUND Digital triage tools for sexually transmitted infection (STI) testing can potentially be used as a substitute for the triage that general practitioners (GPs) perform to lower their work pressure. The studied tool is based on medical guidelines. The same guidelines support GPs' decision-making process. However, research has shown that GPs make decisions from a holistic perspective and, therefore, do not always adhere to those guidelines. To have a high-quality digital triage tool that results in an efficient care process, it is important to learn more about GPs' decision-making process. OBJECTIVE The first objective was to identify whether the advice of the studied digital triage tool aligned with GPs' daily medical practice. The second objective was to learn which factors influence GPs' decisions regarding referral for diagnostic testing. In addition, this study provides insights into GPs' decision-making process. METHODS A qualitative vignette-based study using semistructured interviews was conducted. In total, 6 vignettes representing patient cases were discussed with the participants (GPs). The participants needed to think aloud whether they would advise an STI test for the patient and why. A thematic analysis was conducted on the transcripts of the interviews. The vignette patient cases were also passed through the digital triage tool, resulting in advice to test or not for an STI. A comparison was made between the advice of the tool and that of the participants. RESULTS In total, 10 interviews were conducted. Participants (GPs) had a mean age of 48.30 (SD 11.88) years. For 3 vignettes, the advice of the digital triage tool and of all participants was the same. In those vignettes, the patients' risk factors were sufficiently clear for the participants to advise the same as the digital tool. For 3 vignettes, the advice of the digital tool differed from that of the participants. Patient-related factors that influenced the participants' decision-making process were the patient's anxiety, young age, and willingness to be tested. Participants would test at a lower threshold than the triage tool because of those factors. Sometimes, participants wanted more information than was provided in the vignette or would like to conduct a physical examination. These elements were not part of the digital triage tool. CONCLUSIONS The advice to conduct a diagnostic STI test differed between a digital triage tool and GPs. The digital triage tool considered only medical guidelines, whereas GPs were open to discussion reasoning from a holistic perspective. The GPs' decision-making process was influenced by patients' anxiety, willingness to be tested, and age. On the basis of these results, we believe that the digital triage tool for STI testing could support GPs and even replace consultations in the future. Further research must substantiate how this can be done safely.
Collapse
Affiliation(s)
- Kyma Schnoor
- Public Health and Primary Care, Leiden University Medical Center, Leiden, Netherlands
- National eHealth Living Lab, Leiden University Medical Center, Leiden, Netherlands
| | - Anke Versluis
- Public Health and Primary Care, Leiden University Medical Center, Leiden, Netherlands
- National eHealth Living Lab, Leiden University Medical Center, Leiden, Netherlands
| | - Niels H Chavannes
- Public Health and Primary Care, Leiden University Medical Center, Leiden, Netherlands
- National eHealth Living Lab, Leiden University Medical Center, Leiden, Netherlands
| | - Esther P W A Talboom-Kamp
- Public Health and Primary Care, Leiden University Medical Center, Leiden, Netherlands
- National eHealth Living Lab, Leiden University Medical Center, Leiden, Netherlands
- Zuyderland, Sittard-Geleen, Netherlands
| |
Collapse
|
8
|
Wetzel AJ, Koch R, Koch N, Klemmt M, Müller R, Preiser C, Rieger M, Rösel I, Ranisch R, Ehni HJ, Joos S. 'Better see a doctor?' Status quo of symptom checker apps in Germany: A cross-sectional survey with a mixed-methods design (CHECK.APP). Digit Health 2024; 10:20552076241231555. [PMID: 38434790 PMCID: PMC10908232 DOI: 10.1177/20552076241231555] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/22/2024] [Indexed: 03/05/2024] Open
Abstract
Background Symptom checker apps (SCAs) offer symptom classification and low-threshold self-triage for laypeople. They are already in use despite their poor accuracy and concerns that they may negatively affect primary care. This study assesses the extent to which SCAs are used by medical laypeople in Germany and which software is most popular. We examined associations between satisfaction with the general practitioner (GP) and SCA use as well as the number of GP visits and SCA use. Furthermore, we assessed the reasons for intentional non-use. Methods We conducted a survey comprising standardised and open-ended questions. Quantitative data were weighted, and open-ended responses were examined using thematic analysis. Results This study included 850 participants. The SCA usage rate was 8%, and approximately 50% of SCA non-users were uninterested in trying SCAs. The most commonly used SCAs were NetDoktor and Ada. Surprisingly, SCAs were most frequently used in the age group of 51-55 years. No significant associations were found between SCA usage and satisfaction with the GP or the number of GP visits and SCA usage. Thematic analysis revealed skepticism regarding the results and recommendations of SCAs and discrepancies between users' requirements and the features of apps. Conclusion SCAs are still widely unknown in the German population and have been sparsely used so far. Many participants were not interested in trying SCAs, and we found no positive or negative associations of SCAs and primary care.
Collapse
Affiliation(s)
- Anna-Jasmin Wetzel
- Institute of General Practice and Interprofessional Care, University Hospital Tübingen, Tübingen, Germany
| | - Roland Koch
- Institute of General Practice and Interprofessional Care, University Hospital Tübingen, Tübingen, Germany
| | - Nadine Koch
- Institute of Software Engineering, University of Stuttgart, Stuttgart, Germany
| | - Malte Klemmt
- Institute of Applied Social Science, University of Applied Science Würzburg-Schweinfurt, Wurzburg, Germany
| | - Regina Müller
- Institute of Philosophy, University of Bremen, Bremen, Germany
| | - Christine Preiser
- Institute of Occupational and Social Medicine and Health Services Research, University Hospital Tübingen, Tübingen, Germany
| | - Monika Rieger
- Institute of Occupational and Social Medicine and Health Services Research, University Hospital Tübingen, Tübingen, Germany
| | - Inka Rösel
- Institute of Clinical Epidemiology and Applied Biometry, University Hospital Tübingen, Tübingen, Germany
| | - Robert Ranisch
- Faculty of Health Sciences, University of Potsdam, Potsdam, Germany
| | - Hans-Jörg Ehni
- Institute of Ethics and History of Medicine, University Hospital Tübingen, Tübingen, Germany
| | - Stefanie Joos
- Institute of General Practice and Interprofessional Care, University Hospital Tübingen, Tübingen, Germany
| |
Collapse
|
9
|
Peven K, Wickham AP, Wilks O, Kaplan YC, Marhol A, Ahmed S, Bamford R, Cunningham AC, Prentice C, Meczner A, Fenech M, Gilbert S, Klepchukova A, Ponzo S, Zhaunova L. Assessment of a Digital Symptom Checker Tool's Accuracy in Suggesting Reproductive Health Conditions: Clinical Vignettes Study. JMIR Mhealth Uhealth 2023; 11:e46718. [PMID: 38051574 PMCID: PMC10731551 DOI: 10.2196/46718] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2023] [Revised: 09/06/2023] [Accepted: 11/07/2023] [Indexed: 12/07/2023] Open
Abstract
BACKGROUND Reproductive health conditions such as endometriosis, uterine fibroids, and polycystic ovary syndrome (PCOS) affect a large proportion of women and people who menstruate worldwide. Prevalence estimates for these conditions range from 5% to 40% of women of reproductive age. Long diagnostic delays, up to 12 years, are common and contribute to health complications and increased health care costs. Symptom checker apps provide users with information and tools to better understand their symptoms and thus have the potential to reduce the time to diagnosis for reproductive health conditions. OBJECTIVE This study aimed to evaluate the agreement between clinicians and 3 symptom checkers (developed by Flo Health UK Limited) in assessing symptoms of endometriosis, uterine fibroids, and PCOS using vignettes. We also aimed to present a robust example of vignette case creation, review, and classification in the context of predeployment testing and validation of digital health symptom checker tools. METHODS Independent general practitioners were recruited to create clinical case vignettes of simulated users for the purpose of testing each condition symptom checker; vignettes created for each condition contained a mixture of condition-positive and condition-negative outcomes. A second panel of general practitioners then reviewed, approved, and modified (if necessary) each vignette. A third group of general practitioners reviewed each vignette case and designated a final classification. Vignettes were then entered into the symptom checkers by a fourth, different group of general practitioners. The outcomes of each symptom checker were then compared with the final classification of each vignette to produce accuracy metrics including percent agreement, sensitivity, specificity, positive predictive value, and negative predictive value. RESULTS A total of 24 cases were created per condition. Overall, exact matches between the vignette general practitioner classification and the symptom checker outcome were 83% (n=20) for endometriosis, 83% (n=20) for uterine fibroids, and 88% (n=21) for PCOS. For each symptom checker, sensitivity was reported as 81.8% for endometriosis, 84.6% for uterine fibroids, and 100% for PCOS; specificity was reported as 84.6% for endometriosis, 81.8% for uterine fibroids, and 75% for PCOS; positive predictive value was reported as 81.8% for endometriosis, 84.6% for uterine fibroids, 80% for PCOS; and negative predictive value was reported as 84.6% for endometriosis, 81.8% for uterine fibroids, and 100% for PCOS. CONCLUSIONS The single-condition symptom checkers have high levels of agreement with general practitioner classification for endometriosis, uterine fibroids, and PCOS. Given long delays in diagnosis for many reproductive health conditions, which lead to increased medical costs and potential health complications for individuals and health care providers, innovative health apps and symptom checkers hold the potential to improve care pathways.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | - Stephen Gilbert
- Else Kröner Fresenius Center for Digital Health, TUD Dresden University of Technology, Dresden, Germany
| | | | - Sonia Ponzo
- Flo Health UK Limited, London, United Kingdom
| | | |
Collapse
|
10
|
Bushuven S, Bentele M, Bentele S, Gerber B, Bansbach J, Ganter J, Trifunovic-Koenig M, Ranisch R. "ChatGPT, Can You Help Me Save My Child's Life?" - Diagnostic Accuracy and Supportive Capabilities to Lay Rescuers by ChatGPT in Prehospital Basic Life Support and Paediatric Advanced Life Support Cases - An In-silico Analysis. J Med Syst 2023; 47:123. [PMID: 37987870 PMCID: PMC10663183 DOI: 10.1007/s10916-023-02019-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2023] [Accepted: 11/13/2023] [Indexed: 11/22/2023]
Abstract
BACKGROUND Paediatric emergencies are challenging for healthcare workers, first aiders, and parents waiting for emergency medical services to arrive. With the expected rise of virtual assistants, people will likely seek help from such digital AI tools, especially in regions lacking emergency medical services. Large Language Models like ChatGPT proved effective in providing health-related information and are competent in medical exams but are questioned regarding patient safety. Currently, there is no information on ChatGPT's performance in supporting parents in paediatric emergencies requiring help from emergency medical services. This study aimed to test 20 paediatric and two basic life support case vignettes for ChatGPT and GPT-4 performance and safety in children. METHODS We provided the cases three times each to two models, ChatGPT and GPT-4, and assessed the diagnostic accuracy, emergency call advice, and the validity of advice given to parents. RESULTS Both models recognized the emergency in the cases, except for septic shock and pulmonary embolism, and identified the correct diagnosis in 94%. However, ChatGPT/GPT-4 reliably advised to call emergency services only in 12 of 22 cases (54%), gave correct first aid instructions in 9 cases (45%) and incorrectly advised advanced life support techniques to parents in 3 of 22 cases (13.6%). CONCLUSION Considering these results of the recent ChatGPT versions, the validity, reliability and thus safety of ChatGPT/GPT-4 as an emergency support tool is questionable. However, whether humans would perform better in the same situation is uncertain. Moreover, other studies have shown that human emergency call operators are also inaccurate, partly with worse performance than ChatGPT/GPT-4 in our study. However, one of the main limitations of the study is that we used prototypical cases, and the management may differ from urban to rural areas and between different countries, indicating the need for further evaluation of the context sensitivity and adaptability of the model. Nevertheless, ChatGPT and the new versions under development may be promising tools for assisting lay first responders, operators, and professionals in diagnosing a paediatric emergency. TRIAL REGISTRATION Not applicable.
Collapse
Affiliation(s)
- Stefan Bushuven
- Training Center for Emergency Medicine (NOTIS e.V), Breite Strasse 7, Engen, 78234, Germany.
- Department of Anesthesiology and Critical Care, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Freiburg, Germany.
- Institute for Medical Education, University Hospital, LMU Munich, Munich, Germany.
| | - Michael Bentele
- Training Center for Emergency Medicine (NOTIS e.V), Breite Strasse 7, Engen, 78234, Germany
| | - Stefanie Bentele
- Training Center for Emergency Medicine (NOTIS e.V), Breite Strasse 7, Engen, 78234, Germany
| | - Bianka Gerber
- Training Center for Emergency Medicine (NOTIS e.V), Breite Strasse 7, Engen, 78234, Germany
| | - Joachim Bansbach
- Department of Anesthesiology and Critical Care, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Freiburg, Germany
| | - Julian Ganter
- Department of Anesthesiology and Critical Care, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Freiburg, Germany
| | | | - Robert Ranisch
- Faculty for Health Sciences Brandenburg, University of Potsdam, Potsdam, Germany
| |
Collapse
|
11
|
Hirosawa T, Mizuta K, Harada Y, Shimizu T. Comparative Evaluation of Diagnostic Accuracy Between Google Bard and Physicians. Am J Med 2023; 136:1119-1123.e18. [PMID: 37643659 DOI: 10.1016/j.amjmed.2023.08.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Revised: 07/11/2023] [Accepted: 08/24/2023] [Indexed: 08/31/2023]
Abstract
BACKGROUND In this study, we evaluated the diagnostic accuracy of Google Bard, a generative artificial intelligence (AI) platform. METHODS We searched published case reports from our department for difficult or uncommon case descriptions and mock cases created by physicians for common case descriptions. We entered the case descriptions into the prompt of Google Bard to generate the top 10 differential-diagnosis lists. As in previous studies, other physicians created differential-diagnosis lists by reading the same clinical descriptions. RESULTS A total of 82 clinical descriptions (52 case reports and 30 mock cases) were used. The accuracy rates of physicians were still higher than Google Bard in the top 10 (56.1% vs 82.9%, P < .001), the top 5 (53.7% vs 78.0%, P = .002), and the top differential diagnosis (40.2% vs 64.6%, P = .003). Even within the specific context of case reports, physicians consistently outperformed Google Bard. When it came to mock cases, the performances of the differential-diagnosis lists by Google Bard were no different from those of the physicians in the top 10 (80.0% vs 96.6%, P = .11) and the top 5 (76.7% vs 96.6%, P = .06), except for those in the top diagnoses (60.0% vs 90.0%, P = .02). CONCLUSION While physicians excelled overall, and particularly with case reports, Google Bard displayed comparable diagnostic performance in common cases. This suggested that Google Bard possesses room for further improvement and refinement in its diagnostic capabilities. Generative AIs, including Google Bard, are anticipated to become increasingly beneficial in augmenting diagnostic accuracy.
Collapse
Affiliation(s)
- Takanobu Hirosawa
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi, Japan.
| | - Kazuya Mizuta
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi, Japan
| | - Yukinori Harada
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi, Japan
| | - Taro Shimizu
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi, Japan
| |
Collapse
|
12
|
de Koning E, van der Haas Y, Saguna S, Stoop E, Bosch J, Beeres S, Schalij M, Boogers M. AI Algorithm to Predict Acute Coronary Syndrome in Prehospital Cardiac Care: Retrospective Cohort Study. JMIR Cardio 2023; 7:e51375. [PMID: 37906226 PMCID: PMC10646678 DOI: 10.2196/51375] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Revised: 08/29/2023] [Accepted: 09/19/2023] [Indexed: 11/02/2023] Open
Abstract
BACKGROUND Overcrowding of hospitals and emergency departments (EDs) is a growing problem. However, not all ED consultations are necessary. For example, 80% of patients in the ED with chest pain do not have an acute coronary syndrome (ACS). Artificial intelligence (AI) is useful in analyzing (medical) data, and might aid health care workers in prehospital clinical decision-making before patients are presented to the hospital. OBJECTIVE The aim of this study was to develop an AI model which would be able to predict ACS before patients visit the ED. The model retrospectively analyzed prehospital data acquired by emergency medical services' nurse paramedics. METHODS Patients presenting to the emergency medical services with symptoms suggestive of ACS between September 2018 and September 2020 were included. An AI model using a supervised text classification algorithm was developed to analyze data. Data were analyzed for all 7458 patients (mean 68, SD 15 years, 54% men). Specificity, sensitivity, positive predictive value (PPV), and negative predictive value (NPV) were calculated for control and intervention groups. At first, a machine learning (ML) algorithm (or model) was chosen; afterward, the features needed were selected and then the model was tested and improved using iterative evaluation and in a further step through hyperparameter tuning. Finally, a method was selected to explain the final AI model. RESULTS The AI model had a specificity of 11% and a sensitivity of 99.5% whereas usual care had a specificity of 1% and a sensitivity of 99.5%. The PPV of the AI model was 15% and the NPV was 99%. The PPV of usual care was 13% and the NPV was 94%. CONCLUSIONS The AI model was able to predict ACS based on retrospective data from the prehospital setting. It led to an increase in specificity (from 1% to 11%) and NPV (from 94% to 99%) when compared to usual care, with a similar sensitivity. Due to the retrospective nature of this study and the singular focus on ACS it should be seen as a proof-of-concept. Other (possibly life-threatening) diagnoses were not analyzed. Future prospective validation is necessary before implementation.
Collapse
Affiliation(s)
- Enrico de Koning
- Cardiology Department, Leiden University Medical Center, Leiden, Netherlands
| | | | | | - Esmee Stoop
- Clinical AI and Research lab, Leiden University Medical Center, Leiden, Netherlands
| | - Jan Bosch
- Research and Development, Regional Ambulance Service Hollands-Midden, Leiden, Netherlands
| | - Saskia Beeres
- Cardiology Department, Leiden University Medical Center, Leiden, Netherlands
| | - Martin Schalij
- Cardiology Department, Leiden University Medical Center, Leiden, Netherlands
| | - Mark Boogers
- Cardiology Department, Leiden University Medical Center, Leiden, Netherlands
| |
Collapse
|
13
|
Hirosawa T, Kawamura R, Harada Y, Mizuta K, Tokumasu K, Kaji Y, Suzuki T, Shimizu T. ChatGPT-Generated Differential Diagnosis Lists for Complex Case-Derived Clinical Vignettes: Diagnostic Accuracy Evaluation. JMIR Med Inform 2023; 11:e48808. [PMID: 37812468 PMCID: PMC10594139 DOI: 10.2196/48808] [Citation(s) in RCA: 16] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2023] [Revised: 07/20/2023] [Accepted: 09/13/2023] [Indexed: 10/10/2023] Open
Abstract
BACKGROUND The diagnostic accuracy of differential diagnoses generated by artificial intelligence chatbots, including ChatGPT models, for complex clinical vignettes derived from general internal medicine (GIM) department case reports is unknown. OBJECTIVE This study aims to evaluate the accuracy of the differential diagnosis lists generated by both third-generation ChatGPT (ChatGPT-3.5) and fourth-generation ChatGPT (ChatGPT-4) by using case vignettes from case reports published by the Department of GIM of Dokkyo Medical University Hospital, Japan. METHODS We searched PubMed for case reports. Upon identification, physicians selected diagnostic cases, determined the final diagnosis, and displayed them into clinical vignettes. Physicians typed the determined text with the clinical vignettes in the ChatGPT-3.5 and ChatGPT-4 prompts to generate the top 10 differential diagnoses. The ChatGPT models were not specially trained or further reinforced for this task. Three GIM physicians from other medical institutions created differential diagnosis lists by reading the same clinical vignettes. We measured the rate of correct diagnosis within the top 10 differential diagnosis lists, top 5 differential diagnosis lists, and the top diagnosis. RESULTS In total, 52 case reports were analyzed. The rates of correct diagnosis by ChatGPT-4 within the top 10 differential diagnosis lists, top 5 differential diagnosis lists, and top diagnosis were 83% (43/52), 81% (42/52), and 60% (31/52), respectively. The rates of correct diagnosis by ChatGPT-3.5 within the top 10 differential diagnosis lists, top 5 differential diagnosis lists, and top diagnosis were 73% (38/52), 65% (34/52), and 42% (22/52), respectively. The rates of correct diagnosis by ChatGPT-4 were comparable to those by physicians within the top 10 (43/52, 83% vs 39/52, 75%, respectively; P=.47) and within the top 5 (42/52, 81% vs 35/52, 67%, respectively; P=.18) differential diagnosis lists and top diagnosis (31/52, 60% vs 26/52, 50%, respectively; P=.43) although the difference was not significant. The ChatGPT models' diagnostic accuracy did not significantly vary based on open access status or the publication date (before 2011 vs 2022). CONCLUSIONS This study demonstrates the potential diagnostic accuracy of differential diagnosis lists generated using ChatGPT-3.5 and ChatGPT-4 for complex clinical vignettes from case reports published by the GIM department. The rate of correct diagnoses within the top 10 and top 5 differential diagnosis lists generated by ChatGPT-4 exceeds 80%. Although derived from a limited data set of case reports from a single department, our findings highlight the potential utility of ChatGPT-4 as a supplementary tool for physicians, particularly for those affiliated with the GIM department. Further investigations should explore the diagnostic accuracy of ChatGPT by using distinct case materials beyond its training data. Such efforts will provide a comprehensive insight into the role of artificial intelligence in enhancing clinical decision-making.
Collapse
Affiliation(s)
- Takanobu Hirosawa
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi, Japan
| | - Ren Kawamura
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi, Japan
| | - Yukinori Harada
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi, Japan
| | - Kazuya Mizuta
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi, Japan
| | - Kazuki Tokumasu
- Department of General Medicine, Okayama University Graduate School of Medicine, Dentistry and Pharmaceutical Sciences, Okayama, Japan
| | - Yuki Kaji
- Department of General Medicine, International University of Health and Welfare Narita Hospital, Chiba, Japan
| | - Tomoharu Suzuki
- Department of Hospital Medicine, Urasoe General Hospital, Okinawa, Japan
| | - Taro Shimizu
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi, Japan
| |
Collapse
|
14
|
Fraser H, Crossland D, Bacher I, Ranney M, Madsen T, Hilliard R. Comparison of Diagnostic and Triage Accuracy of Ada Health and WebMD Symptom Checkers, ChatGPT, and Physicians for Patients in an Emergency Department: Clinical Data Analysis Study. JMIR Mhealth Uhealth 2023; 11:e49995. [PMID: 37788063 PMCID: PMC10582809 DOI: 10.2196/49995] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 08/17/2023] [Accepted: 08/25/2023] [Indexed: 10/04/2023] Open
Abstract
BACKGROUND Diagnosis is a core component of effective health care, but misdiagnosis is common and can put patients at risk. Diagnostic decision support systems can play a role in improving diagnosis by physicians and other health care workers. Symptom checkers (SCs) have been designed to improve diagnosis and triage (ie, which level of care to seek) by patients. OBJECTIVE The aim of this study was to evaluate the performance of the new large language model ChatGPT (versions 3.5 and 4.0), the widely used WebMD SC, and an SC developed by Ada Health in the diagnosis and triage of patients with urgent or emergent clinical problems compared with the final emergency department (ED) diagnoses and physician reviews. METHODS We used previously collected, deidentified, self-report data from 40 patients presenting to an ED for care who used the Ada SC to record their symptoms prior to seeing the ED physician. Deidentified data were entered into ChatGPT versions 3.5 and 4.0 and WebMD by a research assistant blinded to diagnoses and triage. Diagnoses from all 4 systems were compared with the previously abstracted final diagnoses in the ED as well as with diagnoses and triage recommendations from three independent board-certified ED physicians who had blindly reviewed the self-report clinical data from Ada. Diagnostic accuracy was calculated as the proportion of the diagnoses from ChatGPT, Ada SC, WebMD SC, and the independent physicians that matched at least one ED diagnosis (stratified as top 1 or top 3). Triage accuracy was calculated as the number of recommendations from ChatGPT, WebMD, or Ada that agreed with at least 2 of the independent physicians or were rated "unsafe" or "too cautious." RESULTS Overall, 30 and 37 cases had sufficient data for diagnostic and triage analysis, respectively. The rate of top-1 diagnosis matches for Ada, ChatGPT 3.5, ChatGPT 4.0, and WebMD was 9 (30%), 12 (40%), 10 (33%), and 12 (40%), respectively, with a mean rate of 47% for the physicians. The rate of top-3 diagnostic matches for Ada, ChatGPT 3.5, ChatGPT 4.0, and WebMD was 19 (63%), 19 (63%), 15 (50%), and 17 (57%), respectively, with a mean rate of 69% for physicians. The distribution of triage results for Ada was 62% (n=23) agree, 14% unsafe (n=5), and 24% (n=9) too cautious; that for ChatGPT 3.5 was 59% (n=22) agree, 41% (n=15) unsafe, and 0% (n=0) too cautious; that for ChatGPT 4.0 was 76% (n=28) agree, 22% (n=8) unsafe, and 3% (n=1) too cautious; and that for WebMD was 70% (n=26) agree, 19% (n=7) unsafe, and 11% (n=4) too cautious. The unsafe triage rate for ChatGPT 3.5 (41%) was significantly higher (P=.009) than that of Ada (14%). CONCLUSIONS ChatGPT 3.5 had high diagnostic accuracy but a high unsafe triage rate. ChatGPT 4.0 had the poorest diagnostic accuracy, but a lower unsafe triage rate and the highest triage agreement with the physicians. The Ada and WebMD SCs performed better overall than ChatGPT. Unsupervised patient use of ChatGPT for diagnosis and triage is not recommended without improvements to triage accuracy and extensive clinical evaluation.
Collapse
Affiliation(s)
- Hamish Fraser
- Brown Center for Biomedical Informatics, The Warren Alpert Medical School of Brown University, Providence, RI, United States
- Department of Health Services, Policy and Practice, Brown University School of Public Health, Providence, RI, United States
| | - Daven Crossland
- Brown Center for Biomedical Informatics, The Warren Alpert Medical School of Brown University, Providence, RI, United States
- Department of Epidemiology, Brown University School of Public Health, Providence, RI, United States
| | - Ian Bacher
- Brown Center for Biomedical Informatics, The Warren Alpert Medical School of Brown University, Providence, RI, United States
| | - Megan Ranney
- School of Public Health, Yale University, New Haven, CT, United States
| | - Tracy Madsen
- Department of Epidemiology, Brown University School of Public Health, Providence, RI, United States
- Department of Emergency Medicine, The Warren Alpert Medical School of Brown University, Providence, RI, United States
| | - Ross Hilliard
- Department of Internal Medicine, Maine Medical Center, Portland, ME, United States
| |
Collapse
|
15
|
Wiedermann CJ, Mahlknecht A, Piccoliori G, Engl A. Redesigning Primary Care: The Emergence of Artificial-Intelligence-Driven Symptom Diagnostic Tools. J Pers Med 2023; 13:1379. [PMID: 37763147 PMCID: PMC10532810 DOI: 10.3390/jpm13091379] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2023] [Revised: 09/13/2023] [Accepted: 09/14/2023] [Indexed: 09/29/2023] Open
Abstract
Modern healthcare is facing a juxtaposition of increasing patient demands owing to an aging population and a decreasing general practitioner workforce, leading to strained access to primary care. The coronavirus disease 2019 pandemic has emphasized the potential for alternative consultation methods, highlighting opportunities to minimize unnecessary care. This article discusses the role of artificial-intelligence-driven symptom checkers, particularly their efficiency, utility, and challenges in primary care. Based on a study conducted in Italian general practices, insights from both physicians and patients were gathered regarding this emergent technology, highlighting differences in perceived utility, user satisfaction, and potential challenges. While symptom checkers are seen as potential tools for addressing healthcare challenges, concerns regarding their accuracy and the potential for misdiagnosis persist. Patients generally viewed them positively, valuing their ease of use and the empowerment they provide in managing health. However, some general practitioners perceive these tools as challenges to their expertise. This article proposes that artificial-intelligence-based symptom checkers can optimize medical-history taking for the benefit of both general practitioners and patients, with potential enhancements in complex diagnostic tasks rather than routine diagnoses. It underscores the importance of carefully integrating digital innovations while preserving the essential human touch in healthcare. Symptom checkers offer promising solutions; ensuring their accuracy, reliability, and effective integration into primary care requires rigorous research, clinical guidance, and an understanding of varied user perceptions. Collaboration among technologists, clinicians, and patients is paramount for the successful evolution of digital tools in healthcare.
Collapse
Affiliation(s)
- Christian J. Wiedermann
- Institute of General Practice and Public Health, Claudiana—College of Health Professions, 39100 Bolzano, Italy
- Department of Public Health, Medical Decision Making and HTA, University of Health Sciences, Medical Informatics and Technology-Tyrol, 6060 Hall, Austria
| | - Angelika Mahlknecht
- Institute of General Practice and Public Health, Claudiana—College of Health Professions, 39100 Bolzano, Italy
| | - Giuliano Piccoliori
- Institute of General Practice and Public Health, Claudiana—College of Health Professions, 39100 Bolzano, Italy
| | - Adolf Engl
- Institute of General Practice and Public Health, Claudiana—College of Health Professions, 39100 Bolzano, Italy
| |
Collapse
|
16
|
Mahlknecht A, Engl A, Piccoliori G, Wiedermann CJ. Supporting primary care through symptom checking artificial intelligence: a study of patient and physician attitudes in Italian general practice. BMC PRIMARY CARE 2023; 24:174. [PMID: 37661285 PMCID: PMC10476397 DOI: 10.1186/s12875-023-02143-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/09/2023] [Accepted: 08/29/2023] [Indexed: 09/05/2023]
Abstract
BACKGROUND Rapid advancements in artificial intelligence (AI) have led to the adoption of AI-driven symptom checkers in primary care. This study aimed to evaluate both patients' and physicians' attitudes towards these tools in Italian general practice settings, focusing on their perceived utility, user satisfaction, and potential challenges. METHODS This feasibility study involved ten general practitioners (GPs) and patients visiting GP offices. The patients used a chatbot-based symptom checker before their medical visit and conducted anamnestic screening for COVID-19 and a medical history algorithm concerning the current medical problem. The entered data were forwarded to the GP as medical history aid. After the medical visit, both physicians and patients evaluated their respective symptoms. Additionally, physicians performed a final overall evaluation of the symptom checker after the conclusion of the practice phase. RESULTS Most patients did not use symptom checkers. Overall, 49% of patients and 27% of physicians reported being rather or very satisfied with the symptom checker. The most frequent patient-reported reasons for satisfaction were ease of use, precise and comprehensive questions, perceived time-saving potential, and encouragement of self-reflection. Every other patient would consider at-home use of the symptom checker for the first appraisal of health problems to save time, reduce unnecessary visits, and/or as an aid for the physician. Patients' attitudes towards the symptom checker were not significantly associated with age, sex, or level of education. Most patients (75%) and physicians (84%) indicated that the symptom checker had no effect on the duration of the medical visit. Only a few participants found the use of the symptom checker to be disruptive to the medical visit or its quality. CONCLUSIONS The findings suggest a positive reception of the symptom checker, albeit with differing focus between patients and physicians. With the potential to be integrated further into primary care, these tools require meticulous clinical guidance to maximize their benefits. TRIAL REGISTRATION The study was not registered, as it did not include direct medical intervention on human participants.
Collapse
Affiliation(s)
- Angelika Mahlknecht
- Institute of General Practice and Public Health, College of Health Care Professions (Claudiana), Lorenz Böhler Street 13, 39100, Bolzano, Italy
| | - Adolf Engl
- Institute of General Practice and Public Health, College of Health Care Professions (Claudiana), Lorenz Böhler Street 13, 39100, Bolzano, Italy
| | - Giuliano Piccoliori
- Institute of General Practice and Public Health, College of Health Care Professions (Claudiana), Lorenz Böhler Street 13, 39100, Bolzano, Italy
| | - Christian Josef Wiedermann
- Institute of General Practice and Public Health, College of Health Care Professions (Claudiana), Lorenz Böhler Street 13, 39100, Bolzano, Italy.
- Department of Public Health, Medical Decision Making and HTA, University of Health Sciences, Medical Informatics and Technology, Eduard-Wallnöfer Place 1, 6060, Hall, Austria.
| |
Collapse
|
17
|
Määttä J, Lindell R, Hayward N, Martikainen S, Honkanen K, Inkala M, Hirvonen P, Martikainen TJ. Diagnostic Performance, Triage Safety, and Usability of a Clinical Decision Support System Within a University Hospital Emergency Department: Algorithm Performance and Usability Study. JMIR Med Inform 2023; 11:e46760. [PMID: 37656018 PMCID: PMC10501486 DOI: 10.2196/46760] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2023] [Revised: 06/22/2023] [Accepted: 07/14/2023] [Indexed: 09/02/2023] Open
Abstract
Background Computerized clinical decision support systems (CDSSs) are increasingly adopted in health care to optimize resources and streamline patient flow. However, they often lack scientific validation against standard medical care. Objective The purpose of this study was to assess the performance, safety, and usability of a CDSS in a university hospital emergency department setting in Kuopio, Finland. Methods Patients entering the emergency department were asked to voluntarily participate in this study. Patients aged 17 years or younger, patients with cognitive impairments, and patients who entered the unit in an ambulance or with the need for immediate care were excluded. Patients completed the CDSS web-based form and usability questionnaire when waiting for the triage nurse's evaluation. The CDSS data were anonymized and did not affect the patients' usual evaluation or treatment. Retrospectively, 2 medical doctors evaluated the urgency of each patient's condition by using the triage nurse's information, and urgent and nonurgent groups were created. The International Statistical Classification of Diseases, Tenth Revision diagnoses were collected from the electronic health records. Usability was assessed by using a positive version of the System Usability Scale questionnaire. Results In total, our analyses included 248 patients. Regarding urgency, the mean sensitivities were 85% and 19%, respectively, for urgent and nonurgent cases when assessing the performance of CDSS evaluations in comparison to that of physicians. The mean sensitivities were 85% and 35%, respectively, when comparing the evaluations between the two physicians. Our CDSS did not miss any cases that were evaluated to be emergencies by physicians; thus, all emergency cases evaluated by physicians were evaluated as either urgent cases or emergency cases by the CDSS. In differential diagnosis, the CDSS had an exact match accuracy of 45.5% (97/213). The usability was good, with a mean System Usability Scale score of 78.2 (SD 16.8). Conclusions In a university hospital emergency department setting with a large real-world population, our CDSS was found to be equally as sensitive in urgent patient cases as physicians and was found to have an acceptable differential diagnosis accuracy, with good usability. These results suggest that this CDSS can be safely assessed further in a real-world setting. A CDSS could accelerate triage by providing patient-provided data in advance of patients' initial consultations and categorize patient cases as urgent and nonurgent cases upon patients' arrival to the emergency department.
Collapse
Affiliation(s)
| | - Rony Lindell
- Klinik Healthcare Solutions Oy, Helsinki, Finland
| | - Nick Hayward
- Klinik Healthcare Solutions Oy, Helsinki, Finland
| | - Susanna Martikainen
- Department of Health and Social Management, University of Eastern Finland, Kuopio, Finland
| | - Katri Honkanen
- Department of Emergency Care, Kuopio University Hospital, Kuopio, Finland
| | - Matias Inkala
- Department of Emergency Care, Kuopio University Hospital, Kuopio, Finland
| | | | - Tero J Martikainen
- Department of Emergency Care, Kuopio University Hospital, Kuopio, Finland
| |
Collapse
|
18
|
Kafke SD, Kuhlmey A, Schuster J, Blüher S, Czimmeck C, Zoellick JC, Grosse P. Can clinical decision support systems be an asset in medical education? An experimental approach. BMC MEDICAL EDUCATION 2023; 23:570. [PMID: 37568144 PMCID: PMC10416486 DOI: 10.1186/s12909-023-04568-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/31/2023] [Accepted: 08/04/2023] [Indexed: 08/13/2023]
Abstract
BACKGROUND Diagnostic accuracy is one of the major cornerstones of appropriate and successful medical decision-making. Clinical decision support systems (CDSSs) have recently been used to facilitate physician's diagnostic considerations. However, to date, little is known about the potential assets of CDSS for medical students in an educational setting. The purpose of our study was to explore the usefulness of CDSSs for medical students assessing their diagnostic performances and the influence of such software on students' trust in their own diagnostic abilities. METHODS Based on paper cases students had to diagnose two different patients using a CDSS and conventional methods such as e.g. textbooks, respectively. Both patients had a common disease, in one setting the clinical presentation was a typical one (tonsillitis), in the other setting (pulmonary embolism), however, the patient presented atypically. We used a 2x2x2 between- and within-subjects cluster-randomised controlled trial to assess the diagnostic accuracy in medical students, also by changing the order of the used resources (CDSS first or second). RESULTS Medical students in their 4th and 5th year performed equally well using conventional methods or the CDSS across the two cases (t(164) = 1,30; p = 0.197). Diagnostic accuracy and trust in the correct diagnosis were higher in the typical presentation condition than in the atypical presentation condition (t(85) = 19.97; p < .0001 and t(150) = 7.67; p < .0001).These results refute our main hypothesis that students diagnose more accurately when using conventional methods compared to the CDSS. CONCLUSIONS Medical students in their 4th and 5th year performed equally well in diagnosing two cases of common diseases with typical or atypical clinical presentations using conventional methods or a CDSS. Students were proficient in diagnosing a common disease with a typical presentation but underestimated their own factual knowledge in this scenario. Also, students were aware of their own diagnostic limitations when presented with a challenging case with an atypical presentation for which the use of a CDSS seemingly provided no additional insights.
Collapse
Affiliation(s)
- Sean D Kafke
- Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany.
| | - Adelheid Kuhlmey
- Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Johanna Schuster
- Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Stefan Blüher
- Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Constanze Czimmeck
- Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Jan C Zoellick
- Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Pascal Grosse
- Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| |
Collapse
|
19
|
Kopka M, Scatturin L, Napierala H, Fürstenau D, Feufel MA, Balzer F, Schmieding ML. Characteristics of Users and Nonusers of Symptom Checkers in Germany: Cross-Sectional Survey Study. J Med Internet Res 2023; 25:e46231. [PMID: 37338970 DOI: 10.2196/46231] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2023] [Revised: 04/12/2023] [Accepted: 05/03/2023] [Indexed: 06/21/2023] Open
Abstract
BACKGROUND Previous studies have revealed that users of symptom checkers (SCs, apps that support self-diagnosis and self-triage) are predominantly female, are younger than average, and have higher levels of formal education. Little data are available for Germany, and no study has so far compared usage patterns with people's awareness of SCs and the perception of usefulness. OBJECTIVE We explored the sociodemographic and individual characteristics that are associated with the awareness, usage, and perceived usefulness of SCs in the German population. METHODS We conducted a cross-sectional online survey among 1084 German residents in July 2022 regarding personal characteristics and people's awareness and usage of SCs. Using random sampling from a commercial panel, we collected participant responses stratified by gender, state of residence, income, and age to reflect the German population. We analyzed the collected data exploratively. RESULTS Of all respondents, 16.3% (177/1084) were aware of SCs and 6.5% (71/1084) had used them before. Those aware of SCs were younger (mean 38.8, SD 14.6 years, vs mean 48.3, SD 15.7 years), were more often female (107/177, 60.5%, vs 453/907, 49.9%), and had higher formal education levels (eg, 72/177, 40.7%, vs 238/907, 26.2%, with a university/college degree) than those unaware. The same observation applied to users compared to nonusers. It disappeared, however, when comparing users to nonusers who were aware of SCs. Among users, 40.8% (29/71) considered these tools useful. Those considering them useful reported higher self-efficacy (mean 4.21, SD 0.66, vs mean 3.63, SD 0.81, on a scale of 1-5) and a higher net household income (mean EUR 2591.63, SD EUR 1103.96 [mean US $2798.96, SD US $1192.28], vs mean EUR 1626.60, SD EUR 649.05 [mean US $1756.73, SD US $700.97]) than those who considered them not useful. More women considered SCs unhelpful (13/44, 29.5%) compared to men (4/26, 15.4%). CONCLUSIONS Concurring with studies from other countries, our findings show associations between sociodemographic characteristics and SC usage in a German sample: users were on average younger, of higher socioeconomic status, and more commonly female compared to nonusers. However, usage cannot be explained by sociodemographic differences alone. It rather seems that sociodemographics explain who is or is not aware of the technology, but those who are aware of SCs are equally likely to use them, independently of sociodemographic differences. Although in some groups (eg, people with anxiety disorder), more participants reported to know and use SCs, they tended to perceive them as less useful. In other groups (eg, male participants), fewer respondents were aware of SCs, but those who used them perceived them to be more useful. Thus, SCs should be designed to fit specific user needs, and strategies should be developed to help reach individuals who could benefit but are not aware of SCs yet.
Collapse
Affiliation(s)
- Marvin Kopka
- Institute of Medical Informatics, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
- Division of Ergonomics, Department of Psychology and Ergonomics, Technische Universität Berlin, Berlin, Germany
| | - Lennart Scatturin
- Institute of Medical Informatics, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Hendrik Napierala
- Institute of General Practice and Family Medicine, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Daniel Fürstenau
- Institute of Medical Informatics, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
- Department of Business IT, IT University of Copenhagen, København, Denmark
| | - Markus A Feufel
- Division of Ergonomics, Department of Psychology and Ergonomics, Technische Universität Berlin, Berlin, Germany
| | - Felix Balzer
- Institute of Medical Informatics, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Malte L Schmieding
- Institute of Medical Informatics, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| |
Collapse
|
20
|
Hirosawa T, Harada Y, Yokose M, Sakamoto T, Kawamura R, Shimizu T. Diagnostic Accuracy of Differential-Diagnosis Lists Generated by Generative Pretrained Transformer 3 Chatbot for Clinical Vignettes with Common Chief Complaints: A Pilot Study. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2023; 20:3378. [PMID: 36834073 PMCID: PMC9967747 DOI: 10.3390/ijerph20043378] [Citation(s) in RCA: 99] [Impact Index Per Article: 99.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/25/2023] [Revised: 02/09/2023] [Accepted: 02/13/2023] [Indexed: 06/01/2023]
Abstract
The diagnostic accuracy of differential diagnoses generated by artificial intelligence (AI) chatbots, including the generative pretrained transformer 3 (GPT-3) chatbot (ChatGPT-3) is unknown. This study evaluated the accuracy of differential-diagnosis lists generated by ChatGPT-3 for clinical vignettes with common chief complaints. General internal medicine physicians created clinical cases, correct diagnoses, and five differential diagnoses for ten common chief complaints. The rate of correct diagnosis by ChatGPT-3 within the ten differential-diagnosis lists was 28/30 (93.3%). The rate of correct diagnosis by physicians was still superior to that by ChatGPT-3 within the five differential-diagnosis lists (98.3% vs. 83.3%, p = 0.03). The rate of correct diagnosis by physicians was also superior to that by ChatGPT-3 in the top diagnosis (53.3% vs. 93.3%, p < 0.001). The rate of consistent differential diagnoses among physicians within the ten differential-diagnosis lists generated by ChatGPT-3 was 62/88 (70.5%). In summary, this study demonstrates the high diagnostic accuracy of differential-diagnosis lists generated by ChatGPT-3 for clinical cases with common chief complaints. This suggests that AI chatbots such as ChatGPT-3 can generate a well-differentiated diagnosis list for common chief complaints. However, the order of these lists can be improved in the future.
Collapse
Affiliation(s)
- Takanobu Hirosawa
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Tochigi 321-0293, Japan
| | | | | | | | | | | |
Collapse
|
21
|
Pairon A, Philips H, Verhoeven V. A scoping review on the use and usefulness of online symptom checkers and triage systems: How to proceed? Front Med (Lausanne) 2023; 9:1040926. [PMID: 36687416 PMCID: PMC9853165 DOI: 10.3389/fmed.2022.1040926] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2022] [Accepted: 12/16/2022] [Indexed: 01/09/2023] Open
Abstract
Background Patients are increasingly turning to the Internet for health information. Numerous online symptom checkers and digital triage tools are currently available to the general public in an effort to meet this need, simultaneously acting as a demand management strategy to aid the overburdened health care system. The implementation of these services requires an evidence-based approach, warranting a review of the available literature on this rapidly evolving topic. Objective This scoping review aims to provide an overview of the current state of the art and identify research gaps through an analysis of the strengths and weaknesses of the presently available literature. Methods A systematic search strategy was formed and applied to six databases: Cochrane library, NICE, DARE, NIHR, Pubmed, and Web of Science. Data extraction was performed by two researchers according to a pre-established data charting methodology allowing for a thematic analysis of the results. Results A total of 10,250 articles were identified, and 28 publications were found eligible for inclusion. Users of these tools are often younger, female, more highly educated and technologically literate, potentially impacting digital divide and health equity. Triage algorithms remain risk-averse, which causes challenges for their accuracy. Recent evolutions in algorithms have varying degrees of success. Results on impact are highly variable, with potential effects on demand, accessibility of care, health literacy and syndromic surveillance. Both patients and healthcare providers are generally positive about the technology and seem amenable to the advice given, but there are still improvements to be made toward a more patient-centered approach. The significant heterogeneity across studies and triage systems remains the primary challenge for the field, limiting transferability of findings. Conclusion Current evidence included in this review is characterized by significant variability in study design and outcomes, highlighting the significant challenges for future research.An evolution toward more homogeneous methodologies, studies tailored to the intended setting, regulation and standardization of evaluations, and a patient-centered approach could benefit the field.
Collapse
|
22
|
Kopka M, Feufel MA, Berner ES, Schmieding ML. How suitable are clinical vignettes for the evaluation of symptom checker apps? A test theoretical perspective. Digit Health 2023; 9:20552076231194929. [PMID: 37614591 PMCID: PMC10444026 DOI: 10.1177/20552076231194929] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2023] [Accepted: 07/28/2023] [Indexed: 08/25/2023] Open
Abstract
Objective To evaluate the ability of case vignettes to assess the performance of symptom checker applications and to suggest refinements to the methodology used in case vignette-based audit studies. Methods We re-analyzed the publicly available data of two prominent case vignette-based symptom checker audit studies by calculating common metrics of test theory. Furthermore, we developed a new metric, the Capability Comparison Score (CCS), which compares symptom checker capability while controlling for the difficulty of the set of cases each symptom checker evaluated. We then scrutinized whether applying test theory and the CCS altered the performance ranking of the investigated symptom checkers. Results In both studies, most symptom checkers changed their rank order when adjusting the triage capability for item difficulty (ID) with the CCS. The previously reported triage accuracies commonly overestimated the capability of symptom checkers because they did not account for the fact that symptom checkers tend to selectively appraise easier cases (i.e., with high ID values). Also, many case vignettes in both studies showed insufficient (very low and even negative) values of item-total correlation (ITC), suggesting that individual items or the composition of item sets are of low quality. Conclusions A test-theoretic perspective helps identify previously undetected threats to the validity of case vignette-based symptom checker assessments and provides guidance and specific metrics to improve the quality of case vignettes, in particular by controlling for the difficulty of the vignettes an app was (not) able to evaluate correctly. Such measures might prove more meaningful than accuracy alone for the competitive assessment of symptom checkers. Our approach helps elaborate and standardize the methodology used for appraising symptom checker capability, which, ultimately, may yield more reliable results.
Collapse
Affiliation(s)
- Marvin Kopka
- Department of Psychology and Ergonomics (IPA), Division of Ergonomics, Technische Universität Berlin, Berlin, Germany
- Institute of Medical Informatics, Charité – Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Markus A Feufel
- Department of Psychology and Ergonomics (IPA), Division of Ergonomics, Technische Universität Berlin, Berlin, Germany
| | - Eta S Berner
- Department of Health Services Administration, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Malte L Schmieding
- Institute of Medical Informatics, Charité – Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| |
Collapse
|
23
|
North F, Jensen TB, Stroebel RJ, Nelson EM, Johnson BJ, Thompson MC, Pecina JL, Crum BA. Self-Triage Use, Subsequent Healthcare Utilization, and Diagnoses: A Retrospective Study of Process and Clinical Outcomes Following Self-Triage and Self-Scheduling for Ear or Hearing Symptoms. Health Serv Res Manag Epidemiol 2023; 10:23333928231168121. [PMID: 37101803 PMCID: PMC10123887 DOI: 10.1177/23333928231168121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/28/2023] Open
Abstract
Background Self-triage is becoming more widespread, but little is known about the people who are using online self-triage tools and their outcomes. For self-triage researchers, there are significant barriers to capturing subsequent healthcare outcomes. Our integrated healthcare system was able to capture subsequent healthcare utilization of individuals who used self-triage integrated with self-scheduling of provider visits. Methods We retrospectively examined healthcare utilization and diagnoses after patients had used self-triage and self-scheduling for ear or hearing symptoms. Outcomes and counts of office visits, telemedicine interactions, emergency department visits, and hospitalizations were captured. Diagnosis codes associated with subsequent provider visits were dichotomously categorized as being associated with ear or hearing concerns or not. Nonvisit care encounters of patient-initiated messages, nurse triage calls, and clinical communications were also captured. Results For 2168 self-triage uses, we were able to capture subsequent healthcare encounters within 7 days of the self-triage for 80.5% (1745/2168). In subsequent 1092 office visits with diagnoses, 83.1% (891/1092) of the uses were associated with relevant ear, nose and throat diagnoses. Only 0.24% (4/1662) of patients with captured outcomes were associated with a hospitalization within 7 days. Self-triage resulted in a self-scheduled office visit in 7.2% (126/1745). Office visits resulting from a self-scheduled visit had significantly fewer combined non-visit care encounters per office visit (fewer combined nurse triage calls, patient messages, and clinical communication messages) than office visits that were not self-scheduled (-0.51; 95% CI, -0.72 to -0.29; P < .0001). Conclusion In an appropriate healthcare setting, self-triage outcomes can be captured in a high percentage of uses to examine for safety, patient adherence to recommendations, and efficiency of self-triage. With the ear or hearing self-triage, most uses had subsequent visit diagnoses relevant to ear or hearing, so most patients appeared to be selecting the appropriate self-triage pathway for their symptoms.
Collapse
Affiliation(s)
- Frederick North
- Department of Medicine, Division of Community Internal Medicine, Geriatrics, and Palliative Care, Mayo Clinic, Rochester, MN, USA
- Frederick North, Department of Medicine, Division of Community Internal Medicine, Geriatrics, and Palliative Care, Mayo Clinic, Rochester, MN 55905, USA.
| | - Teresa B Jensen
- Department of Family Medicine, Mayo Clinic, Rochester, MN, USA
| | - Robert J Stroebel
- Department of Medicine, Division of Community Internal Medicine, Geriatrics, and Palliative Care, Mayo Clinic, Rochester, MN, USA
| | - Elissa M Nelson
- Enterprise Office of Access Management, Mayo Clinic, Rochester, MN, USA
| | - Brenda J Johnson
- Enterprise Office of Access Management, Mayo Clinic, Rochester, MN, USA
| | | | | | - Brian A Crum
- Department of Neurology, Mayo Clinic, Rochester, MN, USA
| |
Collapse
|
24
|
Painter A, Hayhoe B, Riboli-Sasco E, El-Osta A. Online Symptom Checkers: Recommendations for a Vignette-Based Clinical Evaluation Standard. J Med Internet Res 2022; 24:e37408. [DOI: 10.2196/37408] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2022] [Revised: 09/15/2022] [Accepted: 10/11/2022] [Indexed: 11/13/2022] Open
Abstract
The use of patient-facing online symptom checkers (OSCs) has expanded in recent years, but their accuracy, safety, and impact on patient behaviors and health care systems remain unclear. The lack of a standardized process of clinical evaluation has resulted in significant variation in approaches to OSC validation and evaluation. The aim of this paper is to characterize a set of congruent requirements for a standardized vignette-based clinical evaluation process of OSCs. Discrepancies in the findings of comparative studies to date suggest that different steps in OSC evaluation methodology can significantly influence outcomes. A standardized process with a clear specification for vignette-based clinical evaluation is urgently needed to guide developers and facilitate the objective comparison of OSCs. We propose 15 recommendation requirements for an OSC evaluation standard. A third-party evaluation process and protocols for prospective real-world evidence studies should also be prioritized to quality assure OSC assessment.
Collapse
|
25
|
Napierala H, Kopka M, Altendorf MB, Bolanaki M, Schmidt K, Piper SK, Heintze C, Möckel M, Balzer F, Slagman A, Schmieding ML. Examining the impact of a symptom assessment application on patient-physician interaction among self-referred walk-in patients in the emergency department (AKUSYM): study protocol for a multi-center, randomized controlled, parallel-group superiority trial. Trials 2022; 23:791. [PMID: 36127742 PMCID: PMC9490986 DOI: 10.1186/s13063-022-06688-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2022] [Accepted: 08/24/2022] [Indexed: 11/10/2022] Open
Abstract
Background Due to the increasing use of online health information, symptom checkers have been developed to provide an individualized assessment of health complaints and provide potential diagnoses and an urgency estimation. It is assumed that they support patient empowerment and have a positive impact on patient-physician interaction and satisfaction with care. Particularly in the emergency department (ED), symptom checkers could be integrated to bridge waiting times in the ED, and patients as well as physicians could take advantage of potential positive effects. Our study therefore aims to assess the impact of symptom assessment application (SAA) usage compared to no SAA usage on the patient-physician interaction in self-referred walk-in patients in the ED population. Methods In this multi-center, 1:1 randomized, controlled, parallel-group superiority trial, 440 self-referred adult walk-in patients with a non-urgent triage category will be recruited in three EDs in Berlin. Eligible participants in the intervention group will use a SAA directly after initial triage. The control group receives standard care without using a SAA. The primary endpoint is patients’ satisfaction with the patient-physician interaction assessed by the Patient Satisfaction Questionnaire. Discussion The results of this trial could influence the implementation of SAA into acute care to improve the satisfaction with the patient-physician interaction. Trial registration German Clinical Trials Registry DRKS00028598. Registered on 25.03.2022
Collapse
Affiliation(s)
- Hendrik Napierala
- Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Institute of General Practice and Family Medicine, Charitéplatz 1, 10117, Berlin, Germany
| | - Marvin Kopka
- Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Institute of Medical Informatics, Charitéplatz 1, 10117, Berlin, Germany.,Cognitive Psychology and Ergonomics, Department of Psychology and Ergonomics (IPA), Technische Universität Berlin, Straße des 17. Juni 135, 10623, Berlin, Germany
| | - Maria B Altendorf
- Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Emergency and Acute Medicine and Health Services Research in Emergency Medicine (CVK, CCM), Charitéplatz 1, 10117, Berlin, Germany
| | - Myrto Bolanaki
- Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Emergency and Acute Medicine and Health Services Research in Emergency Medicine (CVK, CCM), Charitéplatz 1, 10117, Berlin, Germany
| | - Konrad Schmidt
- Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Institute of General Practice and Family Medicine, Charitéplatz 1, 10117, Berlin, Germany.,Jena University Hospital, Institute of General Practice and Family Medicine, Bachstr. 18, 07743, Jena, Germany
| | - Sophie K Piper
- Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Institute of Medical Informatics, Charitéplatz 1, 10117, Berlin, Germany.,Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Institute of Biometry and Clinical Epidemiology, Charitéplatz 1, 10117, Berlin, Germany.,Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Charitéplatz 1, 10117, Berlin, Germany
| | - Christoph Heintze
- Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Institute of General Practice and Family Medicine, Charitéplatz 1, 10117, Berlin, Germany
| | - Martin Möckel
- Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Emergency and Acute Medicine and Health Services Research in Emergency Medicine (CVK, CCM), Charitéplatz 1, 10117, Berlin, Germany
| | - Felix Balzer
- Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Institute of Medical Informatics, Charitéplatz 1, 10117, Berlin, Germany
| | - Anna Slagman
- Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Emergency and Acute Medicine and Health Services Research in Emergency Medicine (CVK, CCM), Charitéplatz 1, 10117, Berlin, Germany
| | - Malte L Schmieding
- Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Institute of Medical Informatics, Charitéplatz 1, 10117, Berlin, Germany. .,docport Services GmbH, Tußmannstr. 75, 40477, Düsseldorf, Germany.
| |
Collapse
|
26
|
Fraser HSF, Cohan G, Koehler C, Anderson J, Lawrence A, Pateña J, Bacher I, Ranney ML. Evaluation of Diagnostic and Triage Accuracy and Usability of a Symptom Checker in an Emergency Department: Observational Study. JMIR Mhealth Uhealth 2022; 10:e38364. [PMID: 36121688 PMCID: PMC9531004 DOI: 10.2196/38364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2022] [Revised: 05/31/2022] [Accepted: 06/10/2022] [Indexed: 11/26/2022] Open
Abstract
Background Symptom checkers are clinical decision support apps for patients, used by tens of millions of people annually. They are designed to provide diagnostic and triage advice and assist users in seeking the appropriate level of care. Little evidence is available regarding their diagnostic and triage accuracy with direct use by patients for urgent conditions. Objective The aim of this study is to determine the diagnostic and triage accuracy and usability of a symptom checker in use by patients presenting to an emergency department (ED). Methods We recruited a convenience sample of English-speaking patients presenting for care in an urban ED. Each consenting patient used a leading symptom checker from Ada Health before the ED evaluation. Diagnostic accuracy was evaluated by comparing the symptom checker’s diagnoses and those of 3 independent emergency physicians viewing the patient-entered symptom data, with the final diagnoses from the ED evaluation. The Ada diagnoses and triage were also critiqued by the independent physicians. The patients completed a usability survey based on the Technology Acceptance Model. Results A total of 40 (80%) of the 50 participants approached completed the symptom checker assessment and usability survey. Their mean age was 39.3 (SD 15.9; range 18-76) years, and they were 65% (26/40) female, 68% (27/40) White, 48% (19/40) Hispanic or Latino, and 13% (5/40) Black or African American. Some cases had missing data or a lack of a clear ED diagnosis; 75% (30/40) were included in the analysis of diagnosis, and 93% (37/40) for triage. The sensitivity for at least one of the final ED diagnoses by Ada (based on its top 5 diagnoses) was 70% (95% CI 54%-86%), close to the mean sensitivity for the 3 physicians (on their top 3 diagnoses) of 68.9%. The physicians rated the Ada triage decisions as 62% (23/37) fully agree and 24% (9/37) safe but too cautious. It was rated as unsafe and too risky in 22% (8/37) of cases by at least one physician, in 14% (5/37) of cases by at least two physicians, and in 5% (2/37) of cases by all 3 physicians. Usability was rated highly; participants agreed or strongly agreed with the 7 Technology Acceptance Model usability questions with a mean score of 84.6%, although “satisfaction” and “enjoyment” were rated low. Conclusions This study provides preliminary evidence that a symptom checker can provide acceptable usability and diagnostic accuracy for patients with various urgent conditions. A total of 14% (5/37) of symptom checker triage recommendations were deemed unsafe and too risky by at least two physicians based on the symptoms recorded, similar to the results of studies on telephone and nurse triage. Larger studies are needed of diagnosis and triage performance with direct patient use in different clinical environments.
Collapse
Affiliation(s)
- Hamish S F Fraser
- Brown Center for Biomedical Informatics, Warren Alpert Medical School, Brown University, Providence, RI, United States
- School of Public Health, Brown University, Providence, RI, United States
| | - Gregory Cohan
- Warren Alpert Medical School, Brown University, Providence, RI, United States
| | - Christopher Koehler
- Department of Emergency Medicine, Brown University, Providence, RI, United States
| | - Jared Anderson
- Department of Emergency Medicine, Brown University, Providence, RI, United States
| | - Alexis Lawrence
- Harvard Medical Faculty Physicians, Department of Emergency Medicine, St Luke's Hospital, New Bedford, MA, United States
| | - John Pateña
- Brown-Lifespan Center for Digital Health, Providence, RI, United States
| | - Ian Bacher
- Brown Center for Biomedical Informatics, Warren Alpert Medical School, Brown University, Providence, RI, United States
| | - Megan L Ranney
- School of Public Health, Brown University, Providence, RI, United States
- Department of Emergency Medicine, Brown University, Providence, RI, United States
- Brown-Lifespan Center for Digital Health, Providence, RI, United States
| |
Collapse
|
27
|
Nguyen H, Meczner A, Burslam-Dawe K, Hayhoe B. Triage Errors in Primary and Pre-Primary Care. J Med Internet Res 2022; 24:e37209. [PMID: 35749166 PMCID: PMC9270711 DOI: 10.2196/37209] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2022] [Revised: 03/21/2022] [Accepted: 04/04/2022] [Indexed: 01/20/2023] Open
Abstract
Triage errors are a major concern in health care due to resulting harmful delays in treatments or inappropriate allocation of resources. With the increasing popularity of digital symptom checkers in pre–primary care settings, and amid claims that artificial intelligence outperforms doctors, the accuracy of triage by digital symptom checkers is ever more scrutinized. This paper examines the context and challenges of triage in primary care, pre–primary care, and emergency care, as well as reviews existing evidence on the prevalence of triage errors in all three settings. Implications for development, research, and practice are highlighted, and recommendations are made on how digital symptom checkers should be best positioned.
Collapse
Affiliation(s)
- Hai Nguyen
- Your.MD Ltd, London, United Kingdom.,Health Services and Population Research, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, United Kingdom
| | | | | | - Benedict Hayhoe
- eConsult Ltd, London, United Kingdom.,Department of Primary Care, School of Public Health, Imperial College London, London, United Kingdom
| |
Collapse
|
28
|
Kopka M, Feufel MA, Balzer F, Schmieding ML. Triage Capability of Laypersons: Retrospective, Exploratory Analysis (Preprint). JMIR Form Res 2022; 6:e38977. [PMID: 36222793 PMCID: PMC9607917 DOI: 10.2196/38977] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2022] [Revised: 08/08/2022] [Accepted: 08/16/2022] [Indexed: 11/19/2022] Open
Abstract
Background Although medical decision-making may be thought of as a task involving health professionals, many decisions, including critical health–related decisions are made by laypersons alone. Specifically, as the first step to most care episodes, it is the patient who determines whether and where to seek health care (triage). Overcautious self-assessments (ie, overtriaging) may lead to overutilization of health care facilities and overcrowded emergency departments, whereas imprudent decisions (ie, undertriaging) constitute a risk to the patient’s health. Recently, patient-facing decision support systems, commonly known as symptom checkers, have been developed to assist laypersons in these decisions. Objective The purpose of this study is to identify factors influencing laypersons’ ability to self-triage and their risk averseness in self-triage decisions. Methods We analyzed publicly available data on 91 laypersons appraising 45 short fictitious patient descriptions (case vignettes; N=4095 appraisals). Using signal detection theory and descriptive and inferential statistics, we explored whether the type of medical decision laypersons face, their confidence in their decision, and sociodemographic factors influence their triage accuracy and the type of errors they make. We distinguished between 2 decisions: whether emergency care was required (decision 1) and whether self-care was sufficient (decision 2). Results The accuracy of detecting emergencies (decision 1) was higher (mean 82.2%, SD 5.9%) than that of deciding whether any type of medical care is required (decision 2, mean 75.9%, SD 5.25%; t>90=8.4; P<.001; Cohen d=0.9). Sensitivity for decision 1 was lower (mean 67.5%, SD 16.4%) than its specificity (mean 89.6%, SD 8.6%) whereas sensitivity for decision 2 was higher (mean 90.5%, SD 8.3%) than its specificity (mean 46.7%, SD 15.95%). Female participants were more risk averse and overtriaged more often than male participants, but age and level of education showed no association with participants’ risk averseness. Participants’ triage accuracy was higher when they were certain about their appraisal (2114/3381, 62.5%) than when being uncertain (378/714, 52.9%). However, most errors occurred when participants were certain of their decision (1267/1603, 79%). Participants were more commonly certain of their overtriage errors (mean 80.9%, SD 23.8%) than their undertriage errors (mean 72.5%, SD 30.9%; t>89=3.7; P<.001; d=0.39). Conclusions Our study suggests that laypersons are overcautious in deciding whether they require medical care at all, but they miss identifying a considerable portion of emergencies. Our results further indicate that women are more risk averse than men in both types of decisions. Layperson participants made most triage errors when they were certain of their own appraisal. Thus, they might not follow or even seek advice (eg, from symptom checkers) in most instances where advice would be useful.
Collapse
Affiliation(s)
- Marvin Kopka
- Institute of Medical Informatics, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
- Cognitive Psychology and Ergonomics, Department of Psychology and Ergonomics (IPA), Technische Universität Berlin, Berlin, Germany
| | - Markus A Feufel
- Division of Ergonomics, Department of Psychology and Ergonomics (IPA), Technische Universität Berlin, Berlin, Germany
| | - Felix Balzer
- Institute of Medical Informatics, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Malte L Schmieding
- Institute of Medical Informatics, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| |
Collapse
|
29
|
Kopka M, Schmieding ML, Rieger T, Roesler E, Balzer F, Feufel MA. Trust Me, I’m Not a Doctor! Determinants of Laypersons’ Trust in Medical Decision Aids: Experimental Study (Preprint). JMIR Hum Factors 2021; 9:e35219. [PMID: 35503248 PMCID: PMC9115664 DOI: 10.2196/35219] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2021] [Revised: 02/09/2022] [Accepted: 03/06/2022] [Indexed: 11/13/2022] Open
Abstract
Background Objective Methods Results Conclusions Trial Registration
Collapse
Affiliation(s)
- Marvin Kopka
- Institute of Medical Informatics, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
- Cognitive Psychology and Ergonomics, Department of Psychology and Ergonomics (IPA), Technische Universität Berlin, Berlin, Germany
| | - Malte L Schmieding
- Institute of Medical Informatics, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Tobias Rieger
- Work, Engineering and Organizational Psychology, Department of Psychology and Ergonomics (IPA), Technische Universität Berlin, Berlin, Germany
| | - Eileen Roesler
- Work, Engineering and Organizational Psychology, Department of Psychology and Ergonomics (IPA), Technische Universität Berlin, Berlin, Germany
| | - Felix Balzer
- Institute of Medical Informatics, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Markus A Feufel
- Division of Ergonomics, Department of Psychology and Ergonomics (IPA), Technische Universität Berlin, Berlin, Germany
| |
Collapse
|