Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Schmieding ML, Kopka M, Schmidt K, Schulz-Niethammer S, Balzer F, Feufel MA. Triage Accuracy of Symptom Checker Apps: 5-Year Follow-up Evaluation. J Med Internet Res 2022;24:e31810. [PMID: 35536633 PMCID: PMC9131144 DOI: 10.2196/31810] [Citation(s) in RCA: 29] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2021] [Revised: 11/19/2021] [Accepted: 01/30/2022] [Indexed: 12/16/2022] Open

For:	Schmieding ML, Kopka M, Schmidt K, Schulz-Niethammer S, Balzer F, Feufel MA. Triage Accuracy of Symptom Checker Apps: 5-Year Follow-up Evaluation. J Med Internet Res 2022;24:e31810. [PMID: 35536633 PMCID: PMC9131144 DOI: 10.2196/31810] [Citation(s) in RCA: 29] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2021] [Revised: 11/19/2021] [Accepted: 01/30/2022] [Indexed: 12/16/2022] Open

Number

Cited by Other Article(s)

Petrella RJ. The AI Future of Emergency Medicine. Ann Emerg Med 2024;84:139-153. [PMID: 38795081 DOI: 10.1016/j.annemergmed.2024.01.031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Revised: 01/23/2024] [Accepted: 01/24/2024] [Indexed: 05/27/2024]

Kachman MM, Brennan I, Oskvarek JJ, Waseem T, Pines JM. How artificial intelligence could transform emergency care. Am J Emerg Med 2024;81:40-46. [PMID: 38663302 DOI: 10.1016/j.ajem.2024.04.024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2024] [Revised: 04/13/2024] [Accepted: 04/15/2024] [Indexed: 06/07/2024] Open

Meczner A, Cohen N, Qureshi A, Reza M, Sutaria S, Blount E, Bagyura Z, Malak T. Controlling Inputter Variability in Vignette Studies Assessing Web-Based Symptom Checkers: Evaluation of Current Practice and Recommendations for Isolated Accuracy Metrics. JMIR Form Res 2024;8:e49907. [PMID: 38820578 PMCID: PMC11179013 DOI: 10.2196/49907] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Revised: 08/10/2023] [Accepted: 04/24/2024] [Indexed: 06/02/2024] Open

Abstract

BACKGROUND

The rapid growth of web-based symptom checkers (SCs) is not matched by advances in quality assurance. Currently, there are no widely accepted criteria assessing SCs' performance. Vignette studies are widely used to evaluate SCs, measuring the accuracy of outcome. Accuracy behaves as a composite metric as it is affected by a number of individual SC- and tester-dependent factors. In contrast to clinical studies, vignette studies have a small number of testers. Hence, measuring accuracy alone in vignette studies may not provide a reliable assessment of performance due to tester variability.

OBJECTIVE

This study aims to investigate the impact of tester variability on the accuracy of outcome of SCs, using clinical vignettes. It further aims to investigate the feasibility of measuring isolated aspects of performance.

METHODS

Healthily's SC was assessed using 114 vignettes by 3 groups of 3 testers who processed vignettes with different instructions: free interpretation of vignettes (free testers), specified chief complaints (partially free testers), and specified chief complaints with strict instruction for answering additional symptoms (restricted testers). κ statistics were calculated to assess agreement of top outcome condition and recommended triage. Crude and adjusted accuracy was measured against a gold standard. Adjusted accuracy was calculated using only results of consultations identical to the vignette, following a review and selection process. A feasibility study for assessing symptom comprehension of SCs was performed using different variations of 51 chief complaints across 3 SCs.

RESULTS

Intertester agreement of most likely condition and triage was, respectively, 0.49 and 0.51 for the free tester group, 0.66 and 0.66 for the partially free group, and 0.72 and 0.71 for the restricted group. For the restricted group, accuracy ranged from 43.9% to 57% for individual testers, averaging 50.6% (SD 5.35%). Adjusted accuracy was 56.1%. Assessing symptom comprehension was feasible for all 3 SCs. Comprehension scores ranged from 52.9% and 68%.

CONCLUSIONS

We demonstrated that by improving standardization of the vignette testing process, there is a significant improvement in the agreement of outcome between testers. However, significant variability remained due to uncontrollable tester-dependent factors, reflected by varying outcome accuracy. Tester-dependent factors, combined with a small number of testers, limit the reliability and generalizability of outcome accuracy when used as a composite measure in vignette studies. Measuring and reporting different aspects of SC performance in isolation provides a more reliable assessment of SC performance. We developed an adjusted accuracy measure using a review and selection process to assess data algorithm quality. In addition, we demonstrated that symptom comprehension with different input methods can be feasibly compared. Future studies reporting accuracy need to apply vignette testing standardization and isolated metrics.

Collapse

Harada Y, Sakamoto T, Sugimoto S, Shimizu T. Longitudinal Changes in Diagnostic Accuracy of a Differential Diagnosis List Developed by an AI-Based Symptom Checker: Retrospective Observational Study. JMIR Form Res 2024;8:e53985. [PMID: 38758588 PMCID: PMC11143391 DOI: 10.2196/53985] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Revised: 03/23/2024] [Accepted: 04/24/2024] [Indexed: 05/18/2024] Open

Abstract

BACKGROUND

Artificial intelligence (AI) symptom checker models should be trained using real-world patient data to improve their diagnostic accuracy. Given that AI-based symptom checkers are currently used in clinical practice, their performance should improve over time. However, longitudinal evaluations of the diagnostic accuracy of these symptom checkers are limited.

OBJECTIVE

This study aimed to assess the longitudinal changes in the accuracy of differential diagnosis lists created by an AI-based symptom checker used in the real world.

METHODS

This was a single-center, retrospective, observational study. Patients who visited an outpatient clinic without an appointment between May 1, 2019, and April 30, 2022, and who were admitted to a community hospital in Japan within 30 days of their index visit were considered eligible. We only included patients who underwent an AI-based symptom checkup at the index visit, and the diagnosis was finally confirmed during follow-up. Final diagnoses were categorized as common or uncommon, and all cases were categorized as typical or atypical. The primary outcome measure was the accuracy of the differential diagnosis list created by the AI-based symptom checker, defined as the final diagnosis in a list of 10 differential diagnoses created by the symptom checker. To assess the change in the symptom checker's diagnostic accuracy over 3 years, we used a chi-square test to compare the primary outcome over 3 periods: from May 1, 2019, to April 30, 2020 (first year); from May 1, 2020, to April 30, 2021 (second year); and from May 1, 2021, to April 30, 2022 (third year).

RESULTS

A total of 381 patients were included. Common diseases comprised 257 (67.5%) cases, and typical presentations were observed in 298 (78.2%) cases. Overall, the accuracy of the differential diagnosis list created by the AI-based symptom checker was 172 (45.1%), which did not differ across the 3 years (first year: 97/219, 44.3%; second year: 32/72, 44.4%; and third year: 43/90, 47.7%; P=.85). The accuracy of the differential diagnosis list created by the symptom checker was low in those with uncommon diseases (30/124, 24.2%) and atypical presentations (12/83, 14.5%). In the multivariate logistic regression model, common disease (P<.001; odds ratio 4.13, 95% CI 2.50-6.98) and typical presentation (P<.001; odds ratio 6.92, 95% CI 3.62-14.2) were significantly associated with the accuracy of the differential diagnosis list created by the symptom checker.

CONCLUSIONS

A 3-year longitudinal survey of the diagnostic accuracy of differential diagnosis lists developed by an AI-based symptom checker, which has been implemented in real-world clinical practice settings, showed no improvement over time. Uncommon diseases and atypical presentations were independently associated with a lower diagnostic accuracy. In the future, symptom checkers should be trained to recognize uncommon conditions.

Collapse

Müller R, Klemmt M, Koch R, Ehni HJ, Henking T, Langmann E, Wiesing U, Ranisch R. "That's just Future Medicine" - a qualitative study on users' experiences of symptom checker apps. BMC Med Ethics 2024;25:17. [PMID: 38365749 PMCID: PMC10874001 DOI: 10.1186/s12910-024-01011-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 02/06/2024] [Indexed: 02/18/2024] Open

Abstract

BACKGROUND

Symptom checker apps (SCAs) are mobile or online applications for lay people that usually have two main functions: symptom analysis and recommendations. SCAs ask users questions about their symptoms via a chatbot, give a list with possible causes, and provide a recommendation, such as seeing a physician. However, it is unclear whether the actual performance of a SCA corresponds to the users' experiences. This qualitative study investigates the subjective perspectives of SCA users to close the empirical gap identified in the literature and answers the following main research question: How do individuals (healthy users and patients) experience the usage of SCA, including their attitudes, expectations, motivations, and concerns regarding their SCA use?

METHODS

A qualitative interview study was chosen to clarify the relatively unknown experience of SCA use. Semi-structured qualitative interviews with SCA users were carried out by two researchers in tandem via video call. Qualitative content analysis was selected as methodology for the data analysis.

RESULTS

Fifteen interviews with SCA users were conducted and seven main categories identified: (1) Attitudes towards findings and recommendations, (2) Communication, (3) Contact with physicians, (4) Expectations (prior to use), (5) Motivations, (6) Risks, and (7) SCA-use for others.

CONCLUSIONS

The aspects identified in the analysis emphasise the specific perspective of SCA users and, at the same time, the immense scope of different experiences. Moreover, the study reveals ethical issues, such as relational aspects, that are often overlooked in debates on mHealth. Both empirical and ethical research is more needed, as the awareness of the subjective experience of those affected is an essential component in the responsible development and implementation of health apps such as SCA.

TRIAL REGISTRATION

German Clinical Trials Register (DRKS): DRKS00022465. 07/08/2020.

Collapse

Kearney LE, Jansen E, Kathuria H, Steiling K, Jones KC, Walkey A, Cordella N. Efficacy of Digital Outreach Strategies for Collecting Smoking Data: Pragmatic Randomized Trial. JMIR Form Res 2024;8:e50465. [PMID: 38335012 PMCID: PMC10891497 DOI: 10.2196/50465] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2023] [Revised: 12/19/2023] [Accepted: 12/24/2023] [Indexed: 02/10/2024] Open

Abstract

BACKGROUND

Tobacco smoking is an important risk factor for disease, but inaccurate smoking history data in the electronic medical record (EMR) limits the reach of lung cancer screening (LCS) and tobacco cessation interventions. Patient-generated health data is a novel approach to documenting smoking history; however, the comparative effectiveness of different approaches is unclear.

OBJECTIVE

We designed a quality improvement intervention to evaluate the effectiveness of portal questionnaires compared to SMS text message-based surveys, to compare message frames, and to evaluate the completeness of patient-generated smoking histories.

METHODS

We randomly assigned patients aged between 50 and 80 years with a history of tobacco use who identified English as a preferred language and have never undergone LCS to receive an EMR portal questionnaire or a text survey. The portal questionnaire used a "helpfulness" message, while the text survey tested frame types informed by behavior economics ("gain," "loss," and "helpfulness") and nudge messaging. The primary outcome was the response rate for each modality and framing type. Completeness and consistency with documented structured smoking data were also evaluated.

RESULTS

Participants were more likely to respond to the text survey (191/1000, 19.1%) compared to the portal questionnaire (35/504, 6.9%). Across all text survey rounds, patients were less responsive to the "helpfulness" frame compared with the "gain" frame (odds ratio [OR] 0.29, 95% CI 0.09-0.91; P<.05) and "loss" frame (OR 0.32, 95% CI 11.8-99.4; P<.05). Compared to the structured data in the EMR, the patient-generated data were significantly more likely to be complete enough to determine LCS eligibility both compared to the portal questionnaire (OR 34.2, 95% CI 3.8-11.1; P<.05) and to the text survey (OR 6.8, 95% CI 3.8-11.1; P<.05).

CONCLUSIONS

We found that an approach using patient-generated data is a feasible way to engage patients and collect complete smoking histories. Patients are likely to respond to a text survey using "gain" or "loss" framing to report detailed smoking histories. Optimizing an SMS text message approach to collect medical information has implications for preventative and follow-up clinical care beyond smoking histories, LCS, and smoking cessation therapy.

Collapse

Schnoor K, Versluis A, Chavannes NH, Talboom-Kamp EPWA. Digital Triage Tools for Sexually Transmitted Infection Testing Compared With General Practitioners' Advice: Vignette-Based Qualitative Study With Interviews Among General Practitioners. JMIR Hum Factors 2024;11:e49221. [PMID: 38252474 PMCID: PMC10845018 DOI: 10.2196/49221] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Revised: 07/05/2023] [Accepted: 11/20/2023] [Indexed: 01/23/2024] Open

Abstract

BACKGROUND

Digital triage tools for sexually transmitted infection (STI) testing can potentially be used as a substitute for the triage that general practitioners (GPs) perform to lower their work pressure. The studied tool is based on medical guidelines. The same guidelines support GPs' decision-making process. However, research has shown that GPs make decisions from a holistic perspective and, therefore, do not always adhere to those guidelines. To have a high-quality digital triage tool that results in an efficient care process, it is important to learn more about GPs' decision-making process.

OBJECTIVE

The first objective was to identify whether the advice of the studied digital triage tool aligned with GPs' daily medical practice. The second objective was to learn which factors influence GPs' decisions regarding referral for diagnostic testing. In addition, this study provides insights into GPs' decision-making process.

METHODS

A qualitative vignette-based study using semistructured interviews was conducted. In total, 6 vignettes representing patient cases were discussed with the participants (GPs). The participants needed to think aloud whether they would advise an STI test for the patient and why. A thematic analysis was conducted on the transcripts of the interviews. The vignette patient cases were also passed through the digital triage tool, resulting in advice to test or not for an STI. A comparison was made between the advice of the tool and that of the participants.

RESULTS

In total, 10 interviews were conducted. Participants (GPs) had a mean age of 48.30 (SD 11.88) years. For 3 vignettes, the advice of the digital triage tool and of all participants was the same. In those vignettes, the patients' risk factors were sufficiently clear for the participants to advise the same as the digital tool. For 3 vignettes, the advice of the digital tool differed from that of the participants. Patient-related factors that influenced the participants' decision-making process were the patient's anxiety, young age, and willingness to be tested. Participants would test at a lower threshold than the triage tool because of those factors. Sometimes, participants wanted more information than was provided in the vignette or would like to conduct a physical examination. These elements were not part of the digital triage tool.

CONCLUSIONS

The advice to conduct a diagnostic STI test differed between a digital triage tool and GPs. The digital triage tool considered only medical guidelines, whereas GPs were open to discussion reasoning from a holistic perspective. The GPs' decision-making process was influenced by patients' anxiety, willingness to be tested, and age. On the basis of these results, we believe that the digital triage tool for STI testing could support GPs and even replace consultations in the future. Further research must substantiate how this can be done safely.

Collapse

Wetzel AJ, Koch R, Koch N, Klemmt M, Müller R, Preiser C, Rieger M, Rösel I, Ranisch R, Ehni HJ, Joos S. 'Better see a doctor?' Status quo of symptom checker apps in Germany: A cross-sectional survey with a mixed-methods design (CHECK.APP). Digit Health 2024;10:20552076241231555. [PMID: 38434790 PMCID: PMC10908232 DOI: 10.1177/20552076241231555] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/22/2024] [Indexed: 03/05/2024] Open

Peven K, Wickham AP, Wilks O, Kaplan YC, Marhol A, Ahmed S, Bamford R, Cunningham AC, Prentice C, Meczner A, Fenech M, Gilbert S, Klepchukova A, Ponzo S, Zhaunova L. Assessment of a Digital Symptom Checker Tool's Accuracy in Suggesting Reproductive Health Conditions: Clinical Vignettes Study. JMIR Mhealth Uhealth 2023;11:e46718. [PMID: 38051574 PMCID: PMC10731551 DOI: 10.2196/46718] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2023] [Revised: 09/06/2023] [Accepted: 11/07/2023] [Indexed: 12/07/2023] Open

Abstract

BACKGROUND

Reproductive health conditions such as endometriosis, uterine fibroids, and polycystic ovary syndrome (PCOS) affect a large proportion of women and people who menstruate worldwide. Prevalence estimates for these conditions range from 5% to 40% of women of reproductive age. Long diagnostic delays, up to 12 years, are common and contribute to health complications and increased health care costs. Symptom checker apps provide users with information and tools to better understand their symptoms and thus have the potential to reduce the time to diagnosis for reproductive health conditions.

OBJECTIVE

This study aimed to evaluate the agreement between clinicians and 3 symptom checkers (developed by Flo Health UK Limited) in assessing symptoms of endometriosis, uterine fibroids, and PCOS using vignettes. We also aimed to present a robust example of vignette case creation, review, and classification in the context of predeployment testing and validation of digital health symptom checker tools.

METHODS

Independent general practitioners were recruited to create clinical case vignettes of simulated users for the purpose of testing each condition symptom checker; vignettes created for each condition contained a mixture of condition-positive and condition-negative outcomes. A second panel of general practitioners then reviewed, approved, and modified (if necessary) each vignette. A third group of general practitioners reviewed each vignette case and designated a final classification. Vignettes were then entered into the symptom checkers by a fourth, different group of general practitioners. The outcomes of each symptom checker were then compared with the final classification of each vignette to produce accuracy metrics including percent agreement, sensitivity, specificity, positive predictive value, and negative predictive value.

RESULTS

A total of 24 cases were created per condition. Overall, exact matches between the vignette general practitioner classification and the symptom checker outcome were 83% (n=20) for endometriosis, 83% (n=20) for uterine fibroids, and 88% (n=21) for PCOS. For each symptom checker, sensitivity was reported as 81.8% for endometriosis, 84.6% for uterine fibroids, and 100% for PCOS; specificity was reported as 84.6% for endometriosis, 81.8% for uterine fibroids, and 75% for PCOS; positive predictive value was reported as 81.8% for endometriosis, 84.6% for uterine fibroids, 80% for PCOS; and negative predictive value was reported as 84.6% for endometriosis, 81.8% for uterine fibroids, and 100% for PCOS.

CONCLUSIONS

The single-condition symptom checkers have high levels of agreement with general practitioner classification for endometriosis, uterine fibroids, and PCOS. Given long delays in diagnosis for many reproductive health conditions, which lead to increased medical costs and potential health complications for individuals and health care providers, innovative health apps and symptom checkers hold the potential to improve care pathways.

Collapse

Bushuven S, Bentele M, Bentele S, Gerber B, Bansbach J, Ganter J, Trifunovic-Koenig M, Ranisch R. "ChatGPT, Can You Help Me Save My Child's Life?" - Diagnostic Accuracy and Supportive Capabilities to Lay Rescuers by ChatGPT in Prehospital Basic Life Support and Paediatric Advanced Life Support Cases - An In-silico Analysis. J Med Syst 2023;47:123. [PMID: 37987870 PMCID: PMC10663183 DOI: 10.1007/s10916-023-02019-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2023] [Accepted: 11/13/2023] [Indexed: 11/22/2023]

Abstract

BACKGROUND

Paediatric emergencies are challenging for healthcare workers, first aiders, and parents waiting for emergency medical services to arrive. With the expected rise of virtual assistants, people will likely seek help from such digital AI tools, especially in regions lacking emergency medical services. Large Language Models like ChatGPT proved effective in providing health-related information and are competent in medical exams but are questioned regarding patient safety. Currently, there is no information on ChatGPT's performance in supporting parents in paediatric emergencies requiring help from emergency medical services. This study aimed to test 20 paediatric and two basic life support case vignettes for ChatGPT and GPT-4 performance and safety in children.

METHODS

We provided the cases three times each to two models, ChatGPT and GPT-4, and assessed the diagnostic accuracy, emergency call advice, and the validity of advice given to parents.

RESULTS

Both models recognized the emergency in the cases, except for septic shock and pulmonary embolism, and identified the correct diagnosis in 94%. However, ChatGPT/GPT-4 reliably advised to call emergency services only in 12 of 22 cases (54%), gave correct first aid instructions in 9 cases (45%) and incorrectly advised advanced life support techniques to parents in 3 of 22 cases (13.6%).

CONCLUSION

Considering these results of the recent ChatGPT versions, the validity, reliability and thus safety of ChatGPT/GPT-4 as an emergency support tool is questionable. However, whether humans would perform better in the same situation is uncertain. Moreover, other studies have shown that human emergency call operators are also inaccurate, partly with worse performance than ChatGPT/GPT-4 in our study. However, one of the main limitations of the study is that we used prototypical cases, and the management may differ from urban to rural areas and between different countries, indicating the need for further evaluation of the context sensitivity and adaptability of the model. Nevertheless, ChatGPT and the new versions under development may be promising tools for assisting lay first responders, operators, and professionals in diagnosing a paediatric emergency.

TRIAL REGISTRATION

Not applicable.

Collapse

Hirosawa T, Mizuta K, Harada Y, Shimizu T. Comparative Evaluation of Diagnostic Accuracy Between Google Bard and Physicians. Am J Med 2023;136:1119-1123.e18. [PMID: 37643659 DOI: 10.1016/j.amjmed.2023.08.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Revised: 07/11/2023] [Accepted: 08/24/2023] [Indexed: 08/31/2023]

de Koning E, van der Haas Y, Saguna S, Stoop E, Bosch J, Beeres S, Schalij M, Boogers M. AI Algorithm to Predict Acute Coronary Syndrome in Prehospital Cardiac Care: Retrospective Cohort Study. JMIR Cardio 2023;7:e51375. [PMID: 37906226 PMCID: PMC10646678 DOI: 10.2196/51375] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Revised: 08/29/2023] [Accepted: 09/19/2023] [Indexed: 11/02/2023] Open

Abstract

BACKGROUND

Overcrowding of hospitals and emergency departments (EDs) is a growing problem. However, not all ED consultations are necessary. For example, 80% of patients in the ED with chest pain do not have an acute coronary syndrome (ACS). Artificial intelligence (AI) is useful in analyzing (medical) data, and might aid health care workers in prehospital clinical decision-making before patients are presented to the hospital.

OBJECTIVE

The aim of this study was to develop an AI model which would be able to predict ACS before patients visit the ED. The model retrospectively analyzed prehospital data acquired by emergency medical services' nurse paramedics.

METHODS

Patients presenting to the emergency medical services with symptoms suggestive of ACS between September 2018 and September 2020 were included. An AI model using a supervised text classification algorithm was developed to analyze data. Data were analyzed for all 7458 patients (mean 68, SD 15 years, 54% men). Specificity, sensitivity, positive predictive value (PPV), and negative predictive value (NPV) were calculated for control and intervention groups. At first, a machine learning (ML) algorithm (or model) was chosen; afterward, the features needed were selected and then the model was tested and improved using iterative evaluation and in a further step through hyperparameter tuning. Finally, a method was selected to explain the final AI model.

RESULTS

The AI model had a specificity of 11% and a sensitivity of 99.5% whereas usual care had a specificity of 1% and a sensitivity of 99.5%. The PPV of the AI model was 15% and the NPV was 99%. The PPV of usual care was 13% and the NPV was 94%.

CONCLUSIONS

The AI model was able to predict ACS based on retrospective data from the prehospital setting. It led to an increase in specificity (from 1% to 11%) and NPV (from 94% to 99%) when compared to usual care, with a similar sensitivity. Due to the retrospective nature of this study and the singular focus on ACS it should be seen as a proof-of-concept. Other (possibly life-threatening) diagnoses were not analyzed. Future prospective validation is necessary before implementation.

Collapse

Hirosawa T, Kawamura R, Harada Y, Mizuta K, Tokumasu K, Kaji Y, Suzuki T, Shimizu T. ChatGPT-Generated Differential Diagnosis Lists for Complex Case-Derived Clinical Vignettes: Diagnostic Accuracy Evaluation. JMIR Med Inform 2023;11:e48808. [PMID: 37812468 PMCID: PMC10594139 DOI: 10.2196/48808] [Citation(s) in RCA: 16] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2023] [Revised: 07/20/2023] [Accepted: 09/13/2023] [Indexed: 10/10/2023] Open

Abstract

BACKGROUND

The diagnostic accuracy of differential diagnoses generated by artificial intelligence chatbots, including ChatGPT models, for complex clinical vignettes derived from general internal medicine (GIM) department case reports is unknown.

OBJECTIVE

This study aims to evaluate the accuracy of the differential diagnosis lists generated by both third-generation ChatGPT (ChatGPT-3.5) and fourth-generation ChatGPT (ChatGPT-4) by using case vignettes from case reports published by the Department of GIM of Dokkyo Medical University Hospital, Japan.

METHODS

We searched PubMed for case reports. Upon identification, physicians selected diagnostic cases, determined the final diagnosis, and displayed them into clinical vignettes. Physicians typed the determined text with the clinical vignettes in the ChatGPT-3.5 and ChatGPT-4 prompts to generate the top 10 differential diagnoses. The ChatGPT models were not specially trained or further reinforced for this task. Three GIM physicians from other medical institutions created differential diagnosis lists by reading the same clinical vignettes. We measured the rate of correct diagnosis within the top 10 differential diagnosis lists, top 5 differential diagnosis lists, and the top diagnosis.

RESULTS

In total, 52 case reports were analyzed. The rates of correct diagnosis by ChatGPT-4 within the top 10 differential diagnosis lists, top 5 differential diagnosis lists, and top diagnosis were 83% (43/52), 81% (42/52), and 60% (31/52), respectively. The rates of correct diagnosis by ChatGPT-3.5 within the top 10 differential diagnosis lists, top 5 differential diagnosis lists, and top diagnosis were 73% (38/52), 65% (34/52), and 42% (22/52), respectively. The rates of correct diagnosis by ChatGPT-4 were comparable to those by physicians within the top 10 (43/52, 83% vs 39/52, 75%, respectively; P=.47) and within the top 5 (42/52, 81% vs 35/52, 67%, respectively; P=.18) differential diagnosis lists and top diagnosis (31/52, 60% vs 26/52, 50%, respectively; P=.43) although the difference was not significant. The ChatGPT models' diagnostic accuracy did not significantly vary based on open access status or the publication date (before 2011 vs 2022).

CONCLUSIONS

This study demonstrates the potential diagnostic accuracy of differential diagnosis lists generated using ChatGPT-3.5 and ChatGPT-4 for complex clinical vignettes from case reports published by the GIM department. The rate of correct diagnoses within the top 10 and top 5 differential diagnosis lists generated by ChatGPT-4 exceeds 80%. Although derived from a limited data set of case reports from a single department, our findings highlight the potential utility of ChatGPT-4 as a supplementary tool for physicians, particularly for those affiliated with the GIM department. Further investigations should explore the diagnostic accuracy of ChatGPT by using distinct case materials beyond its training data. Such efforts will provide a comprehensive insight into the role of artificial intelligence in enhancing clinical decision-making.

Collapse

Fraser H, Crossland D, Bacher I, Ranney M, Madsen T, Hilliard R. Comparison of Diagnostic and Triage Accuracy of Ada Health and WebMD Symptom Checkers, ChatGPT, and Physicians for Patients in an Emergency Department: Clinical Data Analysis Study. JMIR Mhealth Uhealth 2023;11:e49995. [PMID: 37788063 PMCID: PMC10582809 DOI: 10.2196/49995] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 08/17/2023] [Accepted: 08/25/2023] [Indexed: 10/04/2023] Open

Abstract

BACKGROUND

Diagnosis is a core component of effective health care, but misdiagnosis is common and can put patients at risk. Diagnostic decision support systems can play a role in improving diagnosis by physicians and other health care workers. Symptom checkers (SCs) have been designed to improve diagnosis and triage (ie, which level of care to seek) by patients.

OBJECTIVE

The aim of this study was to evaluate the performance of the new large language model ChatGPT (versions 3.5 and 4.0), the widely used WebMD SC, and an SC developed by Ada Health in the diagnosis and triage of patients with urgent or emergent clinical problems compared with the final emergency department (ED) diagnoses and physician reviews.

METHODS

We used previously collected, deidentified, self-report data from 40 patients presenting to an ED for care who used the Ada SC to record their symptoms prior to seeing the ED physician. Deidentified data were entered into ChatGPT versions 3.5 and 4.0 and WebMD by a research assistant blinded to diagnoses and triage. Diagnoses from all 4 systems were compared with the previously abstracted final diagnoses in the ED as well as with diagnoses and triage recommendations from three independent board-certified ED physicians who had blindly reviewed the self-report clinical data from Ada. Diagnostic accuracy was calculated as the proportion of the diagnoses from ChatGPT, Ada SC, WebMD SC, and the independent physicians that matched at least one ED diagnosis (stratified as top 1 or top 3). Triage accuracy was calculated as the number of recommendations from ChatGPT, WebMD, or Ada that agreed with at least 2 of the independent physicians or were rated "unsafe" or "too cautious."

RESULTS

Overall, 30 and 37 cases had sufficient data for diagnostic and triage analysis, respectively. The rate of top-1 diagnosis matches for Ada, ChatGPT 3.5, ChatGPT 4.0, and WebMD was 9 (30%), 12 (40%), 10 (33%), and 12 (40%), respectively, with a mean rate of 47% for the physicians. The rate of top-3 diagnostic matches for Ada, ChatGPT 3.5, ChatGPT 4.0, and WebMD was 19 (63%), 19 (63%), 15 (50%), and 17 (57%), respectively, with a mean rate of 69% for physicians. The distribution of triage results for Ada was 62% (n=23) agree, 14% unsafe (n=5), and 24% (n=9) too cautious; that for ChatGPT 3.5 was 59% (n=22) agree, 41% (n=15) unsafe, and 0% (n=0) too cautious; that for ChatGPT 4.0 was 76% (n=28) agree, 22% (n=8) unsafe, and 3% (n=1) too cautious; and that for WebMD was 70% (n=26) agree, 19% (n=7) unsafe, and 11% (n=4) too cautious. The unsafe triage rate for ChatGPT 3.5 (41%) was significantly higher (P=.009) than that of Ada (14%).

CONCLUSIONS

ChatGPT 3.5 had high diagnostic accuracy but a high unsafe triage rate. ChatGPT 4.0 had the poorest diagnostic accuracy, but a lower unsafe triage rate and the highest triage agreement with the physicians. The Ada and WebMD SCs performed better overall than ChatGPT. Unsupervised patient use of ChatGPT for diagnosis and triage is not recommended without improvements to triage accuracy and extensive clinical evaluation.

Collapse

Wiedermann CJ, Mahlknecht A, Piccoliori G, Engl A. Redesigning Primary Care: The Emergence of Artificial-Intelligence-Driven Symptom Diagnostic Tools. J Pers Med 2023;13:1379. [PMID: 37763147 PMCID: PMC10532810 DOI: 10.3390/jpm13091379] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2023] [Revised: 09/13/2023] [Accepted: 09/14/2023] [Indexed: 09/29/2023] Open

Abstract

Modern healthcare is facing a juxtaposition of increasing patient demands owing to an aging population and a decreasing general practitioner workforce, leading to strained access to primary care. The coronavirus disease 2019 pandemic has emphasized the potential for alternative consultation methods, highlighting opportunities to minimize unnecessary care. This article discusses the role of artificial-intelligence-driven symptom checkers, particularly their efficiency, utility, and challenges in primary care. Based on a study conducted in Italian general practices, insights from both physicians and patients were gathered regarding this emergent technology, highlighting differences in perceived utility, user satisfaction, and potential challenges. While symptom checkers are seen as potential tools for addressing healthcare challenges, concerns regarding their accuracy and the potential for misdiagnosis persist. Patients generally viewed them positively, valuing their ease of use and the empowerment they provide in managing health. However, some general practitioners perceive these tools as challenges to their expertise. This article proposes that artificial-intelligence-based symptom checkers can optimize medical-history taking for the benefit of both general practitioners and patients, with potential enhancements in complex diagnostic tasks rather than routine diagnoses. It underscores the importance of carefully integrating digital innovations while preserving the essential human touch in healthcare. Symptom checkers offer promising solutions; ensuring their accuracy, reliability, and effective integration into primary care requires rigorous research, clinical guidance, and an understanding of varied user perceptions. Collaboration among technologists, clinicians, and patients is paramount for the successful evolution of digital tools in healthcare.

Collapse

Mahlknecht A, Engl A, Piccoliori G, Wiedermann CJ. Supporting primary care through symptom checking artificial intelligence: a study of patient and physician attitudes in Italian general practice. BMC PRIMARY CARE 2023;24:174. [PMID: 37661285 PMCID: PMC10476397 DOI: 10.1186/s12875-023-02143-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/09/2023] [Accepted: 08/29/2023] [Indexed: 09/05/2023]

Abstract

BACKGROUND

Rapid advancements in artificial intelligence (AI) have led to the adoption of AI-driven symptom checkers in primary care. This study aimed to evaluate both patients' and physicians' attitudes towards these tools in Italian general practice settings, focusing on their perceived utility, user satisfaction, and potential challenges.

METHODS

This feasibility study involved ten general practitioners (GPs) and patients visiting GP offices. The patients used a chatbot-based symptom checker before their medical visit and conducted anamnestic screening for COVID-19 and a medical history algorithm concerning the current medical problem. The entered data were forwarded to the GP as medical history aid. After the medical visit, both physicians and patients evaluated their respective symptoms. Additionally, physicians performed a final overall evaluation of the symptom checker after the conclusion of the practice phase.

RESULTS

Most patients did not use symptom checkers. Overall, 49% of patients and 27% of physicians reported being rather or very satisfied with the symptom checker. The most frequent patient-reported reasons for satisfaction were ease of use, precise and comprehensive questions, perceived time-saving potential, and encouragement of self-reflection. Every other patient would consider at-home use of the symptom checker for the first appraisal of health problems to save time, reduce unnecessary visits, and/or as an aid for the physician. Patients' attitudes towards the symptom checker were not significantly associated with age, sex, or level of education. Most patients (75%) and physicians (84%) indicated that the symptom checker had no effect on the duration of the medical visit. Only a few participants found the use of the symptom checker to be disruptive to the medical visit or its quality.

CONCLUSIONS

The findings suggest a positive reception of the symptom checker, albeit with differing focus between patients and physicians. With the potential to be integrated further into primary care, these tools require meticulous clinical guidance to maximize their benefits.

TRIAL REGISTRATION

The study was not registered, as it did not include direct medical intervention on human participants.

Collapse

Määttä J, Lindell R, Hayward N, Martikainen S, Honkanen K, Inkala M, Hirvonen P, Martikainen TJ. Diagnostic Performance, Triage Safety, and Usability of a Clinical Decision Support System Within a University Hospital Emergency Department: Algorithm Performance and Usability Study. JMIR Med Inform 2023;11:e46760. [PMID: 37656018 PMCID: PMC10501486 DOI: 10.2196/46760] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2023] [Revised: 06/22/2023] [Accepted: 07/14/2023] [Indexed: 09/02/2023] Open

Abstract

Background

Computerized clinical decision support systems (CDSSs) are increasingly adopted in health care to optimize resources and streamline patient flow. However, they often lack scientific validation against standard medical care.

Objective

The purpose of this study was to assess the performance, safety, and usability of a CDSS in a university hospital emergency department setting in Kuopio, Finland.

Methods

Patients entering the emergency department were asked to voluntarily participate in this study. Patients aged 17 years or younger, patients with cognitive impairments, and patients who entered the unit in an ambulance or with the need for immediate care were excluded. Patients completed the CDSS web-based form and usability questionnaire when waiting for the triage nurse's evaluation. The CDSS data were anonymized and did not affect the patients' usual evaluation or treatment. Retrospectively, 2 medical doctors evaluated the urgency of each patient's condition by using the triage nurse's information, and urgent and nonurgent groups were created. The International Statistical Classification of Diseases, Tenth Revision diagnoses were collected from the electronic health records. Usability was assessed by using a positive version of the System Usability Scale questionnaire.

Results

In total, our analyses included 248 patients. Regarding urgency, the mean sensitivities were 85% and 19%, respectively, for urgent and nonurgent cases when assessing the performance of CDSS evaluations in comparison to that of physicians. The mean sensitivities were 85% and 35%, respectively, when comparing the evaluations between the two physicians. Our CDSS did not miss any cases that were evaluated to be emergencies by physicians; thus, all emergency cases evaluated by physicians were evaluated as either urgent cases or emergency cases by the CDSS. In differential diagnosis, the CDSS had an exact match accuracy of 45.5% (97/213). The usability was good, with a mean System Usability Scale score of 78.2 (SD 16.8).

Conclusions

In a university hospital emergency department setting with a large real-world population, our CDSS was found to be equally as sensitive in urgent patient cases as physicians and was found to have an acceptable differential diagnosis accuracy, with good usability. These results suggest that this CDSS can be safely assessed further in a real-world setting. A CDSS could accelerate triage by providing patient-provided data in advance of patients' initial consultations and categorize patient cases as urgent and nonurgent cases upon patients' arrival to the emergency department.

Collapse

Kafke SD, Kuhlmey A, Schuster J, Blüher S, Czimmeck C, Zoellick JC, Grosse P. Can clinical decision support systems be an asset in medical education? An experimental approach. BMC MEDICAL EDUCATION 2023;23:570. [PMID: 37568144 PMCID: PMC10416486 DOI: 10.1186/s12909-023-04568-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/31/2023] [Accepted: 08/04/2023] [Indexed: 08/13/2023]

Abstract

BACKGROUND

Diagnostic accuracy is one of the major cornerstones of appropriate and successful medical decision-making. Clinical decision support systems (CDSSs) have recently been used to facilitate physician's diagnostic considerations. However, to date, little is known about the potential assets of CDSS for medical students in an educational setting. The purpose of our study was to explore the usefulness of CDSSs for medical students assessing their diagnostic performances and the influence of such software on students' trust in their own diagnostic abilities.

METHODS

Based on paper cases students had to diagnose two different patients using a CDSS and conventional methods such as e.g. textbooks, respectively. Both patients had a common disease, in one setting the clinical presentation was a typical one (tonsillitis), in the other setting (pulmonary embolism), however, the patient presented atypically. We used a 2x2x2 between- and within-subjects cluster-randomised controlled trial to assess the diagnostic accuracy in medical students, also by changing the order of the used resources (CDSS first or second).

RESULTS

Medical students in their 4th and 5th year performed equally well using conventional methods or the CDSS across the two cases (t(164) = 1,30; p = 0.197). Diagnostic accuracy and trust in the correct diagnosis were higher in the typical presentation condition than in the atypical presentation condition (t(85) = 19.97; p < .0001 and t(150) = 7.67; p < .0001).These results refute our main hypothesis that students diagnose more accurately when using conventional methods compared to the CDSS.

CONCLUSIONS

Medical students in their 4th and 5th year performed equally well in diagnosing two cases of common diseases with typical or atypical clinical presentations using conventional methods or a CDSS. Students were proficient in diagnosing a common disease with a typical presentation but underestimated their own factual knowledge in this scenario. Also, students were aware of their own diagnostic limitations when presented with a challenging case with an atypical presentation for which the use of a CDSS seemingly provided no additional insights.

Collapse

Kopka M, Scatturin L, Napierala H, Fürstenau D, Feufel MA, Balzer F, Schmieding ML. Characteristics of Users and Nonusers of Symptom Checkers in Germany: Cross-Sectional Survey Study. J Med Internet Res 2023;25:e46231. [PMID: 37338970 DOI: 10.2196/46231] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2023] [Revised: 04/12/2023] [Accepted: 05/03/2023] [Indexed: 06/21/2023] Open

Abstract

BACKGROUND

Previous studies have revealed that users of symptom checkers (SCs, apps that support self-diagnosis and self-triage) are predominantly female, are younger than average, and have higher levels of formal education. Little data are available for Germany, and no study has so far compared usage patterns with people's awareness of SCs and the perception of usefulness.

OBJECTIVE

We explored the sociodemographic and individual characteristics that are associated with the awareness, usage, and perceived usefulness of SCs in the German population.

METHODS

We conducted a cross-sectional online survey among 1084 German residents in July 2022 regarding personal characteristics and people's awareness and usage of SCs. Using random sampling from a commercial panel, we collected participant responses stratified by gender, state of residence, income, and age to reflect the German population. We analyzed the collected data exploratively.

RESULTS

Of all respondents, 16.3% (177/1084) were aware of SCs and 6.5% (71/1084) had used them before. Those aware of SCs were younger (mean 38.8, SD 14.6 years, vs mean 48.3, SD 15.7 years), were more often female (107/177, 60.5%, vs 453/907, 49.9%), and had higher formal education levels (eg, 72/177, 40.7%, vs 238/907, 26.2%, with a university/college degree) than those unaware. The same observation applied to users compared to nonusers. It disappeared, however, when comparing users to nonusers who were aware of SCs. Among users, 40.8% (29/71) considered these tools useful. Those considering them useful reported higher self-efficacy (mean 4.21, SD 0.66, vs mean 3.63, SD 0.81, on a scale of 1-5) and a higher net household income (mean EUR 2591.63, SD EUR 1103.96 [mean US $2798.96, SD US $1192.28], vs mean EUR 1626.60, SD EUR 649.05 [mean US $1756.73, SD US $700.97]) than those who considered them not useful. More women considered SCs unhelpful (13/44, 29.5%) compared to men (4/26, 15.4%).

CONCLUSIONS

Concurring with studies from other countries, our findings show associations between sociodemographic characteristics and SC usage in a German sample: users were on average younger, of higher socioeconomic status, and more commonly female compared to nonusers. However, usage cannot be explained by sociodemographic differences alone. It rather seems that sociodemographics explain who is or is not aware of the technology, but those who are aware of SCs are equally likely to use them, independently of sociodemographic differences. Although in some groups (eg, people with anxiety disorder), more participants reported to know and use SCs, they tended to perceive them as less useful. In other groups (eg, male participants), fewer respondents were aware of SCs, but those who used them perceived them to be more useful. Thus, SCs should be designed to fit specific user needs, and strategies should be developed to help reach individuals who could benefit but are not aware of SCs yet.

Collapse

Hirosawa T, Harada Y, Yokose M, Sakamoto T, Kawamura R, Shimizu T. Diagnostic Accuracy of Differential-Diagnosis Lists Generated by Generative Pretrained Transformer 3 Chatbot for Clinical Vignettes with Common Chief Complaints: A Pilot Study. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2023;20:3378. [PMID: 36834073 PMCID: PMC9967747 DOI: 10.3390/ijerph20043378] [Citation(s) in RCA: 100] [Impact Index Per Article: 100.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/25/2023] [Revised: 02/09/2023] [Accepted: 02/13/2023] [Indexed: 06/01/2023]

Pairon A, Philips H, Verhoeven V. A scoping review on the use and usefulness of online symptom checkers and triage systems: How to proceed? Front Med (Lausanne) 2023;9:1040926. [PMID: 36687416 PMCID: PMC9853165 DOI: 10.3389/fmed.2022.1040926] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2022] [Accepted: 12/16/2022] [Indexed: 01/09/2023] Open

Abstract

Background

Patients are increasingly turning to the Internet for health information. Numerous online symptom checkers and digital triage tools are currently available to the general public in an effort to meet this need, simultaneously acting as a demand management strategy to aid the overburdened health care system. The implementation of these services requires an evidence-based approach, warranting a review of the available literature on this rapidly evolving topic.

Objective

This scoping review aims to provide an overview of the current state of the art and identify research gaps through an analysis of the strengths and weaknesses of the presently available literature.

Methods

A systematic search strategy was formed and applied to six databases: Cochrane library, NICE, DARE, NIHR, Pubmed, and Web of Science. Data extraction was performed by two researchers according to a pre-established data charting methodology allowing for a thematic analysis of the results.

Results

A total of 10,250 articles were identified, and 28 publications were found eligible for inclusion. Users of these tools are often younger, female, more highly educated and technologically literate, potentially impacting digital divide and health equity. Triage algorithms remain risk-averse, which causes challenges for their accuracy. Recent evolutions in algorithms have varying degrees of success. Results on impact are highly variable, with potential effects on demand, accessibility of care, health literacy and syndromic surveillance. Both patients and healthcare providers are generally positive about the technology and seem amenable to the advice given, but there are still improvements to be made toward a more patient-centered approach. The significant heterogeneity across studies and triage systems remains the primary challenge for the field, limiting transferability of findings.

Conclusion

Current evidence included in this review is characterized by significant variability in study design and outcomes, highlighting the significant challenges for future research.An evolution toward more homogeneous methodologies, studies tailored to the intended setting, regulation and standardization of evaluations, and a patient-centered approach could benefit the field.

Collapse

Kopka M, Feufel MA, Berner ES, Schmieding ML. How suitable are clinical vignettes for the evaluation of symptom checker apps? A test theoretical perspective. Digit Health 2023;9:20552076231194929. [PMID: 37614591 PMCID: PMC10444026 DOI: 10.1177/20552076231194929] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2023] [Accepted: 07/28/2023] [Indexed: 08/25/2023] Open

Abstract

Objective

To evaluate the ability of case vignettes to assess the performance of symptom checker applications and to suggest refinements to the methodology used in case vignette-based audit studies.

Methods

We re-analyzed the publicly available data of two prominent case vignette-based symptom checker audit studies by calculating common metrics of test theory. Furthermore, we developed a new metric, the Capability Comparison Score (CCS), which compares symptom checker capability while controlling for the difficulty of the set of cases each symptom checker evaluated. We then scrutinized whether applying test theory and the CCS altered the performance ranking of the investigated symptom checkers.

Results

In both studies, most symptom checkers changed their rank order when adjusting the triage capability for item difficulty (ID) with the CCS. The previously reported triage accuracies commonly overestimated the capability of symptom checkers because they did not account for the fact that symptom checkers tend to selectively appraise easier cases (i.e., with high ID values). Also, many case vignettes in both studies showed insufficient (very low and even negative) values of item-total correlation (ITC), suggesting that individual items or the composition of item sets are of low quality.

Conclusions

A test-theoretic perspective helps identify previously undetected threats to the validity of case vignette-based symptom checker assessments and provides guidance and specific metrics to improve the quality of case vignettes, in particular by controlling for the difficulty of the vignettes an app was (not) able to evaluate correctly. Such measures might prove more meaningful than accuracy alone for the competitive assessment of symptom checkers. Our approach helps elaborate and standardize the methodology used for appraising symptom checker capability, which, ultimately, may yield more reliable results.

Collapse

North F, Jensen TB, Stroebel RJ, Nelson EM, Johnson BJ, Thompson MC, Pecina JL, Crum BA. Self-Triage Use, Subsequent Healthcare Utilization, and Diagnoses: A Retrospective Study of Process and Clinical Outcomes Following Self-Triage and Self-Scheduling for Ear or Hearing Symptoms. Health Serv Res Manag Epidemiol 2023;10:23333928231168121. [PMID: 37101803 PMCID: PMC10123887 DOI: 10.1177/23333928231168121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/28/2023] Open

Abstract

Background

Self-triage is becoming more widespread, but little is known about the people who are using online self-triage tools and their outcomes. For self-triage researchers, there are significant barriers to capturing subsequent healthcare outcomes. Our integrated healthcare system was able to capture subsequent healthcare utilization of individuals who used self-triage integrated with self-scheduling of provider visits.

Methods

We retrospectively examined healthcare utilization and diagnoses after patients had used self-triage and self-scheduling for ear or hearing symptoms. Outcomes and counts of office visits, telemedicine interactions, emergency department visits, and hospitalizations were captured. Diagnosis codes associated with subsequent provider visits were dichotomously categorized as being associated with ear or hearing concerns or not. Nonvisit care encounters of patient-initiated messages, nurse triage calls, and clinical communications were also captured.

Results

For 2168 self-triage uses, we were able to capture subsequent healthcare encounters within 7 days of the self-triage for 80.5% (1745/2168). In subsequent 1092 office visits with diagnoses, 83.1% (891/1092) of the uses were associated with relevant ear, nose and throat diagnoses. Only 0.24% (4/1662) of patients with captured outcomes were associated with a hospitalization within 7 days. Self-triage resulted in a self-scheduled office visit in 7.2% (126/1745). Office visits resulting from a self-scheduled visit had significantly fewer combined non-visit care encounters per office visit (fewer combined nurse triage calls, patient messages, and clinical communication messages) than office visits that were not self-scheduled (-0.51; 95% CI, -0.72 to -0.29; P < .0001).

Conclusion

In an appropriate healthcare setting, self-triage outcomes can be captured in a high percentage of uses to examine for safety, patient adherence to recommendations, and efficiency of self-triage. With the ear or hearing self-triage, most uses had subsequent visit diagnoses relevant to ear or hearing, so most patients appeared to be selecting the appropriate self-triage pathway for their symptoms.

Collapse

Painter A, Hayhoe B, Riboli-Sasco E, El-Osta A. Online Symptom Checkers: Recommendations for a Vignette-Based Clinical Evaluation Standard. J Med Internet Res 2022;24:e37408. [DOI: 10.2196/37408] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2022] [Revised: 09/15/2022] [Accepted: 10/11/2022] [Indexed: 11/13/2022] Open

Napierala H, Kopka M, Altendorf MB, Bolanaki M, Schmidt K, Piper SK, Heintze C, Möckel M, Balzer F, Slagman A, Schmieding ML. Examining the impact of a symptom assessment application on patient-physician interaction among self-referred walk-in patients in the emergency department (AKUSYM): study protocol for a multi-center, randomized controlled, parallel-group superiority trial. Trials 2022;23:791. [PMID: 36127742 PMCID: PMC9490986 DOI: 10.1186/s13063-022-06688-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2022] [Accepted: 08/24/2022] [Indexed: 11/10/2022] Open

Affiliation(s)

Hendrik Napierala Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Institute of General Practice and Family Medicine, Charitéplatz 1, 10117, Berlin, Germany
Marvin Kopka Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Institute of Medical Informatics, Charitéplatz 1, 10117, Berlin, Germany.,Cognitive Psychology and Ergonomics, Department of Psychology and Ergonomics (IPA), Technische Universität Berlin, Straße des 17. Juni 135, 10623, Berlin, Germany
Maria B Altendorf Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Emergency and Acute Medicine and Health Services Research in Emergency Medicine (CVK, CCM), Charitéplatz 1, 10117, Berlin, Germany
Myrto Bolanaki Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Emergency and Acute Medicine and Health Services Research in Emergency Medicine (CVK, CCM), Charitéplatz 1, 10117, Berlin, Germany
Konrad Schmidt Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Institute of General Practice and Family Medicine, Charitéplatz 1, 10117, Berlin, Germany.,Jena University Hospital, Institute of General Practice and Family Medicine, Bachstr. 18, 07743, Jena, Germany
Sophie K Piper Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Institute of Medical Informatics, Charitéplatz 1, 10117, Berlin, Germany.,Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Institute of Biometry and Clinical Epidemiology, Charitéplatz 1, 10117, Berlin, Germany.,Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Charitéplatz 1, 10117, Berlin, Germany
Christoph Heintze Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Institute of General Practice and Family Medicine, Charitéplatz 1, 10117, Berlin, Germany
Martin Möckel Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Emergency and Acute Medicine and Health Services Research in Emergency Medicine (CVK, CCM), Charitéplatz 1, 10117, Berlin, Germany
Felix Balzer Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Institute of Medical Informatics, Charitéplatz 1, 10117, Berlin, Germany
Anna Slagman Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Emergency and Acute Medicine and Health Services Research in Emergency Medicine (CVK, CCM), Charitéplatz 1, 10117, Berlin, Germany
Malte L Schmieding Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Institute of Medical Informatics, Charitéplatz 1, 10117, Berlin, Germany. .,docport Services GmbH, Tußmannstr. 75, 40477, Düsseldorf, Germany.

Collapse

Fraser HSF, Cohan G, Koehler C, Anderson J, Lawrence A, Pateña J, Bacher I, Ranney ML. Evaluation of Diagnostic and Triage Accuracy and Usability of a Symptom Checker in an Emergency Department: Observational Study. JMIR Mhealth Uhealth 2022;10:e38364. [PMID: 36121688 PMCID: PMC9531004 DOI: 10.2196/38364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2022] [Revised: 05/31/2022] [Accepted: 06/10/2022] [Indexed: 11/26/2022] Open

Abstract

Background

Symptom checkers are clinical decision support apps for patients, used by tens of millions of people annually. They are designed to provide diagnostic and triage advice and assist users in seeking the appropriate level of care. Little evidence is available regarding their diagnostic and triage accuracy with direct use by patients for urgent conditions.

Objective

The aim of this study is to determine the diagnostic and triage accuracy and usability of a symptom checker in use by patients presenting to an emergency department (ED).

Methods

We recruited a convenience sample of English-speaking patients presenting for care in an urban ED. Each consenting patient used a leading symptom checker from Ada Health before the ED evaluation. Diagnostic accuracy was evaluated by comparing the symptom checker’s diagnoses and those of 3 independent emergency physicians viewing the patient-entered symptom data, with the final diagnoses from the ED evaluation. The Ada diagnoses and triage were also critiqued by the independent physicians. The patients completed a usability survey based on the Technology Acceptance Model.

Results

A total of 40 (80%) of the 50 participants approached completed the symptom checker assessment and usability survey. Their mean age was 39.3 (SD 15.9; range 18-76) years, and they were 65% (26/40) female, 68% (27/40) White, 48% (19/40) Hispanic or Latino, and 13% (5/40) Black or African American. Some cases had missing data or a lack of a clear ED diagnosis; 75% (30/40) were included in the analysis of diagnosis, and 93% (37/40) for triage. The sensitivity for at least one of the final ED diagnoses by Ada (based on its top 5 diagnoses) was 70% (95% CI 54%-86%), close to the mean sensitivity for the 3 physicians (on their top 3 diagnoses) of 68.9%. The physicians rated the Ada triage decisions as 62% (23/37) fully agree and 24% (9/37) safe but too cautious. It was rated as unsafe and too risky in 22% (8/37) of cases by at least one physician, in 14% (5/37) of cases by at least two physicians, and in 5% (2/37) of cases by all 3 physicians. Usability was rated highly; participants agreed or strongly agreed with the 7 Technology Acceptance Model usability questions with a mean score of 84.6%, although “satisfaction” and “enjoyment” were rated low.

Conclusions

This study provides preliminary evidence that a symptom checker can provide acceptable usability and diagnostic accuracy for patients with various urgent conditions. A total of 14% (5/37) of symptom checker triage recommendations were deemed unsafe and too risky by at least two physicians based on the symptoms recorded, similar to the results of studies on telephone and nurse triage. Larger studies are needed of diagnosis and triage performance with direct patient use in different clinical environments.

Collapse

Nguyen H, Meczner A, Burslam-Dawe K, Hayhoe B. Triage Errors in Primary and Pre-Primary Care. J Med Internet Res 2022;24:e37209. [PMID: 35749166 PMCID: PMC9270711 DOI: 10.2196/37209] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2022] [Revised: 03/21/2022] [Accepted: 04/04/2022] [Indexed: 01/20/2023] Open

Kopka M, Feufel MA, Balzer F, Schmieding ML. Triage Capability of Laypersons: Retrospective, Exploratory Analysis (Preprint). JMIR Form Res 2022;6:e38977. [PMID: 36222793 PMCID: PMC9607917 DOI: 10.2196/38977] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2022] [Revised: 08/08/2022] [Accepted: 08/16/2022] [Indexed: 11/19/2022] Open

Abstract

Background

Although medical decision-making may be thought of as a task involving health professionals, many decisions, including critical health–related decisions are made by laypersons alone. Specifically, as the first step to most care episodes, it is the patient who determines whether and where to seek health care (triage). Overcautious self-assessments (ie, overtriaging) may lead to overutilization of health care facilities and overcrowded emergency departments, whereas imprudent decisions (ie, undertriaging) constitute a risk to the patient’s health. Recently, patient-facing decision support systems, commonly known as symptom checkers, have been developed to assist laypersons in these decisions.

Objective

The purpose of this study is to identify factors influencing laypersons’ ability to self-triage and their risk averseness in self-triage decisions.

Methods

We analyzed publicly available data on 91 laypersons appraising 45 short fictitious patient descriptions (case vignettes; N=4095 appraisals). Using signal detection theory and descriptive and inferential statistics, we explored whether the type of medical decision laypersons face, their confidence in their decision, and sociodemographic factors influence their triage accuracy and the type of errors they make. We distinguished between 2 decisions: whether emergency care was required (decision 1) and whether self-care was sufficient (decision 2).

Results

The accuracy of detecting emergencies (decision 1) was higher (mean 82.2%, SD 5.9%) than that of deciding whether any type of medical care is required (decision 2, mean 75.9%, SD 5.25%; t_>90=8.4; P<.001; Cohen d=0.9). Sensitivity for decision 1 was lower (mean 67.5%, SD 16.4%) than its specificity (mean 89.6%, SD 8.6%) whereas sensitivity for decision 2 was higher (mean 90.5%, SD 8.3%) than its specificity (mean 46.7%, SD 15.95%). Female participants were more risk averse and overtriaged more often than male participants, but age and level of education showed no association with participants’ risk averseness. Participants’ triage accuracy was higher when they were certain about their appraisal (2114/3381, 62.5%) than when being uncertain (378/714, 52.9%). However, most errors occurred when participants were certain of their decision (1267/1603, 79%). Participants were more commonly certain of their overtriage errors (mean 80.9%, SD 23.8%) than their undertriage errors (mean 72.5%, SD 30.9%; t_>89=3.7; P<.001; d=0.39).

Conclusions

Our study suggests that laypersons are overcautious in deciding whether they require medical care at all, but they miss identifying a considerable portion of emergencies. Our results further indicate that women are more risk averse than men in both types of decisions. Layperson participants made most triage errors when they were certain of their own appraisal. Thus, they might not follow or even seek advice (eg, from symptom checkers) in most instances where advice would be useful.

Collapse

Kopka M, Schmieding ML, Rieger T, Roesler E, Balzer F, Feufel MA. Trust Me, I’m Not a Doctor! Determinants of Laypersons’ Trust in Medical Decision Aids: Experimental Study (Preprint). JMIR Hum Factors 2021;9:e35219. [PMID: 35503248 PMCID: PMC9115664 DOI: 10.2196/35219] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2021] [Revised: 02/09/2022] [Accepted: 03/06/2022] [Indexed: 11/13/2022] Open