1
|
Liu V, Kaila M, Koskela T. Triage Accuracy and the Safety of User-Initiated Symptom Assessment With an Electronic Symptom Checker in a Real-Life Setting: Instrument Validation Study. JMIR Hum Factors 2024; 11:e55099. [PMID: 39326038 PMCID: PMC11467609 DOI: 10.2196/55099] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2023] [Revised: 05/13/2024] [Accepted: 07/16/2024] [Indexed: 09/28/2024] Open
Abstract
BACKGROUND Previous studies have evaluated the accuracy of the diagnostics of electronic symptom checkers (ESCs) and triage using clinical case vignettes. National Omaolo digital services (Omaolo) in Finland consist of an ESC for various symptoms. Omaolo is a medical device with a Conformité Européenne marking (risk class: IIa), based on Duodecim Clinical Decision Support, EBMEDS. OBJECTIVE This study investigates how well triage performed by the ESC nurse triage within the chief symptom list available in Omaolo (anal region symptoms, cough, diarrhea, discharge from the eye or watery or reddish eye, headache, heartburn, knee symptom or injury, lower back pain or injury, oral health, painful or blocked ear, respiratory tract infection, sexually transmitted disease, shoulder pain or stiffness or injury, sore throat or throat symptom, and urinary tract infection). In addition, the accuracy, specificity, sensitivity, and safety of the Omaolo ESC were assessed. METHODS This is a clinical validation study in a real-life setting performed at multiple primary health care (PHC) centers across Finland. The included units were of the walk-in model of primary care, where no previous phone call or contact was required. Upon arriving at the PHC center, users (patients) answered the ESC questions and received a triage recommendation; a nurse then assessed their triage. Findings on 877 patients were analyzed by matching the ESC recommendations with triage by the triage nurse. RESULTS Safe assessments by the ESC accounted for 97.6% (856/877; 95% CI 95.6%-98.0%) of all assessments made. The mean of the exact match for all symptom assessments was 53.7% (471/877; 95% CI 49.2%-55.9%). The mean value of the exact match or overly conservative but suitable for all (ESC's assessment was 1 triage level higher than the nurse's triage) symptom assessments was 66.6% (584/877; 95% CI 63.4%-69.7%). When the nurse concluded that urgent treatment was needed, the ESC's exactly matched accuracy was 70.9% (244/344; 95% CI 65.8%-75.7%). Sensitivity for the Omaolo ESC was 62.6% and specificity of 69.2%. A total of 21 critical assessments were identified for further analysis: there was no indication of compromised patient safety. CONCLUSIONS The primary objectives of this study were to evaluate the safety and to explore the accuracy, specificity, and sensitivity of the Omaolo ESC. The results indicate that the ESC is safe in a real-life setting when appraised with assessments conducted by triage nurses. Furthermore, the Omaolo ESC exhibits the potential to guide patients to appropriate triage destinations effectively, helping them to receive timely and suitable care. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID) RR2-10.2196/41423.
Collapse
Affiliation(s)
- Ville Liu
- Faculty of Medicine, University of Helsinki, Helsinki, Finland
| | - Minna Kaila
- Public Health Medicine, Faculty of Medicine, University of Helsinki, Helsinki, Finland
| | - Tuomas Koskela
- Department of General Practice, Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- The Wellbeing Services County of Pirkanmaa, Tampere, Finland
| |
Collapse
|
2
|
Chen A, Chen DO, Tian L. Benchmarking the symptom-checking capabilities of ChatGPT for a broad range of diseases. J Am Med Inform Assoc 2024; 31:2084-2088. [PMID: 38109889 PMCID: PMC11339504 DOI: 10.1093/jamia/ocad245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2023] [Revised: 11/17/2023] [Accepted: 12/04/2023] [Indexed: 12/20/2023] Open
Abstract
OBJECTIVE This study evaluates ChatGPT's symptom-checking accuracy across a broad range of diseases using the Mayo Clinic Symptom Checker patient service as a benchmark. METHODS We prompted ChatGPT with symptoms of 194 distinct diseases. By comparing its predictions with expectations, we calculated a relative comparative score (RCS) to gauge accuracy. RESULTS ChatGPT's GPT-4 model achieved an average RCS of 78.8%, outperforming the GPT-3.5-turbo by 10.5%. Some specialties scored above 90%. DISCUSSION The test set, although extensive, was not exhaustive. Future studies should include a more comprehensive disease spectrum. CONCLUSION ChatGPT exhibits high accuracy in symptom checking for a broad range of diseases, showcasing its potential as a medical training tool in learning health systems to enhance care quality and address health disparities.
Collapse
Affiliation(s)
- Anjun Chen
- Health Sciences, ELHS Institute, Palo Alto, CA 94306, United States
- LHS Tech Forum Initiative, Learning Health Community, Palo Alto, CA 94306, United States
| | - Drake O Chen
- LHS Tech Forum Initiative, Learning Health Community, Palo Alto, CA 94306, United States
| | - Lu Tian
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, United States
| |
Collapse
|
3
|
Meczner A, Cohen N, Qureshi A, Reza M, Sutaria S, Blount E, Bagyura Z, Malak T. Controlling Inputter Variability in Vignette Studies Assessing Web-Based Symptom Checkers: Evaluation of Current Practice and Recommendations for Isolated Accuracy Metrics. JMIR Form Res 2024; 8:e49907. [PMID: 38820578 PMCID: PMC11179013 DOI: 10.2196/49907] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Revised: 08/10/2023] [Accepted: 04/24/2024] [Indexed: 06/02/2024] Open
Abstract
BACKGROUND The rapid growth of web-based symptom checkers (SCs) is not matched by advances in quality assurance. Currently, there are no widely accepted criteria assessing SCs' performance. Vignette studies are widely used to evaluate SCs, measuring the accuracy of outcome. Accuracy behaves as a composite metric as it is affected by a number of individual SC- and tester-dependent factors. In contrast to clinical studies, vignette studies have a small number of testers. Hence, measuring accuracy alone in vignette studies may not provide a reliable assessment of performance due to tester variability. OBJECTIVE This study aims to investigate the impact of tester variability on the accuracy of outcome of SCs, using clinical vignettes. It further aims to investigate the feasibility of measuring isolated aspects of performance. METHODS Healthily's SC was assessed using 114 vignettes by 3 groups of 3 testers who processed vignettes with different instructions: free interpretation of vignettes (free testers), specified chief complaints (partially free testers), and specified chief complaints with strict instruction for answering additional symptoms (restricted testers). κ statistics were calculated to assess agreement of top outcome condition and recommended triage. Crude and adjusted accuracy was measured against a gold standard. Adjusted accuracy was calculated using only results of consultations identical to the vignette, following a review and selection process. A feasibility study for assessing symptom comprehension of SCs was performed using different variations of 51 chief complaints across 3 SCs. RESULTS Intertester agreement of most likely condition and triage was, respectively, 0.49 and 0.51 for the free tester group, 0.66 and 0.66 for the partially free group, and 0.72 and 0.71 for the restricted group. For the restricted group, accuracy ranged from 43.9% to 57% for individual testers, averaging 50.6% (SD 5.35%). Adjusted accuracy was 56.1%. Assessing symptom comprehension was feasible for all 3 SCs. Comprehension scores ranged from 52.9% and 68%. CONCLUSIONS We demonstrated that by improving standardization of the vignette testing process, there is a significant improvement in the agreement of outcome between testers. However, significant variability remained due to uncontrollable tester-dependent factors, reflected by varying outcome accuracy. Tester-dependent factors, combined with a small number of testers, limit the reliability and generalizability of outcome accuracy when used as a composite measure in vignette studies. Measuring and reporting different aspects of SC performance in isolation provides a more reliable assessment of SC performance. We developed an adjusted accuracy measure using a review and selection process to assess data algorithm quality. In addition, we demonstrated that symptom comprehension with different input methods can be feasibly compared. Future studies reporting accuracy need to apply vignette testing standardization and isolated metrics.
Collapse
Affiliation(s)
- András Meczner
- Healthily, London, United Kingdom
- Institute for Clinical Data Management, Semmelweis University, Budapest, Hungary
| | | | | | | | | | | | - Zsolt Bagyura
- Institute for Clinical Data Management, Semmelweis University, Budapest, Hungary
| | | |
Collapse
|
4
|
Harada Y, Sakamoto T, Sugimoto S, Shimizu T. Longitudinal Changes in Diagnostic Accuracy of a Differential Diagnosis List Developed by an AI-Based Symptom Checker: Retrospective Observational Study. JMIR Form Res 2024; 8:e53985. [PMID: 38758588 PMCID: PMC11143391 DOI: 10.2196/53985] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Revised: 03/23/2024] [Accepted: 04/24/2024] [Indexed: 05/18/2024] Open
Abstract
BACKGROUND Artificial intelligence (AI) symptom checker models should be trained using real-world patient data to improve their diagnostic accuracy. Given that AI-based symptom checkers are currently used in clinical practice, their performance should improve over time. However, longitudinal evaluations of the diagnostic accuracy of these symptom checkers are limited. OBJECTIVE This study aimed to assess the longitudinal changes in the accuracy of differential diagnosis lists created by an AI-based symptom checker used in the real world. METHODS This was a single-center, retrospective, observational study. Patients who visited an outpatient clinic without an appointment between May 1, 2019, and April 30, 2022, and who were admitted to a community hospital in Japan within 30 days of their index visit were considered eligible. We only included patients who underwent an AI-based symptom checkup at the index visit, and the diagnosis was finally confirmed during follow-up. Final diagnoses were categorized as common or uncommon, and all cases were categorized as typical or atypical. The primary outcome measure was the accuracy of the differential diagnosis list created by the AI-based symptom checker, defined as the final diagnosis in a list of 10 differential diagnoses created by the symptom checker. To assess the change in the symptom checker's diagnostic accuracy over 3 years, we used a chi-square test to compare the primary outcome over 3 periods: from May 1, 2019, to April 30, 2020 (first year); from May 1, 2020, to April 30, 2021 (second year); and from May 1, 2021, to April 30, 2022 (third year). RESULTS A total of 381 patients were included. Common diseases comprised 257 (67.5%) cases, and typical presentations were observed in 298 (78.2%) cases. Overall, the accuracy of the differential diagnosis list created by the AI-based symptom checker was 172 (45.1%), which did not differ across the 3 years (first year: 97/219, 44.3%; second year: 32/72, 44.4%; and third year: 43/90, 47.7%; P=.85). The accuracy of the differential diagnosis list created by the symptom checker was low in those with uncommon diseases (30/124, 24.2%) and atypical presentations (12/83, 14.5%). In the multivariate logistic regression model, common disease (P<.001; odds ratio 4.13, 95% CI 2.50-6.98) and typical presentation (P<.001; odds ratio 6.92, 95% CI 3.62-14.2) were significantly associated with the accuracy of the differential diagnosis list created by the symptom checker. CONCLUSIONS A 3-year longitudinal survey of the diagnostic accuracy of differential diagnosis lists developed by an AI-based symptom checker, which has been implemented in real-world clinical practice settings, showed no improvement over time. Uncommon diseases and atypical presentations were independently associated with a lower diagnostic accuracy. In the future, symptom checkers should be trained to recognize uncommon conditions.
Collapse
Affiliation(s)
- Yukinori Harada
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga, Japan
- Department of General Medicine, Nagano Chuo Hospital, Nagano, Japan
| | - Tetsu Sakamoto
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga, Japan
| | - Shu Sugimoto
- Department of Medicine (Neurology and Rheumatology), Shinshu University School of Medicine, Matsumoto, Japan
| | - Taro Shimizu
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga, Japan
| |
Collapse
|
5
|
Hammoud M, Douglas S, Darmach M, Alawneh S, Sanyal S, Kanbour Y. Evaluating the Diagnostic Performance of Symptom Checkers: Clinical Vignette Study. JMIR AI 2024; 3:e46875. [PMID: 38875676 PMCID: PMC11091811 DOI: 10.2196/46875] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Revised: 06/15/2023] [Accepted: 03/02/2024] [Indexed: 06/16/2024]
Abstract
BACKGROUND Medical self-diagnostic tools (or symptom checkers) are becoming an integral part of digital health and our daily lives, whereby patients are increasingly using them to identify the underlying causes of their symptoms. As such, it is essential to rigorously investigate and comprehensively report the diagnostic performance of symptom checkers using standard clinical and scientific approaches. OBJECTIVE This study aims to evaluate and report the accuracies of a few known and new symptom checkers using a standard and transparent methodology, which allows the scientific community to cross-validate and reproduce the reported results, a step much needed in health informatics. METHODS We propose a 4-stage experimentation methodology that capitalizes on the standard clinical vignette approach to evaluate 6 symptom checkers. To this end, we developed and peer-reviewed 400 vignettes, each approved by at least 5 out of 7 independent and experienced primary care physicians. To establish a frame of reference and interpret the results of symptom checkers accordingly, we further compared the best-performing symptom checker against 3 primary care physicians with an average experience of 16.6 (SD 9.42) years. To measure accuracy, we used 7 standard metrics, including M1 as a measure of a symptom checker's or a physician's ability to return a vignette's main diagnosis at the top of their differential list, F1-score as a trade-off measure between recall and precision, and Normalized Discounted Cumulative Gain (NDCG) as a measure of a differential list's ranking quality, among others. RESULTS The diagnostic accuracies of the 6 tested symptom checkers vary significantly. For instance, the differences in the M1, F1-score, and NDCG results between the best-performing and worst-performing symptom checkers or ranges were 65.3%, 39.2%, and 74.2%, respectively. The same was observed among the participating human physicians, whereby the M1, F1-score, and NDCG ranges were 22.8%, 15.3%, and 21.3%, respectively. When compared against each other, physicians outperformed the best-performing symptom checker by an average of 1.2% using F1-score, whereas the best-performing symptom checker outperformed physicians by averages of 10.2% and 25.1% using M1 and NDCG, respectively. CONCLUSIONS The performance variation between symptom checkers is substantial, suggesting that symptom checkers cannot be treated as a single entity. On a different note, the best-performing symptom checker was an artificial intelligence (AI)-based one, shedding light on the promise of AI in improving the diagnostic capabilities of symptom checkers, especially as AI keeps advancing exponentially.
Collapse
|
6
|
Peven K, Wickham AP, Wilks O, Kaplan YC, Marhol A, Ahmed S, Bamford R, Cunningham AC, Prentice C, Meczner A, Fenech M, Gilbert S, Klepchukova A, Ponzo S, Zhaunova L. Assessment of a Digital Symptom Checker Tool's Accuracy in Suggesting Reproductive Health Conditions: Clinical Vignettes Study. JMIR Mhealth Uhealth 2023; 11:e46718. [PMID: 38051574 PMCID: PMC10731551 DOI: 10.2196/46718] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2023] [Revised: 09/06/2023] [Accepted: 11/07/2023] [Indexed: 12/07/2023] Open
Abstract
BACKGROUND Reproductive health conditions such as endometriosis, uterine fibroids, and polycystic ovary syndrome (PCOS) affect a large proportion of women and people who menstruate worldwide. Prevalence estimates for these conditions range from 5% to 40% of women of reproductive age. Long diagnostic delays, up to 12 years, are common and contribute to health complications and increased health care costs. Symptom checker apps provide users with information and tools to better understand their symptoms and thus have the potential to reduce the time to diagnosis for reproductive health conditions. OBJECTIVE This study aimed to evaluate the agreement between clinicians and 3 symptom checkers (developed by Flo Health UK Limited) in assessing symptoms of endometriosis, uterine fibroids, and PCOS using vignettes. We also aimed to present a robust example of vignette case creation, review, and classification in the context of predeployment testing and validation of digital health symptom checker tools. METHODS Independent general practitioners were recruited to create clinical case vignettes of simulated users for the purpose of testing each condition symptom checker; vignettes created for each condition contained a mixture of condition-positive and condition-negative outcomes. A second panel of general practitioners then reviewed, approved, and modified (if necessary) each vignette. A third group of general practitioners reviewed each vignette case and designated a final classification. Vignettes were then entered into the symptom checkers by a fourth, different group of general practitioners. The outcomes of each symptom checker were then compared with the final classification of each vignette to produce accuracy metrics including percent agreement, sensitivity, specificity, positive predictive value, and negative predictive value. RESULTS A total of 24 cases were created per condition. Overall, exact matches between the vignette general practitioner classification and the symptom checker outcome were 83% (n=20) for endometriosis, 83% (n=20) for uterine fibroids, and 88% (n=21) for PCOS. For each symptom checker, sensitivity was reported as 81.8% for endometriosis, 84.6% for uterine fibroids, and 100% for PCOS; specificity was reported as 84.6% for endometriosis, 81.8% for uterine fibroids, and 75% for PCOS; positive predictive value was reported as 81.8% for endometriosis, 84.6% for uterine fibroids, 80% for PCOS; and negative predictive value was reported as 84.6% for endometriosis, 81.8% for uterine fibroids, and 100% for PCOS. CONCLUSIONS The single-condition symptom checkers have high levels of agreement with general practitioner classification for endometriosis, uterine fibroids, and PCOS. Given long delays in diagnosis for many reproductive health conditions, which lead to increased medical costs and potential health complications for individuals and health care providers, innovative health apps and symptom checkers hold the potential to improve care pathways.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | - Stephen Gilbert
- Else Kröner Fresenius Center for Digital Health, TUD Dresden University of Technology, Dresden, Germany
| | | | - Sonia Ponzo
- Flo Health UK Limited, London, United Kingdom
| | | |
Collapse
|
7
|
Egli A. ChatGPT, GPT-4, and Other Large Language Models: The Next Revolution for Clinical Microbiology? Clin Infect Dis 2023; 77:1322-1328. [PMID: 37399030 PMCID: PMC10640689 DOI: 10.1093/cid/ciad407] [Citation(s) in RCA: 26] [Impact Index Per Article: 26.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Revised: 06/16/2023] [Accepted: 06/30/2023] [Indexed: 07/04/2023] Open
Abstract
ChatGPT, GPT-4, and Bard are highly advanced natural language process-based computer programs (chatbots) that simulate and process human conversation in written or spoken form. Recently released by the company OpenAI, ChatGPT was trained on billions of unknown text elements (tokens) and rapidly gained wide attention for its ability to respond to questions in an articulate manner across a wide range of knowledge domains. These potentially disruptive large language model (LLM) technologies have a broad range of conceivable applications in medicine and medical microbiology. In this opinion article, I describe how chatbot technologies work and discuss the strengths and weaknesses of ChatGPT, GPT-4, and other LLMs for applications in the routine diagnostic laboratory, focusing on various use cases for the pre- to post-analytical process.
Collapse
Affiliation(s)
- Adrian Egli
- Institute of Medical Microbiology, University of Zurich, Zurich, Switzerland
| |
Collapse
|
8
|
Määttä J, Lindell R, Hayward N, Martikainen S, Honkanen K, Inkala M, Hirvonen P, Martikainen TJ. Diagnostic Performance, Triage Safety, and Usability of a Clinical Decision Support System Within a University Hospital Emergency Department: Algorithm Performance and Usability Study. JMIR Med Inform 2023; 11:e46760. [PMID: 37656018 PMCID: PMC10501486 DOI: 10.2196/46760] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2023] [Revised: 06/22/2023] [Accepted: 07/14/2023] [Indexed: 09/02/2023] Open
Abstract
Background Computerized clinical decision support systems (CDSSs) are increasingly adopted in health care to optimize resources and streamline patient flow. However, they often lack scientific validation against standard medical care. Objective The purpose of this study was to assess the performance, safety, and usability of a CDSS in a university hospital emergency department setting in Kuopio, Finland. Methods Patients entering the emergency department were asked to voluntarily participate in this study. Patients aged 17 years or younger, patients with cognitive impairments, and patients who entered the unit in an ambulance or with the need for immediate care were excluded. Patients completed the CDSS web-based form and usability questionnaire when waiting for the triage nurse's evaluation. The CDSS data were anonymized and did not affect the patients' usual evaluation or treatment. Retrospectively, 2 medical doctors evaluated the urgency of each patient's condition by using the triage nurse's information, and urgent and nonurgent groups were created. The International Statistical Classification of Diseases, Tenth Revision diagnoses were collected from the electronic health records. Usability was assessed by using a positive version of the System Usability Scale questionnaire. Results In total, our analyses included 248 patients. Regarding urgency, the mean sensitivities were 85% and 19%, respectively, for urgent and nonurgent cases when assessing the performance of CDSS evaluations in comparison to that of physicians. The mean sensitivities were 85% and 35%, respectively, when comparing the evaluations between the two physicians. Our CDSS did not miss any cases that were evaluated to be emergencies by physicians; thus, all emergency cases evaluated by physicians were evaluated as either urgent cases or emergency cases by the CDSS. In differential diagnosis, the CDSS had an exact match accuracy of 45.5% (97/213). The usability was good, with a mean System Usability Scale score of 78.2 (SD 16.8). Conclusions In a university hospital emergency department setting with a large real-world population, our CDSS was found to be equally as sensitive in urgent patient cases as physicians and was found to have an acceptable differential diagnosis accuracy, with good usability. These results suggest that this CDSS can be safely assessed further in a real-world setting. A CDSS could accelerate triage by providing patient-provided data in advance of patients' initial consultations and categorize patient cases as urgent and nonurgent cases upon patients' arrival to the emergency department.
Collapse
Affiliation(s)
| | - Rony Lindell
- Klinik Healthcare Solutions Oy, Helsinki, Finland
| | - Nick Hayward
- Klinik Healthcare Solutions Oy, Helsinki, Finland
| | - Susanna Martikainen
- Department of Health and Social Management, University of Eastern Finland, Kuopio, Finland
| | - Katri Honkanen
- Department of Emergency Care, Kuopio University Hospital, Kuopio, Finland
| | - Matias Inkala
- Department of Emergency Care, Kuopio University Hospital, Kuopio, Finland
| | | | - Tero J Martikainen
- Department of Emergency Care, Kuopio University Hospital, Kuopio, Finland
| |
Collapse
|
9
|
Painter A, Hayhoe B, Riboli-Sasco E, El-Osta A. Online Symptom Checkers: Recommendations for a Vignette-Based Clinical Evaluation Standard. J Med Internet Res 2022; 24:e37408. [DOI: 10.2196/37408] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2022] [Revised: 09/15/2022] [Accepted: 10/11/2022] [Indexed: 11/13/2022] Open
Abstract
The use of patient-facing online symptom checkers (OSCs) has expanded in recent years, but their accuracy, safety, and impact on patient behaviors and health care systems remain unclear. The lack of a standardized process of clinical evaluation has resulted in significant variation in approaches to OSC validation and evaluation. The aim of this paper is to characterize a set of congruent requirements for a standardized vignette-based clinical evaluation process of OSCs. Discrepancies in the findings of comparative studies to date suggest that different steps in OSC evaluation methodology can significantly influence outcomes. A standardized process with a clear specification for vignette-based clinical evaluation is urgently needed to guide developers and facilitate the objective comparison of OSCs. We propose 15 recommendation requirements for an OSC evaluation standard. A third-party evaluation process and protocols for prospective real-world evidence studies should also be prioritized to quality assure OSC assessment.
Collapse
|
10
|
Wallace W, Chan C, Chidambaram S, Hanna L, Iqbal FM, Acharya A, Normahani P, Ashrafian H, Markar SR, Sounderajah V, Darzi A. The diagnostic and triage accuracy of digital and online symptom checker tools: a systematic review. NPJ Digit Med 2022; 5:118. [PMID: 35977992 PMCID: PMC9385087 DOI: 10.1038/s41746-022-00667-w] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2022] [Accepted: 07/25/2022] [Indexed: 11/09/2022] Open
Abstract
Digital and online symptom checkers are an increasingly adopted class of health technologies that enable patients to input their symptoms and biodata to produce a set of likely diagnoses and associated triage advice. However, concerns regarding the accuracy and safety of these symptom checkers have been raised. This systematic review evaluates the accuracy of symptom checkers in providing diagnoses and appropriate triage advice. MEDLINE and Web of Science were searched for studies that used either real or simulated patients to evaluate online or digital symptom checkers. The primary outcomes were the diagnostic and triage accuracy of the symptom checkers. The QUADAS-2 tool was used to assess study quality. Of the 177 studies retrieved, 10 studies met the inclusion criteria. Researchers evaluated the accuracy of symptom checkers using a variety of medical conditions, including ophthalmological conditions, inflammatory arthritides and HIV. A total of 50% of the studies recruited real patients, while the remainder used simulated cases. The diagnostic accuracy of the primary diagnosis was low across included studies (range: 19–37.9%) and varied between individual symptom checkers, despite consistent symptom data input. Triage accuracy (range: 48.8–90.1%) was typically higher than diagnostic accuracy. Overall, the diagnostic and triage accuracy of symptom checkers are variable and of low accuracy. Given the increasing push towards adopting this class of technologies across numerous health systems, this study demonstrates that reliance upon symptom checkers could pose significant patient safety hazards. Large-scale primary studies, based upon real-world data, are warranted to demonstrate the adequate performance of these technologies in a manner that is non-inferior to current best practices. Moreover, an urgent assessment of how these systems are regulated and implemented is required.
Collapse
Affiliation(s)
- William Wallace
- Department of Surgery & Cancer, Imperial College London, St. Mary's Hospital, London, W2 1NY, UK
| | - Calvin Chan
- Department of Surgery & Cancer, Imperial College London, St. Mary's Hospital, London, W2 1NY, UK
| | - Swathikan Chidambaram
- Department of Surgery & Cancer, Imperial College London, St. Mary's Hospital, London, W2 1NY, UK
| | - Lydia Hanna
- Department of Surgery & Cancer, Imperial College London, St. Mary's Hospital, London, W2 1NY, UK
| | - Fahad Mujtaba Iqbal
- Department of Surgery & Cancer, Imperial College London, St. Mary's Hospital, London, W2 1NY, UK.,Institute of Global Health Innovation, Imperial College London, South Kensington Campus, London, SW7 2AZ, UK
| | - Amish Acharya
- Department of Surgery & Cancer, Imperial College London, St. Mary's Hospital, London, W2 1NY, UK.,Institute of Global Health Innovation, Imperial College London, South Kensington Campus, London, SW7 2AZ, UK
| | - Pasha Normahani
- Department of Surgery & Cancer, Imperial College London, St. Mary's Hospital, London, W2 1NY, UK
| | - Hutan Ashrafian
- Institute of Global Health Innovation, Imperial College London, South Kensington Campus, London, SW7 2AZ, UK
| | - Sheraz R Markar
- Department of Surgery & Cancer, Imperial College London, St. Mary's Hospital, London, W2 1NY, UK.,Department of Molecular Medicine and Surgery, Karolinska Institutet, Stockholm, Sweden.,Nuffield Department of Surgery, Churchill Hospital, University of Oxford, OX3 7LE, Oxford, UK
| | - Viknesh Sounderajah
- Institute of Global Health Innovation, Imperial College London, South Kensington Campus, London, SW7 2AZ, UK.
| | - Ara Darzi
- Department of Surgery & Cancer, Imperial College London, St. Mary's Hospital, London, W2 1NY, UK.,Institute of Global Health Innovation, Imperial College London, South Kensington Campus, London, SW7 2AZ, UK
| |
Collapse
|