1
|
Meczner A, Cohen N, Qureshi A, Reza M, Sutaria S, Blount E, Bagyura Z, Malak T. Controlling Inputter Variability in Vignette Studies Assessing Web-Based Symptom Checkers: Evaluation of Current Practice and Recommendations for Isolated Accuracy Metrics. JMIR Form Res 2024; 8:e49907. [PMID: 38820578 PMCID: PMC11179013 DOI: 10.2196/49907] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Revised: 08/10/2023] [Accepted: 04/24/2024] [Indexed: 06/02/2024] Open
Abstract
BACKGROUND The rapid growth of web-based symptom checkers (SCs) is not matched by advances in quality assurance. Currently, there are no widely accepted criteria assessing SCs' performance. Vignette studies are widely used to evaluate SCs, measuring the accuracy of outcome. Accuracy behaves as a composite metric as it is affected by a number of individual SC- and tester-dependent factors. In contrast to clinical studies, vignette studies have a small number of testers. Hence, measuring accuracy alone in vignette studies may not provide a reliable assessment of performance due to tester variability. OBJECTIVE This study aims to investigate the impact of tester variability on the accuracy of outcome of SCs, using clinical vignettes. It further aims to investigate the feasibility of measuring isolated aspects of performance. METHODS Healthily's SC was assessed using 114 vignettes by 3 groups of 3 testers who processed vignettes with different instructions: free interpretation of vignettes (free testers), specified chief complaints (partially free testers), and specified chief complaints with strict instruction for answering additional symptoms (restricted testers). κ statistics were calculated to assess agreement of top outcome condition and recommended triage. Crude and adjusted accuracy was measured against a gold standard. Adjusted accuracy was calculated using only results of consultations identical to the vignette, following a review and selection process. A feasibility study for assessing symptom comprehension of SCs was performed using different variations of 51 chief complaints across 3 SCs. RESULTS Intertester agreement of most likely condition and triage was, respectively, 0.49 and 0.51 for the free tester group, 0.66 and 0.66 for the partially free group, and 0.72 and 0.71 for the restricted group. For the restricted group, accuracy ranged from 43.9% to 57% for individual testers, averaging 50.6% (SD 5.35%). Adjusted accuracy was 56.1%. Assessing symptom comprehension was feasible for all 3 SCs. Comprehension scores ranged from 52.9% and 68%. CONCLUSIONS We demonstrated that by improving standardization of the vignette testing process, there is a significant improvement in the agreement of outcome between testers. However, significant variability remained due to uncontrollable tester-dependent factors, reflected by varying outcome accuracy. Tester-dependent factors, combined with a small number of testers, limit the reliability and generalizability of outcome accuracy when used as a composite measure in vignette studies. Measuring and reporting different aspects of SC performance in isolation provides a more reliable assessment of SC performance. We developed an adjusted accuracy measure using a review and selection process to assess data algorithm quality. In addition, we demonstrated that symptom comprehension with different input methods can be feasibly compared. Future studies reporting accuracy need to apply vignette testing standardization and isolated metrics.
Collapse
Affiliation(s)
- András Meczner
- Healthily, London, United Kingdom
- Institute for Clinical Data Management, Semmelweis University, Budapest, Hungary
| | | | | | | | | | | | - Zsolt Bagyura
- Institute for Clinical Data Management, Semmelweis University, Budapest, Hungary
| | | |
Collapse
|
2
|
Hammoud M, Douglas S, Darmach M, Alawneh S, Sanyal S, Kanbour Y. Evaluating the Diagnostic Performance of Symptom Checkers: Clinical Vignette Study. JMIR AI 2024; 3:e46875. [PMID: 38875676 PMCID: PMC11091811 DOI: 10.2196/46875] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Revised: 06/15/2023] [Accepted: 03/02/2024] [Indexed: 06/16/2024]
Abstract
BACKGROUND Medical self-diagnostic tools (or symptom checkers) are becoming an integral part of digital health and our daily lives, whereby patients are increasingly using them to identify the underlying causes of their symptoms. As such, it is essential to rigorously investigate and comprehensively report the diagnostic performance of symptom checkers using standard clinical and scientific approaches. OBJECTIVE This study aims to evaluate and report the accuracies of a few known and new symptom checkers using a standard and transparent methodology, which allows the scientific community to cross-validate and reproduce the reported results, a step much needed in health informatics. METHODS We propose a 4-stage experimentation methodology that capitalizes on the standard clinical vignette approach to evaluate 6 symptom checkers. To this end, we developed and peer-reviewed 400 vignettes, each approved by at least 5 out of 7 independent and experienced primary care physicians. To establish a frame of reference and interpret the results of symptom checkers accordingly, we further compared the best-performing symptom checker against 3 primary care physicians with an average experience of 16.6 (SD 9.42) years. To measure accuracy, we used 7 standard metrics, including M1 as a measure of a symptom checker's or a physician's ability to return a vignette's main diagnosis at the top of their differential list, F1-score as a trade-off measure between recall and precision, and Normalized Discounted Cumulative Gain (NDCG) as a measure of a differential list's ranking quality, among others. RESULTS The diagnostic accuracies of the 6 tested symptom checkers vary significantly. For instance, the differences in the M1, F1-score, and NDCG results between the best-performing and worst-performing symptom checkers or ranges were 65.3%, 39.2%, and 74.2%, respectively. The same was observed among the participating human physicians, whereby the M1, F1-score, and NDCG ranges were 22.8%, 15.3%, and 21.3%, respectively. When compared against each other, physicians outperformed the best-performing symptom checker by an average of 1.2% using F1-score, whereas the best-performing symptom checker outperformed physicians by averages of 10.2% and 25.1% using M1 and NDCG, respectively. CONCLUSIONS The performance variation between symptom checkers is substantial, suggesting that symptom checkers cannot be treated as a single entity. On a different note, the best-performing symptom checker was an artificial intelligence (AI)-based one, shedding light on the promise of AI in improving the diagnostic capabilities of symptom checkers, especially as AI keeps advancing exponentially.
Collapse
|
3
|
He X, Zheng X, Ding H. Existing Barriers Faced by and Future Design Recommendations for Direct-to-Consumer Health Care Artificial Intelligence Apps: Scoping Review. J Med Internet Res 2023; 25:e50342. [PMID: 38109173 PMCID: PMC10758939 DOI: 10.2196/50342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2023] [Revised: 09/20/2023] [Accepted: 11/28/2023] [Indexed: 12/19/2023] Open
Abstract
BACKGROUND Direct-to-consumer (DTC) health care artificial intelligence (AI) apps hold the potential to bridge the spatial and temporal disparities in health care resources, but they also come with individual and societal risks due to AI errors. Furthermore, the manner in which consumers interact directly with health care AI is reshaping traditional physician-patient relationships. However, the academic community lacks a systematic comprehension of the research overview for such apps. OBJECTIVE This paper systematically delineated and analyzed the characteristics of included studies, identified existing barriers and design recommendations for DTC health care AI apps mentioned in the literature and also provided a reference for future design and development. METHODS This scoping review followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews guidelines and was conducted according to Arksey and O'Malley's 5-stage framework. Peer-reviewed papers on DTC health care AI apps published until March 27, 2023, in Web of Science, Scopus, the ACM Digital Library, IEEE Xplore, PubMed, and Google Scholar were included. The papers were analyzed using Braun and Clarke's reflective thematic analysis approach. RESULTS Of the 2898 papers retrieved, 32 (1.1%) covering this emerging field were included. The included papers were recently published (2018-2023), and most (23/32, 72%) were from developed countries. The medical field was mostly general practice (8/32, 25%). In terms of users and functionalities, some apps were designed solely for single-consumer groups (24/32, 75%), offering disease diagnosis (14/32, 44%), health self-management (8/32, 25%), and health care information inquiry (4/32, 13%). Other apps connected to physicians (5/32, 16%), family members (1/32, 3%), nursing staff (1/32, 3%), and health care departments (2/32, 6%), generally to alert these groups to abnormal conditions of consumer users. In addition, 8 barriers and 6 design recommendations related to DTC health care AI apps were identified. Some more subtle obstacles that are particularly worth noting and corresponding design recommendations in consumer-facing health care AI systems, including enhancing human-centered explainability, establishing calibrated trust and addressing overtrust, demonstrating empathy in AI, improving the specialization of consumer-grade products, and expanding the diversity of the test population, were further discussed. CONCLUSIONS The booming DTC health care AI apps present both risks and opportunities, which highlights the need to explore their current status. This paper systematically summarized and sorted the characteristics of the included studies, identified existing barriers faced by, and made future design recommendations for such apps. To the best of our knowledge, this is the first study to systematically summarize and categorize academic research on these apps. Future studies conducting the design and development of such systems could refer to the results of this study, which is crucial to improve the health care services provided by DTC health care AI apps.
Collapse
Affiliation(s)
- Xin He
- School of Mechanical Science and Engineering, Huazhong University of Science and Technology, Wuhan, China
| | - Xi Zheng
- School of Mechanical Science and Engineering, Huazhong University of Science and Technology, Wuhan, China
| | - Huiyuan Ding
- School of Mechanical Science and Engineering, Huazhong University of Science and Technology, Wuhan, China
| |
Collapse
|
4
|
Peven K, Wickham AP, Wilks O, Kaplan YC, Marhol A, Ahmed S, Bamford R, Cunningham AC, Prentice C, Meczner A, Fenech M, Gilbert S, Klepchukova A, Ponzo S, Zhaunova L. Assessment of a Digital Symptom Checker Tool's Accuracy in Suggesting Reproductive Health Conditions: Clinical Vignettes Study. JMIR Mhealth Uhealth 2023; 11:e46718. [PMID: 38051574 PMCID: PMC10731551 DOI: 10.2196/46718] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2023] [Revised: 09/06/2023] [Accepted: 11/07/2023] [Indexed: 12/07/2023] Open
Abstract
BACKGROUND Reproductive health conditions such as endometriosis, uterine fibroids, and polycystic ovary syndrome (PCOS) affect a large proportion of women and people who menstruate worldwide. Prevalence estimates for these conditions range from 5% to 40% of women of reproductive age. Long diagnostic delays, up to 12 years, are common and contribute to health complications and increased health care costs. Symptom checker apps provide users with information and tools to better understand their symptoms and thus have the potential to reduce the time to diagnosis for reproductive health conditions. OBJECTIVE This study aimed to evaluate the agreement between clinicians and 3 symptom checkers (developed by Flo Health UK Limited) in assessing symptoms of endometriosis, uterine fibroids, and PCOS using vignettes. We also aimed to present a robust example of vignette case creation, review, and classification in the context of predeployment testing and validation of digital health symptom checker tools. METHODS Independent general practitioners were recruited to create clinical case vignettes of simulated users for the purpose of testing each condition symptom checker; vignettes created for each condition contained a mixture of condition-positive and condition-negative outcomes. A second panel of general practitioners then reviewed, approved, and modified (if necessary) each vignette. A third group of general practitioners reviewed each vignette case and designated a final classification. Vignettes were then entered into the symptom checkers by a fourth, different group of general practitioners. The outcomes of each symptom checker were then compared with the final classification of each vignette to produce accuracy metrics including percent agreement, sensitivity, specificity, positive predictive value, and negative predictive value. RESULTS A total of 24 cases were created per condition. Overall, exact matches between the vignette general practitioner classification and the symptom checker outcome were 83% (n=20) for endometriosis, 83% (n=20) for uterine fibroids, and 88% (n=21) for PCOS. For each symptom checker, sensitivity was reported as 81.8% for endometriosis, 84.6% for uterine fibroids, and 100% for PCOS; specificity was reported as 84.6% for endometriosis, 81.8% for uterine fibroids, and 75% for PCOS; positive predictive value was reported as 81.8% for endometriosis, 84.6% for uterine fibroids, 80% for PCOS; and negative predictive value was reported as 84.6% for endometriosis, 81.8% for uterine fibroids, and 100% for PCOS. CONCLUSIONS The single-condition symptom checkers have high levels of agreement with general practitioner classification for endometriosis, uterine fibroids, and PCOS. Given long delays in diagnosis for many reproductive health conditions, which lead to increased medical costs and potential health complications for individuals and health care providers, innovative health apps and symptom checkers hold the potential to improve care pathways.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | - Stephen Gilbert
- Else Kröner Fresenius Center for Digital Health, TUD Dresden University of Technology, Dresden, Germany
| | | | - Sonia Ponzo
- Flo Health UK Limited, London, United Kingdom
| | | |
Collapse
|
5
|
Maia E, Vieira P, Praça I. Empowering Preventive Care with GECA Chatbot. Healthcare (Basel) 2023; 11:2532. [PMID: 37761729 PMCID: PMC10531007 DOI: 10.3390/healthcare11182532] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Revised: 09/05/2023] [Accepted: 09/08/2023] [Indexed: 09/29/2023] Open
Abstract
Chatbots have become increasingly popular in the healthcare industry. In the area of preventive care, chatbots can provide personalized and timely solutions that aid individuals in maintaining their well-being and forestalling the development of chronic conditions. This paper presents GECA, a chatbot designed specifically for preventive care, that offers information, advice, and monitoring to patients who are undergoing home treatment, providing a cost-effective, personalized, and engaging solution. Moreover, its adaptable architecture enables extension to other diseases and conditions seamlessly. The chatbot's bilingual capabilities enhance accessibility for a wider range of users, including those with reading or writing difficulties, thereby improving the overall user experience. GECA's ability to connect with external resources offers a higher degree of personalization, which is a crucial aspect in engaging users effectively. The integration of standards and security protocols in these connections allows patient privacy, security and smooth adaptation to emerging healthcare information sources. GECA has demonstrated a remarkable level of accuracy and precision in its interactions with the diverse features, boasting an impressive 97% success rate in delivering accurate responses. Presently, preparations are underway for a pilot project at a Portuguese hospital that will conduct exhaustive testing and evaluate GECA, encompassing aspects such as its effectiveness, efficiency, quality, goal achievability, and user satisfaction.
Collapse
Affiliation(s)
- Eva Maia
- GECAD—Research Group on Intelligent Engineering and Computing for Advanced Innovation and Development, School of Engineering of the Polytechnic of Porto (ISEP), 4249-015 Porto, Portugal
| | - Pedro Vieira
- GECAD—Research Group on Intelligent Engineering and Computing for Advanced Innovation and Development, School of Engineering of the Polytechnic of Porto (ISEP), 4249-015 Porto, Portugal
| | - Isabel Praça
- GECAD—Research Group on Intelligent Engineering and Computing for Advanced Innovation and Development, School of Engineering of the Polytechnic of Porto (ISEP), 4249-015 Porto, Portugal
| |
Collapse
|
6
|
Kopka M, Scatturin L, Napierala H, Fürstenau D, Feufel MA, Balzer F, Schmieding ML. Characteristics of Users and Nonusers of Symptom Checkers in Germany: Cross-Sectional Survey Study. J Med Internet Res 2023; 25:e46231. [PMID: 37338970 DOI: 10.2196/46231] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2023] [Revised: 04/12/2023] [Accepted: 05/03/2023] [Indexed: 06/21/2023] Open
Abstract
BACKGROUND Previous studies have revealed that users of symptom checkers (SCs, apps that support self-diagnosis and self-triage) are predominantly female, are younger than average, and have higher levels of formal education. Little data are available for Germany, and no study has so far compared usage patterns with people's awareness of SCs and the perception of usefulness. OBJECTIVE We explored the sociodemographic and individual characteristics that are associated with the awareness, usage, and perceived usefulness of SCs in the German population. METHODS We conducted a cross-sectional online survey among 1084 German residents in July 2022 regarding personal characteristics and people's awareness and usage of SCs. Using random sampling from a commercial panel, we collected participant responses stratified by gender, state of residence, income, and age to reflect the German population. We analyzed the collected data exploratively. RESULTS Of all respondents, 16.3% (177/1084) were aware of SCs and 6.5% (71/1084) had used them before. Those aware of SCs were younger (mean 38.8, SD 14.6 years, vs mean 48.3, SD 15.7 years), were more often female (107/177, 60.5%, vs 453/907, 49.9%), and had higher formal education levels (eg, 72/177, 40.7%, vs 238/907, 26.2%, with a university/college degree) than those unaware. The same observation applied to users compared to nonusers. It disappeared, however, when comparing users to nonusers who were aware of SCs. Among users, 40.8% (29/71) considered these tools useful. Those considering them useful reported higher self-efficacy (mean 4.21, SD 0.66, vs mean 3.63, SD 0.81, on a scale of 1-5) and a higher net household income (mean EUR 2591.63, SD EUR 1103.96 [mean US $2798.96, SD US $1192.28], vs mean EUR 1626.60, SD EUR 649.05 [mean US $1756.73, SD US $700.97]) than those who considered them not useful. More women considered SCs unhelpful (13/44, 29.5%) compared to men (4/26, 15.4%). CONCLUSIONS Concurring with studies from other countries, our findings show associations between sociodemographic characteristics and SC usage in a German sample: users were on average younger, of higher socioeconomic status, and more commonly female compared to nonusers. However, usage cannot be explained by sociodemographic differences alone. It rather seems that sociodemographics explain who is or is not aware of the technology, but those who are aware of SCs are equally likely to use them, independently of sociodemographic differences. Although in some groups (eg, people with anxiety disorder), more participants reported to know and use SCs, they tended to perceive them as less useful. In other groups (eg, male participants), fewer respondents were aware of SCs, but those who used them perceived them to be more useful. Thus, SCs should be designed to fit specific user needs, and strategies should be developed to help reach individuals who could benefit but are not aware of SCs yet.
Collapse
Affiliation(s)
- Marvin Kopka
- Institute of Medical Informatics, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
- Division of Ergonomics, Department of Psychology and Ergonomics, Technische Universität Berlin, Berlin, Germany
| | - Lennart Scatturin
- Institute of Medical Informatics, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Hendrik Napierala
- Institute of General Practice and Family Medicine, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Daniel Fürstenau
- Institute of Medical Informatics, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
- Department of Business IT, IT University of Copenhagen, København, Denmark
| | - Markus A Feufel
- Division of Ergonomics, Department of Psychology and Ergonomics, Technische Universität Berlin, Berlin, Germany
| | - Felix Balzer
- Institute of Medical Informatics, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Malte L Schmieding
- Institute of Medical Informatics, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| |
Collapse
|
7
|
Associations between literacy and attitudes toward artificial intelligence–assisted medical consultations: The mediating role of perceived distrust and efficiency of artificial intelligence. COMPUTERS IN HUMAN BEHAVIOR 2023. [DOI: 10.1016/j.chb.2022.107529] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
8
|
Kopka M, Feufel MA, Berner ES, Schmieding ML. How suitable are clinical vignettes for the evaluation of symptom checker apps? A test theoretical perspective. Digit Health 2023; 9:20552076231194929. [PMID: 37614591 PMCID: PMC10444026 DOI: 10.1177/20552076231194929] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2023] [Accepted: 07/28/2023] [Indexed: 08/25/2023] Open
Abstract
Objective To evaluate the ability of case vignettes to assess the performance of symptom checker applications and to suggest refinements to the methodology used in case vignette-based audit studies. Methods We re-analyzed the publicly available data of two prominent case vignette-based symptom checker audit studies by calculating common metrics of test theory. Furthermore, we developed a new metric, the Capability Comparison Score (CCS), which compares symptom checker capability while controlling for the difficulty of the set of cases each symptom checker evaluated. We then scrutinized whether applying test theory and the CCS altered the performance ranking of the investigated symptom checkers. Results In both studies, most symptom checkers changed their rank order when adjusting the triage capability for item difficulty (ID) with the CCS. The previously reported triage accuracies commonly overestimated the capability of symptom checkers because they did not account for the fact that symptom checkers tend to selectively appraise easier cases (i.e., with high ID values). Also, many case vignettes in both studies showed insufficient (very low and even negative) values of item-total correlation (ITC), suggesting that individual items or the composition of item sets are of low quality. Conclusions A test-theoretic perspective helps identify previously undetected threats to the validity of case vignette-based symptom checker assessments and provides guidance and specific metrics to improve the quality of case vignettes, in particular by controlling for the difficulty of the vignettes an app was (not) able to evaluate correctly. Such measures might prove more meaningful than accuracy alone for the competitive assessment of symptom checkers. Our approach helps elaborate and standardize the methodology used for appraising symptom checker capability, which, ultimately, may yield more reliable results.
Collapse
Affiliation(s)
- Marvin Kopka
- Department of Psychology and Ergonomics (IPA), Division of Ergonomics, Technische Universität Berlin, Berlin, Germany
- Institute of Medical Informatics, Charité – Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Markus A Feufel
- Department of Psychology and Ergonomics (IPA), Division of Ergonomics, Technische Universität Berlin, Berlin, Germany
| | - Eta S Berner
- Department of Health Services Administration, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Malte L Schmieding
- Institute of Medical Informatics, Charité – Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| |
Collapse
|
9
|
Ilicki J. Challenges in evaluating the accuracy of AI-containing digital triage systems: A systematic review. PLoS One 2022; 17:e0279636. [PMID: 36574438 PMCID: PMC9794085 DOI: 10.1371/journal.pone.0279636] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Accepted: 12/12/2022] [Indexed: 12/28/2022] Open
Abstract
INTRODUCTION Patient-operated digital triage systems with AI components are becoming increasingly common. However, previous reviews have found a limited amount of research on such systems' accuracy. This systematic review of the literature aimed to identify the main challenges in determining the accuracy of patient-operated digital AI-based triage systems. METHODS A systematic review was designed and conducted in accordance with PRISMA guidelines in October 2021 using PubMed, Scopus and Web of Science. Articles were included if they assessed the accuracy of a patient-operated digital triage system that had an AI-component and could triage a general primary care population. Limitations and other pertinent data were extracted, synthesized and analysed. Risk of bias was not analysed as this review studied the included articles' limitations (rather than results). Results were synthesized qualitatively using a thematic analysis. RESULTS The search generated 76 articles and following exclusion 8 articles (6 primary articles and 2 reviews) were included in the analysis. Articles' limitations were synthesized into three groups: epistemological, ontological and methodological limitations. Limitations varied with regards to intractability and the level to which they can be addressed through methodological choices. Certain methodological limitations related to testing triage systems using vignettes can be addressed through methodological adjustments, whereas epistemological and ontological limitations require that readers of such studies appraise the studies with limitations in mind. DISCUSSION The reviewed literature highlights recurring limitations and challenges in studying the accuracy of patient-operated digital triage systems with AI components. Some of these challenges can be addressed through methodology whereas others are intrinsic to the area of inquiry and involve unavoidable trade-offs. Future studies should take these limitations in consideration in order to better address the current knowledge gaps in the literature.
Collapse
|
10
|
Fraser HSF, Cohan G, Koehler C, Anderson J, Lawrence A, Pateña J, Bacher I, Ranney ML. Evaluation of Diagnostic and Triage Accuracy and Usability of a Symptom Checker in an Emergency Department: Observational Study. JMIR Mhealth Uhealth 2022; 10:e38364. [PMID: 36121688 PMCID: PMC9531004 DOI: 10.2196/38364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2022] [Revised: 05/31/2022] [Accepted: 06/10/2022] [Indexed: 11/26/2022] Open
Abstract
Background Symptom checkers are clinical decision support apps for patients, used by tens of millions of people annually. They are designed to provide diagnostic and triage advice and assist users in seeking the appropriate level of care. Little evidence is available regarding their diagnostic and triage accuracy with direct use by patients for urgent conditions. Objective The aim of this study is to determine the diagnostic and triage accuracy and usability of a symptom checker in use by patients presenting to an emergency department (ED). Methods We recruited a convenience sample of English-speaking patients presenting for care in an urban ED. Each consenting patient used a leading symptom checker from Ada Health before the ED evaluation. Diagnostic accuracy was evaluated by comparing the symptom checker’s diagnoses and those of 3 independent emergency physicians viewing the patient-entered symptom data, with the final diagnoses from the ED evaluation. The Ada diagnoses and triage were also critiqued by the independent physicians. The patients completed a usability survey based on the Technology Acceptance Model. Results A total of 40 (80%) of the 50 participants approached completed the symptom checker assessment and usability survey. Their mean age was 39.3 (SD 15.9; range 18-76) years, and they were 65% (26/40) female, 68% (27/40) White, 48% (19/40) Hispanic or Latino, and 13% (5/40) Black or African American. Some cases had missing data or a lack of a clear ED diagnosis; 75% (30/40) were included in the analysis of diagnosis, and 93% (37/40) for triage. The sensitivity for at least one of the final ED diagnoses by Ada (based on its top 5 diagnoses) was 70% (95% CI 54%-86%), close to the mean sensitivity for the 3 physicians (on their top 3 diagnoses) of 68.9%. The physicians rated the Ada triage decisions as 62% (23/37) fully agree and 24% (9/37) safe but too cautious. It was rated as unsafe and too risky in 22% (8/37) of cases by at least one physician, in 14% (5/37) of cases by at least two physicians, and in 5% (2/37) of cases by all 3 physicians. Usability was rated highly; participants agreed or strongly agreed with the 7 Technology Acceptance Model usability questions with a mean score of 84.6%, although “satisfaction” and “enjoyment” were rated low. Conclusions This study provides preliminary evidence that a symptom checker can provide acceptable usability and diagnostic accuracy for patients with various urgent conditions. A total of 14% (5/37) of symptom checker triage recommendations were deemed unsafe and too risky by at least two physicians based on the symptoms recorded, similar to the results of studies on telephone and nurse triage. Larger studies are needed of diagnosis and triage performance with direct patient use in different clinical environments.
Collapse
Affiliation(s)
- Hamish S F Fraser
- Brown Center for Biomedical Informatics, Warren Alpert Medical School, Brown University, Providence, RI, United States
- School of Public Health, Brown University, Providence, RI, United States
| | - Gregory Cohan
- Warren Alpert Medical School, Brown University, Providence, RI, United States
| | - Christopher Koehler
- Department of Emergency Medicine, Brown University, Providence, RI, United States
| | - Jared Anderson
- Department of Emergency Medicine, Brown University, Providence, RI, United States
| | - Alexis Lawrence
- Harvard Medical Faculty Physicians, Department of Emergency Medicine, St Luke's Hospital, New Bedford, MA, United States
| | - John Pateña
- Brown-Lifespan Center for Digital Health, Providence, RI, United States
| | - Ian Bacher
- Brown Center for Biomedical Informatics, Warren Alpert Medical School, Brown University, Providence, RI, United States
| | - Megan L Ranney
- School of Public Health, Brown University, Providence, RI, United States
- Department of Emergency Medicine, Brown University, Providence, RI, United States
- Brown-Lifespan Center for Digital Health, Providence, RI, United States
| |
Collapse
|
11
|
Chen PC, Lu YR, Kang YN, Chang CC. The Accuracy of Artificial Intelligence in the Endoscopic Diagnosis of Early Gastric Cancer: Pooled Analysis Study. J Med Internet Res 2022; 24:e27694. [PMID: 35576561 PMCID: PMC9152716 DOI: 10.2196/27694] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2021] [Revised: 10/23/2021] [Accepted: 11/15/2021] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND Artificial intelligence (AI) for gastric cancer diagnosis has been discussed in recent years. The role of AI in early gastric cancer is more important than in advanced gastric cancer since early gastric cancer is not easily identified in clinical practice. However, to our knowledge, past syntheses appear to have limited focus on the populations with early gastric cancer. OBJECTIVE The purpose of this study is to evaluate the diagnostic accuracy of AI in the diagnosis of early gastric cancer from endoscopic images. METHODS We conducted a systematic review from database inception to June 2020 of all studies assessing the performance of AI in the endoscopic diagnosis of early gastric cancer. Studies not concerning early gastric cancer were excluded. The outcome of interest was the diagnostic accuracy (comprising sensitivity, specificity, and accuracy) of AI systems. Study quality was assessed on the basis of the revised Quality Assessment of Diagnostic Accuracy Studies. Meta-analysis was primarily based on a bivariate mixed-effects model. A summary receiver operating curve and a hierarchical summary receiver operating curve were constructed, and the area under the curve was computed. RESULTS We analyzed 12 retrospective case control studies (n=11,685) in which AI identified early gastric cancer from endoscopic images. The pooled sensitivity and specificity of AI for early gastric cancer diagnosis were 0.86 (95% CI 0.75-0.92) and 0.90 (95% CI 0.84-0.93), respectively. The area under the curve was 0.94. Sensitivity analysis of studies using support vector machines and narrow-band imaging demonstrated more consistent results. CONCLUSIONS For early gastric cancer, to our knowledge, this was the first synthesis study on the use of endoscopic images in AI in diagnosis. AI may support the diagnosis of early gastric cancer. However, the collocation of imaging techniques and optimal algorithms remain unclear. Competing models of AI for the diagnosis of early gastric cancer are worthy of future investigation. TRIAL REGISTRATION PROSPERO CRD42020193223; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=193223.
Collapse
Affiliation(s)
- Pei-Chin Chen
- Department of Internal Medicine, Taipei Medical University Hospital, Taipei, Taiwan.,Department of General Medicine, Taipei Medical University Hospital, Taipei, Taiwan
| | - Yun-Ru Lu
- Department of General Medicine, Taipei Medical University Hospital, Taipei, Taiwan.,Department of Anesthesiology, Wan Fang Hospital, Taipei Medical University, Taipei, Taiwan
| | - Yi-No Kang
- Evidence-Based Medicine Center, Wan Fang Hospital, Taipei Medical University, Taipei, Taiwan.,Institute of Health Behaviors and Community Sciences, College of Public Health, National Taiwan University, Taipei, Taiwan.,Cochrane Taiwan, Taipei Medical University, Taipei, Taiwan.,Department of Health Care Management, College of Health Technology, National Taipei University of Nursing and Health Sciences, Taipei, Taiwan
| | - Chun-Chao Chang
- Division of Gastroenterology and Hepatology, Department of Internal Medicine, Taipei Medical University Hospital, Taipei, Taiwan
| |
Collapse
|
12
|
Schmieding ML, Kopka M, Schmidt K, Schulz-Niethammer S, Balzer F, Feufel MA. Triage Accuracy of Symptom Checker Apps: 5-Year Follow-up Evaluation. J Med Internet Res 2022; 24:e31810. [PMID: 35536633 PMCID: PMC9131144 DOI: 10.2196/31810] [Citation(s) in RCA: 29] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2021] [Revised: 11/19/2021] [Accepted: 01/30/2022] [Indexed: 12/16/2022] Open
Abstract
Background Symptom checkers are digital tools assisting laypersons in self-assessing the urgency and potential causes of their medical complaints. They are widely used but face concerns from both patients and health care professionals, especially regarding their accuracy. A 2015 landmark study substantiated these concerns using case vignettes to demonstrate that symptom checkers commonly err in their triage assessment. Objective This study aims to revisit the landmark index study to investigate whether and how symptom checkers’ capabilities have evolved since 2015 and how they currently compare with laypersons’ stand-alone triage appraisal. Methods In early 2020, we searched for smartphone and web-based applications providing triage advice. We evaluated these apps on the same 45 case vignettes as the index study. Using descriptive statistics, we compared our findings with those of the index study and with publicly available data on laypersons’ triage capability. Results We retrieved 22 symptom checkers providing triage advice. The median triage accuracy in 2020 (55.8%, IQR 15.1%) was close to that in 2015 (59.1%, IQR 15.5%). The apps in 2020 were less risk averse (odds 1.11:1, the ratio of overtriage errors to undertriage errors) than those in 2015 (odds 2.82:1), missing >40% of emergencies. Few apps outperformed laypersons in either deciding whether emergency care was required or whether self-care was sufficient. No apps outperformed the laypersons on both decisions. Conclusions Triage performance of symptom checkers has, on average, not improved over the course of 5 years. It decreased in 2 use cases (advice on when emergency care is required and when no health care is needed for the moment). However, triage capability varies widely within the sample of symptom checkers. Whether it is beneficial to seek advice from symptom checkers depends on the app chosen and on the specific question to be answered. Future research should develop resources (eg, case vignette repositories) to audit the capabilities of symptom checkers continuously and independently and provide guidance on when and to whom they should be recommended.
Collapse
Affiliation(s)
- Malte L Schmieding
- Institute of Medical Informatics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Marvin Kopka
- Institute of Medical Informatics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany.,Cognitive Psychology and Ergonomics, Department of Psychology and Ergonomics, Technische Universität Berlin, Berlin, Germany
| | - Konrad Schmidt
- Institute of General Practice and Family Medicine, Jena University Hospital, Germany, Jena, Germany.,Institute of General Practice and Family Medicine, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Sven Schulz-Niethammer
- Division of Ergonomics, Department of Psychology and Ergonomics, Technische Universität Berlin, Berlin, Germany
| | - Felix Balzer
- Institute of Medical Informatics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Markus A Feufel
- Division of Ergonomics, Department of Psychology and Ergonomics, Technische Universität Berlin, Berlin, Germany
| |
Collapse
|
13
|
Ollier J, Nißen M, von Wangenheim F. The Terms of "You(s)": How the Term of Address Used by Conversational Agents Influences User Evaluations in French and German Linguaculture. Front Public Health 2022; 9:691595. [PMID: 35071147 PMCID: PMC8767023 DOI: 10.3389/fpubh.2021.691595] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2021] [Accepted: 12/03/2021] [Indexed: 11/26/2022] Open
Abstract
Background: Conversational agents (CAs) are a novel approach to delivering digital health interventions. In human interactions, terms of address often change depending on the context or relationship between interlocutors. In many languages, this encompasses T/V distinction—formal and informal forms of the second-person pronoun “You”—that conveys different levels of familiarity. Yet, few research articles have examined whether CAs' use of T/V distinction across language contexts affects users' evaluations of digital health applications. Methods: In an online experiment (N = 284), we manipulated a public health CA prototype to use either informal or formal T/V distinction forms in French (“tu” vs. “vous”) and German (“du” vs. “Sie”) language settings. A MANCOVA and post-hoc tests were performed to examine the effects of the independent variables (i.e., T/V distinction and Language) and the moderating role of users' demographic profile (i.e., Age and Gender) on eleven user evaluation variables. These were related to four themes: (i) Sociability, (ii) CA-User Collaboration, (iii) Service Evaluation, and (iv) Behavioral Intentions. Results: Results showed a four-way interaction between T/V Distinction, Language, Age, and Gender, influencing user evaluations across all outcome themes. For French speakers, when the informal “T form” (“Tu”) was used, higher user evaluation scores were generated for younger women and older men (e.g., the CA felt more humanlike or individuals were more likely to recommend the CA), whereas when the formal “V form” (“Vous”) was used, higher user evaluation scores were generated for younger men and older women. For German speakers, when the informal T form (“Du”) was used, younger users' evaluations were comparable regardless of Gender, however, as individuals' Age increased, the use of “Du” resulted in lower user evaluation scores, with this effect more pronounced in men. When using the formal V form (“Sie”), user evaluation scores were relatively stable, regardless of Gender, and only increasing slightly with Age. Conclusions: Results highlight how user CA evaluations vary based on the T/V distinction used and language setting, however, that even within a culturally homogenous language group, evaluations vary based on user demographics, thus highlighting the importance of personalizing CA language.
Collapse
Affiliation(s)
- Joseph Ollier
- Chair of Technology Marketing, Department of Management, Economics and Technology (D-MTEC), ETH Zürich, Zurich, Switzerland.,Centre for Digital Health Interventions (CDHI), Department of Management, Economics and Technology (D-MTEC), ETH Zürich, Zurich, Switzerland
| | - Marcia Nißen
- Chair of Technology Marketing, Department of Management, Economics and Technology (D-MTEC), ETH Zürich, Zurich, Switzerland.,Centre for Digital Health Interventions (CDHI), Department of Management, Economics and Technology (D-MTEC), ETH Zürich, Zurich, Switzerland
| | - Florian von Wangenheim
- Chair of Technology Marketing, Department of Management, Economics and Technology (D-MTEC), ETH Zürich, Zurich, Switzerland.,Centre for Digital Health Interventions (CDHI), Department of Management, Economics and Technology (D-MTEC), ETH Zürich, Zurich, Switzerland
| |
Collapse
|
14
|
Kopka M, Schmieding ML, Rieger T, Roesler E, Balzer F, Feufel MA. Trust Me, I’m Not a Doctor! Determinants of Laypersons’ Trust in Medical Decision Aids: Experimental Study (Preprint). JMIR Hum Factors 2021; 9:e35219. [PMID: 35503248 PMCID: PMC9115664 DOI: 10.2196/35219] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2021] [Revised: 02/09/2022] [Accepted: 03/06/2022] [Indexed: 11/13/2022] Open
Abstract
Background Objective Methods Results Conclusions Trial Registration
Collapse
Affiliation(s)
- Marvin Kopka
- Institute of Medical Informatics, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
- Cognitive Psychology and Ergonomics, Department of Psychology and Ergonomics (IPA), Technische Universität Berlin, Berlin, Germany
| | - Malte L Schmieding
- Institute of Medical Informatics, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Tobias Rieger
- Work, Engineering and Organizational Psychology, Department of Psychology and Ergonomics (IPA), Technische Universität Berlin, Berlin, Germany
| | - Eileen Roesler
- Work, Engineering and Organizational Psychology, Department of Psychology and Ergonomics (IPA), Technische Universität Berlin, Berlin, Germany
| | - Felix Balzer
- Institute of Medical Informatics, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Markus A Feufel
- Division of Ergonomics, Department of Psychology and Ergonomics (IPA), Technische Universität Berlin, Berlin, Germany
| |
Collapse
|
15
|
Ćirković A. Author's Reply to: Periodic Manual Algorithm Updates and Generalizability: A Developer's Response. Comment on "Evaluation of Four Artificial Intelligence-Assisted Self-Diagnosis Apps on Three Diagnoses: Two-Year Follow-Up Study". J Med Internet Res 2021; 23:e29336. [PMID: 34132643 PMCID: PMC8277319 DOI: 10.2196/29336] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2021] [Accepted: 05/13/2021] [Indexed: 11/16/2022] Open
|
16
|
Gilbert S, Fenech M, Idris A, Türk E. Periodic Manual Algorithm Updates and Generalizability: A Developer's Response. Comment on "Evaluation of Four Artificial Intelligence-Assisted Self-Diagnosis Apps on Three Diagnoses: Two-Year Follow-Up Study". J Med Internet Res 2021; 23:e26514. [PMID: 34132641 PMCID: PMC8277354 DOI: 10.2196/26514] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Accepted: 05/13/2021] [Indexed: 01/16/2023] Open
|
17
|
Deng L, Chen L, Yang T, Liu M, Li S, Jiang T. Constructing High-Fidelity Phenotype Knowledge Graphs for Infectious Diseases With a Fine-Grained Semantic Information Model: Development and Usability Study. J Med Internet Res 2021; 23:e26892. [PMID: 34128811 PMCID: PMC8277235 DOI: 10.2196/26892] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2021] [Revised: 04/01/2021] [Accepted: 05/06/2021] [Indexed: 01/19/2023] Open
Abstract
BACKGROUND Phenotypes characterize the clinical manifestations of diseases and provide important information for diagnosis. Therefore, the construction of phenotype knowledge graphs for diseases is valuable to the development of artificial intelligence in medicine. However, phenotype knowledge graphs in current knowledge bases such as WikiData and DBpedia are coarse-grained knowledge graphs because they only consider the core concepts of phenotypes while neglecting the details (attributes) associated with these phenotypes. OBJECTIVE To characterize the details of disease phenotypes for clinical guidelines, we proposed a fine-grained semantic information model named PhenoSSU (semantic structured unit of phenotypes). METHODS PhenoSSU is an "entity-attribute-value" model by its very nature, and it aims to capture the full semantic information underlying phenotype descriptions with a series of attributes and values. A total of 193 clinical guidelines for infectious diseases from Wikipedia were selected as the study corpus, and 12 attributes from SNOMED-CT were introduced into the PhenoSSU model based on the co-occurrences of phenotype concepts and attribute values. The expressive power of the PhenoSSU model was evaluated by analyzing whether PhenoSSU instances could capture the full semantics underlying the descriptions of the corresponding phenotypes. To automatically construct fine-grained phenotype knowledge graphs, a hybrid strategy that first recognized phenotype concepts with the MetaMap tool and then predicted the attribute values of phenotypes with machine learning classifiers was developed. RESULTS Fine-grained phenotype knowledge graphs of 193 infectious diseases were manually constructed with the BRAT annotation tool. A total of 4020 PhenoSSU instances were annotated in these knowledge graphs, and 3757 of them (89.5%) were found to be able to capture the full semantics underlying the descriptions of the corresponding phenotypes listed in clinical guidelines. By comparison, other information models, such as the clinical element model and the HL7 fast health care interoperability resource model, could only capture the full semantics underlying 48.4% (2034/4020) and 21.8% (914/4020) of the descriptions of phenotypes listed in clinical guidelines, respectively. The hybrid strategy achieved an F1-score of 0.732 for the subtask of phenotype concept recognition and an average weighted accuracy of 0.776 for the subtask of attribute value prediction. CONCLUSIONS PhenoSSU is an effective information model for the precise representation of phenotype knowledge for clinical guidelines, and machine learning can be used to improve the efficiency of constructing PhenoSSU-based knowledge graphs. Our work will potentially shift the focus of medical knowledge engineering from a coarse-grained level to a more fine-grained level.
Collapse
Affiliation(s)
- Lizong Deng
- Center of Systems Medicine, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
- Suzhou Institute of Systems Medicine, Suzhou, China
| | - Luming Chen
- Center of Systems Medicine, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
- Suzhou Institute of Systems Medicine, Suzhou, China
| | - Tao Yang
- Center of Systems Medicine, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
- Suzhou Institute of Systems Medicine, Suzhou, China
| | - Mi Liu
- Jiangsu Institute of Clinical Immunology, Jiangsu Key Laboratory of Clinical Immunology, The First Affiliated Hospital of Soochow University, Suzhou, China
| | - Shicheng Li
- Center of Systems Medicine, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
- Suzhou Institute of Systems Medicine, Suzhou, China
| | - Taijiao Jiang
- Center of Systems Medicine, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
- Suzhou Institute of Systems Medicine, Suzhou, China
- Guangzhou Laboratory, Guangzhou, China
| |
Collapse
|