1
|
Šuster S, Baldwin T, Verspoor K. Zero- and few-shot prompting of generative large language models provides weak assessment of risk of bias in clinical trials. Res Synth Methods 2024; 15:988-1000. [PMID: 39176994 DOI: 10.1002/jrsm.1749] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 06/26/2024] [Accepted: 08/05/2024] [Indexed: 08/24/2024]
Abstract
Existing systems for automating the assessment of risk-of-bias (RoB) in medical studies are supervised approaches that require substantial training data to work well. However, recent revisions to RoB guidelines have resulted in a scarcity of available training data. In this study, we investigate the effectiveness of generative large language models (LLMs) for assessing RoB. Their application requires little or no training data and, if successful, could serve as a valuable tool to assist human experts during the construction of systematic reviews. Following Cochrane's latest guidelines (RoB2) designed for human reviewers, we prepare instructions that are fed as input to LLMs, which then infer the risk associated with a trial publication. We distinguish between two modelling tasks: directly predicting RoB2 from text; and employing decomposition, in which a RoB2 decision is made after the LLM responds to a series of signalling questions. We curate new testing data sets and evaluate the performance of four general- and medical-domain LLMs. The results fall short of expectations, with LLMs seldom surpassing trivial baselines. On the direct RoB2 prediction test set (n = 5993), LLMs perform akin to the baselines (F1: 0.1-0.2). In the decomposition task setup (n = 28,150), similar F1 scores are observed. Our additional comparative evaluation on RoB1 data also reveals results substantially below those of a supervised system. This testifies to the difficulty of solving this task based on (complex) instructions alone. Using LLMs as an assisting technology for assessing RoB2 thus currently seems beyond their reach.
Collapse
Affiliation(s)
- Simon Šuster
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Victoria, Australia
| | - Timothy Baldwin
- Department of Natural Language Processing, MBZUAI, Abu Dhabi, United Arab Emirates
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Victoria, Australia
- School of Computing Technologies, RMIT University, Melbourne, Victoria, Australia
| |
Collapse
|
2
|
Kolaski K, Clarke M, Logan LR. Analysis of risk of bias assessments in a sample of intervention systematic reviews, Part II: focus on risk of bias tools reveals few meet current appraisal standards. J Clin Epidemiol 2024; 174:111460. [PMID: 39025376 DOI: 10.1016/j.jclinepi.2024.111460] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Revised: 06/29/2024] [Accepted: 07/07/2024] [Indexed: 07/20/2024]
Abstract
OBJECTIVES Risk of bias (RoB) assessment is a critical part of any systematic review (SR). There are multiple tools available for assessing RoB of the studies included in a SR. The conduct of these assessments in intervention SRs are addressed by three items in AMSTAR-2, considered the preferred tool for critically appraising an intervention SR. This study focuses attention on item 9, which assesses the ability of a RoB tool to adequately address sources of bias, particularly in randomized trials (RCTs) and nonrandomized studies of interventions (NRSI). Our main objective is to report the detailed results of our examination of both Cochrane and non-Cochrane RoB tools and distinguish those that meet AMSTAR-2 item 9 appraisal standards. STUDY DESIGN AND SETTING We identified critical appraisal tools reported in a sample of 126 SRs reporting on interventions for persons with cerebral palsy published from 2014 to 2021. Eligible tools were those that had been used to assess the primary studies included in these SRs and for which assessment results were reported in enough detail to allow appraisal of the tool. We identified the version of the tool applied as original, modified, or novel and established the applicable study designs as intended by the tools' developers. We then evaluated the potential ability of these tools to assess the four sources of bias specified in AMSTAR-2 item 9 for RCTs and NRSI. We adapted item 9 to appraise tools applied to single-case experimental designs, which we also encountered in this sample of SRs. RESULTS Most of the eligible tools are recognized by name in the published literature and were applied in the original or modified form. Modifications were applied with considerable variability across the sample. Of the 37 tools we examined, those judged to fully meet the appraisal standards for RCTs included all the Cochrane tools, the original and modified Downs and Black Checklist, and the quality assessment standard for a cross-over study by Ding et al; for NRSI, these included all the Cochrane tools, the original and modified Downs and Black Checklist, and the Research Triangle Institute item bank on Risk of Bias and Precision of Observational Studies for NRSI. In general, tools developed for a specific study design were judged to meet the appraisal standards fully or partially for that design. These results suggest it is unlikely that a single tool will be adequate by AMSTAR-2 item 9 appraisal standards for an intervention SR that includes studies of various designs. CONCLUSION To our knowledge, this is the first resource providing SR authors with practical information about the appropriateness and adequacy of RoB tools by the appraisal standards specified in AMSTAR-2 item 9 for RCTs and NRSI. We propose similar methods for appraisal of tools applied to single-case experimental design. We encourage authors to seek contemporary RoB tools developed for use in healthcare-related intervention SRs and designed to evaluate relevant study design features. The tools should address attributes unique to the review topic and research question but not be subjected to unjustified and excessive modifications. We promote recognition of the potential shortcomings of both Cochrane and non-Cochrane RoB tools, even those that perform well by AMSTAR-2 item 9 appraisal standards.
Collapse
Affiliation(s)
- Kat Kolaski
- Departments of Orthopaedic Surgery, Pediatrics, and Neurology, Wake Forest School of Medicine, Winston-Salem, NC, USA.
| | - Mike Clarke
- Director of Northern Ireland Methodology Hub; School of Medicine, Dentistry and Biomedical Sciences, Queen's University Belfast, Belfast, UK
| | - Lynne Romeiser Logan
- Department of Physical Medicine and Rehabilitation, SUNY Upstate Medical University, Syracuse, NY, USA
| |
Collapse
|
3
|
Affengruber L, van der Maten MM, Spiero I, Nussbaumer-Streit B, Mahmić-Kaknjo M, Ellen ME, Goossen K, Kantorova L, Hooft L, Riva N, Poulentzas G, Lalagkas PN, Silva AG, Sassano M, Sfetcu R, Marqués ME, Friessova T, Baladia E, Pezzullo AM, Martinez P, Gartlehner G, Spijker R. An exploration of available methods and tools to improve the efficiency of systematic review production: a scoping review. BMC Med Res Methodol 2024; 24:210. [PMID: 39294580 PMCID: PMC11409535 DOI: 10.1186/s12874-024-02320-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2024] [Accepted: 08/26/2024] [Indexed: 09/20/2024] Open
Abstract
BACKGROUND Systematic reviews (SRs) are time-consuming and labor-intensive to perform. With the growing number of scientific publications, the SR development process becomes even more laborious. This is problematic because timely SR evidence is essential for decision-making in evidence-based healthcare and policymaking. Numerous methods and tools that accelerate SR development have recently emerged. To date, no scoping review has been conducted to provide a comprehensive summary of methods and ready-to-use tools to improve efficiency in SR production. OBJECTIVE To present an overview of primary studies that evaluated the use of ready-to-use applications of tools or review methods to improve efficiency in the review process. METHODS We conducted a scoping review. An information specialist performed a systematic literature search in four databases, supplemented with citation-based and grey literature searching. We included studies reporting the performance of methods and ready-to-use tools for improving efficiency when producing or updating a SR in the health field. We performed dual, independent title and abstract screening, full-text selection, and data extraction. The results were analyzed descriptively and presented narratively. RESULTS We included 103 studies: 51 studies reported on methods, 54 studies on tools, and 2 studies reported on both methods and tools to make SR production more efficient. A total of 72 studies evaluated the validity (n = 69) or usability (n = 3) of one method (n = 33) or tool (n = 39), and 31 studies performed comparative analyses of different methods (n = 15) or tools (n = 16). 20 studies conducted prospective evaluations in real-time workflows. Most studies evaluated methods or tools that aimed at screening titles and abstracts (n = 42) and literature searching (n = 24), while for other steps of the SR process, only a few studies were found. Regarding the outcomes included, most studies reported on validity outcomes (n = 84), while outcomes such as impact on results (n = 23), time-saving (n = 24), usability (n = 13), and cost-saving (n = 3) were less often evaluated. CONCLUSION For title and abstract screening and literature searching, various evaluated methods and tools are available that aim at improving the efficiency of SR production. However, only few studies have addressed the influence of these methods and tools in real-world workflows. Few studies exist that evaluate methods or tools supporting the remaining tasks. Additionally, while validity outcomes are frequently reported, there is a lack of evaluation regarding other outcomes.
Collapse
Affiliation(s)
- Lisa Affengruber
- Cochrane Austria, Department for Evidence-Based Medicine and Clinical Epidemiology, University for Continuing Education Krems, Krems an der Donau, Austria.
- School for Public Health and Primary Care (CAPHRI), Maastricht University, Maastricht, the Netherlands.
| | - Miriam M van der Maten
- Knowledge Institute of Federation of Medical Specialists, Utrecht, The Netherlands
- Cochrane Netherlands, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, the Netherlands
| | - Isa Spiero
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, the Netherlands
| | - Barbara Nussbaumer-Streit
- Cochrane Austria, Department for Evidence-Based Medicine and Clinical Epidemiology, University for Continuing Education Krems, Krems an der Donau, Austria
| | - Mersiha Mahmić-Kaknjo
- Zenica Cantonal Hospital, Department for Clinical Pharmacology, Zenica, Bosnia and Herzegovina
| | - Moriah E Ellen
- Department of Health Policy and Management, Guilford Glazer Faculty of Business and Management and Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer-Sheva, Israel
- Institute of Health Policy Management and Evaluation, Dalla Lana School Of Public Health, University of Toronto, Toronto, Canada
- McMaster Health Forum, McMaster University, Hamilton, Canada
| | - Käthe Goossen
- Witten/Herdecke University, Institute for Research in Operative Medicine (IFOM), Cologne, Germany
| | - Lucia Kantorova
- Czech National Centre for Evidence-Based Healthcare and Knowledge Translation (Cochrane Czech Republic, Czech CEBHC: JBI Centre of Excellence, Masaryk University GRADE Centre), Institute of Biostatistics and Analyses, Faculty of Medicine, Masaryk University, Brno, Czech Republic
| | - Lotty Hooft
- Cochrane Netherlands, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, the Netherlands
| | - Nicoletta Riva
- Department of Pathology, Faculty of Medicine and Surgery, University of Malta, Msida, Malta
| | - Georgios Poulentzas
- Laboratory of Hygiene and Environmental Protection, Department of Medicine, Democritus University of Thrace, Alexandroupolis, Greece
| | - Panagiotis Nikolaos Lalagkas
- Laboratory of Hygiene and Environmental Protection, Department of Medicine, Democritus University of Thrace, Alexandroupolis, Greece
| | - Anabela G Silva
- CINTESIS.RISE@UA, University of Aveiro, Campus Universitário de Santiago, Aveiro, Portugal
| | - Michele Sassano
- Section of Hygiene, University Department of Life Sciences and Public Health, Università Cattolica del Sacro Cuore, Rome, Italy
- Department of Medical and Surgical Sciences, University of Bologna, Bologna, Italy
| | - Raluca Sfetcu
- National Institute for Health Services Management, Bucharest, Romania
- Spiru Haret University, Faculty of Psychology and Educational Sciences, Bucharest, Romania
| | - María E Marqués
- Red de Nutrición Basada en La Evidencia, Academia Española de Nutrición y Dietética, Pamplona, Spain
| | - Tereza Friessova
- Department of Health Sciences, Faculty of Medicine, Masaryk University, Brno, Czech Republic
| | - Eduard Baladia
- Red de Nutrición Basada en La Evidencia, Academia Española de Nutrición y Dietética, Pamplona, Spain
| | - Angelo Maria Pezzullo
- Section of Hygiene, University Department of Life Sciences and Public Health, Università Cattolica del Sacro Cuore, Rome, Italy
| | - Patricia Martinez
- Red de Nutrición Basada en La Evidencia, Academia Española de Nutrición y Dietética, Pamplona, Spain
- Techné Research Group, Department of Knowledge Engineering of the Faculty of Science, University of Granada, Granada, Spain
| | - Gerald Gartlehner
- Cochrane Austria, Department for Evidence-Based Medicine and Clinical Epidemiology, University for Continuing Education Krems, Krems an der Donau, Austria
- RTI International, Center for Public Health Methods, Research Triangle Park, Durham, NC, USA
| | - René Spijker
- Cochrane Netherlands, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, the Netherlands
- Amsterdam UMC, University of Amsterdam, Medical Library, Amsterdam Public Health, Amsterdam, the Netherlands
| |
Collapse
|
4
|
Leenaars CHC, Stafleu FR, Häger C, Bleich A. A case study of the informative value of risk of bias and reporting quality assessments for systematic reviews. Syst Rev 2024; 13:230. [PMID: 39244603 PMCID: PMC11380326 DOI: 10.1186/s13643-024-02650-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/10/2024] [Accepted: 08/28/2024] [Indexed: 09/09/2024] Open
Abstract
While undisputedly important, and part of any systematic review (SR) by definition, evaluation of the risk of bias within the included studies is one of the most time-consuming parts of performing an SR. In this paper, we describe a case study comprising an extensive analysis of risk of bias (RoB) and reporting quality (RQ) assessment from a previously published review (CRD42021236047). It included both animal and human studies, and the included studies compared baseline diseased subjects with controls, assessed the effects of investigational treatments, or both. We compared RoB and RQ between the different types of included primary studies. We also assessed the "informative value" of each of the separate elements for meta-researchers, based on the notion that variation in reporting may be more interesting for the meta-researcher than consistently high/low or reported/non-reported scores. In general, reporting of experimental details was low. This resulted in frequent unclear risk-of-bias scores. We observed this both for animal and for human studies and both for disease-control comparisons and investigations of experimental treatments. Plots and explorative chi-square tests showed that reporting was slightly better for human studies of investigational treatments than for the other study types. With the evidence reported as is, risk-of-bias assessments for systematic reviews have low informative value other than repeatedly showing that reporting of experimental details needs to improve in all kinds of in vivo research. Particularly for reviews that do not directly inform treatment decisions, it could be efficient to perform a thorough but partial assessment of the quality of the included studies, either of a random subset of the included publications or of a subset of relatively informative elements, comprising, e.g. ethics evaluation, conflicts of interest statements, study limitations, baseline characteristics, and the unit of analysis. This publication suggests several potential procedures.
Collapse
Affiliation(s)
- Cathalijn H C Leenaars
- Institute for Laboratory Animal Science, Hannover Medical School, Carl Neubergstrasse 1, 30625, Hannover, Germany.
| | - Frans R Stafleu
- Department of Animals in Science and Society, Utrecht University, Yalelaan 2, Utrecht, 3584 CM, the Netherlands
| | - Christine Häger
- Institute for Laboratory Animal Science, Hannover Medical School, Carl Neubergstrasse 1, 30625, Hannover, Germany
| | - André Bleich
- Institute for Laboratory Animal Science, Hannover Medical School, Carl Neubergstrasse 1, 30625, Hannover, Germany
| |
Collapse
|
5
|
Lai H, Ge L, Sun M, Pan B, Huang J, Hou L, Yang Q, Liu J, Liu J, Ye Z, Xia D, Zhao W, Wang X, Liu M, Talukdar JR, Tian J, Yang K, Estill J. Assessing the Risk of Bias in Randomized Clinical Trials With Large Language Models. JAMA Netw Open 2024; 7:e2412687. [PMID: 38776081 PMCID: PMC11112444 DOI: 10.1001/jamanetworkopen.2024.12687] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/15/2024] [Accepted: 03/20/2024] [Indexed: 05/25/2024] Open
Abstract
Importance Large language models (LLMs) may facilitate the labor-intensive process of systematic reviews. However, the exact methods and reliability remain uncertain. Objective To explore the feasibility and reliability of using LLMs to assess risk of bias (ROB) in randomized clinical trials (RCTs). Design, Setting, and Participants A survey study was conducted between August 10, 2023, and October 30, 2023. Thirty RCTs were selected from published systematic reviews. Main Outcomes and Measures A structured prompt was developed to guide ChatGPT (LLM 1) and Claude (LLM 2) in assessing the ROB in these RCTs using a modified version of the Cochrane ROB tool developed by the CLARITY group at McMaster University. Each RCT was assessed twice by both models, and the results were documented. The results were compared with an assessment by 3 experts, which was considered a criterion standard. Correct assessment rates, sensitivity, specificity, and F1 scores were calculated to reflect accuracy, both overall and for each domain of the Cochrane ROB tool; consistent assessment rates and Cohen κ were calculated to gauge consistency; and assessment time was calculated to measure efficiency. Performance between the 2 models was compared using risk differences. Results Both models demonstrated high correct assessment rates. LLM 1 reached a mean correct assessment rate of 84.5% (95% CI, 81.5%-87.3%), and LLM 2 reached a significantly higher rate of 89.5% (95% CI, 87.0%-91.8%). The risk difference between the 2 models was 0.05 (95% CI, 0.01-0.09). In most domains, domain-specific correct rates were around 80% to 90%; however, sensitivity below 0.80 was observed in domains 1 (random sequence generation), 2 (allocation concealment), and 6 (other concerns). Domains 4 (missing outcome data), 5 (selective outcome reporting), and 6 had F1 scores below 0.50. The consistent rates between the 2 assessments were 84.0% for LLM 1 and 87.3% for LLM 2. LLM 1's κ exceeded 0.80 in 7 and LLM 2's in 8 domains. The mean (SD) time needed for assessment was 77 (16) seconds for LLM 1 and 53 (12) seconds for LLM 2. Conclusions In this survey study of applying LLMs for ROB assessment, LLM 1 and LLM 2 demonstrated substantial accuracy and consistency in evaluating RCTs, suggesting their potential as supportive tools in systematic review processes.
Collapse
Affiliation(s)
- Honghao Lai
- Department of Health Policy and Management, School of Public Health, Lanzhou University, Lanzhou, China
- Evidence-Based Social Science Research Center, School of Public Health, Lanzhou University, Lanzhou, China
| | - Long Ge
- Department of Health Policy and Management, School of Public Health, Lanzhou University, Lanzhou, China
- Evidence-Based Social Science Research Center, School of Public Health, Lanzhou University, Lanzhou, China
- Key Laboratory of Evidence Based Medicine and Knowledge Translation of Gansu Province, Lanzhou, China
| | - Mingyao Sun
- Evidence-Based Nursing Center, School of Nursing, Lanzhou University, Lanzhou, China
| | - Bei Pan
- Evidence-Based Medicine Center, School of Basic Medical Sciences, Lanzhou University, Lanzhou, China
| | - Jiajie Huang
- College of Nursing, Gansu University of Chinese Medicine, Lanzhou, China
| | - Liangying Hou
- Evidence-Based Medicine Center, School of Basic Medical Sciences, Lanzhou University, Lanzhou, China
- Department of Health Research Methods, Evidence, and Impact, McMaster University, Ontario, Canada
| | - Qiuyu Yang
- Department of Health Policy and Management, School of Public Health, Lanzhou University, Lanzhou, China
- Evidence-Based Social Science Research Center, School of Public Health, Lanzhou University, Lanzhou, China
| | - Jiayi Liu
- Department of Health Policy and Management, School of Public Health, Lanzhou University, Lanzhou, China
- Evidence-Based Social Science Research Center, School of Public Health, Lanzhou University, Lanzhou, China
| | - Jianing Liu
- College of Nursing, Gansu University of Chinese Medicine, Lanzhou, China
| | - Ziying Ye
- Department of Health Policy and Management, School of Public Health, Lanzhou University, Lanzhou, China
- Evidence-Based Social Science Research Center, School of Public Health, Lanzhou University, Lanzhou, China
| | - Danni Xia
- Department of Health Policy and Management, School of Public Health, Lanzhou University, Lanzhou, China
- Evidence-Based Social Science Research Center, School of Public Health, Lanzhou University, Lanzhou, China
| | - Weilong Zhao
- Department of Health Policy and Management, School of Public Health, Lanzhou University, Lanzhou, China
- Evidence-Based Social Science Research Center, School of Public Health, Lanzhou University, Lanzhou, China
| | - Xiaoman Wang
- Evidence-Based Medicine Center, School of Basic Medical Sciences, Lanzhou University, Lanzhou, China
| | - Ming Liu
- Evidence-Based Medicine Center, School of Basic Medical Sciences, Lanzhou University, Lanzhou, China
- Department of Health Research Methods, Evidence, and Impact, McMaster University, Ontario, Canada
| | - Jhalok Ronjan Talukdar
- Department of Health Research Methods, Evidence, and Impact, McMaster University, Ontario, Canada
| | - Jinhui Tian
- Key Laboratory of Evidence Based Medicine and Knowledge Translation of Gansu Province, Lanzhou, China
- Evidence-Based Medicine Center, School of Basic Medical Sciences, Lanzhou University, Lanzhou, China
| | - Kehu Yang
- Key Laboratory of Evidence Based Medicine and Knowledge Translation of Gansu Province, Lanzhou, China
- Evidence-Based Medicine Center, School of Basic Medical Sciences, Lanzhou University, Lanzhou, China
| | - Janne Estill
- Evidence-Based Medicine Center, School of Basic Medical Sciences, Lanzhou University, Lanzhou, China
- Institute of Global Health, University of Geneva, Geneva, Switzerland
| |
Collapse
|
6
|
Liu S, Bourgeois FT, Narang C, Dunn AG. A comparison of machine learning methods to find clinical trials for inclusion in new systematic reviews from their PROSPERO registrations prior to searching and screening. Res Synth Methods 2024; 15:73-85. [PMID: 37749068 PMCID: PMC10872991 DOI: 10.1002/jrsm.1672] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2022] [Revised: 08/13/2023] [Accepted: 09/08/2023] [Indexed: 09/27/2023]
Abstract
Searching for trials is a key task in systematic reviews and a focus of automation. Previous approaches required knowing examples of relevant trials in advance, and most methods are focused on published trial articles. To complement existing tools, we compared methods for finding relevant trial registrations given a International Prospective Register of Systematic Reviews (PROSPERO) entry and where no relevant trials have been screened for inclusion in advance. We compared SciBERT-based (extension of Bidirectional Encoder Representations from Transformers) PICO extraction, MetaMap, and term-based representations using an imperfect dataset mined from 3632 PROSPERO entries connected to a subset of 65,662 trial registrations and 65,834 trial articles known to be included in systematic reviews. Performance was measured by the median rank and recall by rank of trials that were eventually included in the published systematic reviews. When ranking trial registrations relative to PROSPERO entries, 296 trial registrations needed to be screened to identify half of the relevant trials, and the best performing approach used a basic term-based representation. When ranking trial articles relative to PROSPERO entries, 162 trial articles needed to be screened to identify half of the relevant trials, and the best-performing approach used a term-based representation. The results show that MetaMap and term-based representations outperformed approaches that included PICO extraction for this use case. The results suggest that when starting with a PROSPERO entry and where no trials have been screened for inclusion, automated methods can reduce workload, but additional processes are still needed to efficiently identify trial registrations or trial articles that meet the inclusion criteria of a systematic review.
Collapse
Affiliation(s)
- Shifeng Liu
- Biomedical Informatics and Digital Health, Faculty of Medicine and Health, The University of Sydney, Sydney, New South Wales, Australia
| | - Florence T Bourgeois
- Computational Health Informatics Program, Boston Children's Hospital, Boston, Massachusetts, USA
- Department of Pediatrics, Harvard Medical School, Boston, Massachusetts, USA
| | - Claire Narang
- Computational Health Informatics Program, Boston Children's Hospital, Boston, Massachusetts, USA
| | - Adam G Dunn
- Biomedical Informatics and Digital Health, Faculty of Medicine and Health, The University of Sydney, Sydney, New South Wales, Australia
- Computational Health Informatics Program, Boston Children's Hospital, Boston, Massachusetts, USA
| |
Collapse
|
7
|
Šuster S, Baldwin T, Verspoor K. Analysis of predictive performance and reliability of classifiers for quality assessment of medical evidence revealed important variation by medical area. J Clin Epidemiol 2023; 159:58-69. [PMID: 37120028 DOI: 10.1016/j.jclinepi.2023.04.006] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2022] [Revised: 03/30/2023] [Accepted: 04/18/2023] [Indexed: 05/01/2023]
Abstract
OBJECTIVES A major obstacle in deployment of models for automated quality assessment is their reliability. To analyze their calibration and selective classification performance. STUDY DESIGN AND SETTING We examine two systems for assessing the quality of medical evidence, EvidenceGRADEr and RobotReviewer, both developed from Cochrane Database of Systematic Reviews (CDSR) to measure strength of bodies of evidence and risk of bias (RoB) of individual studies, respectively. We report their calibration error and Brier scores, present their reliability diagrams, and analyze the risk-coverage trade-off in selective classification. RESULTS The models are reasonably well calibrated on most quality criteria (expected calibration error [ECE] 0.04-0.09 for EvidenceGRADEr, 0.03-0.10 for RobotReviewer). However, we discover that both calibration and predictive performance vary significantly by medical area. This has ramifications for the application of such models in practice, as average performance is a poor indicator of group-level performance (e.g., health and safety at work, allergy and intolerance, and public health see much worse performance than cancer, pain, and anesthesia, and Neurology). We explore the reasons behind this disparity. CONCLUSION Practitioners adopting automated quality assessment should expect large fluctuations in system reliability and predictive performance depending on the medical area. Prospective indicators of such behavior should be further researched.
Collapse
Affiliation(s)
- Simon Šuster
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia.
| | - Timothy Baldwin
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia; Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates
| | - Karin Verspoor
- School of Computing Technologies, RMIT University, Melbourne, Australia; School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| |
Collapse
|
8
|
Oliveira Dos Santos Á, Sergio da Silva E, Machado Couto L, Valadares Labanca Reis G, Silva Belo V. The use of artificial intelligence for automating or semi-automating biomedical literature analyses: a scoping review. J Biomed Inform 2023; 142:104389. [PMID: 37187321 DOI: 10.1016/j.jbi.2023.104389] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2023] [Revised: 04/11/2023] [Accepted: 05/08/2023] [Indexed: 05/17/2023]
Abstract
OBJECTIVE Evidence-based medicine (EBM) is a decision-making process based on the conscious and judicious use of the best available scientific evidence. However, the exponential increase in the amount of information currently available likely exceeds the capacity of human-only analysis. In this context, artificial intelligence (AI) and its branches such as machine learning (ML) can be used to facilitate human efforts in analyzing the literature to foster EBM. The present scoping review aimed to examine the use of AI in the automation of biomedical literature survey and analysis with a view to establishing the state-of-the-art and identifying knowledge gaps. MATERIALS AND METHODS Comprehensive searches of the main databases were performed for articles published up to June 2022 and studies were selected according to inclusion and exclusion criteria. Data were extracted from the included articles and the findings categorized. RESULTS The total number of records retrieved from the databases was 12,145, of which 273 were included in the review. Classification of the studies according to the use of AI in evaluating the biomedical literature revealed three main application groups, namely assembly of scientific evidence (n=127; 47%), mining the biomedical literature (n=112; 41%) and quality analysis (n=34; 12%). Most studies addressed the preparation of systematic reviews, while articles focusing on the development of guidelines and evidence synthesis were the least frequent. The biggest knowledge gap was identified within the quality analysis group, particularly regarding methods and tools that assess the strength of recommendation and consistency of evidence. CONCLUSION Our review shows that, despite significant progress in the automation of biomedical literature surveys and analyses in recent years, intense research is needed to fill knowledge gaps on more difficult aspects of ML, deep learning and natural language processing, and to consolidate the use of automation by end-users (biomedical researchers and healthcare professionals).
Collapse
Affiliation(s)
| | - Eduardo Sergio da Silva
- Federal University of São João del-Rei, Campus Centro-Oeste Dona Lindu, Divinópolis, Minas Gerais, Brazil.
| | - Letícia Machado Couto
- Federal University of São João del-Rei, Campus Centro-Oeste Dona Lindu, Divinópolis, Minas Gerais, Brazil.
| | | | - Vinícius Silva Belo
- Federal University of São João del-Rei, Campus Centro-Oeste Dona Lindu, Divinópolis, Minas Gerais, Brazil.
| |
Collapse
|
9
|
Hartling L, Gates A. Friend or Foe? The Role of Robots in Systematic Reviews. Ann Intern Med 2022; 175:1045-1046. [PMID: 35635849 DOI: 10.7326/m22-1439] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Affiliation(s)
- Lisa Hartling
- Alberta Research Centre for Health Evidence, Department of Pediatrics, Faculty of Medicine & Dentistry, University of Alberta, Edmonton, Alberta, Canada
| | - Allison Gates
- Alberta Research Centre for Health Evidence, Department of Pediatrics, Faculty of Medicine & Dentistry, University of Alberta, Edmonton, Alberta, Canada
| |
Collapse
|