1
|
Witte C, Schmidt DM, Cimiano P. Comparing generative and extractive approaches to information extraction from abstracts describing randomized clinical trials. J Biomed Semantics 2024; 15:3. [PMID: 38654304 PMCID: PMC11036632 DOI: 10.1186/s13326-024-00305-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2024] [Accepted: 04/05/2024] [Indexed: 04/25/2024] Open
Abstract
BACKGROUND Systematic reviews of Randomized Controlled Trials (RCTs) are an important part of the evidence-based medicine paradigm. However, the creation of such systematic reviews by clinical experts is costly as well as time-consuming, and results can get quickly outdated after publication. Most RCTs are structured based on the Patient, Intervention, Comparison, Outcomes (PICO) framework and there exist many approaches which aim to extract PICO elements automatically. The automatic extraction of PICO information from RCTs has the potential to significantly speed up the creation process of systematic reviews and this way also benefit the field of evidence-based medicine. RESULTS Previous work has addressed the extraction of PICO elements as the task of identifying relevant text spans or sentences, but without populating a structured representation of a trial. In contrast, in this work, we treat PICO elements as structured templates with slots to do justice to the complex nature of the information they represent. We present two different approaches to extract this structured information from the abstracts of RCTs. The first approach is an extractive approach based on our previous work that is extended to capture full document representations as well as by a clustering step to infer the number of instances of each template type. The second approach is a generative approach based on a seq2seq model that encodes the abstract describing the RCT and uses a decoder to infer a structured representation of a trial including its arms, treatments, endpoints and outcomes. Both approaches are evaluated with different base models on a manually annotated dataset consisting of RCT abstracts on an existing dataset comprising 211 annotated clinical trial abstracts for Type 2 Diabetes and Glaucoma. For both diseases, the extractive approach (with flan-t5-base) reached the best F 1 score, i.e. 0.547 ( ± 0.006 ) for type 2 diabetes and 0.636 ( ± 0.006 ) for glaucoma. Generally, the F 1 scores were higher for glaucoma than for type 2 diabetes and the standard deviation was higher for the generative approach. CONCLUSION In our experiments, both approaches show promising performance extracting structured PICO information from RCTs, especially considering that most related work focuses on the far easier task of predicting less structured objects. In our experimental results, the extractive approach performs best in both cases, although the lead is greater for glaucoma than for type 2 diabetes. For future work, it remains to be investigated how the base model size affects the performance of both approaches in comparison. Although the extractive approach currently leaves more room for direct improvements, the generative approach might benefit from larger models.
Collapse
Affiliation(s)
- Christian Witte
- Semantic Computing Group, Center for Cognitive Interaction Technology, Bielefeld University, Inspiration 1, Bielefeld, 33619, NRW, Germany
| | - David M Schmidt
- Semantic Computing Group, Center for Cognitive Interaction Technology, Bielefeld University, Inspiration 1, Bielefeld, 33619, NRW, Germany.
| | - Philipp Cimiano
- Semantic Computing Group, Center for Cognitive Interaction Technology, Bielefeld University, Inspiration 1, Bielefeld, 33619, NRW, Germany
| |
Collapse
|
2
|
Zhang G, Zhou Y, Hu Y, Xu H, Weng C, Peng Y. A span-based model for extracting overlapping PICO entities from randomized controlled trial publications. J Am Med Inform Assoc 2024; 31:1163-1171. [PMID: 38471120 PMCID: PMC11031223 DOI: 10.1093/jamia/ocae065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Revised: 02/20/2024] [Accepted: 03/11/2024] [Indexed: 03/14/2024] Open
Abstract
OBJECTIVES Extracting PICO (Populations, Interventions, Comparison, and Outcomes) entities is fundamental to evidence retrieval. We present a novel method, PICOX, to extract overlapping PICO entities. MATERIALS AND METHODS PICOX first identifies entities by assessing whether a word marks the beginning or conclusion of an entity. Then, it uses a multi-label classifier to assign one or more PICO labels to a span candidate. PICOX was evaluated using 1 of the best-performing baselines, EBM-NLP, and 3 more datasets, ie, PICO-Corpus and randomized controlled trial publications on Alzheimer's Disease (AD) or COVID-19, using entity-level precision, recall, and F1 scores. RESULTS PICOX achieved superior precision, recall, and F1 scores across the board, with the micro F1 score improving from 45.05 to 50.87 (P ≪.01). On the PICO-Corpus, PICOX obtained higher recall and F1 scores than the baseline and improved the micro recall score from 56.66 to 67.33. On the COVID-19 dataset, PICOX also outperformed the baseline and improved the micro F1 score from 77.10 to 80.32. On the AD dataset, PICOX demonstrated comparable F1 scores with higher precision when compared to the baseline. CONCLUSION PICOX excels in identifying overlapping entities and consistently surpasses a leading baseline across multiple datasets. Ablation studies reveal that its data augmentation strategy effectively minimizes false positives and improves precision.
Collapse
Affiliation(s)
- Gongbo Zhang
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, United States
| | - Yiliang Zhou
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, United States
| | - Yan Hu
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Hua Xu
- Section of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT 06510, United States
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, United States
| | - Yifan Peng
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, United States
| |
Collapse
|
3
|
Nagel J, Wegener F, Grim C, Hoppe MW. Effects of Digital Physical Health Exercises on Musculoskeletal Diseases: Systematic Review With Best-Evidence Synthesis. JMIR Mhealth Uhealth 2024; 12:e50616. [PMID: 38261356 PMCID: PMC10848133 DOI: 10.2196/50616] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Revised: 10/21/2023] [Accepted: 11/02/2023] [Indexed: 01/24/2024] Open
Abstract
BACKGROUND Musculoskeletal diseases affect 1.71 billion people worldwide, impose a high biopsychosocial burden on patients, and are associated with high economic costs. The use of digital health interventions is a promising cost-saving approach for the treatment of musculoskeletal diseases. As physical exercise is the best clinical practice in the treatment of musculoskeletal diseases, digital health interventions that provide physical exercises could have a highly positive impact on musculoskeletal diseases, but evidence is lacking. OBJECTIVE This systematic review aims to evaluate the impact of digital physical health exercises on patients with musculoskeletal diseases concerning the localization of the musculoskeletal disease, patient-reported outcomes, and medical treatment types. METHODS We performed systematic literature research using the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines. The search was conducted using the PubMed, BISp, Cochrane Library, and Web of Science databases. The Scottish Intercollegiate Guidelines Network checklist was used to assess the quality of the included original studies. To determine the evidence and direction of the impact of digital physical health exercises, a best-evidence synthesis was conducted, whereby only studies with at least acceptable methodological quality were included for validity purposes. RESULTS A total of 8988 studies were screened, of which 30 (0.33%) randomized controlled trials met the inclusion criteria. Of these, 16 studies (53%) were of acceptable or high quality; they included 1840 patients (1008/1643, 61.35% female; 3 studies including 197 patients did not report gender distribution) with various musculoskeletal diseases. A total of 3 different intervention types (app-based interventions, internet-based exercises, and telerehabilitation) were used to deliver digital physical health exercises. Strong evidence was found for the positive impact of digital physical health exercises on musculoskeletal diseases located in the back. Moderate evidence was found for diseases located in the shoulder and hip, whereas evidence for the entire body was limited. Conflicting evidence was found for diseases located in the knee and hand. For patient-reported outcomes, strong evidence was found for impairment and quality of life. Conflicting evidence was found for pain and function. Regarding the medical treatment type, conflicting evidence was found for operative and conservative therapies. CONCLUSIONS Strong to moderate evidence was found for a positive impact on musculoskeletal diseases located in the back, shoulder, and hip and on the patient-reported outcomes of impairment and quality of life. Thus, digital physical health exercises could have a positive effect on a variety of symptoms of musculoskeletal diseases.
Collapse
Affiliation(s)
- Johanna Nagel
- Movement and Training Science, Leipzig University, Leipzig, Germany
| | - Florian Wegener
- Movement and Training Science, Leipzig University, Leipzig, Germany
| | - Casper Grim
- Center for Musculoskeletal Surgery Osnabrück, Klinikum Osnabrück, Osnabrück, Germany
| | | |
Collapse
|
4
|
Hu Y, Keloth VK, Raja K, Chen Y, Xu H. Towards precise PICO extraction from abstracts of randomized controlled trials using a section-specific learning approach. Bioinformatics 2023; 39:btad542. [PMID: 37669123 PMCID: PMC10500081 DOI: 10.1093/bioinformatics/btad542] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2023] [Revised: 08/24/2023] [Accepted: 09/03/2023] [Indexed: 09/07/2023] Open
Abstract
MOTIVATION Automated extraction of participants, intervention, comparison/control, and outcome (PICO) from the randomized controlled trial (RCT) abstracts is important for evidence synthesis. Previous studies have demonstrated the feasibility of applying natural language processing (NLP) for PICO extraction. However, the performance is not optimal due to the complexity of PICO information in RCT abstracts and the challenges involved in their annotation. RESULTS We propose a two-step NLP pipeline to extract PICO elements from RCT abstracts: (i) sentence classification using a prompt-based learning model and (ii) PICO extraction using a named entity recognition (NER) model. First, the sentences in abstracts were categorized into four sections namely background, methods, results, and conclusions. Next, the NER model was applied to extract the PICO elements from the sentences within the title and methods sections that include >96% of PICO information. We evaluated our proposed NLP pipeline on three datasets, the EBM-NLPmoddataset, a randomly selected and reannotated dataset of 500 RCT abstracts from the EBM-NLP corpus, a dataset of 150 COVID-19 RCT abstracts, and a dataset of 150 Alzheimer's disease (AD) RCT abstracts. The end-to-end evaluation reveals that our proposed approach achieved an overall micro F1 score of 0.833 on the EBM-NLPmod dataset, 0.928 on the COVID-19 dataset, and 0.899 on the AD dataset when measured at the token-level and an overall micro F1 score of 0.712 on EBM-NLPmod dataset, 0.850 on the COVID-19 dataset, and 0.805 on the AD dataset when measured at the entity-level. AVAILABILITY Our codes and datasets are publicly available at https://github.com/BIDS-Xu-Lab/section_specific_annotation_of_PICO. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yan Hu
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77054, United States
| | - Vipina K Keloth
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, 100 College St, New Haven, CT 06510, United States
| | - Kalpana Raja
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, 100 College St, New Haven, CT 06510, United States
| | - Yong Chen
- Center for Health Analytics and Synthesis of Evidence (CHASE), Department of Biostatistics, Epide-miology and Informatics, University of Pennsylvania, 423 Guardian Dr, Philadelphia, PA 19104, United States
- Penn Medicine Center for Evidence-based Practice (CEP), University of Pennsylvania, 3600 Civic Center Blvd, Philadelphia, PA 19104, United States
| | - Hua Xu
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, 100 College St, New Haven, CT 06510, United States
| |
Collapse
|
5
|
Aletaha A, Nemati-Anaraki L, Keshtkar A, Sedghi S, Keramatfar A, Korolyova A. A Scoping Review of Adopted Information Extraction Methods for RCTs. Med J Islam Repub Iran 2023; 37:95. [PMID: 38021383 PMCID: PMC10657257 DOI: 10.47176/mjiri.37.95] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2022] [Indexed: 12/01/2023] Open
Abstract
Background Randomized controlled trials (RCTs) provide the strongest evidence for therapeutic interventions and their effects on groups of subjects. However, the large amount of unstructured information in these trials makes it challenging and time-consuming to make decisions and identify important concepts and valid evidence. This study aims to explore methods for automating or semi-automating information extraction from reports of RCT studies. Methods We conducted a systematic search of PubMed, ACM Digital Library, and Web of Science to identify relevant articles published between January 1, 2010, and 2022. We focused on published Natural Language Processing (NLP), machine learning, and deep learning methods that automate or semi-automate key elements of information extraction in the context of RCTs. Results A total of 26 publications were included, which discussed the automatic extraction of key characteristics of RCTs using various PICO frameworks (PIBOSO and PECODR). Among these publications, 14 (53.8%) extracted key characteristics based on PICO, PIBOSO, and PECODR, while 12 (46.1%) discussed information extraction methods in RCT studies. Common approaches mentioned included word/phrase matching, machine learning algorithms such as binary classification using the Naïve Bayes algorithm and powerful BERT network for feature extraction, support vector machine for data classification, conditional random field, non-machine-dependent automation, and machine learning or deep learning approaches. Conclusion The lack of publicly available software and limited access to existing software makes it difficult to determine the most powerful information extraction system. However, deep learning models like Transformers and BERT language models have shown better performance in natural language processing.
Collapse
Affiliation(s)
- Azadeh Aletaha
- Department of Medical Library and Information Science, School of Health
Management and Information Sciences, Iran University of Medical Sciences, Tehran, Iran
- Evidence-Based Medicine Research Center, Endocrinology and Metabolism Clinical
Sciences Institute, Tehran University of Medical Sciences, Tehran, Iran
| | - Leila Nemati-Anaraki
- Department of Medical Library and Information Science, School of Health
Management and Information Sciences, Iran University of Medical Sciences, Tehran, Iran
- Health Management and Economics Research Center, Health Management Research
Institute, Iran University of Medical Sciences, Tehran, Iran
| | - AbbasAli Keshtkar
- Department of Health Science Educational Development, School of Public Health,
Tehran University of Medical Sciences. Tehran, Iran
| | - Shahram Sedghi
- Department of Medical Library and Information Science, School of Health
Management and Information Sciences, Iran University of Medical Sciences, Tehran, Iran
- Economics Research Center, Iran University of Medical Sciences, PO Box
14665-354, Tehran, Iran
| | | | - Anna Korolyova
- Computer Science Laboratory for Mechanics and Engineering Sciences (LIMSI),
CNRS, Universit´e Paris-Saclay, F-91405 Orsay, France
- School of Life Sciences and Facility Management Zurich University of Applied
Sciences (ZHAW)
- Fraser House, White Cross Business Park, Lancaster, LA1 4XQ
| |
Collapse
|
6
|
Newton AJH, Chartash D, Kleinstein SH, McDougal RA. A pipeline for the retrieval and extraction of domain-specific information with application to COVID-19 immune signatures. BMC Bioinformatics 2023; 24:292. [PMID: 37474900 PMCID: PMC10357743 DOI: 10.1186/s12859-023-05397-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2023] [Accepted: 06/23/2023] [Indexed: 07/22/2023] Open
Abstract
BACKGROUND The accelerating pace of biomedical publication has made it impractical to manually, systematically identify papers containing specific information and extract this information. This is especially challenging when the information itself resides beyond titles or abstracts. For emerging science, with a limited set of known papers of interest and an incomplete information model, this is of pressing concern. A timely example in retrospect is the identification of immune signatures (coherent sets of biomarkers) driving differential SARS-CoV-2 infection outcomes. IMPLEMENTATION We built a classifier to identify papers containing domain-specific information from the document embeddings of the title and abstract. To train this classifier with limited data, we developed an iterative process leveraging pre-trained SPECTER document embeddings, SVM classifiers and web-enabled expert review to iteratively augment the training set. This training set was then used to create a classifier to identify papers containing domain-specific information. Finally, information was extracted from these papers through a semi-automated system that directly solicited the paper authors to respond via a web-based form. RESULTS We demonstrate a classifier that retrieves papers with human COVID-19 immune signatures with a positive predictive value of 86%. The type of immune signature (e.g., gene expression vs. other types of profiling) was also identified with a positive predictive value of 74%. Semi-automated queries to the corresponding authors of these publications requesting signature information achieved a 31% response rate. CONCLUSIONS Our results demonstrate the efficacy of using a SVM classifier with document embeddings of the title and abstract, to retrieve papers with domain-specific information, even when that information is rarely present in the abstract. Targeted author engagement based on classifier predictions offers a promising pathway to build a semi-structured representation of such information. Through this approach, partially automated literature mining can help rapidly create semi-structured knowledge repositories for automatic analysis of emerging health threats.
Collapse
Affiliation(s)
- Adam J H Newton
- Department of Physiology and Pharmacology, SUNY Downstate Health Sciences University, Brooklyn, NY, 11203, USA
- Yale Center for Medical Informatics, Yale School of Medicine, Yale University, New Haven, CT, 06511, USA
- Department of Biostatistics, Yale School of Public Health, Yale University, New Haven, CT, 06511, USA
- Department of Pathology, Yale School of Medicine, Yale University, New Haven, CT, 06511, USA
| | - David Chartash
- Yale Center for Medical Informatics, Yale School of Medicine, Yale University, New Haven, CT, 06511, USA
- Department of Biostatistics, Yale School of Public Health, Yale University, New Haven, CT, 06511, USA
- School of Medicine, University College Dublin - National University of Ireland, Dublin, Co. Dublin, Republic of Ireland
| | - Steven H Kleinstein
- Department of Pathology, Yale School of Medicine, Yale University, New Haven, CT, 06511, USA
- Department of Immunobiology, Yale School of Medicine, Yale University, New Haven, CT, 06511, USA
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, 06511, USA
| | - Robert A McDougal
- Yale Center for Medical Informatics, Yale School of Medicine, Yale University, New Haven, CT, 06511, USA.
- Department of Biostatistics, Yale School of Public Health, Yale University, New Haven, CT, 06511, USA.
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, 06511, USA.
| |
Collapse
|
7
|
Kang T, Sun Y, Kim JH, Ta C, Perotte A, Schiffer K, Wu M, Zhao Y, Moustafa-Fahmy N, Peng Y, Weng C. EvidenceMap: a three-level knowledge representation for medical evidence computation and comprehension. J Am Med Inform Assoc 2023; 30:1022-1031. [PMID: 36921288 PMCID: PMC10198523 DOI: 10.1093/jamia/ocad036] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Revised: 02/03/2023] [Accepted: 02/28/2023] [Indexed: 03/17/2023] Open
Abstract
OBJECTIVE To develop a computable representation for medical evidence and to contribute a gold standard dataset of annotated randomized controlled trial (RCT) abstracts, along with a natural language processing (NLP) pipeline for transforming free-text RCT evidence in PubMed into the structured representation. MATERIALS AND METHODS Our representation, EvidenceMap, consists of 3 levels of abstraction: Medical Evidence Entity, Proposition and Map, to represent the hierarchical structure of medical evidence composition. Randomly selected RCT abstracts were annotated following EvidenceMap based on the consensus of 2 independent annotators to train an NLP pipeline. Via a user study, we measured how the EvidenceMap improved evidence comprehension and analyzed its representative capacity by comparing the evidence annotation with EvidenceMap representation and without following any specific guidelines. RESULTS Two corpora including 229 disease-agnostic and 80 COVID-19 RCT abstracts were annotated, yielding 12 725 entities and 1602 propositions. EvidenceMap saves users 51.9% of the time compared to reading raw-text abstracts. Most evidence elements identified during the freeform annotation were successfully represented by EvidenceMap, and users gave the enrollment, study design, and study Results sections mean 5-scale Likert ratings of 4.85, 4.70, and 4.20, respectively. The end-to-end evaluations of the pipeline show that the evidence proposition formulation achieves F1 scores of 0.84 and 0.86 in the adjusted random index score. CONCLUSIONS EvidenceMap extends the participant, intervention, comparator, and outcome framework into 3 levels of abstraction for transforming free-text evidence from the clinical literature into a computable structure. It can be used as an interoperable format for better evidence retrieval and synthesis and an interpretable representation to efficiently comprehend RCT findings.
Collapse
Affiliation(s)
- Tian Kang
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Yingcheng Sun
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Jae Hyun Kim
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Casey Ta
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Adler Perotte
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Kayla Schiffer
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Mutong Wu
- Department of Statistics, Columbia University, New York, New York, USA
| | - Yang Zhao
- Department of Statistics, Columbia University, New York, New York, USA
| | | | - Yifan Peng
- Department of Population Health Sciences, Weill Cornell Medicine, New York, New York, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| |
Collapse
|
8
|
Yeun YR, Kim SD. Psychological Effects of Online-Based Mindfulness Programs during the COVID-19 Pandemic: A Systematic Review of Randomized Controlled Trials. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2022; 19:ijerph19031624. [PMID: 35162646 PMCID: PMC8835139 DOI: 10.3390/ijerph19031624] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Revised: 01/29/2022] [Accepted: 01/29/2022] [Indexed: 12/17/2022]
Abstract
(1) Background: The COVID-19 outbreak has caused psychological problems worldwide. This review explored the psychological effects of online-based mindfulness programs during the COVID-19 pandemic. (2) Methods: This systematic review was guided by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines. Randomized controlled trials that were published in the English language from 1 January 2020 to 31 May 2021 on online-based mindfulness programs for psychological problems due to the COVID-19 pandemic were searched in electronic databases. Quality assessment was conducted on the retrieved RCTs using the Cochrane risk of bias tool for RCTs. (3) Results: Six RCTs were included in this review. Quality appraisal of included RCTs ranged from 1 for low risk of bias to 5 for high risk of bias. There is evidence from the six RCTs that online-based mindfulness interventions may have favorable effects for reducing the levels of psychological problems, such as anxiety, depression, and stress. (4) Conclusions: Online-based mindfulness programs may be used as complementary interventions for clinical populations, healthy individuals, and healthcare workers with psychological problems due to the COVID-19 pandemic.
Collapse
|
9
|
Constructing public health evidence knowledge graph for decision-making support from COVID-19 literature of modelling study. JOURNAL OF SAFETY SCIENCE AND RESILIENCE 2021. [PMCID: PMC8361008 DOI: 10.1016/j.jnlssr.2021.08.002] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
The needs of mitigating COVID-19 epidemic prompt policymakers to make public health-related decision under the guidelines of science. Tremendous unstructured COVID-19 publications make it challenging for policymakers to obtain relevant evidence. Knowledge graphs (KGs) can formalize unstructured knowledge into structured form and have been used in supporting decision-making recently. Here, we introduce a novel framework that can extract the COVID-19 public health evidence knowledge graph (CPHE-KG) from papers relating to a modelling study. We screen out a corpus of 3096 COVID-19 modelling study papers by performing a literature assessment process. We define a novel annotation schema to construct the COVID-19 modelling study-related IE dataset (CPHIE). We also propose a novel multi-tasks document-level information extraction model SS-DYGIE++ based on the dataset. Leveraging the model on the new corpus, we construct CPHE-KG containing 60,967 entities and 51,140 relations. Finally, we seek to apply our KG to support evidence querying and evidence mapping visualization. Our SS-DYGIE++(SpanBERT) model has achieved a F1 score of 0.77 and 0.55 respectively in document-level entity recognition and coreference resolution tasks. It has also shown high performance in the relation identification task. With evidence querying, our KG can present the dynamic transmissions of COVID-19 pandemic in different countries and regions. The evidence mapping of our KG can show the impacts of variable non-pharmacological interventions to COVID-19 pandemic. Analysis demonstrates the quality of our KG and shows that it has the potential to support COVID-19 policy making in public health.
Collapse
|
10
|
Schmidt L, Finnerty Mutlu AN, Elmore R, Olorisade BK, Thomas J, Higgins JPT. Data extraction methods for systematic review (semi)automation: Update of a living systematic review. F1000Res 2021; 10:401. [PMID: 34408850 PMCID: PMC8361807 DOI: 10.12688/f1000research.51117.2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 07/27/2023] [Indexed: 10/12/2023] Open
Abstract
Background: The reliable and usable (semi)automation of data extraction can support the field of systematic review by reducing the workload required to gather information about the conduct and results of the included studies. This living systematic review examines published approaches for data extraction from reports of clinical studies. Methods: We systematically and continually search PubMed, ACL Anthology, arXiv, OpenAlex via EPPI-Reviewer, and the dblp computer science bibliography. Full text screening and data extraction are conducted within an open-source living systematic review application created for the purpose of this review. This living review update includes publications up to December 2022 and OpenAlex content up to March 2023. Results: 76 publications are included in this review. Of these, 64 (84%) of the publications addressed extraction of data from abstracts, while 19 (25%) used full texts. A total of 71 (93%) publications developed classifiers for randomised controlled trials. Over 30 entities were extracted, with PICOs (population, intervention, comparator, outcome) being the most frequently extracted. Data are available from 25 (33%), and code from 30 (39%) publications. Six (8%) implemented publicly available tools Conclusions: This living systematic review presents an overview of (semi)automated data-extraction literature of interest to different types of literature review. We identified a broad evidence base of publications describing data extraction for interventional reviews and a small number of publications extracting epidemiological or diagnostic accuracy data. Between review updates, trends for sharing data and code increased strongly: in the base-review, data and code were available for 13 and 19% respectively, these numbers increased to 78 and 87% within the 23 new publications. Compared with the base-review, we observed another research trend, away from straightforward data extraction and towards additionally extracting relations between entities or automatic text summarisation. With this living review we aim to review the literature continually.
Collapse
Affiliation(s)
- Lena Schmidt
- NIHR Innovation Observatory, Newcastle University, Newcastle upon Tyne, NE4 5TG, UK
- Sciome LLC, Research Triangle Park, North Carolina, 27713, USA
- Bristol Medical School, University of Bristol, Bristol, BS8 2PS, UK
| | | | - Rebecca Elmore
- Sciome LLC, Research Triangle Park, North Carolina, 27713, USA
| | - Babatunde K. Olorisade
- Bristol Medical School, University of Bristol, Bristol, BS8 2PS, UK
- Evaluate Ltd, London, SE1 2RE, UK
- Cardiff School of Technologies, Cardiff Metropolitan University, Cardiff, CF5 2YB, UK
| | - James Thomas
- UCL Social Research Institute, University College London, London, WC1H 0AL, UK
| | | |
Collapse
|
11
|
Schmidt L, Finnerty Mutlu AN, Elmore R, Olorisade BK, Thomas J, Higgins JPT. Data extraction methods for systematic review (semi)automation: A living systematic review. F1000Res 2021; 10:401. [PMID: 34408850 PMCID: PMC8361807 DOI: 10.12688/f1000research.51117.1] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 05/04/2021] [Indexed: 01/07/2023] Open
Abstract
Background: The reliable and usable (semi)automation of data extraction can support the field of systematic review by reducing the workload required to gather information about the conduct and results of the included studies. This living systematic review examines published approaches for data extraction from reports of clinical studies. Methods: We systematically and continually search MEDLINE, Institute of Electrical and Electronics Engineers (IEEE), arXiv, and the dblp computer science bibliography databases. Full text screening and data extraction are conducted within an open-source living systematic review application created for the purpose of this review. This iteration of the living review includes publications up to a cut-off date of 22 April 2020. Results: In total, 53 publications are included in this version of our review. Of these, 41 (77%) of the publications addressed extraction of data from abstracts, while 14 (26%) used full texts. A total of 48 (90%) publications developed and evaluated classifiers that used randomised controlled trials as the main target texts. Over 30 entities were extracted, with PICOs (population, intervention, comparator, outcome) being the most frequently extracted. A description of their datasets was provided by 49 publications (94%), but only seven (13%) made the data publicly available. Code was made available by 10 (19%) publications, and five (9%) implemented publicly available tools. Conclusions: This living systematic review presents an overview of (semi)automated data-extraction literature of interest to different types of systematic review. We identified a broad evidence base of publications describing data extraction for interventional reviews and a small number of publications extracting epidemiological or diagnostic accuracy data. The lack of publicly available gold-standard data for evaluation, and lack of application thereof, makes it difficult to draw conclusions on which is the best-performing system for each data extraction target. With this living review we aim to review the literature continually.
Collapse
Affiliation(s)
- Lena Schmidt
- NIHR Innovation Observatory, Newcastle University, Newcastle upon Tyne, NE4 5TG, UK
- Sciome LLC, Research Triangle Park, North Carolina, 27713, USA
- Bristol Medical School, University of Bristol, Bristol, BS8 2PS, UK
| | | | - Rebecca Elmore
- Sciome LLC, Research Triangle Park, North Carolina, 27713, USA
| | - Babatunde K. Olorisade
- Bristol Medical School, University of Bristol, Bristol, BS8 2PS, UK
- Evaluate Ltd, London, SE1 2RE, UK
- Cardiff School of Technologies, Cardiff Metropolitan University, Cardiff, CF5 2YB, UK
| | - James Thomas
- UCL Social Research Institute, University College London, London, WC1H 0AL, UK
| | | |
Collapse
|
12
|
Kang T, Perotte A, Tang Y, Ta C, Weng C. UMLS-based data augmentation for natural language processing of clinical research literature. J Am Med Inform Assoc 2021; 28:812-823. [PMID: 33367705 DOI: 10.1093/jamia/ocaa309] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2020] [Accepted: 11/23/2020] [Indexed: 01/17/2023] Open
Abstract
OBJECTIVE The study sought to develop and evaluate a knowledge-based data augmentation method to improve the performance of deep learning models for biomedical natural language processing by overcoming training data scarcity. MATERIALS AND METHODS We extended the easy data augmentation (EDA) method for biomedical named entity recognition (NER) by incorporating the Unified Medical Language System (UMLS) knowledge and called this method UMLS-EDA. We designed experiments to systematically evaluate the effect of UMLS-EDA on popular deep learning architectures for both NER and classification. We also compared UMLS-EDA to BERT. RESULTS UMLS-EDA enables substantial improvement for NER tasks from the original long short-term memory conditional random fields (LSTM-CRF) model (micro-F1 score: +5%, + 17%, and +15%), helps the LSTM-CRF model (micro-F1 score: 0.66) outperform LSTM-CRF with transfer learning by BERT (0.63), and improves the performance of the state-of-the-art sentence classification model. The largest gain on micro-F1 score is 9%, from 0.75 to 0.84, better than classifiers with BERT pretraining (0.82). CONCLUSIONS This study presents a UMLS-based data augmentation method, UMLS-EDA. It is effective at improving deep learning models for both NER and sentence classification, and contributes original insights for designing new, superior deep learning approaches for low-resource biomedical domains.
Collapse
Affiliation(s)
- Tian Kang
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Adler Perotte
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Youlan Tang
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Casey Ta
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| |
Collapse
|