1
|
Liu H, Soroush A, Nestor JG, Park E, Idnay B, Fang Y, Pan J, Liao S, Bernard M, Peng Y, Weng C. Retrieval augmented scientific claim verification. JAMIA Open 2024; 7:ooae021. [PMID: 38455840 PMCID: PMC10919922 DOI: 10.1093/jamiaopen/ooae021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Revised: 01/19/2024] [Accepted: 02/14/2024] [Indexed: 03/09/2024] Open
Abstract
Objective To automate scientific claim verification using PubMed abstracts. Materials and Methods We developed CliVER, an end-to-end scientific Claim VERification system that leverages retrieval-augmented techniques to automatically retrieve relevant clinical trial abstracts, extract pertinent sentences, and use the PICO framework to support or refute a scientific claim. We also created an ensemble of three state-of-the-art deep learning models to classify rationale of support, refute, and neutral. We then constructed CoVERt, a new COVID VERification dataset comprising 15 PICO-encoded drug claims accompanied by 96 manually selected and labeled clinical trial abstracts that either support or refute each claim. We used CoVERt and SciFact (a public scientific claim verification dataset) to assess CliVER's performance in predicting labels. Finally, we compared CliVER to clinicians in the verification of 19 claims from 6 disease domains, using 189 648 PubMed abstracts extracted from January 2010 to October 2021. Results In the evaluation of label prediction accuracy on CoVERt, CliVER achieved a notable F1 score of 0.92, highlighting the efficacy of the retrieval-augmented models. The ensemble model outperforms each individual state-of-the-art model by an absolute increase from 3% to 11% in the F1 score. Moreover, when compared with four clinicians, CliVER achieved a precision of 79.0% for abstract retrieval, 67.4% for sentence selection, and 63.2% for label prediction, respectively. Conclusion CliVER demonstrates its early potential to automate scientific claim verification using retrieval-augmented strategies to harness the wealth of clinical trial abstracts in PubMed. Future studies are warranted to further test its clinical utility.
Collapse
Affiliation(s)
- Hao Liu
- School of Computing, Montclair State University, Montclair, NJ 07043, United States
| | - Ali Soroush
- Department of Medicine, Columbia University, New York, NY 10027, United States
| | - Jordan G Nestor
- Department of Medicine, Columbia University, New York, NY 10027, United States
| | - Elizabeth Park
- Department of Medicine, Columbia University, New York, NY 10027, United States
| | - Betina Idnay
- Department of Biomedical Informatics, Columbia University, New York, NY 10027, United States
| | - Yilu Fang
- Department of Biomedical Informatics, Columbia University, New York, NY 10027, United States
| | - Jane Pan
- Department of Applied Physics and Applied Mathematics, Columbia University, New York, NY 10027, United States
| | - Stan Liao
- Department of Applied Physics and Applied Mathematics, Columbia University, New York, NY 10027, United States
| | - Marguerite Bernard
- Institute of Human Nutrition, Columbia University, New York, NY 10027, United States
| | - Yifan Peng
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, United States
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY 10027, United States
| |
Collapse
|
2
|
Newton AJH, Chartash D, Kleinstein SH, McDougal RA. A pipeline for the retrieval and extraction of domain-specific information with application to COVID-19 immune signatures. BMC Bioinformatics 2023; 24:292. [PMID: 37474900 PMCID: PMC10357743 DOI: 10.1186/s12859-023-05397-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2023] [Accepted: 06/23/2023] [Indexed: 07/22/2023] Open
Abstract
BACKGROUND The accelerating pace of biomedical publication has made it impractical to manually, systematically identify papers containing specific information and extract this information. This is especially challenging when the information itself resides beyond titles or abstracts. For emerging science, with a limited set of known papers of interest and an incomplete information model, this is of pressing concern. A timely example in retrospect is the identification of immune signatures (coherent sets of biomarkers) driving differential SARS-CoV-2 infection outcomes. IMPLEMENTATION We built a classifier to identify papers containing domain-specific information from the document embeddings of the title and abstract. To train this classifier with limited data, we developed an iterative process leveraging pre-trained SPECTER document embeddings, SVM classifiers and web-enabled expert review to iteratively augment the training set. This training set was then used to create a classifier to identify papers containing domain-specific information. Finally, information was extracted from these papers through a semi-automated system that directly solicited the paper authors to respond via a web-based form. RESULTS We demonstrate a classifier that retrieves papers with human COVID-19 immune signatures with a positive predictive value of 86%. The type of immune signature (e.g., gene expression vs. other types of profiling) was also identified with a positive predictive value of 74%. Semi-automated queries to the corresponding authors of these publications requesting signature information achieved a 31% response rate. CONCLUSIONS Our results demonstrate the efficacy of using a SVM classifier with document embeddings of the title and abstract, to retrieve papers with domain-specific information, even when that information is rarely present in the abstract. Targeted author engagement based on classifier predictions offers a promising pathway to build a semi-structured representation of such information. Through this approach, partially automated literature mining can help rapidly create semi-structured knowledge repositories for automatic analysis of emerging health threats.
Collapse
Affiliation(s)
- Adam J H Newton
- Department of Physiology and Pharmacology, SUNY Downstate Health Sciences University, Brooklyn, NY, 11203, USA
- Yale Center for Medical Informatics, Yale School of Medicine, Yale University, New Haven, CT, 06511, USA
- Department of Biostatistics, Yale School of Public Health, Yale University, New Haven, CT, 06511, USA
- Department of Pathology, Yale School of Medicine, Yale University, New Haven, CT, 06511, USA
| | - David Chartash
- Yale Center for Medical Informatics, Yale School of Medicine, Yale University, New Haven, CT, 06511, USA
- Department of Biostatistics, Yale School of Public Health, Yale University, New Haven, CT, 06511, USA
- School of Medicine, University College Dublin - National University of Ireland, Dublin, Co. Dublin, Republic of Ireland
| | - Steven H Kleinstein
- Department of Pathology, Yale School of Medicine, Yale University, New Haven, CT, 06511, USA
- Department of Immunobiology, Yale School of Medicine, Yale University, New Haven, CT, 06511, USA
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, 06511, USA
| | - Robert A McDougal
- Yale Center for Medical Informatics, Yale School of Medicine, Yale University, New Haven, CT, 06511, USA.
- Department of Biostatistics, Yale School of Public Health, Yale University, New Haven, CT, 06511, USA.
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, 06511, USA.
| |
Collapse
|
3
|
Khader A, Ensan F. Learning to rank query expansion terms for COVID-19 scholarly search. J Biomed Inform 2023; 142:104386. [PMID: 37178780 PMCID: PMC10174726 DOI: 10.1016/j.jbi.2023.104386] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2022] [Revised: 04/19/2023] [Accepted: 05/05/2023] [Indexed: 05/15/2023]
Abstract
OBJECTIVE With the onset of the Coronavirus Disease 2019 (COVID-19) pandemic, there has been a surge in the number of publicly available biomedical information sources, which makes it an increasingly challenging research goal to retrieve a relevant text to a topic of interest. In this paper, we propose a Contextual Query Expansion framework based on the clinical Domain knowledge (CQED) for formalizing an effective search over PubMed to retrieve relevant COVID-19 scholarly articles to a given information need. MATERIALS AND METHODS For the sake of training and evaluation, we use the widely adopted TREC-COVID benchmark. Given a query, the proposed framework utilizes a contextual and a domain-specific neural language model to generate a set of candidate query expansion terms that enrich the original query. Moreover, the framework includes a multi-head attention mechanism that is trained alongside a learning-to-rank model for re-ranking the list of generated expansion candidate terms. The original query and the top-ranked expansion terms are posed to the PubMed search engine for retrieving relevant scholarly articles to an information need. The framework, CQED, can have four different variations, depending upon the learning path adopted for training and re-ranking the candidate expansion terms. RESULTS The model drastically improves the search performance, when compared to the original query. The performance improvement in comparison to the original query, in terms of terms of RECALL@1000 is 190.85% and in terms of NDCG@1000 is 343.55%. Additionally, the model outperforms all existing state-of-the-art baselines. In terms of P@10, the model that has been optimized based on Precision outperforms all baselines (0.7987). On the other hand, in terms of NDCG@10 (0.7986), MAP (0.3450) and bpref (0.4900), the CQED model that has been optimized based on an average of all retrieval measures outperforms all the baselines. CONCLUSION The proposed model successfully expands queries posed to PubMed, and improves search performance, as compared to all existing baselines. A success/failure analysis shows that the model improved the search performance of each of the evaluated queries. Moreover, an ablation study depicted that if ranking of generated candidate terms is not conducted, the overall performance decreases. For future work, we would like to explore the application of the presented query expansion framework in conducting technology-assisted Systematic Literature Reviews (SLR).
Collapse
Affiliation(s)
- Ayesha Khader
- Department of Electrical, Computer, and Biomedical Engineering Toronto Metropolitan University, Toronto, Canada.
| | - Faezeh Ensan
- Department of Electrical, Computer, and Biomedical Engineering Toronto Metropolitan University, Toronto, Canada.
| |
Collapse
|
4
|
Goto A, Rodriguez-Esteban R, Scharf SH, Morris GM. Understanding the genetics of viral drug resistance by integrating clinical data and mining of the scientific literature. Sci Rep 2022; 12:14476. [PMID: 36008431 PMCID: PMC9403226 DOI: 10.1038/s41598-022-17746-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Accepted: 07/30/2022] [Indexed: 11/16/2022] Open
Abstract
Drug resistance caused by mutations is a public health threat for existing and emerging viral diseases. A wealth of evidence about these mutations and their clinically associated phenotypes is scattered across the literature, but a comprehensive perspective is usually lacking. This work aimed to produce a clinically relevant view for the case of Hepatitis B virus (HBV) mutations by combining a chronic HBV clinical study with a compendium of genetic mutations systematically gathered from the scientific literature. We enriched clinical mutation data by systematically mining 2,472,725 scientific articles from PubMed Central in order to gather information about the HBV mutational landscape. By performing this analysis, we were able to identify mutational hotspots for each HBV genotype (A-E) and gene (C, X, P, S), as well as the location of disulfide bonds associated with these mutations. Through a modelling study, we also identified a mutation position common in both the clinical data and the literature that is located at the binding pocket for a known anti-HBV drug, namely entecavir. The results of this novel approach show the potential of integrated analyses to assist in the development of new drugs for viral diseases that are more robust to resistance. Such analyses should be of particular interest due to the increasing importance of viral resistance in established and emerging viruses, such as for newly developed drugs against SARS-CoV-2.
Collapse
Affiliation(s)
- An Goto
- Oxford Protein Informatics Group, Department of Statistics, University of Oxford, 24-29 St Giles', Oxford, OX1 3LB, UK
| | | | | | - Garrett M Morris
- Oxford Protein Informatics Group, Department of Statistics, University of Oxford, 24-29 St Giles', Oxford, OX1 3LB, UK.
| |
Collapse
|
5
|
GFCNet: Utilizing graph feature collection networks for coronavirus knowledge graph embeddings. Inf Sci (N Y) 2022; 608:1557-1571. [PMID: 35855405 PMCID: PMC9279179 DOI: 10.1016/j.ins.2022.07.031] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2021] [Revised: 04/04/2022] [Accepted: 07/03/2022] [Indexed: 01/25/2023]
Abstract
In response to fighting COVID-19 pandemic, researchers in machine learning and artificial intelligence have constructed some medical knowledge graphs (KG) based on existing COVID-19 datasets, however, these KGs contain a considerable amount of semantic relations which are incomplete or missing. In this paper, we focus on the task of knowledge graph embedding (KGE), which serves an important solution to infer the missing relations. In the past, there have been a collection of knowledge graph embedding models with different scoring functions to learn entity and relation embeddings published. However, these models share the same problems of rarely taking important features of KG like attribute features, other than relation triples, into account, while dealing with the heterogeneous, complex and incomplete COVID-19 medical data. To address the above issue, we propose a graph feature collection network (GFCNet) for COVID-19 KGE task, which considers both neighbor and attribute features in KGs. The extensive experiments conducted on the COVID-19 drug KG dataset show promising results and prove the effectiveness and efficiency of our proposed model. In addition, we also explain the future directions of deepening the study on COVID-19 KGE task.
Collapse
|
6
|
Gu J, Xiang R, Wang X, Li J, Li W, Qian L, Zhou G, Huang CR. Multi-probe attention neural network for COVID-19 semantic indexing. BMC Bioinformatics 2022; 23:259. [PMID: 35768777 PMCID: PMC9241329 DOI: 10.1186/s12859-022-04803-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2021] [Accepted: 06/15/2022] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND The COVID-19 pandemic has increasingly accelerated the publication pace of scientific literature. How to efficiently curate and index this large amount of biomedical literature under the current crisis is of great importance. Previous literature indexing is mainly performed by human experts using Medical Subject Headings (MeSH), which is labor-intensive and time-consuming. Therefore, to alleviate the expensive time consumption and monetary cost, there is an urgent need for automatic semantic indexing technologies for the emerging COVID-19 domain. RESULTS In this research, to investigate the semantic indexing problem for COVID-19, we first construct the new COVID-19 Semantic Indexing dataset, which consists of more than 80 thousand biomedical articles. We then propose a novel semantic indexing framework based on the multi-probe attention neural network (MPANN) to address the COVID-19 semantic indexing problem. Specifically, we employ a k-nearest neighbour based MeSH masking approach to generate candidate topic terms for each input article. We encode and feed the selected candidate terms as well as other contextual information as probes into the downstream attention-based neural network. Each semantic probe carries specific aspects of biomedical knowledge and provides informatively discriminative features for the input article. After extracting the semantic features at both term-level and document-level through the attention-based neural network, MPANN adopts a linear multi-view classifier to conduct the final topic prediction for COVID-19 semantic indexing. CONCLUSION The experimental results suggest that MPANN promises to represent the semantic features of biomedical texts and is effective in predicting semantic topics for COVID-19 related biomedical articles.
Collapse
Affiliation(s)
- Jinghang Gu
- Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong, China
| | - Rong Xiang
- Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China
| | | | - Jing Li
- Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China
| | - Wenjie Li
- Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China
| | - Longhua Qian
- School of Computer Science and Technology, Soochow University, Suzhou, China
| | - Guodong Zhou
- School of Computer Science and Technology, Soochow University, Suzhou, China
| | - Chu-Ren Huang
- Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong, China.
| |
Collapse
|
7
|
Cafarella M, Anderson M, Beltagy I, Cattan A, Chasins S, Dagan I, Downey D, Etzioni O, Feldman S, Gao T, Hope T, Huang K, Johnson S, King D, Lo K, Lou Y, Shapiro M, Shen D, Subramanian S, Wang LL, Wang Y, Wang Y, Weld DS, Vo‐Phamhi J, Zeng A, Zou J. Infrastructure for rapid open knowledge network development. AI MAG 2022. [DOI: 10.1002/aaai.12038] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
| | | | - Iz Beltagy
- Allen Institute for Artificial Intelligence Seattle Washington USA
| | - Arie Cattan
- Allen Institute for Artificial Intelligence Seattle Washington USA
| | | | - Ido Dagan
- Bar‐Ilan University Ramat Gan Israel
| | - Doug Downey
- Allen Institute for Artificial Intelligence Seattle Washington USA
| | - Oren Etzioni
- Allen Institute for Artificial Intelligence Seattle Washington USA
| | - Sergey Feldman
- Allen Institute for Artificial Intelligence Seattle Washington USA
| | - Tian Gao
- University of Michigan Ann Arbor Michigan USA
| | - Tom Hope
- Allen Institute for Artificial Intelligence Seattle Washington USA
| | - Kexin Huang
- University of Michigan Ann Arbor Michigan USA
| | - Sophie Johnson
- Allen Institute for Artificial Intelligence Seattle Washington USA
| | - Daniel King
- Allen Institute for Artificial Intelligence Seattle Washington USA
| | - Kyle Lo
- Allen Institute for Artificial Intelligence Seattle Washington USA
| | - Yuze Lou
- University of Michigan Ann Arbor Michigan USA
| | | | | | | | - Lucy Lu Wang
- Allen Institute for Artificial Intelligence Seattle Washington USA
| | - Yuning Wang
- University of Michigan Ann Arbor Michigan USA
| | - Yitong Wang
- University of Michigan Ann Arbor Michigan USA
| | - Daniel S. Weld
- Allen Institute for Artificial Intelligence Seattle Washington USA
| | | | - Anna Zeng
- MIT CSAIL Cambridge Massachusetts USA
| | - Jiayun Zou
- University of Michigan Ann Arbor Michigan USA
| |
Collapse
|
8
|
Otegi A, San Vicente I, Saralegi X, Peñas A, Lozano B, Agirre E. Information retrieval and question answering: A case study on COVID-19 scientific literature. Knowl Based Syst 2022; 240:108072. [PMID: 35002094 PMCID: PMC8719365 DOI: 10.1016/j.knosys.2021.108072] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2021] [Revised: 12/21/2021] [Accepted: 12/24/2021] [Indexed: 11/04/2022]
Abstract
Biosanitary experts around the world are directing their efforts towards the study of COVID-19. This effort generates a large volume of scientific publications at a speed that makes the effective acquisition of new knowledge difficult. Therefore, Information Systems are needed to assist biosanitary experts in accessing, consulting and analyzing these publications. In this work we develop a study of the variables involved in the development of a Question Answering system that receives a set of questions asked by experts about the disease COVID-19 and its causal virus SARS-CoV-2, and provides a ranked list of expert-level answers to each question. In particular, we address the interrelation of the Information Retrieval and the Answer Extraction steps. We found that a recall based document retrieval that leaves to a neural answer extraction module the scanning of the whole documents to find the best answer is a better strategy than relying in a precise passage retrieval before extracting the answer span.
Collapse
Affiliation(s)
| | | | | | - Anselmo Peñas
- NLP & IR Group, UNED, C/Juan del Rosal 16, 28040 Madrid, Spain
| | - Borja Lozano
- NLP & IR Group, UNED, C/Juan del Rosal 16, 28040 Madrid, Spain
| | | |
Collapse
|
9
|
Nguyen V, Rybinski M, Karimi S, Xing Z. Search like an expert: Reducing expertise disparity using a hybrid neural index for COVID-19 queries. J Biomed Inform 2022; 127:104005. [PMID: 35144000 PMCID: PMC9759932 DOI: 10.1016/j.jbi.2022.104005] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2021] [Revised: 01/19/2022] [Accepted: 01/24/2022] [Indexed: 11/17/2022]
Abstract
Consumers from non-medical backgrounds often look for information regarding a specific medical information need; however, they are limited by their lack of medical knowledge and may not be able to find reputable resources. As a case study, we investigate reducing this knowledge barrier to allow consumers to achieve search effectiveness comparable to that of an expert, or a medical professional, for COVID-19 related questions. We introduce and evaluate a hybrid index model that allows a consumer to formulate queries using consumer language to find relevant answers to COVID-19 questions. Our aim is to reduce performance degradation between medical professional queries and those of a consumer. We use a universal sentence embedding model to project consumer queries into the same semantic space as professional queries. We then incorporate sentence embeddings into a search framework alongside an inverted index. Documents from this index are retrieved using a novel scoring function that considers sentence embeddings and BM25 scoring. We find that our framework alleviates the expertise disparity, which we validate using an additional set of crowdsourced-consumer-queries even in an unsupervised setting. We also propose an extension of our method, where the sentence encoder is optimised in a supervised setup. Our framework allows for a consumer to search using consumer queries to match the search performance with that of a professional.
Collapse
Affiliation(s)
- Vincent Nguyen
- The Australian National University, Canberra, Australia; CSIRO Data61, Sydney, NSW, Australia.
| | | | | | - Zhenchang Xing
- The Australian National University, Canberra, Australia.
| |
Collapse
|
10
|
Napolitano F, Xu X, Gao X. Impact of computational approaches in the fight against COVID-19: an AI guided review of 17 000 studies. Brief Bioinform 2022; 23:bbab456. [PMID: 34788381 PMCID: PMC8689952 DOI: 10.1093/bib/bbab456] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2021] [Revised: 09/08/2021] [Accepted: 10/07/2021] [Indexed: 12/15/2022] Open
Abstract
SARS-CoV-2 caused the first severe pandemic of the digital era. Computational approaches have been ubiquitously used in an attempt to timely and effectively cope with the resulting global health crisis. In order to extensively assess such contribution, we collected, categorized and prioritized over 17 000 COVID-19-related research articles including both peer-reviewed and preprint publications that make a relevant use of computational approaches. Using machine learning methods, we identified six broad application areas i.e. Molecular Pharmacology and Biomarkers, Molecular Virology, Epidemiology, Healthcare, Clinical Medicine and Clinical Imaging. We then used our prioritization model as a guidance through an extensive, systematic review of the most relevant studies. We believe that the remarkable contribution provided by computational applications during the ongoing pandemic motivates additional efforts toward their further development and adoption, with the aim of enhancing preparedness and critical response for current and future emergencies.
Collapse
Affiliation(s)
- Francesco Napolitano
- Computational Bioscience Research Center, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Makkah, Saudi Arabia
| | - Xiaopeng Xu
- Computational Bioscience Research Center, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Makkah, Saudi Arabia
| | - Xin Gao
- Computational Bioscience Research Center, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Makkah, Saudi Arabia
| |
Collapse
|
11
|
Analyzing COVID-19 Medical Papers Using Artificial Intelligence: Insights for Researchers and Medical Professionals. BIG DATA AND COGNITIVE COMPUTING 2022. [DOI: 10.3390/bdcc6010004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
Since the beginning of the COVID-19 pandemic almost two years ago, there have been more than 700,000 scientific papers published on the subject. An individual researcher cannot possibly get acquainted with such a huge text corpus and, therefore, some help from artificial intelligence (AI) is highly needed. We propose the AI-based tool to help researchers navigate the medical papers collections in a meaningful way and extract some knowledge from scientific COVID-19 papers. The main idea of our approach is to get as much semi-structured information from text corpus as possible, using named entity recognition (NER) with a model called PubMedBERT and Text Analytics for Health service, then store the data into NoSQL database for further fast processing and insights generation. Additionally, the contexts in which the entities were used (neutral or negative) are determined. Application of NLP and text-based emotion detection (TBED) methods to COVID-19 text corpus allows us to gain insights on important issues of diagnosis and treatment (such as changes in medical treatment over time, joint treatment strategies using several medications, and the connection between signs and symptoms of coronavirus, etc.).
Collapse
|
12
|
Zerva C, Taylor S, Soto AJ, Nguyen NTH, Ananiadou S. A term-based and citation network-based search system for COVID-19. JAMIA Open 2021; 4:ooab104. [PMID: 34927002 PMCID: PMC8672931 DOI: 10.1093/jamiaopen/ooab104] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2021] [Revised: 11/15/2021] [Accepted: 11/24/2021] [Indexed: 11/14/2022] Open
Abstract
The COVID-19 pandemic resulted in an unprecedented production of scientific
literature spanning several fields. To facilitate navigation of the scientific
literature related to various aspects of the pandemic, we developed an
exploratory search system. The system is based on automatically identified
technical terms, document citations, and their visualization, accelerating
identification of relevant documents. It offers a multi-view interactive search
and navigation interface, bringing together unsupervised approaches of term
extraction and citation analysis. We conducted a user evaluation with domain
experts, including epidemiologists, biochemists, medicinal chemists, and
medicine students. In general, most users were satisfied with the relevance and
speed of the search results. More interestingly, participants mostly agreed on
the capacity of the system to enable exploration and discovery of the search
space using the graph visualization and filters. The system is updated on a
weekly basis and it is publicly available at http://www.nactem.ac.uk/cord/. In this article, we present a search system and exploratory tool built on the
documents of the COVID-19 Open Research Dataset, which is a large and open
collection of scholarly articles related to COVID-19 (Coronavirus disease 2019),
SARS-CoV-2 (Severe Acute Respiratory Syndrome Coronavirus-2), and related
coronaviruses. The search system aims to facilitate navigation of the scientific
literature related to various aspects of the pandemic. Specifically, we identify
3 types of core information per paper to be used as navigation facets including
technical terminologies, citation/reference links from 1 paper to others, and
bibliometric data. Unlike other exploratory-based search engines, our system
allows users to combine information from text mining and bibliometrics analysis
to explore the data in a more versatile manner tailored to their needs. The
system is automatically updated on a weekly basis to ensure timely and updated
access to recent information. We also conducted a user evaluation that included
epidemiologists, biochemists, medicinal chemists, and medicine students. In
general, most users were satisfied with the relevance and speed of the search
results. More interestingly, participants mostly agreed on the capacity of the
system to enable exploration and discovery of the search space using the graph
visualization and filters.
Collapse
Affiliation(s)
- Chrysoula Zerva
- Department of Computer Science, National Centre for Text Mining, Manchester Interdisciplinary Biocentre, The University of Manchester, Manchester, UK.,Chrysoula Zerva's affiliation at the time of submission/publicationis is Instituto de Telecomunicações (IT), Lisbon, Portugal. All work was carried out while the author was employed at the University of Manchester, UK
| | - Samuel Taylor
- Department of Computer Science, National Centre for Text Mining, Manchester Interdisciplinary Biocentre, The University of Manchester, Manchester, UK
| | - Axel J Soto
- Department of Computer Science and Engineering, Universidad Nacional del Sur & Institute for Computer Science and Engineering (ICIC, UNS-CONICET), Bahia Blanca, Argentina
| | - Nhung T H Nguyen
- Department of Computer Science, National Centre for Text Mining, Manchester Interdisciplinary Biocentre, The University of Manchester, Manchester, UK
| | - Sophia Ananiadou
- Department of Computer Science, National Centre for Text Mining, Manchester Interdisciplinary Biocentre, The University of Manchester, Manchester, UK.,The Alan Turing Institute, London, UK
| |
Collapse
|
13
|
Xia Y, Cai J, Li Y, Dou Z, Zhang Y, Wu L, Huang Z, Xu S, Sun J, Liu Y, Wu D, Han D. A precision‐preferred comprehensive information extraction system for clinical articles in traditional Chinese Medicine. INT J INTELL SYST 2021. [DOI: 10.1002/int.22748] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Affiliation(s)
- Ye Xia
- School of Life and Science Beijing University of Chinese Medicine Beijing China
| | - Jianxiong Cai
- State Key Laboratory of Dampness Syndrome of Chinese Medicine Guangzhou China
- The Second Affiliated Hospital of Guangzhou University of Chinese Medicine (Guangdong Provincial Hospital of Chinese Medicine) Guangzhou China
| | - Yizhen Li
- School of Life and Science Beijing University of Chinese Medicine Beijing China
| | - Zhili Dou
- School of Life and Science Beijing University of Chinese Medicine Beijing China
| | - Yunan Zhang
- School of Life and Science Beijing University of Chinese Medicine Beijing China
| | - Lin Wu
- School of Life and Science Beijing University of Chinese Medicine Beijing China
| | - Zhe Huang
- School of Life and Science Beijing University of Chinese Medicine Beijing China
| | - Shujing Xu
- School of Life and Science Beijing University of Chinese Medicine Beijing China
| | - Jiayi Sun
- School of Life and Science Beijing University of Chinese Medicine Beijing China
| | - Yixing Liu
- School of Management Beijing University of Chinese Medicine Beijing China
| | - Darong Wu
- State Key Laboratory of Dampness Syndrome of Chinese Medicine Guangzhou China
- The Second Affiliated Hospital of Guangzhou University of Chinese Medicine (Guangdong Provincial Hospital of Chinese Medicine) Guangzhou China
- Guangdong Provincial Key Laboratory of Clinical Research on Traditional Chinese Medicine Syndrome Guangzhou China
| | - Dongran Han
- School of Life and Science Beijing University of Chinese Medicine Beijing China
| |
Collapse
|
14
|
Roitero K, Soprano M, Portelli B, De Luise M, Spina D, Mea VD, Serra G, Mizzaro S, Demartini G. Can the crowd judge truthfulness? A longitudinal study on recent misinformation about COVID-19. PERSONAL AND UBIQUITOUS COMPUTING 2021; 27:59-89. [PMID: 34545278 PMCID: PMC8444165 DOI: 10.1007/s00779-021-01604-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/01/2021] [Accepted: 07/12/2021] [Indexed: 06/13/2023]
Abstract
Recently, the misinformation problem has been addressed with a crowdsourcing-based approach: to assess the truthfulness of a statement, instead of relying on a few experts, a crowd of non-expert is exploited. We study whether crowdsourcing is an effective and reliable method to assess truthfulness during a pandemic, targeting statements related to COVID-19, thus addressing (mis)information that is both related to a sensitive and personal issue and very recent as compared to when the judgment is done. In our experiments, crowd workers are asked to assess the truthfulness of statements, and to provide evidence for the assessments. Besides showing that the crowd is able to accurately judge the truthfulness of the statements, we report results on workers' behavior, agreement among workers, effect of aggregation functions, of scales transformations, and of workers background and bias. We perform a longitudinal study by re-launching the task multiple times with both novice and experienced workers, deriving important insights on how the behavior and quality change over time. Our results show that workers are able to detect and objectively categorize online (mis)information related to COVID-19; both crowdsourced and expert judgments can be transformed and aggregated to improve quality; worker background and other signals (e.g., source of information, behavior) impact the quality of the data. The longitudinal study demonstrates that the time-span has a major effect on the quality of the judgments, for both novice and experienced workers. Finally, we provide an extensive failure analysis of the statements misjudged by the crowd-workers.
Collapse
|
15
|
Singh I, Scarton C, Bontcheva K. Multistage BiCross encoder for multilingual access to COVID-19 health information. PLoS One 2021; 16:e0256874. [PMID: 34492073 PMCID: PMC8423231 DOI: 10.1371/journal.pone.0256874] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2021] [Accepted: 08/17/2021] [Indexed: 11/18/2022] Open
Abstract
The Coronavirus (COVID-19) pandemic has led to a rapidly growing ‘infodemic’ of health information online. This has motivated the need for accurate semantic search and retrieval of reliable COVID-19 information across millions of documents, in multiple languages. To address this challenge, this paper proposes a novel high precision and high recall neural Multistage BiCross encoder approach. It is a sequential three-stage ranking pipeline which uses the Okapi BM25 retrieval algorithm and transformer-based bi-encoder and cross-encoder to effectively rank the documents with respect to the given query. We present experimental results from our participation in the Multilingual Information Access (MLIA) shared task on COVID-19 multilingual semantic search. The independently evaluated MLIA results validate our approach and demonstrate that it outperforms other state-of-the-art approaches according to nearly all evaluation metrics in cases of both monolingual and bilingual runs.
Collapse
Affiliation(s)
- Iknoor Singh
- Department of Computer Science, University of Sheffield, Sheffield, United Kingdom
| | - Carolina Scarton
- Department of Computer Science, University of Sheffield, Sheffield, United Kingdom
| | - Kalina Bontcheva
- Department of Computer Science, University of Sheffield, Sheffield, United Kingdom
| |
Collapse
|
16
|
Hassoun S, Jefferson F, Shi X, Stucky B, Wang J, Rosa E. Artificial Intelligence for Biology. Integr Comp Biol 2021; 61:2267-2275. [PMID: 34448841 DOI: 10.1093/icb/icab188] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2021] [Revised: 07/14/2021] [Accepted: 08/23/2021] [Indexed: 01/18/2023] Open
Abstract
Despite efforts to integrate research across different subdisciplines of biology, the scale of integration remains limited. We hypothesize that future generations of Artificial Intelligence (AI) technologies specifically adapted for biological sciences will help enable the reintegration of biology. AI technologies will allow us not only to collect, connect and analyze data at unprecedented scales, but also to build comprehensive predictive models that span various subdisciplines. They will make possible both targeted (testing specific hypotheses) and untargeted discoveries. AI for biology will be the cross-cutting technology that will enhance our ability to do biological research at every scale. We expect AI to revolutionize biology in the 21st century much like statistics transformed biology in the 20th century. The difficulties, however, are many, including data curation and assembly, development of new science in the form of theories that connect the subdisciplines, and new predictive and interpretable AI models that are more suited to biology than existing machine learning and AI techniques. Development efforts will require strong collaborations between biological and computational scientists. This white paper provides a vision for AI for Biology and highlights some challenges.
Collapse
Affiliation(s)
- Soha Hassoun
- Department of Computer Science, Tufts University, Medford, MA 02155, USA
| | - Felicia Jefferson
- Biology Academic Department, Fort Valley State University, Fort Valley, GA 31030, USA
| | - Xinghua Shi
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA 19122, USA
| | - Brian Stucky
- Florida Museum of Natural History, University of Florida, Gainesville, FL 32611, USA
| | - Jin Wang
- Department of Mathematics, University of Tennessee at Chattanooga, Chattanooga, TN 37403, USA
| | - Epaminondas Rosa
- Department of Physics and School of Biological Sciences, Illinois State University, Normal, IL 61790, USA
| |
Collapse
|
17
|
Firouzi F, Farahani B, Daneshmand M, Grise K, Song J, Saracco R, Wang LL, Lo K, Angelov P, Soares E, Loh PS, Talebpour Z, Moradi R, Goodarzi M, Ashraf H, Talebpour M, Talebpour A, Romeo L, Das R, Heidari H, Pasquale D, Moody J, Woods C, Huang ES, Barnaghi P, Sarrafzadeh M, Li R, Beck KL, Isayev O, Sung N, Luo A. Harnessing the Power of Smart and Connected Health to Tackle COVID-19: IoT, AI, Robotics, and Blockchain for a Better World. IEEE INTERNET OF THINGS JOURNAL 2021; 8:12826-12846. [PMID: 35782886 PMCID: PMC8769005 DOI: 10.1109/jiot.2021.3073904] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/15/2020] [Revised: 03/09/2021] [Accepted: 04/02/2021] [Indexed: 05/07/2023]
Abstract
As COVID-19 hounds the world, the common cause of finding a swift solution to manage the pandemic has brought together researchers, institutions, governments, and society at large. The Internet of Things (IoT), artificial intelligence (AI)-including machine learning (ML) and Big Data analytics-as well as Robotics and Blockchain, are the four decisive areas of technological innovation that have been ingenuity harnessed to fight this pandemic and future ones. While these highly interrelated smart and connected health technologies cannot resolve the pandemic overnight and may not be the only answer to the crisis, they can provide greater insight into the disease and support frontline efforts to prevent and control the pandemic. This article provides a blend of discussions on the contribution of these digital technologies, propose several complementary and multidisciplinary techniques to combat COVID-19, offer opportunities for more holistic studies, and accelerate knowledge acquisition and scientific discoveries in pandemic research. First, four areas, where IoT can contribute are discussed, namely: 1) tracking and tracing; 2) remote patient monitoring (RPM) by wearable IoT (WIoT); 3) personal digital twins (PDTs); and 4) real-life use case: ICT/IoT solution in South Korea. Second, the role and novel applications of AI are explained, namely: 1) diagnosis and prognosis; 2) risk prediction; 3) vaccine and drug development; 4) research data set; 5) early warnings and alerts; 6) social control and fake news detection; and 7) communication and chatbot. Third, the main uses of robotics and drone technology are analyzed, including: 1) crowd surveillance; 2) public announcements; 3) screening and diagnosis; and 4) essential supply delivery. Finally, we discuss how distributed ledger technologies (DLTs), of which blockchain is a common example, can be combined with other technologies for tackling COVID-19.
Collapse
Affiliation(s)
- Farshad Firouzi
- Electrical and Computer Engineering DepartmentDuke University Durham NC 27708 USA
| | - Bahar Farahani
- Cyberspace Research Institute, Shahid Beheshti University Tehran 1983969411 Iran
| | - Mahmoud Daneshmand
- Business Intelligence and AnalyticsStevens Institute of Technology Hoboken NJ 07030 USA
| | - Kathy Grise
- IEEE Future Directions Piscataway NJ 08854 USA
| | - Jaeseung Song
- Department of Computer and Information SecuritySejong University Seoul 15600 South Korea
| | | | - Lucy Lu Wang
- Allen Institute for Artificial Intelligence Seattle WA 98112 USA
| | - Kyle Lo
- Allen Institute for Artificial Intelligence Seattle WA 98112 USA
| | - Plamen Angelov
- School of Computing and CommunicationsLancaster University Lancashire LA1 4YW U.K
| | - Eduardo Soares
- School of Computing and CommunicationsLancaster University Lancashire LA1 4YW U.K
| | - Po-Shen Loh
- Department of Mathematical SciencesCarnegie Mellon University Pittsburgh PA 15213 USA
| | - Zeynab Talebpour
- Cyberspace Research Institute, Shahid Beheshti University Tehran 1983969411 Iran
| | - Reza Moradi
- Cyberspace Research Institute, Shahid Beheshti University Tehran 1983969411 Iran
| | - Mohsen Goodarzi
- Cyberspace Research Institute, Shahid Beheshti University Tehran 1983969411 Iran
| | | | | | - Alireza Talebpour
- Cyberspace Research Institute, Shahid Beheshti University Tehran 1983969411 Iran
| | - Luca Romeo
- Department of Information EngineeringUniversit Politecnica delle Marche 60121 Ancona Italy
| | - Rupam Das
- James Watt School of EngineeringUniversity of Glasgow Glasgow G12 8QQ U.K
| | - Hadi Heidari
- James Watt School of EngineeringUniversity of Glasgow Glasgow G12 8QQ U.K
| | - Dana Pasquale
- School of Medicine and Duke HealthDuke University Durham NC 27708 USA
| | - James Moody
- School of Medicine and Duke HealthDuke University Durham NC 27708 USA
| | - Chris Woods
- School of Medicine and Duke HealthDuke University Durham NC 27708 USA
| | - Erich S Huang
- School of Medicine and Duke HealthDuke University Durham NC 27708 USA
| | - Payam Barnaghi
- Department of Brain SciencesImperial College London London SW7 2AZ U.K
- U.K. Dementia Research Institute London U.K
| | - Majid Sarrafzadeh
- Computer Science Department & Electrical and Computer Engineering DepartmentUniversity of California at Los Angeles Los Angeles CA 90095 USA
| | - Ron Li
- Department of MedicineStanford University School of Medicine Stanford CA 94305 USA
| | | | - Olexandr Isayev
- Department of ChemistryCarnegie Mellon University Pittsburgh PA 15213 USA
| | - Nakmyoung Sung
- Korea Electronics Technology Institute Seongnam 13509 South Korea
| | - Alan Luo
- Computer Science DepartmentStanford University Stanford CA 94305 USA
| |
Collapse
|
18
|
Teodoro D, Ferdowsi S, Borissov N, Kashani E, Vicente Alvarez D, Copara J, Gouareb R, Naderi N, Amini P. Information retrieval in an infodemic: the case of COVID-19 publications. J Med Internet Res 2021; 23:e30161. [PMID: 34375298 PMCID: PMC8451964 DOI: 10.2196/30161] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2021] [Revised: 07/22/2021] [Accepted: 08/05/2021] [Indexed: 12/31/2022] Open
Abstract
Background The COVID-19 global health crisis has led to an exponential surge in published scientific literature. In an attempt to tackle the pandemic, extremely large COVID-19–related corpora are being created, sometimes with inaccurate information, which is no longer at scale of human analyses. Objective In the context of searching for scientific evidence in the deluge of COVID-19–related literature, we present an information retrieval methodology for effective identification of relevant sources to answer biomedical queries posed using natural language. Methods Our multistage retrieval methodology combines probabilistic weighting models and reranking algorithms based on deep neural architectures to boost the ranking of relevant documents. Similarity of COVID-19 queries is compared to documents, and a series of postprocessing methods is applied to the initial ranking list to improve the match between the query and the biomedical information source and boost the position of relevant documents. Results The methodology was evaluated in the context of the TREC-COVID challenge, achieving competitive results with the top-ranking teams participating in the competition. Particularly, the combination of bag-of-words and deep neural language models significantly outperformed an Okapi Best Match 25–based baseline, retrieving on average, 83% of relevant documents in the top 20. Conclusions These results indicate that multistage retrieval supported by deep learning could enhance identification of literature for COVID-19–related questions posed using natural language.
Collapse
Affiliation(s)
- Douglas Teodoro
- HES-SO University of Applied Arts and Sciences of Western Switzerland, Rue de la Tambourine 17, Carouge, CH.,SIB Swiss Institute of Bioinformatics, Lausanne, CH
| | - Sohrab Ferdowsi
- HES-SO University of Applied Arts and Sciences of Western Switzerland, Rue de la Tambourine 17, Carouge, CH
| | | | - Elham Kashani
- Institute of Pathology, University of Bern, Bern, CH
| | - David Vicente Alvarez
- HES-SO University of Applied Arts and Sciences of Western Switzerland, Rue de la Tambourine 17, Carouge, CH
| | - Jenny Copara
- HES-SO University of Applied Arts and Sciences of Western Switzerland, Rue de la Tambourine 17, Carouge, CH.,SIB Swiss Institute of Bioinformatics, Lausanne, CH.,University of Geneva, Geneva, CH
| | - Racha Gouareb
- HES-SO University of Applied Arts and Sciences of Western Switzerland, Rue de la Tambourine 17, Carouge, CH
| | - Nona Naderi
- HES-SO University of Applied Arts and Sciences of Western Switzerland, Rue de la Tambourine 17, Carouge, CH.,SIB Swiss Institute of Bioinformatics, Lausanne, CH
| | - Poorya Amini
- Risklick AG, Bern, CH.,Clinical Trials Unit Bern, Bern, CH
| |
Collapse
|
19
|
Chen Q, Leaman R, Allot A, Luo L, Wei CH, Yan S, Lu Z. Artificial Intelligence in Action: Addressing the COVID-19 Pandemic with Natural Language Processing. Annu Rev Biomed Data Sci 2021; 4:313-339. [PMID: 34465169 DOI: 10.1146/annurev-biodatasci-021821-061045] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The COVID-19 (coronavirus disease 2019) pandemic has had a significant impact on society, both because of the serious health effects of COVID-19 and because of public health measures implemented to slow its spread. Many of these difficulties are fundamentally information needs; attempts to address these needs have caused an information overload for both researchers and the public. Natural language processing (NLP)-the branch of artificial intelligence that interprets human language-can be applied to address many of the information needs made urgent by the COVID-19 pandemic. This review surveys approximately 150 NLP studies and more than 50 systems and datasets addressing the COVID-19 pandemic. We detail work on four core NLP tasks: information retrieval, named entity recognition, literature-based discovery, and question answering. We also describe work that directly addresses aspects of the pandemic through four additional tasks: topic modeling, sentiment and emotion analysis, caseload forecasting, and misinformation detection. We conclude by discussing observable trends and remaining challenges.
Collapse
Affiliation(s)
- Qingyu Chen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| | - Robert Leaman
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| | - Alexis Allot
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| | - Ling Luo
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| | - Shankai Yan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| |
Collapse
|
20
|
Roberts K, Alam T, Bedrick S, Demner-Fushman D, Lo K, Soboroff I, Voorhees E, Wang LL, Hersh WR. Searching for scientific evidence in a pandemic: An overview of TREC-COVID. J Biomed Inform 2021; 121:103865. [PMID: 34245913 PMCID: PMC8264272 DOI: 10.1016/j.jbi.2021.103865] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2020] [Revised: 06/30/2021] [Accepted: 07/05/2021] [Indexed: 12/15/2022]
Abstract
We present an overview of the TREC-COVID Challenge, an information retrieval (IR) shared task to evaluate search on scientific literature related to COVID-19. The goals of TREC-COVID include the construction of a pandemic search test collection and the evaluation of IR methods for COVID-19. The challenge was conducted over five rounds from April to July 2020, with participation from 92 unique teams and 556 individual submissions. A total of 50 topics (sets of related queries) were used in the evaluation, starting at 30 topics for Round 1 and adding 5 new topics per round to target emerging topics at that state of the still-emerging pandemic. This paper provides a comprehensive overview of the structure and results of TREC-COVID. Specifically, the paper provides details on the background, task structure, topic structure, corpus, participation, pooling, assessment, judgments, results, top-performing systems, lessons learned, and benchmark datasets.
Collapse
Affiliation(s)
- Kirk Roberts
- University of Texas Health Science Center at Houston, Houston, TX, USA.
| | | | | | | | - Kyle Lo
- Allen Institute for AI, Seattle, WA, USA
| | - Ian Soboroff
- National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Ellen Voorhees
- National Institute of Standards and Technology, Gaithersburg, MD, USA
| | | | | |
Collapse
|
21
|
Developing a sampling method and preliminary taxonomy for classifying COVID-19 public health guidance for healthcare organizations and the general public. J Biomed Inform 2021; 120:103852. [PMID: 34192573 PMCID: PMC8236411 DOI: 10.1016/j.jbi.2021.103852] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2020] [Revised: 05/09/2021] [Accepted: 06/24/2021] [Indexed: 02/06/2023]
Abstract
BACKGROUND Development and dissemination of public health (PH) guidance to healthcare organizations and the general public (e.g., businesses, schools, individuals) during emergencies like the COVID-19 pandemic is vital for policy, clinical, and public decision-making. Yet, the rapidly evolving nature of these events poses significant challenges for guidance development and dissemination strategies predicated on well-understood concepts and clearly defined access and distribution pathways. Taxonomies are an important but underutilized tool for guidance authoring, dissemination and updating in such dynamic scenarios. OBJECTIVE To design a rapid, semi-automated method for sampling and developing a PH guidance taxonomy using widely available Web crawling tools and streamlined manual content analysis. METHODS Iterative samples of guidance documents were taken from four state PH agency websites, the US Center for Disease Control and Prevention, and the World Health Organization. Documents were used to derive and refine a preliminary taxonomy of COVID-19 PH guidance via content analysis. RESULTS Eight iterations of guidance document sampling and taxonomy revisions were performed, with a final corpus of 226 documents. The preliminary taxonomy contains 110 branches distributed between three major domains: stakeholders (24 branches), settings (25 branches) and topics (61 branches). Thematic saturation measures indicated rapid saturation (≤5% change) for the domains of "stakeholders" and "settings", and "topic"-related branches for clinical decision-making. Branches related to business reopening and economic consequences remained dynamic throughout sampling iterations. CONCLUSION The PH guidance taxonomy can support public health agencies by aligning guidance development with curation and indexing strategies; supporting targeted dissemination; increasing the speed of updates; and enhancing public-facing guidance repositories and information retrieval tools. Taxonomies are essential to support knowledge management activities during rapidly evolving scenarios such as disease outbreaks and natural disasters.
Collapse
|
22
|
Abstract
The SARS-CoV-2 pandemic has caused a surge in research exploring all aspects of the virus and its effects on human health. The overwhelming publication rate means that researchers are unable to keep abreast of the literature. To ameliorate this, we present the CoronaCentral resource that uses machine learning to process the research literature on SARS-CoV-2 together with SARS-CoV and MERS-CoV. We categorize the literature into useful topics and article types and enable analysis of the contents, pace, and emphasis of research during the crisis with integration of Altmetric data. These topics include therapeutics, disease forecasting, as well as growing areas such as “long COVID” and studies of inequality. This resource, available at https://coronacentral.ai, is updated daily.
Collapse
|
23
|
COVID-19 information retrieval with deep-learning based semantic search, question answering, and abstractive summarization. NPJ Digit Med 2021; 4:68. [PMID: 33846532 PMCID: PMC8041998 DOI: 10.1038/s41746-021-00437-0] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2020] [Accepted: 03/08/2021] [Indexed: 11/09/2022] Open
Abstract
The COVID-19 global pandemic has resulted in international efforts to understand, track, and mitigate the disease, yielding a significant corpus of COVID-19 and SARS-CoV-2-related publications across scientific disciplines. Throughout 2020, over 400,000 coronavirus-related publications have been collected through the COVID-19 Open Research Dataset. Here, we present CO-Search, a semantic, multi-stage, search engine designed to handle complex queries over the COVID-19 literature, potentially aiding overburdened health workers in finding scientific answers and avoiding misinformation during a time of crisis. CO-Search is built from two sequential parts: a hybrid semantic-keyword retriever, which takes an input query and returns a sorted list of the 1000 most relevant documents, and a re-ranker, which further orders them by relevance. The retriever is composed of a deep learning model (Siamese-BERT) that encodes query-level meaning, along with two keyword-based models (BM25, TF-IDF) that emphasize the most important words of a query. The re-ranker assigns a relevance score to each document, computed from the outputs of (1) a question–answering module which gauges how much each document answers the query, and (2) an abstractive summarization module which determines how well a query matches a generated summary of the document. To account for the relatively limited dataset, we develop a text augmentation technique which splits the documents into pairs of paragraphs and the citations contained in them, creating millions of (citation title, paragraph) tuples for training the retriever. We evaluate our system (http://einstein.ai/covid) on the data of the TREC-COVID information retrieval challenge, obtaining strong performance across multiple key information retrieval metrics.
Collapse
|
24
|
Chen JS, Hersh WR. A comparative analysis of system features used in the TREC-COVID information retrieval challenge. J Biomed Inform 2021; 117:103745. [PMID: 33831536 PMCID: PMC8021447 DOI: 10.1016/j.jbi.2021.103745] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2020] [Revised: 12/02/2020] [Accepted: 03/05/2021] [Indexed: 11/18/2022]
Abstract
The COVID-19 pandemic has resulted in a rapidly growing quantity of scientific publications from journal articles, preprints, and other sources. The TREC-COVID Challenge was created to evaluate information retrieval (IR) methods and systems for this quickly expanding corpus. Using the COVID-19 Open Research Dataset (CORD-19), several dozen research teams participated in over 5 rounds of the TREC-COVID Challenge. While previous work has compared IR techniques used on other test collections, there are no studies that have analyzed the methods used by participants in the TREC-COVID Challenge. We manually reviewed team run reports from Rounds 2 and 5, extracted features from the documented methodologies, and used a univariate and multivariate regression-based analysis to identify features associated with higher retrieval performance. We observed that fine-tuning datasets with relevance judgments, MS-MARCO, and CORD-19 document vectors was associated with improved performance in Round 2 but not in Round 5. Though the relatively decreased heterogeneity of runs in Round 5 may explain the lack of significance in that round, fine-tuning has been found to improve search performance in previous challenge evaluations by improving a system’s ability to map relevant queries and phrases to documents. Furthermore, term expansion was associated with improvement in system performance, and the use of the narrative field in the TREC-COVID topics was associated with decreased system performance in both rounds. These findings emphasize the need for clear queries in search. While our study has some limitations in its generalizability and scope of techniques analyzed, we identified some IR techniques that may be useful in building search systems for COVID-19 using the TREC-COVID test collections.
Collapse
Affiliation(s)
- Jimmy S Chen
- School of Medicine, Oregon Health & Science University, Portland, OR, USA.
| | - William R Hersh
- Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR, USA
| |
Collapse
|
25
|
Bakken S. Informatics impact requires effective, scalable tools and standards-based infrastructure. J Am Med Inform Assoc 2021; 27:1341-1342. [PMID: 32989458 DOI: 10.1093/jamia/ocaa187] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2020] [Accepted: 07/23/2020] [Indexed: 11/13/2022] Open
Affiliation(s)
- Suzanne Bakken
- School of Nursing, Department of Biomedical Informatics, and Data Science Institute, Columbia University, New York, New York, USA
| |
Collapse
|
26
|
Wang LL, Lo K. Text mining approaches for dealing with the rapidly expanding literature on COVID-19. Brief Bioinform 2021; 22:781-799. [PMID: 33279995 PMCID: PMC7799291 DOI: 10.1093/bib/bbaa296] [Citation(s) in RCA: 38] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2020] [Revised: 10/02/2020] [Accepted: 10/07/2020] [Indexed: 12/13/2022] Open
Abstract
More than 50 000 papers have been published about COVID-19 since the beginning of 2020 and several hundred new papers continue to be published every day. This incredible rate of scientific productivity leads to information overload, making it difficult for researchers, clinicians and public health officials to keep up with the latest findings. Automated text mining techniques for searching, reading and summarizing papers are helpful for addressing information overload. In this review, we describe the many resources that have been introduced to support text mining applications over the COVID-19 literature; specifically, we discuss the corpora, modeling resources, systems and shared tasks that have been introduced for COVID-19. We compile a list of 39 systems that provide functionality such as search, discovery, visualization and summarization over the COVID-19 literature. For each system, we provide a qualitative description and assessment of the system's performance, unique data or user interface features and modeling decisions. Many systems focus on search and discovery, though several systems provide novel features, such as the ability to summarize findings over multiple documents or linking between scientific articles and clinical trials. We also describe the public corpora, models and shared tasks that have been introduced to help reduce repeated effort among community members; some of these resources (especially shared tasks) can provide a basis for comparing the performance of different systems. Finally, we summarize promising results and open challenges for text mining the COVID-19 literature.
Collapse
Affiliation(s)
- Lucy Lu Wang
- The Allen Institute for Artificial Intelligence, Seattle, WA 98112, USA
| | - Kyle Lo
- The Allen Institute for Artificial Intelligence, Seattle, WA 98112, USA
| |
Collapse
|
27
|
Soni S, Roberts K. An evaluation of two commercial deep learning-based information retrieval systems for COVID-19 literature. J Am Med Inform Assoc 2021; 28:132-137. [PMID: 33197268 PMCID: PMC7717324 DOI: 10.1093/jamia/ocaa271] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2020] [Indexed: 11/17/2022] Open
Abstract
The COVID-19 pandemic has resulted in a tremendous need for access to the latest scientific information, leading to both corpora for COVID-19 literature and search engines to query such data. While most search engine research is performed in academia with rigorous evaluation, major commercial companies dominate the web search market. Thus, it is expected that commercial pandemic-specific search engines will gain much higher traction than academic alternatives, leading to questions about the empirical performance of these tools. This paper seeks to empirically evaluate two commercial search engines for COVID-19 (Google and Amazon) in comparison with academic prototypes evaluated in the TREC-COVID task. We performed several steps to reduce bias in the manual judgments to ensure a fair comparison of all systems. We find the commercial search engines sizably underperformed those evaluated under TREC-COVID. This has implications for trust in popular health search engines and developing biomedical search engines for future health crises.
Collapse
Affiliation(s)
- Sarvesh Soni
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas 77030, USA
| | - Kirk Roberts
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas 77030, USA
| |
Collapse
|
28
|
Tworowski D, Gorohovski A, Mukherjee S, Carmi G, Levy E, Detroja R, Mukherjee SB, Frenkel-Morgenstern M. COVID19 Drug Repository: text-mining the literature in search of putative COVID19 therapeutics. Nucleic Acids Res 2021; 49:D1113-D1121. [PMID: 33166390 PMCID: PMC7778969 DOI: 10.1093/nar/gkaa969] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2020] [Revised: 10/07/2020] [Accepted: 11/04/2020] [Indexed: 12/12/2022] Open
Abstract
The recent outbreak of COVID-19 has generated an enormous amount of Big Data. To date, the COVID-19 Open Research Dataset (CORD-19), lists ∼130,000 articles from the WHO COVID-19 database, PubMed Central, medRxiv, and bioRxiv, as collected by Semantic Scholar. According to LitCovid (11 August 2020), ∼40,300 COVID19-related articles are currently listed in PubMed. It has been shown in clinical settings that the analysis of past research results and the mining of available data can provide novel opportunities for the successful application of currently approved therapeutics and their combinations for the treatment of conditions caused by a novel SARS-CoV-2 infection. As such, effective responses to the pandemic require the development of efficient applications, methods and algorithms for data navigation, text-mining, clustering, classification, analysis, and reasoning. Thus, our COVID19 Drug Repository represents a modular platform for drug data navigation and analysis, with an emphasis on COVID-19-related information currently being reported. The COVID19 Drug Repository enables users to focus on different levels of complexity, starting from general information about (FDA-) approved drugs, PubMed references, clinical trials, recipes as well as the descriptions of molecular mechanisms of drugs' action. Our COVID19 drug repository provide a most updated world-wide collection of drugs that has been repurposed for COVID19 treatments around the world.
Collapse
Affiliation(s)
- Dmitry Tworowski
- Laboratory of Cancer Genomics and Biocomputing of Complex Diseases, Azrieli Faculty of Medicine, Bar-Ilan University, Henrietta Szold 8, Safed 13195, Israel
| | - Alessandro Gorohovski
- Laboratory of Cancer Genomics and Biocomputing of Complex Diseases, Azrieli Faculty of Medicine, Bar-Ilan University, Henrietta Szold 8, Safed 13195, Israel
| | - Sumit Mukherjee
- Laboratory of Cancer Genomics and Biocomputing of Complex Diseases, Azrieli Faculty of Medicine, Bar-Ilan University, Henrietta Szold 8, Safed 13195, Israel
| | - Gon Carmi
- Laboratory of Cancer Genomics and Biocomputing of Complex Diseases, Azrieli Faculty of Medicine, Bar-Ilan University, Henrietta Szold 8, Safed 13195, Israel
| | - Eliad Levy
- Laboratory of Cancer Genomics and Biocomputing of Complex Diseases, Azrieli Faculty of Medicine, Bar-Ilan University, Henrietta Szold 8, Safed 13195, Israel
| | - Rajesh Detroja
- Laboratory of Cancer Genomics and Biocomputing of Complex Diseases, Azrieli Faculty of Medicine, Bar-Ilan University, Henrietta Szold 8, Safed 13195, Israel
| | - Sunanda Biswas Mukherjee
- Laboratory of Cancer Genomics and Biocomputing of Complex Diseases, Azrieli Faculty of Medicine, Bar-Ilan University, Henrietta Szold 8, Safed 13195, Israel
| | - Milana Frenkel-Morgenstern
- Laboratory of Cancer Genomics and Biocomputing of Complex Diseases, Azrieli Faculty of Medicine, Bar-Ilan University, Henrietta Szold 8, Safed 13195, Israel
| |
Collapse
|
29
|
Lever J, Altman RB. Analyzing the vast coronavirus literature with CoronaCentral. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2020. [PMID: 33398279 PMCID: PMC7781314 DOI: 10.1101/2020.12.21.423860] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
The global SARS-CoV-2 pandemic has caused a surge in research exploring all aspects of the virus and its effects on human health. The overwhelming rate of publications means that human researchers are unable to keep abreast of the research. To ameliorate this, we present the CoronaCentral resource which uses machine learning to process the research literature on SARS-CoV-2 along with articles on SARS-CoV and MERS-CoV. We break the literature down into useful categories and enable analysis of the contents, pace, and emphasis of research during the crisis. These categories cover therapeutics, forecasting as well as growing areas such as “Long Covid” and studies of inequality and misinformation. Using this data, we compare topics that appear in original research articles compared to commentaries and other article types. Finally, using Altmetric data, we identify the topics that have gained the most media attention. This resource, available at https://coronacentral.ai, is updated multiple times per day and provides an easy-to-navigate system to find papers in different categories, focussing on different aspects of the virus along with currently trending articles.
Collapse
Affiliation(s)
- Jake Lever
- Department of Bioengineering, Stanford University, 443 Via Ortega, Stanford, CA 94305
| | - Russ B Altman
- Department of Bioengineering, Stanford University, 443 Via Ortega, Stanford, CA 94305
| |
Collapse
|
30
|
Rybinski M, Karimi S, Nguyen V, Paris C. A2A: a platform for research in biomedical literature search. BMC Bioinformatics 2020; 21:572. [PMID: 33349237 PMCID: PMC7751125 DOI: 10.1186/s12859-020-03894-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2020] [Accepted: 11/18/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Finding relevant literature is crucial for many biomedical research activities and in the practice of evidence-based medicine. Search engines such as PubMed provide a means to search and retrieve published literature, given a query. However, they are limited in how users can control the processing of queries and articles-or as we call them documents-by the search engine. To give this control to both biomedical researchers and computer scientists working in biomedical information retrieval, we introduce a public online tool for searching over biomedical literature. Our setup is guided by the NIST setup of the relevant TREC evaluation tasks in genomics, clinical decision support, and precision medicine. RESULTS To provide benchmark results for some of the most common biomedical information retrieval strategies, such as querying MeSH subject headings with a specific weight or querying over the title of the articles only, we present our evaluations on public datasets. Our experiments report well-known information retrieval metrics such as precision at a cutoff of ranked documents. CONCLUSIONS We introduce the A2A search and benchmarking tool which is publicly available for the researchers who want to explore different search strategies over published biomedical literature. We outline several query formulation strategies and present their evaluations with known human judgements for a large pool of topics, from genomics to precision medicine.
Collapse
Affiliation(s)
| | | | - Vincent Nguyen
- CSIRO Data61, Sydney, Australia
- Australian National University, Canberra, Australia
| | | |
Collapse
|
31
|
López Carreño R, Martínez Méndez FJ. Sistemas de recuperación de información implementados a partir de CORD-19: herramientas clave en la gestión de la información sobre COVID-19. REVISTA ESPANOLA DE DOCUMENTACION CIENTIFICA 2020. [DOI: 10.3989/redc.2020.4.1794] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022] Open
Abstract
La investigación sobre el coronavirus ha generado una producción de documentos científicos extraordinaria. Su tratamiento y asimilación por parte de la comunidad científica ha necesitado de la ayuda de sistemas de recuperación de información diseñados específicamente. Algunas de las principales instituciones mundiales dedicadas a la lucha contra la pandemia han desarrollado el conjunto de datos CORD-19 que destaca sobre otros proyectos de similar naturaleza. Los documentos recopilados en esta fuente han sido procesados por distintas herramientas de recuperación de información, a veces prototipos o sistemas que ya estaban implementados. Se ha analizado la tipología y características principales de estos sistemas concluyendo que hay tres grandes categorías no excluyentes entre ellas: búsqueda terminológica, visualización de información y procesamiento de lenguaje natural. Destaca enormemente que la gran mayoría de ellos emplean preferentemente tecnologías de búsqueda semántica con el objeto de facilitar la adquisición de conocimiento s los investigadores y ayudarlas en su ingente tarea. La crisis provocada por la pandemia ha sido aprovechada por los buscadores semánticos para encontrar su sitio.
Collapse
|
32
|
Cabanac G, Frommholz I, Mayr P. Scholarly literature mining with information retrieval and natural language processing: Preface. Scientometrics 2020; 125:2835-2840. [PMID: 33223580 PMCID: PMC7670972 DOI: 10.1007/s11192-020-03763-4] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Indexed: 11/25/2022]
Affiliation(s)
- Guillaume Cabanac
- Computer Science Department, IRIT UMR 5505 CNRS, University of Toulouse, 118 Route de Narbonne, 31062 Toulouse Cedex 9, France
| | | | - Philipp Mayr
- GESIS – Leibniz Institute for the Social Sciences, Cologne, Germany
| |
Collapse
|
33
|
Wang LL, Lo K, Chandrasekhar Y, Reas R, Yang J, Burdick D, Eide D, Funk K, Katsis Y, Kinney R, Li Y, Liu Z, Merrill W, Mooney P, Murdick D, Rishi D, Sheehan J, Shen Z, Stilson B, Wade AD, Wang K, Wang NXR, Wilhelm C, Xie B, Raymond D, Weld DS, Etzioni O, Kohlmeier S. CORD-19: The Covid-19 Open Research Dataset. ARXIV 2020:arXiv:2004.10706v4. [PMID: 32510522 PMCID: PMC7251955] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Figures] [Subscribe] [Scholar Register] [Revised: 07/10/2020] [Indexed: 06/11/2023]
Abstract
The Covid-19 Open Research Dataset (CORD-19) is a growing resource of scientific papers on Covid-19 and related historical coronavirus research. CORD-19 is designed to facilitate the development of text mining and information retrieval systems over its rich collection of metadata and structured full text papers. Since its release, CORD-19 has been downloaded over 200K times and has served as the basis of many Covid-19 text mining and discovery systems. In this article, we describe the mechanics of dataset construction, highlighting challenges and key design decisions, provide an overview of how CORD-19 has been used, and describe several shared tasks built around the dataset. We hope this resource will continue to bring together the computing community, biomedical experts, and policy makers in the search for effective treatments and management policies for Covid-19.
Collapse
|