1
|
Focsa M, Tan C, Chen M, Yan M, Zhang N, Huang S, Liu X. State-of-the-Art Evidence Retriever for Precision Medicine: Algorithm Development and Validation. JMIR Med Inform 2022; 10:e40743. [PMID: 36409468 PMCID: PMC9801267 DOI: 10.2196/40743] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2022] [Revised: 11/13/2022] [Accepted: 11/16/2022] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND Under the paradigm of precision medicine (PM), patients with the same disease can receive different personalized therapies according to their clinical and genetic features. These therapies are determined by the totality of all available clinical evidence, including results from case reports, clinical trials, and systematic reviews. However, it is increasingly difficult for physicians to find such evidence from scientific publications, whose size is growing at an unprecedented pace. OBJECTIVE In this work, we propose the PM-Search system to facilitate the retrieval of clinical literature that contains critical evidence for or against giving specific therapies to certain cancer patients. METHODS The PM-Search system combines a baseline retriever that selects document candidates at a large scale and an evidence reranker that finely reorders the candidates based on their evidence quality. The baseline retriever uses query expansion and keyword matching with the ElasticSearch retrieval engine, and the evidence reranker fits pretrained language models to expert annotations that are derived from an active learning strategy. RESULTS The PM-Search system achieved the best performance in the retrieval of high-quality clinical evidence at the Text Retrieval Conference PM Track 2020, outperforming the second-ranking systems by large margins (0.4780 vs 0.4238 for standard normalized discounted cumulative gain at rank 30 and 0.4519 vs 0.4193 for exponential normalized discounted cumulative gain at rank 30). CONCLUSIONS We present PM-Search, a state-of-the-art search engine to assist the practicing of evidence-based PM. PM-Search uses a novel Bidirectional Encoder Representations from Transformers for Biomedical Text Mining-based active learning strategy that models evidence quality and improves the model performance. Our analyses show that evidence quality is a distinct aspect from general relevance, and specific modeling of evidence quality beyond general relevance is required for a PM search engine.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Xiaozhong Liu
- Indiana University Bloomington, Bloomington, IN, United States
| |
Collapse
|
2
|
Medical social networks content mining for a semantic annotation. SOCIAL NETWORK ANALYSIS AND MINING 2021. [DOI: 10.1007/s13278-021-00848-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
3
|
Simple but Effective Knowledge-Based Query Reformulations for Precision Medicine Retrieval. INFORMATION 2021. [DOI: 10.3390/info12100402] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
In Information Retrieval (IR), the semantic gap represents the mismatch between users’ queries and how retrieval models answer to these queries. In this paper, we explore how to use external knowledge resources to enhance bag-of-words representations and reduce the effect of the semantic gap between queries and documents. In this regard, we propose several simple but effective knowledge-based query expansion and reduction techniques, and we evaluate them for the medical domain. The query reformulations proposed are used to increase the probability of retrieving relevant documents through the addition to, or the removal from, the original query of highly specific terms. The experimental analyses on different test collections for Precision Medicine IR show the effectiveness of the developed techniques. In particular, a specific subset of query reformulations allow retrieval models to achieve top performing results in all the considered test collections.
Collapse
|
4
|
Huang X, Zhang J, Xu Z, Ou L, Tong J. A knowledge graph based question answering method for medical domain. PeerJ Comput Sci 2021; 7:e667. [PMID: 34604514 PMCID: PMC8444078 DOI: 10.7717/peerj-cs.667] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2020] [Accepted: 07/19/2021] [Indexed: 06/06/2023]
Abstract
Question answering (QA) is a hot field of research in Natural Language Processing. A big challenge in this field is to answer questions from knowledge-dependable domain. Since traditional QA hardly satisfies some knowledge-dependable situations, such as disease diagnosis, drug recommendation, etc. In recent years, researches focus on knowledge-based question answering (KBQA). However, there still exist some problems in KBQA, traditional KBQA is limited by a range of historical cases and takes too much human labor. To address the problems, in this paper, we propose an approach of knowledge graph based question answering (KGQA) method for medical domain, which firstly constructs a medical knowledge graph by extracting named entities and relations between the entities from medical documents. Then, in order to understand a question, it extracts the key information in the question according to the named entities, and meanwhile, it recognizes the questions' intentions by adopting information gain. The next an inference method based on weighted path ranking on the knowledge graph is proposed to score the related entities according to the key information and intention of a given question. Finally, it extracts the inferred candidate entities to construct answers. Our approach can understand questions, connect the questions to the knowledge graph and inference the answers on the knowledge graph. Theoretical analysis and real-life experimental results show the efficiency of our approach.
Collapse
Affiliation(s)
- Xiaofeng Huang
- School of Computer Science, Hubei University of Technology, Wuhan, Hubei, China
| | - Jixin Zhang
- School of Computer Science, Hubei University of Technology, Wuhan, Hubei, China
| | - Zisang Xu
- Computer and Communication Engineer Institute, Changsha University of Science and Technology, Changsha, Hunan, China
| | - Lu Ou
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan, China
| | - Jianbin Tong
- Hunan Province Key Laboratory of Brain Homeostasis, Third Xiangya Hospital, Central South University, Changsha, Hunan, China
| |
Collapse
|
5
|
Chen JS, Hersh WR. A comparative analysis of system features used in the TREC-COVID information retrieval challenge. J Biomed Inform 2021; 117:103745. [PMID: 33831536 PMCID: PMC8021447 DOI: 10.1016/j.jbi.2021.103745] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2020] [Revised: 12/02/2020] [Accepted: 03/05/2021] [Indexed: 11/18/2022]
Abstract
The COVID-19 pandemic has resulted in a rapidly growing quantity of scientific publications from journal articles, preprints, and other sources. The TREC-COVID Challenge was created to evaluate information retrieval (IR) methods and systems for this quickly expanding corpus. Using the COVID-19 Open Research Dataset (CORD-19), several dozen research teams participated in over 5 rounds of the TREC-COVID Challenge. While previous work has compared IR techniques used on other test collections, there are no studies that have analyzed the methods used by participants in the TREC-COVID Challenge. We manually reviewed team run reports from Rounds 2 and 5, extracted features from the documented methodologies, and used a univariate and multivariate regression-based analysis to identify features associated with higher retrieval performance. We observed that fine-tuning datasets with relevance judgments, MS-MARCO, and CORD-19 document vectors was associated with improved performance in Round 2 but not in Round 5. Though the relatively decreased heterogeneity of runs in Round 5 may explain the lack of significance in that round, fine-tuning has been found to improve search performance in previous challenge evaluations by improving a system’s ability to map relevant queries and phrases to documents. Furthermore, term expansion was associated with improvement in system performance, and the use of the narrative field in the TREC-COVID topics was associated with decreased system performance in both rounds. These findings emphasize the need for clear queries in search. While our study has some limitations in its generalizability and scope of techniques analyzed, we identified some IR techniques that may be useful in building search systems for COVID-19 using the TREC-COVID test collections.
Collapse
Affiliation(s)
- Jimmy S Chen
- School of Medicine, Oregon Health & Science University, Portland, OR, USA.
| | - William R Hersh
- Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR, USA
| |
Collapse
|
6
|
Automatically transforming full length biomedical articles into search queries for retrieving related articles. EGYPTIAN INFORMATICS JOURNAL 2021. [DOI: 10.1016/j.eij.2020.04.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
7
|
Roberts K, Alam T, Bedrick S, Demner-Fushman D, Lo K, Soboroff I, Voorhees E, Wang LL, Hersh WR. TREC-COVID: rationale and structure of an information retrieval shared task for COVID-19. J Am Med Inform Assoc 2020; 27:1431-1436. [PMID: 32365190 PMCID: PMC7239098 DOI: 10.1093/jamia/ocaa091] [Citation(s) in RCA: 55] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2020] [Accepted: 05/01/2020] [Indexed: 11/17/2022] Open
Abstract
TREC-COVID is an information retrieval (IR) shared task initiated to support clinicians and clinical research during the COVID-19 pandemic. IR for pandemics breaks many normal assumptions, which can be seen by examining 9 important basic IR research questions related to pandemic situations. TREC-COVID differs from traditional IR shared task evaluations with special considerations for the expected users, IR modality considerations, topic development, participant requirements, assessment process, relevance criteria, evaluation metrics, iteration process, projected timeline, and the implications of data use as a post-task test collection. This article describes how all these were addressed for the particular requirements of developing IR systems under a pandemic situation. Finally, initial participation numbers are also provided, which demonstrate the tremendous interest the IR community has in this effort.
Collapse
Affiliation(s)
- Kirk Roberts
- University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Tasmeer Alam
- National Institute of Standards and Technology, Gaithersburg, Maryland, USA
| | - Steven Bedrick
- Oregon Health & Science University, Portland, Oregon, USA
| | | | - Kyle Lo
- Allen Institute for AI, Seattle, Washington, USA
| | - Ian Soboroff
- National Institute of Standards and Technology, Gaithersburg, Maryland, USA
| | - Ellen Voorhees
- National Institute of Standards and Technology, Gaithersburg, Maryland, USA
| | - Lucy Lu Wang
- Allen Institute for AI, Seattle, Washington, USA
| | | |
Collapse
|
8
|
Hassanzadeh H, Karimi S, Nguyen A. Matching patients to clinical trials using semantically enriched document representation. J Biomed Inform 2020; 105:103406. [DOI: 10.1016/j.jbi.2020.103406] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2019] [Revised: 01/28/2020] [Accepted: 03/02/2020] [Indexed: 12/16/2022]
|
9
|
He J, Fu M, Tu M. Applying deep matching networks to Chinese medical question answering: a study and a dataset. BMC Med Inform Decis Mak 2019; 19:52. [PMID: 30961607 PMCID: PMC6454599 DOI: 10.1186/s12911-019-0761-8] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND Medical and clinical question answering (QA) is highly concerned by researchers recently. Though there are remarkable advances in this field, the development in Chinese medical domain is relatively backward. It can be attributed to the difficulty of Chinese text processing and the lack of large-scale datasets. To bridge the gap, this paper introduces a Chinese medical QA dataset and proposes effective methods for the task. METHODS We first construct a large scale Chinese medical QA dataset. Then we leverage deep matching neural networks to capture semantic interaction between words in questions and answers. Considering that Chinese Word Segmentation (CWS) tools may fail to identify clinical terms, we design a module to merge the word segments and produce a new representation. It learns the common compositions of words or segments by using convolutional kernels and selects the strongest signals by windowed pooling. RESULTS The best performer among popular CWS tools on our dataset is found. In our experiments, deep matching models substantially outperform existing methods. Results also show that our proposed semantic clustered representation module improves the performance of models by up to 5.5% Precision at 1 and 4.9% Mean Average Precision. CONCLUSIONS In this paper, we introduce a large scale Chinese medical QA dataset and cast the task into a semantic matching problem. We also compare different CWS tools and input units. Among the two state-of-the-art deep matching neural networks, MatchPyramid performs better. Results also show the effectiveness of the proposed semantic clustered representation module.
Collapse
Affiliation(s)
- Junqing He
- Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing, 100190 China
- University of Chinese Academy of Sciences, Beijing, 100049 China
| | - Mingming Fu
- Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing, 100190 China
- University of Chinese Academy of Sciences, Beijing, 100049 China
| | - Manshu Tu
- Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing, 100190 China
- University of Chinese Academy of Sciences, Beijing, 100049 China
| |
Collapse
|
10
|
Palotti J, Zuccon G, Hanbury A. Consumer Health Search on the Web: Study of Web Page Understandability and Its Integration in Ranking Algorithms. J Med Internet Res 2019; 21:e10986. [PMID: 30698536 PMCID: PMC6372940 DOI: 10.2196/10986] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2018] [Revised: 08/23/2018] [Accepted: 09/23/2018] [Indexed: 01/23/2023] Open
Abstract
BACKGROUND Understandability plays a key role in ensuring that people accessing health information are capable of gaining insights that can assist them with their health concerns and choices. The access to unclear or misleading information has been shown to negatively impact the health decisions of the general public. OBJECTIVE The aim of this study was to investigate methods to estimate the understandability of health Web pages and use these to improve the retrieval of information for people seeking health advice on the Web. METHODS Our investigation considered methods to automatically estimate the understandability of health information in Web pages, and it provided a thorough evaluation of these methods using human assessments as well as an analysis of preprocessing factors affecting understandability estimations and associated pitfalls. Furthermore, lessons learned for estimating Web page understandability were applied to the construction of retrieval methods, with specific attention to retrieving information understandable by the general public. RESULTS We found that machine learning techniques were more suitable to estimate health Web page understandability than traditional readability formulae, which are often used as guidelines and benchmark by health information providers on the Web (larger difference found for Pearson correlation of .602 using gradient boosting regressor compared with .438 using Simple Measure of Gobbledygook Index with the Conference and Labs of the Evaluation Forum eHealth 2015 collection). CONCLUSIONS The findings reported in this paper are important for specialized search services tailored to support the general public in seeking health advice on the Web, as they document and empirically validate state-of-the-art techniques and settings for this domain application.
Collapse
Affiliation(s)
- Joao Palotti
- Qatar Computing Research Institute, Doha, Qatar.,Institute for Information Systems Engineering, Technische Universität Wien, Vienna, Austria
| | | | - Allan Hanbury
- Institute for Information Systems Engineering, Technische Universität Wien, Vienna, Austria.,Complexity Science Hub Vienna, Vienna, Austria
| |
Collapse
|
11
|
A New Biomedical Passage Retrieval Framework for Laboratory Medicine: Leveraging Domain-specific Ontology, Multilevel PRF, and Negation Differential Weighting. JOURNAL OF HEALTHCARE ENGINEERING 2019; 2018:3943417. [PMID: 30675333 PMCID: PMC6323463 DOI: 10.1155/2018/3943417] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/14/2018] [Revised: 11/19/2018] [Accepted: 11/22/2018] [Indexed: 11/18/2022]
Abstract
Clinical decision support (CDS) search is performed to retrieve key medical literature that can assist the practice of medical experts by offering appropriate medical information relevant to the medical case in hand. In this paper, we present a novel CDS search framework designed for passage retrieval from biomedical textbooks in order to support clinical decision-making using laboratory test results. The framework utilizes two unique characteristics of the textual reports derived from the test results, which are syntax variation and negation information. The proposed framework consists of three components: domain ontology, index repository, and query processing engine. We first created a domain ontology to resolve syntax variation by applying the ontology to detect medical concepts from the test results with language translation. We then preprocessed and performed indexing of biomedical textbooks recommended by clinicians for passage retrieval. We finally built the query-processing engine tailored for CDS, including translation, concept detection, query expansion, pseudo-relevance feedback at the local and global levels, and ranking with differential weighting of negation information. To evaluate the effectiveness of the proposed framework, we followed the standard information retrieval evaluation procedure. An evaluation dataset was created, including 28,581 textual reports for 30 laboratory test results and 56,228 passages from widely used biomedical textbooks, recommended by clinicians. Overall, our proposed passage retrieval framework, GPRF-NEG, outperforms the baseline by 36.2, 100.5, and 69.7 percent for MRR, R-precision, and Precision at 5, respectively. Our study results indicate that the proposed CDS search framework specifically designed for passage retrieval of biomedical literature represents a practically viable choice for clinicians as it supports their decision-making processes by providing relevant passages extracted from the sources that they prefer to refer to, with improved performances.
Collapse
|
12
|
Noh J, Kavuluru R. Document Retrieval for Biomedical Question Answering with Neural Sentence Matching. PROCEEDINGS OF THE ... INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS. INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS 2018; 2018:194-201. [PMID: 30714048 PMCID: PMC6353660 DOI: 10.1109/icmla.2018.00036] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Document retrieval (DR) forms an important component in end-to-end question-answering (QA) systems where particular answers are sought for well-formed questions. DR in the QA scenario is also useful by itself even without a more involved natural language processing component to extract exact answers from the retrieved documents. This latter step may simply be done by humans like in traditional search engines granted the retrieved documents contain the answer. In this paper, we take advantage of datasets made available through the BioASQ end-to-end QA shared task series and build an effective biomedical DR system that relies on relevant answer snippets in the BioASQ training datasets. At the core of our approach is a question-answer sentence matching neural network that learns a measure of relevance of a sentence to an input question in the form of a matching score. In addition to this matching score feature, we also exploit two auxiliary features for scoring document relevance: the name of the journal in which a document is published and the presence/absence of semantic relations (subject-predicate-object triples) in a candidate answer sentence connecting entities mentioned in the question. We rerank our baseline sequential dependence model scores using these three additional features weighted via adaptive random research and other learning-to-rank methods. Our full system placed 2nd in the final batch of Phase A (DR) of task B (QA) in BioASQ 2018. Our ablation experiments highlight the significance of the neural matching network component in the full system.
Collapse
Affiliation(s)
- Jiho Noh
- Department of Computer Science, University of Kentucky, Lexington KY
| | - Ramakanth Kavuluru
- Div. of Biomedical Informatics (Internal Medicine), University of Kentucky, Lexington KY
| |
Collapse
|
13
|
Goodwin TR, Harabagiu SM. Knowledge Representations and Inference Techniques for Medical Question Answering. ACM T INTEL SYST TEC 2018. [DOI: 10.1145/3106745] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
Answering medical questions related to complex medical cases, as required in modern Clinical Decision Support (CDS) systems, imposes (1) access to vast medical knowledge and (2) sophisticated inference techniques. In this article, we examine the representation and role of combining medical knowledge automatically derived from (a) clinical practice and (b) research findings for inferring answers to medical questions. Knowledge from medical practice was distilled from a vast Electronic Medical Record (EMR) system, while research knowledge was processed from biomedical articles available in PubMed Central. The knowledge automatically acquired from the EMR system took into account the clinical picture and therapy recognized from each medical record to generate a probabilistic Markov network denoted as a Clinical Picture and Therapy Graph (CPTG). Moreover, we represented the background of medical questions available from the description of each complex medical case as a medical knowledge sketch. We considered three possible representations of medical knowledge sketches that were used by four different probabilistic inference methods to pinpoint the answers from the CPTG. In addition, several answer-informed relevance models were developed to provide a ranked list of biomedical articles containing the answers. Evaluations on the TREC-CDS data show which of the medical knowledge representations and inference methods perform optimally. The experiments indicate an improvement of biomedical article ranking by 49% over state-of-the-art results.
Collapse
|
14
|
Cieslewicz A, Dutkiewicz J, Jedrzejek C. Baseline and extensions approach to information retrieval of complex medical data: Poznan's approach to the bioCADDIE 2016. Database (Oxford) 2018; 2018:4930756. [PMID: 29688372 PMCID: PMC5846287 DOI: 10.1093/database/bax103] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2017] [Revised: 12/18/2017] [Accepted: 12/18/2017] [Indexed: 11/23/2022]
Abstract
Database URL https://biocaddie.org/benchmark-data.
Collapse
Affiliation(s)
- Artur Cieslewicz
- Department of Clinical Pharmacology, Poznan University of Medical Sciences, Dluga 1/2 Str., 61-848 Poznan, Poland
| | - Jakub Dutkiewicz
- Institute of Control, Robotics and Information Engineering, Poznan University of Technology, ul. Piotrowo 3a, 60-965 Poznań, Poland
| | - Czeslaw Jedrzejek
- Institute of Control, Robotics and Information Engineering, Poznan University of Technology, ul. Piotrowo 3a, 60-965 Poznań, Poland
| |
Collapse
|
15
|
Moskovitch R, Wang F, Pei J, Friedman C. JASISTspecial issue on biomedical information retrieval. J Assoc Inf Sci Technol 2017. [DOI: 10.1002/asi.23972] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Affiliation(s)
- Robert Moskovitch
- Software and Information Systems Engineering, Ben Gurion University of the Negev; Israel
| | - Fei Wang
- Healthcare Policy and Research, Cornell University, USA
| | - Jian Pei
- Computer Science, Simon Fraser University, Canada
| | | |
Collapse
|
16
|
|
17
|
Chinese Medical Question Answer Matching Using End-to-End Character-Level Multi-Scale CNNs. APPLIED SCIENCES-BASEL 2017. [DOI: 10.3390/app7080767] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
This paper focuses mainly on the problem of Chinese medical question answer matching, which is arguably more challenging than open-domain question answer matching in English due to the combination of its domain-restricted nature and the language-specific features of Chinese. We present an end-to-end character-level multi-scale convolutional neural framework in which character embeddings instead of word embeddings are used to avoid Chinese word segmentation in text preprocessing, and multi-scale convolutional neural networks (CNNs) are then introduced to extract contextual information from either question or answer sentences over different scales. The proposed framework can be trained with minimal human supervision and does not require any handcrafted features, rule-based patterns, or external resources. To validate our framework, we create a new text corpus, named cMedQA, by harvesting questions and answers from an online Chinese health and wellness community. The experimental results on the cMedQA dataset show that our framework significantly outperforms several strong baselines, and achieves an improvement of top-1 accuracy by up to 19%.
Collapse
|
18
|
Moreno I, Boldrini E, Moreda P, Romá-Ferri MT. DrugSemantics: A corpus for Named Entity Recognition in Spanish Summaries of Product Characteristics. J Biomed Inform 2017. [PMID: 28624642 DOI: 10.1016/j.jbi.2017.06.013] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
For the healthcare sector, it is critical to exploit the vast amount of textual health-related information. Nevertheless, healthcare providers have difficulties to benefit from such quantity of data during pharmacotherapeutic care. The problem is that such information is stored in different sources and their consultation time is limited. In this context, Natural Language Processing techniques can be applied to efficiently transform textual data into structured information so that it could be used in critical healthcare applications, being of help for physicians in their daily workload, such as: decision support systems, cohort identification, patient management, etc. Any development of these techniques requires annotated corpora. However, there is a lack of such resources in this domain and, in most cases, the few ones available concern English. This paper presents the definition and creation of DrugSemantics corpus, a collection of Summaries of Product Characteristics in Spanish. It was manually annotated with pharmacotherapeutic named entities, detailed in DrugSemantics annotation scheme. Annotators were a Registered Nurse (RN) and two students from the Degree in Nursing. The quality of DrugSemantics corpus has been assessed by measuring its annotation reliability (overall F=79.33% [95%CI: 78.35-80.31]), as well as its annotation precision (overall P=94.65% [95%CI: 94.11-95.19]). Besides, the gold-standard construction process is described in detail. In total, our corpus contains more than 2000 named entities, 780 sentences and 226,729 tokens. Last, a Named Entity Classification module trained on DrugSemantics is presented aiming at showing the quality of our corpus, as well as an example of how to use it.
Collapse
Affiliation(s)
- Isabel Moreno
- Department of Software and Computing Systems, University of Alicante, Alicante, Spain.
| | - Ester Boldrini
- Department of Software and Computing Systems, University of Alicante, Alicante, Spain.
| | - Paloma Moreda
- Department of Software and Computing Systems, University of Alicante, Alicante, Spain.
| | | |
Collapse
|
19
|
Mork J, Aronson A, Demner-Fushman D. 12 years on - Is the NLM medical text indexer still useful and relevant? J Biomed Semantics 2017; 8:8. [PMID: 28231809 PMCID: PMC5324252 DOI: 10.1186/s13326-017-0113-5] [Citation(s) in RCA: 31] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2016] [Accepted: 01/11/2017] [Indexed: 11/21/2022] Open
Abstract
Background Facing a growing workload and dwindling resources, the US National Library of Medicine (NLM) created the Indexing Initiative project in 1996. This cross-library team’s mission is to explore indexing methodologies for ensuring quality and currency of NLM document collections. The NLM Medical Text Indexer (MTI) is the main product of this project and has been providing automated indexing recommendations since 2002. After all of this time, the questions arise whether MTI is still useful and relevant. Methods To answer the question about MTI usefulness, we track a wide variety of statistics related to how frequently MEDLINE indexers refer to MTI recommendations, how well MTI performs against human indexing, and how often MTI is used. To answer the question of MTI relevancy compared to other available tools, we have participated in the 2013 and 2014 BioASQ Challenges. The BioASQ Challenges have provided us with an unbiased comparison between the MTI system and other systems performing the same task. Results Indexers have continually increased their use of MTI recommendations over the years from 15.75% of the articles they index in 2002 to 62.44% in 2014 showing that the indexers find MTI to be increasingly useful. The MTI performance statistics show significant improvement in Precision (+0.2992) and F1 (+0.1997) with modest gains in Recall (+0.0454) over the years. MTI consistency is comparable to the available indexer consistency studies. MTI performed well in both of the BioASQ Challenges ranking within the top tier teams. Conclusions Based on our findings, yes, MTI is still relevant and useful, and needs to be improved and expanded. The BioASQ Challenge results have shown that we need to incorporate more machine learning into MTI while still retaining the indexing rules that have earned MTI the indexers’ trust over the years. We also need to expand MTI through the use of full text, when and where it is available, to provide coverage of indexing terms that are typically only found in the full text. The role of MTI at NLM is also expanding into new areas, further reinforcing the idea that MTI is increasingly useful and relevant.
Collapse
Affiliation(s)
- James Mork
- US National Library of Medicine, 8600 Rockville Pike, Bethesda, USA.
| | - Alan Aronson
- US National Library of Medicine, 8600 Rockville Pike, Bethesda, USA
| | | |
Collapse
|
20
|
Cohen T, Roberts K, Gururaj AE, Chen X, Pournejati S, Alter G, Hersh WR, Demner-Fushman D, Ohno-Machado L, Xu H. A publicly available benchmark for biomedical dataset retrieval: the reference standard for the 2016 bioCADDIE dataset retrieval challenge. Database (Oxford) 2017; 2017:4085942. [PMID: 29220453 PMCID: PMC5737202 DOI: 10.1093/database/bax061] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2017] [Revised: 06/20/2017] [Accepted: 07/17/2017] [Indexed: 11/17/2022]
Abstract
Database URL https://biocaddie.org/benchmark-data.
Collapse
Affiliation(s)
- Trevor Cohen
- School of Biomedical Informatics. The University of Texas Health Science Center at Houston/7000 Fannin St. Suite 600, Houston, TX, 77030, USA
| | - Kirk Roberts
- School of Biomedical Informatics. The University of Texas Health Science Center at Houston/7000 Fannin St. Suite 600, Houston, TX, 77030, USA
| | - Anupama E. Gururaj
- School of Biomedical Informatics. The University of Texas Health Science Center at Houston/7000 Fannin St. Suite 600, Houston, TX, 77030, USA
| | - Xiaoling Chen
- School of Biomedical Informatics. The University of Texas Health Science Center at Houston/7000 Fannin St. Suite 600, Houston, TX, 77030, USA
| | - Saeid Pournejati
- School of Biomedical Informatics. The University of Texas Health Science Center at Houston/7000 Fannin St. Suite 600, Houston, TX, 77030, USA
| | - George Alter
- Population Studies Center, University of Michigan, 426 Thompson St. Ann Arbor, MI, 48104, USA
| | - William R. Hersh
- Department of Medical Informatics and Clinical Epidemiology, Oregon Health and Science University, 3181 S.W. Sam Jackson Park Rd, Portland, OR, 97239, USA
| | - Dina Demner-Fushman
- U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894 USA
| | - Lucila Ohno-Machado
- Department of Biomedical Informatics, University of California San Diego, Altman Clinical and Translational Research Institute Building, 9452 Medical Center Drive, La Jolla, CA, 92093, USA
| | - Hua Xu
- School of Biomedical Informatics. The University of Texas Health Science Center at Houston/7000 Fannin St. Suite 600, Houston, TX, 77030, USA
| |
Collapse
|
21
|
Scerri A, Kuriakose J, Deshmane AA, Stanger M, Cotroneo P, Moore R, Naik R, de Waard A. Elsevier's approach to the bioCADDIE 2016 Dataset Retrieval Challenge. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017; 2017:4090923. [PMID: 29220454 PMCID: PMC5737073 DOI: 10.1093/database/bax056] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/16/2017] [Accepted: 06/29/2017] [Indexed: 11/13/2022]
Abstract
Database URL https://data.mendeley.com/datasets/zd9dxpyybg/1.
Collapse
Affiliation(s)
- Antony Scerri
- Elsevier Ltd, The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, UK
| | - John Kuriakose
- Infosys, Hosur Road, Electronics City, Bengaluru 560 100, India
| | | | - Mark Stanger
- Search Technologies Corp, 1110 Herndon Parkway, Suite 306, Herndon, VA 20170, USA
| | - Peter Cotroneo
- Elsevier Ltd, The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, UK
| | - Rebekah Moore
- Elsevier Ltd, The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, UK
| | - Raj Naik
- Elsevier Ltd, The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, UK
| | | |
Collapse
|
22
|
Wright TB, Ball D, Hersh W. Query expansion using MeSH terms for dataset retrieval: OHSU at the bioCADDIE 2016 dataset retrieval challenge. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017; 2017:4561576. [PMID: 29220467 PMCID: PMC5737054 DOI: 10.1093/database/bax065] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/15/2017] [Accepted: 08/07/2017] [Indexed: 01/02/2023]
Abstract
Scientific data are being generated at an ever-increasing rate. The Biomedical and Healthcare Data Discovery Index Ecosystem (bioCADDIE) is an NIH-funded Data Discovery Index that aims to provide a platform for researchers to locate, retrieve, and share research datasets. The bioCADDIE 2016 Dataset Retrieval Challenge was held to identify the most effective dataset retrieval methods. We aimed to assess the value of Medical Subject Heading (MeSH) term-based query expansion to improve retrieval. Our system, based on the open-source search engine, Elasticsearch, expands queries by identifying synonyms from the MeSH vocabulary and adding these to the original query. The number and relative weighting of MeSH terms is variable. The top 1000 search results for the 15 challenge queries were submitted for evaluation. After the challenge, we performed additional runs to determine the optimal number of MeSH terms and weighting. Our best overall score used five MeSH terms with a 1:5 terms:words weighting ratio, achieving an inferred normalized distributed cumulative gain (infNDCG) of 0.445, which was the third highest score among the 10 research groups who participated in the challenge. Further testing revealed our initial combination of MeSH terms and weighting yielded the best overall performance. Scores varied considerably between queries as well as with different variations of MeSH terms and weights. Query expansion using MeSH terms can enhance search relevance of biomedical datasets. High variability between queries and system variables suggest room for improvement and directions for further research. Database URL:https://biocaddie.org/benchmark-data
Collapse
Affiliation(s)
- Theodore B Wright
- Department of Medical Informatics & Clinical Epidemiology, School of Medicine, Oregon Health & Science University, 5th Floor, Biomedical Information Communication Center (BICC) 3181 S.W. Sam Jackson Park Rd., Portland, OR 97239, USA
| | - David Ball
- Department of Medical Informatics & Clinical Epidemiology, School of Medicine, Oregon Health & Science University, 5th Floor, Biomedical Information Communication Center (BICC) 3181 S.W. Sam Jackson Park Rd., Portland, OR 97239, USA
| | - William Hersh
- Department of Medical Informatics & Clinical Epidemiology, School of Medicine, Oregon Health & Science University, 5th Floor, Biomedical Information Communication Center (BICC) 3181 S.W. Sam Jackson Park Rd., Portland, OR 97239, USA
| |
Collapse
|
23
|
Wang Y, Rastegar-Mojarad M, Komandur-Elayavilli R, Liu H. Leveraging word embeddings and medical entity extraction for biomedical dataset retrieval using unstructured texts. Database (Oxford) 2017; 2017:bax091. [PMID: 31725862 PMCID: PMC7243926 DOI: 10.1093/database/bax091] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2017] [Revised: 10/17/2017] [Accepted: 11/14/2017] [Indexed: 11/16/2022]
Abstract
The recent movement towards open data in the biomedical domain has generated a large number of datasets that are publicly accessible. The Big Data to Knowledge data indexing project, biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE), has gathered these datasets in a one-stop portal aiming at facilitating their reuse for accelerating scientific advances. However, as the number of biomedical datasets stored and indexed increases, it becomes more and more challenging to retrieve the relevant datasets according to researchers' queries. In this article, we propose an information retrieval (IR) system to tackle this problem and implement it for the bioCADDIE Dataset Retrieval Challenge. The system leverages the unstructured texts of each dataset including the title and description for the dataset, and utilizes a state-of-the-art IR model, medical named entity extraction techniques, query expansion with deep learning-based word embeddings and a re-ranking strategy to enhance the retrieval performance. In empirical experiments, we compared the proposed system with 11 baseline systems using the bioCADDIE Dataset Retrieval Challenge datasets. The experimental results show that the proposed system outperforms other systems in terms of inference Average Precision and inference normalized Discounted Cumulative Gain, implying that the proposed system is a viable option for biomedical dataset retrieval. Database URL: https://github.com/yanshanwang/biocaddie2016mayodata.
Collapse
Affiliation(s)
- Yanshan Wang
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN 55901, USA
| | | | | | - Hongfang Liu
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN 55901, USA
| |
Collapse
|
24
|
Goodwin TR, Harabagiu SM. Medical Question Answering for Clinical Decision Support. PROCEEDINGS OF THE ... ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT. ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT 2016; 2016:297-306. [PMID: 28758046 PMCID: PMC5530755 DOI: 10.1145/2983323.2983819] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
The goal of modern Clinical Decision Support (CDS) systems is to provide physicians with information relevant to their management of patient care. When faced with a medical case, a physician asks questions about the diagnosis, the tests, or treatments that should be administered. Recently, the TREC-CDS track has addressed this challenge by evaluating results of retrieving relevant scientific articles where the answers of medical questions in support of CDS can be found. Although retrieving relevant medical articles instead of identifying the answers was believed to be an easier task, state-of-the-art results are not yet sufficiently promising. In this paper, we present a novel framework for answering medical questions in the spirit of TREC-CDS by first discovering the answer and then selecting and ranking scientific articles that contain the answer. Answer discovery is the result of probabilistic inference which operates on a probabilistic knowledge graph, automatically generated by processing the medical language of large collections of electronic medical records (EMRs). The probabilistic inference of answers combines knowledge from medical practice (EMRs) with knowledge from medical research (scientific articles). It also takes into account the medical knowledge automatically discerned from the medical case description. We show that this novel form of medical question answering (Q/A) produces very promising results in (a) identifying accurately the answers and (b) it improves medical article ranking by 40%.
Collapse
Affiliation(s)
- Travis R Goodwin
- Human Language Technology Research Institute, Department of Computer Science, University of Texas at Dallas, 800 W. Campbell Rd., Richardson, Texas 75080
| | - Sanda M Harabagiu
- Human Language Technology Research Institute, Department of Computer Science, University of Texas at Dallas, 800 W. Campbell Rd., Richardson, Texas 75080
| |
Collapse
|
25
|
Goeuriot L, Jones GJF, Kelly L, Müller H, Zobel J. Medical information retrieval: introduction to the special issue. INFORM RETRIEVAL J 2016. [DOI: 10.1007/s10791-015-9277-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
26
|
|