1
|
Humbert-Droz M, Corley J, Tamang S, Gevaert O. Development and validation of MedDRA Tagger: a tool for extraction and structuring medical information from clinical notes. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2022:2022.12.14.22283470. [PMID: 36561189 PMCID: PMC9774225 DOI: 10.1101/2022.12.14.22283470] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
Rapid and automated extraction of clinical information from patients' notes is a desirable though difficult task. Natural language processing (NLP) and machine learning have great potential to automate and accelerate such applications, but developing such models can require a large amount of labeled clinical text, which can be a slow and laborious process. To address this gap, we propose the MedDRA tagger, a fast annotation tool that makes use of industrial level libraries such as spaCy, biomedical ontologies and weak supervision to annotate and extract clinical concepts at scale. The tool can be used to annotate clinical text and obtain labels for training machine learning models and further refine the clinical concept extraction performance, or to extract clinical concepts for observational study purposes. To demonstrate the usability and versatility of our tool, we present three different use cases: we use the tagger to determine patients with a primary brain cancer diagnosis, we show evidence of rising mental health symptoms at the population level and our last use case shows the evolution of COVID-19 symptomatology throughout three waves between February 2020 and October 2021. The validation of our tool showed good performance on both specific annotations from our development set (F1 score 0.81) and open source annotated data set (F1 score 0.79). We successfully demonstrate the versatility of our pipeline with three different use cases. Finally, we note that the modular nature of our tool allows for a straightforward adaptation to another biomedical ontology. We also show that our tool is independent of EHR system, and as such generalizable.
Collapse
Affiliation(s)
- Marie Humbert-Droz
- Stanford Center for Biomedical Informatics Research, Department of Medicine, Stanford University, Stanford, CA
| | | | - Suzanne Tamang
- Department of Biomedical Data Science, Stanford University, Stanford, CA
| | - Olivier Gevaert
- Stanford Center for Biomedical Informatics Research, Department of Medicine, Stanford University, Stanford, CA
- Department of Biomedical Data Science, Stanford University, Stanford, CA
| |
Collapse
|
2
|
Cho H, Kim B, Choi W, Lee D, Lee H. Plant phenotype relationship corpus for biomedical relationships between plants and phenotypes. Sci Data 2022; 9:235. [PMID: 35618736 PMCID: PMC9135735 DOI: 10.1038/s41597-022-01350-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Accepted: 05/03/2022] [Indexed: 11/09/2022] Open
Abstract
Medicinal plants have demonstrated therapeutic potential for applicability for a wide range of observable characteristics in the human body, known as "phenotype," and have been considered favorably in clinical treatment. With an ever increasing interest in plants, many researchers have attempted to extract meaningful information by identifying relationships between plants and phenotypes from the existing literature. Although natural language processing (NLP) aims to extract useful information from unstructured textual data, there is no appropriate corpus available to train and evaluate the NLP model for plants and phenotypes. Therefore, in the present study, we have presented the plant-phenotype relationship (PPR) corpus, a high-quality resource that supports the development of various NLP fields; it includes information derived from 600 PubMed abstracts corresponding to 5,668 plant and 11,282 phenotype entities, and demonstrates a total of 9,709 relationships. We have also described benchmark results through named entity recognition and relation extraction systems to verify the quality of our data and to show the significant performance of NLP tasks in the PPR test set.
Collapse
Affiliation(s)
- Hyejin Cho
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, 61005, Republic of Korea
| | - Baeksoo Kim
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, 61005, Republic of Korea
| | - Wonjun Choi
- Digital Curation Center, Korea Institute of Science and Technology Information, Daejeon, 34141, Republic of Korea
| | - Doheon Lee
- Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology, Daejeon, 34141, Republic of Korea
| | - Hyunju Lee
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, 61005, Republic of Korea.
| |
Collapse
|
3
|
Zhang Y, Cui S, Gao H. Adverse drug reaction detection on social media with deep linguistic features. J Biomed Inform 2020; 106:103437. [PMID: 32360987 DOI: 10.1016/j.jbi.2020.103437] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2019] [Revised: 04/02/2020] [Accepted: 04/26/2020] [Indexed: 11/26/2022]
Abstract
Adverse reactions caused by drugs are one of the most important public health problems. Social media has encouraged more patients to share their drug use experiences and has become a major source for the detection of professionally unreported adverse drug reactions (ADRs). Since a large number of user posts do not mention any ADR, accurate detection of the presence of ADRs in each user post is necessary before further research can be conducted. Previous feature-based methods focus on extracting more shallow linguistic features that are unable to capture deep and subtle information in the context, ultimately failing to provide satisfactory accuracy. To overcome the limitations of previous studies, this paper proposes a novel method that can extract deep linguistic features and then combine them with shallow linguistic features for ADR detection. We first extract predicate-ADR pairs under the guidance of extended syntactic dependencies and ADR lexicon. Then, we extract semantic and part-of-speech (POS) features for each pair and pool the features of different pairs to generate a holistic representation of deep linguistic features. Finally, we use the collection of deep features and several shallow features to train the predictive models. A series of experiments are performed on data sets collected from DailyStrength and Twitter. Our approach can achieve AUCs of 94.44% and 88.97% on the two data sets, respectively, outperforming other state-of-the-art methods. The results demonstrate the potential benefits of deep linguistic features for ADR detection on social data. This method can be applied to multiple other healthcare and text analysis tasks and can be used to support pharmacovigilance research.
Collapse
Affiliation(s)
- Ying Zhang
- School of Management and Economics, Beijing Institute of Technology, Beijing 100081, China; School of Business, University of Jinan, Jinan 250022, China.
| | - Shaoze Cui
- School of Economics and Management, Dalian University of Technology, Dalian 116023, China.
| | - Huiying Gao
- School of Management and Economics, Beijing Institute of Technology, Beijing 100081, China.
| |
Collapse
|
4
|
Artificial intelligence's essential role in the process of drug discovery. FUTURE DRUG DISCOVERY 2019. [DOI: 10.4155/fdd-2019-0026] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023] Open
|
5
|
Combi C, Zorzi M, Pozzani G, Arzenton E, Moretti U. Normalizing Spontaneous Reports Into MedDRA: Some Experiments With MagiCoder. IEEE J Biomed Health Inform 2018; 23:95-102. [PMID: 30059326 DOI: 10.1109/jbhi.2018.2861213] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Text normalization into medical dictionaries is useful to support clinical tasks. A typical setting is pharmacovigilance (PV). The manual detection of suspected adverse drug reactions (ADRs) in narrative reports is time consuming and natural language processing (NLP) provides a concrete help to PV experts. In this paper, we carry out experiments for testing performances of MagiCoder, an NLP application designed to extract MedDRA terms from narrative clinical text. Given a narrative description, MagiCoder proposes an automatic encoding. The pharmacologist reviews, (possibly) corrects, and then, validates the solution. This drastically reduces the time needed for the validation of reports with respect to a completely manual encoding. In previous work, we mainly tested MagiCoder performances on Italian written spontaneous reports. In this paper, we include some new features, change the experiment design, and carry on more tests about MagiCoder. Moreover, we do a change of language, moving to English documents. In particular, we tested MagiCoder on the CADEC dataset, a corpus of manually annotated posts about ADRs collected from the social media.
Collapse
|
6
|
An MCEM Framework for Drug Safety Signal Detection and Combination from Heterogeneous Real World Evidence. Sci Rep 2018; 8:1806. [PMID: 29379048 PMCID: PMC5789130 DOI: 10.1038/s41598-018-19979-7] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2017] [Accepted: 01/11/2018] [Indexed: 11/08/2022] Open
Abstract
Delayed drug safety insights can impact patients, pharmaceutical companies, and the whole society. Post-market drug safety surveillance plays a critical role in providing drug safety insights, where real world evidence such as spontaneous reporting systems (SRS) and a series of disproportional analysis serve as a cornerstone of proactive and predictive drug safety surveillance. However, they still face several challenges including concomitant drugs confounders, rare adverse drug reaction (ADR) detection, data bias, and the under-reporting issue. In this paper, we are developing a new framework that detects improved drug safety signals from multiple data sources via Monte Carlo Expectation-Maximization (MCEM) and signal combination. In MCEM procedure, we propose a new sampling approach to generate more accurate SRS signals for each ADR through iteratively down-weighting their associations with irrelevant drugs in case reports. While in signal combination step, we adopt Bayesian hierarchical model and propose a new summary statistic such that SRS signals can be combined with signals derived from other observational health data allowing for related signals to borrow statistical support with adjustment of data reliability. They combined effectively alleviate the concomitant confounders, data bias, rare ADR and under-reporting issues. Experimental results demonstrated the effectiveness and usefulness of the proposed framework.
Collapse
|
7
|
|
8
|
Natarajan S, Bangera V, Khot T, Picado J, Wazalwar A, Costa VS, Page D, Caldwell M. Markov Logic Networks for Adverse Drug Event Extraction from Text. Knowl Inf Syst 2017; 51:435-457. [PMID: 29123330 PMCID: PMC5673137 DOI: 10.1007/s10115-016-0980-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Adverse drug events (ADEs) are a major concern and point of emphasis for the medical profession, government, and society. A diverse set of techniques from epidemiology, statistics, and computer science are being proposed and studied for ADE discovery from observational health data (e.g., EHR and claims data), social network data (e.g., Google and Twitter posts), and other information sources. Methodologies are needed for evaluating, quantitatively measuring, and comparing the ability of these various approaches to accurately discover ADEs. This work is motivated by the observation that text sources such as the Medline/Medinfo library provide a wealth of information on human health. Unfortunately, ADEs often result from unexpected interactions, and the connection between conditions and drugs is not explicit in these sources. Thus, in this work we address the question of whether we can quantitatively estimate relationships between drugs and conditions from the medical literature. This paper proposes and studies a state-of-the-art NLP-based extraction of ADEs from text.
Collapse
Affiliation(s)
| | | | - Tushar Khot
- Indiana University, University of Wisconsin-Madison
| | | | | | | | - David Page
- Indiana University, University of Wisconsin-Madison
| | | |
Collapse
|
9
|
Karystianis G, Sheppard T, Dixon WG, Nenadic G. Modelling and extraction of variability in free-text medication prescriptions from an anonymised primary care electronic medical record research database. BMC Med Inform Decis Mak 2016; 16:18. [PMID: 26860263 PMCID: PMC4748480 DOI: 10.1186/s12911-016-0255-x] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2015] [Accepted: 01/28/2016] [Indexed: 11/30/2022] Open
Abstract
Background Free-text medication prescriptions contain detailed instruction information that is key when preparing drug data for analysis. The objective of this study was to develop a novel model and automated text-mining method to extract detailed structured medication information from free-text prescriptions and explore their variability (e.g. optional dosages) in primary care research databases. Methods We introduce a prescription model that provides minimum and maximum values for dose number, frequency and interval, allowing modelling variability and flexibility within a drug prescription. We developed a text mining system that relies on rules to extract such structured information from prescription free-text dosage instructions. The system was applied to medication prescriptions from an anonymised primary care electronic record database (Clinical Practice Research Datalink, CPRD). Results We have evaluated our approach on a test set of 220 CPRD prescription free-text directions. The system achieved an overall accuracy of 91 % at the prescription level, with 97 % accuracy across the attribute levels. We then further analysed over 56,000 most common free text prescriptions from CPRD records and found that 1 in 4 has inherent variability, i.e. a choice in taking medication specified by different minimum and maximum doses, duration or frequency. Conclusions Our approach provides an accurate, automated way of coding prescription free text information, including information about flexibility and variability within a prescription. The method allows the researcher to decide how best to prepare the prescription data for drug efficacy and safety analyses in any given setting, and test various scenarios and their impact. Electronic supplementary material The online version of this article (doi:10.1186/s12911-016-0255-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- George Karystianis
- School of Computer Science, University of Manchester, Manchester, UK.,The Christie NHS Foundation Trust, Manchester, UK
| | - Therese Sheppard
- Arthritis Research UK Centre for Epidemiology, University of Manchester, Manchester, UK
| | - William G Dixon
- Arthritis Research UK Centre for Epidemiology, University of Manchester, Manchester, UK.,The Farr Institute of Health Informatics Research, Health eResearch Centre, Manchester, UK
| | - Goran Nenadic
- School of Computer Science, University of Manchester, Manchester, UK. .,The Farr Institute of Health Informatics Research, Health eResearch Centre, Manchester, UK. .,Manchester Institute of Biotechnology, University of Manchester, Manchester, UK.
| |
Collapse
|
10
|
Peek N, Combi C, Marin R, Bellazzi R. Thirty years of artificial intelligence in medicine (AIME) conferences: A review of research themes. Artif Intell Med 2015; 65:61-73. [DOI: 10.1016/j.artmed.2015.07.003] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2015] [Revised: 07/17/2015] [Accepted: 07/17/2015] [Indexed: 10/23/2022]
|
11
|
Segura-Bedmar I, Martínez P, Revert R, Moreno-Schneider J. Exploring Spanish health social media for detecting drug effects. BMC Med Inform Decis Mak 2015; 15 Suppl 2:S6. [PMID: 26100267 PMCID: PMC4474583 DOI: 10.1186/1472-6947-15-s2-s6] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Adverse Drug reactions (ADR) cause a high number of deaths among hospitalized patients in developed countries. Major drug agencies have devoted a great interest in the early detection of ADRs due to their high incidence and increasing health care costs. Reporting systems are available in order for both healthcare professionals and patients to alert about possible ADRs. However, several studies have shown that these adverse events are underestimated. Our hypothesis is that health social networks could be a significant information source for the early detection of ADRs as well as of new drug indications. METHODS In this work we present a system for detecting drug effects (which include both adverse drug reactions as well as drug indications) from user posts extracted from a Spanish health forum. Texts were processed using MeaningCloud, a multilingual text analysis engine, to identify drugs and effects. In addition, we developed the first Spanish database storing drugs as well as their effects automatically built from drug package inserts gathered from online websites. We then applied a distant-supervision method using the database on a collection of 84,000 messages in order to extract the relations between drugs and their effects. To classify the relation instances, we used a kernel method based only on shallow linguistic information of the sentences. RESULTS Regarding Relation Extraction of drugs and their effects, the distant supervision approach achieved a recall of 0.59 and a precision of 0.48. CONCLUSIONS The task of extracting relations between drugs and their effects from social media is a complex challenge due to the characteristics of social media texts. These texts, typically posts or tweets, usually contain many grammatical errors and spelling mistakes. Moreover, patients use lay terminology to refer to diseases, symptoms and indications that is not usually included in lexical resources in languages other than English.
Collapse
|
12
|
Nikfarjam A, Sarker A, O'Connor K, Ginn R, Gonzalez G. Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features. J Am Med Inform Assoc 2015; 22:671-81. [PMID: 25755127 PMCID: PMC4457113 DOI: 10.1093/jamia/ocu041] [Citation(s) in RCA: 221] [Impact Index Per Article: 24.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2014] [Accepted: 12/04/2014] [Indexed: 02/06/2023] Open
Abstract
Objective Social media is becoming increasingly popular as a platform for sharing personal health-related information. This information can be utilized for public health monitoring tasks, particularly for pharmacovigilance, via the use of natural language processing (NLP) techniques. However, the language in social media is highly informal, and user-expressed medical concepts are often nontechnical, descriptive, and challenging to extract. There has been limited progress in addressing these challenges, and thus far, advanced machine learning-based NLP techniques have been underutilized. Our objective is to design a machine learning-based approach to extract mentions of adverse drug reactions (ADRs) from highly informal text in social media. Methods We introduce ADRMine, a machine learning-based concept extraction system that uses conditional random fields (CRFs). ADRMine utilizes a variety of features, including a novel feature for modeling words’ semantic similarities. The similarities are modeled by clustering words based on unsupervised, pretrained word representation vectors (embeddings) generated from unlabeled user posts in social media using a deep learning technique. Results ADRMine outperforms several strong baseline systems in the ADR extraction task by achieving an F-measure of 0.82. Feature analysis demonstrates that the proposed word cluster features significantly improve extraction performance. Conclusion It is possible to extract complex medical concepts, with relatively high performance, from informal, user-generated content. Our approach is particularly scalable, suitable for social media mining, as it relies on large volumes of unlabeled data, thus diminishing the need for large, annotated training data sets.
Collapse
Affiliation(s)
- Azadeh Nikfarjam
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ, USA
| | - Abeed Sarker
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ, USA
| | - Karen O'Connor
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ, USA
| | - Rachel Ginn
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ, USA
| | - Graciela Gonzalez
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ, USA
| |
Collapse
|
13
|
Declerck G, Hussain S, Daniel C, Yuksel M, Laleci GB, Twagirumukiza M, Jaulent MC. Bridging data models and terminologies to support adverse drug event reporting using EHR data. Methods Inf Med 2014; 54:24-31. [PMID: 25487120 DOI: 10.3414/me13-02-0025] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2013] [Accepted: 03/17/2014] [Indexed: 12/23/2022]
Abstract
INTRODUCTION This article is part of the Focus Theme of METHODs of Information in Medicine on "Managing Interoperability and Complexity in Health Systems". BACKGROUND SALUS project aims at building an interoperability platform and a dedicated toolkit to enable secondary use of electronic health records (EHR) data for post marketing drug surveillance. An important component of this toolkit is a drug-related adverse events (AE) reporting system designed to facilitate and accelerate the reporting process using automatic prepopulation mechanisms. OBJECTIVE To demonstrate SALUS approach for establishing syntactic and semantic interoperability for AE reporting. METHOD Standard (e.g. HL7 CDA-CCD) and proprietary EHR data models are mapped to the E2B(R2) data model via SALUS Common Information Model. Terminology mapping and terminology reasoning services are designed to ensure the automatic conversion of source EHR terminologies (e.g. ICD-9-CM, ICD-10, LOINC or SNOMED-CT) to the target terminology MedDRA which is expected in AE reporting forms. A validated set of terminology mappings is used to ensure the reliability of the reasoning mechanisms. RESULTS The percentage of data elements of a standard E2B report that can be completed automatically has been estimated for two pilot sites. In the best scenario (i.e. the available fields in the EHR have actually been filled), only 36% (pilot site 1) and 38% (pilot site 2) of E2B data elements remain to be filled manually. In addition, most of these data elements shall not be filled in each report. CONCLUSION SALUS platform's interoperability solutions enable partial automation of the AE reporting process, which could contribute to improve current spontaneous reporting practices and reduce under-reporting, which is currently one major obstacle in the process of acquisition of pharmacovigilance data.
Collapse
Affiliation(s)
- G Declerck
- Gunnar Declerck, Centre de recherche des Cordeliers, 15 rue de l'école de médecine, 75006 Paris, France, E-mail:
| | | | | | | | | | | | | |
Collapse
|
14
|
Sarker A, Gonzalez G. Portable automatic text classification for adverse drug reaction detection via multi-corpus training. J Biomed Inform 2014; 53:196-207. [PMID: 25451103 DOI: 10.1016/j.jbi.2014.11.002] [Citation(s) in RCA: 145] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2014] [Revised: 10/24/2014] [Accepted: 11/02/2014] [Indexed: 10/24/2022]
Abstract
OBJECTIVE Automatic detection of adverse drug reaction (ADR) mentions from text has recently received significant interest in pharmacovigilance research. Current research focuses on various sources of text-based information, including social media-where enormous amounts of user posted data is available, which have the potential for use in pharmacovigilance if collected and filtered accurately. The aims of this study are: (i) to explore natural language processing (NLP) approaches for generating useful features from text, and utilizing them in optimized machine learning algorithms for automatic classification of ADR assertive text segments; (ii) to present two data sets that we prepared for the task of ADR detection from user posted internet data; and (iii) to investigate if combining training data from distinct corpora can improve automatic classification accuracies. METHODS One of our three data sets contains annotated sentences from clinical reports, and the two other data sets, built in-house, consist of annotated posts from social media. Our text classification approach relies on generating a large set of features, representing semantic properties (e.g., sentiment, polarity, and topic), from short text nuggets. Importantly, using our expanded feature sets, we combine training data from different corpora in attempts to boost classification accuracies. RESULTS Our feature-rich classification approach performs significantly better than previously published approaches with ADR class F-scores of 0.812 (previously reported best: 0.770), 0.538 and 0.678 for the three data sets. Combining training data from multiple compatible corpora further improves the ADR F-scores for the in-house data sets to 0.597 (improvement of 5.9 units) and 0.704 (improvement of 2.6 units) respectively. CONCLUSIONS Our research results indicate that using advanced NLP techniques for generating information rich features from text can significantly improve classification accuracies over existing benchmarks. Our experiments illustrate the benefits of incorporating various semantic features such as topics, concepts, sentiments, and polarities. Finally, we show that integration of information from compatible corpora can significantly improve classification performance. This form of multi-corpus training may be particularly useful in cases where data sets are heavily imbalanced (e.g., social media data), and may reduce the time and costs associated with the annotation of data in the future.
Collapse
Affiliation(s)
- Abeed Sarker
- Department of Biomedical Informatics, Arizona State University, 13212 East Shea Blvd., Scottsdale, AZ 85259, USA.
| | - Graciela Gonzalez
- Department of Biomedical Informatics, Arizona State University, 13212 East Shea Blvd., Scottsdale, AZ 85259, USA.
| |
Collapse
|
15
|
Xu R, Wang Q. Large-scale combining signals from both biomedical literature and the FDA Adverse Event Reporting System (FAERS) to improve post-marketing drug safety signal detection. BMC Bioinformatics 2014; 15:17. [PMID: 24428898 PMCID: PMC3906761 DOI: 10.1186/1471-2105-15-17] [Citation(s) in RCA: 57] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2013] [Accepted: 01/13/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Independent data sources can be used to augment post-marketing drug safety signal detection. The vast amount of publicly available biomedical literature contains rich side effect information for drugs at all clinical stages. In this study, we present a large-scale signal boosting approach that combines over 4 million records in the US Food and Drug Administration (FDA) Adverse Event Reporting System (FAERS) and over 21 million biomedical articles. RESULTS The datasets are comprised of 4,285,097 records from FAERS and 21,354,075 MEDLINE articles. We first extracted all drug-side effect (SE) pairs from FAERS. Our study implemented a total of seven signal ranking algorithms. We then compared these different ranking algorithms before and after they were boosted with signals from MEDLINE sentences or abstracts. Finally, we manually curated all drug-cardiovascular (CV) pairs that appeared in both data sources and investigated whether our approach can detect many true signals that have not been included in FDA drug labels. We extracted a total of 2,787,797 drug-SE pairs from FAERS with a low initial precision of 0.025. The ranking algorithm combined signals from both FAERS and MEDLINE, significantly improving the precision from 0.025 to 0.371 for top-ranked pairs, representing a 13.8 fold elevation in precision. We showed by manual curation that drug-SE pairs that appeared in both data sources were highly enriched with true signals, many of which have not yet been included in FDA drug labels. CONCLUSIONS We have developed an efficient and effective drug safety signal ranking and strengthening approach We demonstrate that large-scale combining information from FAERS and biomedical literature can significantly contribute to drug safety surveillance.
Collapse
Affiliation(s)
- Rong Xu
- Medical Informatics Division, Case Western Reserve, Cleveland, Ohio, USA
| | | |
Collapse
|
16
|
Xu R, Wang Q. Automatic signal extraction, prioritizing and filtering approaches in detecting post-marketing cardiovascular events associated with targeted cancer drugs from the FDA Adverse Event Reporting System (FAERS). J Biomed Inform 2013; 47:171-7. [PMID: 24177320 DOI: 10.1016/j.jbi.2013.10.008] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2013] [Revised: 09/07/2013] [Accepted: 10/21/2013] [Indexed: 02/07/2023]
Abstract
OBJECTIVE Targeted drugs dramatically improve the treatment outcomes in cancer patients; however, these innovative drugs are often associated with unexpectedly high cardiovascular toxicity. Currently, cardiovascular safety represents both a challenging issue for drug developers, regulators, researchers, and clinicians and a concern for patients. While FDA drug labels have captured many of these events, spontaneous reporting systems are a main source for post-marketing drug safety surveillance in 'real-world' (outside of clinical trials) cancer patients. In this study, we present approaches to extracting, prioritizing, filtering, and confirming cardiovascular events associated with targeted cancer drugs from the FDA Adverse Event Reporting System (FAERS). DATA AND METHODS The dataset includes records of 4,285,097 patients from FAERS. We first extracted drug-cardiovascular event (drug-CV) pairs from FAERS through named entity recognition and mapping processes. We then compared six ranking algorithms in prioritizing true positive signals among extracted pairs using known drug-CV pairs derived from FDA drug labels. We also developed three filtering algorithms to further improve precision. Finally, we manually validated extracted drug-CV pairs using 21 million published MEDLINE records. RESULTS We extracted a total of 11,173 drug-CV pairs from FAERS. We showed that ranking by frequency is significantly more effective than by the five standard signal detection methods (246% improvement in precision for top-ranked pairs). The filtering algorithm we developed further improved overall precision by 91.3%. By manual curation using literature evidence, we show that about 51.9% of the 617 drug-CV pairs that appeared in both FAERS and MEDLINE sentences are true positives. In addition, 80.6% of these positive pairs have not been captured by FDA drug labeling. CONCLUSIONS The unique drug-CV association dataset that we created based on FAERS could facilitate our understanding and prediction of cardiotoxic events associated with targeted cancer drugs.
Collapse
Affiliation(s)
- Rong Xu
- Medical Informatics Program, Center for Clinical Investigation, Case Western Reserve University, United States.
| | | |
Collapse
|
17
|
Identifying potential adverse effects using the web: a new approach to medical hypothesis generation. J Biomed Inform 2011; 44:989-96. [PMID: 21820083 DOI: 10.1016/j.jbi.2011.07.005] [Citation(s) in RCA: 118] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2010] [Revised: 07/16/2011] [Accepted: 07/20/2011] [Indexed: 11/23/2022]
Abstract
Medical message boards are online resources where users with a particular condition exchange information, some of which they might not otherwise share with medical providers. Many of these boards contain a large number of posts and contain patient opinions and experiences that would be potentially useful to clinicians and researchers. We present an approach that is able to collect a corpus of medical message board posts, de-identify the corpus, and extract information on potential adverse drug effects discussed by users. Using a corpus of posts to breast cancer message boards, we identified drug event pairs using co-occurrence statistics. We then compared the identified drug event pairs with adverse effects listed on the package labels of tamoxifen, anastrozole, exemestane, and letrozole. Of the pairs identified by our system, 75-80% were documented on the drug labels. Some of the undocumented pairs may represent previously unidentified adverse drug effects.
Collapse
|
18
|
Mork JG, Bodenreider O, Demner-Fushman D, Dogan RI, Lang FM, Lu Z, Névéol A, Peters L, Shooshan SE, Aronson AR. Extracting Rx information from clinical narrative. J Am Med Inform Assoc 2010; 17:536-9. [PMID: 20819859 DOI: 10.1136/jamia.2010.003970] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
OBJECTIVE The authors used the i2b2 Medication Extraction Challenge to evaluate their entity extraction methods, contribute to the generation of a publicly available collection of annotated clinical notes, and start developing methods for ontology-based reasoning using structured information generated from the unstructured clinical narrative. DESIGN Extraction of salient features of medication orders from the text of de-identified hospital discharge summaries was addressed with a knowledge-based approach using simple rules and lookup lists. The entity recognition tool, MetaMap, was combined with dose, frequency, and duration modules specifically developed for the Challenge as well as a prototype module for reason identification. MEASUREMENTS Evaluation metrics and corresponding results were provided by the Challenge organizers. RESULTS The results indicate that robust rule-based tools achieve satisfactory results in extraction of simple elements of medication orders, but more sophisticated methods are needed for identification of reasons for the orders and durations. LIMITATIONS Owing to the time constraints and nature of the Challenge, some obvious follow-on analysis has not been completed yet. CONCLUSIONS The authors plan to integrate the new modules with MetaMap to enhance its accuracy. This integration effort will provide guidance in retargeting existing tools for better processing of clinical text.
Collapse
Affiliation(s)
- James G Mork
- US National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | | | | | | | | | | | | | | | | | | |
Collapse
|