1
|
Leroy G, Andrews JG, KeAlohi-Preece M, Jaswani A, Song H, Galindo MK, Rice SA. Transparent deep learning to identify autism spectrum disorders (ASD) in EHR using clinical notes. J Am Med Inform Assoc 2024; 31:1313-1321. [PMID: 38626184 PMCID: PMC11105145 DOI: 10.1093/jamia/ocae080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2023] [Revised: 03/25/2024] [Accepted: 04/03/2024] [Indexed: 04/18/2024] Open
Abstract
OBJECTIVE Machine learning (ML) is increasingly employed to diagnose medical conditions, with algorithms trained to assign a single label using a black-box approach. We created an ML approach using deep learning that generates outcomes that are transparent and in line with clinical, diagnostic rules. We demonstrate our approach for autism spectrum disorders (ASD), a neurodevelopmental condition with increasing prevalence. METHODS We use unstructured data from the Centers for Disease Control and Prevention (CDC) surveillance records labeled by a CDC-trained clinician with ASD A1-3 and B1-4 criterion labels per sentence and with ASD cases labels per record using Diagnostic and Statistical Manual of Mental Disorders (DSM5) rules. One rule-based and three deep ML algorithms and six ensembles were compared and evaluated using a test set with 6773 sentences (N = 35 cases) set aside in advance. Criterion and case labeling were evaluated for each ML algorithm and ensemble. Case labeling outcomes were compared also with seven traditional tests. RESULTS Performance for criterion labeling was highest for the hybrid BiLSTM ML model. The best case labeling was achieved by an ensemble of two BiLSTM ML models using a majority vote. It achieved 100% precision (or PPV), 83% recall (or sensitivity), 100% specificity, 91% accuracy, and 0.91 F-measure. A comparison with existing diagnostic tests shows that our best ensemble was more accurate overall. CONCLUSIONS Transparent ML is achievable even with small datasets. By focusing on intermediate steps, deep ML can provide transparent decisions. By leveraging data redundancies, ML errors at the intermediate level have a low impact on final outcomes.
Collapse
Affiliation(s)
- Gondy Leroy
- Department of Management Information Systems, The University of Arizona, Tucson, AZ 85621, United States
| | - Jennifer G Andrews
- Department of Pediatrics, The University of Arizona, Tucson, AZ 85621, United States
| | | | - Ajay Jaswani
- Department of Management Information Systems, The University of Arizona, Tucson, AZ 85621, United States
| | - Hyunju Song
- Department of Computer Science, The University of Arizona, Tucson, AZ 85621, United States
| | - Maureen Kelly Galindo
- Department of Pediatrics, The University of Arizona, Tucson, AZ 85621, United States
| | - Sydney A Rice
- Department of Pediatrics, The University of Arizona, Tucson, AZ 85621, United States
| |
Collapse
|
2
|
Modi S, Kasmiran KA, Mohd Sharef N, Sharum MY. Extracting adverse drug events from clinical Notes: A systematic review of approaches used. J Biomed Inform 2024; 151:104603. [PMID: 38331081 DOI: 10.1016/j.jbi.2024.104603] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2023] [Revised: 01/31/2024] [Accepted: 02/01/2024] [Indexed: 02/10/2024]
Abstract
BACKGROUND An adverse drug event (ADE) is any unfavorable effect that occurs due to the use of a drug. Extracting ADEs from unstructured clinical notes is essential to biomedical text extraction research because it helps with pharmacovigilance and patient medication studies. OBJECTIVE From the considerable amount of clinical narrative text, natural language processing (NLP) researchers have developed methods for extracting ADEs and their related attributes. This work presents a systematic review of current methods. METHODOLOGY Two biomedical databases have been searched from June 2022 until December 2023 for relevant publications regarding this review, namely the databases PubMed and Medline. Similarly, we searched the multi-disciplinary databases IEEE Xplore, Scopus, ScienceDirect, and the ACL Anthology. We adopted the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 statement guidelines and recommendations for reporting systematic reviews in conducting this review. Initially, we obtained 5,537 articles from the search results from the various databases between 2015 and 2023. Based on predefined inclusion and exclusion criteria for article selection, 100 publications have undergone full-text review, of which we consider 82 for our analysis. RESULTS We determined the general pattern for extracting ADEs from clinical notes, with named entity recognition (NER) and relation extraction (RE) being the dual tasks considered. Researchers that tackled both NER and RE simultaneously have approached ADE extraction as a "pipeline extraction" problem (n = 22), as a "joint task extraction" problem (n = 7), and as a "multi-task learning" problem (n = 6), while others have tackled only NER (n = 27) or RE (n = 20). We further grouped the reviews based on the approaches for data extraction, namely rule-based (n = 8), machine learning (n = 11), deep learning (n = 32), comparison of two or more approaches (n = 11), hybrid (n = 12) and large language models (n = 8). The most used datasets are MADE 1.0, TAC 2017 and n2c2 2018. CONCLUSION Extracting ADEs is crucial, especially for pharmacovigilance studies and patient medications. This survey showcases advances in ADE extraction research, approaches, datasets, and state-of-the-art performance in them. Challenges and future research directions are highlighted. We hope this review will guide researchers in gaining background knowledge and developing more innovative ways to address the challenges.
Collapse
Affiliation(s)
- Salisu Modi
- Faculty of Computer Science and Information Technology, Universiti Putra Malaysia, Selangor, Malaysia; Department of Computer Science, Sokoto State University, Sokoto, Nigeria.
| | - Khairul Azhar Kasmiran
- Faculty of Computer Science and Information Technology, Universiti Putra Malaysia, Selangor, Malaysia.
| | - Nurfadhlina Mohd Sharef
- Faculty of Computer Science and Information Technology, Universiti Putra Malaysia, Selangor, Malaysia.
| | - Mohd Yunus Sharum
- Faculty of Computer Science and Information Technology, Universiti Putra Malaysia, Selangor, Malaysia.
| |
Collapse
|
3
|
Zhou H, Austin R, Lu SC, Silverman GM, Zhou Y, Kilicoglu H, Xu H, Zhang R. Complementary and Integrative Health Information in the literature: its lexicon and named entity recognition. J Am Med Inform Assoc 2024; 31:426-434. [PMID: 37952122 PMCID: PMC10797266 DOI: 10.1093/jamia/ocad216] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 10/20/2023] [Accepted: 11/08/2023] [Indexed: 11/14/2023] Open
Abstract
OBJECTIVE To construct an exhaustive Complementary and Integrative Health (CIH) Lexicon (CIHLex) to help better represent the often underrepresented physical and psychological CIH approaches in standard terminologies, and to also apply state-of-the-art natural language processing (NLP) techniques to help recognize them in the biomedical literature. MATERIALS AND METHODS We constructed the CIHLex by integrating various resources, compiling and integrating data from biomedical literature and relevant sources of knowledge. The Lexicon encompasses 724 unique concepts with 885 corresponding unique terms. We matched these concepts to the Unified Medical Language System (UMLS), and we developed and utilized BERT models comparing their efficiency in CIH named entity recognition to well-established models including MetaMap and CLAMP, as well as the large language model GPT3.5-turbo. RESULTS Of the 724 unique concepts in CIHLex, 27.2% could be matched to at least one term in the UMLS. About 74.9% of the mapped UMLS Concept Unique Identifiers were categorized as "Therapeutic or Preventive Procedure." Among the models applied to CIH named entity recognition, BLUEBERT delivered the highest macro-average F1-score of 0.91, surpassing other models. CONCLUSION Our CIHLex significantly augments representation of CIH approaches in biomedical literature. Demonstrating the utility of advanced NLP models, BERT notably excelled in CIH entity recognition. These results highlight promising strategies for enhancing standardization and recognition of CIH terminology in biomedical contexts.
Collapse
Affiliation(s)
- Huixue Zhou
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN, United States
| | - Robin Austin
- School of Nursing, University of Minnesota, Minneapolis, MN, United States
| | - Sheng-Chieh Lu
- Department of Symptom Research, The University of Texas MD Anderson Cancer Center, Houston, TX, United States
| | - Greg Marc Silverman
- Department of Surgery, University of Minnesota, Minneapolis, MN, United States
| | - Yuqi Zhou
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN, United States
- Department of Pharmaceutical Care & Health Systems, University of Minnesota, Minneapolis, MN, United States
| | - Halil Kilicoglu
- School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, IL, United States
| | - Hua Xu
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, United States
| | - Rui Zhang
- Department of Surgery, University of Minnesota, Minneapolis, MN, United States
| |
Collapse
|
4
|
Jędrejko K, Catlin O, Stewart T, Anderson A, Muszyńska B, Catlin DH. Unauthorized ingredients in "nootropic" dietary supplements: A review of the history, pharmacology, prevalence, international regulations, and potential as doping agents. Drug Test Anal 2023. [PMID: 37357012 DOI: 10.1002/dta.3529] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2023] [Revised: 04/11/2023] [Accepted: 04/18/2023] [Indexed: 06/27/2023]
Abstract
The first nootropic prohibited in sport was fonturacetam (4-phenylpiracetam, carphedon) in 1998. Presented here 25 years later is a broad-scale consideration of the history, pharmacology, prevalence, regulations, and doping potential of nootropics viewed through a lens of 50 selected dietary supplements (DS) marketed as "cognitive enhancement," "brain health," "brain boosters," or "nootropics," with a focus on unauthorized ingredients. Nootropic DS have risen to prominence over the last decade often as multicomponent formulations of bioactive ingredients presenting compelling pharmacological questions and potential public health concerns. Many popular nootropics are unauthorized food or DS ingredients according to the European Commission including huperzine A, yohimbine, and dimethylaminoethanol; unapproved pharmaceuticals like phenibut or emoxypine (mexidol); previously registered drugs like meclofenoxate or reserpine; EU authorized pharmaceuticals like piracetam or vinpocetine; infamous doping agents like methylhexaneamine or dimethylbutylamine; and other investigational substances and peptides. Several are authorized DS ingredients in the United States resulting in significant global variability as to what qualifies as a legal nootropic. Prohibited stimulants or ß2-agonists commonly used in "pre-workout," "weight loss," or "thermogenic" DS such as octodrine, hordenine, or higenamine are often stacked with nootropic substances. While stimulants and ß2-agonists are defined as doping agents by the World Anti-Doping Agency (WADA), many nootropics are not, although some may qualify as non-approved substances or related substances under catch-all language in the WADA Prohibited List. Synergistic combinations, excessive dosing, or recently researched pharmacology may justify listing certain nootropics as doping agents or warrant additional attention in future regulations.
Collapse
Affiliation(s)
- Karol Jędrejko
- Faculty of Pharmacy, Department of Pharmaceutical Botany, Jagiellonian University Medical College, Kraków, Poland
| | - Oliver Catlin
- Banned Substances Control Group (BSCG), Los Angeles, California, USA
| | - Timothy Stewart
- Banned Substances Control Group (BSCG), Los Angeles, California, USA
| | - Ashley Anderson
- International Sports Pharmacists Network, Fort Collins, Colorado, USA
| | - Bożena Muszyńska
- Faculty of Pharmacy, Department of Pharmaceutical Botany, Jagiellonian University Medical College, Kraków, Poland
| | - Don H Catlin
- Banned Substances Control Group (BSCG), Los Angeles, California, USA
- Department of Medicine and Molecular and Medical Pharmacology, University of California Los Angeles (UCLA), Los Angeles, California, USA
| |
Collapse
|
5
|
Keloth VK, Banda JM, Gurley M, Heider PM, Kennedy G, Liu H, Liu F, Miller T, Natarajan K, V Patterson O, Peng Y, Raja K, Reeves RM, Rouhizadeh M, Shi J, Wang X, Wang Y, Wei WQ, Williams AE, Zhang R, Belenkaya R, Reich C, Blacketer C, Ryan P, Hripcsak G, Elhadad N, Xu H. Representing and utilizing clinical textual data for real world studies: An OHDSI approach. J Biomed Inform 2023; 142:104343. [PMID: 36935011 PMCID: PMC10428170 DOI: 10.1016/j.jbi.2023.104343] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2022] [Revised: 01/21/2023] [Accepted: 03/13/2023] [Indexed: 03/19/2023]
Abstract
Clinical documentation in electronic health records contains crucial narratives and details about patients and their care. Natural language processing (NLP) can unlock the information conveyed in clinical notes and reports, and thus plays a critical role in real-world studies. The NLP Working Group at the Observational Health Data Sciences and Informatics (OHDSI) consortium was established to develop methods and tools to promote the use of textual data and NLP in real-world observational studies. In this paper, we describe a framework for representing and utilizing textual data in real-world evidence generation, including representations of information from clinical text in the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM), the workflow and tools that were developed to extract, transform and load (ETL) data from clinical notes into tables in OMOP CDM, as well as current applications and specific use cases of the proposed OHDSI NLP solution at large consortia and individual institutions with English textual data. Challenges faced and lessons learned during the process are also discussed to provide valuable insights for researchers who are planning to implement NLP solutions in real-world studies.
Collapse
Affiliation(s)
- Vipina K Keloth
- Section of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
| | - Juan M Banda
- Department of Computer Science, Georgia State University, Atlanta, GA, USA
| | - Michael Gurley
- Lurie Cancer Center, Northwestern University, Chicago, Illinois, USA
| | - Paul M Heider
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, SC, USA
| | - Georgina Kennedy
- Ingham Institute for Applied Medical Research, Sydney, Australia
| | - Hongfang Liu
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN, USA
| | - Feifan Liu
- Department of Population and Quantitative Health Sciences, University of Massachusetts Chan Medical School, Worcester, MA, USA
| | - Timothy Miller
- Computational Health Informatics Program, Boston Children's Hospital, and Department of Pediatrics, Harvard Medical School, Boston, MA, USA
| | - Karthik Natarajan
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, USA
| | - Olga V Patterson
- VA Informatics and Computing Infrastructure, Department of Veterans Affairs Salt Lake City Health Care System, Salt Lake City, Utah, USA; Division of Epidemiology, Department of Internal Medicine, School of Medicine, University of Utah, Salt Lake City, Utah, USA; Verily Life Sciences, Mountain View, CA, USA
| | - Yifan Peng
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA
| | - Kalpana Raja
- Section of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
| | - Ruth M Reeves
- TN Valley Healthcare System, U.S. Department of Veterans Affairs, Nashville, TN, USA; Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Masoud Rouhizadeh
- Department of Pharmaceutical Outcomes & Policy, University of Florida, Gainesville, FL, USA; Biomedical Informatics and Data Science, Johns Hopkins University, Baltimore, MD, USA
| | - Jianlin Shi
- VA Informatics and Computing Infrastructure, Department of Veterans Affairs Salt Lake City Health Care System, Salt Lake City, Utah, USA; Division of Epidemiology, Department of Internal Medicine, School of Medicine, University of Utah, Salt Lake City, Utah, USA; Department of Biomedical Informatics, University of Utah, Salt Lake City, USA
| | - Xiaoyan Wang
- Sema4 Mount Sinai Genomics Incorporation, Stamford, CT, USA
| | - Yanshan Wang
- Department of Health Information Management, Department of Biomedical Informatics, and Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA
| | - Wei-Qi Wei
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | | | - Rui Zhang
- Institute for Health Informatics, and Department of Pharmaceutical Care & Health Systems, University of Minnesota, Minneapolis, MN, USA
| | | | | | - Clair Blacketer
- Janssen Pharmaceutical Research and Development LLC, Titusville, NJ, USA; Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands
| | - Patrick Ryan
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, USA; Janssen Pharmaceutical Research and Development LLC, Titusville, NJ, USA
| | - George Hripcsak
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, USA
| | - Noémie Elhadad
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, USA.
| | - Hua Xu
- Section of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA.
| |
Collapse
|
6
|
Gao J, He S, Hu J, Chen G. A hybrid system to understand the relations between assessments and plans in progress notes. J Biomed Inform 2023; 141:104363. [PMID: 37054961 DOI: 10.1016/j.jbi.2023.104363] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Revised: 04/05/2023] [Accepted: 04/07/2023] [Indexed: 04/15/2023]
Abstract
OBJECTIVE The paper presents a novel solution to the 2022 National NLP Clinical Challenges (n2c2) Track 3, which aims to predict the relations between assessment and plan subsections in progress notes. METHODS Our approach goes beyond standard transformer models and incorporates external information such as medical ontology and order information to comprehend the semantics of progress notes. We fine-tuned transformers to understand the textual data and incorporated medical ontology concepts and their relationships to enhance the model's accuracy. We also captured order information that regular transformers cannot by taking into account the position of the assessment and plan subsections in progress notes. RESULTS Our submission earned third place in the challenge phase with a macro-F1 score of 0.811. After refining our pipeline further, we achieved a macro-F1 of 0.826, outperforming the top-performing system during the challenge phase. CONCLUSION Our approach, which combines fine-tuned transformers, medical ontology, and order information, outperformed other systems in predicting the relationships between assessment and plan subsections in progress notes. This highlights the importance of incorporating external information beyond textual data in natural language processing (NLP) tasks related to medical documentation. Our work could potentially improve the efficiency and accuracy of progress note analysis.
Collapse
Affiliation(s)
- Jifan Gao
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, 610 Walnut St, Madison, 53726, WI, USA
| | - Shilu He
- Department of Mathematics, University of Wisconsin-Madison, 480 Lincoln Dr, Madison, 53706, WI, USA
| | - Junjie Hu
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, 610 Walnut St, Madison, 53726, WI, USA.
| | - Guanhua Chen
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, 610 Walnut St, Madison, 53726, WI, USA.
| |
Collapse
|
7
|
Zhou S, Wang N, Wang L, Liu H, Zhang R. CancerBERT: a cancer domain-specific language model for extracting breast cancer phenotypes from electronic health records. J Am Med Inform Assoc 2022; 29:1208-1216. [PMID: 35333345 PMCID: PMC9196678 DOI: 10.1093/jamia/ocac040] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Revised: 03/06/2022] [Accepted: 03/09/2022] [Indexed: 11/16/2022] Open
Abstract
OBJECTIVE Accurate extraction of breast cancer patients' phenotypes is important for clinical decision support and clinical research. This study developed and evaluated cancer domain pretrained CancerBERT models for extracting breast cancer phenotypes from clinical texts. We also investigated the effect of customized cancer-related vocabulary on the performance of CancerBERT models. MATERIALS AND METHODS A cancer-related corpus of breast cancer patients was extracted from the electronic health records of a local hospital. We annotated named entities in 200 pathology reports and 50 clinical notes for 8 cancer phenotypes for fine-tuning and evaluation. We kept pretraining the BlueBERT model on the cancer corpus with expanded vocabularies (using both term frequency-based and manually reviewed methods) to obtain CancerBERT models. The CancerBERT models were evaluated and compared with other baseline models on the cancer phenotype extraction task. RESULTS All CancerBERT models outperformed all other models on the cancer phenotyping NER task. Both CancerBERT models with customized vocabularies outperformed the CancerBERT with the original BERT vocabulary. The CancerBERT model with manually reviewed customized vocabulary achieved the best performance with macro F1 scores equal to 0.876 (95% CI, 0.873-0.879) and 0.904 (95% CI, 0.902-0.906) for exact match and lenient match, respectively. CONCLUSIONS The CancerBERT models were developed to extract the cancer phenotypes in clinical notes and pathology reports. The results validated that using customized vocabulary may further improve the performances of domain specific BERT models in clinical NLP tasks. The CancerBERT models developed in the study would further help clinical decision support.
Collapse
Affiliation(s)
- Sicheng Zhou
- Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA
| | - Nan Wang
- School of Statistics, University of Minnesota, Minneapolis, Minnesota, USA
| | - Liwei Wang
- Department of AI and Informatics Research, Mayo Clinic, Rochester, Minnesota, USA
| | - Hongfang Liu
- Department of AI and Informatics Research, Mayo Clinic, Rochester, Minnesota, USA
| | - Rui Zhang
- Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA.,Department of Pharmaceutical Care & Health Systems, University of Minnesota, Minneapolis, Minnesota, USA
| |
Collapse
|
8
|
An Entity Relation Extraction Method Based on Dynamic Context and Multi-Feature Fusion. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12031532] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Dynamic context selector, a kind of mask idea, will divide the matrix into some regions, selecting the information of region as the input of model dynamically. There is a novel thought that improvement is made on the entity relation extraction (ERE) by applying the dynamic context to the training. In reality, most existing models of joint extraction of entity and relation are based on static context, which always suffers from the feature missing issue, resulting in poor performance. To address the problem, we propose a span-based joint extraction method based on dynamic context and multi-feature fusion (SPERT-DC). The context area is picked dynamically with the help of threshold in feature selecting layer of the model. It is noted that we also use Bi-LSTM_ATT to improve compatibility of longer text in feature extracting layer and enhance context information by combining with the tags of entity in feature fusion layer. Furthermore, the model in this paper outperforms prior work by up to 1% F1 score on the public dataset, which has verified the efficiency of dynamic context on ERE model.
Collapse
|
9
|
Chopard D, Treder MS, Corcoran P, Ahmed N, Johnson C, Busse M, Spasic I. Text Mining of Adverse Events in Clinical Trials: Deep Learning Approach. JMIR Med Inform 2021; 9:e28632. [PMID: 34951601 PMCID: PMC8742206 DOI: 10.2196/28632] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2021] [Revised: 08/01/2021] [Accepted: 11/14/2021] [Indexed: 11/28/2022] Open
Abstract
BACKGROUND Pharmacovigilance and safety reporting, which involve processes for monitoring the use of medicines in clinical trials, play a critical role in the identification of previously unrecognized adverse events or changes in the patterns of adverse events. OBJECTIVE This study aims to demonstrate the feasibility of automating the coding of adverse events described in the narrative section of the serious adverse event report forms to enable statistical analysis of the aforementioned patterns. METHODS We used the Unified Medical Language System (UMLS) as the coding scheme, which integrates 217 source vocabularies, thus enabling coding against other relevant terminologies such as the International Classification of Diseases-10th Revision, Medical Dictionary for Regulatory Activities, and Systematized Nomenclature of Medicine). We used MetaMap, a highly configurable dictionary lookup software, to identify the mentions of the UMLS concepts. We trained a binary classifier using Bidirectional Encoder Representations from Transformers (BERT), a transformer-based language model that captures contextual relationships, to differentiate between mentions of the UMLS concepts that represented adverse events and those that did not. RESULTS The model achieved a high F1 score of 0.8080, despite the class imbalance. This is 10.15 percent points lower than human-like performance but also 17.45 percent points higher than that of the baseline approach. CONCLUSIONS These results confirmed that automated coding of adverse events described in the narrative section of serious adverse event reports is feasible. Once coded, adverse events can be statistically analyzed so that any correlations with the trialed medicines can be estimated in a timely fashion.
Collapse
Affiliation(s)
- Daphne Chopard
- School of Computer Science & Informatics, Cardiff University, Cardiff, United Kingdom
| | - Matthias S Treder
- School of Computer Science & Informatics, Cardiff University, Cardiff, United Kingdom
| | - Padraig Corcoran
- School of Computer Science & Informatics, Cardiff University, Cardiff, United Kingdom
| | - Nagheen Ahmed
- Centre for Trials Research, Cardiff University, Cardiff, United Kingdom
| | - Claire Johnson
- Centre for Trials Research, Cardiff University, Cardiff, United Kingdom
| | - Monica Busse
- Centre for Trials Research, Cardiff University, Cardiff, United Kingdom
| | - Irena Spasic
- School of Computer Science & Informatics, Cardiff University, Cardiff, United Kingdom
| |
Collapse
|