1
|
Le ND, Nguyen NTH. A metric learning-based method for biomedical entity linking. Front Res Metr Anal 2023; 8:1247094. [PMID: 38173988 PMCID: PMC10762861 DOI: 10.3389/frma.2023.1247094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2023] [Accepted: 11/29/2023] [Indexed: 01/05/2024] Open
Abstract
Biomedical entity linking task is the task of mapping mention(s) that occur in a particular textual context to a unique concept or entity in a knowledge base, e.g., the Unified Medical Language System (UMLS). One of the most challenging aspects of the entity linking task is the ambiguity of mentions, i.e., (1) mentions whose surface forms are very similar, but which map to different entities in different contexts, and (2) entities that can be expressed using diverse types of mentions. Recent studies have used BERT-based encoders to encode mentions and entities into distinguishable representations such that their similarity can be measured using distance metrics. However, most real-world biomedical datasets suffer from severe imbalance, i.e., some classes have many instances while others appear only once or are completely absent from the training data. A common way to address this issue is to down-sample the dataset, i.e., to reduce the number instances of the majority classes to make the dataset more balanced. In the context of entity linking, down-sampling reduces the ability of the model to comprehensively learn the representations of mentions in different contexts, which is very important. To tackle this issue, we propose a metric-based learning method that treats a given entity and its mentions as a whole, regardless of the number of mentions in the training set. Specifically, our method uses a triplet loss-based function in conjunction with a clustering technique to learn the representation of mentions and entities. Through evaluations on two challenging biomedical datasets, i.e., MedMentions and BC5CDR, we show that our proposed method is able to address the issue of imbalanced data and to perform competitively with other state-of-the-art models. Moreover, our method significantly reduces computational cost in both training and inference steps. Our source code is publicly available here.
Collapse
Affiliation(s)
- Ngoc D. Le
- Faculty of Information Technology, University of Science, Ho Chi Minh City, Vietnam
- Vietnam National University, Ho Chi Minh City, Vietnam
| | - Nhung T. H. Nguyen
- Department of Computer Science, School of Engineering, University of Manchester, Manchester, United Kingdom
| |
Collapse
|
2
|
Dolatabadi E, Moyano D, Bales M, Spasojevic S, Bhambhoria R, Bhatti J, Debnath S, Hoell N, Li X, Leng C, Nanda S, Saab J, Sahak E, Sie F, Uppal S, Vadlamudi NK, Vladimirova A, Yakimovich A, Yang X, Kocak SA, Cheung AM. Using Social Media to Help Understand Patient-Reported Health Outcomes of Post-COVID-19 Condition: Natural Language Processing Approach. J Med Internet Res 2023; 25:e45767. [PMID: 37725432 PMCID: PMC10510753 DOI: 10.2196/45767] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Revised: 05/18/2023] [Accepted: 06/05/2023] [Indexed: 09/21/2023] Open
Abstract
BACKGROUND While scientific knowledge of post-COVID-19 condition (PCC) is growing, there remains significant uncertainty in the definition of the disease, its expected clinical course, and its impact on daily functioning. Social media platforms can generate valuable insights into patient-reported health outcomes as the content is produced at high resolution by patients and caregivers, representing experiences that may be unavailable to most clinicians. OBJECTIVE In this study, we aimed to determine the validity and effectiveness of advanced natural language processing approaches built to derive insight into PCC-related patient-reported health outcomes from social media platforms Twitter and Reddit. We extracted PCC-related terms, including symptoms and conditions, and measured their occurrence frequency. We compared the outputs with human annotations and clinical outcomes and tracked symptom and condition term occurrences over time and locations to explore the pipeline's potential as a surveillance tool. METHODS We used bidirectional encoder representations from transformers (BERT) models to extract and normalize PCC symptom and condition terms from English posts on Twitter and Reddit. We compared 2 named entity recognition models and implemented a 2-step normalization task to map extracted terms to unique concepts in standardized terminology. The normalization steps were done using a semantic search approach with BERT biencoders. We evaluated the effectiveness of BERT models in extracting the terms using a human-annotated corpus and a proximity-based score. We also compared the validity and reliability of the extracted and normalized terms to a web-based survey with more than 3000 participants from several countries. RESULTS UmlsBERT-Clinical had the highest accuracy in predicting entities closest to those extracted by human annotators. Based on our findings, the top 3 most commonly occurring groups of PCC symptom and condition terms were systemic (such as fatigue), neuropsychiatric (such as anxiety and brain fog), and respiratory (such as shortness of breath). In addition, we also found novel symptom and condition terms that had not been categorized in previous studies, such as infection and pain. Regarding the co-occurring symptoms, the pair of fatigue and headaches was among the most co-occurring term pairs across both platforms. Based on the temporal analysis, the neuropsychiatric terms were the most prevalent, followed by the systemic category, on both social media platforms. Our spatial analysis concluded that 42% (10,938/26,247) of the analyzed terms included location information, with the majority coming from the United States, United Kingdom, and Canada. CONCLUSIONS The outcome of our social media-derived pipeline is comparable with the results of peer-reviewed articles relevant to PCC symptoms. Overall, this study provides unique insights into patient-reported health outcomes of PCC and valuable information about the patient's journey that can help health care providers anticipate future needs. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID) RR2-10.1101/2022.12.14.22283419.
Collapse
Affiliation(s)
- Elham Dolatabadi
- Faculty of Health, School of Health Policy and Management, York University, Toronto, ON, Canada
- Vector Institute, Toronto, ON, Canada
- Department of Medicine and Joint Department of Medical Imaging, University of Toronto, Toronto, ON, Canada
| | | | | | | | - Rohan Bhambhoria
- Electrical and Computer Engineering, Queen's University, Kingston, ON, Canada
| | | | | | | | - Xin Li
- Department of Medicine and Joint Department of Medical Imaging, University of Toronto, Toronto, ON, Canada
| | | | | | - Jad Saab
- TELUS Health, Montreal, QC, Canada
| | - Esmat Sahak
- Department of Medicine and Joint Department of Medical Imaging, University of Toronto, Toronto, ON, Canada
| | - Fanny Sie
- Hoffmann-La Roche Ltd, Toronto, ON, Canada
| | | | - Nirma Khatri Vadlamudi
- Department of Pediatrics, Faculty of Medicine, University of British Columbia, Vancouver, BC, Canada
| | | | | | | | | | - Angela M Cheung
- Department of Medicine and Joint Department of Medical Imaging, University of Toronto, Toronto, ON, Canada
- University Health Network, Toronto, ON, Canada
| |
Collapse
|
3
|
Mullin S, McDougal R, Cheung KH, Kilicoglu H, Beck A, Zeiss CJ. Chemical Entity Normalization for Successful Translational Development of Alzheimer's Disease and Dementia Therapeutics. Res Sq 2023:rs.3.rs-2547912. [PMID: 36824778 PMCID: PMC9949240 DOI: 10.21203/rs.3.rs-2547912/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/18/2023]
Abstract
Background Identifying chemical mentions within the Alzheimer's and dementia literature can provide a powerful tool to further therapeutic research. Leveraging the Chemical Entities of Biological Interest (ChEBI) ontology, which is rich in hierarchical and other relationship types, for entity normalization can provide an advantage for future downstream applications. We provide a reproducible hybrid approach that combines an ontology-enhanced PubMedBERT model for disambiguation with a dictionary-based method for candidate selection. Results There were 56,553 chemical mentions in the titles of 44,812 unique PubMed article abstracts. Based on our gold standard, our method of disambiguation improved entity normalization by 25.3 percentage points compared to using only the dictionary-based approach with fuzzy-string matching for disambiguation. For our Alzheimer's and dementia cohort, we were able to add 47.1% more potential mappings between MeSH and ChEBI when compared to BioPortal. Conclusion Use of natural language models like PubMedBERT and resources such as ChEBI and PubChem provide a beneficial way to link entity mentions to ontology terms, while further supporting downstream tasks like filtering ChEBI mentions based on roles and assertions to find beneficial therapies for Alzheimer's and dementia.
Collapse
Affiliation(s)
- Sarah Mullin
- Yale University School of Medicine, New Haven, CT, USA
| | | | | | | | - Amanda Beck
- Marine Ecology Department, Institute of Marine Sciences Kiel, Bronx, NY, USA
| | | |
Collapse
|
4
|
Cuffy C, French E, Fehrmann S, McInnes BT. Exploring Representations for Singular and Multi-Concept Relations for Biomedical Named Entity Normalization. Proc Int World Wide Web Conf 2022; 2022:823-832. [PMID: 37465200 PMCID: PMC10353314 DOI: 10.1145/3487553.3524701] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 07/20/2023]
Abstract
Since the rise of the COVID-19 pandemic, peer-reviewed biomedical repositories have experienced a surge in chemical and disease related queries. These queries have a wide variety of naming conventions and nomenclatures from trademark and generic, to chemical composition mentions. Normalizing or disambiguating these mentions within texts provides researchers and data-curators with more relevant articles returned by their search query. Named entity normalization aims to automate this disambiguation process by linking entity mentions onto their appropriate candidate concepts within a biomedical knowledge base or ontology. We explore several term embedding aggregation techniques in addition to how the term's context affects evaluation performance. We also evaluate our embedding approaches for normalizing term instances containing one or many relations within unstructured texts.
Collapse
Affiliation(s)
- Clint Cuffy
- Virginia Commonwealth University, Richmond, Virginia, USA
| | - Evan French
- Virginia Commonwealth University, Richmond, Virginia, USA
| | | | | |
Collapse
|
5
|
Noh J, Kavuluru R. Joint Learning for Biomedical NER and Entity Normalization: Encoding Schemes, Counterfactual Examples, and Zero-Shot Evaluation. ACM BCB 2021; 2021. [PMID: 34505115 DOI: 10.1145/3459930.3469533] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
Named entity recognition (NER) and normalization (EN) form an indispensable first step to many biomedical natural language processing applications. In biomedical information science, recognizing entities (e.g., genes, diseases, or drugs) and normalizing them to concepts in standard terminologies or thesauri (e.g., Entrez, ICD-10, or RxNorm) is crucial for identifying more informative relations among them that drive disease etiology, progression, and treatment. In this effort we pursue two high level strategies to improve biomedical ER and EN. The first is to decouple standard entity encoding tags (e.g., "B-Drug" for the beginning of a drug) into type tags (e.g., "Drug") and positional tags (e.g., "B"). A second strategy is to use additional counterfactual training examples to handle the issue of models learning spurious correlations between surrounding context and normalized concepts in training data. We conduct elaborate experiments using the MedMentions dataset, the largest dataset of its kind for ER and EN in biomedicine. We find that our first strategy performs better in entity normalization when compared with the standard coding scheme. The second data augmentation strategy uniformly improves performance in span detection, typing, and normalization. The gains from counterfactual examples are more prominent when evaluating in zero-shot settings, for concepts that have never been encountered during training.
Collapse
Affiliation(s)
- Jiho Noh
- Department of Computer Science, University of Kentucky, Lexington, Kentucky, USA
| | - Ramakanth Kavuluru
- Division of Biomedical Informatics (Internal Medicine), University of Kentucky, Lexington, Kentucky, USA
| |
Collapse
|
6
|
Ferré A, Ba M, Bossy R. Improving the CONTES method for normalizing biomedical text entities with concepts from an ontology with (almost) no training data. Genomics Inform 2019; 17:e20. [PMID: 31307135 PMCID: PMC6808633 DOI: 10.5808/gi.2019.17.2.e20] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2019] [Accepted: 05/31/2019] [Indexed: 11/20/2022] Open
Abstract
Entity normalization, or entity linking in the general domain, is an information extraction task that aims to annotate/bind multiple words/expressions in raw text with semantic references, such as concepts of an ontology. An ontology consists minimally of a formally organized vocabulary or hierarchy of terms, which captures knowledge of a domain. Presently, machine-learning methods, often coupled with distributional representations, achieve good performance. However, these require large training datasets, which are not always available, especially for tasks in specialized domains. CONTES (CONcept-TErm System) is a supervised method that addresses entity normalization with ontology concepts using small training datasets. CONTES has some limitations, such as it does not scale well with very large ontologies, it tends to overgeneralize predictions, and it lacks valid representations for the out-of-vocabulary words. Here, we propose to assess different methods to reduce the dimensionality in the representation of the ontology. We also propose to calibrate parameters in order to make the predictions more accurate, and to address the problem of out-of-vocabulary words, with a specific method.
Collapse
Affiliation(s)
- Arnaud Ferré
- MaIAGE, INRA, Paris-Saclay University, 78350 Jouy-en-Josas, France.,LIMSI, CNRS, Paris-Saclay University, 91405 Orsay, France
| | - Mouhamadou Ba
- MaIAGE, INRA, Paris-Saclay University, 78350 Jouy-en-Josas, France
| | - Robert Bossy
- MaIAGE, INRA, Paris-Saclay University, 78350 Jouy-en-Josas, France
| |
Collapse
|