1
|
Lambert J, Leutenegger AL, Baudot A, Jannot AS. Improving patient clustering by incorporating structured variable label relationships in similarity measures. BMC Med Res Methodol 2025; 25:72. [PMID: 40089699 PMCID: PMC11910865 DOI: 10.1186/s12874-025-02459-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Accepted: 01/03/2025] [Indexed: 03/17/2025] Open
Abstract
BACKGROUND Patient stratification is the cornerstone of numerous health investigations, serving to enhance the estimation of treatment efficacy and facilitating patient matching. To stratify patients, similarity measures between patients can be computed from clinical variables contained in medical health records. These variables have both values and labels structured in ontologies or other classification systems. The relevance of considering variable label relationships in the computation of patient similarity measures has been poorly studied. OBJECTIVE We adapt and evaluate several weighted versions of the Cosine similarity in order to consider structured label relationships to compute patient similarities from a medico-administrative database. MATERIALS AND METHODS As a use case, we clustered patients aged 60 years from their annual medicine reimbursements contained in the Échantillon Généraliste des Bénéficiaires, a random sample of a French medico-administrative database. We used four patient similarity measures: the standard Cosine similarity, a weighted Cosine similarity measure that includes variable frequencies and two weighted Cosine similarity measures that consider variable label relationships. We construct patient networks from each similarity measure and identify clusters of patients using the Markov Cluster algorithm. We evaluate the performance of the different similarity measures with enrichment tests based on patient diagnoses. RESULTS The weighted similarity measures that include structured variable label relationships perform better to identify similar patients. Indeed, using these weighted measures, we identify more clusters associated with different diagnose enrichment. Importantly, the enrichment tests provide clinically interpretable insights into these patient clusters. CONCLUSION Considering label relationships when computing patient similarities improves stratification of patients regarding their health status.
Collapse
Affiliation(s)
- Judith Lambert
- Sorbonne Université, Université Paris Cité, INSERM, Centre de Recherche des Cordeliers, Paris, F-75006, France.
- HeKA, Inria Paris, Paris, F-75015, France.
- Aix Marseille Univ, INSERM, MMG, Marseille, UMR1251, France.
| | | | - Anaïs Baudot
- Aix Marseille Univ, INSERM, MMG, Marseille, UMR1251, France
- CNRS, Marseille, France
- Barcelona Supercomputing Center, Barcelona, Spain
| | - Anne-Sophie Jannot
- HeKA, Inria Paris, Paris, F-75015, France
- Université Paris Cité, Sorbonne Université, INSERM, Centre de Recherche des Cordeliers, F-75006, Paris, France
- French National Rare Disease Registry (BNDMR), Greater Paris University Hospitals (AP-HP), Paris, France
| |
Collapse
|
2
|
Leist IC, Rivas-Torrubia M, Alarcón-Riquelme ME, Barturen G, Consortium PC, Gut IG, Rueda M. Pheno-Ranker: a toolkit for comparison of phenotypic data stored in GA4GH standards and beyond. BMC Bioinformatics 2024; 25:373. [PMID: 39633268 PMCID: PMC11616229 DOI: 10.1186/s12859-024-05993-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2024] [Accepted: 11/19/2024] [Indexed: 12/07/2024] Open
Abstract
BACKGROUND Phenotypic data comparison is essential for disease association studies, patient stratification, and genotype-phenotype correlation analysis. To support these efforts, the Global Alliance for Genomics and Health (GA4GH) established Phenopackets v2 and Beacon v2 standards for storing, sharing, and discovering genomic and phenotypic data. These standards provide a consistent framework for organizing biological data, simplifying their transformation into computer-friendly formats. However, matching participants using GA4GH-based formats remains challenging, as current methods are not fully compatible, limiting their effectiveness. RESULTS Here, we introduce Pheno-Ranker, an open-source software toolkit for individual-level comparison of phenotypic data. As input, it accepts JSON/YAML data exchange formats from Beacon v2 and Phenopackets v2 data models, as well as any data structure encoded in JSON, YAML, or CSV formats. Internally, the hierarchical data structure is flattened to one dimension and then transformed through one-hot encoding. This allows for efficient pairwise (all-to-all) comparisons within cohorts or for matching of a patient's profile in cohorts. Users have the flexibility to refine their comparisons by including or excluding terms, applying weights to variables, and obtaining statistical significance through Z-scores and p-values. The output consists of text files, which can be further analyzed using unsupervised learning techniques, such as clustering or multidimensional scaling (MDS), and with graph analytics. Pheno-Ranker's performance has been validated with simulated and synthetic data, showing its accuracy, robustness, and efficiency across various health data scenarios. A real data use case from the PRECISESADS study highlights its practical utility in clinical research. CONCLUSIONS Pheno-Ranker is a user-friendly, lightweight software for semantic similarity analysis of phenotypic data in Beacon v2 and Phenopackets v2 formats, extendable to other data types. It enables the comparison of a wide range of variables beyond HPO or OMIM terms while preserving full context. The software is designed as a command-line tool with additional utilities for CSV import, data simulation, summary statistics plotting, and QR code generation. For interactive analysis, it also includes a web-based user interface built with R Shiny. Links to the online documentation, including a Google Colab tutorial, and the tool's source code are available on the project home page: https://github.com/CNAG-Biomedical-Informatics/pheno-ranker .
Collapse
Affiliation(s)
- Ivo C Leist
- Centro Nacional de Análisis Genómico, C/Baldiri Reixac 4, 08028, Barcelona, Spain
- Universitat de Barcelona (UB), Barcelona, Spain
| | - María Rivas-Torrubia
- Pfizer-University of Granada-Junta de Andalucía Centre for Genomics and Oncological Research, Granada, Spain
| | - Marta E Alarcón-Riquelme
- Pfizer-University of Granada-Junta de Andalucía Centre for Genomics and Oncological Research, Granada, Spain
- Institute of Environmental Medicine, Karolinska Institute, Stockholm, Sweden
| | - Guillermo Barturen
- Pfizer-University of Granada-Junta de Andalucía Centre for Genomics and Oncological Research, Granada, Spain
- Department of Genetics, Faculty of Science, University of Granada, 18071, Granada, Spain
- Bioinformatics Laboratory, Centro de Investigación Biomédica, Biotechnology Institute, PTS, Avda del Conocimiento S/N, 18100, Granada, Spain
| | | | - Ivo G Gut
- Centro Nacional de Análisis Genómico, C/Baldiri Reixac 4, 08028, Barcelona, Spain
- Universitat de Barcelona (UB), Barcelona, Spain
| | - Manuel Rueda
- Centro Nacional de Análisis Genómico, C/Baldiri Reixac 4, 08028, Barcelona, Spain.
- Universitat de Barcelona (UB), Barcelona, Spain.
| |
Collapse
|
3
|
Abbasi OR, Alesheikh AA, Lotfata A. Semantic similarity is not enough: A novel NLP-based semantic similarity measure in geospatial context. iScience 2024; 27:109883. [PMID: 38974474 PMCID: PMC11225810 DOI: 10.1016/j.isci.2024.109883] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2023] [Revised: 01/20/2024] [Accepted: 04/30/2024] [Indexed: 07/09/2024] Open
Abstract
In this study, we addressed two primary challenges: firstly, the issue of domain shift, which pertains to changes in data characteristics or context that can impact model performance, and secondly, the discrepancy between semantic similarity and geographical distance. We employed topic modeling in conjunction with the BERT architecture. Our model was crafted to enhance similarity computations applied to geospatial text, aiming to integrate both semantic similarity and geographical proximity. We tested the model on two datasets, Persian Wikipedia articles and rental property advertisements. The findings demonstrate that the model effectively improved the correlation between semantic similarity and geographical distance. Furthermore, evaluation by real-world users within a recommender system context revealed a notable increase in user satisfaction by approximately 22% for Wikipedia articles and 56% for advertisements.
Collapse
Affiliation(s)
- Omid Reza Abbasi
- Department of Geospatial Information Systems, K. N. Toosi University of Technology, Tehran, Iran
| | - Ali Asghar Alesheikh
- Department of Geospatial Information Systems, K. N. Toosi University of Technology, Tehran, Iran
| | - Aynaz Lotfata
- Department of Pathology, Microbiology, and Immunology, School of Veterinary Medicine, University of California, Davis, Davis, CA, USA
| |
Collapse
|
4
|
Zahra FA, Kate RJ. Obtaining clinical term embeddings from SNOMED CT ontology. J Biomed Inform 2024; 149:104560. [PMID: 38070816 DOI: 10.1016/j.jbi.2023.104560] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 11/29/2023] [Accepted: 12/05/2023] [Indexed: 01/22/2024]
Abstract
Clinical term embeddings are traditionally obtained using corpus-based methods, however, these methods cannot incorporate knowledge about clinical terms which is already present in medical ontologies. On the other hand, graph-based methods can obtain embeddings of clinical concepts from ontologies, but they cannot obtain embeddings for clinical terms and words. In this paper, a novel method is presented to obtain embeddings for clinical terms and words from the SNOMED CT ontology. The method first obtains embeddings of clinical concepts from SNOMED CT using a graph-based method. Next, these concept embeddings are used as targets to train a deep learning model to map clinical terms to concepts embeddings. The learned model then provides embeddings for clinical terms and words as well as maps novel clinical terms to their embeddings. The embeddings obtained using the method out-performed corpus-based embeddings on the task of predicting clinical term similarity on five benchmark datasets. On the clinical term normalization task, using these embeddings simply as a means of computing similarity between clinical terms obtained accuracy which was competitive to methods trained specifically for this task. Both corpus-based and ontology-based embeddings have a limitation that they tend to learn similar embeddings for opposite or analogous terms. To counter this, we also introduce a method to automatically learn patterns that indicate when two clinical terms represent the same concept and when they represent different concepts. Supplementing the normalization process with these patterns showed improvement. Although clinical term embeddings obtained from SNOMED CT incorporate ontological knowledge which is missed by corpus-based embeddings, they do not incorporate linguistic knowledge which is needed for sentence-based tasks. Hence combining ontology-based embeddings with corpus-based embeddings is an avenue for future work.
Collapse
Affiliation(s)
- Fuad Abu Zahra
- Department of Computer Science, University of Wisconsin-Milwaukee, Milwaukee, WI, USA
| | - Rohit J Kate
- Department of Computer Science, University of Wisconsin-Milwaukee, Milwaukee, WI, USA.
| |
Collapse
|
5
|
Biggers FB, Mohanty SD, Manda P. A deep semantic matching approach for identifying relevant messages for social media analysis. Sci Rep 2023; 13:12005. [PMID: 37491443 PMCID: PMC10368660 DOI: 10.1038/s41598-023-38761-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Accepted: 07/14/2023] [Indexed: 07/27/2023] Open
Abstract
There is a growing interest in using social media content for Natural Language Processing applications. However, it is not easy to computationally identify the most relevant set of tweets related to any specific event. Challenging semantics coupled with different ways for using natural language in social media make it difficult for retrieving the most relevant set of data from any social media outlet. This paper seeks to demonstrate a way to present the changing semantics of Twitter within the context of a crisis event, specifically tweets during Hurricane Irma. These methods can be used to identify the most relevant corpus of text for analysis in relevance to a specific incident such as a hurricane. Using an implementation of the Word2Vec method of Neural Network training mechanisms to create Word Embeddings, this paper will: discuss how the relative meaning of words changes as events unfold; present a mechanism for scoring tweets based upon dynamic, relative context relatedness; and show that similarity between words is not necessarily static. We present different methods for training the vector model in Word2Vec for identification of the most relevant tweets for any search query. The impact of tuning parameters such as Word Window Size, Minimum Word Frequency, Hidden Layer Dimensionality, and Negative Sampling on model performance was explored. The window containing the local maximum for AU_ROC for each parameter serves as a guide for other studies using the methods presented here for social media data analysis.
Collapse
Affiliation(s)
- Frederick Brown Biggers
- Artificial Intelligence and Natural Language Processing, United Health Group, Raleigh, NC, USA
| | - Somya D Mohanty
- Electronic Resources and Information Technology, University of North Carolina at Greensboro, Greensboro, NC, USA
| | - Prashanti Manda
- Informatics and Analytics, University of North Carolina at Greensboro, Greensboro, NC, USA.
| |
Collapse
|
6
|
Giancani S, Albertoni R, Catalano CE. Quality of word and concept embeddings in targetted biomedical domains. Heliyon 2023; 9:e16818. [PMID: 37332929 PMCID: PMC10272317 DOI: 10.1016/j.heliyon.2023.e16818] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2023] [Revised: 05/29/2023] [Accepted: 05/30/2023] [Indexed: 06/20/2023] Open
Abstract
Embeddings are fundamental resources often reused for building intelligent systems in the biomedical context. As a result, evaluating the quality of previously trained embeddings and ensuring they cover the desired information is critical for the success of applications. This paper proposes a new evaluation methodology to test the coverage of embeddings against a targetted domain of interest. It defines measures to assess the terminology, similarity, and analogy coverage, which are core aspects of the embeddings. Then, it discusses the experimentation carried out on existing biomedical embeddings in the specific context of pulmonary diseases. The proposed methodology and measures are general and may be applied to any application domain.
Collapse
Affiliation(s)
- Salvatore Giancani
- Institut de Neurosciences de la Timone, Unité Mixte de Recherche 7289 Centre National de la Recherce Scientifique and Aix-Marseille Université, Faculty of Medicine, 27, Boulevard Jean Moulin, 13385 Marseille Cedex 05, France
- Istituto di Matematica Applicata e Tecnologie Informatiche, Consiglio Nazionale delle Ricerche, Via De Marini 16, 16149 Genova, Italy
| | - Riccardo Albertoni
- Istituto di Matematica Applicata e Tecnologie Informatiche, Consiglio Nazionale delle Ricerche, Via De Marini 16, 16149 Genova, Italy
| | - Chiara Eva Catalano
- Istituto di Matematica Applicata e Tecnologie Informatiche, Consiglio Nazionale delle Ricerche, Via De Marini 16, 16149 Genova, Italy
| |
Collapse
|
7
|
Kartheeswaran KP, Rayan AXA, Varrieth GT. Enhanced disease-disease association with information enriched disease representation. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:8892-8932. [PMID: 37161227 DOI: 10.3934/mbe.2023391] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
OBJECTIVE Quantification of disease-disease association (DDA) enables the understanding of disease relationships for discovering disease progression and finding comorbidity. For effective DDA strength calculation, there is a need to address the main challenge of integration of various biomedical aspects of DDA is to obtain an information rich disease representation. MATERIALS AND METHODS An enhanced and integrated DDA framework is developed that integrates enriched literature-based with concept-based DDA representation. The literature component of the proposed framework uses PubMed abstracts and consists of improved neural network model that classifies DDAs for an enhanced literature-based DDA representation. Similarly, an ontology-based joint multi-source association embedding model is proposed in the ontology component using Disease Ontology (DO), UMLS, claims insurance, clinical notes etc. Results and Discussion: The obtained information rich disease representation is evaluated on different aspects of DDA datasets such as Gene, Variant, Gene Ontology (GO) and a human rated benchmark dataset. The DDA scores calculated using the proposed method achieved a high correlation mainly in gene-based dataset. The quantified scores also shown better correlation of 0.821, when evaluated on human rated 213 disease pairs. In addition, the generated disease representation is proved to have substantial effect on correlation of DDA scores for different categories of disease pairs. CONCLUSION The enhanced context and semantic DDA framework provides an enriched disease representation, resulting in high correlated results with different DDA datasets. We have also presented the biological interpretation of disease pairs. The developed framework can also be used for deriving the strength of other biomedical associations.
Collapse
|
8
|
Babalou S, Algergawy A, König-Ries B. SimBio: Adopting Particle Swarm Optimization for ontology-based biomedical term similarity assessment. DATA KNOWL ENG 2023. [DOI: 10.1016/j.datak.2022.102137] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
|
9
|
Martinez-Gil J. A comprehensive review of stacking methods for semantic similarity measurement. MACHINE LEARNING WITH APPLICATIONS 2022. [DOI: 10.1016/j.mlwa.2022.100423] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
|
10
|
Junaid SB, Imam AA, Balogun AO, De Silva LC, Surakat YA, Kumar G, Abdulkarim M, Shuaibu AN, Garba A, Sahalu Y, Mohammed A, Mohammed TY, Abdulkadir BA, Abba AA, Kakumi NAI, Mahamad S. Recent Advancements in Emerging Technologies for Healthcare Management Systems: A Survey. Healthcare (Basel) 2022; 10:1940. [PMID: 36292387 PMCID: PMC9601636 DOI: 10.3390/healthcare10101940] [Citation(s) in RCA: 43] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Revised: 09/26/2022] [Accepted: 09/28/2022] [Indexed: 11/16/2022] Open
Abstract
In recent times, the growth of the Internet of Things (IoT), artificial intelligence (AI), and Blockchain technologies have quickly gained pace as a new study niche in numerous collegiate and industrial sectors, notably in the healthcare sector. Recent advancements in healthcare delivery have given many patients access to advanced personalized healthcare, which has improved their well-being. The subsequent phase in healthcare is to seamlessly consolidate these emerging technologies such as IoT-assisted wearable sensor devices, AI, and Blockchain collectively. Surprisingly, owing to the rapid use of smart wearable sensors, IoT and AI-enabled technology are shifting healthcare from a conventional hub-based system to a more personalized healthcare management system (HMS). However, implementing smart sensors, advanced IoT, AI, and Blockchain technologies synchronously in HMS remains a significant challenge. Prominent and reoccurring issues such as scarcity of cost-effective and accurate smart medical sensors, unstandardized IoT system architectures, heterogeneity of connected wearable devices, the multidimensionality of data generated, and high demand for interoperability are vivid problems affecting the advancement of HMS. Hence, this survey paper presents a detailed evaluation of the application of these emerging technologies (Smart Sensor, IoT, AI, Blockchain) in HMS to better understand the progress thus far. Specifically, current studies and findings on the deployment of these emerging technologies in healthcare are investigated, as well as key enabling factors, noteworthy use cases, and successful deployments. This survey also examined essential issues that are frequently encountered by IoT-assisted wearable sensor systems, AI, and Blockchain, as well as the critical concerns that must be addressed to enhance the application of these emerging technologies in the HMS.
Collapse
Affiliation(s)
| | - Abdullahi Abubakar Imam
- School of Digital Science, Universiti Brunei Darussalam, Brunei Darussalam, Jalan Tungku Link, Gadong BE1410, Brunei
| | - Abdullateef Oluwagbemiga Balogun
- Department of Computer Science, University of Ilorin, Ilorin 1515, Nigeria
- Department of Computer and Information Science, Universiti Teknologi PETRONAS, Sri Iskandar 32610, Malaysia
| | | | | | - Ganesh Kumar
- Department of Computer and Information Science, Universiti Teknologi PETRONAS, Sri Iskandar 32610, Malaysia
| | - Muhammad Abdulkarim
- Department of Computer Science, Ahmadu Bello University, Zaria 810211, Nigeria
| | - Aliyu Nuhu Shuaibu
- Department of Electrical Engineering, University of Jos, Bauchi Road, Jos 930105, Nigeria
| | - Aliyu Garba
- Department of Computer Science, Ahmadu Bello University, Zaria 810211, Nigeria
| | - Yusra Sahalu
- SEHA Abu Dhabi Health Services Co., Abu Dhabi 109090, United Arab Emirates
| | - Abdullahi Mohammed
- Department of Computer Science, Ahmadu Bello University, Zaria 810211, Nigeria
| | | | | | | | - Nana Aliyu Iliyasu Kakumi
- Patient Care Department, General Ward, Saudi German Hospital Cairo, Taha Hussein Rd, Huckstep, El Nozha, Cairo Governorate 4473303, Egypt
| | - Saipunidzam Mahamad
- Department of Computer and Information Science, Universiti Teknologi PETRONAS, Sri Iskandar 32610, Malaysia
| |
Collapse
|
11
|
Chiu C, Villena F, Martin K, Núñez F, Besa C, Dunstan J. Training and intrinsic evaluation of lightweight word embeddings for the clinical domain in Spanish. Front Artif Intell 2022; 5:970517. [PMID: 36213168 PMCID: PMC9533099 DOI: 10.3389/frai.2022.970517] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Accepted: 08/25/2022] [Indexed: 11/17/2022] Open
Abstract
Resources for Natural Language Processing (NLP) are less numerous for languages different from English. In the clinical domain, where these resources are vital for obtaining new knowledge about human health and diseases, creating new resources for the Spanish language is imperative. One of the most common approaches in NLP is word embeddings, which are dense vector representations of a word, considering the word's context. This vector representation is usually the first step in various NLP tasks, such as text classification or information extraction. Therefore, in order to enrich Spanish language NLP tools, we built a Spanish clinical corpus from waiting list diagnostic suspicions, a biomedical corpus from medical journals, and term sequences sampled from the Unified Medical Language System (UMLS). These three corpora can be used to compute word embeddings models from scratch using Word2vec and fastText algorithms. Furthermore, to validate the quality of the calculated embeddings, we adapted several evaluation datasets in English, including some tests that have not been used in Spanish to the best of our knowledge. These translations were validated by two bilingual clinicians following an ad hoc validation standard for the translation. Even though contextualized word embeddings nowadays receive enormous attention, their calculation and deployment require specialized hardware and giant training corpora. Our static embeddings can be used in clinical applications with limited computational resources. The validation of the intrinsic test we present here can help groups working on static and contextualized word embeddings. We are releasing the training corpus and the embeddings within this publication1.
Collapse
Affiliation(s)
- Carolina Chiu
- Department of Mathematical Engineering, FCFM, Universidad de Chile, Santiago, Chile
| | - Fabián Villena
- Center for Mathematical Modeling & CNRS IRL2807, FCFM, Universidad de Chile, Santiago, Chile
- Department of Computer Sciences, FCFM, University of Chile, Santiago, Chile
| | - Kinan Martin
- Department of Computer Sciences, Massachusetts Institute of Technology, Cambridge, MA, United States
| | - Fredy Núñez
- Center for Mathematical Modeling & CNRS IRL2807, FCFM, Universidad de Chile, Santiago, Chile
- Department of Language Sciences, Pontificia Universidad Católica de Chile, Santiago, Chile
| | - Cecilia Besa
- Department of Radiology, School of Medicine, Pontificia Universidad Católica de Chile, Santiago, Chile
- Millenium Institute for Intelligent Healthcare Engineering, ANID, Santiago, Chile
| | - Jocelyn Dunstan
- Center for Mathematical Modeling & CNRS IRL2807, FCFM, Universidad de Chile, Santiago, Chile
- Millenium Institute for Intelligent Healthcare Engineering, ANID, Santiago, Chile
- Initiative for Data & Artificial Intelligence, FCFM, University of Chile, Santiago, Chile
- *Correspondence: Jocelyn Dunstan
| |
Collapse
|
12
|
Chang E. A vector-based semantic relatedness measure using multiple relations within SNOMED CT and UMLS. J Biomed Inform 2022; 131:104118. [PMID: 35690349 DOI: 10.1016/j.jbi.2022.104118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2022] [Revised: 05/26/2022] [Accepted: 06/05/2022] [Indexed: 11/27/2022]
Abstract
OBJECTIVE To propose a new vector-based relatedness metric that derives word vectors from the intrinsic structure of biomedical ontologies, without consulting external resources such as large-scale biomedical corpora. MATERIALS AND METHODS SNOMED CT on the mapping layer of UMLS was used as a testbed ontology. Vectors were created for every concept at the end of all semantic relations-attribute-value relations and descendants as well as is_a relation-of the defining concept. The cosine similarity between the averages of those vectors with respect to each defining concept was computed to produce a final semantic relatedness. RESULTS Two benchmark sets that include a total of 62 biomedical term pairs were used for evaluation. Spearman's rank coefficient of the current method was 0.655, 0.744, and 0.742 with the relatedness rated by physicians, coders, and medical experts, respectively. The proposed method was comparable to a word-embedding method and outperformed path-based, information content-based, and another multiple relation-based relatedness metrics. DISCUSSION The current study demonstrated that the addition of attribute relations to the is_a hierarchy of SNOMED CT better conforms to the human sense of relatedness than models based on taxonomic relations. The current approach also showed that it is robust to the design inconsistency of ontologies. CONCLUSION Unlike the previous vector-based approach, the current study exploited the intrinsic semantic structure of an ontology, precluding the need for external textual resources to obtain context information of defining terms. Future research is recommended to prove the validity of the current method with other biomedical ontologies.
Collapse
Affiliation(s)
- Eunsuk Chang
- Carolina Health Informatics Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
| |
Collapse
|
13
|
Chanda AK, Bai T, Yang Z, Vucetic S. Improving medical term embeddings using UMLS Metathesaurus. BMC Med Inform Decis Mak 2022; 22:114. [PMID: 35488252 PMCID: PMC9052653 DOI: 10.1186/s12911-022-01850-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2020] [Accepted: 03/29/2022] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND Health providers create Electronic Health Records (EHRs) to describe the conditions and procedures used to treat their patients. Medical notes entered by medical staff in the form of free text are a particularly insightful component of EHRs. There is a great interest in applying machine learning tools on medical notes in numerous medical informatics applications. Learning vector representations, or embeddings, of terms in the notes, is an important pre-processing step in such applications. However, learning good embeddings is challenging because medical notes are rich in specialized terminology, and the number of available EHRs in practical applications is often very small. METHODS In this paper, we propose a novel algorithm to learn embeddings of medical terms from a limited set of medical notes. The algorithm, called definition2vec, exploits external information in the form of medical term definitions. It is an extension of a skip-gram algorithm that incorporates textual definitions of medical terms provided by the Unified Medical Language System (UMLS) Metathesaurus. RESULTS To evaluate the proposed approach, we used a publicly available Medical Information Mart for Intensive Care (MIMIC-III) EHR data set. We performed quantitative and qualitative experiments to measure the usefulness of the learned embeddings. The experimental results show that definition2vec keeps the semantically similar medical terms together in the embedding vector space even when they are rare or unobserved in the corpus. We also demonstrate that learned vector embeddings are helpful in downstream medical informatics applications. CONCLUSION This paper shows that medical term definitions can be helpful when learning embeddings of rare or previously unseen medical terms from a small corpus of specialized documents such as medical notes.
Collapse
Affiliation(s)
- Ashis Kumar Chanda
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA
| | - Tian Bai
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA
| | - Ziyu Yang
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA
| | - Slobodan Vucetic
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA.
| |
Collapse
|
14
|
Xiang J, Zhang J, Zhao Y, Wu FX, Li M. Biomedical data, computational methods and tools for evaluating disease-disease associations. Brief Bioinform 2022; 23:6522999. [PMID: 35136949 DOI: 10.1093/bib/bbac006] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Revised: 01/04/2022] [Accepted: 01/05/2022] [Indexed: 12/12/2022] Open
Abstract
In recent decades, exploring potential relationships between diseases has been an active research field. With the rapid accumulation of disease-related biomedical data, a lot of computational methods and tools/platforms have been developed to reveal intrinsic relationship between diseases, which can provide useful insights to the study of complex diseases, e.g. understanding molecular mechanisms of diseases and discovering new treatment of diseases. Human complex diseases involve both external phenotypic abnormalities and complex internal molecular mechanisms in organisms. Computational methods with different types of biomedical data from phenotype to genotype can evaluate disease-disease associations at different levels, providing a comprehensive perspective for understanding diseases. In this review, available biomedical data and databases for evaluating disease-disease associations are first summarized. Then, existing computational methods for disease-disease associations are reviewed and classified into five groups in terms of the usages of biomedical data, including disease semantic-based, phenotype-based, function-based, representation learning-based and text mining-based methods. Further, we summarize software tools/platforms for computation and analysis of disease-disease associations. Finally, we give a discussion and summary on the research of disease-disease associations. This review provides a systematic overview for current disease association research, which could promote the development and applications of computational methods and tools/platforms for disease-disease associations.
Collapse
Affiliation(s)
- Ju Xiang
- School of Computer Science and Engineering, Central South University, China
| | - Jiashuai Zhang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, China
| | - Yichao Zhao
- School of Computer Science and Engineering, Central South University, China
| | - Fang-Xiang Wu
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, China
| | - Min Li
- Division of Biomedical Engineering and Department of Mechanical Engineering at University of Saskatchewan, Saskatoon, Canada
| |
Collapse
|
15
|
Lastra-Díaz JJ, Lara-Clares A, Garcia-Serrano A. HESML: a real-time semantic measures library for the biomedical domain with a reproducible survey. BMC Bioinformatics 2022; 23:23. [PMID: 34991460 PMCID: PMC8734250 DOI: 10.1186/s12859-021-04539-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Accepted: 12/15/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Ontology-based semantic similarity measures based on SNOMED-CT, MeSH, and Gene Ontology are being extensively used in many applications in biomedical text mining and genomics respectively, which has encouraged the development of semantic measures libraries based on the aforementioned ontologies. However, current state-of-the-art semantic measures libraries have some performance and scalability drawbacks derived from their ontology representations based on relational databases, or naive in-memory graph representations. Likewise, a recent reproducible survey on word similarity shows that one hybrid IC-based measure which integrates a shortest-path computation sets the state of the art in the family of ontology-based semantic measures. However, the lack of an efficient shortest-path algorithm for their real-time computation prevents both their practical use in any application and the use of any other path-based semantic similarity measure. RESULTS To bridge the two aforementioned gaps, this work introduces for the first time an updated version of the HESML Java software library especially designed for the biomedical domain, which implements the most efficient and scalable ontology representation reported in the literature, together with a new method for the approximation of the Dijkstra's algorithm for taxonomies, called Ancestors-based Shortest-Path Length (AncSPL), which allows the real-time computation of any path-based semantic similarity measure. CONCLUSIONS We introduce a set of reproducible benchmarks showing that HESML outperforms by several orders of magnitude the current state-of-the-art libraries in the three aforementioned biomedical ontologies, as well as the real-time performance and approximation quality of the new AncSPL shortest-path algorithm. Likewise, we show that AncSPL linearly scales regarding the dimension of the common ancestor subgraph regardless of the ontology size. Path-based measures based on the new AncSPL algorithm are up to six orders of magnitude faster than their exact implementation in large ontologies like SNOMED-CT and GO. Finally, we provide a detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results.
Collapse
Affiliation(s)
- Juan J. Lastra-Díaz
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), C/Juan del Rosal 16, 28040 Madrid, Spain
| | - Alicia Lara-Clares
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), C/Juan del Rosal 16, 28040 Madrid, Spain
| | - Ana Garcia-Serrano
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), C/Juan del Rosal 16, 28040 Madrid, Spain
| |
Collapse
|
16
|
Abstract
AbstractIn low-resource domains, it is challenging to achieve good performance using existing machine learning methods due to a lack of training data and mixed data types (numeric and categorical). In particular, categorical variables with high cardinality pose a challenge to machine learning tasks such as classification and regression because training requires sufficiently many data points for the possible values of each variable. Since interpolation is not possible, nothing can be learned for values not seen in the training set. This paper presents a method that uses prior knowledge of the application domain to support machine learning in cases with insufficient data. We propose to address this challenge by using embeddings for categorical variables that are based on an explicit representation of domain knowledge (KR), namely a hierarchy of concepts. Our approach is to 1. define a semantic similarity measure between categories, based on the hierarchy—we propose a purely hierarchy-based measure, but other similarity measures from the literature can be used—and 2. use that similarity measure to define a modified one-hot encoding. We propose two embedding schemes for single-valued and multi-valued categorical data. We perform experiments on three different use cases. We first compare existing similarity approaches with our approach on a word pair similarity use case. This is followed by creating word embeddings using different similarity approaches. A comparison with existing methods such as Google, Word2Vec and GloVe embeddings on several benchmarks shows better performance on concept categorisation tasks when using knowledge-based embeddings. The third use case uses a medical dataset to compare the performance of semantic-based embeddings and standard binary encodings. Significant improvement in performance of the downstream classification tasks is achieved by using semantic information.
Collapse
|
17
|
Measuring associational thinking through word embeddings. Artif Intell Rev 2021. [DOI: 10.1007/s10462-021-10056-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
AbstractThe development of a model to quantify semantic similarity and relatedness between words has been the major focus of many studies in various fields, e.g. psychology, linguistics, and natural language processing. Unlike the measures proposed by most previous research, this article is aimed at estimating automatically the strength of associative words that can be semantically related or not. We demonstrate that the performance of the model depends not only on the combination of independently constructed word embeddings (namely, corpus- and network-based embeddings) but also on the way these word vectors interact. The research concludes that the weighted average of the cosine-similarity coefficients derived from independent word embeddings in a double vector space tends to yield high correlations with human judgements. Moreover, we demonstrate that evaluating word associations through a measure that relies on not only the rank ordering of word pairs but also the strength of associations can reveal some findings that go unnoticed by traditional measures such as Spearman’s and Pearson’s correlation coefficients.
Collapse
|
18
|
Noh J, Kavuluru R. Improved biomedical word embeddings in the transformer era. J Biomed Inform 2021; 120:103867. [PMID: 34284119 PMCID: PMC8373296 DOI: 10.1016/j.jbi.2021.103867] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2020] [Revised: 05/10/2021] [Accepted: 07/11/2021] [Indexed: 10/20/2022]
Abstract
BACKGROUND Recent natural language processing (NLP) research is dominated by neural network methods that employ word embeddings as basic building blocks. Pre-training with neural methods that capture local and global distributional properties (e.g., skip-gram, GLoVE) using free text corpora is often used to embed both words and concepts. Pre-trained embeddings are typically leveraged in downstream tasks using various neural architectures that are designed to optimize task-specific objectives that might further tune such embeddings. OBJECTIVE Despite advances in contextualized language model based embeddings, static word embeddings still form an essential starting point in BioNLP research and applications. They are useful in low resource settings and in lexical semantics studies. Our main goal is to build improved biomedical word embeddings and make them publicly available for downstream applications. METHODS We jointly learn word and concept embeddings by first using the skip-gram method and further fine-tuning them with correlational information manifesting in co-occurring Medical Subject Heading (MeSH) concepts in biomedical citations. This fine-tuning is accomplished with the transformer-based BERT architecture in the two-sentence input mode with a classification objective that captures MeSH pair co-occurrence. We conduct evaluations of these tuned static embeddings using multiple datasets for word relatedness developed by previous efforts. RESULTS Both in qualitative and quantitative evaluations we demonstrate that our methods produce improved biomedical embeddings in comparison with other static embedding efforts. Without selectively culling concepts and terms (as was pursued by previous efforts), we believe we offer the most exhaustive evaluation of biomedical embeddings to date with clear performance improvements across the board. CONCLUSION We repurposed a transformer architecture (typically used to generate dynamic embeddings) to improve static biomedical word embeddings using concept correlations. We provide our code and embeddings for public use for downstream applications and research endeavors: https://github.com/bionlproc/BERT-CRel-Embeddings.
Collapse
Affiliation(s)
- Jiho Noh
- Department of Computer Science, University of Kentucky, United States of America.
| | - Ramakanth Kavuluru
- Division of Biomedical Informatics, Department of Internal Medicine, University of Kentucky, United States of America; Department of Computer Science, University of Kentucky, United States of America.
| |
Collapse
|
19
|
|
20
|
Yum Y, Lee JM, Jang MJ, Kim Y, Kim JH, Kim S, Shin U, Song S, Joo HJ. A Word Pair Dataset for Semantic Similarity and Relatedness in Korean Medical Vocabulary: Reference Development and Validation. JMIR Med Inform 2021; 9:e29667. [PMID: 34185005 PMCID: PMC8277378 DOI: 10.2196/29667] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Revised: 05/08/2021] [Accepted: 05/16/2021] [Indexed: 01/16/2023] Open
Abstract
Background The fact that medical terms require special expertise and are becoming increasingly complex makes it difficult to employ natural language processing techniques in medical informatics. Several human-validated reference standards for medical terms have been developed to evaluate word embedding models using the semantic similarity and relatedness of medical word pairs. However, there are very few reference standards in non-English languages. In addition, because the existing reference standards were developed a long time ago, there is a need to develop an updated standard to represent recent findings in medical sciences. Objective We propose a new Korean word pair reference set to verify embedding models. Methods From January 2010 to December 2020, 518 medical textbooks, 72,844 health information news, and 15,698 medical research articles were collected, and the top 10,000 medical terms were selected to develop medical word pairs. Attending physicians (n=16) participated in the verification of the developed set with 607 word pairs. Results The proportion of word pairs answered by all participants was 90.8% (551/607) for the similarity task and 86.5% (525/605) for the relatedness task. The similarity and relatedness of the word pair showed a high correlation (ρ=0.70, P<.001). The intraclass correlation coefficients to assess the interrater agreements of the word pair sets were 0.47 on the similarity task and 0.53 on the relatedness task. The final reference standard was 604 word pairs for the similarity task and 599 word pairs for relatedness, excluding word pairs with answers corresponding to outliers and word pairs that were answered by less than 50% of all the respondents. When FastText models were applied to the final reference standard word pair sets, the embedding models learning medical documents had a higher correlation between the calculated cosine similarity scores compared to human-judged similarity and relatedness scores (namu, ρ=0.12 vs with medical text for the similarity task, ρ=0.47; namu, ρ=0.02 vs with medical text for the relatedness task, ρ=0.30). Conclusions Korean medical word pair reference standard sets for semantic similarity and relatedness were developed based on medical documents from the past 10 years. It is expected that our word pair reference sets will be actively utilized in the development of medical and multilingual natural language processing technology in the future.
Collapse
Affiliation(s)
- Yunjin Yum
- Department of Biostatistics, Korea University College of Medicine, Seoul, Republic of Korea.,Korea University Research Institute for Medical Bigdata Science, Korea University, Seoul, Republic of Korea
| | - Jeong Moon Lee
- Korea University Research Institute for Medical Bigdata Science, Korea University, Seoul, Republic of Korea
| | - Moon Joung Jang
- Korea University Research Institute for Medical Bigdata Science, Korea University, Seoul, Republic of Korea
| | - Yoojoong Kim
- Korea University Research Institute for Medical Bigdata Science, Korea University, Seoul, Republic of Korea
| | - Jong-Ho Kim
- Korea University Research Institute for Medical Bigdata Science, Korea University, Seoul, Republic of Korea.,Department of Cardiology, Cardiovascular Center, Korea University College of Medicine, Seoul, Republic of Korea
| | - Seongtae Kim
- Department of Linguistics, Korea University, Seoul, Republic of Korea
| | - Unsub Shin
- Department of Linguistics, Korea University, Seoul, Republic of Korea
| | - Sanghoun Song
- Department of Linguistics, Korea University, Seoul, Republic of Korea
| | - Hyung Joon Joo
- Department of Cardiology, Cardiovascular Center, Korea University College of Medicine, Seoul, Republic of Korea.,Korea University Research Institute for Medical Bigdata Science, Korea University Anam Hospital, Seoul, Republic of Korea.,Department of Medical Informatics, Korea University College of Medicine, Seoul, Republic of Korea
| |
Collapse
|
21
|
Mao Y, Fung KW. Use of word and graph embedding to measure semantic relatedness between Unified Medical Language System concepts. J Am Med Inform Assoc 2021; 27:1538-1546. [PMID: 33029614 PMCID: PMC7566472 DOI: 10.1093/jamia/ocaa136] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2020] [Revised: 06/03/2020] [Accepted: 06/04/2020] [Indexed: 12/03/2022] Open
Abstract
Objective The study sought to explore the use of deep learning techniques to measure the semantic relatedness between Unified Medical Language System (UMLS) concepts. Materials and Methods Concept sentence embeddings were generated for UMLS concepts by applying the word embedding models BioWordVec and various flavors of BERT to concept sentences formed by concatenating UMLS terms. Graph embeddings were generated by the graph convolutional networks and 4 knowledge graph embedding models, using graphs built from UMLS hierarchical relations. Semantic relatedness was measured by the cosine between the concepts’ embedding vectors. Performance was compared with 2 traditional path-based (shortest path and Leacock-Chodorow) measurements and the publicly available concept embeddings, cui2vec, generated from large biomedical corpora. The concept sentence embeddings were also evaluated on a word sense disambiguation (WSD) task. Reference standards used included the semantic relatedness and semantic similarity datasets from the University of Minnesota, concept pairs generated from the Standardized MedDRA Queries and the MeSH (Medical Subject Headings) WSD corpus. Results Sentence embeddings generated by BioWordVec outperformed all other methods used individually in semantic relatedness measurements. Graph convolutional network graph embedding uniformly outperformed path-based measurements and was better than some word embeddings for the Standardized MedDRA Queries dataset. When used together, combined word and graph embedding achieved the best performance in all datasets. For WSD, the enhanced versions of BERT outperformed BioWordVec. Conclusions Word and graph embedding techniques can be used to harness terms and relations in the UMLS to measure semantic relatedness between concepts. Concept sentence embedding outperforms path-based measurements and cui2vec, and can be further enhanced by combining with graph embedding.
Collapse
Affiliation(s)
- Yuqing Mao
- National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Kin Wah Fung
- National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| |
Collapse
|
22
|
Parikh S, Davoudi A, Yu S, Giraldo C, Schriver E, Mowery D. Lexicon Development for COVID-19-related Concepts Using Open-source Word Embedding Sources: An Intrinsic and Extrinsic Evaluation. JMIR Med Inform 2021; 9:e21679. [PMID: 33544689 PMCID: PMC7901592 DOI: 10.2196/21679] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2020] [Revised: 09/20/2020] [Accepted: 01/31/2021] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Scientists are developing new computational methods and prediction models to better clinically understand COVID-19 prevalence, treatment efficacy, and patient outcomes. These efforts could be improved by leveraging documented COVID-19-related symptoms, findings, and disorders from clinical text sources in an electronic health record. Word embeddings can identify terms related to these clinical concepts from both the biomedical and nonbiomedical domains, and are being shared with the open-source community at large. However, it's unclear how useful openly available word embeddings are for developing lexicons for COVID-19-related concepts. OBJECTIVE Given an initial lexicon of COVID-19-related terms, this study aims to characterize the returned terms by similarity across various open-source word embeddings and determine common semantic and syntactic patterns between the COVID-19 queried terms and returned terms specific to the word embedding source. METHODS We compared seven openly available word embedding sources. Using a series of COVID-19-related terms for associated symptoms, findings, and disorders, we conducted an interannotator agreement study to determine how accurately the most similar returned terms could be classified according to semantic types by three annotators. We conducted a qualitative study of COVID-19 queried terms and their returned terms to detect informative patterns for constructing lexicons. We demonstrated the utility of applying such learned synonyms to discharge summaries by reporting the proportion of patients identified by concept among three patient cohorts: pneumonia (n=6410), acute respiratory distress syndrome (n=8647), and COVID-19 (n=2397). RESULTS We observed high pairwise interannotator agreement (Cohen kappa) for symptoms (0.86-0.99), findings (0.93-0.99), and disorders (0.93-0.99). Word embedding sources generated based on characters tend to return more synonyms (mean count of 7.2 synonyms) compared to token-based embedding sources (mean counts range from 2.0 to 3.4). Word embedding sources queried using a qualifier term (eg, dry cough or muscle pain) more often returned qualifiers of the similar semantic type (eg, "dry" returns consistency qualifiers like "wet" and "runny") compared to a single term (eg, cough or pain) queries. A higher proportion of patients had documented fever (0.61-0.84), cough (0.41-0.55), shortness of breath (0.40-0.59), and hypoxia (0.51-0.56) retrieved than other clinical features. Terms for dry cough returned a higher proportion of patients with COVID-19 (0.07) than the pneumonia (0.05) and acute respiratory distress syndrome (0.03) populations. CONCLUSIONS Word embeddings are valuable technology for learning related terms, including synonyms. When leveraging openly available word embedding sources, choices made for the construction of the word embeddings can significantly influence the words learned.
Collapse
Affiliation(s)
- Soham Parikh
- School of Engineering and Applied Science, University of Pennsylvania, Philadelphia, PA, United States
| | - Anahita Davoudi
- Department of Biostatistics, Epidemiology, & Informatics, University of Pennsylvania, Philadelphia, PA, United States
| | - Shun Yu
- Division of Hematology/Oncology, Department of Medicine, Hospital of the University of Pennsylvania, Philadelphia, PA, United States
| | - Carolina Giraldo
- Philadelphia College of Osteopathic Medicine, Philadelphia, PA, United States
| | - Emily Schriver
- Data Analytics Center, Penn Medicine, Philadelphia, PA, United States
| | - Danielle Mowery
- Department of Biostatistics, Epidemiology, & Informatics, Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, United States
| |
Collapse
|
23
|
Ramos-Vargas RE, Román-Godínez I, Torres-Ramos S. Comparing general and specialized word embeddings for biomedical named entity recognition. PeerJ Comput Sci 2021; 7:e384. [PMID: 33817030 PMCID: PMC7959609 DOI: 10.7717/peerj-cs.384] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2020] [Accepted: 01/14/2021] [Indexed: 06/12/2023]
Abstract
Increased interest in the use of word embeddings, such as word representation, for biomedical named entity recognition (BioNER) has highlighted the need for evaluations that aid in selecting the best word embedding to be used. One common criterion for selecting a word embedding is the type of source from which it is generated; that is, general (e.g., Wikipedia, Common Crawl), or specific (e.g., biomedical literature). Using specific word embeddings for the BioNER task has been strongly recommended, considering that they have provided better coverage and semantic relationships among medical entities. To the best of our knowledge, most studies have focused on improving BioNER task performance by, on the one hand, combining several features extracted from the text (for instance, linguistic, morphological, character embedding, and word embedding itself) and, on the other, testing several state-of-the-art named entity recognition algorithms. The latter, however, do not pay great attention to the influence of the word embeddings, and do not facilitate observing their real impact on the BioNER task. For this reason, the present study evaluates three well-known NER algorithms (CRF, BiLSTM, BiLSTM-CRF) with respect to two corpora (DrugBank and MedLine) using two classic word embeddings, GloVe Common Crawl (of the general type) and Pyysalo PM + PMC (specific), as unique features. Furthermore, three contextualized word embeddings (ELMo, Pooled Flair, and Transformer) are compared in their general and specific versions. The aim is to determine whether general embeddings can perform better than specialized ones on the BioNER task. To this end, four experiments were designed. In the first, we set out to identify the combination of classic word embedding, NER algorithm, and corpus that results in the best performance. The second evaluated the effect of the size of the corpus on performance. The third assessed the semantic cohesiveness of the classic word embeddings and their correlation with respect to several gold standards; while the fourth evaluates the performance of general and specific contextualized word embeddings on the BioNER task. Results show that the classic general word embedding GloVe Common Crawl performed better in the DrugBank corpus, despite having less word coverage and a lower internal semantic relationship than the classic specific word embedding, Pyysalo PM + PMC; while in the contextualized word embeddings the best results are presented in the specific ones. We conclude, therefore, when using classic word embeddings as features on the BioNER task, the general ones could be considered a good option. On the other hand, when using contextualized word embeddings, the specific ones are the best option.
Collapse
|
24
|
A large reproducible benchmark of ontology-based methods and word embeddings for word similarity. INFORM SYST 2021. [DOI: 10.1016/j.is.2020.101636] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
25
|
Yeganova L, Kim S, Chen Q, Balasanov G, Wilbur WJ, Lu Z. Better synonyms for enriching biomedical search. J Am Med Inform Assoc 2020; 27:1894-1902. [PMID: 33083825 PMCID: PMC7727334 DOI: 10.1093/jamia/ocaa151] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2020] [Revised: 05/20/2020] [Accepted: 08/20/2020] [Indexed: 01/12/2023] Open
Abstract
OBJECTIVE In a biomedical literature search, the link between a query and a document is often not established, because they use different terms to refer to the same concept. Distributional word embeddings are frequently used for detecting related words by computing the cosine similarity between them. However, previous research has not established either the best embedding methods for detecting synonyms among related word pairs or how effective such methods may be. MATERIALS AND METHODS In this study, we first create the BioSearchSyn set, a manually annotated set of synonyms, to assess and compare 3 widely used word-embedding methods (word2vec, fastText, and GloVe) in their ability to detect synonyms among related pairs of words. We demonstrate the shortcomings of the cosine similarity score between word embeddings for this task: the same scores have very different meanings for the different methods. To address the problem, we propose utilizing pool adjacent violators (PAV), an isotonic regression algorithm, to transform a cosine similarity into a probability of 2 words being synonyms. RESULTS Experimental results using the BioSearchSyn set as a gold standard reveal which embedding methods have the best performance in identifying synonym pairs. The BioSearchSyn set also allows converting cosine similarity scores into probabilities, which provides a uniform interpretation of the synonymy score over different methods. CONCLUSIONS We introduced the BioSearchSyn corpus of 1000 term pairs, which allowed us to identify the best embedding method for detecting synonymy for biomedical search. Using the proposed method, we created PubTermVariants2.0: a large, automatically extracted set of synonym pairs that have augmented PubMed searches since the spring of 2019.
Collapse
Affiliation(s)
- Lana Yeganova
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Qingyu Chen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Grigory Balasanov
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - W John Wilbur
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| |
Collapse
|
26
|
Jiang S, Wu W, Tomita N, Ganoe C, Hassanpour S. Multi-Ontology Refined Embeddings (MORE): A hybrid multi-ontology and corpus-based semantic representation model for biomedical concepts. J Biomed Inform 2020; 111:103581. [PMID: 33010425 DOI: 10.1016/j.jbi.2020.103581] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2020] [Revised: 09/22/2020] [Accepted: 09/26/2020] [Indexed: 11/25/2022]
Abstract
OBJECTIVE Currently, a major limitation for natural language processing (NLP) analyses in clinical applications is that concepts are not effectively referenced in various forms across different texts. This paper introduces Multi-Ontology Refined Embeddings (MORE), a novel hybrid framework that incorporates domain knowledge from multiple ontologies into a distributional semantic model, learned from a corpus of clinical text. MATERIALS AND METHODS We use the RadCore and MIMIC-III free-text datasets for the corpus-based component of MORE. For the ontology-based part, we use the Medical Subject Headings (MeSH) ontology and three state-of-the-art ontology-based similarity measures. In our approach, we propose a new learning objective, modified from the sigmoid cross-entropy objective function. RESULTS AND DISCUSSION We used two established datasets of semantic similarities among biomedical concept pairs to evaluate the quality of the generated word embeddings. On the first dataset with 29 concept pairs, with similarity scores established by physicians and medical coders, MORE's similarity scores have the highest combined correlation (0.633), which is 5.0% higher than that of the baseline model, and 12.4% higher than that of the best ontology-based similarity measure. On the second dataset with 449 concept pairs, MORE's similarity scores have a correlation of 0.481, based on the average of four medical residents' similarity ratings, and that outperforms the skip-gram model by 8.1%, and the best ontology measure by 6.9%. Furthermore, MORE outperforms three pre-trained transformer-based word embedding models (i.e., BERT, ClinicalBERT, and BioBERT) on both datasets. CONCLUSION MORE incorporates knowledge from several biomedical ontologies into an existing corpus-based distributional semantics model, improving both the accuracy of the learned word embeddings and the extensibility of the model to a broader range of biomedical concepts. MORE allows for more accurate clustering of concepts across a wide range of applications, such as analyzing patient health records to identify subjects with similar pathologies, or integrating heterogeneous clinical data to improve interoperability between hospitals.
Collapse
Affiliation(s)
- Steven Jiang
- Department of Computer Science, Dartmouth College, Hanover, NH 03755, USA
| | - Weiyi Wu
- Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA
| | - Naofumi Tomita
- Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA
| | - Craig Ganoe
- Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA
| | - Saeed Hassanpour
- Department of Computer Science, Dartmouth College, Hanover, NH 03755, USA; Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA; Department of Epidemiology, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA.
| |
Collapse
|
27
|
Hier DB, Kopel J, Brint SU, Wunsch DC, Olbricht GR, Azizi S, Allen B. Evaluation of standard and semantically-augmented distance metrics for neurology patients. BMC Med Inform Decis Mak 2020; 20:203. [PMID: 32843023 PMCID: PMC7448345 DOI: 10.1186/s12911-020-01217-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2020] [Accepted: 08/12/2020] [Indexed: 12/23/2022] Open
Abstract
Background Patient distances can be calculated based on signs and symptoms derived from an ontological hierarchy. There is controversy as to whether patient distance metrics that consider the semantic similarity between concepts can outperform standard patient distance metrics that are agnostic to concept similarity. The choice of distance metric can dominate the performance of classification or clustering algorithms. Our objective was to determine if semantically augmented distance metrics would outperform standard metrics on machine learning tasks. Methods We converted the neurological findings from 382 published neurology cases into sets of concepts with corresponding machine-readable codes. We calculated patient distances by four different metrics (cosine distance, a semantically augmented cosine distance, Jaccard distance, and a semantically augmented bipartite distance). Semantic augmentation for two of the metrics depended on concept similarities from a hierarchical neuro-ontology. For machine learning algorithms, we used the patient diagnosis as the ground truth label and patient findings as machine learning features. We assessed classification accuracy for four classifiers and cluster quality for two clustering algorithms for each of the distance metrics. Results Inter-patient distances were smaller when the distance metric was semantically augmented. Classification accuracy and cluster quality were not significantly different by distance metric. Conclusion Although semantic augmentation reduced inter-patient distances, we did not find improved classification accuracy or improved cluster quality with semantically augmented patient distance metrics when applied to a dataset of neurology patients. Further work is needed to assess the utility of semantically augmented patient distances.
Collapse
Affiliation(s)
- Daniel B Hier
- Department of Neurology and Rehabilitation, University of Illinois at Chicago, Chicago, IL, 60612, USA.
| | - Jonathan Kopel
- Department of Internal Medicine, Texas Tech University Health Sciences Center, Lubbock, TX, USA
| | - Steven U Brint
- Department of Neurology and Rehabilitation, University of Illinois at Chicago, Chicago, IL, 60612, USA
| | - Donald C Wunsch
- Department of Electrical and Computer Engineering, Missouri University of Science and Technology, Rolla, MO, 65401, USA
| | - Gayla R Olbricht
- Department of Mathematics and Statistics, Missouri University of Science and Technology, Rolla, MO, 65401, USA
| | - Sima Azizi
- Department of Electrical and Computer Engineering, Missouri University of Science and Technology, Rolla, MO, 65401, USA
| | - Blaine Allen
- Department of Electrical and Computer Engineering, Missouri University of Science and Technology, Rolla, MO, 65401, USA
| |
Collapse
|
28
|
Arguello Casteleiro M, Des Diz J, Maroto N, Fernandez Prieto MJ, Peters S, Wroe C, Sevillano Torrado C, Maseda Fernandez D, Stevens R. Semantic Deep Learning: Prior Knowledge and a Type of Four-Term Embedding Analogy to Acquire Treatments for Well-Known Diseases. JMIR Med Inform 2020; 8:e16948. [PMID: 32759099 PMCID: PMC7441383 DOI: 10.2196/16948] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2019] [Revised: 02/27/2020] [Accepted: 02/27/2020] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND How to treat a disease remains to be the most common type of clinical question. Obtaining evidence-based answers from biomedical literature is difficult. Analogical reasoning with embeddings from deep learning (embedding analogies) may extract such biomedical facts, although the state-of-the-art focuses on pair-based proportional (pairwise) analogies such as man:woman::king:queen ("queen = -man +king +woman"). OBJECTIVE This study aimed to systematically extract disease treatment statements with a Semantic Deep Learning (SemDeep) approach underpinned by prior knowledge and another type of 4-term analogy (other than pairwise). METHODS As preliminaries, we investigated Continuous Bag-of-Words (CBOW) embedding analogies in a common-English corpus with five lines of text and observed a type of 4-term analogy (not pairwise) applying the 3CosAdd formula and relating the semantic fields person and death: "dagger = -Romeo +die +died" (search query: -Romeo +die +died). Our SemDeep approach worked with pre-existing items of knowledge (what is known) to make inferences sanctioned by a 4-term analogy (search query -x +z1 +z2) from CBOW and Skip-gram embeddings created with a PubMed systematic reviews subset (PMSB dataset). Stage1: Knowledge acquisition. Obtaining a set of terms, candidate y, from embeddings using vector arithmetic. Some n-gram pairs from the cosine and validated with evidence (prior knowledge) are the input for the 3cosAdd, seeking a type of 4-term analogy relating the semantic fields disease and treatment. Stage 2: Knowledge organization. Identification of candidates sanctioned by the analogy belonging to the semantic field treatment and mapping these candidates to unified medical language system Metathesaurus concepts with MetaMap. A concept pair is a brief disease treatment statement (biomedical fact). Stage 3: Knowledge validation. An evidence-based evaluation followed by human validation of biomedical facts potentially useful for clinicians. RESULTS We obtained 5352 n-gram pairs from 446 search queries by applying the 3CosAdd. The microaveraging performance of MetaMap for candidate y belonging to the semantic field treatment was F-measure=80.00% (precision=77.00%, recall=83.25%). We developed an empirical heuristic with some predictive power for clinical winners, that is, search queries bringing candidate y with evidence of a therapeutic intent for target disease x. The search queries -asthma +inhaled_corticosteroids +inhaled_corticosteroid and -epilepsy +valproate +antiepileptic_drug were clinical winners, finding eight evidence-based beneficial treatments. CONCLUSIONS Extracting treatments with therapeutic intent by analogical reasoning from embeddings (423K n-grams from the PMSB dataset) is an ambitious goal. Our SemDeep approach is knowledge-based, underpinned by embedding analogies that exploit prior knowledge. Biomedical facts from embedding analogies (4-term type, not pairwise) are potentially useful for clinicians. The heuristic offers a practical way to discover beneficial treatments for well-known diseases. Learning from deep learning models does not require a massive amount of data. Embedding analogies are not limited to pairwise analogies; hence, analogical reasoning with embeddings is underexploited.
Collapse
Affiliation(s)
| | | | - Nava Maroto
- Departamento de Lingüística Aplicada a la Ciencia y a la Tecnología, Universidad Politécnica de Madrid, Madrid, Spain
| | | | - Simon Peters
- School of Social Sciences, University of Manchester, Manchester, United Kingdom
| | | | | | | | - Robert Stevens
- Department of Computer Science, University of Manchester, Manchester, United Kingdom
| |
Collapse
|
29
|
Abdalla M, Abdalla M, Rudzicz F, Hirst G. Using word embeddings to improve the privacy of clinical notes. J Am Med Inform Assoc 2020; 27:901-907. [PMID: 32388549 PMCID: PMC7309261 DOI: 10.1093/jamia/ocaa038] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2020] [Revised: 03/10/2020] [Accepted: 03/23/2020] [Indexed: 11/24/2022] Open
Abstract
OBJECTIVE In this work, we introduce a privacy technique for anonymizing clinical notes that guarantees all private health information is secured (including sensitive data, such as family history, that are not adequately covered by current techniques). MATERIALS AND METHODS We employ a new "random replacement" paradigm (replacing each token in clinical notes with neighboring word vectors from the embedding space) to achieve 100% recall on the removal of sensitive information, unachievable with current "search-and-secure" paradigms. We demonstrate the utility of this paradigm on multiple corpora in a diverse set of classification tasks. RESULTS We empirically evaluate the effect of our anonymization technique both on upstream and downstream natural language processing tasks to show that our perturbations, while increasing security (ie, achieving 100% recall on any dataset), do not greatly impact the results of end-to-end machine learning approaches. DISCUSSION As long as current approaches utilize precision and recall to evaluate deidentification algorithms, there will remain a risk of overlooking sensitive information. Inspired by differential privacy, we sought to make it statistically infeasible to recreate the original data, although at the cost of readability. We hope that the work will serve as a catalyst to further research into alternative deidentification methods that can address current weaknesses. CONCLUSION Our proposed technique can secure clinical texts at a low cost and extremely high recall with a readability trade-off while remaining useful for natural language processing classification tasks. We hope that our work can be used by risk-averse data holders to release clinical texts to researchers.
Collapse
Affiliation(s)
- Mohamed Abdalla
- ICES, Toronto, Canada
- The Vector Institute for Artificial Intelligence, Toronto, Canada
- Department of Computer Science, University of Toronto, Toronto, Canada
| | - Moustafa Abdalla
- Computational Statistics & Machine Learning Group, Department of Statistics, University of Oxford, Oxford, UK
- Wellcome Centre for Human Genetics, Nuffield Department of Medicine, University of Oxford, Oxford, UK
- Harvard Medical School, Harvard University, Boston, USA
| | - Frank Rudzicz
- The Vector Institute for Artificial Intelligence, Toronto, Canada
- Department of Computer Science, University of Toronto, Toronto, Canada
- International Centre for Surgical Safety, Li Ka Shing Knowledge Institute, St Michael’s Hospital, Toronto, Canada
| | - Graeme Hirst
- The Vector Institute for Artificial Intelligence, Toronto, Canada
- Department of Computer Science, University of Toronto, Toronto, Canada
| |
Collapse
|
30
|
Kuppili V, Biswas M, Edla DR, Prasad KJR, Suri JS. A Mechanics-Based Similarity Measure for Text Classification in Machine Learning Paradigm. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE 2020. [DOI: 10.1109/tetci.2018.2863728] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
31
|
Schumacher E, Dredze M. Learning unsupervised contextual representations for medical synonym discovery. JAMIA Open 2020; 2:538-546. [PMID: 32025651 PMCID: PMC6994012 DOI: 10.1093/jamiaopen/ooz057] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2019] [Revised: 09/23/2019] [Accepted: 10/02/2019] [Indexed: 11/14/2022] Open
Abstract
Objectives An important component of processing medical texts is the identification of synonymous words or phrases. Synonyms can inform learned representations of patients or improve linking mentioned concepts to medical ontologies. However, medical synonyms can be lexically similar (“dilated RA” and “dilated RV”) or dissimilar (“cerebrovascular accident” and “stroke”); contextual information can determine if 2 strings are synonymous. Medical professionals utilize extensive variation of medical terminology, often not evidenced in structured medical resources. Therefore, the ability to discover synonyms, especially without reliance on training data, is an important component in processing training notes. The ability to discover synonyms from models trained on large amounts of unannotated data removes the need to rely on annotated pairs of similar words. Models relying solely on non-annotated data can be trained on a wider variety of texts without the cost of annotation, and thus may capture a broader variety of language. Materials and Methods Recent contextualized deep learning representation models, such as ELMo (Peters et al., 2019) and BERT, (Devlin et al. 2019) have shown strong improvements over previous approaches in a broad variety of tasks. We leverage these contextualized deep learning models to build representations of synonyms, which integrate the context of surrounding sentence and use character-level models to alleviate out-of-vocabulary issues. Using these models, we perform unsupervised discovery of likely synonym matches, which reduces the reliance on expensive training data. Results We use the ShARe/CLEF eHealth Evaluation Lab 2013 Task 1b data to evaluate our synonym discovery method. Comparing our proposed contextualized deep learning representations to previous non-neural representations, we find that the contextualized representations show consistent improvement over non-contextualized models in all metrics. Conclusions Our results show that contextualized models produce effective representations for synonym discovery. We expect that the use of these representations in other tasks would produce similar gains in performance.
Collapse
Affiliation(s)
- Elliot Schumacher
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, USA
| | - Mark Dredze
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, USA
| |
Collapse
|
32
|
Cardoso C, Sousa RT, Köhler S, Pesquita C. A Collection of Benchmark Data Sets for Knowledge Graph-based Similarity in the Biomedical Domain. Database (Oxford) 2020; 2020:baaa078. [PMID: 33181823 PMCID: PMC7661097 DOI: 10.1093/database/baaa078] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2020] [Revised: 08/13/2020] [Accepted: 08/24/2020] [Indexed: 01/12/2023]
Abstract
The ability to compare entities within a knowledge graph is a cornerstone technique for several applications, ranging from the integration of heterogeneous data to machine learning. It is of particular importance in the biomedical domain, where semantic similarity can be applied to the prediction of protein-protein interactions, associations between diseases and genes, cellular localization of proteins, among others. In recent years, several knowledge graph-based semantic similarity measures have been developed, but building a gold standard data set to support their evaluation is non-trivial. We present a collection of 21 benchmark data sets that aim at circumventing the difficulties in building benchmarks for large biomedical knowledge graphs by exploiting proxies for biomedical entity similarity. These data sets include data from two successful biomedical ontologies, Gene Ontology and Human Phenotype Ontology, and explore proxy similarities calculated based on protein sequence similarity, protein family similarity, protein-protein interactions and phenotype-based gene similarity. Data sets have varying sizes and cover four different species at different levels of annotation completion. For each data set, we also provide semantic similarity computations with state-of-the-art representative measures. Database URL: https://github.com/liseda-lab/kgsim-benchmark.
Collapse
Affiliation(s)
- Carlota Cardoso
- Departamento de informática, LASIGE Faculdade de Ciências da Universidade de Lisboa, 1749 - 016 Lisboa, Portugal
| | - Rita T Sousa
- Departamento de informática, LASIGE Faculdade de Ciências da Universidade de Lisboa, 1749 - 016 Lisboa, Portugal
| | | | - Catia Pesquita
- Departamento de informática, LASIGE Faculdade de Ciências da Universidade de Lisboa, 1749 - 016 Lisboa, Portugal
| |
Collapse
|
33
|
SECNLP: A survey of embeddings in clinical natural language processing. J Biomed Inform 2020; 101:103323. [DOI: 10.1016/j.jbi.2019.103323] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2019] [Revised: 09/12/2019] [Accepted: 10/27/2019] [Indexed: 12/11/2022]
|
34
|
|
35
|
Fan Y, Pakhomov S, McEwan R, Zhao W, Lindemann E, Zhang R. Using word embeddings to expand terminology of dietary supplements on clinical notes. JAMIA Open 2019; 2:246-253. [PMID: 31825016 PMCID: PMC6904105 DOI: 10.1093/jamiaopen/ooz007] [Citation(s) in RCA: 33] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
Objective The objective of this study is to demonstrate the feasibility of applying word embeddings to expand the terminology of dietary supplements (DS) using over 26 million clinical notes. Methods Word embedding models (ie, word2vec and GloVe) trained on clinical notes were used to predefine a list of top 40 semantically related terms for each of 14 commonly used DS. Each list was further evaluated by experts to generate semantically similar terms. We investigated the effect of corpus size and other settings (ie, vector size and window size) as well as the 2 word embedding models on performance for DS term expansion. We compared the number of clinical notes (and patients they represent) that were retrieved using the word embedding expanded terms to both the baseline terms and external DS sources expanded terms. Results Using the word embedding models trained on clinical notes, we could identify 1–12 semantically similar terms for each DS. Using the word embedding expanded terms, we were able to retrieve averagely 8.39% more clinical notes and 11.68% more patients for each DS compared with 2 sets of terms. The increasing corpus size results in more misspellings, but not more semantic variants and brand names. Word2vec model is also found more capable of detecting semantically similar terms than GloVe. Conclusion Our study demonstrates the utility of word embeddings on clinical notes for terminology expansion on 14 DS. We propose that this method can be potentially applied to create a DS vocabulary for downstream applications, such as information extraction.
Collapse
Affiliation(s)
- Yadan Fan
- Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA
| | - Serguei Pakhomov
- Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA.,College of Pharmacy, University of Minnesota, Minneapolis, Minnesota, USA
| | - Reed McEwan
- Academic Health Center-Information Systems, University of Minnesota, Minneapolis, Minnesota, USA
| | - Wendi Zhao
- Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA
| | | | - Rui Zhang
- Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA.,College of Pharmacy, University of Minnesota, Minneapolis, Minnesota, USA
| |
Collapse
|
36
|
Semantic association computation: a comprehensive survey. Artif Intell Rev 2019. [DOI: 10.1007/s10462-019-09781-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
37
|
Arguello-Casteleiro M, Stevens R, Des-Diz J, Wroe C, Fernandez-Prieto MJ, Maroto N, Maseda-Fernandez D, Demetriou G, Peters S, Noble PJM, Jones PH, Dukes-McEwan J, Radford AD, Keane J, Nenadic G. Exploring semantic deep learning for building reliable and reusable one health knowledge from PubMed systematic reviews and veterinary clinical notes. J Biomed Semantics 2019; 10:22. [PMID: 31711540 PMCID: PMC6849172 DOI: 10.1186/s13326-019-0212-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023] Open
Abstract
BACKGROUND Deep Learning opens up opportunities for routinely scanning large bodies of biomedical literature and clinical narratives to represent the meaning of biomedical and clinical terms. However, the validation and integration of this knowledge on a scale requires cross checking with ground truths (i.e. evidence-based resources) that are unavailable in an actionable or computable form. In this paper we explore how to turn information about diagnoses, prognoses, therapies and other clinical concepts into computable knowledge using free-text data about human and animal health. We used a Semantic Deep Learning approach that combines the Semantic Web technologies and Deep Learning to acquire and validate knowledge about 11 well-known medical conditions mined from two sets of unstructured free-text data: 300 K PubMed Systematic Review articles (the PMSB dataset) and 2.5 M veterinary clinical notes (the VetCN dataset). For each target condition we obtained 20 related clinical concepts using two deep learning methods applied separately on the two datasets, resulting in 880 term pairs (target term, candidate term). Each concept, represented by an n-gram, is mapped to UMLS using MetaMap; we also developed a bespoke method for mapping short forms (e.g. abbreviations and acronyms). Existing ontologies were used to formally represent associations. We also create ontological modules and illustrate how the extracted knowledge can be queried. The evaluation was performed using the content within BMJ Best Practice. RESULTS MetaMap achieves an F measure of 88% (precision 85%, recall 91%) when applied directly to the total of 613 unique candidate terms for the 880 term pairs. When the processing of short forms is included, MetaMap achieves an F measure of 94% (precision 92%, recall 96%). Validation of the term pairs with BMJ Best Practice yields precision between 98 and 99%. CONCLUSIONS The Semantic Deep Learning approach can transform neural embeddings built from unstructured free-text data into reliable and reusable One Health knowledge using ontologies and content from BMJ Best Practice.
Collapse
Affiliation(s)
| | - Robert Stevens
- School of Computer Science, University of Manchester, Manchester, UK
| | - Julio Des-Diz
- Hospital do Salnés, Villagarcía de Arousa, Pontevedra, Spain
| | | | | | - Nava Maroto
- Departamento de Lingüística Aplicada a la Ciencia y a la Tecnología, Universidad Politécnica de Madrid, Madrid, Spain
| | - Diego Maseda-Fernandez
- Midcheshire Hospital Foundation Trust, NHS England, Crewe, UK
- School of Medical Sciences, University of Manchester, Manchester, UK
| | - George Demetriou
- School of Computer Science, University of Manchester, Manchester, UK
| | - Simon Peters
- School of Social Sciences, University of Manchester, Manchester, UK
| | - Peter-John M Noble
- Small Animal Veterinary Surveillance Network, University of Liverpool, Liverpool, UK
| | - Phil H Jones
- Small Animal Veterinary Surveillance Network, University of Liverpool, Liverpool, UK
| | - Jo Dukes-McEwan
- Small Animal Teaching Hospital, University of Liverpool, Liverpool, UK
| | - Alan D Radford
- Small Animal Veterinary Surveillance Network, University of Liverpool, Liverpool, UK
| | - John Keane
- School of Computer Science, University of Manchester, Manchester, UK
- Manchester Institute of Biotechnology, University of Manchester, Manchester, UK
| | - Goran Nenadic
- School of Computer Science, University of Manchester, Manchester, UK
- Manchester Institute of Biotechnology, University of Manchester, Manchester, UK
- Health eResearch Centre, University of Manchester, Manchester, UK
| |
Collapse
|
38
|
|
39
|
Lavertu A, Altman RB. RedMed: Extending drug lexicons for social media applications. J Biomed Inform 2019; 99:103307. [PMID: 31627020 PMCID: PMC6874884 DOI: 10.1016/j.jbi.2019.103307] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2019] [Revised: 10/02/2019] [Accepted: 10/11/2019] [Indexed: 10/25/2022]
Abstract
Social media has been identified as a promising potential source of information for pharmacovigilance. The adoption of social media data has been hindered by the massive and noisy nature of the data. Initial attempts to use social media data have relied on exact text matches to drugs of interest, and therefore suffer from the gap between formal drug lexicons and the informal nature of social media. The Reddit comment archive represents an ideal corpus for bridging this gap. We trained a word embedding model, RedMed, to facilitate the identification and retrieval of health entities from Reddit data. We compare the performance of our model trained on a consumer-generated corpus against publicly available models trained on expert-generated corpora. Our automated classification pipeline achieves an accuracy of 0.88 and a specificity of >0.9 across four different term classes. Of all drug mentions, an average of 79% (±0.5%) were exact matches to a generic or trademark drug name, 14% (±0.5%) were misspellings, 6.4% (±0.3%) were synonyms, and 0.13% (±0.05%) were pill marks. We find that our system captures an additional 20% of mentions; these would have been missed by approaches that rely solely on exact string matches. We provide a lexicon of misspellings and synonyms for 2978 drugs and a word embedding model trained on a health-oriented subset of Reddit.
Collapse
Affiliation(s)
- Adam Lavertu
- Biomedical Informatics Training Program, Stanford University, Stanford, CA 94305, USA
| | - Russ B Altman
- Department of Bioengineering, Stanford University, Stanford, CA 94305, USA.
| |
Collapse
|
40
|
Wikidata: A large-scale collaborative ontological medical database. J Biomed Inform 2019; 99:103292. [DOI: 10.1016/j.jbi.2019.103292] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2019] [Revised: 08/10/2019] [Accepted: 09/18/2019] [Indexed: 01/09/2023]
|
41
|
Hassanzadeh H, Nguyen A, Verspoor K. Quantifying semantic similarity of clinical evidence in the biomedical literature to facilitate related evidence synthesis. J Biomed Inform 2019; 100:103321. [PMID: 31676460 DOI: 10.1016/j.jbi.2019.103321] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2019] [Revised: 09/28/2019] [Accepted: 10/25/2019] [Indexed: 10/25/2022]
Abstract
OBJECTIVE Published clinical trials and high quality peer reviewed medical publications are considered as the main sources of evidence used for synthesizing systematic reviews or practicing Evidence Based Medicine (EBM). Finding all relevant published evidence for a particular medical case is a time and labour intensive task, given the breadth of the biomedical literature. Automatic quantification of conceptual relationships between key clinical evidence within and across publications, despite variations in the expression of clinically-relevant concepts, can help to facilitate synthesis of evidence. In this study, we aim to provide an approach towards expediting evidence synthesis by quantifying semantic similarity of key evidence as expressed in the form of individual sentences. Such semantic textual similarity can be applied as a key approach for supporting selection of related studies. MATERIAL AND METHODS We propose a generalisable approach for quantifying semantic similarity of clinical evidence in the biomedical literature, specifically considering the similarity of sentences corresponding to a given type of evidence, such as clinical interventions, population information, clinical findings, etc. We develop three sets of generic, ontology-based, and vector-space models of similarity measures that make use of a variety of lexical, conceptual, and contextual information to quantify the similarity of full sentences containing clinical evidence. To understand the impact of different similarity measures on the overall evidence semantic similarity quantification, we provide a comparative analysis of these measures when used as input to an unsupervised linear interpolation and a supervised regression ensemble. In order to provide a reliable test-bed for this experiment, we generate a dataset of 1000 pairs of sentences from biomedical publications that are annotated by ten human experts. We also extend the experiments on an external dataset for further generalisability testing. RESULTS The combination of all diverse similarity measures showed stronger correlations with the gold standard similarity scores in the dataset than any individual kind of measure. Our approach reached near 0.80 average Pearson correlation across different clinical evidence types using the devised similarity measures. Although they were more effective when combined together, individual generic and vector-space measures also resulted in strong similarity quantification when used in both unsupervised and supervised models. On the external dataset, our similarity measures were highly competitive with the state-of-the-art approaches developed and trained specifically on that dataset for predicting semantic similarity. CONCLUSION Experimental results showed that the proposed semantic similarity quantification approach can effectively identify related clinical evidence that is reported in the literature. The comparison with a state-of-the-art method demonstrated the effectiveness of the approach, and experiments with an external dataset support its generalisability.
Collapse
Affiliation(s)
- Hamed Hassanzadeh
- The Australian e-Health Research Centre, CSIRO, Brisbane, QLD, Australia.
| | - Anthony Nguyen
- The Australian e-Health Research Centre, CSIRO, Brisbane, QLD, Australia.
| | - Karin Verspoor
- School of Computing and Information Systems, University of Melbourne, Melbourne, VIC, Australia.
| |
Collapse
|
42
|
Bazan J, Bazan-Socha S, Ochab M, Buregwa-Czuma S, Nowakowski T, Woźniak M. Effective construction of classifiers with the k-NN method supported by a concept ontology. Knowl Inf Syst 2019. [DOI: 10.1007/s10115-019-01391-w] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
43
|
Henry S, McInnes BT. Indirect association and ranking hypotheses for literature based discovery. BMC Bioinformatics 2019; 20:425. [PMID: 31416434 PMCID: PMC6694578 DOI: 10.1186/s12859-019-2989-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2018] [Accepted: 07/09/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Literature Based Discovery (LBD) produces more potential hypotheses than can be manually reviewed, making automatically ranking these hypotheses critical. In this paper, we introduce the indirect association measures of Linking Term Association (LTA), Minimum Weight Association (MWA), and Shared B to C Set Association (SBC), and compare them to Linking Set Association (LSA), concept embeddings vector cosine, Linking Term Count (LTC), and direct co-occurrence vector cosine. Our proposed indirect association measures extend traditional association measures to quantify indirect rather than direct associations while preserving valuable statistical properties. RESULTS We perform a comparison between several different hypothesis ranking methods for LBD, and compare them against our proposed indirect association measures. We intrinsically evaluate each method's performance using its ability to estimate semantic relatedness on standard evaluation datasets. We extrinsically evaluate each method's ability to rank hypotheses in LBD using a time-slicing dataset based on co-occurrence information, and another time-slicing dataset based on SemRep extracted-relationships. Precision and recall curves are generated by ranking term pairs and applying a threshold at each rank. CONCLUSIONS Results differ depending on the evaluation methods and datasets, but it is unclear if this is a result of biases in the evaluation datasets or if one method is truly better than another. We conclude that LTC and SBC are the best suited methods for hypothesis ranking in LBD, but there is value in having a variety of methods to choose from.
Collapse
Affiliation(s)
- Sam Henry
- Department of Computer Science, Virginia Commonwealth University, 601 W. Main St. Rm 435, Richmond, 23284 USA
| | - Bridget T. McInnes
- Department of Computer Science, Virginia Commonwealth University, 601 W. Main St. Rm 435, Richmond, 23284 USA
| |
Collapse
|
44
|
Lin C, Lou YS, Tsai DJ, Lee CC, Hsu CJ, Wu DC, Wang MC, Fang WH. Projection Word Embedding Model With Hybrid Sampling Training for Classifying ICD-10-CM Codes: Longitudinal Observational Study. JMIR Med Inform 2019; 7:e14499. [PMID: 31339103 PMCID: PMC6683650 DOI: 10.2196/14499] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2019] [Revised: 06/13/2019] [Accepted: 06/17/2019] [Indexed: 12/26/2022] Open
Abstract
BACKGROUND Most current state-of-the-art models for searching the International Classification of Diseases, Tenth Revision Clinical Modification (ICD-10-CM) codes use word embedding technology to capture useful semantic properties. However, they are limited by the quality of initial word embeddings. Word embedding trained by electronic health records (EHRs) is considered the best, but the vocabulary diversity is limited by previous medical records. Thus, we require a word embedding model that maintains the vocabulary diversity of open internet databases and the medical terminology understanding of EHRs. Moreover, we need to consider the particularity of the disease classification, wherein discharge notes present only positive disease descriptions. OBJECTIVE We aimed to propose a projection word2vec model and a hybrid sampling method. In addition, we aimed to conduct a series of experiments to validate the effectiveness of these methods. METHODS We compared the projection word2vec model and traditional word2vec model using two corpora sources: English Wikipedia and PubMed journal abstracts. We used seven published datasets to measure the medical semantic understanding of the word2vec models and used these embeddings to identify the three-character-level ICD-10-CM diagnostic codes in a set of discharge notes. On the basis of embedding technology improvement, we also tried to apply the hybrid sampling method to improve accuracy. The 94,483 labeled discharge notes from the Tri-Service General Hospital of Taipei, Taiwan, from June 1, 2015, to June 30, 2017, were used. To evaluate the model performance, 24,762 discharge notes from July 1, 2017, to December 31, 2017, from the same hospital were used. Moreover, 74,324 additional discharge notes collected from seven other hospitals were tested. The F-measure, which is the major global measure of effectiveness, was adopted. RESULTS In medical semantic understanding, the original EHR embeddings and PubMed embeddings exhibited superior performance to the original Wikipedia embeddings. After projection training technology was applied, the projection Wikipedia embeddings exhibited an obvious improvement but did not reach the level of original EHR embeddings or PubMed embeddings. In the subsequent ICD-10-CM coding experiment, the model that used both projection PubMed and Wikipedia embeddings had the highest testing mean F-measure (0.7362 and 0.6693 in Tri-Service General Hospital and the seven other hospitals, respectively). Moreover, the hybrid sampling method was found to improve the model performance (F-measure=0.7371/0.6698). CONCLUSIONS The word embeddings trained using EHR and PubMed could understand medical semantics better, and the proposed projection word2vec model improved the ability of medical semantics extraction in Wikipedia embeddings. Although the improvement from the projection word2vec model in the real ICD-10-CM coding task was not substantial, the models could effectively handle emerging diseases. The proposed hybrid sampling method enables the model to behave like a human expert.
Collapse
Affiliation(s)
- Chin Lin
- Graduate Institute of Life Sciences, National Defense Medical Center, Taipei, Taiwan
- School of Public Health, National Defense Medical Center, Taipei, Taiwan
| | - Yu-Sheng Lou
- Graduate Institute of Life Sciences, National Defense Medical Center, Taipei, Taiwan
- School of Public Health, National Defense Medical Center, Taipei, Taiwan
| | - Dung-Jang Tsai
- Graduate Institute of Life Sciences, National Defense Medical Center, Taipei, Taiwan
- School of Public Health, National Defense Medical Center, Taipei, Taiwan
| | - Chia-Cheng Lee
- Planning and Management Office, Tri-Service General Hospital, National Defense Medical Center, Taipei, Taiwan
| | - Chia-Jung Hsu
- Planning and Management Office, Tri-Service General Hospital, National Defense Medical Center, Taipei, Taiwan
| | - Ding-Chung Wu
- Department of Medical Record, Tri-Service General Hospital, National Defense Medical Center, Taipei, Taiwan
| | - Mei-Chuen Wang
- Department of Medical Record, Tri-Service General Hospital, National Defense Medical Center, Taipei, Taiwan
| | - Wen-Hui Fang
- Department of Family and Community Medicine, Tri-Service General Hospital, National Defense Medical Center, Taipei, Taiwan
| |
Collapse
|
45
|
Timilsina M, Tandan M, d'Aquin M, Yang H. Discovering Links Between Side Effects and Drugs Using a Diffusion Based Method. Sci Rep 2019; 9:10436. [PMID: 31320740 PMCID: PMC6639365 DOI: 10.1038/s41598-019-46939-6] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Accepted: 07/05/2019] [Indexed: 12/14/2022] Open
Abstract
Identifying the unintended effects of drugs (side effects) is a very important issue in pharmacological studies. The laboratory verification of associations between drugs and side effects requires costly, time-intensive research. Thus, an approach to predicting drug side effects based on known side effects, using a computational model, is highly desirable. To provide such a model, we used openly available data resources to model drugs and side effects as a bipartite graph. The drug-drug network is constructed using the word2vec model where the edges between drugs represent the semantic similarity between them. We integrated the bipartite graph and the semantic similarity graph using a matrix factorization method and a diffusion based model. Our results show the effectiveness of this integration by computing weighted (i.e., ranked) predictions of initially unknown links between side effects and drugs.
Collapse
Affiliation(s)
- Mohan Timilsina
- Data Science Institute, Insight Centre for Data Analytics, National University of Ireland Galway, Galway, Ireland.
| | - Meera Tandan
- Discipline of General Practice, School of Medicine, National University of Ireland Galway, Galway, Ireland
| | - Mathieu d'Aquin
- Data Science Institute, Insight Centre for Data Analytics, National University of Ireland Galway, Galway, Ireland
| | - Haixuan Yang
- School of Mathematics, Statistics and Applied Mathematics, National University of Ireland Galway, Galway, Ireland
| |
Collapse
|
46
|
Gopalakrishnan V, Jha K, Xun G, Ngo HQ, Zhang A. Towards self-learning based hypotheses generation in biomedical text domain. Bioinformatics 2019; 34:2103-2115. [PMID: 29293920 DOI: 10.1093/bioinformatics/btx837] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2017] [Accepted: 12/22/2017] [Indexed: 01/01/2023] Open
Abstract
Motivation The overwhelming amount of research articles in the domain of bio-medicine might cause important connections to remain unnoticed. Literature Based Discovery is a sub-field within biomedical text mining that peruses these articles to formulate high confident hypotheses on possible connections between medical concepts. Although many alternate methodologies have been proposed over the last decade, they still suffer from scalability issues. The primary reason, apart from the dense inter-connections between biological concepts, is the absence of information on the factors that lead to the edge-formation. In this work, we formulate this problem as a collaborative filtering task and leverage a relatively new concept of word-vectors to learn and mimic the implicit edge-formation process. Along with single-class classifier, we prune the search-space of redundant and irrelevant hypotheses to increase the efficiency of the system and at the same time maintaining and in some cases even boosting the overall accuracy. Results We show that our proposed framework is able to prune up to 90% of the hypotheses while still retaining high recall in top-K results. This level of efficiency enables the discovery algorithm to look for higher-order hypotheses, something that was infeasible until now. Furthermore, the generic formulation allows our approach to be agile to perform both open and closed discovery. We also experimentally validate that the core data-structures upon which the system bases its decision has a high concordance with the opinion of the experts.This coupled with the ability to understand the edge formation process provides us with interpretable results without any manual intervention. Availability and implementation The relevant JAVA codes are available at: https://github.com/vishrawas/Medline-Code_v2. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Vishrawas Gopalakrishnan
- Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, NY, USA
| | - Kishlay Jha
- Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, NY, USA
| | - Guangxu Xun
- Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, NY, USA
| | - Hung Q Ngo
- Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, NY, USA
| | - Aidong Zhang
- Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, NY, USA
| |
Collapse
|
47
|
Torjmen-Khemakhem M, Gasmi K. Document/query expansion based on selecting significant concepts for context based retrieval of medical images. J Biomed Inform 2019; 95:103210. [DOI: 10.1016/j.jbi.2019.103210] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2019] [Revised: 05/15/2019] [Accepted: 05/16/2019] [Indexed: 11/28/2022]
|
48
|
Moon S, Liu S, Chen D, Wang Y, Wood DL, Chaudhry R, Liu H, Kingsbury P. Salience of Medical Concepts of Inside Clinical Texts and Outside Medical Records for Referred Cardiovascular Patients. JOURNAL OF HEALTHCARE INFORMATICS RESEARCH 2019; 3:200-219. [PMID: 35415427 PMCID: PMC8982748 DOI: 10.1007/s41666-019-00044-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2018] [Revised: 11/29/2018] [Accepted: 01/05/2019] [Indexed: 12/03/2022]
Abstract
Outside medical records (OMRs) accompanying referred patients are frequently sent as faxes from external healthcare providers. Accessing useful and relevant information from these OMRs in a timely manner is a challenging task due to a combination of the presence of machine-illegible information and the limited system interoperability inherent in healthcare. Little research has been done on investigating information in OMRs. This paper evaluated overlapping and non-overlapping medical concepts captured from digitally faxed OMRs for patients transferring to the Department of Cardiovascular Medicine and from clinical consultant notes generated at the Mayo Clinic. We used optical character recognition (OCR) techniques to make faxed OMRs machine-readable and used natural language processing (NLP) techniques to capture clinical concepts from both machine-readable OMRs and Mayo clinical notes. We measured the level of overlap in medical concepts between OMRs and Mayo clinical narratives in the quantitative approaches and assessed the salience of concepts specific to Cardiovascular Medicine by calculating the ratio of those mentioned concepts relative to an independent clinical corpus. Among the concepts collected from the OMRs, 11.19% of those were also present in the Mayo clinical narratives that were generated within the 3 months after their initial encounter at the Mayo Clinic. For those common concepts, 73.97% were identified in initial consultant notes (ICNs) and 26.03% were captured over subsequent follow-up consultant notes (FCNs). These findings implied that information collected from the OMRs is potentially informative for patient care, but some valuable information (additionally identified in FCNs) collected from the OMRs is not fully used in an earlier stage of the care process. The concepts collected from the ICNs have the highest salience to Cardiovascular Medicine (0.112) compared to concepts in OMRs and concepts in FCNs. Additionally, unique concepts captured in ICNs (unseen in OMRs or FCNs) carried the most salient information (0.094), which demonstrated that ICNs provided the most informative concepts for the care of transferred patients.
Collapse
Affiliation(s)
- Sungrim Moon
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN USA
| | - Sijia Liu
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN USA
- Department of Computer Science and Engineering, University at Buffalo, Buffalo, NY USA
| | - David Chen
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN USA
| | - Yanshan Wang
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN USA
| | | | - Rajeev Chaudhry
- Department of Medicine and Center for Translational Informatics, Mayo Clinic, Rochester, MN USA
| | - Hongfang Liu
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN USA
| | - Paul Kingsbury
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN USA
| |
Collapse
|
49
|
Abstract
BACKGROUND Given the increasing amount of biomedical resources that are being annotated with concepts from more than one ontology and covering multiple domains of knowledge, it is important to devise mechanisms to compare these resources that take into account the various domains of annotation. For example, metabolic pathways are annotated with their enzymes and their metabolites, and thus similarity measures should compare them with respect to both of those domains simultaneously. RESULTS In this paper, we propose two approaches to lift existing single-ontology semantic similarity measures into multi-domain measures. The aggregative approach compares domains independently and averages the various similarity values into a final score. The integrative approach integrates all the relevant ontologies into a single one, calculating similarity in the resulting multi-domain ontology using the single-ontology measure. CONCLUSIONS We evaluated the two approaches in a multidisciplinary epidemiology dataset by evaluating the capacity of the similarity measures to predict new annotations based on the existing ones. The results show a promising increase in performance of the multi-domain measures over the single-ontology ones in the vast majority of the cases. These results show that multi-domain measures outperform single-domain ones, and should be considered by the community as a starting point to study more efficient multi-domain semantic similarity measures.
Collapse
Affiliation(s)
- João D. Ferreira
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisboa, Portugal
| | | |
Collapse
|
50
|
Rodriguez-Prieto O, Araujo L, Martinez-Romo J. Discovering related scientific literature beyond semantic similarity: a new co-citation approach. Scientometrics 2019. [DOI: 10.1007/s11192-019-03125-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|