1
|
Han P, Li X, Zhang Z, Zhong Y, Gu L, Hua Y, Li X. CMCN: Chinese medical concept normalization using continual learning and knowledge-enhanced. Artif Intell Med 2024; 157:102965. [PMID: 39241561 DOI: 10.1016/j.artmed.2024.102965] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Revised: 05/10/2024] [Accepted: 08/19/2024] [Indexed: 09/09/2024]
Abstract
Medical Concept Normalization (MCN) is a crucial process for deep information extraction and natural language processing tasks, which plays a vital role in biomedical research. Although MCN in English has achieved significant research achievements, Chinese medical concept normalization (CMCN) remains insufficiently explored due to its complex syntactic structure and the paucity of Chinese medical semantic and ontology resources. In recent years, deep learning has been extensively applied across numerous natural language processing tasks, owing to its robust learning capabilities, adaptability, and transferability. It has proven to be well suited for intricate and specialized knowledge discovery research in the biomedical field. In this study, we conduct research on CMCN through the lens of deep learning. Specifically, our research introduces a model that leverages polymorphic semantic information and knowledge enhanced through multi-task learning and retain more important medical features through continual learning. As the cornerstone of CMCN, disease names are the main focus of this research. We evaluated various methodologies on Chinese disease dataset built by ourselves, finally achieving 76.12 % on Accuracy@1, 87.20 % on Accuracy@5 and 90.02 % on Accuracy@10 with our best-performing model GCBM-BSCL. This research not only advances the fields of knowledge mining and medical concept normalization but also enhances the integration and application of artificial intelligence in the medical and health field. We have published the source code and results on https://github.com/BearLiX/CMCN.
Collapse
Affiliation(s)
- Pu Han
- School of Management, Nanjing University of Posts & Telecommunications, Nanjing 210003, China; Jiangsu Provincial Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China.
| | - Xiong Li
- School of Management, Nanjing University of Posts & Telecommunications, Nanjing 210003, China
| | - Zhanpeng Zhang
- School of Management, Nanjing University of Posts & Telecommunications, Nanjing 210003, China
| | - Yule Zhong
- School of Management, Nanjing University of Posts & Telecommunications, Nanjing 210003, China
| | - Liang Gu
- School of Management, Nanjing University of Posts & Telecommunications, Nanjing 210003, China
| | - Yingying Hua
- School of Management, Nanjing University of Posts & Telecommunications, Nanjing 210003, China
| | - Xiaoyan Li
- School of Basic Medical Sciences, Nanjing Medical University, Nanjing 210029, China.
| |
Collapse
|
2
|
Li J, Li Y, Pan Y, Guo J, Sun Z, Li F, He Y, Tao C. Mapping vaccine names in clinical trials to vaccine ontology using cascaded fine-tuned domain-specific language models. J Biomed Semantics 2024; 15:14. [PMID: 39123237 PMCID: PMC11316402 DOI: 10.1186/s13326-024-00318-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2023] [Accepted: 07/31/2024] [Indexed: 08/12/2024] Open
Abstract
BACKGROUND Vaccines have revolutionized public health by providing protection against infectious diseases. They stimulate the immune system and generate memory cells to defend against targeted diseases. Clinical trials evaluate vaccine performance, including dosage, administration routes, and potential side effects. CLINICALTRIALS gov is a valuable repository of clinical trial information, but the vaccine data in them lacks standardization, leading to challenges in automatic concept mapping, vaccine-related knowledge development, evidence-based decision-making, and vaccine surveillance. RESULTS In this study, we developed a cascaded framework that capitalized on multiple domain knowledge sources, including clinical trials, the Unified Medical Language System (UMLS), and the Vaccine Ontology (VO), to enhance the performance of domain-specific language models for automated mapping of VO from clinical trials. The Vaccine Ontology (VO) is a community-based ontology that was developed to promote vaccine data standardization, integration, and computer-assisted reasoning. Our methodology involved extracting and annotating data from various sources. We then performed pre-training on the PubMedBERT model, leading to the development of CTPubMedBERT. Subsequently, we enhanced CTPubMedBERT by incorporating SAPBERT, which was pretrained using the UMLS, resulting in CTPubMedBERT + SAPBERT. Further refinement was accomplished through fine-tuning using the Vaccine Ontology corpus and vaccine data from clinical trials, yielding the CTPubMedBERT + SAPBERT + VO model. Finally, we utilized a collection of pre-trained models, along with the weighted rule-based ensemble approach, to normalize the vaccine corpus and improve the accuracy of the process. The ranking process in concept normalization involves prioritizing and ordering potential concepts to identify the most suitable match for a given context. We conducted a ranking of the Top 10 concepts, and our experimental results demonstrate that our proposed cascaded framework consistently outperformed existing effective baselines on vaccine mapping, achieving 71.8% on top 1 candidate's accuracy and 90.0% on top 10 candidate's accuracy. CONCLUSION This study provides a detailed insight into a cascaded framework of fine-tuned domain-specific language models improving mapping of VO from clinical trials. By effectively leveraging domain-specific information and applying weighted rule-based ensembles of different pre-trained BERT models, our framework can significantly enhance the mapping of VO from clinical trials.
Collapse
Affiliation(s)
- Jianfu Li
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL, 32224, USA
| | - Yiming Li
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
| | - Yuanyi Pan
- Unit for Laboratory Animal Medicine, Department of Microbiology and Immunology, Center for Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI, 48109, USA
| | - Jinjing Guo
- Unit for Laboratory Animal Medicine, Department of Microbiology and Immunology, Center for Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI, 48109, USA
| | - Zenan Sun
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
| | - Fang Li
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL, 32224, USA
| | - Yongqun He
- Unit for Laboratory Animal Medicine, Department of Microbiology and Immunology, Center for Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI, 48109, USA.
| | - Cui Tao
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL, 32224, USA.
| |
Collapse
|
3
|
Livne M, Miftahutdinov Z, Tutubalina E, Kuznetsov M, Polykovskiy D, Brundyn A, Jhunjhunwala A, Costa A, Aliper A, Aspuru-Guzik A, Zhavoronkov A. nach0: multimodal natural and chemical languages foundation model. Chem Sci 2024; 15:8380-8389. [PMID: 38846388 PMCID: PMC11151847 DOI: 10.1039/d4sc00966e] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2024] [Accepted: 04/26/2024] [Indexed: 06/09/2024] Open
Abstract
Large Language Models (LLMs) have substantially driven scientific progress in various domains, and many papers have demonstrated their ability to tackle complex problems with creative solutions. Our paper introduces a new foundation model, nach0, capable of solving various chemical and biological tasks: biomedical question answering, named entity recognition, molecular generation, molecular synthesis, attributes prediction, and others. nach0 is a multi-domain and multi-task encoder-decoder LLM pre-trained on unlabeled text from scientific literature, patents, and molecule strings to incorporate a range of chemical and linguistic knowledge. We employed instruction tuning, where specific task-related instructions are utilized to fine-tune nach0 for the final set of tasks. To train nach0 effectively, we leverage the NeMo framework, enabling efficient parallel optimization of both base and large model versions. Extensive experiments demonstrate that our model outperforms state-of-the-art baselines on single-domain and cross-domain tasks. Furthermore, it can generate high-quality outputs in molecular and textual formats, showcasing its effectiveness in multi-domain setups.
Collapse
Affiliation(s)
- Micha Livne
- NVIDIA 2788 San Tomas Expressway Santa Clara 95051 CA USA
| | - Zulfat Miftahutdinov
- Insilico Medicine Canada Inc. 3710-1250 René-Lévesque West Montreal Quebec Canada
| | - Elena Tutubalina
- Insilico Medicine Hong Kong Ltd. Unit 310, 3/F, Building 8W, Phase 2, Hong Kong Science Park, Pak Shek Kok New Territories Hong Kong
| | - Maksim Kuznetsov
- Insilico Medicine Canada Inc. 3710-1250 René-Lévesque West Montreal Quebec Canada
| | - Daniil Polykovskiy
- Insilico Medicine Canada Inc. 3710-1250 René-Lévesque West Montreal Quebec Canada
| | - Annika Brundyn
- NVIDIA 2788 San Tomas Expressway Santa Clara 95051 CA USA
| | | | - Anthony Costa
- NVIDIA 2788 San Tomas Expressway Santa Clara 95051 CA USA
| | - Alex Aliper
- Insilico Medicine AI Ltd. Level 6, Unit 08, Block A, IRENA HQ Building, Masdar City Abu Dhabi United Arab Emirates
| | - Alán Aspuru-Guzik
- University of Toronto Lash Miller Building 80 St. George Street Toronto Ontario Canada
| | - Alex Zhavoronkov
- Insilico Medicine Hong Kong Ltd. Unit 310, 3/F, Building 8W, Phase 2, Hong Kong Science Park, Pak Shek Kok New Territories Hong Kong
| |
Collapse
|
4
|
Rouhizadeh H, Nikishina I, Yazdani A, Bornet A, Zhang B, Ehrsam J, Gaudet-Blavignac C, Naderi N, Teodoro D. A Dataset for Evaluating Contextualized Representation of Biomedical Concepts in Language Models. Sci Data 2024; 11:455. [PMID: 38704422 PMCID: PMC11069517 DOI: 10.1038/s41597-024-03317-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Accepted: 04/25/2024] [Indexed: 05/06/2024] Open
Abstract
Due to the complexity of the biomedical domain, the ability to capture semantically meaningful representations of terms in context is a long-standing challenge. Despite important progress in the past years, no evaluation benchmark has been developed to evaluate how well language models represent biomedical concepts according to their corresponding context. Inspired by the Word-in-Context (WiC) benchmark, in which word sense disambiguation is reformulated as a binary classification task, we propose a novel dataset, BioWiC, to evaluate the ability of language models to encode biomedical terms in context. BioWiC comprises 20'156 instances, covering over 7'400 unique biomedical terms, making it the largest WiC dataset in the biomedical domain. We evaluate BioWiC both intrinsically and extrinsically and show that it could be used as a reliable benchmark for evaluating context-dependent embeddings in biomedical corpora. In addition, we conduct several experiments using a variety of discriminative and generative large language models to establish robust baselines that can serve as a foundation for future research.
Collapse
Affiliation(s)
- Hossein Rouhizadeh
- Department of Radiology and Medical Informatics, Faculty of Medicine, University of Geneva, Geneva, Switzerland.
| | - Irina Nikishina
- Department of Informatics, University of Hamburg, Hamburg, Germany
| | - Anthony Yazdani
- Department of Radiology and Medical Informatics, Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Alban Bornet
- Department of Radiology and Medical Informatics, Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Boya Zhang
- Department of Radiology and Medical Informatics, Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Julien Ehrsam
- Department of Radiology and Medical Informatics, Faculty of Medicine, University of Geneva, Geneva, Switzerland
- Division of Medical Information Sciences, Diagnostic Department, Geneva University Hospitals, Geneva, Switzerland
| | - Christophe Gaudet-Blavignac
- Department of Radiology and Medical Informatics, Faculty of Medicine, University of Geneva, Geneva, Switzerland
- Division of Medical Information Sciences, Diagnostic Department, Geneva University Hospitals, Geneva, Switzerland
| | - Nona Naderi
- Laboratoire Interdisciplinaire des Sciences du Numerique, CNRS, Paris-Saclay University, Orsay, France
| | - Douglas Teodoro
- Department of Radiology and Medical Informatics, Faculty of Medicine, University of Geneva, Geneva, Switzerland.
| |
Collapse
|
5
|
Abdulnazar A, Roller R, Schulz S, Kreuzthaler M. Unsupervised SapBERT-based bi-encoders for medical concept annotation of clinical narratives with SNOMED CT. Digit Health 2024; 10:20552076241288681. [PMID: 39493636 PMCID: PMC11531008 DOI: 10.1177/20552076241288681] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Accepted: 09/03/2024] [Indexed: 11/05/2024] Open
Abstract
Objective Clinical narratives provide comprehensive patient information. Achieving interoperability involves mapping relevant details to standardized medical vocabularies. Typically, natural language processing divides this task into named entity recognition (NER) and medical concept normalization (MCN). State-of-the-art results require supervised setups with abundant training data. However, the limited availability of annotated data due to sensitivity and time constraints poses challenges. This study addressed the need for unsupervised medical concept annotation (MCA) to overcome these limitations and support the creation of annotated datasets. Method We use an unsupervised SapBERT-based bi-encoder model to analyze n-grams from narrative text and measure their similarity to SNOMED CT concepts. At the end, we apply a syntactical re-ranker. For evaluation, we use the semantic tags of SNOMED CT candidates to assess the NER phase and their concept IDs to assess the MCN phase. The approach is evaluated with both English and German narratives. Result Without training data, our unsupervised approach achieves an F1 score of 0.765 in English and 0.557 in German for MCN. Evaluation at the semantic tag level reveals that "disorder" has the highest F1 scores, 0.871 and 0.648 on English and German datasets. Furthermore, the MCA approach on the semantic tag "disorder" shows F1 scores of 0.839 and 0.696 in English and 0.685 and 0.437 in German for NER and MCN, respectively. Conclusion This unsupervised approach demonstrates potential for initial annotation (pre-labeling) in manual annotation tasks. While promising for certain semantic tags, challenges remain, including false positives, contextual errors, and variability of clinical language, requiring further fine-tuning.
Collapse
Affiliation(s)
- Akhila Abdulnazar
- Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Austria
- CBmed GmbH – Center for Biomarker Research in Medicine, Graz, Austria
| | - Roland Roller
- German Research Center for Artificial Intelligence (DFKI), Berlin, Germany
| | - Stefan Schulz
- Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Austria
| | - Markus Kreuzthaler
- Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Austria
| |
Collapse
|
6
|
Aliper A, Kudrin R, Polykovskiy D, Kamya P, Tutubalina E, Chen S, Ren F, Zhavoronkov A. Prediction of Clinical Trials Outcomes Based on Target Choice and Clinical Trial Design with Multi-Modal Artificial Intelligence. Clin Pharmacol Ther 2023; 114:972-980. [PMID: 37483175 DOI: 10.1002/cpt.3008] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Accepted: 07/10/2023] [Indexed: 07/25/2023]
Abstract
Drug discovery and development is a notoriously risky process with high failure rates at every stage, including disease modeling, target discovery, hit discovery, lead optimization, preclinical development, human safety, and efficacy studies. Accurate prediction of clinical trial outcomes may help significantly improve the efficiency of this process by prioritizing therapeutic programs that are more likely to succeed in clinical trials and ultimately benefit patients. Here, we describe inClinico, a transformer-based artificial intelligence software platform designed to predict the outcome of phase II clinical trials. The platform combines an ensemble of clinical trial outcome prediction engines that leverage generative artificial intelligence and multimodal data, including omics, text, clinical trial design, and small molecule properties. inClinico was validated in retrospective, quasi-prospective, and prospective validation studies internally and with pharmaceutical companies and financial institutions. The platform achieved 0.88 receiver operating characteristic area under the curve in predicting the phase II to phase III transition on a quasi-prospective validation dataset. The first prospective predictions were made and placed on date-stamped preprint servers in 2016. To validate our model in a real-world setting, we published forecasted outcomes for several phase II clinical trials achieving 79% accuracy for the trials that have read out. We also present an investment application of inClinico using date stamped virtual trading portfolio demonstrating 35% 9-month return on investment.
Collapse
Affiliation(s)
- Alex Aliper
- Insilico Medicine AI Ltd, Masdar City, Abu Dhabi, United Arab Emirates
| | - Roman Kudrin
- Insilico Medicine AI Ltd, Masdar City, Abu Dhabi, United Arab Emirates
| | | | - Petrina Kamya
- Insilico Medicine Canada Inc., Quebec, Montreal, Canada
| | - Elena Tutubalina
- Insilico Medicine Hong Kong Ltd, New Territories, Pak Shek Kok, Hong Kong
| | - Shan Chen
- Insilico Medicine Shanghai Ltd, Pudong New District, Shanghai, China
| | - Feng Ren
- Insilico Medicine Shanghai Ltd, Pudong New District, Shanghai, China
| | - Alex Zhavoronkov
- Insilico Medicine AI Ltd, Masdar City, Abu Dhabi, United Arab Emirates
- Insilico Medicine Hong Kong Ltd, New Territories, Pak Shek Kok, Hong Kong
| |
Collapse
|
7
|
Whitton J, Hunter A. Automated tabulation of clinical trial results: A joint entity and relation extraction approach with transformer-based language representations. Artif Intell Med 2023; 144:102661. [PMID: 37783549 DOI: 10.1016/j.artmed.2023.102661] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Revised: 07/05/2023] [Accepted: 09/04/2023] [Indexed: 10/04/2023]
Abstract
Evidence-based medicine, the practice in which healthcare professionals refer to the best available evidence when making decisions, forms the foundation of modern healthcare. However, it relies on labour-intensive systematic reviews, where domain specialists must aggregate and extract information from thousands of publications, primarily of randomised controlled trial (RCT) results, into evidence tables. This paper investigates automating evidence table generation by decomposing the problem across two language processing tasks: named entity recognition, which identifies key entities within text, such as drug names, and relation extraction, which maps their relationships for separating them into ordered tuples. We focus on the automatic tabulation of sentences from published RCT abstracts that report the results of the study outcomes. Two deep neural net models were developed as part of a joint extraction pipeline, using the principles of transfer learning and transformer-based language representations. To train and test these models, a new gold-standard corpus was developed, comprising over 550 result sentences from six disease areas. This approach demonstrated significant advantages, with our system performing well across multiple natural language processing tasks and disease areas, as well as in generalising to disease domains unseen during training. Furthermore, we show these results were achievable through training our models on as few as 170 example sentences. The final system is a proof of concept that the generation of evidence tables can be semi-automated, representing a step towards fully automating systematic reviews.
Collapse
Affiliation(s)
- Jetsun Whitton
- Department of Computer Science, University College London, Gower Street, London, WC1E 6BT, UK.
| | - Anthony Hunter
- Department of Computer Science, University College London, Gower Street, London, WC1E 6BT, UK.
| |
Collapse
|
8
|
Li J, Li Y, Pan Y, Guo J, Sun Z, Li F, He Y, Tao C. Mapping Vaccine Names in Clinical Trials to Vaccine Ontology using Cascaded Fine-Tuned Domain-Specific Language Models. RESEARCH SQUARE 2023:rs.3.rs-3362256. [PMID: 37841880 PMCID: PMC10571639 DOI: 10.21203/rs.3.rs-3362256/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/17/2023]
Abstract
Background Vaccines have revolutionized public health by providing protection against infectious diseases. They stimulate the immune system and generate memory cells to defend against targeted diseases. Clinical trials evaluate vaccine performance, including dosage, administration routes, and potential side effects. ClinicalTrials.gov is a valuable repository of clinical trial information, but the vaccine data in them lacks standardization, leading to challenges in automatic concept mapping, vaccine-related knowledge development, evidence-based decision-making, and vaccine surveillance. Results In this study, we developed a cascaded framework that capitalized on multiple domain knowledge sources, including clinical trials, Unified Medical Language System (UMLS), and the Vaccine Ontology (VO), to enhance the performance of domain-specific language models for automated mapping of VO from clinical trials. The Vaccine Ontology (VO) is a community-based ontology that was developed to promote vaccine data standardization, integration, and computer-assisted reasoning. Our methodology involved extracting and annotating data from various sources. We then performed pre-training on the PubMedBERT model, leading to the development of CTPubMedBERT. Subsequently, we enhanced CTPubMedBERT by incorporating SAPBERT, which was pretrained using the UMLS, resulting in CTPubMedBERT + SAPBERT. Further refinement was accomplished through fine-tuning using the Vaccine Ontology corpus and vaccine data from clinical trials, yielding the CTPubMedBERT + SAPBERT + VO model. Finally, we utilized a collection of pre-trained models, along with the weighted rule-based ensemble approach, to normalize the vaccine corpus and improve the accuracy of the process. The ranking process in concept normalization involves prioritizing and ordering potential concepts to identify the most suitable match for a given context. We conducted a ranking of the Top 10 concepts, and our experimental results demonstrate that our proposed cascaded framework consistently outperformed existing effective baselines on vaccine mapping, achieving 71.8% on top 1 candidate's accuracy and 90.0% on top 10 candidate's accuracy. Conclusion This study provides a detailed insight into a cascaded framework of fine-tuned domain-specific language models improving mapping of VO from clinical trials. By effectively leveraging domain-specific information and applying weighted rule-based ensembles of different pre-trained BERT models, our framework can significantly enhance the mapping of VO from clinical trials.
Collapse
Affiliation(s)
- Jianfu Li
- The University of Texas Health Science Center at Houston
| | - Yiming Li
- The University of Texas Health Science Center at Houston
| | | | | | - Zenan Sun
- The University of Texas Health Science Center at Houston
| | - Fang Li
- The University of Texas Health Science Center at Houston
| | | | - Cui Tao
- The University of Texas Health Science Center at Houston
| |
Collapse
|
9
|
Keloth VK, Zhou S, Lindemann L, Zheng L, Elhanan G, Einstein AJ, Geller J, Perl Y. Mining of EHR for interface terminology concepts for annotating EHRs of COVID patients. BMC Med Inform Decis Mak 2023; 23:40. [PMID: 36829139 PMCID: PMC9951157 DOI: 10.1186/s12911-023-02136-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2022] [Accepted: 02/09/2023] [Indexed: 02/26/2023] Open
Abstract
BACKGROUND Two years into the COVID-19 pandemic and with more than five million deaths worldwide, the healthcare establishment continues to struggle with every new wave of the pandemic resulting from a new coronavirus variant. Research has demonstrated that there are variations in the symptoms, and even in the order of symptom presentations, in COVID-19 patients infected by different SARS-CoV-2 variants (e.g., Alpha and Omicron). Textual data in the form of admission notes and physician notes in the Electronic Health Records (EHRs) is rich in information regarding the symptoms and their orders of presentation. Unstructured EHR data is often underutilized in research due to the lack of annotations that enable automatic extraction of useful information from the available extensive volumes of textual data. METHODS We present the design of a COVID Interface Terminology (CIT), not just a generic COVID-19 terminology, but one serving a specific purpose of enabling automatic annotation of EHRs of COVID-19 patients. CIT was constructed by integrating existing COVID-related ontologies and mining additional fine granularity concepts from clinical notes. The iterative mining approach utilized the techniques of 'anchoring' and 'concatenation' to identify potential fine granularity concepts to be added to the CIT. We also tested the generalizability of our approach on a hold-out dataset and compared the annotation coverage to the coverage obtained for the dataset used to build the CIT. RESULTS Our experiments demonstrate that this approach results in higher annotation coverage compared to existing ontologies such as SNOMED CT and Coronavirus Infectious Disease Ontology (CIDO). The final version of CIT achieved about 20% more coverage than SNOMED CT and 50% more coverage than CIDO. In the future, the concepts mined and added into CIT could be used as training data for machine learning models for mining even more concepts into CIT and further increasing the annotation coverage. CONCLUSION In this paper, we demonstrated the construction of a COVID interface terminology that can be utilized for automatically annotating EHRs of COVID-19 patients. The techniques presented can identify frequently documented fine granularity concepts that are missing in other ontologies thereby increasing the annotation coverage.
Collapse
Affiliation(s)
- Vipina K Keloth
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA.
| | - Shuxin Zhou
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA
| | - Luke Lindemann
- School of Medicine and Health Sciences, The George Washington University, Washington (D.C.), USA
| | - Ling Zheng
- Computer Science and Software Engineering Department, Monmouth University, West Long Branch, NJ, USA
| | - Gai Elhanan
- Renown Institute for Health Innovation, Desert Research Institute, Reno, NV, USA
| | - Andrew J Einstein
- Cardiology Division, Department of Medicine, Columbia University Irving Medical Center, New York, NY, USA
- Department of Radiology, Columbia University Irving Medical Center, New York, NY, USA
| | - James Geller
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA
| | - Yehoshua Perl
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA
| |
Collapse
|
10
|
French E, McInnes BT. An overview of biomedical entity linking throughout the years. J Biomed Inform 2023; 137:104252. [PMID: 36464228 PMCID: PMC9845184 DOI: 10.1016/j.jbi.2022.104252] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2022] [Revised: 09/19/2022] [Accepted: 11/15/2022] [Indexed: 12/04/2022]
Abstract
Biomedical Entity Linking (BEL) is the task of mapping of spans of text within biomedical documents to normalized, unique identifiers within an ontology. This is an important task in natural language processing for both translational information extraction applications and providing context for downstream tasks like relationship extraction. In this paper, we will survey the progression of BEL from its inception in the late 80s to present day state of the art systems, provide a comprehensive list of datasets available for training BEL systems, reference shared tasks focused on BEL, discuss the technical components that comprise BEL systems, and discuss possible directions for the future of the field.
Collapse
Affiliation(s)
- Evan French
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA.
| | - Bridget T McInnes
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| |
Collapse
|
11
|
Ruas P, Couto FM. NILINKER: Attention-based approach to NIL Entity Linking. J Biomed Inform 2022; 132:104137. [PMID: 35811025 DOI: 10.1016/j.jbi.2022.104137] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2022] [Revised: 06/29/2022] [Accepted: 07/05/2022] [Indexed: 10/17/2022]
Abstract
The existence of unlinkable (NIL) entities is a major hurdle affecting the performance of Named Entity Linking approaches, and, consequently, the performance of downstream models that depend on them. Existing approaches to deal with NIL entities focus mainly on clustering and prediction and are limited to general entities. However, other domains, such as the biomedical sciences, are also prone to the existence of NIL entities, given the growing nature of scientific literature. We propose NILINKER, a model that includes a candidate retrieval module for biomedical NIL entities and a neural network that leverages the attention mechanism to find the top-k relevant concepts from target Knowledge Bases (MEDIC, CTD-Chemicals, ChEBI, HP, CTD-Anatomy and Gene Ontology-Biological Process) that may partially represent a given NIL entity. We also make available a new evaluation dataset designated by EvaNIL, suitable for training and evaluating models focusing on the NIL entity linking task. This dataset contains 846,165 documents (abstracts and full-text biomedical articles), including 1,071,776 annotations, distributed by six different partitions: EvaNIL-MEDIC, EvaNIL-CTD-Chemicals, EvaNIL-ChEBI, EvaNIL-HP, EvaNIL-CTD-Anatomy and EvaNIL-Gene Ontology-Biological Process. NILINKER was integrated into a graph-based Named Entity Linking model (REEL) and the results of the experiments show that this approach is able to increase the performance of the Named Entity Linking model.
Collapse
Affiliation(s)
- Pedro Ruas
- LASIGE, Faculdade de Ciencias, Universidade de Lisboa, Lisbon, 1749-016, Portugal.
| | - Francisco M Couto
- LASIGE, Faculdade de Ciencias, Universidade de Lisboa, Lisbon, 1749-016, Portugal
| |
Collapse
|
12
|
Fang Y, Idnay B, Sun Y, Liu H, Chen Z, Marder K, Xu H, Schnall R, Weng C. Combining human and machine intelligence for clinical trial eligibility querying. J Am Med Inform Assoc 2022; 29:1161-1171. [PMID: 35426943 PMCID: PMC9196697 DOI: 10.1093/jamia/ocac051] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2022] [Accepted: 03/29/2022] [Indexed: 11/13/2022] Open
Abstract
OBJECTIVE To combine machine efficiency and human intelligence for converting complex clinical trial eligibility criteria text into cohort queries. MATERIALS AND METHODS Criteria2Query (C2Q) 2.0 was developed to enable real-time user intervention for criteria selection and simplification, parsing error correction, and concept mapping. The accuracy, precision, recall, and F1 score of enhanced modules for negation scope detection, temporal and value normalization were evaluated using a previously curated gold standard, the annotated eligibility criteria of 1010 COVID-19 clinical trials. The usability and usefulness were evaluated by 10 research coordinators in a task-oriented usability evaluation using 5 Alzheimer's disease trials. Data were collected by user interaction logging, a demographic questionnaire, the Health Information Technology Usability Evaluation Scale (Health-ITUES), and a feature-specific questionnaire. RESULTS The accuracies of negation scope detection, temporal and value normalization were 0.924, 0.916, and 0.966, respectively. C2Q 2.0 achieved a moderate usability score (3.84 out of 5) and a high learnability score (4.54 out of 5). On average, 9.9 modifications were made for a clinical study. Experienced researchers made more modifications than novice researchers. The most frequent modification was deletion (5.35 per study). Furthermore, the evaluators favored cohort queries resulting from modifications (score 4.1 out of 5) and the user engagement features (score 4.3 out of 5). DISCUSSION AND CONCLUSION Features to engage domain experts and to overcome the limitations in automated machine output are shown to be useful and user-friendly. We concluded that human-computer collaboration is key to improving the adoption and user-friendliness of natural language processing.
Collapse
Affiliation(s)
- Yilu Fang
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Betina Idnay
- School of Nursing, Columbia University, New York, New York, USA.,Department of Neurology, Columbia University, New York, New York, USA
| | - Yingcheng Sun
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Hao Liu
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Zhehuan Chen
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - Karen Marder
- Department of Neurology, Columbia University, New York, New York, USA
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Rebecca Schnall
- School of Nursing, Columbia University, New York, New York, USA.,Heilbrunn Department of Population and Family Health, Mailman School of Public Health, Columbia University, New York, New York, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| |
Collapse
|
13
|
Khrabrov K, Shenbin I, Ryabov A, Tsypin A, Telepov A, Alekseev A, Grishin A, Strashnov P, Zhilyaev P, Nikolenko S, Kadurin A. nablaDFT: Large-Scale Conformational Energy and Hamiltonian Prediction benchmark and dataset. Phys Chem Chem Phys 2022; 24:25853-25863. [DOI: 10.1039/d2cp03966d] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
In this work we present nablaDFT, the new dataset and benchmark for the Density Functional Theory Hamiltonian and energy prediction. We provide data for over 1 million different molecules and over 5 million conformations and baseline models for both tasks.
Collapse
Affiliation(s)
- Kuzma Khrabrov
- AIRI, Kutuzovskiy prospect house 32 building K.1, Moscow, 121170, Russia
| | - Ilya Shenbin
- St. Petersburg Department of Steklov Mathematical Institute of Russian Academy of Sciences, nab. r. Fontanki 27, St. Petersburg 191011, Russia
| | - Alexander Ryabov
- Center for Materials Technologies, Skolkovo Institute of Science and Technology, Bolshoy Boulevard 30, bld. 1, Moscow, 121205, Russia
- Moscow Institute of Physics and Technology (National Research University), Institutsky lane, 9, Dolgoprudny, Moscow Region 141700, Russia
| | - Artem Tsypin
- AIRI, Kutuzovskiy prospect house 32 building K.1, Moscow, 121170, Russia
| | - Alexander Telepov
- AIRI, Kutuzovskiy prospect house 32 building K.1, Moscow, 121170, Russia
| | - Anton Alekseev
- St. Petersburg Department of Steklov Mathematical Institute of Russian Academy of Sciences, nab. r. Fontanki 27, St. Petersburg 191011, Russia
- St. Petersburg University, 7-9 Universitetskaya Embankment, St Petersburg, 199034, Russia
| | - Alexander Grishin
- AIRI, Kutuzovskiy prospect house 32 building K.1, Moscow, 121170, Russia
| | - Pavel Strashnov
- AIRI, Kutuzovskiy prospect house 32 building K.1, Moscow, 121170, Russia
| | - Petr Zhilyaev
- Center for Materials Technologies, Skolkovo Institute of Science and Technology, Bolshoy Boulevard 30, bld. 1, Moscow, 121205, Russia
| | - Sergey Nikolenko
- St. Petersburg Department of Steklov Mathematical Institute of Russian Academy of Sciences, nab. r. Fontanki 27, St. Petersburg 191011, Russia
- ISP RAS Research Center for Trusted Artificial Intelligence, Alexander Solzhenitsyn st. 25, Moscow, 109004, Russia
| | - Artur Kadurin
- AIRI, Kutuzovskiy prospect house 32 building K.1, Moscow, 121170, Russia
- Kuban State University, Stavropolskaya Street, 149, Krasnodar 350040, Russia
| |
Collapse
|