1
|
Tarasova OA, Rudik AV, Biziukova NY, Filimonov DA, Poroikov VV. Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach. J Cheminform 2022; 14:55. [PMID: 35964150 PMCID: PMC9375066 DOI: 10.1186/s13321-022-00633-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2022] [Accepted: 07/12/2022] [Indexed: 11/24/2022] Open
Abstract
Motivation Application of chemical named entity recognition (CNER) algorithms allows retrieval of information from texts about chemical compound identifiers and creates associations with physical–chemical properties and biological activities. Scientific texts represent low-formalized sources of information. Most methods aimed at CNER are based on machine learning approaches, including conditional random fields and deep neural networks. In general, most machine learning approaches require either vector or sparse word representation of texts. Chemical named entities (CNEs) constitute only a small fraction of the whole text, and the datasets used for training are highly imbalanced. Methods and results We propose a new method for extracting CNEs from texts based on the naïve Bayes classifier combined with specially developed filters. In contrast to the earlier developed CNER methods, our approach uses the representation of the data as a set of fragments of text (FoTs) with the subsequent preparati`on of a set of multi-n-grams (sequences from one to n symbols) for each FoT. Our approach may provide the recognition of novel CNEs. For CHEMDNER corpus, the values of the sensitivity (recall) was 0.95, precision was 0.74, specificity was 0.88, and balanced accuracy was 0.92 based on five-fold cross validation. We applied the developed algorithm to the extracted CNEs of potential Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) main protease (Mpro) inhibitors. A set of CNEs corresponding to the chemical substances evaluated in the biochemical assays used for the discovery of Mpro inhibitors was retrieved. Manual analysis of the appropriate texts showed that CNEs of potential SARS-CoV-2 Mpro inhibitors were successfully identified by our method. Conclusion The obtained results show that the proposed method can be used for filtering out words that are not related to CNEs; therefore, it can be successfully applied to the extraction of CNEs for the purposes of cheminformatics and medicinal chemistry. Supplementary Information The online version contains supplementary material available at 10.1186/s13321-022-00633-4.
Collapse
Affiliation(s)
- O A Tarasova
- Laboratory of Structure-Function Based Drug Design, Institute of Biomedical Chemistry, 10 bldg. 8, Pogodinskaya Str., Moscow, 119121, Russia.
| | - A V Rudik
- Laboratory of Structure-Function Based Drug Design, Institute of Biomedical Chemistry, 10 bldg. 8, Pogodinskaya Str., Moscow, 119121, Russia
| | - N Yu Biziukova
- Laboratory of Structure-Function Based Drug Design, Institute of Biomedical Chemistry, 10 bldg. 8, Pogodinskaya Str., Moscow, 119121, Russia
| | - D A Filimonov
- Laboratory of Structure-Function Based Drug Design, Institute of Biomedical Chemistry, 10 bldg. 8, Pogodinskaya Str., Moscow, 119121, Russia
| | - V V Poroikov
- Laboratory of Structure-Function Based Drug Design, Institute of Biomedical Chemistry, 10 bldg. 8, Pogodinskaya Str., Moscow, 119121, Russia
| |
Collapse
|
2
|
Wang J, Ren Y, Zhang Z, Xu H, Zhang Y. From Tokenization to Self-Supervision: Building a High-Performance Information Extraction System for Chemical Reactions in Patents. Front Res Metr Anal 2021; 6:691105. [PMID: 35005421 PMCID: PMC8727901 DOI: 10.3389/frma.2021.691105] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2021] [Accepted: 11/02/2021] [Indexed: 11/28/2022] Open
Abstract
Chemical reactions and experimental conditions are fundamental information for chemical research and pharmaceutical applications. However, the latest information of chemical reactions is usually embedded in the free text of patents. The rapidly accumulating chemical patents urge automatic tools based on natural language processing (NLP) techniques for efficient and accurate information extraction. This work describes the participation of the Melax Tech team in the CLEF 2020-ChEMU Task of Chemical Reaction Extraction from Patent. The task consisted of two subtasks: (1) named entity recognition to identify compounds and different semantic roles in the chemical reaction and (2) event extraction to identify event triggers of chemical reaction and their relations with the semantic roles recognized in subtask 1. To build an end-to-end system with high performance, multiple strategies tailored to chemical patents were applied and evaluated, ranging from optimizing the tokenization, pre-training patent language models based on self-supervision, to domain knowledge-based rules. Our hybrid approaches combining different strategies achieved state-of-the-art results in both subtasks, with the top-ranked F1 of 0.957 for entity recognition and the top-ranked F1 of 0.9536 for event extraction, indicating that the proposed approaches are promising.
Collapse
Affiliation(s)
- Jingqi Wang
- Melax Technologies, Inc., Houston, TX, United States
| | - Yuankai Ren
- School of Medicine, Nantong University, Nantong, China
| | - Zhi Zhang
- School of Medicine, Nantong University, Nantong, China
| | - Hua Xu
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Yaoyun Zhang
- Melax Technologies, Inc., Houston, TX, United States
| |
Collapse
|
3
|
Naderi N, Knafou J, Copara J, Ruch P, Teodoro D. Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora. Front Res Metr Anal 2021; 6:689803. [PMID: 34870074 PMCID: PMC8640190 DOI: 10.3389/frma.2021.689803] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2021] [Accepted: 10/11/2021] [Indexed: 11/13/2022] Open
Abstract
The health and life science domains are well known for their wealth of named entities found in large free text corpora, such as scientific literature and electronic health records. To unlock the value of such corpora, named entity recognition (NER) methods are proposed. Inspired by the success of transformer-based pretrained models for NER, we assess how individual and ensemble of deep masked language models perform across corpora of different health and life science domains-biology, chemistry, and medicine-available in different languages-English and French. Individual deep masked language models, pretrained on external corpora, are fined-tuned on task-specific domain and language corpora and ensembled using classical majority voting strategies. Experiments show statistically significant improvement of the ensemble models over an individual BERT-based baseline model, with an overall best performance of 77% macro F1-score. We further perform a detailed analysis of the ensemble results and show how their effectiveness changes according to entity properties, such as length, corpus frequency, and annotation consistency. The results suggest that the ensembles of deep masked language models are an effective strategy for tackling NER across corpora from the health and life science domains.
Collapse
Affiliation(s)
- Nona Naderi
- Information Science Department, University of Applied Sciences and Arts of Western Switzerland (HES-SO), Geneva, Switzerland.,Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Julien Knafou
- Information Science Department, University of Applied Sciences and Arts of Western Switzerland (HES-SO), Geneva, Switzerland.,Swiss Institute of Bioinformatics, Geneva, Switzerland.,Computer Science Department, University of Geneva, Geneva, Switzerland
| | - Jenny Copara
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland.,Information Science Department, University of Applied Sciences and Arts of Western Switzerland (HES-SO), Geneva, Switzerland.,Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Patrick Ruch
- Information Science Department, University of Applied Sciences and Arts of Western Switzerland (HES-SO), Geneva, Switzerland.,Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Douglas Teodoro
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland.,Information Science Department, University of Applied Sciences and Arts of Western Switzerland (HES-SO), Geneva, Switzerland.,Swiss Institute of Bioinformatics, Geneva, Switzerland
| |
Collapse
|
4
|
Krestel R, Chikkamath R, Hewel C, Risch J. A survey on deep learning for patent analysis. WORLD PATENT INFORMATION 2021. [DOI: 10.1016/j.wpi.2021.102035] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
5
|
Jouffroy J, Feldman SF, Lerner I, Rance B, Burgun A, Neuraz A. Hybrid Deep Learning for Medication-Related Information Extraction From Clinical Texts in French: MedExt Algorithm Development Study. JMIR Med Inform 2021; 9:e17934. [PMID: 33724196 PMCID: PMC8077811 DOI: 10.2196/17934] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2020] [Revised: 12/29/2020] [Accepted: 01/20/2021] [Indexed: 11/16/2022] Open
Abstract
BACKGROUND Information related to patient medication is crucial for health care; however, up to 80% of the information resides solely in unstructured text. Manual extraction is difficult and time-consuming, and there is not a lot of research on natural language processing extracting medical information from unstructured text from French corpora. OBJECTIVE We aimed to develop a system to extract medication-related information from clinical text written in French. METHODS We developed a hybrid system combining an expert rule-based system, contextual word embedding (embedding for language model) trained on clinical notes, and a deep recurrent neural network (bidirectional long short term memory-conditional random field). The task consisted of extracting drug mentions and their related information (eg, dosage, frequency, duration, route, condition). We manually annotated 320 clinical notes from a French clinical data warehouse to train and evaluate the model. We compared the performance of our approach to those of standard approaches: rule-based or machine learning only and classic word embeddings. We evaluated the models using token-level recall, precision, and F-measure. RESULTS The overall F-measure was 89.9% (precision 90.8; recall: 89.2) when combining expert rules and contextualized embeddings, compared to 88.1% (precision 89.5; recall 87.2) without expert rules or contextualized embeddings. The F-measures for each category were 95.3% for medication name, 64.4% for drug class mentions, 95.3% for dosage, 92.2% for frequency, 78.8% for duration, and 62.2% for condition of the intake. CONCLUSIONS Associating expert rules, deep contextualized embedding, and deep neural networks improved medication information extraction. Our results revealed a synergy when associating expert knowledge and latent knowledge.
Collapse
Affiliation(s)
- Jordan Jouffroy
- Department of Biomedical Informatics, Necker-Enfants malades Hospital, Assistance Publique-Hôpitaux de Paris, Paris, France
- UMRS 1138 team 22, Institut National de la Santé et de la Recherche Médicale, Université de Paris, Paris, France
| | - Sarah F Feldman
- Department of Biomedical Informatics, Necker-Enfants malades Hospital, Assistance Publique-Hôpitaux de Paris, Paris, France
- UMRS 1138 team 22, Institut National de la Santé et de la Recherche Médicale, Université de Paris, Paris, France
| | - Ivan Lerner
- Department of Biomedical Informatics, Necker-Enfants malades Hospital, Assistance Publique-Hôpitaux de Paris, Paris, France
- UMRS 1138 team 22, Institut National de la Santé et de la Recherche Médicale, Université de Paris, Paris, France
| | - Bastien Rance
- UMRS 1138 team 22, Institut National de la Santé et de la Recherche Médicale, Université de Paris, Paris, France
- Department of Biomedical Informatics, Georges Pompidou European Hospital, Assistance Publique-Hôpitaux de Paris, Paris, France
| | - Anita Burgun
- Department of Biomedical Informatics, Necker-Enfants malades Hospital, Assistance Publique-Hôpitaux de Paris, Paris, France
- UMRS 1138 team 22, Institut National de la Santé et de la Recherche Médicale, Université de Paris, Paris, France
| | - Antoine Neuraz
- Department of Biomedical Informatics, Necker-Enfants malades Hospital, Assistance Publique-Hôpitaux de Paris, Paris, France
- UMRS 1138 team 22, Institut National de la Santé et de la Recherche Médicale, Université de Paris, Paris, France
| |
Collapse
|
6
|
Thomas A, Sangeetha S. An innovative hybrid approach for extracting named entities from unstructured text data. Comput Intell 2019. [DOI: 10.1111/coin.12214] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Anu Thomas
- Text Analytics & NLP Lab, Department of Computer ApplicationsNational Institute of Technology Tiruchirappalli India
| | - S. Sangeetha
- Text Analytics & NLP Lab, Department of Computer ApplicationsNational Institute of Technology Tiruchirappalli India
| |
Collapse
|
7
|
Aristodemou L, Tietze F. The state-of-the-art on Intellectual Property Analytics (IPA): A literature review on artificial intelligence, machine learning and deep learning methods for analysing intellectual property (IP) data. WORLD PATENT INFORMATION 2018. [DOI: 10.1016/j.wpi.2018.07.002] [Citation(s) in RCA: 43] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
8
|
Jiang N, Rong W, Nie Y, Shen Y, Xiong Z. Biological Event Trigger Identification with Noise Contrastive Estimation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1549-1559. [PMID: 30296207 DOI: 10.1109/tcbb.2017.2710048] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Biological Event Extraction is an important task towards the goal of extracting biomedical knowledge from the scientific publications by capturing biomedical entities and their complex relations from the texts. As a crucial step in event extraction, event trigger identification, assigning words with suitable trigger category, has recently attracted substantial attention. As triggers are scattered in large corpus, traditional linguistic parsers are hard to generate syntactic features from them. Thereby, trigger sparsity problem restricts the model's learning process and becomes one of the main hinder in trigger identification. In this paper, we employ Noise Contrastive Estimation with Multi-Layer Perceptron model for solving triggers' sparsity problem. Meanwhile, in the light of recent advance in word distributed representation, word-embedding feature generated by language model is utilized for semantic and syntactic information extraction. Finally, experimental study on commonly used MLEE dataset against baseline methods has demonstrated its promising result.
Collapse
|
9
|
Campos L, Pedro V, Couto F. Impact of translation on named-entity recognition in radiology texts. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018; 2017:4097790. [PMID: 29220455 PMCID: PMC5737072 DOI: 10.1093/database/bax064] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/30/2017] [Accepted: 08/03/2017] [Indexed: 11/17/2022]
Abstract
Radiology reports describe the results of radiography procedures and have the potential of being a useful source of information which can bring benefits to health care systems around the world. One way to automatically extract information from the reports is by using Text Mining tools. The problem is that these tools are mostly developed for English and reports are usually written in the native language of the radiologist, which is not necessarily English. This creates an obstacle to the sharing of Radiology information between different communities. This work explores the solution of translating the reports to English before applying the Text Mining tools, probing the question of what translation approach should be used. We created MRRAD (Multilingual Radiology Research Articles Dataset), a parallel corpus of Portuguese research articles related to Radiology and a number of alternative translations (human, automatic and semi-automatic) to English. This is a novel corpus which can be used to move forward the research on this topic. Using MRRAD we studied which kind of automatic or semi-automatic translation approach is more effective on the Named-entity recognition task of finding RadLex terms in the English version of the articles. Considering the terms extracted from human translations as our gold standard, we calculated how similar to this standard were the terms extracted using other translations. We found that a completely automatic translation approach using Google leads to F-scores (between 0.861 and 0.868, depending on the extraction approach) similar to the ones obtained through a more expensive semi-automatic translation approach using Unbabel (between 0.862 and 0.870). To better understand the results we also performed a qualitative analysis of the type of errors found in the automatic and semi-automatic translations. Database URL:https://github.com/lasigeBioTM/MRRAD
Collapse
Affiliation(s)
- Luís Campos
- LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisboa, Portugal
| | - Vasco Pedro
- Unbabel Lda, Rua Visconde de Santarém, 67-B, 1000-286 Lisboa, Portugal
| | - Francisco Couto
- LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisboa, Portugal
| |
Collapse
|
10
|
Abd MT, Mohd M, Abd MT. Investigation of Data Representation Methods with Machine Learning Algorithms for Biomedical Named Enttity Recognition. 2018 FOURTH INTERNATIONAL CONFERENCE ON INFORMATION RETRIEVAL AND KNOWLEDGE MANAGEMENT (CAMP) 2018. [DOI: 10.1109/infrkm.2018.8464816] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
|
11
|
Krallinger M, Rabal O, Lourenço A, Oyarzabal J, Valencia A. Information Retrieval and Text Mining Technologies for Chemistry. Chem Rev 2017; 117:7673-7761. [PMID: 28475312 DOI: 10.1021/acs.chemrev.6b00851] [Citation(s) in RCA: 111] [Impact Index Per Article: 13.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.
Collapse
Affiliation(s)
- Martin Krallinger
- Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre , C/Melchor Fernández Almagro 3, Madrid E-28029, Spain
| | - Obdulia Rabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Anália Lourenço
- ESEI - Department of Computer Science, University of Vigo , Edificio Politécnico, Campus Universitario As Lagoas s/n, Ourense E-32004, Spain.,Centro de Investigaciones Biomédicas (Centro Singular de Investigación de Galicia) , Campus Universitario Lagoas-Marcosende, Vigo E-36310, Spain.,CEB-Centre of Biological Engineering, University of Minho , Campus de Gualtar, Braga 4710-057, Portugal
| | - Julen Oyarzabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Alfonso Valencia
- Life Science Department, Barcelona Supercomputing Centre (BSC-CNS) , C/Jordi Girona, 29-31, Barcelona E-08034, Spain.,Joint BSC-IRB-CRG Program in Computational Biology, Parc Científic de Barcelona , C/ Baldiri Reixac 10, Barcelona E-08028, Spain.,Institució Catalana de Recerca i Estudis Avançats (ICREA) , Passeig de Lluís Companys 23, Barcelona E-08010, Spain
| |
Collapse
|
12
|
Soysal E, Lee HJ, Zhang Y, Huang LC, Chen X, Wei Q, Zheng W, Chang JT, Cohen T, Sun J, Xu H. CATTLE (CAncer treatment treasury with linked evidence): An integrated knowledge base for personalized oncology research and practice. CPT-PHARMACOMETRICS & SYSTEMS PHARMACOLOGY 2017; 6:188-196. [PMID: 28296354 PMCID: PMC5351410 DOI: 10.1002/psp4.12174] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/09/2016] [Revised: 01/16/2017] [Accepted: 01/17/2017] [Indexed: 01/15/2023]
Abstract
Despite the existence of various databases cataloging cancer drugs, there is an emerging need to support the development and application of personalized therapies, where an integrated understanding of the clinical factors and drug mechanism of action and its gene targets is necessary. We have developed CATTLE (CAncer Treatment Treasury with Linked Evidence), a comprehensive cancer drug knowledge base providing information across the complete spectrum of the drug life cycle. The CATTLE system collects relevant data from 22 heterogeneous databases, integrates them into a unified model centralized on drugs, and presents comprehensive drug information via an interactive web portal with a download function. A total of 2,323 unique cancer drugs are currently linked to rich information from these databases in CATTLE. Through two use cases, we demonstrate that CATTLE can be used in supporting both research and practice in personalized oncology.
Collapse
Affiliation(s)
- E Soysal
- University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - H-J Lee
- University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Y Zhang
- University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - L-C Huang
- University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - X Chen
- University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Q Wei
- University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - W Zheng
- University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - J T Chang
- University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - T Cohen
- University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - J Sun
- University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - H Xu
- University of Texas Health Science Center at Houston, Houston, Texas, USA
| |
Collapse
|