1
|
Schilling-Wilhelmi M, Ríos-García M, Shabih S, Gil MV, Miret S, Koch CT, Márquez JA, Jablonka KM. From text to insight: large language models for chemical data extraction. Chem Soc Rev 2024. [PMID: 39703015 DOI: 10.1039/d4cs00913d] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2024]
Abstract
The vast majority of chemical knowledge exists in unstructured natural language, yet structured data is crucial for innovative and systematic materials design. Traditionally, the field has relied on manual curation and partial automation for data extraction for specific use cases. The advent of large language models (LLMs) represents a significant shift, potentially enabling non-experts to extract structured, actionable data from unstructured text efficiently. While applying LLMs to chemical and materials science data extraction presents unique challenges, domain knowledge offers opportunities to guide and validate LLM outputs. This tutorial review provides a comprehensive overview of LLM-based structured data extraction in chemistry, synthesizing current knowledge and outlining future directions. We address the lack of standardized guidelines and present frameworks for leveraging the synergy between LLMs and chemical expertise. This work serves as a foundational resource for researchers aiming to harness LLMs for data-driven chemical research. The insights presented here could significantly enhance how researchers across chemical disciplines access and utilize scientific information, potentially accelerating the development of novel compounds and materials for critical societal needs.
Collapse
Affiliation(s)
- Mara Schilling-Wilhelmi
- Laboratory of Organic and Macromolecular Chemistry (IOMC), Friedrich Schiller University Jena, Humboldtstrasse 10, 07743 Jena, Germany.
| | - Martiño Ríos-García
- Laboratory of Organic and Macromolecular Chemistry (IOMC), Friedrich Schiller University Jena, Humboldtstrasse 10, 07743 Jena, Germany.
- Institute of Carbon Science and Technology (INCAR), CSIC, Francisco Pintado Fe 26, 33011 Oviedo, Spain
| | - Sherjeel Shabih
- Department of Physics and CSMB, Humboldt-Universität zu Berlin, Berlin, Germany
| | - María Victoria Gil
- Institute of Carbon Science and Technology (INCAR), CSIC, Francisco Pintado Fe 26, 33011 Oviedo, Spain
| | | | - Christoph T Koch
- Department of Physics and CSMB, Humboldt-Universität zu Berlin, Berlin, Germany
| | - José A Márquez
- Department of Physics and CSMB, Humboldt-Universität zu Berlin, Berlin, Germany
| | - Kevin Maik Jablonka
- Laboratory of Organic and Macromolecular Chemistry (IOMC), Friedrich Schiller University Jena, Humboldtstrasse 10, 07743 Jena, Germany.
- Center for Energy and Environmental Chemistry Jena (CEEC Jena), Friedrich Schiller University Jena, Philosophenweg 7a, 07743 Jena, Germany
- Helmholtz Institute for Polymers in Energy Applications Jena (HIPOLE Jena), Lessingstrasse 12-14, 07743 Jena, Germany
| |
Collapse
|
2
|
Ai Q, Meng F, Shi J, Pelkie B, Coley CW. Extracting structured data from organic synthesis procedures using a fine-tuned large language model. DIGITAL DISCOVERY 2024; 3:1822-1831. [PMID: 39157760 PMCID: PMC11322921 DOI: 10.1039/d4dd00091a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/06/2024] [Accepted: 07/30/2024] [Indexed: 08/20/2024]
Abstract
The popularity of data-driven approaches and machine learning (ML) techniques in the field of organic chemistry and its various subfields has increased the value of structured reaction data. Most data in chemistry is represented by unstructured text, and despite the vastness of the organic chemistry literature (papers, patents), manual conversion from unstructured text to structured data remains a largely manual endeavor. Software tools for this task would facilitate downstream applications such as reaction prediction and condition recommendation. In this study, we fine-tune a large language model (LLM) to extract reaction information from organic synthesis procedure text into structured data following the Open Reaction Database (ORD) schema, a comprehensive data structure designed for organic reactions. The fine-tuned model produces syntactically correct ORD records with an average accuracy of 91.25% for ORD "messages" (e.g., full compound, workups, or condition definitions) and 92.25% for individual data fields (e.g., compound identifiers, mass quantities), with the ability to recognize compound-referencing tokens and to infer reaction roles. We investigate its failure modes and evaluate performance on specific subtasks such as reaction role classification.
Collapse
Affiliation(s)
- Qianxiang Ai
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA USA
| | - Fanwang Meng
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA USA
| | - Jiale Shi
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA USA
| | - Brenden Pelkie
- Department of Chemical Engineering, University of Washington Seattle WA USA
| | - Connor W Coley
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA USA
| |
Collapse
|
3
|
Blakey M, Pearman-Kanza S, Frey JG. Zombie cheminformatics: extraction and conversion of Wiswesser Line Notation (WLN) from chemical documents. J Cheminform 2024; 16:42. [PMID: 38622746 PMCID: PMC11017645 DOI: 10.1186/s13321-024-00831-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Accepted: 03/23/2024] [Indexed: 04/17/2024] Open
Abstract
PURPOSE Wiswesser Line Notation (WLN) is a old line notation for encoding chemical compounds for storage and processing by computers. Whilst the notation itself has long since been surpassed by SMILES and InChI, distribution of WLN during its active years was extensive. In the context of modernising chemical data, we present a comprehensive WLN parser developed using the OpenBabel toolkit, capable of translating WLN strings into various formats supported by the library. Furthermore, we have devised a specialised Finite State Machine l, constructed from the rules of WLN, enabling the recognition and extraction of chemical strings out of large bodies of text. Available open-access WLN data with corresponding SMILES or InChI notation is rare, however ChEMBL, ChemSpider and PubChem all contain WLN records which were used for conversion scoring. Our investigation revealed a notable proportion of inaccuracies within the database entries, and we have taken steps to rectify these errors whenever feasible. SCIENTIFIC CONTRIBUTION Tools for both the extraction and conversion of WLN from chemical documents have been successfully developed. Both the Deterministic Finite Automaton (DFA) and parser handle the majority of WLN rules officially endorsed in the three major WLN manuals, with the parser showing a clear jump in accuracy and chemical coverage over previous submissions. The GitHub repository can be found here: https://github.com/Mblakey/wiswesser .
Collapse
Affiliation(s)
- Michael Blakey
- Department of Chemistry, University of Southampton, University Road, Southampton, Hampshire, SO17 1BJ, UK.
| | - Samantha Pearman-Kanza
- Department of Chemistry, University of Southampton, University Road, Southampton, Hampshire, SO17 1BJ, UK
| | - Jeremy G Frey
- Department of Chemistry, University of Southampton, University Road, Southampton, Hampshire, SO17 1BJ, UK
| |
Collapse
|
4
|
Dong Q, Cole JM. Snowball 2.0: Generic Material Data Parser for ChemDataExtractor. J Chem Inf Model 2023; 63:7045-7055. [PMID: 37934697 PMCID: PMC10685441 DOI: 10.1021/acs.jcim.3c01281] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2023] [Revised: 10/19/2023] [Accepted: 10/20/2023] [Indexed: 11/09/2023]
Abstract
The ever-growing amount of chemical data found in the scientific literature has led to the emergence of data-driven materials discovery. The first step in the pipeline, to automatically extract chemical information from plain text, has been driven by the development of software toolkits such as ChemDataExtractor. Such data extraction processes have created a demand for parsers that efficiently enable text mining. Here, we present Snowball 2.0, a sentence parser based on a semisupervised machine-learning algorithm. It can be used to extract any chemical property without additional training. We validate its precision, recall, and F-score by training and testing a model with sentences of semiconductor band gap information curated from journal articles. Snowball 2.0 builds on two previously developed Snowball algorithms. Evaluation of Snowball 2.0 shows a 15-20% increase in recall with marginally reduced precision over the previous version which has been incorporated into ChemDataExtractor 2.0, giving Snowball 2.0 better performance in most configurations. Snowball 2.0 offers more and better parsing options for ChemDataExtractor, and it is more capable in the pipeline of automated data extraction. Snowball 2.0 also features better generalizability, performance, learning efficiencies, and user-friendliness.
Collapse
Affiliation(s)
- Qingyang Dong
- Cavendish
Laboratory, Department of Physics, University
of Cambridge, Cambridge CB3 0HE, U.K.
| | - Jacqueline M. Cole
- Cavendish
Laboratory, Department of Physics, University
of Cambridge, Cambridge CB3 0HE, U.K.
- ISIS
Neutron and Muon Source, STFC Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K.
| |
Collapse
|
5
|
Wang J, Shen Z, Liao Y, Yuan Z, Li S, He G, Lan M, Qian X, Zhang K, Li H. Multi-modal chemical information reconstruction from images and texts for exploring the near-drug space. Brief Bioinform 2022; 23:6761958. [PMID: 36252922 PMCID: PMC9677486 DOI: 10.1093/bib/bbac461] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2022] [Revised: 09/21/2022] [Accepted: 09/26/2022] [Indexed: 12/14/2022] Open
Abstract
Identification of new chemical compounds with desired structural diversity and biological properties plays an essential role in drug discovery, yet the construction of such a potential space with elements of 'near-drug' properties is still a challenging task. In this work, we proposed a multimodal chemical information reconstruction system to automatically process, extract and align heterogeneous information from the text descriptions and structural images of chemical patents. Our key innovation lies in a heterogeneous data generator that produces cross-modality training data in the form of text descriptions and Markush structure images, from which a two-branch model with image- and text-processing units can then learn to both recognize heterogeneous chemical entities and simultaneously capture their correspondence. In particular, we have collected chemical structures from ChEMBL database and chemical patents from the European Patent Office and the US Patent and Trademark Office using keywords 'A61P, compound, structure' in the years from 2010 to 2020, and generated heterogeneous chemical information datasets with 210K structural images and 7818 annotated text snippets. Based on the reconstructed results and substituent replacement rules, structural libraries of a huge number of near-drug compounds can be generated automatically. In quantitative evaluations, our model can correctly reconstruct 97% of the molecular images into structured format and achieve an F1-score around 97-98% in the recognition of chemical entities, which demonstrated the effectiveness of our model in automatic information extraction from chemical patents, and hopefully transforming them to a user-friendly, structured molecular database enriching the near-drug space to realize the intelligent retrieval technology of chemical knowledge.
Collapse
Affiliation(s)
| | | | - Yichen Liao
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science & Technology, Shanghai 200237, China
| | - Zhen Yuan
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science & Technology, Shanghai 200237, China
| | - Shiliang Li
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science & Technology, Shanghai 200237, China
| | - Gaoqi He
- School of Computer Science and Technology, East China Normal University, Shanghai 200062, China
| | - Man Lan
- School of Computer Science and Technology, East China Normal University, Shanghai 200062, China
| | - Xuhong Qian
- Innovation Center for AI and Drug Discovery, East China Normal University, Shanghai 200062, China
| | - Kai Zhang
- Corresponding authors: Kai Zhang, School of Computer Science and Technology, Innovation Center for AI and Drug Discovery, East China Normal University, Shanghai 200062, China. E-mail: ; Honglin Li, Shanghai Key Laboratory of New Drug Design, East China University of Science & Technology, Shanghai 200237, China. Innovation Center for AI and Drug Discovery, East China Normal University, Shanghai 200062, China. E-mail:
| | - Honglin Li
- Corresponding authors: Kai Zhang, School of Computer Science and Technology, Innovation Center for AI and Drug Discovery, East China Normal University, Shanghai 200062, China. E-mail: ; Honglin Li, Shanghai Key Laboratory of New Drug Design, East China University of Science & Technology, Shanghai 200237, China. Innovation Center for AI and Drug Discovery, East China Normal University, Shanghai 200062, China. E-mail:
| |
Collapse
|
6
|
Coghlan A, Padalino G, O'Boyle NM, Hoffmann KF, Berriman M. Identification of anti-schistosomal, anthelmintic and anti-parasitic compounds curated and text-mined from the scientific literature. Wellcome Open Res 2022; 7:193. [PMID: 36003342 PMCID: PMC9363976 DOI: 10.12688/wellcomeopenres.17987.1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/08/2022] [Indexed: 11/20/2022] Open
Abstract
More than a billion people are infected with parasitic worms, including nematodes, such as hookworms, and flatworms, such as blood flukes. Few drugs are available to treat worm infections, but high-throughput screening approaches hold promise to identify novel drug candidates. One problem for researchers who find an interesting ‘hit’ from a high-throughput screen is to identify whether that compound, or a similar compound has previously been published as having anthelmintic or anti-parasitic activity. Here, we present (i) data sets of 2,828 anthelmintic compounds, and 1,269 specific anti-schistosomal compounds, manually curated from scientific papers and books, and (ii) a data set of 24,335 potential anthelmintic and anti-parasitic compounds identified by text-mining PubMed abstracts. We provide their structures in simplified molecular-input line-entry system (SMILES) format so that researchers can easily compare ‘hits’ from their screens to these anthelmintic compounds and anti-parasitic compounds and find previous literature on them to support/halt their progression in drug discovery pipelines.
Collapse
Affiliation(s)
- Avril Coghlan
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
| | - Gilda Padalino
- The Institute of Biological, Environmental and Rural Sciences (IBERS), Aberystwyth University, Aberystwyth, Wales, SY23 3DA, UK
- School of Pharmacy and Pharmaceutical Sciences, Cardiff University, Redwood Building, King Edward VII Avenue, Cardiff, CF10 3NB, UK
| | - Noel M. O'Boyle
- NextMove Software Ltd., Cambridge Science Park, Milton Rd., Cambridge, CB4 0WG, UK
- Sosei Heptares, Steinmetz Building, Granta Park, Great Abington, Cambridge, CB21 6DG, UK
| | - Karl F. Hoffmann
- The Institute of Biological, Environmental and Rural Sciences (IBERS), Aberystwyth University, Aberystwyth, Wales, SY23 3DA, UK
| | - Matthew Berriman
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
- Wellcome Centre for Integrative Parasitology, Institute of Infection, Immunity and Inflammation, University of Glasgow, 120 University Place, Glasgow, G12 8TA, UK
| |
Collapse
|
7
|
Auto-generated database of semiconductor band gaps using ChemDataExtractor. Sci Data 2022; 9:193. [PMID: 35504897 PMCID: PMC9065101 DOI: 10.1038/s41597-022-01294-6] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2021] [Accepted: 02/08/2022] [Indexed: 11/16/2022] Open
Abstract
Large-scale databases of band gap information about semiconductors that are curated from the scientific literature have significant usefulness for computational databases and general semiconductor materials research. This work presents an auto-generated database of 100,236 semiconductor band gap records, extracted from 128,776 journal articles with their associated temperature information. The database was produced using ChemDataExtractor version 2.0, a ‘chemistry-aware’ software toolkit that uses Natural Language Processing (NLP) and machine-learning methods to extract chemical data from scientific documents. The modified Snowball algorithm of ChemDataExtractor has been extended to incorporate nested models, optimized by hyperparameter analysis, and used together with the default NLP parsers to achieve optimal quality of the database. Evaluation of the database shows a weighted precision of 84% and a weighted recall of 65%. To the best of our knowledge, this is the largest open-source non-computational band gap database to date. Database records are available in CSV, JSON, and MongoDB formats, which are machine readable and can assist data mining and semiconductor materials discovery. Measurement(s) | semiconductor band gaps | Technology Type(s) | natural language processing |
Collapse
|
8
|
López-Úbeda P, Díaz-Galiano MC, Ureña-López LA, Martín-Valdivia MT. Combining word embeddings to extract chemical and drug entities in biomedical literature. BMC Bioinformatics 2021; 22:599. [PMID: 34920708 PMCID: PMC8684055 DOI: 10.1186/s12859-021-04188-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2021] [Accepted: 05/12/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Natural language processing (NLP) and text mining technologies for the extraction and indexing of chemical and drug entities are key to improving the access and integration of information from unstructured data such as biomedical literature. METHODS In this paper we evaluate two important tasks in NLP: the named entity recognition (NER) and Entity indexing using the SNOMED-CT terminology. For this purpose, we propose a combination of word embeddings in order to improve the results obtained in the PharmaCoNER challenge. RESULTS For the NER task we present a neural network composed of BiLSTM with a CRF sequential layer where different word embeddings are combined as an input to the architecture. A hybrid method combining supervised and unsupervised models is used for the concept indexing task. In the supervised model, we use the training set to find previously trained concepts, and the unsupervised model is based on a 6-step architecture. This architecture uses a dictionary of synonyms and the Levenshtein distance to assign the correct SNOMED-CT code. CONCLUSION On the one hand, the combination of word embeddings helps to improve the recognition of chemicals and drugs in the biomedical literature. We achieved results of 91.41% for precision, 90.14% for recall, and 90.77% for F1-score using micro-averaging. On the other hand, our indexing system achieves a 92.67% F1-score, 92.44% for recall, and 92.91% for precision. With these results in a final ranking, we would be in the first position.
Collapse
Affiliation(s)
- Pilar López-Úbeda
- Department of Computer Science, Advanced Studies Center in Information and Communication Technologies (CEATIC), Universidad de Jaén, Campus Las Lagunillas s/n, 23071, Jaén, Spain.
| | - Manuel Carlos Díaz-Galiano
- Department of Computer Science, Advanced Studies Center in Information and Communication Technologies (CEATIC), Universidad de Jaén, Campus Las Lagunillas s/n, 23071, Jaén, Spain
| | - L Alfonso Ureña-López
- Department of Computer Science, Advanced Studies Center in Information and Communication Technologies (CEATIC), Universidad de Jaén, Campus Las Lagunillas s/n, 23071, Jaén, Spain
| | - M Teresa Martín-Valdivia
- Department of Computer Science, Advanced Studies Center in Information and Communication Technologies (CEATIC), Universidad de Jaén, Campus Las Lagunillas s/n, 23071, Jaén, Spain
| |
Collapse
|
9
|
Mavračić J, Court CJ, Isazawa T, Elliott SR, Cole JM. ChemDataExtractor 2.0: Autopopulated Ontologies for Materials Science. J Chem Inf Model 2021; 61:4280-4289. [PMID: 34529432 DOI: 10.1021/acs.jcim.1c00446] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The ever-growing abundance of data found in heterogeneous sources, such as scientific publications, has forced the development of automated techniques for data extraction. While in the past, in the physical sciences domain, the focus has been on the precise extraction of individual properties, attention has recently been devoted to the extraction of higher-level relationships. Here, we present a framework for an automated population of ontologies. That is, the direct extraction of a larger group of properties linked by a semantic network. We exploit data-rich sources, such as tables within documents, and present a new model concept that enables data extraction for chemical and physical properties with the ability to organize hierarchical data as nested information. Combining these capabilities with automatically generated parsers for data extraction and forward-looking interdependency resolution, we illustrate the power of our approach via the automatic extraction of a crystallographic hierarchy of information. This includes 18 interrelated submodels of nested data, extracted from an evaluation set of scientific articles, yielding an overall precision of 92.2%, across 26 different journals. Our method and associated toolkit, ChemDataExtractor 2.0, offers a key step toward the seamless integration of primary literature sources into a data-driven scientific framework.
Collapse
Affiliation(s)
- Juraj Mavračić
- Cavendish Laboratory, Department of Physics, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K.,Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, U.K
| | - Callum J Court
- Cavendish Laboratory, Department of Physics, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K
| | - Taketomo Isazawa
- Cavendish Laboratory, Department of Physics, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K
| | - Stephen R Elliott
- Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, U.K
| | - Jacqueline M Cole
- Cavendish Laboratory, Department of Physics, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K.,Department of Chemical Engineering and Biotechnology, University of Cambridge, West Cambridge Site, Philippa Fawcett Drive, Cambridge CB3 0FS, U.K.,ISIS Neutron and Muon Source, STFC Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K
| |
Collapse
|
10
|
Li Z, Yang Z, Wang L, Zhang Y, Lin H, Wang J. Lexicon Knowledge Boosted Interaction Graph Network for Adverse Drug Reaction Recognition From Social Media. IEEE J Biomed Health Inform 2021; 25:2777-2786. [PMID: 33275589 DOI: 10.1109/jbhi.2020.3042549] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
The World Health Organization underlines the significance of adverse drug reaction (ADR) reports for patients' safety. Actually, many potential ADRs tend to be under-reported in post-market ADR surveillance. Recognizing ADRs from social media is indispensably important and could complement post-market ADR surveillance for more effective pharmacovigilance studies. However, previous approaches pose two challenges: 1) ADRs show high expression variability in social media, and thus, many potential ADRs are out-of-lexicon ones, which are difficult to be recognized, and 2) most phrasal ADRs are non-standard mentions and their boundaries are difficult to identify accurately. To tackle these challenges, we design three interaction graphs and propose a neural network approach, i.e., Interaction Graph Network (IGN). Specifically, to recognize more out-of-lexicon ADRs, besides the mentions in ADR lexicon, noun phrases in the input sentence are regarded as candidate phrases and their features are taken into considerations. Moreover, in an attempt to accurately identify ADR boundaries, three word-phrase interaction graphs are designed to represent lexicon knowledge and are encoded using graph attention networks (GATs) to directly integrate various boundary and contextual information of candidate phrases into ADR recognition. Experimental results on two benchmark datasets show that IGN can recognize ADR accurately and consistently outperforms other state-of-the-art approaches.
Collapse
|
11
|
Zaslavsky L, Cheng T, Gindulyte A, He S, Kim S, Li Q, Thiessen P, Yu B, Bolton EE. Discovering and Summarizing Relationships Between Chemicals, Genes, Proteins, and Diseases in PubChem. Front Res Metr Anal 2021; 6:689059. [PMID: 34322655 PMCID: PMC8311438 DOI: 10.3389/frma.2021.689059] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Accepted: 06/17/2021] [Indexed: 11/13/2022] Open
Abstract
The literature knowledge panels developed and implemented in PubChem are described. These help to uncover and summarize important relationships between chemicals, genes, proteins, and diseases by analyzing co-occurrences of terms in biomedical literature abstracts. Named entities in PubMed records are matched with chemical names in PubChem, disease names in Medical Subject Headings (MeSH), and gene/protein names in popular gene/protein information resources, and the most closely related entities are identified using statistical analysis and relevance-based sampling. Knowledge panels for the co-occurrence of chemical, disease, and gene/protein entities are included in PubChem Compound, Protein, and Gene pages, summarizing these in a compact form. Statistical methods for removing redundancy and estimating relevance scores are discussed, along with benefits and pitfalls of relying on automated (i.e., not human-curated) methods operating on data from multiple heterogeneous sources.
Collapse
Affiliation(s)
- Leonid Zaslavsky
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, United States
| | - Tiejun Cheng
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, United States
| | - Asta Gindulyte
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, United States
| | - Siqian He
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, United States
| | - Sunghwan Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, United States
| | - Qingliang Li
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, United States
| | - Paul Thiessen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, United States
| | - Bo Yu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, United States
| | - Evan E Bolton
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, United States
| |
Collapse
|
12
|
Alfattni G, Belousov M, Peek N, Nenadic G. Extracting Drug Names and Associated Attributes From Discharge Summaries: Text Mining Study. JMIR Med Inform 2021; 9:e24678. [PMID: 33949962 PMCID: PMC8135022 DOI: 10.2196/24678] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2020] [Revised: 02/15/2021] [Accepted: 02/20/2021] [Indexed: 11/13/2022] Open
Abstract
Background Drug prescriptions are often recorded in free-text clinical narratives; making this information available in a structured form is important to support many health-related tasks. Although several natural language processing (NLP) methods have been proposed to extract such information, many challenges remain. Objective This study evaluates the feasibility of using NLP and deep learning approaches for extracting and linking drug names and associated attributes identified in clinical free-text notes and presents an extensive error analysis of different methods. This study initiated with the participation in the 2018 National NLP Clinical Challenges (n2c2) shared task on adverse drug events and medication extraction. Methods The proposed system (DrugEx) consists of a named entity recognizer (NER) to identify drugs and associated attributes and a relation extraction (RE) method to identify the relations between them. For NER, we explored deep learning-based approaches (ie, bidirectional long-short term memory with conditional random fields [BiLSTM-CRFs]) with various embeddings (ie, word embedding, character embedding [CE], and semantic-feature embedding) to investigate how different embeddings influence the performance. A rule-based method was implemented for RE and compared with a context-aware long-short term memory (LSTM) model. The methods were trained and evaluated using the 2018 n2c2 shared task data. Results The experiments showed that the best model (BiLSTM-CRFs with pretrained word embeddings [PWE] and CE) achieved lenient micro F-scores of 0.921 for NER, 0.927 for RE, and 0.855 for the end-to-end system. NER, which relies on the pretrained word and semantic embeddings, performed better on most individual entity types, but NER with PWE and CE had the highest classification efficiency among the proposed approaches. Extracting relations using the rule-based method achieved higher accuracy than the context-aware LSTM for most relations. Interestingly, the LSTM model performed notably better in the reason-drug relations, the most challenging relation type. Conclusions The proposed end-to-end system achieved encouraging results and demonstrated the feasibility of using deep learning methods to extract medication information from free-text data.
Collapse
Affiliation(s)
- Ghada Alfattni
- Department of Computer Science, University of Manchester, Manchester, United Kingdom.,Department of Computer Science, Jamoum University College, Umm Al-Qura University, Makkah, Saudi Arabia
| | - Maksim Belousov
- Department of Computer Science, University of Manchester, Manchester, United Kingdom
| | - Niels Peek
- Centre for Health Informatics, Division of Informatics, Imaging and Data Sciences, University of Manchester, Manchester, United Kingdom.,National Institute of Health Research Manchester Biomedical Research Centre, Manchester Academic Health Science Centre, University of Manchester, Manchester, United Kingdom.,The Alan Turing Institute, Manchester, United Kingdom
| | - Goran Nenadic
- Department of Computer Science, University of Manchester, Manchester, United Kingdom.,The Alan Turing Institute, Manchester, United Kingdom
| |
Collapse
|
13
|
He J, Nguyen DQ, Akhondi SA, Druckenbrodt C, Thorne C, Hoessel R, Afzal Z, Zhai Z, Fang B, Yoshikawa H, Albahem A, Cavedon L, Cohn T, Baldwin T, Verspoor K. ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents. Front Res Metr Anal 2021; 6:654438. [PMID: 33870071 PMCID: PMC8028406 DOI: 10.3389/frma.2021.654438] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2021] [Accepted: 02/24/2021] [Indexed: 11/21/2022] Open
Abstract
Chemical patents represent a valuable source of information about new chemical compounds, which is critical to the drug discovery process. Automated information extraction over chemical patents is, however, a challenging task due to the large volume of existing patents and the complex linguistic properties of chemical patents. The Cheminformatics Elsevier Melbourne University (ChEMU) evaluation lab 2020, part of the Conference and Labs of the Evaluation Forum 2020 (CLEF2020), was introduced to support the development of advanced text mining techniques for chemical patents. The ChEMU 2020 lab proposed two fundamental information extraction tasks focusing on chemical reaction processes described in chemical patents: (1) chemical named entity recognition, requiring identification of essential chemical entities and their roles in chemical reactions, as well as reaction conditions; and (2) event extraction, which aims at identification of event steps relating the entities involved in chemical reactions. The ChEMU 2020 lab received 37 team registrations and 46 runs. Overall, the performance of submissions for these tasks exceeded our expectations, with the top systems outperforming strong baselines. We further show the methods to be robust to variations in sampling of the test data. We provide a detailed overview of the ChEMU 2020 corpus and its annotation, showing that inter-annotator agreement is very strong. We also present the methods adopted by participants, provide a detailed analysis of their performance, and carefully consider the potential impact of data leakage on interpretation of the results. The ChEMU 2020 Lab has shown the viability of automated methods to support information extraction of key information in chemical patents.
Collapse
Affiliation(s)
- Jiayuan He
- The University of Melbourne, Parkville, VIC, Australia.,RMIT University, Melbourne, VIC, Australia
| | - Dat Quoc Nguyen
- The University of Melbourne, Parkville, VIC, Australia.,VinAI Research, Hanoi, Vietnam
| | | | | | - Camilo Thorne
- Elsevier Information Systems GmbH, Frankfurt, Germany
| | - Ralph Hoessel
- Elsevier Information Systems GmbH, Frankfurt, Germany
| | | | - Zenan Zhai
- The University of Melbourne, Parkville, VIC, Australia
| | - Biaoyan Fang
- The University of Melbourne, Parkville, VIC, Australia
| | - Hiyori Yoshikawa
- The University of Melbourne, Parkville, VIC, Australia.,Fujitsu Laboratories Ltd., Tokyo, Japan
| | - Ameer Albahem
- The University of Melbourne, Parkville, VIC, Australia.,RMIT University, Melbourne, VIC, Australia
| | | | - Trevor Cohn
- The University of Melbourne, Parkville, VIC, Australia
| | | | - Karin Verspoor
- The University of Melbourne, Parkville, VIC, Australia.,RMIT University, Melbourne, VIC, Australia
| |
Collapse
|
14
|
Kononova O, He T, Huo H, Trewartha A, Olivetti EA, Ceder G. Opportunities and challenges of text mining in aterials research. iScience 2021; 24:102155. [PMID: 33665573 PMCID: PMC7905448 DOI: 10.1016/j.isci.2021.102155] [Citation(s) in RCA: 43] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
Research publications are the major repository of scientific knowledge. However, their unstructured and highly heterogenous format creates a significant obstacle to large-scale analysis of the information contained within. Recent progress in natural language processing (NLP) has provided a variety of tools for high-quality information extraction from unstructured text. These tools are primarily trained on non-technical text and struggle to produce accurate results when applied to scientific text, involving specific technical terminology. During the last years, significant efforts in information retrieval have been made for biomedical and biochemical publications. For materials science, text mining (TM) methodology is still at the dawn of its development. In this review, we survey the recent progress in creating and applying TM and NLP approaches to materials science field. This review is directed at the broad class of researchers aiming to learn the fundamentals of TM as applied to the materials science publications.
Collapse
Affiliation(s)
- Olga Kononova
- Department of Materials Science & Engineering, University of California, Berkeley, CA 94720, USA
- Materials Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Tanjin He
- Department of Materials Science & Engineering, University of California, Berkeley, CA 94720, USA
- Materials Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Haoyan Huo
- Department of Materials Science & Engineering, University of California, Berkeley, CA 94720, USA
- Materials Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Amalie Trewartha
- Materials Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Elsa A. Olivetti
- Department of Materials Science & Engineering, MIT, Cambridge, MA 02139, USA
| | - Gerbrand Ceder
- Department of Materials Science & Engineering, University of California, Berkeley, CA 94720, USA
- Materials Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| |
Collapse
|
15
|
|
16
|
Schwaller P, Probst D, Vaucher AC, Nair VH, Kreutter D, Laino T, Reymond JL. Mapping the space of chemical reactions using attention-based neural networks. NAT MACH INTELL 2021. [DOI: 10.1038/s42256-020-00284-w] [Citation(s) in RCA: 50] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
17
|
Wilbraham L, Mehr SHM, Cronin L. Digitizing Chemistry Using the Chemical Processing Unit: From Synthesis to Discovery. Acc Chem Res 2021; 54:253-262. [PMID: 33370095 DOI: 10.1021/acs.accounts.0c00674] [Citation(s) in RCA: 38] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
The digitization of chemistry is not simply about using machine learning or artificial intelligence systems to process chemical data, or about the development of ever more capable automation hardware; instead, it is the creation of a hard link between an abstracted process ontology of chemistry and bespoke hardware for performing reactions or exploring reactivity. Chemical digitization is therefore about the unambiguous development of an architecture, a chemical state machine, that uses this ontology to connect precise instruction sets to hardware that performs chemical transformations. This approach enables a universal standard for describing chemistry procedures via a chemical programming language and facilitates unambiguous dissemination of these procedures. We predict that this standard will revolutionize the ability of chemists to collaborate, increase reproducibility and safety, as we all as optimize for cost and efficiency. Most importantly, the digitization of chemistry will dramatically reduce the labor needed to make new compounds and broaden accessible chemical space. In recent years, the developments of automation in chemistry have gone beyond flow chemistry alone, with many bespoke workflows being developed not only for automating chemical synthesis but also for materials, nanomaterials, and formulation production. Indeed, the leap from fixed-configuration synthesis machines like peptide, nucleic acid, or dedicated cross-coupling engines is important for developing a truly universal approach to "dial-a-molecule". In this case, a key conceptual leap is the use of a batch system that can encode the chemical reagents, solvent, and products as packets which can be moved around the system, and a graph-based approach for the description of hardware modules that allows the compilation of chemical code that runs on, in principle, any hardware. Further, the integration of sensor systems for monitoring and controlling the state of the chemical synthesis machine, as well as high resolution spectroscopic tools, is vital if these systems are to facilitate closed-loop autonomous experiments. Systems that not only make molecules and materials, but also optimize their function, and use algorithms to assist with the development of new synthetic pathways and process optimization are also possible. Here, we discuss how the digitization of chemistry is happening, building on the plethora of technological developments in hardware and software. Importantly, digital-chemical robot systems need to integrate feedback from simple sensors, e.g., conductivity or temperature, as well as online analytics in order to navigate process space autonomously. This will open the door to accessing known molecules (synthesis), exploring whether known compounds/reactions are possible under new conditions (optimization), and searching chemical space for unknown and unexpected new molecules, reactions, and modes of reactivity (discovery). We will also discuss the role of chemical knowledge and how this can be used to challenge bias, as well as define and expand synthetically accessible chemical space using programmable robotic chemical state machines.
Collapse
Affiliation(s)
- Liam Wilbraham
- School of Chemistry, The University of Glasgow, University Avenue, Glasgow G12 8QQ, United Kingdom
| | - S. Hessam M. Mehr
- School of Chemistry, The University of Glasgow, University Avenue, Glasgow G12 8QQ, United Kingdom
| | - Leroy Cronin
- School of Chemistry, The University of Glasgow, University Avenue, Glasgow G12 8QQ, United Kingdom
| |
Collapse
|
18
|
Kim S, Chen J, Cheng T, Gindulyte A, He J, He S, Li Q, Shoemaker BA, Thiessen PA, Yu B, Zaslavsky L, Zhang J, Bolton EE. PubChem in 2021: new data content and improved web interfaces. Nucleic Acids Res 2021; 49:D1388-D1395. [PMID: 33151290 DOI: 10.1093/nar/gkaa971(2020)] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Revised: 10/06/2020] [Accepted: 10/11/2020] [Indexed: 05/28/2023] Open
Abstract
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a popular chemical information resource that serves the scientific community as well as the general public, with millions of unique users per month. In the past two years, PubChem made substantial improvements. Data from more than 100 new data sources were added to PubChem, including chemical-literature links from Thieme Chemistry, chemical and physical property links from SpringerMaterials, and patent links from the World Intellectual Properties Organization (WIPO). PubChem's homepage and individual record pages were updated to help users find desired information faster. This update involved a data model change for the data objects used by these pages as well as by programmatic users. Several new services were introduced, including the PubChem Periodic Table and Element pages, Pathway pages, and Knowledge panels. Additionally, in response to the coronavirus disease 2019 (COVID-19) outbreak, PubChem created a special data collection that contains PubChem data related to COVID-19 and the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2).
Collapse
Affiliation(s)
- Sunghwan Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Jie Chen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Tiejun Cheng
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Asta Gindulyte
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Jia He
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Siqian He
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Qingliang Li
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Benjamin A Shoemaker
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Paul A Thiessen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Bo Yu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Leonid Zaslavsky
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Jian Zhang
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Evan E Bolton
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| |
Collapse
|
19
|
Kim S, Chen J, Cheng T, Gindulyte A, He J, He S, Li Q, Shoemaker BA, Thiessen PA, Yu B, Zaslavsky L, Zhang J, Bolton EE. PubChem in 2021: new data content and improved web interfaces. Nucleic Acids Res 2021; 49:D1388-D1395. [PMID: 33151290 PMCID: PMC7778930 DOI: 10.1093/nar/gkaa971] [Citation(s) in RCA: 1904] [Impact Index Per Article: 476.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Revised: 10/06/2020] [Accepted: 10/11/2020] [Indexed: 02/06/2023] Open
Abstract
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a popular chemical information resource that serves the scientific community as well as the general public, with millions of unique users per month. In the past two years, PubChem made substantial improvements. Data from more than 100 new data sources were added to PubChem, including chemical-literature links from Thieme Chemistry, chemical and physical property links from SpringerMaterials, and patent links from the World Intellectual Properties Organization (WIPO). PubChem's homepage and individual record pages were updated to help users find desired information faster. This update involved a data model change for the data objects used by these pages as well as by programmatic users. Several new services were introduced, including the PubChem Periodic Table and Element pages, Pathway pages, and Knowledge panels. Additionally, in response to the coronavirus disease 2019 (COVID-19) outbreak, PubChem created a special data collection that contains PubChem data related to COVID-19 and the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2).
Collapse
Affiliation(s)
- Sunghwan Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Jie Chen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Tiejun Cheng
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Asta Gindulyte
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Jia He
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Siqian He
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Qingliang Li
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Benjamin A Shoemaker
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Paul A Thiessen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Bo Yu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Leonid Zaslavsky
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Jian Zhang
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| | - Evan E Bolton
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, 20894, USA
| |
Collapse
|
20
|
Abstract
The discovery of materials is an important element in the development of new technologies and abilities that can help humanity tackle many challenges. Materials discovery is frustratingly slow, with the large time and resource cost often providing only small gains in property performance. Furthermore, researchers are unwilling to take large risks that they will only know the outcome of months or years later. Computation is playing an increasing role in allowing rapid screening of large numbers of materials from vast search space to identify promising candidates for laboratory synthesis and testing. However, there is a problem, in that many materials computationally predicted to have encouraging properties cannot be readily realised in the lab. This minireview looks at how we can tackle the problem of confirming that hypothetical materials are synthetically realisable, through consideration of all the stages of the materials discovery process, from obtaining the components, reacting them to a material in the correct structure, through to processing into a desired form. In an ideal world, a material prediction would come with an associated 'recipe' for the successful laboratory preparation of the material. We discuss the opportunity to thus prevent wasted effort in experimental discovery programmes, including those using automation, to accelerate the discovery of novel materials.
Collapse
Affiliation(s)
- Filip T Szczypiński
- Department of Chemistry, Imperial College London, Molecular Sciences Research Hub White City Campus, Wood Lane London W12 0BZ UK
| | - Steven Bennett
- Department of Chemistry, Imperial College London, Molecular Sciences Research Hub White City Campus, Wood Lane London W12 0BZ UK
| | - Kim E Jelfs
- Department of Chemistry, Imperial College London, Molecular Sciences Research Hub White City Campus, Wood Lane London W12 0BZ UK
| |
Collapse
|
21
|
Whiteland H, Crusco A, Bloemberg LW, Tibble-Howlings J, Forde-Thomas J, Coghlan A, Murphy PJ, Hoffmann KF. Quorum sensing N-Acyl homoserine lactones are a new class of anti-schistosomal. PLoS Negl Trop Dis 2020; 14:e0008630. [PMID: 33075069 PMCID: PMC7595621 DOI: 10.1371/journal.pntd.0008630] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2020] [Revised: 10/29/2020] [Accepted: 09/24/2020] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND Schistosomiasis is a prevalent neglected tropical disease that affects approximately 300 million people worldwide. Its treatment is through a single class chemotherapy, praziquantel. Concerns surrounding the emergence of praziquantel insensitivity have led to a need for developing novel anthelmintics. METHODOLOGY/PRINCIPLE FINDINGS Through evaluating and screening fourteen compounds (initially developed for anti-cancer and anti-viral projects) against Schistosoma mansoni, one of three species responsible for most cases of human schistosomiasis, a racemic N-acyl homoserine (1) demonstrated good efficacy against all intra mammalian lifecycle stages including schistosomula (EC50 = 4.7 μM), juvenile worms (EC50 = 4.3 μM) and adult worms (EC50 = 8.3 μM). To begin exploring structural activity relationships, a further 8 analogues of this compound were generated, including individual (R)- and (S)- enantiomers. Upon anti-schistosomal screening of these analogues, the (R)- enantiomer retained activity, whereas the (S)- lost activity. Furthermore, modification of the lactone ring to a thiolactone ring (3) improved potency against schistosomula (EC50 = 2.1 μM), juvenile worms (EC50 = 0.5 μM) and adult worms (EC50 = 4.8 μM). As the effective racemic parent compound is structurally similar to quorum sensing signaling peptides used by bacteria, further evaluation of its effect (along with its stereoisomers and the thiolactone analogues) against Gram+ (Staphylococcus aureus) and Gram- (Escherichia coli) species was conducted. While some activity was observed against both Gram+ and Gram- bacteria species for the racemic compound 1 (MIC 125 mg/L), the (R) stereoisomer had better activity (125 mg/L) than the (S) (>125mg/L). However, the greatest antimicrobial activity (MIC 31.25 mg/L against S. aureus) was observed for the thiolactone containing analogue (3). CONCLUSION/SIGNIFICANCE To the best of our knowledge, this is the first demonstration that N-Acyl homoserines exhibit anthelmintic activities. Furthermore, their additional action on Gram+ bacteria opens a new avenue for exploring these molecules more broadly as part of future anti-infective initiatives.
Collapse
Affiliation(s)
- Helen Whiteland
- Institute of Biological, Environmental and Rural Sciences (IBERS), Aberystwyth University, Aberystwyth, Wales, United Kingdom
| | - Alessandra Crusco
- Institute of Biological, Environmental and Rural Sciences (IBERS), Aberystwyth University, Aberystwyth, Wales, United Kingdom
| | - Lisa W. Bloemberg
- Institute of Biological, Environmental and Rural Sciences (IBERS), Aberystwyth University, Aberystwyth, Wales, United Kingdom
| | | | - Josephine Forde-Thomas
- Institute of Biological, Environmental and Rural Sciences (IBERS), Aberystwyth University, Aberystwyth, Wales, United Kingdom
| | - Avril Coghlan
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, United Kingdom
| | - Patrick J. Murphy
- School of Natural Sciences, Bangor University, Gwynedd, United Kingdom
| | - Karl F. Hoffmann
- Institute of Biological, Environmental and Rural Sciences (IBERS), Aberystwyth University, Aberystwyth, Wales, United Kingdom
| |
Collapse
|
22
|
Vaucher AC, Zipoli F, Geluykens J, Nair VH, Schwaller P, Laino T. Automated extraction of chemical synthesis actions from experimental procedures. Nat Commun 2020; 11:3601. [PMID: 32681088 PMCID: PMC7367864 DOI: 10.1038/s41467-020-17266-6] [Citation(s) in RCA: 55] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2019] [Accepted: 06/15/2020] [Indexed: 11/09/2022] Open
Abstract
Experimental procedures for chemical synthesis are commonly reported in prose in patents or in the scientific literature. The extraction of the details necessary to reproduce and validate a synthesis in a chemical laboratory is often a tedious task requiring extensive human intervention. We present a method to convert unstructured experimental procedures written in English to structured synthetic steps (action sequences) reflecting all the operations needed to successfully conduct the corresponding chemical reactions. To achieve this, we design a set of synthesis actions with predefined properties and a deep-learning sequence to sequence model based on the transformer architecture to convert experimental procedures to action sequences. The model is pretrained on vast amounts of data generated automatically with a custom rule-based natural language processing approach and refined on manually annotated samples. Predictions on our test set result in a perfect (100%) match of the action sequence for 60.8% of sentences, a 90% match for 71.3% of sentences, and a 75% match for 82.4% of sentences.
Collapse
Affiliation(s)
- Alain C Vaucher
- IBM Research Europe, Säumerstrasse 4, Rüschlikon, 8803, Switzerland.
| | - Federico Zipoli
- IBM Research Europe, Säumerstrasse 4, Rüschlikon, 8803, Switzerland
| | - Joppe Geluykens
- IBM Research Europe, Säumerstrasse 4, Rüschlikon, 8803, Switzerland
| | - Vishnu H Nair
- IBM Research Europe, Säumerstrasse 4, Rüschlikon, 8803, Switzerland
| | | | - Teodoro Laino
- IBM Research Europe, Säumerstrasse 4, Rüschlikon, 8803, Switzerland
| |
Collapse
|
23
|
Savery ME, Rogers WJ, Pillai M, Mork JG, Demner-Fushman D. Chemical Entity Recognition for MEDLINE Indexing. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2020; 2020:561-568. [PMID: 32477678 PMCID: PMC7233078] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Chemical entity recognition is essential for indexing scientific literature in the MEDLINE database at the National Library of Medicine. However, the tool currently used to suggest terms for indexing, the Medical Text Indexer, was not originally conceived as a chemical recognition tool. It has instead been adapted to the task via its use of MetaMap and the addition of in-house patterns and rules. In order to develop a tool more suitable for chemical recognition, we have created a collection of 200 MEDLINE titles and abstracts annotated with genes, proteins, inorganic and organic chemicals, as well as other biological molecules. We use this collection to evaluate eleven chemical entity recognition systems, where we seek to identify a tool that effectively recognizes chemical entities for indexing and also performs well on chemical recognition beyond the indexing task. We observe the highest performance with a SciBERT ensemble.
Collapse
Affiliation(s)
- Max E Savery
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, National Institutes of Health, Bethesda, MD
| | - Willie J Rogers
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, National Institutes of Health, Bethesda, MD
| | - Malvika Pillai
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, National Institutes of Health, Bethesda, MD
| | - James G Mork
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, National Institutes of Health, Bethesda, MD
| | - Dina Demner-Fushman
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, National Institutes of Health, Bethesda, MD
| |
Collapse
|
24
|
Abstract
The world needs new materials to stimulate the chemical industry in key sectors of our economy: environment and sustainability, information storage, optical telecommunications, and catalysis. Yet, nearly all functional materials are still discovered by "trial-and-error", of which the lack of predictability affords a major materials bottleneck to technological innovation. The average "molecule-to-market" lead time for materials discovery is currently 20 years. This is far too long for industrial needs, as highlighted by the Materials Genome Initiative, which has ambitious targets of up to 4-fold reductions in average molecule-to-market lead times. Such a large step change in progress can only be realistically achieved if one adopts an entirely new approach to materials discovery. Fortunately, a fundamentally new approach to materials discovery has been emerging, whereby data science with artificial intelligence offers a prospective solution to speed up these average molecule-to-market lead times.This approach is known as data-driven materials discovery. Its broad prospects have only recently become a reality, given the timely and major advances in "big data", artificial intelligence, and high-performance computing (HPC). Access to massive data sets has been stimulated by government-regulated open-access requirements for data and literature. Natural-language processing (NLP) and machine-learning (ML) tools that can mine data and find patterns therein are becoming mainstream. Exascale HPC capabilities that can aid data mining and pattern recognition and also generate their own data from calculations are now within our grasp. These timely advances present an ideal opportunity to develop data-driven materials-discovery strategies to systematically design and predict new chemicals for a given device application.This Account shows how data science can afford materials discovery via a four-step "design-to-device" pipeline that entails (1) data extraction, (2) data enrichment, (3) material prediction, and (4) experimental validation. Massive databases of cognate chemical and property information are first forged from "chemistry-aware" natural-language-processing tools, such as ChemDataExtractor, and enriched using machine-learning methods and high-throughput quantum-chemical calculations. New materials for a bespoke application can then be predicted by mining these databases with algorithmic encodings of relationships between chemical structures and physical properties that are known to deliver functional materials. These may take the form of classification, enumeration, or machine-learning algorithms. A data-mining workflow short-lists these predictions to a handful of lead candidate materials that go forward to experimental validation. This design-to-device approach is being developed to offer a roadmap for the accelerated discovery of new chemicals for functional applications. Case studies presented demonstrate its utility for photovoltaic, optical, and catalytic applications. While this Account is focused on applications in the physical sciences, the generic pipeline discussed is readily transferable to other scientific disciplines such as biology and medicine.
Collapse
Affiliation(s)
- Jacqueline M. Cole
- Cavendish Laboratory, Department of Physics, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K
- ISIS Neutron and Muon Source, STFC Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K
- Department of Chemical Engineering and Biotechnology, University of Cambridge, West Cambridge Site, Philippa Fawcett Drive, Cambridge CB3 0AS, U.K
- Mathematical Institute, University of Oxford, Woodstock Road, Oxford OX2 6GG, U.K
| |
Collapse
|
25
|
Xu K, Yang Z, Kang P, Wang Q, Liu W. Document-level attention-based BiLSTM-CRF incorporating disease dictionary for disease named entity recognition. Comput Biol Med 2019; 108:122-132. [PMID: 31003175 DOI: 10.1016/j.compbiomed.2019.04.002] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2018] [Revised: 04/01/2019] [Accepted: 04/01/2019] [Indexed: 02/06/2023]
Abstract
BACKGROUND Disease named entity recognition (NER) plays an important role in biomedical research. There are a significant number of challenging issues to be addressed; among these, the identification of rare diseases and complex disease names and the problem of tagging inconsistency (i.e., if an entity is tagged differently in a document) are attracting substantial research attention. METHODS We propose a new neural network method named Dic-Att-BiLSTM-CRF (DABLC) for disease NER. DABLC applies an efficient exact string matching method to match disease entities with a disease dictionary; here, the dictionary is constructed based on the Disease Ontology. Furthermore, DABLC constructs a dictionary attention layer by incorporating a disease dictionary matching method and document-level attention mechanism. Finally, a bidirectional long short-term memory network and conditional random field (BiLSTM-CRF) with a dictionary attention layer is proposed to combine the disease dictionary to develop disease NER. RESULTS Extensive experiments are conducted on two widely-used corpora: the NCBI disease corpus and the BioCreative V CDR corpus. We apply each test on 10 executions of each model, with a 95% confidence interval. DABLC achieves the highest F1 scores (NCBI: Precision = 0.883, Recall = 0.89, F1 = 0.886; BioCreative V CDR: Precision = 0.891, Recall = 0.875, F1 = 0.883), outperforming the state-of-the-art methods. CONCLUSION DABLC combines the advantages of both external dictionary resources and deep attention neural networks. This aids the identification of rare diseases and complex disease names; moreover, it reduces the impact of tagging inconsistency. Special disease NER and deep learning models addressing long sentences are noteworthy areas for future examination.
Collapse
Affiliation(s)
- Kai Xu
- Department of Computer Science, Guangdong University of Technology, Guangzhou, China.
| | - Zhenguo Yang
- Department of Computer Science, Guangdong University of Technology, Guangzhou, China; Department of Computer Science, City University of Hong Kong, Hong Kong, China.
| | - Peipei Kang
- Department of Computer Science, Guangdong University of Technology, Guangzhou, China.
| | - Qi Wang
- Department of Computer Science, Guangdong University of Technology, Guangzhou, China.
| | - Wenyin Liu
- Department of Computer Science, Guangdong University of Technology, Guangzhou, China.
| |
Collapse
|
26
|
Korvigo I, Holmatov M, Zaikovskii A, Skoblov M. Putting hands to rest: efficient deep CNN-RNN architecture for chemical named entity recognition with no hand-crafted rules. J Cheminform 2018; 10:28. [PMID: 29796778 PMCID: PMC5966369 DOI: 10.1186/s13321-018-0280-0] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2017] [Accepted: 05/14/2018] [Indexed: 11/10/2022] Open
Abstract
Chemical named entity recognition (NER) is an active field of research in biomedical natural language processing. To facilitate the development of new and superior chemical NER systems, BioCreative released the CHEMDNER corpus, an extensive dataset of diverse manually annotated chemical entities. Most of the systems trained on the corpus rely on complicated hand-crafted rules or curated databases for data preprocessing, feature extraction and output post-processing, though modern machine learning algorithms, such as deep neural networks, can automatically design the rules with little to none human intervention. Here we explored this approach by experimenting with various deep learning architectures for targeted tokenisation and named entity recognition. Our final model, based on a combination of convolutional and stateful recurrent neural networks with attention-like loops and hybrid word- and character-level embeddings, reaches near human-level performance on the testing dataset with no manually asserted rules. To make our model easily accessible for standalone use and integration in third-party software, we've developed a Python package with a minimalistic user interface.
Collapse
Affiliation(s)
- Ilia Korvigo
- Laboratory of Functional Analysis of the Genome, Moscow Institute of Physics and Technology, Moscow, Russia
- All-Russia Institute for Agricultural Microbiology, St. Petersburg, Russia
- ITMO University, St. Petersburg, Russia
| | - Maxim Holmatov
- St. Petersburg State Pediatric Medical University, St. Petersburg, Russia
- N.N. Petrov Institute of Oncology, Department of Tumor Biology, St. Petersburg, Russia
| | | | - Mikhail Skoblov
- Laboratory of Functional Analysis of the Genome, Moscow Institute of Physics and Technology, Moscow, Russia
- School of Biomedicine, Far Eastern Federal University, Vladivostok, Russia
- Laboratory of Functional Genomics, Research Centre for Medical Genetics, Moscow, Russia
| |
Collapse
|
27
|
Eftimov T, Koroušić Seljak B, Korošec P. A rule-based named-entity recognition method for knowledge extraction of evidence-based dietary recommendations. PLoS One 2017. [PMID: 28644863 PMCID: PMC5482438 DOI: 10.1371/journal.pone.0179488] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Evidence-based dietary information represented as unstructured text is a crucial information that needs to be accessed in order to help dietitians follow the new knowledge arrives daily with newly published scientific reports. Different named-entity recognition (NER) methods have been introduced previously to extract useful information from the biomedical literature. They are focused on, for example extracting gene mentions, proteins mentions, relationships between genes and proteins, chemical concepts and relationships between drugs and diseases. In this paper, we present a novel NER method, called drNER, for knowledge extraction of evidence-based dietary information. To the best of our knowledge this is the first attempt at extracting dietary concepts. DrNER is a rule-based NER that consists of two phases. The first one involves the detection and determination of the entities mention, and the second one involves the selection and extraction of the entities. We evaluate the method by using text corpora from heterogeneous sources, including text from several scientifically validated web sites and text from scientific publications. Evaluation of the method showed that drNER gives good results and can be used for knowledge extraction of evidence-based dietary recommendations.
Collapse
Affiliation(s)
- Tome Eftimov
- Computer Systems Department, Jožef Stefan Institute, Ljubljana, Slovenia
- Jožef Stefan International Postgraduate School, Ljubljana, Slovenia
- * E-mail:
| | | | - Peter Korošec
- Computer Systems Department, Jožef Stefan Institute, Ljubljana, Slovenia
- Faculty of Mathematics, Natural Science and Information Technologies, Koper, Slovenia
| |
Collapse
|
28
|
Krallinger M, Rabal O, Lourenço A, Oyarzabal J, Valencia A. Information Retrieval and Text Mining Technologies for Chemistry. Chem Rev 2017; 117:7673-7761. [PMID: 28475312 DOI: 10.1021/acs.chemrev.6b00851] [Citation(s) in RCA: 111] [Impact Index Per Article: 13.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.
Collapse
Affiliation(s)
- Martin Krallinger
- Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre , C/Melchor Fernández Almagro 3, Madrid E-28029, Spain
| | - Obdulia Rabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Anália Lourenço
- ESEI - Department of Computer Science, University of Vigo , Edificio Politécnico, Campus Universitario As Lagoas s/n, Ourense E-32004, Spain.,Centro de Investigaciones Biomédicas (Centro Singular de Investigación de Galicia) , Campus Universitario Lagoas-Marcosende, Vigo E-36310, Spain.,CEB-Centre of Biological Engineering, University of Minho , Campus de Gualtar, Braga 4710-057, Portugal
| | - Julen Oyarzabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Alfonso Valencia
- Life Science Department, Barcelona Supercomputing Centre (BSC-CNS) , C/Jordi Girona, 29-31, Barcelona E-08034, Spain.,Joint BSC-IRB-CRG Program in Computational Biology, Parc Científic de Barcelona , C/ Baldiri Reixac 10, Barcelona E-08028, Spain.,Institució Catalana de Recerca i Estudis Avançats (ICREA) , Passeig de Lluís Companys 23, Barcelona E-08010, Spain
| |
Collapse
|
29
|
Schneider N, Stiefl N, Landrum GA. What's What: The (Nearly) Definitive Guide to Reaction Role Assignment. J Chem Inf Model 2016; 56:2336-2346. [PMID: 28024398 DOI: 10.1021/acs.jcim.6b00564] [Citation(s) in RCA: 66] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
When analyzing chemical reactions it is essential to know which molecules are actively involved in the reaction and which educts will form the product molecules. Assigning reaction roles, like reactant, reagent, or product, to the molecules of a chemical reaction might be a trivial problem for hand-curated reaction schemes but it is more difficult to automate, an essential step when handling large amounts of reaction data. Here, we describe a new fingerprint-based and data-driven approach to assign reaction roles which is also applicable to rather unbalanced and noisy reaction schemes. Given a set of molecules involved and knowing the product(s) of a reaction we assign the most probable reactants and sort out the remaining reagents. Our approach was validated using two different data sets: one hand-curated data set comprising about 680 diverse reactions extracted from patents which span more than 200 different reaction types and include up to 18 different reactants. A second set consists of 50 000 randomly picked reactions from US patents. The results of the second data set were compared to results obtained using two different atom-to-atom mapping algorithms. For both data sets our method assigns the reaction roles correctly for the vast majority of the reactions, achieving an accuracy of 88% and 97% respectively. The median time needed, about 8 ms, indicates that the algorithm is fast enough to be applied to large collections. The new method is available as part of the RDKit toolkit and the data sets and Jupyter notebooks used for evaluation of the new method are available in the Supporting Information of this publication.
Collapse
Affiliation(s)
- Nadine Schneider
- Novartis Institutes for BioMedical Research, Novartis Pharma AG , Novartis Campus, 4002 Basel, Switzerland
| | - Nikolaus Stiefl
- Novartis Institutes for BioMedical Research, Novartis Pharma AG , Novartis Campus, 4002 Basel, Switzerland
| | | |
Collapse
|
30
|
Swain MC, Cole JM. ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature. J Chem Inf Model 2016; 56:1894-1904. [PMID: 27669338 DOI: 10.1021/acs.jcim.6b00207] [Citation(s) in RCA: 167] [Impact Index Per Article: 18.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
The emergence of "big data" initiatives has led to the need for tools that can automatically extract valuable chemical information from large volumes of unstructured data, such as the scientific literature. Since chemical information can be present in figures, tables, and textual paragraphs, successful information extraction often depends on the ability to interpret all of these domains simultaneously. We present a complete toolkit for the automated extraction of chemical entities and their associated properties, measurements, and relationships from scientific documents that can be used to populate structured chemical databases. Our system provides an extensible, chemistry-aware, natural language processing pipeline for tokenization, part-of-speech tagging, named entity recognition, and phrase parsing. Within this scope, we report improved performance for chemical named entity recognition through the use of unsupervised word clustering based on a massive corpus of chemistry articles. For phrase parsing and information extraction, we present the novel use of multiple rule-based grammars that are tailored for interpreting specific document domains such as textual paragraphs, captions, and tables. We also describe document-level processing to resolve data interdependencies and show that this is particularly necessary for the autogeneration of chemical databases since captions and tables commonly contain chemical identifiers and references that are defined elsewhere in the text. The performance of the toolkit to correctly extract various types of data was evaluated, affording an F-score of 93.4%, 86.8%, and 91.5% for extracting chemical identifiers, spectroscopic attributes, and chemical property attributes, respectively; set against the CHEMDNER chemical name extraction challenge, ChemDataExtractor yields a competitive F-score of 87.8%. All tools have been released under the MIT license and are available to download from http://www.chemdataextractor.org .
Collapse
Affiliation(s)
- Matthew C Swain
- Cavendish Laboratory, University of Cambridge , J. J. Thomson Avenue, Cambridge, CB3 0HE, U.K
| | - Jacqueline M Cole
- Cavendish Laboratory, University of Cambridge , J. J. Thomson Avenue, Cambridge, CB3 0HE, U.K
| |
Collapse
|
31
|
Akhondi SA, Pons E, Afzal Z, van Haagen H, Becker BFH, Hettne KM, van Mulligen EM, Kors JA. Chemical entity recognition in patents by combining dictionary-based and statistical approaches. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw061. [PMID: 27141091 PMCID: PMC4852402 DOI: 10.1093/database/baw061] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 12/03/2015] [Accepted: 04/03/2016] [Indexed: 11/13/2022]
Abstract
We describe the development of a chemical entity recognition system and its application in the CHEMDNER-patent track of BioCreative 2015. This community challenge includes a Chemical Entity Mention in Patents (CEMP) recognition task and a Chemical Passage Detection (CPD) classification task. We addressed both tasks by an ensemble system that combines a dictionary-based approach with a statistical one. For this purpose the performance of several lexical resources was assessed using Peregrine, our open-source indexing engine. We combined our dictionary-based results on the patent corpus with the results of tmChem, a chemical recognizer using a conditional random field classifier. To improve the performance of tmChem, we utilized three additional features, viz. part-of-speech tags, lemmas and word-vector clusters. When evaluated on the training data, our final system obtained an F-score of 85.21% for the CEMP task, and an accuracy of 91.53% for the CPD task. On the test set, the best system ranked sixth among 21 teams for CEMP with an F-score of 86.82%, and second among nine teams for CPD with an accuracy of 94.23%. The differences in performance between the best ensemble system and the statistical system separately were small. Database URL: http://biosemantics.org/chemdner-patents
Collapse
Affiliation(s)
- Saber A Akhondi
- Department of Medical Informatics, Erasmus University Medical Center, PO Box 2040, 3000 CA Rotterdam
| | - Ewoud Pons
- Department of Medical Informatics, Erasmus University Medical Center, PO Box 2040, 3000 CA Rotterdam
| | - Zubair Afzal
- Department of Medical Informatics, Erasmus University Medical Center, PO Box 2040, 3000 CA Rotterdam
| | - Herman van Haagen
- Department of Medical Informatics, Erasmus University Medical Center, PO Box 2040, 3000 CA Rotterdam
| | - Benedikt F H Becker
- Department of Medical Informatics, Erasmus University Medical Center, PO Box 2040, 3000 CA Rotterdam
| | - Kristina M Hettne
- Department of Human Genetics, Leiden University Medical Center, PO Box 9600, 2300 RC Leiden, The Netherlands
| | - Erik M van Mulligen
- Department of Medical Informatics, Erasmus University Medical Center, PO Box 2040, 3000 CA Rotterdam
| | - Jan A Kors
- Department of Medical Informatics, Erasmus University Medical Center, PO Box 2040, 3000 CA Rotterdam
| |
Collapse
|
32
|
Zhang Y, Xu J, Chen H, Wang J, Wu Y, Prakasam M, Xu H. Chemical named entity recognition in patents by domain knowledge and unsupervised feature learning. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw049. [PMID: 27087307 PMCID: PMC4834204 DOI: 10.1093/database/baw049] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/04/2015] [Accepted: 03/14/2016] [Indexed: 11/13/2022]
Abstract
Medicinal chemistry patents contain rich information about chemical compounds. Although much effort has been devoted to extracting chemical entities from scientific literature, limited numbers of patent mining systems are publically available, probably due to the lack of large manually annotated corpora. To accelerate the development of information extraction systems for medicinal chemistry patents, the 2015 BioCreative V challenge organized a track on Chemical and Drug Named Entity Recognition from patent text (CHEMDNER patents). This track included three individual subtasks: (i) Chemical Entity Mention Recognition in Patents (CEMP), (ii) Chemical Passage Detection (CPD) and (iii) Gene and Protein Related Object task (GPRO). We participated in the two subtasks of CEMP and CPD using machine learning-based systems. Our machine learning-based systems employed the algorithms of conditional random fields (CRF) and structured support vector machines (SSVMs), respectively. To improve the performance of the NER systems, two strategies were proposed for feature engineering: (i) domain knowledge features of dictionaries, chemical structural patterns and semantic type information present in the context of the candidate chemical and (ii) unsupervised feature learning algorithms to generate word representation features by Brown clustering and a novel binarized Word embedding to enhance the generalizability of the system. Further, the system output for the CPD task was yielded based on the patent titles and abstracts with chemicals recognized in the CEMP task.The effects of the proposed feature strategies on both the machine learning-based systems were investigated. Our best system achieved the second best performance among 21 participating teams in CEMP with a precision of 87.18%, a recall of 90.78% and aF-measure of 88.94% and was the top performing system among nine participating teams in CPD with a sensitivity of 98.60%, a specificity of 87.21%, an accuracy of 94.75%, a Matthew's correlation coefficient (MCC) of 88.24%, a precision at full recall (P_full_R) of 66.57% and an area under the precision-recall curve (AUC_PR) of 0.9347. The SSVM-based CEMP systems outperformed the CRF-based CEMP systems when using the same features. Features generated from both the domain knowledge and unsupervised learning algorithms significantly improved the chemical NER task on patents.Database URL:http:// database. oxfordjournals. org/ content/ 2016/ baw049.
Collapse
Affiliation(s)
- Yaoyun Zhang
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Jun Xu
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Hui Chen
- School of Biomedical Engineering, Capital Medical University, Beijing 100069, China
| | - Jingqi Wang
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Yonghui Wu
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | | | - Hua Xu
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| |
Collapse
|
33
|
Lowe DM, O'Boyle NM, Sayle RA. Efficient chemical-disease identification and relationship extraction using Wikipedia to improve recall. Database (Oxford) 2016; 2016:baw039. [PMID: 27060160 PMCID: PMC4825350 DOI: 10.1093/database/baw039] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2015] [Revised: 02/29/2016] [Accepted: 03/02/2016] [Indexed: 11/13/2022]
Abstract
Awareness of the adverse effects of chemicals is important in biomedical research and healthcare. Text mining can allow timely and low-cost extraction of this knowledge from the biomedical literature. We extended our text mining solution, LeadMine, to identify diseases and chemical-induced disease relationships (CIDs). LeadMine is a dictionary/grammar-based entity recognizer and was used to recognize and normalize both chemicals and diseases to Medical Subject Headings (MeSH) IDs. The disease lexicon was obtained from three sources: MeSH, the Disease Ontology and Wikipedia. The Wikipedia dictionary was derived from pages with a disease/symptom box, or those where the page title appeared in the lexicon. Composite entities (e.g. heart and lung disease) were detected and mapped to their composite MeSH IDs. For CIDs, we developed a simple pattern-based system to find relationships within the same sentence. Our system was evaluated in the BioCreative V Chemical-Disease Relation task and achieved very good results for both disease concept ID recognition (F1-score: 86.12%) and CIDs (F1-score: 52.20%) on the test set. As our system was over an order of magnitude faster than other solutions evaluated on the task, we were able to apply the same system to the entirety of MEDLINE allowing us to extract a collection of over 250 000 distinct CIDs.
Collapse
Affiliation(s)
- Daniel M Lowe
- NextMove Software Ltd, Innovation Centre, Unit 23, Science Park, Milton Road, Cambridge, United Kingdom
| | - Noel M O'Boyle
- NextMove Software Ltd, Innovation Centre, Unit 23, Science Park, Milton Road, Cambridge, United Kingdom
| | - Roger A Sayle
- NextMove Software Ltd, Innovation Centre, Unit 23, Science Park, Milton Road, Cambridge, United Kingdom
| |
Collapse
|
34
|
Schneider N, Lowe DM, Sayle RA, Tarselli MA, Landrum GA. Big Data from Pharmaceutical Patents: A Computational Analysis of Medicinal Chemists’ Bread and Butter. J Med Chem 2016; 59:4385-402. [DOI: 10.1021/acs.jmedchem.6b00153] [Citation(s) in RCA: 225] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Affiliation(s)
- Nadine Schneider
- Novartis Institutes
for BioMedical Research, Novartis Pharma AG, Novartis Campus, 4002 Basel, Switzerland
| | - Daniel M. Lowe
- NextMove Software Ltd., Innovation
Centre, Unit 23, Science Park, Milton Road, Cambridge CB4 0EY, U.K
| | - Roger A. Sayle
- NextMove Software Ltd., Innovation
Centre, Unit 23, Science Park, Milton Road, Cambridge CB4 0EY, U.K
| | - Michael A. Tarselli
- Novartis Institutes for BioMedical Research, 186 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Gregory A. Landrum
- Novartis Institutes
for BioMedical Research, Novartis Pharma AG, Novartis Campus, 4002 Basel, Switzerland
| |
Collapse
|
35
|
Xu J, Wu Y, Zhang Y, Wang J, Lee HJ, Xu H. CD-REST: a system for extracting chemical-induced disease relation in literature. Database (Oxford) 2016; 2016:baw036. [PMID: 27016700 PMCID: PMC4808251 DOI: 10.1093/database/baw036] [Citation(s) in RCA: 43] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2015] [Revised: 02/04/2016] [Accepted: 03/01/2016] [Indexed: 11/24/2022]
Abstract
Mining chemical-induced disease relations embedded in the vast biomedical literature could facilitate a wide range of computational biomedical applications, such as pharmacovigilance. The BioCreative V organized a Chemical Disease Relation (CDR) Track regarding chemical-induced disease relation extraction from biomedical literature in 2015. We participated in all subtasks of this challenge. In this article, we present our participation system Chemical Disease Relation Extraction SysTem (CD-REST), an end-to-end system for extracting chemical-induced disease relations in biomedical literature. CD-REST consists of two main components: (1) a chemical and disease named entity recognition and normalization module, which employs the Conditional Random Fields algorithm for entity recognition and a Vector Space Model-based approach for normalization; and (2) a relation extraction module that classifies both sentence-level and document-level candidate drug-disease pairs by support vector machines. Our system achieved the best performance on the chemical-induced disease relation extraction subtask in the BioCreative V CDR Track, demonstrating the effectiveness of our proposed machine learning-based approaches for automatic extraction of chemical-induced disease relations in biomedical literature. The CD-REST system provides web services using HTTP POST request. The web services can be accessed fromhttp://clinicalnlptool.com/cdr The online CD-REST demonstration system is available athttp://clinicalnlptool.com/cdr/cdr.html. Database URL:http://clinicalnlptool.com/cdr;http://clinicalnlptool.com/cdr/cdr.html.
Collapse
Affiliation(s)
- Jun Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Yonghui Wu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Yaoyun Zhang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Jingqi Wang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Hee-Jin Lee
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| |
Collapse
|
36
|
Tetko IV, M. Lowe D, Williams AJ. The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS. J Cheminform 2016; 8:2. [PMID: 26807157 PMCID: PMC4724158 DOI: 10.1186/s13321-016-0113-y] [Citation(s) in RCA: 50] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2015] [Accepted: 01/08/2016] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND Melting point (MP) is an important property in regards to the solubility of chemical compounds. Its prediction from chemical structure remains a highly challenging task for quantitative structure-activity relationship studies. Success in this area of research critically depends on the availability of high quality MP data as well as accurate chemical structure representations in order to develop models. Currently, available datasets for MP predictions have been limited to around 50k molecules while lots more data are routinely generated following the synthesis of novel materials. Significant amounts of MP data are freely available within the patent literature and, if it were available in the appropriate form, could potentially be used to develop predictive models. RESULTS We have developed a pipeline for the automated extraction and annotation of chemical data from published PATENTS. Almost 300,000 data points have been collected and used to develop models to predict melting and pyrolysis (decomposition) points using tools available on the OCHEM modeling platform (http://ochem.eu). A number of technical challenges were simultaneously solved to develop models based on these data. These included the handing of sparse data matrices with >200,000,000,000 entries and parallel calculations using 32 × 6 cores per task using 13 descriptor sets totaling more than 700,000 descriptors. We showed that models developed using data collected from PATENTS had similar or better prediction accuracy compared to the highly curated data used in previous publications. The separation of data for chemicals that decomposed rather than melting, from compounds that did undergo a normal melting transition, was performed and models for both pyrolysis and MPs were developed. The accuracy of the consensus MP models for molecules from the drug-like region of chemical space was similar to their estimated experimental accuracy, 32 °C. Last but not least, important structural features related to the pyrolysis of chemicals were identified, and a model to predict whether a compound will decompose instead of melting was developed. CONCLUSIONS We have shown that automated tools for the analysis of chemical information have reached a mature stage allowing for the extraction and collection of high quality data to enable the development of structure-activity relationship models. The developed models and data are publicly available at http://ochem.eu/article/99826.
Collapse
Affiliation(s)
- Igor V. Tetko
- />Institute of Structural Biology, Helmholtz Zentrum München für Gesundheit und Umwelt (HMGU), Ingolstädter Landstraße 1, b. 60w, 85764 Neuherberg, Germany
- />BigChem GmbH, 85764 Neuherberg, Germany
| | - Daniel M. Lowe
- />NextMove Software Limited, Innovation Centre (Unit 23), Cambridge Science Park, Cambridge, CB4 0EY UK
| | | |
Collapse
|
37
|
|
38
|
Akhondi SA, Hettne KM, van der Horst E, van Mulligen EM, Kors JA. Recognition of chemical entities: combining dictionary-based and grammar-based approaches. J Cheminform 2015; 7:S10. [PMID: 25810767 PMCID: PMC4331686 DOI: 10.1186/1758-2946-7-s1-s10] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The past decade has seen an upsurge in the number of publications in chemistry. The ever-swelling volume of available documents makes it increasingly hard to extract relevant new information from such unstructured texts. The BioCreative CHEMDNER challenge invites the development of systems for the automatic recognition of chemicals in text (CEM task) and for ranking the recognized compounds at the document level (CDI task). We investigated an ensemble approach where dictionary-based named entity recognition is used along with grammar-based recognizers to extract compounds from text. We assessed the performance of ten different commercial and publicly available lexical resources using an open source indexing system (Peregrine), in combination with three different chemical compound recognizers and a set of regular expressions to recognize chemical database identifiers. The effect of different stop-word lists, case-sensitivity matching, and use of chunking information was also investigated. We focused on lexical resources that provide chemical structure information. To rank the different compounds found in a text, we used a term confidence score based on the normalized ratio of the term frequencies in chemical and non-chemical journals. RESULTS The use of stop-word lists greatly improved the performance of the dictionary-based recognition, but there was no additional benefit from using chunking information. A combination of ChEBI and HMDB as lexical resources, the LeadMine tool for grammar-based recognition, and the regular expressions, outperformed any of the individual systems. On the test set, the F-scores were 77.8% (recall 71.2%, precision 85.8%) for the CEM task and 77.6% (recall 71.7%, precision 84.6%) for the CDI task. Missed terms were mainly due to tokenization issues, poor recognition of formulas, and term conjunctions. CONCLUSIONS We developed an ensemble system that combines dictionary-based and grammar-based approaches for chemical named entity recognition, outperforming any of the individual systems that we considered. The system is able to provide structure information for most of the compounds that are found. Improved tokenization and better recognition of specific entity types is likely to further improve system performance.
Collapse
Affiliation(s)
- Saber A Akhondi
- Department of Medical Informatics, Erasmus University Medical Center, P.O. Box 2040, Rotterdam, CA 3000, The Netherlands
| | - Kristina M Hettne
- Department of Human Genetics, Leiden University Medical Center, P.O. Box 9600, Leiden, RC 2300, The Netherlands
| | - Eelke van der Horst
- Department of Human Genetics, Leiden University Medical Center, P.O. Box 9600, Leiden, RC 2300, The Netherlands
| | - Erik M van Mulligen
- Department of Medical Informatics, Erasmus University Medical Center, P.O. Box 2040, Rotterdam, CA 3000, The Netherlands
| | - Jan A Kors
- Department of Medical Informatics, Erasmus University Medical Center, P.O. Box 2040, Rotterdam, CA 3000, The Netherlands
| |
Collapse
|
39
|
Krallinger M, Leitner F, Rabal O, Vazquez M, Oyarzabal J, Valencia A. CHEMDNER: The drugs and chemical names extraction challenge. J Cheminform 2015; 7:S1. [PMID: 25810766 PMCID: PMC4331685 DOI: 10.1186/1758-2946-7-s1-s1] [Citation(s) in RCA: 121] [Impact Index Per Article: 12.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023] Open
Abstract
Natural language processing (NLP) and text mining technologies for the chemical domain (ChemNLP or chemical text mining) are key to improve the access and integration of information from unstructured data such as patents or the scientific literature. Therefore, the BioCreative organizers posed the CHEMDNER (chemical compound and drug name recognition) community challenge, which promoted the development of novel, competitive and accessible chemical text mining systems. This task allowed a comparative assessment of the performance of various methodologies using a carefully prepared collection of manually labeled text prepared by specially trained chemists as Gold Standard data. We evaluated two important aspects: one covered the indexing of documents with chemicals (chemical document indexing - CDI task), and the other was concerned with finding the exact mentions of chemicals in text (chemical entity mention recognition - CEM task). 27 teams (23 academic and 4 commercial, a total of 87 researchers) returned results for the CHEMDNER tasks: 26 teams for CEM and 23 for the CDI task. Top scoring teams obtained an F-score of 87.39% for the CEM task and 88.20% for the CDI task, a very promising result when compared to the agreement between human annotators (91%). The strategies used to detect chemicals included machine learning methods (e.g. conditional random fields) using a variety of features, chemistry and drug lexica, and domain-specific rules. We expect that the tools and resources resulting from this effort will have an impact in future developments of chemical text mining applications and will form the basis to find related chemical information for the detected entities, such as toxicological or pharmacogenomic properties.
Collapse
Affiliation(s)
- Martin Krallinger
- Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre, Calle Melchor Fernndez Almagro, 3, Madrid, Spain
| | - Florian Leitner
- Computational Intelligence Group, Department of Artificial Intelligence, Universidad Politecnica de Madrid, Calle Ramiro de Maeztu, 7, Madrid, Spain
| | - Obdulia Rabal
- Small Molecule Discovery Platform, Center for Applied Medical Research (CIMA), University of Navarra, Avenida de Pio XII, 55, Pamplona, Spain
| | - Miguel Vazquez
- Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre, Calle Melchor Fernndez Almagro, 3, Madrid, Spain
| | - Julen Oyarzabal
- Small Molecule Discovery Platform, Center for Applied Medical Research (CIMA), University of Navarra, Avenida de Pio XII, 55, Pamplona, Spain
| | - Alfonso Valencia
- Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre, Calle Melchor Fernndez Almagro, 3, Madrid, Spain
| |
Collapse
|
40
|
Usié A, Cruz J, Comas J, Solsona F, Alves R. CheNER: a tool for the identification of chemical entities and their classes in biomedical literature. J Cheminform 2015; 7:S15. [PMID: 25810772 PMCID: PMC4331691 DOI: 10.1186/1758-2946-7-s1-s15] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Small chemical molecules regulate biological processes at the molecular level. Those molecules are often involved in causing or treating pathological states. Automatically identifying such molecules in biomedical text is difficult due to both, the diverse morphology of chemical names and the alternative types of nomenclature that are simultaneously used to describe them. To address these issues, the last BioCreAtIvE challenge proposed a CHEMDNER task, which is a Named Entity Recognition (NER) challenge that aims at labelling different types of chemical names in biomedical text. Methods To address this challenge we tested various approaches to recognizing chemical entities in biomedical documents. These approaches range from linear Conditional Random Fields (CRFs) to a combination of CRFs with regular expression and dictionary matching, followed by a post-processing step to tag those chemical names in a corpus of Medline abstracts. We named our best performing systems CheNER. Results We evaluate the performance of the various approaches using the F-score statistics. Higher F-scores indicate better performance. The highest F-score we obtain in identifying unique chemical entities is 72.88%. The highest F-score we obtain in identifying all chemical entities is 73.07%. We also evaluate the F-Score of combining our system with ChemSpot, and find an increase from 72.88% to 73.83%. Conclusions CheNER presents a valid alternative for automated annotation of chemical entities in biomedical documents. In addition, CheNER may be used to derive new features to train newer methods for tagging chemical entities. CheNER can be downloaded from http://metres.udl.cat and included in text annotation pipelines.
Collapse
Affiliation(s)
- Anabel Usié
- Departament Ciències Mèdiques Bàsiques, Universitat de Lleida, Av. Rovira Roure nº 80, 25298 Lleida, Spain ; Departament d'Informàtica i Enginyeria Industrial, Universitat de Lleida, C/Jaume II nº 69, 25001, Lleida, Spain ; Centro de Biotecnologia Agricola e Agro-Alimentar do Baixo Alentejo (CEBAL), Rua. Pedro Soares s/n, Campus IPBeja, 6158 7801-908 Beja, Portugal
| | - Joaquim Cruz
- Departament Ciències Mèdiques Bàsiques, Universitat de Lleida, Av. Rovira Roure nº 80, 25298 Lleida, Spain
| | - Jorge Comas
- Departament Ciències Mèdiques Bàsiques, Universitat de Lleida, Av. Rovira Roure nº 80, 25298 Lleida, Spain
| | - Francesc Solsona
- Departament d'Informàtica i Enginyeria Industrial, Universitat de Lleida, C/Jaume II nº 69, 25001, Lleida, Spain
| | - Rui Alves
- Departament Ciències Mèdiques Bàsiques, Universitat de Lleida, Av. Rovira Roure nº 80, 25298 Lleida, Spain
| |
Collapse
|
41
|
Abstract
BACKGROUND Chemical entity recognition has traditionally been performed by machine learning approaches. Here we describe an approach using grammars and dictionaries. This approach has the advantage that the entities found can be directly related to a given grammar or dictionary, which allows the type of an entity to be known and, if an entity is misannotated, indicates which resource should be corrected. As recognition is driven by what is expected, if spelling errors occur, they can be corrected. Correcting such errors is highly useful when attempting to lookup an entity in a database or, in the case of chemical names, converting them to structures. RESULTS Our system uses a mixture of expertly curated grammars and dictionaries, as well as dictionaries automatically derived from public resources. We show that the heuristics developed to filter our dictionary of trivial chemical names (from PubChem) yields a better performing dictionary than the previously published Jochem dictionary. Our final system performs post-processing steps to modify the boundaries of entities and to detect abbreviations. These steps are shown to significantly improve performance (2.6% and 4.0% F1-score respectively). Our complete system, with incremental post-BioCreative workshop improvements, achieves 89.9% precision and 85.4% recall (87.6% F1-score) on the CHEMDNER test set. CONCLUSIONS Grammar and dictionary approaches can produce results at least as good as the current state of the art in machine learning approaches. While machine learning approaches are commonly thought of as "black box" systems, our approach directly links the output entities to the input dictionaries and grammars. Our approach also allows correction of errors in detected entities, which can assist with entity resolution.
Collapse
Affiliation(s)
- Daniel M Lowe
- NextMove Software Ltd, Innovation Centre, Unit 23, Science Park, Milton Road, Cambridge, UK
| | - Roger A Sayle
- NextMove Software Ltd, Innovation Centre, Unit 23, Science Park, Milton Road, Cambridge, UK
| |
Collapse
|
42
|
Akhondi SA, Klenner AG, Tyrchan C, Manchala AK, Boppana K, Lowe D, Zimmermann M, Jagarlapudi SARP, Sayle R, Kors JA, Muresan S. Annotated chemical patent corpus: a gold standard for text mining. PLoS One 2014; 9:e107477. [PMID: 25268232 PMCID: PMC4182036 DOI: 10.1371/journal.pone.0107477] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2014] [Accepted: 08/10/2014] [Indexed: 11/19/2022] Open
Abstract
Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, targets, and modes of action. Spelling mistakes and spurious line break due to optical character recognition errors were also annotated. A subset of 47 patents was annotated by at least three annotator groups, from which harmonized annotations and inter-annotator agreement scores were derived. One group annotated the full set. The patent corpus includes 400,125 annotations for the full set and 36,537 annotations for the harmonized set. All patents and annotated entities are publicly available at www.biosemantics.org.
Collapse
Affiliation(s)
- Saber A. Akhondi
- Department of Medical Informatics, Erasmus University Medical Centre, Rotterdam, The Netherlands
| | - Alexander G. Klenner
- Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI), Fraunhofer-Gesellschaft, Sankt Augustin, Germany
| | | | | | | | | | - Marc Zimmermann
- Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI), Fraunhofer-Gesellschaft, Sankt Augustin, Germany
| | | | | | - Jan A. Kors
- Department of Medical Informatics, Erasmus University Medical Centre, Rotterdam, The Netherlands
| | - Sorel Muresan
- Chemistry Innovation Centre, AstraZeneca R&D Mölndal, Mölndal, Sweden
| |
Collapse
|
43
|
Eltyeb S, Salim N. Chemical named entities recognition: a review on approaches and applications. J Cheminform 2014; 6:17. [PMID: 24834132 PMCID: PMC4022577 DOI: 10.1186/1758-2946-6-17] [Citation(s) in RCA: 67] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2013] [Accepted: 03/25/2014] [Indexed: 12/03/2022] Open
Abstract
The rapid increase in the flow rate of published digital information in all disciplines has resulted in a pressing need for techniques that can simplify the use of this information. The chemistry literature is very rich with information about chemical entities. Extracting molecules and their related properties and activities from the scientific literature to "text mine" these extracted data and determine contextual relationships helps research scientists, particularly those in drug development. One of the most important challenges in chemical text mining is the recognition of chemical entities mentioned in the texts. In this review, the authors briefly introduce the fundamental concepts of chemical literature mining, the textual contents of chemical documents, and the methods of naming chemicals in documents. We sketch out dictionary-based, rule-based and machine learning, as well as hybrid chemical named entity recognition approaches with their applied solutions. We end with an outlook on the pros and cons of these approaches and the types of chemical entities extracted.
Collapse
Affiliation(s)
- Safaa Eltyeb
- Faculty of Computing, Universiti Teknologi Malaysia, Johor, Malaysia
- College of Computer Science and Information Technology, Sudan University of Science and Technology, Khartoum, Sudan
| | - Naomie Salim
- Faculty of Computing, Universiti Teknologi Malaysia, Johor, Malaysia
| |
Collapse
|