1
|
Ohms J. Validity of PubChem compounds supplied by Patentscope or SureChEMBL. WORLD PATENT INFORMATION 2022. [DOI: 10.1016/j.wpi.2022.102134] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
2
|
Naderi N, Knafou J, Copara J, Ruch P, Teodoro D. Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora. Front Res Metr Anal 2021; 6:689803. [PMID: 34870074 PMCID: PMC8640190 DOI: 10.3389/frma.2021.689803] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2021] [Accepted: 10/11/2021] [Indexed: 11/13/2022] Open
Abstract
The health and life science domains are well known for their wealth of named entities found in large free text corpora, such as scientific literature and electronic health records. To unlock the value of such corpora, named entity recognition (NER) methods are proposed. Inspired by the success of transformer-based pretrained models for NER, we assess how individual and ensemble of deep masked language models perform across corpora of different health and life science domains-biology, chemistry, and medicine-available in different languages-English and French. Individual deep masked language models, pretrained on external corpora, are fined-tuned on task-specific domain and language corpora and ensembled using classical majority voting strategies. Experiments show statistically significant improvement of the ensemble models over an individual BERT-based baseline model, with an overall best performance of 77% macro F1-score. We further perform a detailed analysis of the ensemble results and show how their effectiveness changes according to entity properties, such as length, corpus frequency, and annotation consistency. The results suggest that the ensembles of deep masked language models are an effective strategy for tackling NER across corpora from the health and life science domains.
Collapse
Affiliation(s)
- Nona Naderi
- Information Science Department, University of Applied Sciences and Arts of Western Switzerland (HES-SO), Geneva, Switzerland.,Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Julien Knafou
- Information Science Department, University of Applied Sciences and Arts of Western Switzerland (HES-SO), Geneva, Switzerland.,Swiss Institute of Bioinformatics, Geneva, Switzerland.,Computer Science Department, University of Geneva, Geneva, Switzerland
| | - Jenny Copara
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland.,Information Science Department, University of Applied Sciences and Arts of Western Switzerland (HES-SO), Geneva, Switzerland.,Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Patrick Ruch
- Information Science Department, University of Applied Sciences and Arts of Western Switzerland (HES-SO), Geneva, Switzerland.,Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Douglas Teodoro
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland.,Information Science Department, University of Applied Sciences and Arts of Western Switzerland (HES-SO), Geneva, Switzerland.,Swiss Institute of Bioinformatics, Geneva, Switzerland
| |
Collapse
|
3
|
Ohms J. Current methodologies for chemical compound searching in patents: A case study. WORLD PATENT INFORMATION 2021. [DOI: 10.1016/j.wpi.2021.102055] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
4
|
Artificial intelligence in drug design: algorithms, applications, challenges and ethics. FUTURE DRUG DISCOVERY 2021. [DOI: 10.4155/fdd-2020-0028] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
The discovery paradigm of drugs is rapidly growing due to advances in machine learning (ML) and artificial intelligence (AI). This review covers myriad faces of AI and ML in drug design. There is a plethora of AI algorithms, the most common of which are summarized in this review. In addition, AI is fraught with challenges that are highlighted along with plausible solutions to them. Examples are provided to illustrate the use of AI and ML in drug discovery and in predicting drug properties such as binding affinities and interactions, solubility, toxicology, blood–brain barrier permeability and chemical properties. The review also includes examples depicting the implementation of AI and ML in tackling intractable diseases such as COVID-19, cancer and Alzheimer’s disease. Ethical considerations and future perspectives of AI are also covered in this review.
Collapse
|
5
|
Ševa J, Wiegandt DL, Götze J, Lamping M, Rieke D, Schäfer R, Jähnichen P, Kittner M, Pallarz S, Starlinger J, Keilholz U, Leser U. VIST - a Variant-Information Search Tool for precision oncology. BMC Bioinformatics 2019; 20:429. [PMID: 31419935 PMCID: PMC6697931 DOI: 10.1186/s12859-019-2958-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2019] [Accepted: 06/18/2019] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND Diagnosis and treatment decisions in cancer increasingly depend on a detailed analysis of the mutational status of a patient's genome. This analysis relies on previously published information regarding the association of variations to disease progression and possible interventions. Clinicians to a large degree use biomedical search engines to obtain such information; however, the vast majority of scientific publications focus on basic science and have no direct clinical impact. We develop the Variant-Information Search Tool (VIST), a search engine designed for the targeted search of clinically relevant publications given an oncological mutation profile. RESULTS VIST indexes all PubMed abstracts and content from ClinicalTrials.gov. It applies advanced text mining to identify mentions of genes, variants and drugs and uses machine learning based scoring to judge the clinical relevance of indexed abstracts. Its functionality is available through a fast and intuitive web interface. We perform several evaluations, showing that VIST's ranking is superior to that of PubMed or a pure vector space model with regard to the clinical relevance of a document's content. CONCLUSION Different user groups search repositories of scientific publications with different intentions. This diversity is not adequately reflected in the standard search engines, often leading to poor performance in specialized settings. We develop a search engine for the specific case of finding documents that are clinically relevant in the course of cancer treatment. We believe that the architecture of our engine, heavily relying on machine learning algorithms, can also act as a blueprint for search engines in other, equally specific domains. VIST is freely available at https://vist.informatik.hu-berlin.de/.
Collapse
Affiliation(s)
- Jurica Ševa
- Knowledge Management in Bioinformatics, Department of Computer Science, Humboldt-Universität zu Berlin, Rudower Chaussee 25, Berlin, 12489, Germany
| | - David Luis Wiegandt
- Knowledge Management in Bioinformatics, Department of Computer Science, Humboldt-Universität zu Berlin, Rudower Chaussee 25, Berlin, 12489, Germany
| | - Julian Götze
- University Hospital Tübingen, Hoppe-Seyler-Straße 3, Tübingen, 72076, Germany
| | - Mario Lamping
- Charité Comprehensive Cancer Center, Charitéplatz 1, Berlin, 10117, Germany
| | - Damian Rieke
- Charité Comprehensive Cancer Center, Charitéplatz 1, Berlin, 10117, Germany
- Department of Hematology and Medical Oncology, Campus Benjamin Franklin, Charité Unviersitätsmedizin Berlin, Hindenburgdamm 30, Berlin, 12203, Germany
- Berlin Institute of Health, Kapelle-Ufer 2, Berlin, 10117, Germany
| | - Reinhold Schäfer
- Charité Comprehensive Cancer Center, Charitéplatz 1, Berlin, 10117, Germany
- German Cancer Consortium (DKTK), DKFZ Heidelberg, Im Neuenheimer Feld 280, Heidelberg, 69120, Germany
| | - Patrick Jähnichen
- Knowledge Management in Bioinformatics, Department of Computer Science, Humboldt-Universität zu Berlin, Rudower Chaussee 25, Berlin, 12489, Germany
| | - Madeleine Kittner
- Knowledge Management in Bioinformatics, Department of Computer Science, Humboldt-Universität zu Berlin, Rudower Chaussee 25, Berlin, 12489, Germany
| | - Steffen Pallarz
- Knowledge Management in Bioinformatics, Department of Computer Science, Humboldt-Universität zu Berlin, Rudower Chaussee 25, Berlin, 12489, Germany
| | - Johannes Starlinger
- Knowledge Management in Bioinformatics, Department of Computer Science, Humboldt-Universität zu Berlin, Rudower Chaussee 25, Berlin, 12489, Germany
| | - Ulrich Keilholz
- Charité Comprehensive Cancer Center, Charitéplatz 1, Berlin, 10117, Germany
| | - Ulf Leser
- Knowledge Management in Bioinformatics, Department of Computer Science, Humboldt-Universität zu Berlin, Rudower Chaussee 25, Berlin, 12489, Germany.
| |
Collapse
|
6
|
Barth A. A deep analysis of chemical structure-based patent searching in the Derwent index space. WORLD PATENT INFORMATION 2018. [DOI: 10.1016/j.wpi.2018.05.005] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/16/2022]
|
7
|
Habibi M, Weber L, Neves M, Wiegandt DL, Leser U. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 2017; 33:i37-i48. [PMID: 28881963 PMCID: PMC5870729 DOI: 10.1093/bioinformatics/btx228] [Citation(s) in RCA: 192] [Impact Index Per Article: 27.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
MOTIVATION Text mining has become an important tool for biomedical research. The most fundamental text-mining task is the recognition of biomedical named entities (NER), such as genes, chemicals and diseases. Current NER methods rely on pre-defined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which makes extrapolation of quality measures difficult. RESULTS We show that a completely generic method based on deep learning and statistical word embeddings [called long short-term memory network-conditional random field (LSTM-CRF)] outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall. AVAILABILITY AND IMPLEMENTATION The source code for LSTM-CRF is available at https://github.com/glample/tagger and the links to the corpora are available at https://corposaurus.github.io/corpora/ . CONTACT habibima@informatik.hu-berlin.de.
Collapse
Affiliation(s)
- Maryam Habibi
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin, Germany
| | - Leon Weber
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin, Germany
| | - Mariana Neves
- Enterprise Platform and Integration Concepts, Hasso-Plattner-Institute, Potsdam, Germany
| | - David Luis Wiegandt
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin, Germany
| | - Ulf Leser
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin, Germany
| |
Collapse
|
8
|
Lupu M. Information retrieval, machine learning, and Natural Language Processing for intellectual property information. WORLD PATENT INFORMATION 2017. [DOI: 10.1016/j.wpi.2017.06.002] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
9
|
Krallinger M, Rabal O, Lourenço A, Oyarzabal J, Valencia A. Information Retrieval and Text Mining Technologies for Chemistry. Chem Rev 2017; 117:7673-7761. [PMID: 28475312 DOI: 10.1021/acs.chemrev.6b00851] [Citation(s) in RCA: 111] [Impact Index Per Article: 15.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.
Collapse
Affiliation(s)
- Martin Krallinger
- Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre , C/Melchor Fernández Almagro 3, Madrid E-28029, Spain
| | - Obdulia Rabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Anália Lourenço
- ESEI - Department of Computer Science, University of Vigo , Edificio Politécnico, Campus Universitario As Lagoas s/n, Ourense E-32004, Spain.,Centro de Investigaciones Biomédicas (Centro Singular de Investigación de Galicia) , Campus Universitario Lagoas-Marcosende, Vigo E-36310, Spain.,CEB-Centre of Biological Engineering, University of Minho , Campus de Gualtar, Braga 4710-057, Portugal
| | - Julen Oyarzabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Alfonso Valencia
- Life Science Department, Barcelona Supercomputing Centre (BSC-CNS) , C/Jordi Girona, 29-31, Barcelona E-08034, Spain.,Joint BSC-IRB-CRG Program in Computational Biology, Parc Científic de Barcelona , C/ Baldiri Reixac 10, Barcelona E-08028, Spain.,Institució Catalana de Recerca i Estudis Avançats (ICREA) , Passeig de Lluís Companys 23, Barcelona E-08010, Spain
| |
Collapse
|