1
|
Iscoe M, Socrates V, Gilson A, Chi L, Li H, Huang T, Kearns T, Perkins R, Khandjian L, Taylor RA. Identifying signs and symptoms of urinary tract infection from emergency department clinical notes using large language models. Acad Emerg Med 2024; 31:599-610. [PMID: 38567658 DOI: 10.1111/acem.14883] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Revised: 01/24/2024] [Accepted: 01/24/2024] [Indexed: 04/04/2024]
Abstract
BACKGROUND Natural language processing (NLP) tools including recently developed large language models (LLMs) have myriad potential applications in medical care and research, including the efficient labeling and classification of unstructured text such as electronic health record (EHR) notes. This opens the door to large-scale projects that rely on variables that are not typically recorded in a structured form, such as patient signs and symptoms. OBJECTIVES This study is designed to acquaint the emergency medicine research community with the foundational elements of NLP, highlighting essential terminology, annotation methodologies, and the intricacies involved in training and evaluating NLP models. Symptom characterization is critical to urinary tract infection (UTI) diagnosis, but identification of symptoms from the EHR has historically been challenging, limiting large-scale research, public health surveillance, and EHR-based clinical decision support. We therefore developed and compared two NLP models to identify UTI symptoms from unstructured emergency department (ED) notes. METHODS The study population consisted of patients aged ≥ 18 who presented to an ED in a northeastern U.S. health system between June 2013 and August 2021 and had a urinalysis performed. We annotated a random subset of 1250 ED clinician notes from these visits for a list of 17 UTI symptoms. We then developed two task-specific LLMs to perform the task of named entity recognition: a convolutional neural network-based model (SpaCy) and a transformer-based model designed to process longer documents (Clinical Longformer). Models were trained on 1000 notes and tested on a holdout set of 250 notes. We compared model performance (precision, recall, F1 measure) at identifying the presence or absence of UTI symptoms at the note level. RESULTS A total of 8135 entities were identified in 1250 notes; 83.6% of notes included at least one entity. Overall F1 measure for note-level symptom identification weighted by entity frequency was 0.84 for the SpaCy model and 0.88 for the Longformer model. F1 measure for identifying presence or absence of any UTI symptom in a clinical note was 0.96 (232/250 correctly classified) for the SpaCy model and 0.98 (240/250 correctly classified) for the Longformer model. CONCLUSIONS The study demonstrated the utility of LLMs and transformer-based models in particular for extracting UTI symptoms from unstructured ED clinical notes; models were highly accurate for detecting the presence or absence of any UTI symptom on the note level, with variable performance for individual symptoms.
Collapse
Affiliation(s)
- Mark Iscoe
- Department of Emergency Medicine, Yale School of Medicine, New Haven, Connecticut, USA
- Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, Connecticut, USA
| | - Vimig Socrates
- Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, Connecticut, USA
- Program of Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, USA
| | - Aidan Gilson
- Yale School of Medicine, New Haven, Connecticut, USA
| | - Ling Chi
- Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut, USA
| | - Huan Li
- Program of Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, USA
| | - Thomas Huang
- Yale School of Medicine, New Haven, Connecticut, USA
| | - Thomas Kearns
- Department of Emergency Medicine, Yale School of Medicine, New Haven, Connecticut, USA
| | - Rachelle Perkins
- Department of Emergency Medicine, Yale School of Medicine, New Haven, Connecticut, USA
| | - Laura Khandjian
- Department of Emergency Medicine, Yale School of Medicine, New Haven, Connecticut, USA
| | - R Andrew Taylor
- Department of Emergency Medicine, Yale School of Medicine, New Haven, Connecticut, USA
- Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, Connecticut, USA
| |
Collapse
|
2
|
Tiemann JKS, Szczuka M, Bouarroudj L, Oussaren M, Garcia S, Howard RJ, Delemotte L, Lindahl E, Baaden M, Lindorff-Larsen K, Chavent M, Poulain P. MDverse: Shedding Light on the Dark Matter of Molecular Dynamics Simulations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.05.02.538537. [PMID: 37205542 PMCID: PMC10187166 DOI: 10.1101/2023.05.02.538537] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
The rise of open science and the absence of a global dedicated data repository for molecular dynamics (MD) simulations has led to the accumulation of MD files in generalist data repositories, constituting the dark matter of MD - data that is technically accessible, but neither indexed, curated, or easily searchable. Leveraging an original search strategy, we found and indexed about 250,000 files and 2,000 datasets from Zenodo, Figshare and Open Science Framework. With a focus on files produced by the Gromacs MD software, we illustrate the potential offered by the mining of publicly available MD data. We identified systems with specific molecular composition and were able to characterize essential parameters of MD simulation such as temperature and simulation length, and could identify model resolution, such as all-atom and coarse-grain. Based on this analysis, we inferred metadata to propose a search engine prototype to explore the MD data. To continue in this direction, we call on the community to pursue the effort of sharing MD data, and to report and standardize metadata to reuse this valuable matter.
Collapse
|
3
|
Xie T, Wan Y, Wang H, Østrøm I, Wang S, He M, Deng R, Wu X, Grazian C, Kit C, Hoex B. Opinion Mining by Convolutional Neural Networks for Maximizing Discoverability of Nanomaterials. J Chem Inf Model 2024; 64:2746-2759. [PMID: 37982753 DOI: 10.1021/acs.jcim.3c00746] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2023]
Abstract
The scientific literature contains valuable information that can be used for future applications, but manual analysis presents challenges due to its size and disciplinary boundaries. The prevailing solution involves natural language processing (NLP) techniques such as information retrieval. Nonetheless, existing automated systems primarily provide either statistically based shallow information or deep information without traceability, thereby falling short of delivering high-quality and reliable insights. To address this, we propose an innovative approach of leveraging sentiment information embedded within the literature to track the opinions toward materials. In this study, we integrated material knowledge into text representation and constructed opinion data sets to hierarchically train deep learning models, named as Scientific Sentiment Network (SSNet). SSNet can effectively extract knowledge from the energy material literature and accurately categorize expert opinions into challenges and opportunities (94% and 92% accuracy, respectively). By incorporating sentiment features determined by SSNet, we can predict the ranking of emerging thermoelectric materials with a 70% correlation to experimental outcomes. Furthermore, our model achieves a commendable 68% accuracy in predicting suitable nanomaterials for atomic layer deposition (ALD) over time. These promising results offer a practical framework to extract and synthesize knowledge from the scientific literature, thereby accelerating research in the field of nanomaterials.
Collapse
Affiliation(s)
- Tong Xie
- School of Photovoltaic and Renewable Energy Engineering, University of New South Wales, Kensington, NSW 2052, Australia
- GreenDynamics Pty. Ltd., Kensington, NSW 2052, Australia
| | - Yuwei Wan
- Department of Linguistics and Translation, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong
- GreenDynamics Pty. Ltd., Kensington, NSW 2052, Australia
| | - Haoran Wang
- School of Photovoltaic and Renewable Energy Engineering, University of New South Wales, Kensington, NSW 2052, Australia
| | - Ina Østrøm
- School of Photovoltaic and Renewable Energy Engineering, University of New South Wales, Kensington, NSW 2052, Australia
| | - Shaozhou Wang
- School of Photovoltaic and Renewable Energy Engineering, University of New South Wales, Kensington, NSW 2052, Australia
- GreenDynamics Pty. Ltd., Kensington, NSW 2052, Australia
| | - Mingrui He
- School of Photovoltaic and Renewable Energy Engineering, University of New South Wales, Kensington, NSW 2052, Australia
| | - Rong Deng
- School of Photovoltaic and Renewable Energy Engineering, University of New South Wales, Kensington, NSW 2052, Australia
| | - Xinyuan Wu
- School of Photovoltaic and Renewable Energy Engineering, University of New South Wales, Kensington, NSW 2052, Australia
| | - Clara Grazian
- DARE ARC Training Centre in Data Analytics for Resources and Environments, South Eveleigh, NSW 2015, Australia
- School of Mathematics and Statistics, University of Sydney, Camperdown, NSW 2006, Australia
| | - Chunyu Kit
- Department of Linguistics and Translation, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong
| | - Bram Hoex
- School of Photovoltaic and Renewable Energy Engineering, University of New South Wales, Kensington, NSW 2052, Australia
| |
Collapse
|
4
|
Emmert-Streib F. Can ChatGPT understand genetics? Eur J Hum Genet 2024; 32:371-372. [PMID: 37407734 PMCID: PMC10999414 DOI: 10.1038/s41431-023-01419-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Accepted: 06/19/2023] [Indexed: 07/07/2023] Open
Affiliation(s)
- Frank Emmert-Streib
- Predictive Society and Data Analytics Lab, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland.
| |
Collapse
|
5
|
Ngo DH, Koopman B. From Free-text Drug Labels to Structured Medication Terminology with BERT and GPT. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2024; 2023:540-549. [PMID: 38222391 PMCID: PMC10785872] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 01/16/2024]
Abstract
We present a method to enrich controlled medication terminology from free-text drug labels. This is important because, while controlled medication terminology capture well-structured medication information, much of the information pertaining to medications is still found in free-text. First, we compared different Named Entity Recognition (NER) models including rule-based, feature-based, deep learning-based models with Transformers as well as ChatGPT, few-shot and fine-tuned GPT-3 to find the most suitable model that accurately extracts medication entities (ingredients, brand, dose, etc.) from free-text. Then, a rule-based Relation Extraction algorithm transforms NER results into a well-structured medication knowledge graph. Finally, a Medication Searching method takes the knowledge graph and matches it to relevant medications in the terminology server. An empirical evaluation on real-world drug labels shows that BERT-CRF was the most effective NER model with F-measure 95%. After performing terms normalization, the Medication Searching achieved an accuracy of 77% for when matching a label to relevant medication in the terminology server. The NER and Medication Searching models could be deployed as a web service capable of accepting free-text queries and returning structured medication information; thus providing a useful means of better managing medications information found in different health systems.
Collapse
Affiliation(s)
- Duy-Hoa Ngo
- The Australian E-Health Research Centre, CSIRO, Australia
| | - Bevan Koopman
- The Australian E-Health Research Centre, CSIRO, Australia
| |
Collapse
|
6
|
Nachtegael C, De Stefani J, Lenaerts T. A study of deep active learning methods to reduce labelling efforts in biomedical relation extraction. PLoS One 2023; 18:e0292356. [PMID: 38100453 PMCID: PMC10723703 DOI: 10.1371/journal.pone.0292356] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Accepted: 09/19/2023] [Indexed: 12/17/2023] Open
Abstract
Automatic biomedical relation extraction (bioRE) is an essential task in biomedical research in order to generate high-quality labelled data that can be used for the development of innovative predictive methods. However, building such fully labelled, high quality bioRE data sets of adequate size for the training of state-of-the-art relation extraction models is hindered by an annotation bottleneck due to limitations on time and expertise of researchers and curators. We show here how Active Learning (AL) plays an important role in resolving this issue and positively improve bioRE tasks, effectively overcoming the labelling limits inherent to a data set. Six different AL strategies are benchmarked on seven bioRE data sets, using PubMedBERT as the base model, evaluating their area under the learning curve (AULC) as well as intermediate results measurements. The results demonstrate that uncertainty-based strategies, such as Least-Confident or Margin Sampling, are statistically performing better in terms of F1-score, accuracy and precision, than other types of AL strategies. However, in terms of recall, a diversity-based strategy, called Core-set, outperforms all strategies. AL strategies are shown to reduce the annotation need (in order to reach a performance at par with training on all data), from 6% to 38%, depending on the data set; with Margin Sampling and Least-Confident Sampling strategies moreover obtaining the best AULCs compared to the Random Sampling baseline. We show through the experiments the importance of using AL methods to reduce the amount of labelling needed to construct high-quality data sets leading to optimal performance of deep learning models. The code and data sets to reproduce all the results presented in the article are available at https://github.com/oligogenic/Deep_active_learning_bioRE.
Collapse
Affiliation(s)
- Charlotte Nachtegael
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussel, Bruxelles, Belgium
- Machine Learning Group, Université Libre de Bruxelles, Bruxelles, Belgium
| | - Jacopo De Stefani
- Machine Learning Group, Université Libre de Bruxelles, Bruxelles, Belgium
- Technology, Policy and Management Faculty, Technische Universiteit Delft, Delft, Netherlands
| | - Tom Lenaerts
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussel, Bruxelles, Belgium
- Machine Learning Group, Université Libre de Bruxelles, Bruxelles, Belgium
- Artificial Intelligence Laboratory, Vrije Universiteit Brussel, Bruxelles, Belgium
| |
Collapse
|
7
|
Lera-Ramírez M, Bähler J, Mata J, Rutherford K, Hoffman CS, Lambert S, Oliferenko S, Martin SG, Gould KL, Du LL, Sabatinos SA, Forsburg SL, Nielsen O, Nurse P, Wood V. Revised fission yeast gene and allele nomenclature guidelines for machine readability. Genetics 2023; 225:iyad143. [PMID: 37758508 PMCID: PMC10627252 DOI: 10.1093/genetics/iyad143] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Accepted: 07/24/2023] [Indexed: 09/30/2023] Open
Abstract
Standardized nomenclature for genes, gene products, and isoforms is crucial to prevent ambiguity and enable clear communication of scientific data, facilitating efficient biocuration and data sharing. Standardized genotype nomenclature, which describes alleles present in a specific strain that differ from those in the wild-type reference strain, is equally essential to maximize research impact and ensure that results linking genotypes to phenotypes are Findable, Accessible, Interoperable, and Reusable (FAIR). In this publication, we extend the fission yeast clade gene nomenclature guidelines to support the curation efforts at PomBase (www.pombase.org), the Schizosaccharomyces pombe Model Organism Database. This update introduces nomenclature guidelines for noncoding RNA genes, following those set forth by the Human Genome Organisation Gene Nomenclature Committee. Additionally, we provide a significant update to the allele and genotype nomenclature guidelines originally published in 1987, to standardize the diverse range of genetic modifications enabled by the fission yeast genetic toolbox. These updated guidelines reflect a community consensus between numerous fission yeast researchers. Adoption of these rules will improve consistency in gene and genotype nomenclature, and facilitate machine-readability and automated entity recognition of fission yeast genes and alleles in publications or datasets. In conclusion, our updated guidelines provide a valuable resource for the fission yeast research community, promoting consistency, clarity, and FAIRness in genetic data sharing and interpretation.
Collapse
Affiliation(s)
- Manuel Lera-Ramírez
- University College London, Department of Genetics Evolution and Environment, Darwin Building, 99-105 Gower Street, London WC1E 6BT, UK
| | - Jürg Bähler
- University College London, Department of Genetics Evolution and Environment, Darwin Building, 99-105 Gower Street, London WC1E 6BT, UK
| | - Juan Mata
- University of Cambridge, Department of Biochemistry, Cambridge CB2 1GA, UK
| | - Kim Rutherford
- University of Cambridge, Department of Biochemistry, Cambridge CB2 1GA, UK
| | | | - Sarah Lambert
- Institut Curie, Université Paris-Saclay, CNRS UMR3348, Orsay 91400, France
| | - Snezhana Oliferenko
- The Francis Crick Institute, London NW1 1AT, UK
- Randall Centre for Cell and Molecular Biophysics, School of Basic and Medical Biosciences, King’s College London, London SE1 1UL, UK
| | - Sophie G Martin
- University of Geneva, Department of Molecular and Cellular Biology, Geneva 1211, Switzerland
| | - Kathleen L Gould
- Vanderbilt University School of Medicine, Department of Cell and Developmental Biology, Nashville, TN 37232, USA
| | - Li-Lin Du
- National Institute of Biological Sciences, Beijing 102206, China
| | - Sarah A Sabatinos
- Toronto Metropolitan University, Department of Chemistry & Biology, Toronto M5B 2K3, Canada
| | - Susan L Forsburg
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, CA 90089, USA
| | - Olaf Nielsen
- Department of Biology, Cell cycle and genome stability Group, University of Copenhagen, Copenhagen N DK2100, Denmark
| | - Paul Nurse
- The Francis Crick Institute, London NW1 1AT, UK
| | - Valerie Wood
- University of Cambridge, Department of Biochemistry, Cambridge CB2 1GA, UK
| |
Collapse
|
8
|
Sun H, Song Z, Chen Q, Wang M, Tang F, Dou L, Zou Q, Yang F. MMiKG: a knowledge graph-based platform for path mining of microbiota-mental diseases interactions. Brief Bioinform 2023; 24:bbad340. [PMID: 37779250 DOI: 10.1093/bib/bbad340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 08/21/2023] [Accepted: 09/12/2023] [Indexed: 10/03/2023] Open
Abstract
The microbiota-gut-brain axis denotes a two-way system of interactions between the gut and the brain, comprising three key components: (1) gut microbiota, (2) intermediates and (3) mental ailments. These constituents communicate with one another to induce changes in the host's mood, cognition and demeanor. Knowledge concerning the regulation of the host central nervous system by gut microbiota is fragmented and mostly confined to disorganized or semi-structured unrestricted texts. Such a format hinders the exploration and comprehension of unknown territories or the further advancement of artificial intelligence systems. Hence, we collated crucial information by scrutinizing an extensive body of literature, amalgamated the extant knowledge of the microbiota-gut-brain axis and depicted it in the form of a knowledge graph named MMiKG, which can be visualized on the GraphXR platform and the Neo4j database, correspondingly. By merging various associated resources and deducing prospective connections between gut microbiota and the central nervous system through MMiKG, users can acquire a more comprehensive perception of the pathogenesis of mental disorders and generate novel insights for advancing therapeutic measures. As a free and open-source platform, MMiKG can be accessed at http://yangbiolab.cn:8501/ with no login requirement.
Collapse
Affiliation(s)
- Haoran Sun
- School of Medical Imaging, Fujian Medical University, Fuzhou 350122, China
| | - Zhaoqi Song
- Department of Bioinformatics, Fujian Key Laboratory of Medical Bioinformatics, School of Medical Technology and Engineering, Fujian Medical University, Fuzhou 350122, China
| | - Qiuming Chen
- School of Medical Imaging, Fujian Medical University, Fuzhou 350122, China
| | - Meiling Wang
- Department of Bioinformatics, Fujian Key Laboratory of Medical Bioinformatics, School of Medical Technology and Engineering, Fujian Medical University, Fuzhou 350122, China
| | - Furong Tang
- Department of Basic Medical Sciences, School of Medicine, Tsinghua University, Beijing 100084, China
| | - Lijun Dou
- Genomic Medicine Institute, Lerner Research Institute, Cleveland, OH 44106, USA
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Fenglong Yang
- Department of Bioinformatics, Fujian Key Laboratory of Medical Bioinformatics, School of Medical Technology and Engineering, Fujian Medical University, Fuzhou 350122, China
- Key Laboratory of Ministry of Education for Gastrointestinal Cancer, School of Basic Medical Sciences, Fujian Medical University, Fuzhou 350122, China
| |
Collapse
|
9
|
Vaškevičius M, Kapočiūtė-Dzikienė J, Vaškevičius A, Šlepikas L. Deep learning-based automatic action extraction from structured chemical synthesis procedures. PeerJ Comput Sci 2023; 9:e1511. [PMID: 37705639 PMCID: PMC10495970 DOI: 10.7717/peerj-cs.1511] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2023] [Accepted: 07/07/2023] [Indexed: 09/15/2023]
Abstract
This article proposes a methodology that uses machine learning algorithms to extract actions from structured chemical synthesis procedures, thereby bridging the gap between chemistry and natural language processing. The proposed pipeline combines ML algorithms and scripts to extract relevant data from USPTO and EPO patents, which helps transform experimental procedures into structured actions. This pipeline includes two primary tasks: classifying patent paragraphs to select chemical procedures and converting chemical procedure sentences into a structured, simplified format. We employ artificial neural networks such as long short-term memory, bidirectional LSTMs, transformers, and fine-tuned T5. Our results show that the bidirectional LSTM classifier achieved the highest accuracy of 0.939 in the first task, while the Transformer model attained the highest BLEU score of 0.951 in the second task. The developed pipeline enables the creation of a dataset of chemical reactions and their procedures in a structured format, facilitating the application of AI-based approaches to streamline synthetic pathways, predict reaction outcomes, and optimize experimental conditions. Furthermore, the developed pipeline allows for creating a structured dataset of chemical reactions and procedures, making it easier for researchers to access and utilize the valuable information in synthesis procedures.
Collapse
Affiliation(s)
- Mantas Vaškevičius
- Department of Applied Informatics, Vytautas Magnus University, Kaunas, Lithuania
- JSC Synhet, Kaunas, Lithuania
| | | | - Arnas Vaškevičius
- Faculty of Mechanical Engineering and Design, Kaunas University of Technology, Kaunas, Lithuania
| | | |
Collapse
|
10
|
Raza S, Schwartz B, Lakamana S, Ge Y, Sarker A. A framework for multi-faceted content analysis of social media chatter regarding non-medical use of prescription medications. BMC DIGITAL HEALTH 2023; 1:29. [PMID: 37680768 PMCID: PMC10483682 DOI: 10.1186/s44247-023-00029-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/23/2023] [Accepted: 07/17/2023] [Indexed: 09/09/2023]
Abstract
Background Substance use, including the non-medical use of prescription medications, is a global health problem resulting in hundreds of thousands of overdose deaths and other health problems. Social media has emerged as a potent source of information for studying substance use-related behaviours and their consequences. Mining large-scale social media data on the topic requires the development of natural language processing (NLP) and machine learning frameworks customized for this problem. Our objective in this research is to develop a framework for conducting a content analysis of Twitter chatter about the non-medical use of a set of prescription medications. Methods We collected Twitter data for four medications-fentanyl and morphine (opioids), alprazolam (benzodiazepine), and Adderall® (stimulant), and identified posts that indicated non-medical use using an automatic machine learning classifier. In our NLP framework, we applied supervised named entity recognition (NER) to identify other substances mentioned, symptoms, and adverse events. We applied unsupervised topic modelling to identify latent topics associated with the chatter for each medication. Results The quantitative analysis demonstrated the performance of the proposed NER approach in identifying substance-related entities from data with a high degree of accuracy compared to the baseline methods. The performance evaluation of the topic modelling was also notable. The qualitative analysis revealed knowledge about the use, non-medical use, and side effects of these medications in individuals and communities. Conclusions NLP-based analyses of Twitter chatter associated with prescription medications belonging to different categories provide multi-faceted insights about their use and consequences. Our developed framework can be applied to chatter about other substances. Further research can validate the predictive value of this information on the prevention, assessment, and management of these disorders.
Collapse
Affiliation(s)
- Shaina Raza
- Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
- Vector Institute for Artificial Intelligence, Toronto, ON, Canada
| | - Brian Schwartz
- Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
| | - Sahithi Lakamana
- Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, USA
| | - Yao Ge
- Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, USA
| | - Abeed Sarker
- Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, USA
| |
Collapse
|
11
|
Liang T, Xia C, Zhao Z, Jiang Y, Yin Y, Yu PS. Transferring From Textual Entailment to Biomedical Named Entity Recognition. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2577-2586. [PMID: 37018664 DOI: 10.1109/tcbb.2023.3236477] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Biomedical Named Entity Recognition (BioNER) aims at identifying biomedical entities such as genes, proteins, diseases, and chemical compounds in the given textual data. However, due to the issues of ethics, privacy, and high specialization of biomedical data, BioNER suffers from the more severe problem of lacking in quality labeled data than the general domain especially for the token-level. Facing the extremely limited labeled biomedical data, this work studies the problem of gazetteer-based BioNER, which aims at building a BioNER system from scratch. It needs to identify the entities in the given sentences when we have zero token-level annotations for training. Previous works usually use sequential labeling models to solve the NER or BioNER task and obtain weakly labeled data from gazetteers when we don't have full annotations. However, these labeled data are quite noisy since we need the labels for each token and the entity coverage of the gazetteers is limited. Here we propose to formulate the BioNER task as a Textual Entailment problem and solve the task via Textual Entailment with Dynamic Contrastive learning (TEDC). TEDC not only alleviates the noisy labeling issue, but also transfers the knowledge from pre-trained textual entailment models. Additionally, the dynamic contrastive learning framework contrasts the entities and non-entities in the same sentence and improves the model's discrimination ability. Experiments on two real-world biomedical datasets show that TEDC can achieve state-of-the-art performance for gazetteer-based BioNER.
Collapse
|
12
|
Cutforth M, Watson H, Brown C, Wang C, Thomson S, Fell D, Dilys V, Scrimgeour M, Schrempf P, Lesh J, Muir K, Weir A, O’Neil AQ. Acute stroke CDS: automatic retrieval of thrombolysis contraindications from unstructured clinical letters. Front Digit Health 2023; 5:1186516. [PMID: 37388253 PMCID: PMC10305776 DOI: 10.3389/fdgth.2023.1186516] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2023] [Accepted: 05/15/2023] [Indexed: 07/01/2023] Open
Abstract
Introduction Thrombolysis treatment for acute ischaemic stroke can lead to better outcomes if administered early enough. However, contraindications exist which put the patient at greater risk of a bleed (e.g. recent major surgery, anticoagulant medication). Therefore, clinicians must check a patient's past medical history before proceeding with treatment. In this work we present a machine learning approach for accurate automatic detection of this information in unstructured text documents such as discharge letters or referral letters, to support the clinician in making a decision about whether to administer thrombolysis. Methods We consulted local and national guidelines for thrombolysis eligibility, identifying 86 entities which are relevant to the thrombolysis decision. A total of 8,067 documents from 2,912 patients were manually annotated with these entities by medical students and clinicians. Using this data, we trained and validated several transformer-based named entity recognition (NER) models, focusing on transformer models which have been pre-trained on a biomedical corpus as these have shown most promise in the biomedical NER literature. Results Our best model was a PubMedBERT-based approach, which obtained a lenient micro/macro F1 score of 0.829/0.723. Ensembling 5 variants of this model gave a significant boost to precision, obtaining micro/macro F1 of 0.846/0.734 which approaches the human annotator performance of 0.847/0.839. We further propose numeric definitions for the concepts of name regularity (similarity of all spans which refer to an entity) and context regularity (similarity of all context surrounding mentions of an entity), using these to analyse the types of errors made by the system and finding that the name regularity of an entity is a stronger predictor of model performance than raw training set frequency. Discussion Overall, this work shows the potential of machine learning to provide clinical decision support (CDS) for the time-critical decision of thrombolysis administration in ischaemic stroke by quickly surfacing relevant information, leading to prompt treatment and hence to better patient outcomes.
Collapse
Affiliation(s)
| | - Hannah Watson
- Canon Medical Research Europe, Edinburgh, United Kingdom
| | - Cameron Brown
- Institute of Neuroscience & Psychology, University of Glasgow, Glasgow, United Kingdom
| | - Chaoyang Wang
- Canon Medical Research Europe, Edinburgh, United Kingdom
| | - Stuart Thomson
- Canon Medical Research Europe, Edinburgh, United Kingdom
| | - Dickon Fell
- Canon Medical Research Europe, Edinburgh, United Kingdom
| | | | | | | | - James Lesh
- Canon Medical Research Europe, Edinburgh, United Kingdom
| | - Keith Muir
- Institute of Neuroscience & Psychology, University of Glasgow, Glasgow, United Kingdom
| | - Alexander Weir
- Canon Medical Research Europe, Edinburgh, United Kingdom
| | - Alison Q O’Neil
- Canon Medical Research Europe, Edinburgh, United Kingdom
- School of Engineering, University of Edinburgh, Edinburgh, United Kingdom
| |
Collapse
|
13
|
Jeong M, Kang J. Consistency enhancement of model prediction on document-level named entity recognition. Bioinformatics 2023; 39:btad361. [PMID: 37261870 PMCID: PMC10272703 DOI: 10.1093/bioinformatics/btad361] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Revised: 04/17/2023] [Accepted: 05/31/2023] [Indexed: 06/02/2023] Open
Abstract
SUMMARY Biomedical named entity recognition (NER) plays a crucial role in extracting information from documents in biomedical applications. However, many of these applications require NER models to operate at a document level, rather than just a sentence level. This presents a challenge, as the extension from a sentence model to a document model is not always straightforward. Despite the existence of document NER models that are able to make consistent predictions, they still fall short of meeting the expectations of researchers and practitioners in the field. To address this issue, we have undertaken an investigation into the underlying causes of inconsistent predictions. Our research has led us to believe that the use of adjectives and prepositions within entities may be contributing to low label consistency. In this article, we present our method, ConNER, to enhance a label consistency of modifiers such as adjectives and prepositions. By refining the labels of these modifiers, ConNER is able to improve representations of biomedical entities. The effectiveness of our method is demonstrated on four popular biomedical NER datasets. On three datasets, we achieve a higher F1 score than the previous state-of-the-art model. Our method shows its efficacy on two datasets, resulting in 7.5%-8.6% absolute improvements in the F1 score. Our findings suggest that our ConNER method is effective on datasets with intrinsically low label consistency. Through qualitative analysis, we demonstrate how our approach helps the NER model generate more consistent predictions. AVAILABILITY AND IMPLEMENTATION Our code and resources are available at https://github.com/dmis-lab/ConNER/.
Collapse
Affiliation(s)
- Minbyul Jeong
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Republic of Korea
| | - Jaewoo Kang
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Republic of Korea
- Interdisciplinary Graduate Program in Bioinformatics, Korea University, Seoul, Republic of Korea
- AIGEN Sciences, Seoul 04778, Republic of Korea
| |
Collapse
|
14
|
Raza S, Schwartz B. Constructing a disease database and using natural language processing to capture and standardize free text clinical information. Sci Rep 2023; 13:8591. [PMID: 37237101 DOI: 10.1038/s41598-023-35482-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2022] [Accepted: 05/18/2023] [Indexed: 05/28/2023] Open
Abstract
The ability to extract critical information about an infectious disease in a timely manner is critical for population health research. The lack of procedures for mining large amounts of health data is a major impediment. The goal of this research is to use natural language processing (NLP) to extract key information (clinical factors, social determinants of health) from free text. The proposed framework describes database construction, NLP modules for locating clinical and non-clinical (social determinants) information, and a detailed evaluation protocol for evaluating results and demonstrating the effectiveness of the proposed framework. The use of COVID-19 case reports is demonstrated for data construction and pandemic surveillance. The proposed approach outperforms benchmark methods in F1-score by about 1-3%. A thorough examination reveals the disease's presence as well as the frequency of symptoms in patients. The findings suggest that prior knowledge gained through transfer learning can be useful when researching infectious diseases with similar presentations in order to accurately predict patient outcomes.
Collapse
Affiliation(s)
- Shaina Raza
- Public Health Ontario (PHO), Toronto, ON, Canada.
- Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada.
| | - Brian Schwartz
- Public Health Ontario (PHO), Toronto, ON, Canada
- Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
| |
Collapse
|
15
|
Moezzi SAR, Ghaedi A, Rahmanian M, Mousavi SZ, Sami A. Application of Deep Learning in Generating Structured Radiology Reports: A Transformer-Based Technique. J Digit Imaging 2023; 36:80-90. [PMID: 36002778 PMCID: PMC9984654 DOI: 10.1007/s10278-022-00692-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2021] [Revised: 06/20/2022] [Accepted: 07/27/2022] [Indexed: 11/29/2022] Open
Abstract
Since radiology reports needed for clinical practice and research are written and stored in free-text narrations, extraction of relative information for further analysis is difficult. In these circumstances, natural language processing (NLP) techniques can facilitate automatic information extraction and transformation of free-text formats to structured data. In recent years, deep learning (DL)-based models have been adapted for NLP experiments with promising results. Despite the significant potential of DL models based on artificial neural networks (ANN) and convolutional neural networks (CNN), the models face some limitations to implement in clinical practice. Transformers, another new DL architecture, have been increasingly applied to improve the process. Therefore, in this study, we propose a transformer-based fine-grained named entity recognition (NER) architecture for clinical information extraction. We collected 88 abdominopelvic sonography reports in free-text formats and annotated them based on our developed information schema. The text-to-text transfer transformer model (T5) and Scifive, a pre-trained domain-specific adaptation of the T5 model, were applied for fine-tuning to extract entities and relations and transform the input into a structured format. Our transformer-based model in this study outperformed previously applied approaches such as ANN and CNN models based on ROUGE-1, ROUGE-2, ROUGE-L, and BLEU scores of 0.816, 0.668, 0.528, and 0.743, respectively, while providing an interpretable structured report.
Collapse
Affiliation(s)
- Seyed Ali Reza Moezzi
- Department of Computer Science and Engineering and IT, Shiraz University, Shiraz, Iran
| | - Abdolrahman Ghaedi
- Department of Computer Science and Engineering and IT, Shiraz University, Shiraz, Iran
| | - Mojdeh Rahmanian
- Department of Computer Science and Engineering and IT, Shiraz University, Shiraz, Iran
| | | | - Ashkan Sami
- Department of Computer Science and Engineering and IT, Shiraz University, Shiraz, Iran.
| |
Collapse
|
16
|
Yew ANJ, Schraagen M, Otte WM, van Diessen E. Transforming epilepsy research: A systematic review on natural language processing applications. Epilepsia 2023; 64:292-305. [PMID: 36462150 PMCID: PMC10108221 DOI: 10.1111/epi.17474] [Citation(s) in RCA: 14] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2022] [Revised: 11/23/2022] [Accepted: 12/01/2022] [Indexed: 12/05/2022]
Abstract
Despite improved ancillary investigations in epilepsy care, patients' narratives remain indispensable for diagnosing and treatment monitoring. This wealth of information is typically stored in electronic health records and accumulated in medical journals in an unstructured manner, thereby restricting complete utilization in clinical decision-making. To this end, clinical researchers increasing apply natural language processing (NLP)-a branch of artificial intelligence-as it removes ambiguity, derives context, and imbues standardized meaning from free-narrative clinical texts. This systematic review presents an overview of the current NLP applications in epilepsy and discusses the opportunities and drawbacks of NLP alongside its future implications. We searched the PubMed and Embase databases with a "natural language processing" and "epilepsy" query (March 4, 2022) and included original research articles describing the application of NLP techniques for textual analysis in epilepsy. Twenty-six studies were included. Fifty-eight percent of these studies used NLP to classify clinical records into predefined categories, improving patient identification and treatment decisions. Other applications of NLP had structured clinical information retrieval from electronic health records, scientific papers, and online posts of patients. Challenges and opportunities of NLP applications for enhancing epilepsy care and research are discussed. The field could further benefit from NLP by replicating successes in other health care domains, such as NLP-aided quality evaluation for clinical decision-making, outcome prediction, and clinical record summarization.
Collapse
Affiliation(s)
- Arister N J Yew
- University College Utrecht, Utrecht University, Utrecht, The Netherlands
| | - Marijn Schraagen
- Department of Information and Computing Sciences, Faculty of Science, Utrecht University, Utrecht, The Netherlands
| | - Willem M Otte
- Department of Child Neurology, Brain Center, University Medical Center Utrecht and Utrecht University, Utrecht, The Netherlands
| | - Eric van Diessen
- Department of Child Neurology, Brain Center, University Medical Center Utrecht and Utrecht University, Utrecht, The Netherlands
| |
Collapse
|
17
|
Raza S, Schwartz B. Entity and relation extraction from clinical case reports of COVID-19: a natural language processing approach. BMC Med Inform Decis Mak 2023; 23:20. [PMID: 36703154 PMCID: PMC9879259 DOI: 10.1186/s12911-023-02117-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2022] [Accepted: 01/20/2023] [Indexed: 01/28/2023] Open
Abstract
BACKGROUND Extracting relevant information about infectious diseases is an essential task. However, a significant obstacle in supporting public health research is the lack of methods for effectively mining large amounts of health data. OBJECTIVE This study aims to use natural language processing (NLP) to extract the key information (clinical factors, social determinants of health) from published cases in the literature. METHODS The proposed framework integrates a data layer for preparing a data cohort from clinical case reports; an NLP layer to find the clinical and demographic-named entities and relations in the texts; and an evaluation layer for benchmarking performance and analysis. The focus of this study is to extract valuable information from COVID-19 case reports. RESULTS The named entity recognition implementation in the NLP layer achieves a performance gain of about 1-3% compared to benchmark methods. Furthermore, even without extensive data labeling, the relation extraction method outperforms benchmark methods in terms of accuracy (by 1-8% better). A thorough examination reveals the disease's presence and symptoms prevalence in patients. CONCLUSIONS A similar approach can be generalized to other infectious diseases. It is worthwhile to use prior knowledge acquired through transfer learning when researching other infectious diseases.
Collapse
Affiliation(s)
- Shaina Raza
- grid.415400.40000 0001 1505 2354Public Health Ontario (PHO), Toronto, ON Canada ,grid.17063.330000 0001 2157 2938Dalla Lana School of Public Health, University of Toronto, Toronto, ON Canada
| | - Brian Schwartz
- grid.415400.40000 0001 1505 2354Public Health Ontario (PHO), Toronto, ON Canada ,grid.17063.330000 0001 2157 2938Dalla Lana School of Public Health, University of Toronto, Toronto, ON Canada
| |
Collapse
|
18
|
Cenikj G, Valenčič E, Ispirova G, Ogrinc M, Stojanov R, Korošec P, Cavalli E, Seljak BK, Eftimov T. CafeteriaSA corpus: scientific abstracts annotated across different food semantic resources. Database (Oxford) 2022; 2022:6918707. [PMID: 36526439 PMCID: PMC9757992 DOI: 10.1093/database/baac107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2022] [Revised: 10/30/2022] [Accepted: 11/23/2022] [Indexed: 12/23/2022]
Abstract
In the last decades, a great amount of work has been done in predictive modeling of issues related to human and environmental health. Resolution of issues related to healthcare is made possible by the existence of several biomedical vocabularies and standards, which play a crucial role in understanding the health information, together with a large amount of health-related data. However, despite a large number of available resources and work done in the health and environmental domains, there is a lack of semantic resources that can be utilized in the food and nutrition domain, as well as their interconnections. For this purpose, in a European Food Safety Authority-funded project CAFETERIA, we have developed the first annotated corpus of 500 scientific abstracts that consists of 6407 annotated food entities with regard to Hansard taxonomy, 4299 for FoodOn and 3623 for SNOMED-CT. The CafeteriaSA corpus will enable the further development of natural language processing methods for food information extraction from textual data that will allow extracting food information from scientific textual data. Database URL: https://zenodo.org/record/6683798#.Y49wIezMJJF.
Collapse
Affiliation(s)
| | - Eva Valenčič
- Department of Computer Systems, Jožef Stefan Institute, Jamova cesta 39, Ljubljana 1000, Slovenia,Jožef Stefan International Postgraduate School, Jamova cesta 39, Ljubljana 1000, Slovenia,School of Health Sciences, College of Health, Medicine and Wellbeing, University of Newcastle, University Drive, Callaghan Campus, Newcastle, NSW 2308, Australia,Food and Nutrition Program, Hunter Medical Research Institute, Lot 1 Kookaburra Circuit, New Lambton Heights, Newcastle, NSW 2305, Australia
| | - Gordana Ispirova
- Department of Computer Systems, Jožef Stefan Institute, Jamova cesta 39, Ljubljana 1000, Slovenia,Jožef Stefan International Postgraduate School, Jamova cesta 39, Ljubljana 1000, Slovenia
| | - Matevž Ogrinc
- Department of Computer Systems, Jožef Stefan Institute, Jamova cesta 39, Ljubljana 1000, Slovenia,Jožef Stefan International Postgraduate School, Jamova cesta 39, Ljubljana 1000, Slovenia
| | - Riste Stojanov
- Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University in Skopje, Ruger Boshkovikj 16, Skopje 1000, North Macedonia
| | - Peter Korošec
- Department of Computer Systems, Jožef Stefan Institute, Jamova cesta 39, Ljubljana 1000, Slovenia
| | - Ermanno Cavalli
- European Food Safety Authority, Via Carlo Magno 1A, Parma 43126, Italy
| | - Barbara Koroušić Seljak
- Department of Computer Systems, Jožef Stefan Institute, Jamova cesta 39, Ljubljana 1000, Slovenia,Jožef Stefan International Postgraduate School, Jamova cesta 39, Ljubljana 1000, Slovenia
| | - Tome Eftimov
- Department of Computer Systems, Jožef Stefan Institute, Jamova cesta 39, Ljubljana 1000, Slovenia
| |
Collapse
|
19
|
Bashir SR, Raza S, Kocaman V, Qamar U. Clinical Application of Detecting COVID-19 Risks: A Natural Language Processing Approach. Viruses 2022; 14:v14122761. [PMID: 36560764 PMCID: PMC9781729 DOI: 10.3390/v14122761] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Accepted: 12/08/2022] [Indexed: 12/14/2022] Open
Abstract
The clinical application of detecting COVID-19 factors is a challenging task. The existing named entity recognition models are usually trained on a limited set of named entities. Besides clinical, the non-clinical factors, such as social determinant of health (SDoH), are also important to study the infectious disease. In this paper, we propose a generalizable machine learning approach that improves on previous efforts by recognizing a large number of clinical risk factors and SDoH. The novelty of the proposed method lies in the subtle combination of a number of deep neural networks, including the BiLSTM-CNN-CRF method and a transformer-based embedding layer. Experimental results on a cohort of COVID-19 data prepared from PubMed articles show the superiority of the proposed approach. When compared to other methods, the proposed approach achieves a performance gain of about 1-5% in terms of macro- and micro-average F1 scores. Clinical practitioners and researchers can use this approach to obtain accurate information regarding clinical risks and SDoH factors, and use this pipeline as a tool to end the pandemic or to prepare for future pandemics.
Collapse
Affiliation(s)
- Syed Raza Bashir
- Department of Computer Science, Toronto Metropolitan University, Toronto, ON M5B 2K3, Canada
| | - Shaina Raza
- Dalla Lana School of Public Health, University of Toronto, Toronto, ON M5T 3M7, Canada
- Correspondence:
| | | | - Urooj Qamar
- Institute of Business & Information Technology, University of the Punjab, Lahore 54590, Pakistan
| |
Collapse
|
20
|
Raza S, Reji DJ, Shajan F, Bashir SR. Large-scale application of named entity recognition to biomedicine and epidemiology. PLOS DIGITAL HEALTH 2022; 1:e0000152. [PMID: 36812589 PMCID: PMC9931203 DOI: 10.1371/journal.pdig.0000152] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/04/2022] [Accepted: 11/01/2022] [Indexed: 12/13/2022]
Abstract
BACKGROUND Despite significant advancements in biomedical named entity recognition methods, the clinical application of these systems continues to face many challenges: (1) most of the methods are trained on a limited set of clinical entities; (2) these methods are heavily reliant on a large amount of data for both pre-training and prediction, making their use in production impractical; (3) they do not consider non-clinical entities, which are also related to patient's health, such as social, economic or demographic factors. METHODS In this paper, we develop Bio-Epidemiology-NER (https://pypi.org/project/Bio-Epidemiology-NER/) an open-source Python package for detecting biomedical named entities from the text. This approach is based on a Transformer-based system and trained on a dataset that is annotated with many named entities (medical, clinical, biomedical, and epidemiological). This approach improves on previous efforts in three ways: (1) it recognizes many clinical entity types, such as medical risk factors, vital signs, drugs, and biological functions; (2) it is easily configurable, reusable, and can scale up for training and inference; (3) it also considers non-clinical factors (age and gender, race and social history and so) that influence health outcomes. At a high level, it consists of the phases: pre-processing, data parsing, named entity recognition, and named entity enhancement. RESULTS Experimental results show that our pipeline outperforms other methods on three benchmark datasets with macro-and micro average F1 scores around 90 percent and above. CONCLUSION This package is made publicly available for researchers, doctors, clinicians, and anyone to extract biomedical named entities from unstructured biomedical texts.
Collapse
Affiliation(s)
- Shaina Raza
- Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
- * E-mail: (SR); (SRB)
| | | | - Femi Shajan
- Environmental Resources Management, Bangalore, India
| | - Syed Raza Bashir
- Toronto Metropolitan University, Toronto, Ontario, Canada
- * E-mail: (SR); (SRB)
| |
Collapse
|
21
|
Zheng X, Du H, Luo X, Tong F, Song W, Zhao D. BioByGANS: biomedical named entity recognition by fusing contextual and syntactic features through graph attention network in node classification framework. BMC Bioinformatics 2022; 23:501. [PMID: 36418937 PMCID: PMC9682683 DOI: 10.1186/s12859-022-05051-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2022] [Accepted: 11/10/2022] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND Automatic and accurate recognition of various biomedical named entities from literature is an important task of biomedical text mining, which is the foundation of extracting biomedical knowledge from unstructured texts into structured formats. Using the sequence labeling framework and deep neural networks to implement biomedical named entity recognition (BioNER) is a common method at present. However, the above method often underutilizes syntactic features such as dependencies and topology of sentences. Therefore, it is an urgent problem to be solved to integrate semantic and syntactic features into the BioNER model. RESULTS In this paper, we propose a novel biomedical named entity recognition model, named BioByGANS (BioBERT/SpaCy-Graph Attention Network-Softmax), which uses a graph to model the dependencies and topology of a sentence and formulate the BioNER task as a node classification problem. This formulation can introduce more topological features of language and no longer be only concerned about the distance between words in the sequence. First, we use periods to segment sentences and spaces and symbols to segment words. Second, contextual features are encoded by BioBERT, and syntactic features such as part of speeches, dependencies and topology are preprocessed by SpaCy respectively. A graph attention network is then used to generate a fusing representation considering both the contextual features and syntactic features. Last, a softmax function is used to calculate the probabilities and get the results. We conduct experiments on 8 benchmark datasets, and our proposed model outperforms existing BioNER state-of-the-art methods on the BC2GM, JNLPBA, BC4CHEMD, BC5CDR-chem, BC5CDR-disease, NCBI-disease, Species-800, and LINNAEUS datasets, and achieves F1-scores of 85.15%, 78.16%, 92.97%, 94.74%, 87.74%, 91.57%, 75.01%, 90.99%, respectively. CONCLUSION The experimental results on 8 biomedical benchmark datasets demonstrate the effectiveness of our model, and indicate that formulating the BioNER task into a node classification problem and combining syntactic features into the graph attention networks can significantly improve model performance.
Collapse
Affiliation(s)
- Xiangwen Zheng
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Haijian Du
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Xiaowei Luo
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Fan Tong
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Wei Song
- Beijing MedPeer Information Technology Co., Ltd, Beijing, 102300, China
| | - Dongsheng Zhao
- Academy of Military Medical Sciences, Beijing, 100039, China.
| |
Collapse
|
22
|
Su Y, Wang M, Wang P, Zheng C, Liu Y, Zeng X. Deep learning joint models for extracting entities and relations in biomedical: a survey and comparison. Brief Bioinform 2022; 23:6686739. [PMID: 36125190 DOI: 10.1093/bib/bbac342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Revised: 07/20/2022] [Accepted: 07/25/2022] [Indexed: 12/14/2022] Open
Abstract
The rapid development of biomedicine has produced a large number of biomedical written materials. These unstructured text data create serious challenges for biomedical researchers to find information. Biomedical named entity recognition (BioNER) and biomedical relation extraction (BioRE) are the two most fundamental tasks of biomedical text mining. Accurately and efficiently identifying entities and extracting relations have become very important. Methods that perform two tasks separately are called pipeline models, and they have shortcomings such as insufficient interaction, low extraction quality and easy redundancy. To overcome the above shortcomings, many deep learning-based joint name entity recognition and relation extraction models have been proposed, and they have achieved advanced performance. This paper comprehensively summarize deep learning models for joint name entity recognition and relation extraction for biomedicine. The joint BioNER and BioRE models are discussed in the light of the challenges existing in the BioNER and BioRE tasks. Five joint BioNER and BioRE models and one pipeline model are selected for comparative experiments on four biomedical public datasets, and the experimental results are analyzed. Finally, we discuss the opportunities for future development of deep learning-based joint BioNER and BioRE models.
Collapse
Affiliation(s)
- Yansen Su
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, School of Artificial Intelligence, Anhui University, 111 Jiulong Road, Economic and Technological Development Zone, 230601, Hefei, China
| | - Minglu Wang
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, School of Computer Science and Technology, Anhui University, 111 Jiulong Road, Economic and Technological Development Zone, 230601, Hefei, China
| | - Pengpeng Wang
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, School of Computer Science and Technology, Anhui University, 111 Jiulong Road, Economic and Technological Development Zone, 230601, Hefei, China
| | - Chunhou Zheng
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, School of Artificial Intelligence, Anhui University, 111 Jiulong Road, Economic and Technological Development Zone, 230601, Hefei, China
| | - Yuansheng Liu
- College of Information Science and Engineering, Hunan University, 2 Lushan S Rd, Yuelu District, 410086, Changsha, China
| | - Xiangxiang Zeng
- College of Information Science and Engineering, Hunan University, 2 Lushan S Rd, Yuelu District, 410086, Changsha, China
| |
Collapse
|
23
|
Wang SY, Huang J, Hwang H, Hu W, Tao S, Hernandez-Boussard T. Leveraging weak supervision to perform named entity recognition in electronic health records progress notes to identify the ophthalmology exam. Int J Med Inform 2022; 167:104864. [PMID: 36179600 PMCID: PMC9901505 DOI: 10.1016/j.ijmedinf.2022.104864] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2022] [Revised: 08/11/2022] [Accepted: 09/05/2022] [Indexed: 02/08/2023]
Abstract
OBJECTIVE To develop deep learning models to recognize ophthalmic examination components from clinical notes in electronic health records (EHR) using a weak supervision approach. METHODS A corpus of 39,099 ophthalmology notes weakly labeled for 24 examination entities was assembled from the EHR of one academic center. Four pre-trained transformer-based language models (DistilBert, BioBert, BlueBert, and ClinicalBert) were fine-tuned to this named entity recognition task and compared to a baseline regular expression model. Models were evaluated on the weakly labeled test dataset, a human-labeled sample of that set, and a human-labeled independent dataset. RESULTS On the weakly labeled test set, all transformer-based models had recall > 0.93, with precision varying from 0.815 to 0.843. The baseline model had lower recall (0.769) and precision (0.682). On the human-annotated sample, the baseline model had high recall (0.962, 95 % CI 0.955-0.067) with variable precision across entities (0.081-0.999). Bert models had recall ranging from 0.771 to 0.831, and precision >=0.973. On the independent dataset, precision was 0.926 and recall 0.458 for BlueBert. The baseline model had better recall (0.708, 95 % CI 0.674-0.738) but worse precision (0.399, 95 % CI -0.352-0.451). CONCLUSION We developed the first deep learning system to recognize eye examination components from clinical notes, leveraging a novel opportunity for weak supervision. Transformer-based models had high precision on human-annotated labels, whereas the baseline model had poor precision but higher recall. This system may be used to improve cohort and feature identification using free-text notes.Our weakly supervised approach may help amass large datasets of domain-specific entities from EHRs in many fields.
Collapse
Affiliation(s)
- Sophia Y Wang
- Department of Ophthalmology, Byers Eye Institute, Stanford University, Palo Alto, CA, USA.
| | - Justin Huang
- Johns Hopkins School of Medicine, Baltimore, MD, USA
| | - Hannah Hwang
- Department of Ophthalmology, Weill Cornell Medicine, New York, NY, USA
| | - Wendeng Hu
- Department of Ophthalmology, Byers Eye Institute, Stanford University, Palo Alto, CA, USA
| | - Shiqi Tao
- Department of Ophthalmology, Byers Eye Institute, Stanford University, Palo Alto, CA, USA
| | | |
Collapse
|
24
|
Review on knowledge extraction from text and scope in agriculture domain. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10239-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
25
|
Morandini P, Laino ME, Paoletti G, Carlucci A, Tommasini T, Angelotti G, Pepys J, Canonica GW, Heffler E, Savevski V, Puggioni F. Artificial intelligence processing electronic health records to identify commonalities and comorbidities cluster at Immuno Center Humanitas. Clin Transl Allergy 2022; 12:e12144. [PMID: 35702725 PMCID: PMC9175261 DOI: 10.1002/clt2.12144] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2021] [Revised: 02/24/2022] [Accepted: 03/28/2022] [Indexed: 11/12/2022] Open
Abstract
Background Comorbidities are common in chronic inflammatory conditions, requiring multidisciplinary treatment approach. Understanding the link between a single disease and its comorbidities is important for appropriate treatment and management. We evaluate the ability of an NLP-based process for knowledge discovery to detect information about pathologies, patients' phenotype, doctors' prescriptions and commonalities in electronic medical records, by extracting information from free narrative text written by clinicians during medical visits, resulting in the extraction of valuable information and enriching real world evidence data from a multidisciplinary setting. Methods We collected clinical notes from the Allergy Department of Humanitas Research Hospital written in the last 3 years and used it to look for diseases that cluster together as comorbidities associated to the main pathology of our patients, and for the extent of prescription of systemic corticosteroids, thus evaluating the ability of NLP-based tools for knowledge discovery to extract structured information from free text. Results We found that the 3 most frequent comorbidities to appear in our clusters were asthma, rhinitis, and urticaria, and that 991 (of 2057) patients suffered from at least one of these comorbidities. The clusters which co-occur particularly often are oral allergy syndrome and urticaria (131 patients), angioedema and urticaria (105 patients), rhinitis and asthma (227 patients). With regards to systemic corticosteroid prescription volume by our clinicians, we found it was lower when compared to the therapy the patients followed before coming to our attention, with the exception of two diseases: Chronic obstructive pulmonary disease and Angioedema. Conclusions This analysis seems to be valid and is confirmed by the data from the literature. This means that NLP tools could have significant role in many other research fields of medicine, as it may help identify other important, and possibly previously neglected clusters of patients with comorbidities and commonalities. Another potential benefit of this approach lies in its potential ability to foster a multidisciplinary approach, using the same drugs to treat pathologies normally treated by physicians in different branches of medicine, thus saving resources and improving the pharmacological management of patients.
Collapse
Affiliation(s)
| | - Maria Elena Laino
- Artificial Intelligence CenterIRCCS Humanitas Research HospitalMilanItaly
| | - Giovanni Paoletti
- Department of Biomedical SciencesHumanitas UniversityMilanItaly
- Personalized Medicine, Asthma and AllergyIRCCS Humanitas Research HospitalMilanItaly
| | | | - Tobia Tommasini
- Artificial Intelligence CenterIRCCS Humanitas Research HospitalMilanItaly
| | - Giovanni Angelotti
- Artificial Intelligence CenterIRCCS Humanitas Research HospitalMilanItaly
| | - Jack Pepys
- Department of Biomedical SciencesHumanitas UniversityMilanItaly
| | - Giorgio Walter Canonica
- Department of Biomedical SciencesHumanitas UniversityMilanItaly
- Personalized Medicine, Asthma and AllergyIRCCS Humanitas Research HospitalMilanItaly
| | - Enrico Heffler
- Department of Biomedical SciencesHumanitas UniversityMilanItaly
- Personalized Medicine, Asthma and AllergyIRCCS Humanitas Research HospitalMilanItaly
| | - Victor Savevski
- Artificial Intelligence CenterIRCCS Humanitas Research HospitalMilanItaly
| | - Francesca Puggioni
- Department of Biomedical SciencesHumanitas UniversityMilanItaly
- Personalized Medicine, Asthma and AllergyIRCCS Humanitas Research HospitalMilanItaly
| |
Collapse
|
26
|
Yang H, Lee N, Park B, Park J, Lee J, Jang HS, Yoo H. Hierarchical network analysis of co-occurring bioentities in literature. Sci Rep 2022; 12:7885. [PMID: 35550589 PMCID: PMC9098521 DOI: 10.1038/s41598-022-12093-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2021] [Accepted: 05/03/2022] [Indexed: 11/09/2022] Open
Abstract
Biomedical databases grow by more than a thousand new publications every day. The large volume of biomedical literature that is being published at an unprecedented rate hinders the discovery of relevant knowledge from keywords of interest to gather new insights and form hypotheses. A text-mining tool, PubTator, helps to automatically annotate bioentities, such as species, chemicals, genes, and diseases, from PubMed abstracts and full-text articles. However, the manual re-organization and analysis of bioentities is a non-trivial and highly time-consuming task. ChexMix was designed to extract the unique identifiers of bioentities from query results. Herein, ChexMix was used to construct a taxonomic tree with allied species among Korean native plants and to extract the medical subject headings unique identifier of the bioentities, which co-occurred with the keywords in the same literature. ChexMix discovered the allied species related to a keyword of interest and experimentally proved its usefulness for multi-species analysis.
Collapse
Affiliation(s)
- Heejung Yang
- Department of Pharmacy, Kangwon National University, Chuncheon, 24341, Republic of Korea. .,Bionsight, Inc., Chuncheon, 24341, Republic of Korea.
| | - Namgil Lee
- Bionsight, Inc., Chuncheon, 24341, Republic of Korea.,Department of Information Statistics, Kangwon National University, Gangwondaehak-gil 1, Chuncheon, Gangwon, 24341, Republic of Korea
| | - Beomjun Park
- Bionsight, Inc., Chuncheon, 24341, Republic of Korea
| | - Jinyoung Park
- Department of Pharmacy, Kangwon National University, Chuncheon, 24341, Republic of Korea
| | - Jiho Lee
- Department of Pharmacy, Kangwon National University, Chuncheon, 24341, Republic of Korea
| | - Hyeon Seok Jang
- Department of Pharmacy, Kangwon National University, Chuncheon, 24341, Republic of Korea
| | - Hojin Yoo
- Bionsight, Inc., Chuncheon, 24341, Republic of Korea
| |
Collapse
|
27
|
Comparison of Text Mining Models for Food and Dietary Constituent Named-Entity Recognition. MACHINE LEARNING AND KNOWLEDGE EXTRACTION 2022. [DOI: 10.3390/make4010012] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Biomedical Named-Entity Recognition (BioNER) has become an essential part of text mining due to the continuously increasing digital archives of biological and medical articles. While there are many well-performing BioNER tools for entities such as genes, proteins, diseases or species, there is very little research into food and dietary constituent named-entity recognition. For this reason, in this paper, we study seven BioNER models for food and dietary constituents recognition. Specifically, we study a dictionary-based model, a conditional random fields (CRF) model and a new hybrid model, called FooDCoNER (Food and Dietary Constituents Named-Entity Recognition), which we introduce combining the former two models. In addition, we study deep language models including BERT, BioBERT, RoBERTa and ELECTRA. As a result, we find that FooDCoNER does not only lead to the overall best results, comparable with the deep language models, but FooDCoNER is also much more efficient with respect to run time and sample size requirements of the training data. The latter has been identified via the study of learning curves. Overall, our results not only provide a new tool for food and dietary constituent NER but also shed light on the difference between classical machine learning models and recent deep language models.
Collapse
|
28
|
Borchert F, Meister L, Langer T, Follmann M, Arnrich B, Schapranow MP. Controversial Trials First: Identifying Disagreement Between Clinical Guidelines and New Evidence. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2022; 2021:237-246. [PMID: 35308948 PMCID: PMC8861732] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Clinical guidelines integrate latest evidence to support clinical decision-making. As new research findings are published at an increasing rate, it would be helpful to detect when such results disagree with current guideline recommendations. In this work, we describe a software system for the automatic identification of disagreement between clinical guidelines and published research. A critical feature of the system is the extraction and cross-lingual normalization of information through natural language processing. The initial version focuses on the detection of cancer treatments in clinical trial reports that are not addressed in oncology guidelines. We evaluate the relevance of trials retrieved by our system retrospectively by comparison with historic guideline updates and also prospectively through manual evaluation by guideline experts. The system improves precision over state-of-the-art literature research strategies while maintaining near-total recall. Detailed error analysis highlights challenges for fine-grained clinical information extraction, in particular when extracting population definitions for tumor-agnostic therapies.
Collapse
Affiliation(s)
- Florian Borchert
- Digital Health Center, Hasso Plattner Institute, University of Potsdam, Germany
| | - Laura Meister
- Digital Health Center, Hasso Plattner Institute, University of Potsdam, Germany
| | - Thomas Langer
- German Guideline Program in Oncology, German Cancer Society, Berlin, Germany
| | - Markus Follmann
- German Guideline Program in Oncology, German Cancer Society, Berlin, Germany
| | - Bert Arnrich
- Digital Health Center, Hasso Plattner Institute, University of Potsdam, Germany
| | | |
Collapse
|
29
|
Walker VR, Schmitt CP, Wolfe MS, Nowak AJ, Kulesza K, Williams AR, Shin R, Cohen J, Burch D, Stout MD, Shipkowski KA, Rooney AA. Evaluation of a semi-automated data extraction tool for public health literature-based reviews: Dextr. ENVIRONMENT INTERNATIONAL 2022; 159:107025. [PMID: 34920276 DOI: 10.1016/j.envint.2021.107025] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/01/2021] [Revised: 10/07/2021] [Accepted: 12/03/2021] [Indexed: 06/14/2023]
Abstract
INTRODUCTION There has been limited development and uptake of machine-learning methods to automate data extraction for literature-based assessments. Although advanced extraction approaches have been applied to some clinical research reviews, existing methods are not well suited for addressing toxicology or environmental health questions due to unique data needs to support reviews in these fields. OBJECTIVES To develop and evaluate a flexible, web-based tool for semi-automated data extraction that: 1) makes data extraction predictions with user verification, 2) integrates token-level annotations, and 3) connects extracted entities to support hierarchical data extraction. METHODS Dextr was developed with Agile software methodology using a two-team approach. The development team outlined proposed features and coded the software. The advisory team guided developers and evaluated Dextr's performance on precision, recall, and extraction time by comparing a manual extraction workflow to a semi-automated extraction workflow using a dataset of 51 environmental health animal studies. RESULTS The semi-automated workflow did not appear to affect precision rate (96.0% vs. 95.4% manual, p = 0.38), resulted in a small reduction in recall rate (91.8% vs. 97.0% manual, p < 0.01), and substantially reduced the median extraction time (436 s vs. 933 s per study manual, p < 0.01) compared to a manual workflow. DISCUSSION Dextr provides similar performance to manual extraction in terms of recall and precision and greatly reduces data extraction time. Unlike other tools, Dextr provides the ability to extract complex concepts (e.g., multiple experiments with various exposures and doses within a single study), properly connect the extracted elements within a study, and effectively limit the work required by researchers to generate machine-readable, annotated exports. The Dextr tool addresses data-extraction challenges associated with environmental health sciences literature with a simple user interface, incorporates the key capabilities of user verification and entity connecting, provides a platform for further automation developments, and has the potential to improve data extraction for literature reviews in this and other fields.
Collapse
Affiliation(s)
- Vickie R Walker
- Division of the National Toxicology Program (DNTP), National Institute of Environmental Health Sciences (NIEHS), National Institutes of Health (NIH), Research Triangle Park, NC, USA.
| | - Charles P Schmitt
- Division of the National Toxicology Program (DNTP), National Institute of Environmental Health Sciences (NIEHS), National Institutes of Health (NIH), Research Triangle Park, NC, USA
| | - Mary S Wolfe
- Division of the National Toxicology Program (DNTP), National Institute of Environmental Health Sciences (NIEHS), National Institutes of Health (NIH), Research Triangle Park, NC, USA
| | | | | | | | - Rob Shin
- ICF, Research Triangle Park, NC, USA
| | | | | | - Matthew D Stout
- Division of the National Toxicology Program (DNTP), National Institute of Environmental Health Sciences (NIEHS), National Institutes of Health (NIH), Research Triangle Park, NC, USA
| | - Kelly A Shipkowski
- Division of the National Toxicology Program (DNTP), National Institute of Environmental Health Sciences (NIEHS), National Institutes of Health (NIH), Research Triangle Park, NC, USA
| | - Andrew A Rooney
- Division of the National Toxicology Program (DNTP), National Institute of Environmental Health Sciences (NIEHS), National Institutes of Health (NIH), Research Triangle Park, NC, USA
| |
Collapse
|
30
|
Analyzing COVID-19 Medical Papers Using Artificial Intelligence: Insights for Researchers and Medical Professionals. BIG DATA AND COGNITIVE COMPUTING 2022. [DOI: 10.3390/bdcc6010004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
Since the beginning of the COVID-19 pandemic almost two years ago, there have been more than 700,000 scientific papers published on the subject. An individual researcher cannot possibly get acquainted with such a huge text corpus and, therefore, some help from artificial intelligence (AI) is highly needed. We propose the AI-based tool to help researchers navigate the medical papers collections in a meaningful way and extract some knowledge from scientific COVID-19 papers. The main idea of our approach is to get as much semi-structured information from text corpus as possible, using named entity recognition (NER) with a model called PubMedBERT and Text Analytics for Health service, then store the data into NoSQL database for further fast processing and insights generation. Additionally, the contexts in which the entities were used (neutral or negative) are determined. Application of NLP and text-based emotion detection (TBED) methods to COVID-19 text corpus allows us to gain insights on important issues of diagnosis and treatment (such as changes in medical treatment over time, joint treatment strategies using several medications, and the connection between signs and symptoms of coronavirus, etc.).
Collapse
|
31
|
Abdulkadhar S, Natarajan J. A Text Mining Protocol for Mining Biological Pathways and Regulatory Networks from Biomedical Literature. Methods Mol Biol 2022; 2496:141-157. [PMID: 35713863 DOI: 10.1007/978-1-0716-2305-3_8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
A biological pathway or regulatory network is a collection of molecular regulators which can activate the changes in cellular processes leading to an assembly of new molecules by series of actions among the molecules. There are three important pathways in system biology studies namely signaling pathways, metabolic pathways, and genetic pathways (or) gene regulatory networks. Recently, biological pathway construction from scientific literature is given much attention as the scientific literature contains a rich set of linguistic features to extract biological associations between genes and proteins. These associations can be united to construct biological networks. Here, we present a brief overview about various biological pathways, biomedical text resources/corpora for network construction and state-of-the-art existing methods for network construction followed by our hybrid text mining protocol for extracting pathways and regulatory networks from biomedical literature.
Collapse
Affiliation(s)
- Sabenabanu Abdulkadhar
- Data Mining and Text Mining Laboratory, Department of Bioinformatics, Bharathiar University, Coimbatore, Tamilnadu, India
| | - Jeyakumar Natarajan
- Data Mining and Text Mining Laboratory, Department of Bioinformatics, Bharathiar University, Coimbatore, Tamilnadu, India.
| |
Collapse
|
32
|
Using semantics to scale up evidence-based chemical risk-assessments. PLoS One 2021; 16:e0260712. [PMID: 34910747 PMCID: PMC8673667 DOI: 10.1371/journal.pone.0260712] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Accepted: 11/15/2021] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND The manual processes used for risk assessments are not scaling to the amount of data available. Although automated approaches appear promising, they must be transparent in a public policy setting. OBJECTIVE Our goal is to create an automated approach that moves beyond retrieval to the extraction step of the information synthesis process, where evidence is characterized as supporting, refuting, or neutral with respect to a given outcome. METHODS We combine knowledge resources and natural language processing to resolve coordinated ellipses and thus avoid surface level differences between concepts in an ontology and outcomes in an abstract. As with a systematic review, the search criterion, and inclusion and exclusion criterion are explicit. RESULTS The system scales to 482K abstracts on 27 chemicals. Results for three endpoints that are critical for cancer risk assessments show that refuting evidence (where the outcome decreased) was higher for cell proliferation (45.9%), and general cell changes (37.7%) than for cell death (25.0%). Moreover, cell death was the only end point where supporting claims were the majority (61.3%). If the number of abstracts that measure an outcome was used as a proxy for association there would be a stronger association with cell proliferation than cell death (20/27 chemicals). However, if the amount of supporting evidence was used (where the outcome increased) the conclusion would change for 21/27 chemicals (20 from proliferation to death and 1 from death to proliferation). CONCLUSIONS We provide decision makers with a visual representation of supporting, neutral, and refuting evidence whilst maintaining the reproducibility and transparency needed for public policy. Our findings show that results from the retrieval step where the number of abstracts that measure an outcome are reported can be misleading if not accompanied with results from the extraction step where the directionality of the outcome is established.
Collapse
|
33
|
Le Guillarme N, Thuiller W. TaxoNERD: Deep neural models for the recognition of taxonomic entities in the ecological and evolutionary literature. Methods Ecol Evol 2021. [DOI: 10.1111/2041-210x.13778] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Nicolas Le Guillarme
- CNRS LECA Laboratoire d'Ecologie Alpine Université Grenoble Alpes University Savoie Mont Blanc Grenoble France
| | - Wilfried Thuiller
- CNRS LECA Laboratoire d'Ecologie Alpine Université Grenoble Alpes University Savoie Mont Blanc Grenoble France
| |
Collapse
|
34
|
Yim WWY, Kurikawa Y, Mizushima N. An exploratory text analysis of the autophagy research field. Autophagy 2021; 18:1648-1661. [PMID: 34812110 PMCID: PMC9298454 DOI: 10.1080/15548627.2021.1995151] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
After its discovery in the 1950 s, the autophagy research field has seen its annual number of publications climb from tens to thousands. The ever-growing number of autophagy publications is a wealth of information but presents a challenge to researchers, especially those new to the field, who are looking for a general overview of the field to, for example, determine current topics of the field or formulate new hypotheses. Here, we employed text mining tools to extract research trends in the autophagy field, including those of genes, terms, and topics. The publication trend of the field can be separated into three phases. The exponential rise in publication number began in the last phase and is most likely spurred by a series of highly cited research papers published in previous phases. The exponential increase in papers has resulted in a larger variety of research topics, with the majority involving those that are directly physiologically relevant, such as disease and modulating autophagy. Our findings provide researchers a summary of the history of the autophagy research field and perhaps hints of what is to come.Abbreviations: 5Y-IF: 5-year impact factor; AIS: article influence score; EM: electron microscopy; HGNC: HUGO gene nomenclature committee; LDA: latent Dirichlet allocation; MeSH: medical subject headings; ncRNA: non-coding RNA.
Collapse
Affiliation(s)
- Willa Wen-You Yim
- Department of Biochemistry and Molecular Biology, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| | - Yoshitaka Kurikawa
- Department of Biochemistry and Molecular Biology, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| | - Noboru Mizushima
- Department of Biochemistry and Molecular Biology, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| |
Collapse
|
35
|
Green NL. Argumentation schemes: From genetics to international relations to environmental science policy to AI ethics. ARGUMENT & COMPUTATION 2021. [DOI: 10.3233/aac-210551] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Argumentation schemes have played a key role in our research projects on computational models of natural argument over the last decade. The catalogue of schemes in Walton, Reed and Macagno’s 2008 book, Argumentation Schemes, served as our starting point for analysis of the naturally occurring arguments in written text, i.e., text in different genres having different types of author, audience, and subject domain (genetics, international relations, environmental science policy, AI ethics), for different argument goals, and for different possible future applications. We would often first attempt to analyze the arguments in our corpora in terms of those schemes, then adapt schemes as needed for the goals of the project, and in some cases implement them for use in computational models. Among computational researchers, the main interest in argumentation schemes has been for use in argument mining by applying machine learning methods to existing argument corpora. In contrast, a primary goal of our research has been to learn more about written arguments themselves in various contemporary fields. Our approach has been to manually analyze semantics, discourse structure, argumentation, and rhetoric in texts. Another goal has been to create sharable digital corpora containing the results of our studies. Our approach has been to define argument schemes for use by human corpus annotators or for use in logic programs for argument mining. The third goal is to design useful computer applications based upon our studies, such as argument diagramming systems that provide argument schemes as building blocks. This paper describes each of the various projects: the methods, the argument schemes that were identified, and how they were used. Then a synthesis of the results is given with a discussion of open issues.
Collapse
Affiliation(s)
- Nancy L. Green
- University of North Carolina Greensboro, Greensboro, NC 27402, USA. E-mail:
| |
Collapse
|
36
|
Baltoumas FA, Zafeiropoulou S, Karatzas E, Paragkamian S, Thanati F, Iliopoulos I, Eliopoulos AG, Schneider R, Jensen LJ, Pafilis E, Pavlopoulos GA. OnTheFly 2.0: a text-mining web application for automated biomedical entity recognition, document annotation, network and functional enrichment analysis. NAR Genom Bioinform 2021; 3:lqab090. [PMID: 34632381 PMCID: PMC8494211 DOI: 10.1093/nargab/lqab090] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2021] [Revised: 09/09/2021] [Accepted: 09/20/2021] [Indexed: 02/06/2023] Open
Abstract
Extracting and processing information from documents is of great importance as lots of experimental results and findings are stored in local files. Therefore, extracting and analyzing biomedical terms from such files in an automated way is absolutely necessary. In this article, we present OnTheFly2.0, a web application for extracting biomedical entities from individual files such as plain texts, office documents, PDF files or images. OnTheFly2.0 can generate informative summaries in popup windows containing knowledge related to the identified terms along with links to various databases. It uses the EXTRACT tagging service to perform named entity recognition (NER) for genes/proteins, chemical compounds, organisms, tissues, environments, diseases, phenotypes and gene ontology terms. Multiple files can be analyzed, whereas identified terms such as proteins or genes can be explored through functional enrichment analysis or be associated with diseases and PubMed entries. Finally, protein-protein and protein-chemical networks can be generated with the use of STRING and STITCH services. To demonstrate its capacity for knowledge discovery, we interrogated published meta-analyses of clinical biomarkers of severe COVID-19 and uncovered inflammatory and senescence pathways that impact disease pathogenesis. OnTheFly2.0 currently supports 197 species and is available at http://bib.fleming.gr:3838/OnTheFly/ and http://onthefly.pavlopouloslab.info.
Collapse
Affiliation(s)
- Fotis A Baltoumas
- Institute for Fundamental Biomedical Research, Biomedical Sciences Research Center "Alexander Fleming", Vari 16672, Greece
| | - Sofia Zafeiropoulou
- Institute for Fundamental Biomedical Research, Biomedical Sciences Research Center "Alexander Fleming", Vari 16672, Greece
| | - Evangelos Karatzas
- Institute for Fundamental Biomedical Research, Biomedical Sciences Research Center "Alexander Fleming", Vari 16672, Greece
| | - Savvas Paragkamian
- Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Former U.S. Base of Gournes P.O. Box 2214, 71003 Heraklion, Crete, Greece
| | - Foteini Thanati
- Institute for Fundamental Biomedical Research, Biomedical Sciences Research Center "Alexander Fleming", Vari 16672, Greece
| | - Ioannis Iliopoulos
- Department of Basic Sciences, School of Medicine, University of Crete, Heraklion 71003, Crete, Greece
| | - Aristides G Eliopoulos
- Department of Biology, School of Medicine, National and Kapodistrian University of Athens, Athens, 70013, Greece
| | - Reinhard Schneider
- University of Luxembourg, Luxembourg Centre for Systems Biomedicine, Bioinformatics Core, Esch-sur-Alzette, L-4365, Luxembourg
| | - Lars Juhl Jensen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, 2200, Denmark
| | - Evangelos Pafilis
- Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Former U.S. Base of Gournes P.O. Box 2214, 71003 Heraklion, Crete, Greece
| | - Georgios A Pavlopoulos
- Institute for Fundamental Biomedical Research, Biomedical Sciences Research Center "Alexander Fleming", Vari 16672, Greece
| |
Collapse
|
37
|
Qin X, Li L, Sun X. Reply to letter to the editor by Kharawala S, et al: Artificial intelligence for assisting systematic reviews: Opportunities with continuing challenges. J Clin Epidemiol 2021; 138:245-246. [PMID: 33753226 DOI: 10.1016/j.jclinepi.2021.03.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2021] [Accepted: 03/11/2021] [Indexed: 02/08/2023]
Affiliation(s)
- Xuan Qin
- Chinese Evidence-based Medicine Center, Cochrane China Center and National Clinical Research Center for Geriatrics, West China Hospital, Sichuan University, Chengdu 610041, Sichuan, China
| | - Ling Li
- Chinese Evidence-based Medicine Center, Cochrane China Center and National Clinical Research Center for Geriatrics, West China Hospital, Sichuan University, Chengdu 610041, Sichuan, China
| | - Xin Sun
- Chinese Evidence-based Medicine Center, Cochrane China Center and National Clinical Research Center for Geriatrics, West China Hospital, Sichuan University, Chengdu 610041, Sichuan, China; Evidence-based Medicine Research Center, School of Basic Science, Jiangxi University of Traditional Chinese Medicine, Nanchang 330004, Jiangxi, China.
| |
Collapse
|
38
|
Su J, Wu Y, Ting HF, Lam TW, Luo R. RENET2: high-performance full-text gene-disease relation extraction with iterative training data expansion. NAR Genom Bioinform 2021; 3:lqab062. [PMID: 34235433 PMCID: PMC8256824 DOI: 10.1093/nargab/lqab062] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2021] [Revised: 06/16/2021] [Accepted: 06/23/2021] [Indexed: 01/06/2023] Open
Abstract
Relation extraction (RE) is a fundamental task for extracting gene–disease associations from biomedical text. Many state-of-the-art tools have limited capacity, as they can extract gene–disease associations only from single sentences or abstract texts. A few studies have explored extracting gene–disease associations from full-text articles, but there exists a large room for improvements. In this work, we propose RENET2, a deep learning-based RE method, which implements Section Filtering and ambiguous relations modeling to extract gene–disease associations from full-text articles. We designed a novel iterative training data expansion strategy to build an annotated full-text dataset to resolve the scarcity of labels on full-text articles. In our experiments, RENET2 achieved an F1-score of 72.13% for extracting gene–disease associations from an annotated full-text dataset, which was 27.22, 30.30, 29.24 and 23.87% higher than BeFree, DTMiner, BioBERT and RENET, respectively. We applied RENET2 to (i) ∼1.89M full-text articles from PubMed Central and found ∼3.72M gene–disease associations; and (ii) the LitCovid articles and ranked the top 15 proteins associated with COVID-19, supported by recent articles. RENET2 is an efficient and accurate method for full-text gene–disease association extraction. The source-code, manually curated abstract/full-text training data, and results of RENET2 are available at GitHub.
Collapse
Affiliation(s)
- Junhao Su
- Department of Computer Science, The University of Hong Kong, Hong Kong, 999077, China
| | - Ye Wu
- Department of Computer Science, The University of Hong Kong, Hong Kong, 999077, China
| | - Hing-Fung Ting
- Department of Computer Science, The University of Hong Kong, Hong Kong, 999077, China
| | - Tak-Wah Lam
- Department of Computer Science, The University of Hong Kong, Hong Kong, 999077, China
| | - Ruibang Luo
- Department of Computer Science, The University of Hong Kong, Hong Kong, 999077, China
| |
Collapse
|
39
|
Mining Proteome Research Reports: A Bird's Eye View. Proteomes 2021; 9:proteomes9020029. [PMID: 34200663 PMCID: PMC8293458 DOI: 10.3390/proteomes9020029] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2021] [Revised: 05/27/2021] [Accepted: 06/08/2021] [Indexed: 01/25/2023] Open
Abstract
The complexity of data has burgeoned to such an extent that scientists of every realm are encountering the incessant challenge of data management. Modern-day analytical approaches with the help of free source tools and programming languages have facilitated access to the context of the various domains as well as specific works reported. Here, with this article, an attempt has been made to provide a systematic analysis of all the available reports at PubMed on Proteome using text mining. The work is comprised of scientometrics as well as information extraction to provide the publication trends as well as frequent keywords, bioconcepts and most importantly gene–gene co-occurrence network. Out of 33,028 PMIDs collected initially, the segregation of 24,350 articles under 28 Medical Subject Headings (MeSH) was analyzed and plotted. Keyword link network and density visualizations were provided for the top 1000 frequent Mesh keywords. PubTator was used, and 322,026 bioconcepts were able to extracted under 10 classes (such as Gene, Disease, CellLine, etc.). Co-occurrence networks were constructed for PMID-bioconcept as well as bioconcept–bioconcept associations. Further, for creation of subnetwork with respect to gene–gene co-occurrence, a total of 11,100 unique genes participated with mTOR and AKT showing the highest (64) number of connections. The gene p53 was the most popular one in the network in accordance with both the degree and weighted degree centrality, which were 425 and 1414, respectively. The present piece of study is an amalgam of bibliometrics and scientific data mining methods looking deeper into the whole scale analysis of available literature on proteome.
Collapse
|
40
|
From the Digital Data Revolution toward a Digital Society: Pervasiveness of Artificial Intelligence. MACHINE LEARNING AND KNOWLEDGE EXTRACTION 2021. [DOI: 10.3390/make3010014] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Technological progress has led to powerful computers and communication technologies that penetrate nowadays all areas of science, industry and our private lives. As a consequence, all these areas are generating digital traces of data amounting to big data resources. This opens unprecedented opportunities but also challenges toward the analysis, management, interpretation and responsible usage of such data. In this paper, we discuss these developments and the fields that have been particularly effected by the digital revolution. Our discussion is AI-centered showing domain-specific prospects but also intricacies for the method development in artificial intelligence. For instance, we discuss recent breakthroughs in deep learning algorithms and artificial intelligence as well as advances in text mining and natural language processing, e.g., word-embedding methods that enable the processing of large amounts of text data from diverse sources such as governmental reports, blog entries in social media or clinical health records of patients. Furthermore, we discuss the necessity of further improving general artificial intelligence approaches and for utilizing advanced learning paradigms. This leads to arguments for the establishment of statistical artificial intelligence. Finally, we provide an outlook on important aspects of future challenges that are of crucial importance for the development of all fields, including ethical AI and the influence of bias on AI systems. As potential end-point of this development, we define digital society as the asymptotic limiting state of digital economy that emerges from fully connected information and communication technologies enabling the pervasiveness of AI. Overall, our discussion provides a perspective on the elaborate relatedness of digital data and AI systems.
Collapse
|
41
|
Biziukova N, Tarasova O, Ivanov S, Poroikov V. Automated Extraction of Information From Texts of Scientific Publications: Insights Into HIV Treatment Strategies. Front Genet 2021; 11:618862. [PMID: 33414815 PMCID: PMC7783389 DOI: 10.3389/fgene.2020.618862] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2020] [Accepted: 11/26/2020] [Indexed: 12/16/2022] Open
Abstract
Text analysis can help to identify named entities (NEs) of small molecules, proteins, and genes. Such data are very important for the analysis of molecular mechanisms of disease progression and development of new strategies for the treatment of various diseases and pathological conditions. The texts of publications represent a primary source of information, which is especially important to collect the data of the highest quality due to the immediate obtaining information, in comparison with databases. In our study, we aimed at the development and testing of an approach to the named entity recognition in the abstracts of publications. More specifically, we have developed and tested an algorithm based on the conditional random fields, which provides recognition of NEs of (i) genes and proteins and (ii) chemicals. Careful selection of abstracts strictly related to the subject of interest leads to the possibility of extracting the NEs strongly associated with the subject. To test the applicability of our approach, we have applied it for the extraction of (i) potential HIV inhibitors and (ii) a set of proteins and genes potentially responsible for viremic control in HIV-positive patients. The computational experiments performed provide the estimations of evaluating the accuracy of recognition of chemical NEs and proteins (genes). The precision of the chemical NEs recognition is over 0.91; recall is 0.86, and the F1-score (harmonic mean of precision and recall) is 0.89; the precision of recognition of proteins and genes names is over 0.86; recall is 0.83; while F1-score is above 0.85. Evaluation of the algorithm on two case studies related to HIV treatment confirms our suggestion about the possibility of extracting the NEs strongly relevant to (i) HIV inhibitors and (ii) a group of patients i.e., the group of HIV-positive individuals with an ability to maintain an undetectable HIV-1 viral load overtime in the absence of antiretroviral therapy. Analysis of the results obtained provides insights into the function of proteins that can be responsible for viremic control. Our study demonstrated the applicability of the developed approach for the extraction of useful data on HIV treatment.
Collapse
Affiliation(s)
- Nadezhda Biziukova
- Laboratory of Structure-Function Based Drug Design, Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia
| | - Olga Tarasova
- Laboratory of Structure-Function Based Drug Design, Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia
| | - Sergey Ivanov
- Laboratory of Structure-Function Based Drug Design, Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia.,Department of Bioinformatics, Faculty of Biomedicine, Pirogov Russian National Research Medical University, Moscow, Russia
| | - Vladimir Poroikov
- Laboratory of Structure-Function Based Drug Design, Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia
| |
Collapse
|