1
|
Mojibian A, Jaskolka J, Ching G, Lee B, Myers R, Devine C, Nicolaou S, Parker W. The Efficacy of a Named Entity Recognition AI Model for Identifying Incidental Pulmonary Nodules in CT Reports. Can Assoc Radiol J 2025; 76:68-75. [PMID: 39066637 DOI: 10.1177/08465371241266785] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/30/2024] Open
Abstract
Purpose: This study evaluates the efficacy of a commercial medical Named Entity Recognition (NER) model combined with a post-processing protocol in identifying incidental pulmonary nodules from CT reports. Methods: We analyzed 9165 anonymized CT reports and classified them into 3 categories: no nodules, nodules present, and nodules >6 mm. For each report, a generic medical NER model annotated entities and their relations, which were then filtered through inclusion/exclusion criteria selected to identify pulmonary nodules. Ground truth was established by manual review. To better understand the relationship between model performance and nodule prevalence, a subset of the data was programmatically balanced to equalize the number of reports in each class category. Results: In the unbalanced subset of the data, the model achieved a sensitivity of 97%, specificity of 99%, and accuracy of 99% in detecting pulmonary nodules mentioned in the reports. For nodules >6 mm, sensitivity was 95%, specificity was 100%, and accuracy was 100%. In the balanced subset of the data, sensitivity was 99%, specificity 96%, and accuracy 97% for nodule detection; for larger nodules, sensitivity was 94%, specificity 99%, and accuracy 98%. Conclusions: The NER model demonstrated high sensitivity and specificity in detecting pulmonary nodules reported in CT scans, including those >6 mm which are potentially clinically significant. The results were consistent across both unbalanced and balanced datasets indicating that the model performance is independent of nodule prevalence. Implementing this technology in hospital systems could automate the identification of at-risk patients, ensuring timely follow-up and potentially reducing missed or late-stage cancer diagnoses.
Collapse
Affiliation(s)
- Alireza Mojibian
- Sapien Machine Learning Corporation (SapienML), Vancouver, BC, Canada
| | - Jeff Jaskolka
- Radiology Department, Brampton Civic Hospital, Brampton, ON, Canada
- Faculty of Medicine - Medical Imaging, University of Toronto, Toronto, ON, Canada
| | - Geoffrey Ching
- Schulich School of Medicine & Dentistry - University of Western Ontario, London, On, Canada
| | - Brian Lee
- Sapien Machine Learning Corporation (SapienML), Vancouver, BC, Canada
| | - Renelle Myers
- Faculty of Medicine, University of British Columbia, Vancouver, BC, Canada
- BC Cancer Agency, Provincial Health Services Authority, Vancouver, BC, Canada
- Respirology, Vancouver General Hospital, Vancouver, BC, Canada
| | - Chloe Devine
- Sapien Machine Learning Corporation (SapienML), Vancouver, BC, Canada
| | - Savvas Nicolaou
- Sapien Machine Learning Corporation (SapienML), Vancouver, BC, Canada
- Faculty of Medicine, University of British Columbia, Vancouver, BC, Canada
- Radiology Department, Vancouver General Hospital, Vancouver, BC, Canada
| | - William Parker
- Sapien Machine Learning Corporation (SapienML), Vancouver, BC, Canada
- Radiology Department, Vancouver General Hospital, Vancouver, BC, Canada
- Radiology Department, Nanaimo Regional General Hospital, Nanaimo, BC, Canada
| |
Collapse
|
2
|
Pinard CJ, Poon AC, Lagree A, Wu KC, Li J, Tran WT. Precision in Parsing: Evaluation of an Open-Source Named Entity Recognizer (NER) in Veterinary Oncology. Vet Comp Oncol 2024. [PMID: 39711253 DOI: 10.1111/vco.13035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2024] [Revised: 11/14/2024] [Accepted: 12/02/2024] [Indexed: 12/24/2024]
Abstract
Integrating Artificial Intelligence (AI) through Natural Language Processing (NLP) can improve veterinary medical oncology clinical record analytics. Named Entity Recognition (NER), a critical component of NLP, can facilitate efficient data extraction and automated labelling for research and clinical decision-making. This study assesses the efficacy of the Bio-Epidemiology-NER (BioEN), an open-source NER developed using human epidemiological and medical data, on veterinary medical oncology records. The NER's performance was compared with manual annotations by a veterinary medical oncologist and a veterinary intern. Evaluation metrics included Jaccard similarity, intra-rater reliability, ROUGE scores, and standard NER performance metrics (precision, recall, F1-score). Results indicate poor direct translatability to veterinary medical oncology record text and room for improvement in the NER's performance, with precision, recall, and F1-score suggesting a marginally better alignment with the oncologist than the intern. While challenges remain, these insights contribute to the ongoing development of AI tools tailored for veterinary healthcare and highlight the need for veterinary-specific models.
Collapse
Affiliation(s)
- Christopher J Pinard
- Department of Clinical Studies, Ontario Veterinary College, University of Guelph, Guelph, Ontario, Canada
- Department of Oncology, Lakeshore Animal Health Partners, Mississauga, Ontario, Canada
- Centre for Advancing Responsible & Ethical Artificial Intelligence, University of Guelph, Guelph, Ontario, Canada
- Radiogenomics Laboratory, Sunnybrook Health Sciences Centre, Toronto, Ontario, Canada
- ANI.ML Research, ANI.ML Health Inc., Toronto, Ontario, Canada
| | - Andrew C Poon
- VCA Mississauga Oakville Veterinary Emergency Hospital, Mississauga, Ontario, Canada
| | - Andrew Lagree
- Radiogenomics Laboratory, Sunnybrook Health Sciences Centre, Toronto, Ontario, Canada
- ANI.ML Research, ANI.ML Health Inc., Toronto, Ontario, Canada
- Odette Cancer Program, Sunnybrook Health Sciences Centre, Toronto, Ontario, Canada
| | - Kuan-Chuen Wu
- ANI.ML Research, ANI.ML Health Inc., Toronto, Ontario, Canada
| | - Jiaxu Li
- Radiogenomics Laboratory, Sunnybrook Health Sciences Centre, Toronto, Ontario, Canada
| | - William T Tran
- Radiogenomics Laboratory, Sunnybrook Health Sciences Centre, Toronto, Ontario, Canada
- Odette Cancer Program, Sunnybrook Health Sciences Centre, Toronto, Ontario, Canada
- Department of Radiation Oncology, University of Toronto, Toronto, Ontario, Canada
- Temerty Centre for AI Research and Education in Medicine, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
3
|
Nourani E, Koutrouli M, Xie Y, Vagiaki D, Pyysalo S, Nastou K, Brunak S, Jensen LJ. Lifestyle factors in the biomedical literature: an ontology and comprehensive resources for named entity recognition. Bioinformatics 2024; 40:btae613. [PMID: 39412443 PMCID: PMC11543612 DOI: 10.1093/bioinformatics/btae613] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Revised: 09/26/2024] [Accepted: 10/15/2024] [Indexed: 11/09/2024] Open
Abstract
MOTIVATION Despite lifestyle factors (LSFs) being increasingly acknowledged in shaping individual health trajectories, particularly in chronic diseases, they have still not been systematically described in the biomedical literature. This is in part because no named entity recognition (NER) system exists, which can comprehensively detect all types of LSFs in text. The task is challenging due to their inherent diversity, lack of a comprehensive LSF classification for dictionary-based NER, and lack of a corpus for deep learning-based NER. RESULTS We present a novel lifestyle factor ontology (LSFO), which we used to develop a dictionary-based system for recognition and normalization of LSFs. Additionally, we introduce a manually annotated corpus for LSFs (LSF200) suitable for training and evaluation of NER systems, and use it to train a transformer-based system. Evaluating the performance of both NER systems on the corpus revealed an F-score of 64% for the dictionary-based system and 76% for the transformer-based system. Large-scale application of these systems on PubMed abstracts and PMC Open Access articles identified over 300 million mentions of LSF in the biomedical literature. AVAILABILITY AND IMPLEMENTATION LSFO, the annotated LSF200 corpus, and the detected LSFs in PubMed and PMC-OA articles using both NER systems, are available under open licenses via the following GitHub repository: https://github.com/EsmaeilNourani/LSFO-expansion. This repository contains links to two associated GitHub repositories and a Zenodo project related to the study. LSFO is also available at BioPortal: https://bioportal.bioontology.org/ontologies/LSFO.
Collapse
Affiliation(s)
- Esmaeil Nourani
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen 2200, Denmark
- Faculty of Information Technology and Computer Engineering, Azarbaijan Shahid Madani University, Tabriz, Iran
| | - Mikaela Koutrouli
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen 2200, Denmark
| | - Yijia Xie
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen 2200, Denmark
| | - Danai Vagiaki
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen 2200, Denmark
| | - Sampo Pyysalo
- TurkuNLP Group, Department of Computing, Faculty of Technology, University of Turku, Turku 20014, Finland
| | - Katerina Nastou
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen 2200, Denmark
| | - Søren Brunak
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen 2200, Denmark
| | - Lars Juhl Jensen
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen 2200, Denmark
| |
Collapse
|
4
|
Martín-Noguerol T, López-Úbeda P, Paulano-Godino F, Luna A. Natural language processing-based analysis of the level of adoption by expert radiologists of the ASSR, ASNR and NASS version 2.0 of lumbar disc nomenclature: an eight-year survey. Quant Imaging Med Surg 2024; 14:7780-7790. [PMID: 39544464 PMCID: PMC11558493 DOI: 10.21037/qims-23-1294] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2023] [Accepted: 12/26/2023] [Indexed: 11/17/2024]
Abstract
Background The American Society of Spine Radiology (ASSR), American Society of Neuroradiology (ASNR), and North American Spine Society (NASS) published a consensus paper with recommendations for lumbar disc nomenclature reports in 2014. We aimed to evaluate the degree of adoption in our radiology department of the ASSR, ASNR, and NASS 2.0 lumbar spine consensus paper using natural language processing (NLP). Methods In March 2015 we gave in our radiology department, at HT Medica in Jaén (Spain) a lecture detailing the changes proposed in the ASSR, ASNR, and NASS consensus about lumbar disc nomenclature, version 2.0. We analyzed 34,064 lumbar spine magnetic resonance imaging (MRI) reports from three different expert radiologists (A, B, and C) performed from May 2010 to February 2015 (15,813 studies) and from March 2015 to February 2022 (18,251 studies). Using an NLP algorithm, we evaluated 29 old and new terms related to 4 different categories: disc with fissures of the annulus, degenerated disc, herniated disc, and location of the disc. Results A relevant decrease in the percentage of use of old terms was found for degenerated disc category (44.63% for radiologist B and 18.95% for radiologist C) and disc localization (18.86% for radiologist A and 27.73% for radiologist C). Relevant increments in the percentage of use of new lexicon were depicted for terms related to degenerated disc (32.48% for radiologist C), herniated disc (7.27% for radiologist A) and disc localization (36.53% for radiologist C). Conclusions NLP algorithms may help to manage large radiological report datasets to evaluate the impact and degree of adherence of radiologists to recommendations for the use of ASSR, ASNR and NASS lumbar disc nomenclature version 2.0.
Collapse
Affiliation(s)
| | | | | | - Antonio Luna
- MRI Unit, Radiology Department, HT Medica, Jaén, Spain
| |
Collapse
|
5
|
Jia Y, Wang H, Yuan Z, Zhu L, Xiang ZL. Biomedical relation extraction method based on ensemble learning and attention mechanism. BMC Bioinformatics 2024; 25:333. [PMID: 39425010 PMCID: PMC11488084 DOI: 10.1186/s12859-024-05951-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2024] [Accepted: 10/07/2024] [Indexed: 10/21/2024] Open
Abstract
BACKGROUND Relation extraction (RE) plays a crucial role in biomedical research as it is essential for uncovering complex semantic relationships between entities in textual data. Given the significance of RE in biomedical informatics and the increasing volume of literature, there is an urgent need for advanced computational models capable of accurately and efficiently extracting these relationships on a large scale. RESULTS This paper proposes a novel approach, SARE, combining ensemble learning Stacking and attention mechanisms to enhance the performance of biomedical relation extraction. By leveraging multiple pre-trained models, SARE demonstrates improved adaptability and robustness across diverse domains. The attention mechanisms enable the model to capture and utilize key information in the text more accurately. SARE achieved performance improvements of 4.8, 8.7, and 0.8 percentage points on the PPI, DDI, and ChemProt datasets, respectively, compared to the original BERT variant and the domain-specific PubMedBERT model. CONCLUSIONS SARE offers a promising solution for improving the accuracy and efficiency of relation extraction tasks in biomedical research, facilitating advancements in biomedical informatics. The results suggest that combining ensemble learning with attention mechanisms is effective for extracting complex relationships from biomedical texts. Our code and data are publicly available at: https://github.com/GS233/Biomedical .
Collapse
Affiliation(s)
- Yaxun Jia
- Department of Radiation Oncology, Shanghai East Hospital, Tongji University School of Medicine, Shanghai, China
| | - Haoyang Wang
- Department of Computer College, Beijing Information Science and Technology University, Beijing, China
| | - Zhu Yuan
- Department of Information Management, The National Police University for Criminal Justice, Baoding, China
| | - Lian Zhu
- Department of Radiation Oncology, Shanghai East Hospital Ji'an hospital, Jian, China
| | - Zuo-Lin Xiang
- Department of Radiation Oncology, Shanghai East Hospital, Tongji University School of Medicine, Shanghai, China.
- Department of Radiation Oncology, Shanghai East Hospital Ji'an hospital, Jian, China.
| |
Collapse
|
6
|
Nastou K, Koutrouli M, Pyysalo S, Jensen LJ. Improving dictionary-based named entity recognition with deep learning. Bioinformatics 2024; 40:ii45-ii52. [PMID: 39230709 PMCID: PMC11373323 DOI: 10.1093/bioinformatics/btae402] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/05/2024] Open
Abstract
MOTIVATION Dictionary-based named entity recognition (NER) allows terms to be detected in a corpus and normalized to biomedical databases and ontologies. However, adaptation to different entity types requires new high-quality dictionaries and associated lists of blocked names for each type. The latter are so far created by identifying cases that cause many false positives through manual inspection of individual names, a process that scales poorly. RESULTS In this work, we aim to improve block list s by automatically identifying names to block, based on the context in which they appear. By comparing results of three well-established biomedical NER methods, we generated a dataset of over 12.5 million text spans where the methods agree on the boundaries and type of entity tagged. These were used to generate positive and negative examples of contexts for four entity types (genes, diseases, species, and chemicals), which were used to train a Transformer-based model (BioBERT) to perform entity type classification. Application of the best model (F1-score = 96.7%) allowed us to generate a list of problematic names that should be blocked. Introducing this into our system doubled the size of the previous list of corpus-wide blocked names. In addition, we generated a document-specific list that allows ambiguous names to be blocked in specific documents. These changes boosted text mining precision by ∼5.5% on average, and over 8.5% for chemical and 7.5% for gene names, positively affecting several biological databases utilizing this NER system, like the STRING database, with only a minor drop in recall (0.6%). AVAILABILITY AND IMPLEMENTATION All resources are available through Zenodo https://doi.org/10.5281/zenodo.11243139 and GitHub https://doi.org/10.5281/zenodo.10289360.
Collapse
Affiliation(s)
- Katerina Nastou
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Blegdamsvej 3, Copenhagen, 2200, Denmark
| | - Mikaela Koutrouli
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Blegdamsvej 3, Copenhagen, 2200, Denmark
| | - Sampo Pyysalo
- TurkuNLP Group, Department of Computing, University of Turku, Turku, 20014, Finland
| | - Lars Juhl Jensen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Blegdamsvej 3, Copenhagen, 2200, Denmark
| |
Collapse
|
7
|
Tiemann JKS, Szczuka M, Bouarroudj L, Oussaren M, Garcia S, Howard RJ, Delemotte L, Lindahl E, Baaden M, Lindorff-Larsen K, Chavent M, Poulain P. MDverse, shedding light on the dark matter of molecular dynamics simulations. eLife 2024; 12:RP90061. [PMID: 39212001 PMCID: PMC11364437 DOI: 10.7554/elife.90061] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/04/2024] Open
Abstract
The rise of open science and the absence of a global dedicated data repository for molecular dynamics (MD) simulations has led to the accumulation of MD files in generalist data repositories, constituting the dark matter of MD - data that is technically accessible, but neither indexed, curated, or easily searchable. Leveraging an original search strategy, we found and indexed about 250,000 files and 2000 datasets from Zenodo, Figshare and Open Science Framework. With a focus on files produced by the Gromacs MD software, we illustrate the potential offered by the mining of publicly available MD data. We identified systems with specific molecular composition and were able to characterize essential parameters of MD simulation such as temperature and simulation length, and could identify model resolution, such as all-atom and coarse-grain. Based on this analysis, we inferred metadata to propose a search engine prototype to explore the MD data. To continue in this direction, we call on the community to pursue the effort of sharing MD data, and to report and standardize metadata to reuse this valuable matter.
Collapse
Affiliation(s)
- Johanna KS Tiemann
- Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of CopenhagenCopenhagenDenmark
| | - Magdalena Szczuka
- Institut de Pharmacologie et Biologie Structurale, CNRS, Université de ToulouseToulouseFrance
| | - Lisa Bouarroudj
- Université Paris Cité, CNRS, Institut Jacques MonodParisFrance
| | | | | | - Rebecca J Howard
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm UniversityStockholmSweden
| | - Lucie Delemotte
- Department of applied physics, Science for Life Laboratory, KTH Royal Institute of TechnologyStockholmSweden
| | - Erik Lindahl
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm UniversityStockholmSweden
- Department of applied physics, Science for Life Laboratory, KTH Royal Institute of TechnologyStockholmSweden
| | - Marc Baaden
- Laboratoire de Biochimie Théorique, CNRS, Université Paris CitéParisFrance
| | - Kresten Lindorff-Larsen
- Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of CopenhagenCopenhagenDenmark
| | - Matthieu Chavent
- Institut de Pharmacologie et Biologie Structurale, CNRS, Université de ToulouseToulouseFrance
| | - Pierre Poulain
- Université Paris Cité, CNRS, Institut Jacques MonodParisFrance
| |
Collapse
|
8
|
Albashayreh A, Bandyopadhyay A, Zeinali N, Zhang M, Fan W, Gilbertson White S. Natural Language Processing Accurately Differentiates Cancer Symptom Information in Electronic Health Record Narratives. JCO Clin Cancer Inform 2024; 8:e2300235. [PMID: 39116379 DOI: 10.1200/cci.23.00235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2023] [Revised: 04/29/2024] [Accepted: 05/30/2024] [Indexed: 08/10/2024] Open
Abstract
PURPOSE Identifying cancer symptoms in electronic health record (EHR) narratives is feasible with natural language processing (NLP). However, more efficient NLP systems are needed to detect various symptoms and distinguish observed symptoms from negated symptoms and medication-related side effects. We evaluated the accuracy of NLP in (1) detecting 14 symptom groups (ie, pain, fatigue, swelling, depressed mood, anxiety, nausea/vomiting, pruritus, headache, shortness of breath, constipation, numbness/tingling, decreased appetite, impaired memory, disturbed sleep) and (2) distinguishing observed symptoms in EHR narratives among patients with cancer. METHODS We extracted 902,508 notes for 11,784 unique patients diagnosed with cancer and developed a gold standard corpus of 1,112 notes labeled for presence or absence of 14 symptom groups. We trained an embeddings-augmented NLP system integrating human and machine intelligence and conventional machine learning algorithms. NLP metrics were calculated on a gold standard corpus subset for testing. RESULTS The interannotator agreement for labeling the gold standard corpus was excellent at 92%. The embeddings-augmented NLP model achieved the best performance (F1 score = 0.877). The highest NLP accuracy was observed in pruritus (F1 score = 0.937) while the lowest accuracy was in swelling (F1 score = 0.787). After classifying the entire data set with embeddings-augmented NLP, we found that 41% of the notes included symptom documentation. Pain was the most documented symptom (29% of all notes) while impaired memory was the least documented (0.7% of all notes). CONCLUSION We illustrated the feasibility of detecting 14 symptom groups in EHR narratives and showed that an embeddings-augmented NLP system outperforms conventional machine learning algorithms in detecting symptom information and differentiating observed symptoms from negated symptoms and medication-related side effects.
Collapse
Affiliation(s)
| | | | | | - Min Zhang
- School of Economics and Management, Communication University of China, Beijing, China
| | - Weiguo Fan
- Tippie College of Business, University of Iowa, Iowa City, IA
| | | |
Collapse
|
9
|
Farrell MJ, Le Guillarme N, Brierley L, Hunter B, Scheepens D, Willoughby A, Yates A, Mideo N. The changing landscape of text mining: a review of approaches for ecology and evolution. Proc Biol Sci 2024; 291:20240423. [PMID: 39082244 PMCID: PMC11289731 DOI: 10.1098/rspb.2024.0423] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Revised: 06/20/2024] [Accepted: 06/20/2024] [Indexed: 08/02/2024] Open
Abstract
In ecology and evolutionary biology, the synthesis and modelling of data from published literature are commonly used to generate insights and test theories across systems. However, the tasks of searching, screening, and extracting data from literature are often arduous. Researchers may manually process hundreds to thousands of articles for systematic reviews, meta-analyses, and compiling synthetic datasets. As relevant articles expand to tens or hundreds of thousands, computer-based approaches can increase the efficiency, transparency and reproducibility of literature-based research. Methods available for text mining are rapidly changing owing to developments in machine learning-based language models. We review the growing landscape of approaches, mapping them onto three broad paradigms (frequency-based approaches, traditional Natural Language Processing and deep learning-based language models). This serves as an entry point to learn foundational and cutting-edge concepts, vocabularies, and methods to foster integration of these tools into ecological and evolutionary research. We cover approaches for modelling ecological texts, generating training data, developing custom models and interacting with large language models and discuss challenges and possible solutions to implementing these methods in ecology and evolution.
Collapse
Affiliation(s)
- Maxwell J. Farrell
- Department of Ecology & Evolutionary Biology, University of Toronto, Toronto, Ontario, Canada
- School of Biodiversity, One Health & Veterinary Medicine, University of Glasgow, Glasgow, UK
- MRC-University of Glasgow Centre for Virus Research, Glasgow, UK
| | - Nicolas Le Guillarme
- Université Grenoble Alpes, CNRS, LECA, Laboratoire d'Ecologie Alpine, Grenoble, France
| | - Liam Brierley
- MRC-University of Glasgow Centre for Virus Research, Glasgow, UK
- Department of Health Data Science, University of Liverpool, Liverpool, UK
| | - Bronwen Hunter
- School of Life Sciences, University of Sussex, Brighton, UK
| | - Daan Scheepens
- Division of Biosciences, University College London, London, UK
| | | | - Andrew Yates
- Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands
| | - Nicole Mideo
- Department of Ecology & Evolutionary Biology, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
10
|
Iscoe M, Socrates V, Gilson A, Chi L, Li H, Huang T, Kearns T, Perkins R, Khandjian L, Taylor RA. Identifying signs and symptoms of urinary tract infection from emergency department clinical notes using large language models. Acad Emerg Med 2024; 31:599-610. [PMID: 38567658 DOI: 10.1111/acem.14883] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Revised: 01/24/2024] [Accepted: 01/24/2024] [Indexed: 04/04/2024]
Abstract
BACKGROUND Natural language processing (NLP) tools including recently developed large language models (LLMs) have myriad potential applications in medical care and research, including the efficient labeling and classification of unstructured text such as electronic health record (EHR) notes. This opens the door to large-scale projects that rely on variables that are not typically recorded in a structured form, such as patient signs and symptoms. OBJECTIVES This study is designed to acquaint the emergency medicine research community with the foundational elements of NLP, highlighting essential terminology, annotation methodologies, and the intricacies involved in training and evaluating NLP models. Symptom characterization is critical to urinary tract infection (UTI) diagnosis, but identification of symptoms from the EHR has historically been challenging, limiting large-scale research, public health surveillance, and EHR-based clinical decision support. We therefore developed and compared two NLP models to identify UTI symptoms from unstructured emergency department (ED) notes. METHODS The study population consisted of patients aged ≥ 18 who presented to an ED in a northeastern U.S. health system between June 2013 and August 2021 and had a urinalysis performed. We annotated a random subset of 1250 ED clinician notes from these visits for a list of 17 UTI symptoms. We then developed two task-specific LLMs to perform the task of named entity recognition: a convolutional neural network-based model (SpaCy) and a transformer-based model designed to process longer documents (Clinical Longformer). Models were trained on 1000 notes and tested on a holdout set of 250 notes. We compared model performance (precision, recall, F1 measure) at identifying the presence or absence of UTI symptoms at the note level. RESULTS A total of 8135 entities were identified in 1250 notes; 83.6% of notes included at least one entity. Overall F1 measure for note-level symptom identification weighted by entity frequency was 0.84 for the SpaCy model and 0.88 for the Longformer model. F1 measure for identifying presence or absence of any UTI symptom in a clinical note was 0.96 (232/250 correctly classified) for the SpaCy model and 0.98 (240/250 correctly classified) for the Longformer model. CONCLUSIONS The study demonstrated the utility of LLMs and transformer-based models in particular for extracting UTI symptoms from unstructured ED clinical notes; models were highly accurate for detecting the presence or absence of any UTI symptom on the note level, with variable performance for individual symptoms.
Collapse
Affiliation(s)
- Mark Iscoe
- Department of Emergency Medicine, Yale School of Medicine, New Haven, Connecticut, USA
- Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, Connecticut, USA
| | - Vimig Socrates
- Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, Connecticut, USA
- Program of Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, USA
| | - Aidan Gilson
- Yale School of Medicine, New Haven, Connecticut, USA
| | - Ling Chi
- Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut, USA
| | - Huan Li
- Program of Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, USA
| | - Thomas Huang
- Yale School of Medicine, New Haven, Connecticut, USA
| | - Thomas Kearns
- Department of Emergency Medicine, Yale School of Medicine, New Haven, Connecticut, USA
| | - Rachelle Perkins
- Department of Emergency Medicine, Yale School of Medicine, New Haven, Connecticut, USA
| | - Laura Khandjian
- Department of Emergency Medicine, Yale School of Medicine, New Haven, Connecticut, USA
| | - R Andrew Taylor
- Department of Emergency Medicine, Yale School of Medicine, New Haven, Connecticut, USA
- Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, Connecticut, USA
| |
Collapse
|
11
|
Tiemann JKS, Szczuka M, Bouarroudj L, Oussaren M, Garcia S, Howard RJ, Delemotte L, Lindahl E, Baaden M, Lindorff-Larsen K, Chavent M, Poulain P. MDverse: Shedding Light on the Dark Matter of Molecular Dynamics Simulations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.05.02.538537. [PMID: 37205542 PMCID: PMC10187166 DOI: 10.1101/2023.05.02.538537] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
The rise of open science and the absence of a global dedicated data repository for molecular dynamics (MD) simulations has led to the accumulation of MD files in generalist data repositories, constituting the dark matter of MD - data that is technically accessible, but neither indexed, curated, or easily searchable. Leveraging an original search strategy, we found and indexed about 250,000 files and 2,000 datasets from Zenodo, Figshare and Open Science Framework. With a focus on files produced by the Gromacs MD software, we illustrate the potential offered by the mining of publicly available MD data. We identified systems with specific molecular composition and were able to characterize essential parameters of MD simulation such as temperature and simulation length, and could identify model resolution, such as all-atom and coarse-grain. Based on this analysis, we inferred metadata to propose a search engine prototype to explore the MD data. To continue in this direction, we call on the community to pursue the effort of sharing MD data, and to report and standardize metadata to reuse this valuable matter.
Collapse
|
12
|
Xie T, Wan Y, Wang H, Østrøm I, Wang S, He M, Deng R, Wu X, Grazian C, Kit C, Hoex B. Opinion Mining by Convolutional Neural Networks for Maximizing Discoverability of Nanomaterials. J Chem Inf Model 2024; 64:2746-2759. [PMID: 37982753 DOI: 10.1021/acs.jcim.3c00746] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2023]
Abstract
The scientific literature contains valuable information that can be used for future applications, but manual analysis presents challenges due to its size and disciplinary boundaries. The prevailing solution involves natural language processing (NLP) techniques such as information retrieval. Nonetheless, existing automated systems primarily provide either statistically based shallow information or deep information without traceability, thereby falling short of delivering high-quality and reliable insights. To address this, we propose an innovative approach of leveraging sentiment information embedded within the literature to track the opinions toward materials. In this study, we integrated material knowledge into text representation and constructed opinion data sets to hierarchically train deep learning models, named as Scientific Sentiment Network (SSNet). SSNet can effectively extract knowledge from the energy material literature and accurately categorize expert opinions into challenges and opportunities (94% and 92% accuracy, respectively). By incorporating sentiment features determined by SSNet, we can predict the ranking of emerging thermoelectric materials with a 70% correlation to experimental outcomes. Furthermore, our model achieves a commendable 68% accuracy in predicting suitable nanomaterials for atomic layer deposition (ALD) over time. These promising results offer a practical framework to extract and synthesize knowledge from the scientific literature, thereby accelerating research in the field of nanomaterials.
Collapse
Affiliation(s)
- Tong Xie
- School of Photovoltaic and Renewable Energy Engineering, University of New South Wales, Kensington, NSW 2052, Australia
- GreenDynamics Pty. Ltd., Kensington, NSW 2052, Australia
| | - Yuwei Wan
- Department of Linguistics and Translation, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong
- GreenDynamics Pty. Ltd., Kensington, NSW 2052, Australia
| | - Haoran Wang
- School of Photovoltaic and Renewable Energy Engineering, University of New South Wales, Kensington, NSW 2052, Australia
| | - Ina Østrøm
- School of Photovoltaic and Renewable Energy Engineering, University of New South Wales, Kensington, NSW 2052, Australia
| | - Shaozhou Wang
- School of Photovoltaic and Renewable Energy Engineering, University of New South Wales, Kensington, NSW 2052, Australia
- GreenDynamics Pty. Ltd., Kensington, NSW 2052, Australia
| | - Mingrui He
- School of Photovoltaic and Renewable Energy Engineering, University of New South Wales, Kensington, NSW 2052, Australia
| | - Rong Deng
- School of Photovoltaic and Renewable Energy Engineering, University of New South Wales, Kensington, NSW 2052, Australia
| | - Xinyuan Wu
- School of Photovoltaic and Renewable Energy Engineering, University of New South Wales, Kensington, NSW 2052, Australia
| | - Clara Grazian
- DARE ARC Training Centre in Data Analytics for Resources and Environments, South Eveleigh, NSW 2015, Australia
- School of Mathematics and Statistics, University of Sydney, Camperdown, NSW 2006, Australia
| | - Chunyu Kit
- Department of Linguistics and Translation, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong
| | - Bram Hoex
- School of Photovoltaic and Renewable Energy Engineering, University of New South Wales, Kensington, NSW 2052, Australia
| |
Collapse
|
13
|
Emmert-Streib F. Can ChatGPT understand genetics? Eur J Hum Genet 2024; 32:371-372. [PMID: 37407734 PMCID: PMC10999414 DOI: 10.1038/s41431-023-01419-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Accepted: 06/19/2023] [Indexed: 07/07/2023] Open
Affiliation(s)
- Frank Emmert-Streib
- Predictive Society and Data Analytics Lab, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland.
| |
Collapse
|
14
|
Abstract
Smart healthcare has achieved significant progress in recent years. Emerging artificial intelligence (AI) technologies enable various smart applications across various healthcare scenarios. As an essential technology powered by AI, natural language processing (NLP) plays a key role in smart healthcare due to its capability of analysing and understanding human language. In this work, we review existing studies that concern NLP for smart healthcare from the perspectives of technique and application. We first elaborate on different NLP approaches and the NLP pipeline for smart healthcare from the technical point of view. Then, in the context of smart healthcare applications employing NLP techniques, we introduce representative smart healthcare scenarios, including clinical practice, hospital management, personal care, public health, and drug development. We further discuss two specific medical issues, i.e., the coronavirus disease 2019 (COVID-19) pandemic and mental health, in which NLP-driven smart healthcare plays an important role. Finally, we discuss the limitations of current works and identify the directions for future works.
Collapse
|
15
|
Ngo DH, Koopman B. From Free-text Drug Labels to Structured Medication Terminology with BERT and GPT. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2024; 2023:540-549. [PMID: 38222391 PMCID: PMC10785872] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 01/16/2024]
Abstract
We present a method to enrich controlled medication terminology from free-text drug labels. This is important because, while controlled medication terminology capture well-structured medication information, much of the information pertaining to medications is still found in free-text. First, we compared different Named Entity Recognition (NER) models including rule-based, feature-based, deep learning-based models with Transformers as well as ChatGPT, few-shot and fine-tuned GPT-3 to find the most suitable model that accurately extracts medication entities (ingredients, brand, dose, etc.) from free-text. Then, a rule-based Relation Extraction algorithm transforms NER results into a well-structured medication knowledge graph. Finally, a Medication Searching method takes the knowledge graph and matches it to relevant medications in the terminology server. An empirical evaluation on real-world drug labels shows that BERT-CRF was the most effective NER model with F-measure 95%. After performing terms normalization, the Medication Searching achieved an accuracy of 77% for when matching a label to relevant medication in the terminology server. The NER and Medication Searching models could be deployed as a web service capable of accepting free-text queries and returning structured medication information; thus providing a useful means of better managing medications information found in different health systems.
Collapse
Affiliation(s)
- Duy-Hoa Ngo
- The Australian E-Health Research Centre, CSIRO, Australia
| | - Bevan Koopman
- The Australian E-Health Research Centre, CSIRO, Australia
| |
Collapse
|
16
|
Nachtegael C, De Stefani J, Lenaerts T. A study of deep active learning methods to reduce labelling efforts in biomedical relation extraction. PLoS One 2023; 18:e0292356. [PMID: 38100453 PMCID: PMC10723703 DOI: 10.1371/journal.pone.0292356] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Accepted: 09/19/2023] [Indexed: 12/17/2023] Open
Abstract
Automatic biomedical relation extraction (bioRE) is an essential task in biomedical research in order to generate high-quality labelled data that can be used for the development of innovative predictive methods. However, building such fully labelled, high quality bioRE data sets of adequate size for the training of state-of-the-art relation extraction models is hindered by an annotation bottleneck due to limitations on time and expertise of researchers and curators. We show here how Active Learning (AL) plays an important role in resolving this issue and positively improve bioRE tasks, effectively overcoming the labelling limits inherent to a data set. Six different AL strategies are benchmarked on seven bioRE data sets, using PubMedBERT as the base model, evaluating their area under the learning curve (AULC) as well as intermediate results measurements. The results demonstrate that uncertainty-based strategies, such as Least-Confident or Margin Sampling, are statistically performing better in terms of F1-score, accuracy and precision, than other types of AL strategies. However, in terms of recall, a diversity-based strategy, called Core-set, outperforms all strategies. AL strategies are shown to reduce the annotation need (in order to reach a performance at par with training on all data), from 6% to 38%, depending on the data set; with Margin Sampling and Least-Confident Sampling strategies moreover obtaining the best AULCs compared to the Random Sampling baseline. We show through the experiments the importance of using AL methods to reduce the amount of labelling needed to construct high-quality data sets leading to optimal performance of deep learning models. The code and data sets to reproduce all the results presented in the article are available at https://github.com/oligogenic/Deep_active_learning_bioRE.
Collapse
Affiliation(s)
- Charlotte Nachtegael
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussel, Bruxelles, Belgium
- Machine Learning Group, Université Libre de Bruxelles, Bruxelles, Belgium
| | - Jacopo De Stefani
- Machine Learning Group, Université Libre de Bruxelles, Bruxelles, Belgium
- Technology, Policy and Management Faculty, Technische Universiteit Delft, Delft, Netherlands
| | - Tom Lenaerts
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussel, Bruxelles, Belgium
- Machine Learning Group, Université Libre de Bruxelles, Bruxelles, Belgium
- Artificial Intelligence Laboratory, Vrije Universiteit Brussel, Bruxelles, Belgium
| |
Collapse
|
17
|
Lera-Ramírez M, Bähler J, Mata J, Rutherford K, Hoffman CS, Lambert S, Oliferenko S, Martin SG, Gould KL, Du LL, Sabatinos SA, Forsburg SL, Nielsen O, Nurse P, Wood V. Revised fission yeast gene and allele nomenclature guidelines for machine readability. Genetics 2023; 225:iyad143. [PMID: 37758508 PMCID: PMC10627252 DOI: 10.1093/genetics/iyad143] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Accepted: 07/24/2023] [Indexed: 09/30/2023] Open
Abstract
Standardized nomenclature for genes, gene products, and isoforms is crucial to prevent ambiguity and enable clear communication of scientific data, facilitating efficient biocuration and data sharing. Standardized genotype nomenclature, which describes alleles present in a specific strain that differ from those in the wild-type reference strain, is equally essential to maximize research impact and ensure that results linking genotypes to phenotypes are Findable, Accessible, Interoperable, and Reusable (FAIR). In this publication, we extend the fission yeast clade gene nomenclature guidelines to support the curation efforts at PomBase (www.pombase.org), the Schizosaccharomyces pombe Model Organism Database. This update introduces nomenclature guidelines for noncoding RNA genes, following those set forth by the Human Genome Organisation Gene Nomenclature Committee. Additionally, we provide a significant update to the allele and genotype nomenclature guidelines originally published in 1987, to standardize the diverse range of genetic modifications enabled by the fission yeast genetic toolbox. These updated guidelines reflect a community consensus between numerous fission yeast researchers. Adoption of these rules will improve consistency in gene and genotype nomenclature, and facilitate machine-readability and automated entity recognition of fission yeast genes and alleles in publications or datasets. In conclusion, our updated guidelines provide a valuable resource for the fission yeast research community, promoting consistency, clarity, and FAIRness in genetic data sharing and interpretation.
Collapse
Affiliation(s)
- Manuel Lera-Ramírez
- University College London, Department of Genetics Evolution and Environment, Darwin Building, 99-105 Gower Street, London WC1E 6BT, UK
| | - Jürg Bähler
- University College London, Department of Genetics Evolution and Environment, Darwin Building, 99-105 Gower Street, London WC1E 6BT, UK
| | - Juan Mata
- University of Cambridge, Department of Biochemistry, Cambridge CB2 1GA, UK
| | - Kim Rutherford
- University of Cambridge, Department of Biochemistry, Cambridge CB2 1GA, UK
| | | | - Sarah Lambert
- Institut Curie, Université Paris-Saclay, CNRS UMR3348, Orsay 91400, France
| | - Snezhana Oliferenko
- The Francis Crick Institute, London NW1 1AT, UK
- Randall Centre for Cell and Molecular Biophysics, School of Basic and Medical Biosciences, King’s College London, London SE1 1UL, UK
| | - Sophie G Martin
- University of Geneva, Department of Molecular and Cellular Biology, Geneva 1211, Switzerland
| | - Kathleen L Gould
- Vanderbilt University School of Medicine, Department of Cell and Developmental Biology, Nashville, TN 37232, USA
| | - Li-Lin Du
- National Institute of Biological Sciences, Beijing 102206, China
| | - Sarah A Sabatinos
- Toronto Metropolitan University, Department of Chemistry & Biology, Toronto M5B 2K3, Canada
| | - Susan L Forsburg
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, CA 90089, USA
| | - Olaf Nielsen
- Department of Biology, Cell cycle and genome stability Group, University of Copenhagen, Copenhagen N DK2100, Denmark
| | - Paul Nurse
- The Francis Crick Institute, London NW1 1AT, UK
| | - Valerie Wood
- University of Cambridge, Department of Biochemistry, Cambridge CB2 1GA, UK
| |
Collapse
|
18
|
Sun H, Song Z, Chen Q, Wang M, Tang F, Dou L, Zou Q, Yang F. MMiKG: a knowledge graph-based platform for path mining of microbiota-mental diseases interactions. Brief Bioinform 2023; 24:bbad340. [PMID: 37779250 DOI: 10.1093/bib/bbad340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 08/21/2023] [Accepted: 09/12/2023] [Indexed: 10/03/2023] Open
Abstract
The microbiota-gut-brain axis denotes a two-way system of interactions between the gut and the brain, comprising three key components: (1) gut microbiota, (2) intermediates and (3) mental ailments. These constituents communicate with one another to induce changes in the host's mood, cognition and demeanor. Knowledge concerning the regulation of the host central nervous system by gut microbiota is fragmented and mostly confined to disorganized or semi-structured unrestricted texts. Such a format hinders the exploration and comprehension of unknown territories or the further advancement of artificial intelligence systems. Hence, we collated crucial information by scrutinizing an extensive body of literature, amalgamated the extant knowledge of the microbiota-gut-brain axis and depicted it in the form of a knowledge graph named MMiKG, which can be visualized on the GraphXR platform and the Neo4j database, correspondingly. By merging various associated resources and deducing prospective connections between gut microbiota and the central nervous system through MMiKG, users can acquire a more comprehensive perception of the pathogenesis of mental disorders and generate novel insights for advancing therapeutic measures. As a free and open-source platform, MMiKG can be accessed at http://yangbiolab.cn:8501/ with no login requirement.
Collapse
Affiliation(s)
- Haoran Sun
- School of Medical Imaging, Fujian Medical University, Fuzhou 350122, China
| | - Zhaoqi Song
- Department of Bioinformatics, Fujian Key Laboratory of Medical Bioinformatics, School of Medical Technology and Engineering, Fujian Medical University, Fuzhou 350122, China
| | - Qiuming Chen
- School of Medical Imaging, Fujian Medical University, Fuzhou 350122, China
| | - Meiling Wang
- Department of Bioinformatics, Fujian Key Laboratory of Medical Bioinformatics, School of Medical Technology and Engineering, Fujian Medical University, Fuzhou 350122, China
| | - Furong Tang
- Department of Basic Medical Sciences, School of Medicine, Tsinghua University, Beijing 100084, China
| | - Lijun Dou
- Genomic Medicine Institute, Lerner Research Institute, Cleveland, OH 44106, USA
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Fenglong Yang
- Department of Bioinformatics, Fujian Key Laboratory of Medical Bioinformatics, School of Medical Technology and Engineering, Fujian Medical University, Fuzhou 350122, China
- Key Laboratory of Ministry of Education for Gastrointestinal Cancer, School of Basic Medical Sciences, Fujian Medical University, Fuzhou 350122, China
| |
Collapse
|
19
|
Vaškevičius M, Kapočiūtė-Dzikienė J, Vaškevičius A, Šlepikas L. Deep learning-based automatic action extraction from structured chemical synthesis procedures. PeerJ Comput Sci 2023; 9:e1511. [PMID: 37705639 PMCID: PMC10495970 DOI: 10.7717/peerj-cs.1511] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2023] [Accepted: 07/07/2023] [Indexed: 09/15/2023]
Abstract
This article proposes a methodology that uses machine learning algorithms to extract actions from structured chemical synthesis procedures, thereby bridging the gap between chemistry and natural language processing. The proposed pipeline combines ML algorithms and scripts to extract relevant data from USPTO and EPO patents, which helps transform experimental procedures into structured actions. This pipeline includes two primary tasks: classifying patent paragraphs to select chemical procedures and converting chemical procedure sentences into a structured, simplified format. We employ artificial neural networks such as long short-term memory, bidirectional LSTMs, transformers, and fine-tuned T5. Our results show that the bidirectional LSTM classifier achieved the highest accuracy of 0.939 in the first task, while the Transformer model attained the highest BLEU score of 0.951 in the second task. The developed pipeline enables the creation of a dataset of chemical reactions and their procedures in a structured format, facilitating the application of AI-based approaches to streamline synthetic pathways, predict reaction outcomes, and optimize experimental conditions. Furthermore, the developed pipeline allows for creating a structured dataset of chemical reactions and procedures, making it easier for researchers to access and utilize the valuable information in synthesis procedures.
Collapse
Affiliation(s)
- Mantas Vaškevičius
- Department of Applied Informatics, Vytautas Magnus University, Kaunas, Lithuania
- JSC Synhet, Kaunas, Lithuania
| | | | - Arnas Vaškevičius
- Faculty of Mechanical Engineering and Design, Kaunas University of Technology, Kaunas, Lithuania
| | | |
Collapse
|
20
|
Raza S, Schwartz B, Lakamana S, Ge Y, Sarker A. A framework for multi-faceted content analysis of social media chatter regarding non-medical use of prescription medications. BMC DIGITAL HEALTH 2023; 1:29. [PMID: 37680768 PMCID: PMC10483682 DOI: 10.1186/s44247-023-00029-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/23/2023] [Accepted: 07/17/2023] [Indexed: 09/09/2023]
Abstract
Background Substance use, including the non-medical use of prescription medications, is a global health problem resulting in hundreds of thousands of overdose deaths and other health problems. Social media has emerged as a potent source of information for studying substance use-related behaviours and their consequences. Mining large-scale social media data on the topic requires the development of natural language processing (NLP) and machine learning frameworks customized for this problem. Our objective in this research is to develop a framework for conducting a content analysis of Twitter chatter about the non-medical use of a set of prescription medications. Methods We collected Twitter data for four medications-fentanyl and morphine (opioids), alprazolam (benzodiazepine), and Adderall® (stimulant), and identified posts that indicated non-medical use using an automatic machine learning classifier. In our NLP framework, we applied supervised named entity recognition (NER) to identify other substances mentioned, symptoms, and adverse events. We applied unsupervised topic modelling to identify latent topics associated with the chatter for each medication. Results The quantitative analysis demonstrated the performance of the proposed NER approach in identifying substance-related entities from data with a high degree of accuracy compared to the baseline methods. The performance evaluation of the topic modelling was also notable. The qualitative analysis revealed knowledge about the use, non-medical use, and side effects of these medications in individuals and communities. Conclusions NLP-based analyses of Twitter chatter associated with prescription medications belonging to different categories provide multi-faceted insights about their use and consequences. Our developed framework can be applied to chatter about other substances. Further research can validate the predictive value of this information on the prevention, assessment, and management of these disorders.
Collapse
Affiliation(s)
- Shaina Raza
- Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
- Vector Institute for Artificial Intelligence, Toronto, ON, Canada
| | - Brian Schwartz
- Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
| | - Sahithi Lakamana
- Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, USA
| | - Yao Ge
- Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, USA
| | - Abeed Sarker
- Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, USA
| |
Collapse
|
21
|
Liang T, Xia C, Zhao Z, Jiang Y, Yin Y, Yu PS. Transferring From Textual Entailment to Biomedical Named Entity Recognition. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2577-2586. [PMID: 37018664 DOI: 10.1109/tcbb.2023.3236477] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Biomedical Named Entity Recognition (BioNER) aims at identifying biomedical entities such as genes, proteins, diseases, and chemical compounds in the given textual data. However, due to the issues of ethics, privacy, and high specialization of biomedical data, BioNER suffers from the more severe problem of lacking in quality labeled data than the general domain especially for the token-level. Facing the extremely limited labeled biomedical data, this work studies the problem of gazetteer-based BioNER, which aims at building a BioNER system from scratch. It needs to identify the entities in the given sentences when we have zero token-level annotations for training. Previous works usually use sequential labeling models to solve the NER or BioNER task and obtain weakly labeled data from gazetteers when we don't have full annotations. However, these labeled data are quite noisy since we need the labels for each token and the entity coverage of the gazetteers is limited. Here we propose to formulate the BioNER task as a Textual Entailment problem and solve the task via Textual Entailment with Dynamic Contrastive learning (TEDC). TEDC not only alleviates the noisy labeling issue, but also transfers the knowledge from pre-trained textual entailment models. Additionally, the dynamic contrastive learning framework contrasts the entities and non-entities in the same sentence and improves the model's discrimination ability. Experiments on two real-world biomedical datasets show that TEDC can achieve state-of-the-art performance for gazetteer-based BioNER.
Collapse
|
22
|
Cutforth M, Watson H, Brown C, Wang C, Thomson S, Fell D, Dilys V, Scrimgeour M, Schrempf P, Lesh J, Muir K, Weir A, O’Neil AQ. Acute stroke CDS: automatic retrieval of thrombolysis contraindications from unstructured clinical letters. Front Digit Health 2023; 5:1186516. [PMID: 37388253 PMCID: PMC10305776 DOI: 10.3389/fdgth.2023.1186516] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2023] [Accepted: 05/15/2023] [Indexed: 07/01/2023] Open
Abstract
Introduction Thrombolysis treatment for acute ischaemic stroke can lead to better outcomes if administered early enough. However, contraindications exist which put the patient at greater risk of a bleed (e.g. recent major surgery, anticoagulant medication). Therefore, clinicians must check a patient's past medical history before proceeding with treatment. In this work we present a machine learning approach for accurate automatic detection of this information in unstructured text documents such as discharge letters or referral letters, to support the clinician in making a decision about whether to administer thrombolysis. Methods We consulted local and national guidelines for thrombolysis eligibility, identifying 86 entities which are relevant to the thrombolysis decision. A total of 8,067 documents from 2,912 patients were manually annotated with these entities by medical students and clinicians. Using this data, we trained and validated several transformer-based named entity recognition (NER) models, focusing on transformer models which have been pre-trained on a biomedical corpus as these have shown most promise in the biomedical NER literature. Results Our best model was a PubMedBERT-based approach, which obtained a lenient micro/macro F1 score of 0.829/0.723. Ensembling 5 variants of this model gave a significant boost to precision, obtaining micro/macro F1 of 0.846/0.734 which approaches the human annotator performance of 0.847/0.839. We further propose numeric definitions for the concepts of name regularity (similarity of all spans which refer to an entity) and context regularity (similarity of all context surrounding mentions of an entity), using these to analyse the types of errors made by the system and finding that the name regularity of an entity is a stronger predictor of model performance than raw training set frequency. Discussion Overall, this work shows the potential of machine learning to provide clinical decision support (CDS) for the time-critical decision of thrombolysis administration in ischaemic stroke by quickly surfacing relevant information, leading to prompt treatment and hence to better patient outcomes.
Collapse
Affiliation(s)
| | - Hannah Watson
- Canon Medical Research Europe, Edinburgh, United Kingdom
| | - Cameron Brown
- Institute of Neuroscience & Psychology, University of Glasgow, Glasgow, United Kingdom
| | - Chaoyang Wang
- Canon Medical Research Europe, Edinburgh, United Kingdom
| | - Stuart Thomson
- Canon Medical Research Europe, Edinburgh, United Kingdom
| | - Dickon Fell
- Canon Medical Research Europe, Edinburgh, United Kingdom
| | | | | | | | - James Lesh
- Canon Medical Research Europe, Edinburgh, United Kingdom
| | - Keith Muir
- Institute of Neuroscience & Psychology, University of Glasgow, Glasgow, United Kingdom
| | - Alexander Weir
- Canon Medical Research Europe, Edinburgh, United Kingdom
| | - Alison Q O’Neil
- Canon Medical Research Europe, Edinburgh, United Kingdom
- School of Engineering, University of Edinburgh, Edinburgh, United Kingdom
| |
Collapse
|
23
|
Jeong M, Kang J. Consistency enhancement of model prediction on document-level named entity recognition. Bioinformatics 2023; 39:btad361. [PMID: 37261870 PMCID: PMC10272703 DOI: 10.1093/bioinformatics/btad361] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Revised: 04/17/2023] [Accepted: 05/31/2023] [Indexed: 06/02/2023] Open
Abstract
SUMMARY Biomedical named entity recognition (NER) plays a crucial role in extracting information from documents in biomedical applications. However, many of these applications require NER models to operate at a document level, rather than just a sentence level. This presents a challenge, as the extension from a sentence model to a document model is not always straightforward. Despite the existence of document NER models that are able to make consistent predictions, they still fall short of meeting the expectations of researchers and practitioners in the field. To address this issue, we have undertaken an investigation into the underlying causes of inconsistent predictions. Our research has led us to believe that the use of adjectives and prepositions within entities may be contributing to low label consistency. In this article, we present our method, ConNER, to enhance a label consistency of modifiers such as adjectives and prepositions. By refining the labels of these modifiers, ConNER is able to improve representations of biomedical entities. The effectiveness of our method is demonstrated on four popular biomedical NER datasets. On three datasets, we achieve a higher F1 score than the previous state-of-the-art model. Our method shows its efficacy on two datasets, resulting in 7.5%-8.6% absolute improvements in the F1 score. Our findings suggest that our ConNER method is effective on datasets with intrinsically low label consistency. Through qualitative analysis, we demonstrate how our approach helps the NER model generate more consistent predictions. AVAILABILITY AND IMPLEMENTATION Our code and resources are available at https://github.com/dmis-lab/ConNER/.
Collapse
Affiliation(s)
- Minbyul Jeong
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Republic of Korea
| | - Jaewoo Kang
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Republic of Korea
- Interdisciplinary Graduate Program in Bioinformatics, Korea University, Seoul, Republic of Korea
- AIGEN Sciences, Seoul 04778, Republic of Korea
| |
Collapse
|
24
|
Raza S, Schwartz B. Constructing a disease database and using natural language processing to capture and standardize free text clinical information. Sci Rep 2023; 13:8591. [PMID: 37237101 DOI: 10.1038/s41598-023-35482-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2022] [Accepted: 05/18/2023] [Indexed: 05/28/2023] Open
Abstract
The ability to extract critical information about an infectious disease in a timely manner is critical for population health research. The lack of procedures for mining large amounts of health data is a major impediment. The goal of this research is to use natural language processing (NLP) to extract key information (clinical factors, social determinants of health) from free text. The proposed framework describes database construction, NLP modules for locating clinical and non-clinical (social determinants) information, and a detailed evaluation protocol for evaluating results and demonstrating the effectiveness of the proposed framework. The use of COVID-19 case reports is demonstrated for data construction and pandemic surveillance. The proposed approach outperforms benchmark methods in F1-score by about 1-3%. A thorough examination reveals the disease's presence as well as the frequency of symptoms in patients. The findings suggest that prior knowledge gained through transfer learning can be useful when researching infectious diseases with similar presentations in order to accurately predict patient outcomes.
Collapse
Affiliation(s)
- Shaina Raza
- Public Health Ontario (PHO), Toronto, ON, Canada.
- Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada.
| | - Brian Schwartz
- Public Health Ontario (PHO), Toronto, ON, Canada
- Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
| |
Collapse
|
25
|
Moezzi SAR, Ghaedi A, Rahmanian M, Mousavi SZ, Sami A. Application of Deep Learning in Generating Structured Radiology Reports: A Transformer-Based Technique. J Digit Imaging 2023; 36:80-90. [PMID: 36002778 PMCID: PMC9984654 DOI: 10.1007/s10278-022-00692-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2021] [Revised: 06/20/2022] [Accepted: 07/27/2022] [Indexed: 11/29/2022] Open
Abstract
Since radiology reports needed for clinical practice and research are written and stored in free-text narrations, extraction of relative information for further analysis is difficult. In these circumstances, natural language processing (NLP) techniques can facilitate automatic information extraction and transformation of free-text formats to structured data. In recent years, deep learning (DL)-based models have been adapted for NLP experiments with promising results. Despite the significant potential of DL models based on artificial neural networks (ANN) and convolutional neural networks (CNN), the models face some limitations to implement in clinical practice. Transformers, another new DL architecture, have been increasingly applied to improve the process. Therefore, in this study, we propose a transformer-based fine-grained named entity recognition (NER) architecture for clinical information extraction. We collected 88 abdominopelvic sonography reports in free-text formats and annotated them based on our developed information schema. The text-to-text transfer transformer model (T5) and Scifive, a pre-trained domain-specific adaptation of the T5 model, were applied for fine-tuning to extract entities and relations and transform the input into a structured format. Our transformer-based model in this study outperformed previously applied approaches such as ANN and CNN models based on ROUGE-1, ROUGE-2, ROUGE-L, and BLEU scores of 0.816, 0.668, 0.528, and 0.743, respectively, while providing an interpretable structured report.
Collapse
Affiliation(s)
- Seyed Ali Reza Moezzi
- Department of Computer Science and Engineering and IT, Shiraz University, Shiraz, Iran
| | - Abdolrahman Ghaedi
- Department of Computer Science and Engineering and IT, Shiraz University, Shiraz, Iran
| | - Mojdeh Rahmanian
- Department of Computer Science and Engineering and IT, Shiraz University, Shiraz, Iran
| | | | - Ashkan Sami
- Department of Computer Science and Engineering and IT, Shiraz University, Shiraz, Iran.
| |
Collapse
|
26
|
Yew ANJ, Schraagen M, Otte WM, van Diessen E. Transforming epilepsy research: A systematic review on natural language processing applications. Epilepsia 2023; 64:292-305. [PMID: 36462150 PMCID: PMC10108221 DOI: 10.1111/epi.17474] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2022] [Revised: 11/23/2022] [Accepted: 12/01/2022] [Indexed: 12/05/2022]
Abstract
Despite improved ancillary investigations in epilepsy care, patients' narratives remain indispensable for diagnosing and treatment monitoring. This wealth of information is typically stored in electronic health records and accumulated in medical journals in an unstructured manner, thereby restricting complete utilization in clinical decision-making. To this end, clinical researchers increasing apply natural language processing (NLP)-a branch of artificial intelligence-as it removes ambiguity, derives context, and imbues standardized meaning from free-narrative clinical texts. This systematic review presents an overview of the current NLP applications in epilepsy and discusses the opportunities and drawbacks of NLP alongside its future implications. We searched the PubMed and Embase databases with a "natural language processing" and "epilepsy" query (March 4, 2022) and included original research articles describing the application of NLP techniques for textual analysis in epilepsy. Twenty-six studies were included. Fifty-eight percent of these studies used NLP to classify clinical records into predefined categories, improving patient identification and treatment decisions. Other applications of NLP had structured clinical information retrieval from electronic health records, scientific papers, and online posts of patients. Challenges and opportunities of NLP applications for enhancing epilepsy care and research are discussed. The field could further benefit from NLP by replicating successes in other health care domains, such as NLP-aided quality evaluation for clinical decision-making, outcome prediction, and clinical record summarization.
Collapse
Affiliation(s)
- Arister N J Yew
- University College Utrecht, Utrecht University, Utrecht, The Netherlands
| | - Marijn Schraagen
- Department of Information and Computing Sciences, Faculty of Science, Utrecht University, Utrecht, The Netherlands
| | - Willem M Otte
- Department of Child Neurology, Brain Center, University Medical Center Utrecht and Utrecht University, Utrecht, The Netherlands
| | - Eric van Diessen
- Department of Child Neurology, Brain Center, University Medical Center Utrecht and Utrecht University, Utrecht, The Netherlands
| |
Collapse
|
27
|
Raza S, Schwartz B. Entity and relation extraction from clinical case reports of COVID-19: a natural language processing approach. BMC Med Inform Decis Mak 2023; 23:20. [PMID: 36703154 PMCID: PMC9879259 DOI: 10.1186/s12911-023-02117-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2022] [Accepted: 01/20/2023] [Indexed: 01/28/2023] Open
Abstract
BACKGROUND Extracting relevant information about infectious diseases is an essential task. However, a significant obstacle in supporting public health research is the lack of methods for effectively mining large amounts of health data. OBJECTIVE This study aims to use natural language processing (NLP) to extract the key information (clinical factors, social determinants of health) from published cases in the literature. METHODS The proposed framework integrates a data layer for preparing a data cohort from clinical case reports; an NLP layer to find the clinical and demographic-named entities and relations in the texts; and an evaluation layer for benchmarking performance and analysis. The focus of this study is to extract valuable information from COVID-19 case reports. RESULTS The named entity recognition implementation in the NLP layer achieves a performance gain of about 1-3% compared to benchmark methods. Furthermore, even without extensive data labeling, the relation extraction method outperforms benchmark methods in terms of accuracy (by 1-8% better). A thorough examination reveals the disease's presence and symptoms prevalence in patients. CONCLUSIONS A similar approach can be generalized to other infectious diseases. It is worthwhile to use prior knowledge acquired through transfer learning when researching other infectious diseases.
Collapse
Affiliation(s)
- Shaina Raza
- grid.415400.40000 0001 1505 2354Public Health Ontario (PHO), Toronto, ON Canada ,grid.17063.330000 0001 2157 2938Dalla Lana School of Public Health, University of Toronto, Toronto, ON Canada
| | - Brian Schwartz
- grid.415400.40000 0001 1505 2354Public Health Ontario (PHO), Toronto, ON Canada ,grid.17063.330000 0001 2157 2938Dalla Lana School of Public Health, University of Toronto, Toronto, ON Canada
| |
Collapse
|
28
|
Cenikj G, Valenčič E, Ispirova G, Ogrinc M, Stojanov R, Korošec P, Cavalli E, Seljak BK, Eftimov T. CafeteriaSA corpus: scientific abstracts annotated across different food semantic resources. Database (Oxford) 2022; 2022:6918707. [PMID: 36526439 PMCID: PMC9757992 DOI: 10.1093/database/baac107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2022] [Revised: 10/30/2022] [Accepted: 11/23/2022] [Indexed: 12/23/2022]
Abstract
In the last decades, a great amount of work has been done in predictive modeling of issues related to human and environmental health. Resolution of issues related to healthcare is made possible by the existence of several biomedical vocabularies and standards, which play a crucial role in understanding the health information, together with a large amount of health-related data. However, despite a large number of available resources and work done in the health and environmental domains, there is a lack of semantic resources that can be utilized in the food and nutrition domain, as well as their interconnections. For this purpose, in a European Food Safety Authority-funded project CAFETERIA, we have developed the first annotated corpus of 500 scientific abstracts that consists of 6407 annotated food entities with regard to Hansard taxonomy, 4299 for FoodOn and 3623 for SNOMED-CT. The CafeteriaSA corpus will enable the further development of natural language processing methods for food information extraction from textual data that will allow extracting food information from scientific textual data. Database URL: https://zenodo.org/record/6683798#.Y49wIezMJJF.
Collapse
Affiliation(s)
| | - Eva Valenčič
- Department of Computer Systems, Jožef Stefan Institute, Jamova cesta 39, Ljubljana 1000, Slovenia,Jožef Stefan International Postgraduate School, Jamova cesta 39, Ljubljana 1000, Slovenia,School of Health Sciences, College of Health, Medicine and Wellbeing, University of Newcastle, University Drive, Callaghan Campus, Newcastle, NSW 2308, Australia,Food and Nutrition Program, Hunter Medical Research Institute, Lot 1 Kookaburra Circuit, New Lambton Heights, Newcastle, NSW 2305, Australia
| | - Gordana Ispirova
- Department of Computer Systems, Jožef Stefan Institute, Jamova cesta 39, Ljubljana 1000, Slovenia,Jožef Stefan International Postgraduate School, Jamova cesta 39, Ljubljana 1000, Slovenia
| | - Matevž Ogrinc
- Department of Computer Systems, Jožef Stefan Institute, Jamova cesta 39, Ljubljana 1000, Slovenia,Jožef Stefan International Postgraduate School, Jamova cesta 39, Ljubljana 1000, Slovenia
| | - Riste Stojanov
- Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University in Skopje, Ruger Boshkovikj 16, Skopje 1000, North Macedonia
| | - Peter Korošec
- Department of Computer Systems, Jožef Stefan Institute, Jamova cesta 39, Ljubljana 1000, Slovenia
| | - Ermanno Cavalli
- European Food Safety Authority, Via Carlo Magno 1A, Parma 43126, Italy
| | - Barbara Koroušić Seljak
- Department of Computer Systems, Jožef Stefan Institute, Jamova cesta 39, Ljubljana 1000, Slovenia,Jožef Stefan International Postgraduate School, Jamova cesta 39, Ljubljana 1000, Slovenia
| | - Tome Eftimov
- Department of Computer Systems, Jožef Stefan Institute, Jamova cesta 39, Ljubljana 1000, Slovenia
| |
Collapse
|
29
|
Bashir SR, Raza S, Kocaman V, Qamar U. Clinical Application of Detecting COVID-19 Risks: A Natural Language Processing Approach. Viruses 2022; 14:2761. [PMID: 36560764 PMCID: PMC9781729 DOI: 10.3390/v14122761] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Accepted: 12/08/2022] [Indexed: 12/14/2022] Open
Abstract
The clinical application of detecting COVID-19 factors is a challenging task. The existing named entity recognition models are usually trained on a limited set of named entities. Besides clinical, the non-clinical factors, such as social determinant of health (SDoH), are also important to study the infectious disease. In this paper, we propose a generalizable machine learning approach that improves on previous efforts by recognizing a large number of clinical risk factors and SDoH. The novelty of the proposed method lies in the subtle combination of a number of deep neural networks, including the BiLSTM-CNN-CRF method and a transformer-based embedding layer. Experimental results on a cohort of COVID-19 data prepared from PubMed articles show the superiority of the proposed approach. When compared to other methods, the proposed approach achieves a performance gain of about 1-5% in terms of macro- and micro-average F1 scores. Clinical practitioners and researchers can use this approach to obtain accurate information regarding clinical risks and SDoH factors, and use this pipeline as a tool to end the pandemic or to prepare for future pandemics.
Collapse
Affiliation(s)
- Syed Raza Bashir
- Department of Computer Science, Toronto Metropolitan University, Toronto, ON M5B 2K3, Canada
| | - Shaina Raza
- Dalla Lana School of Public Health, University of Toronto, Toronto, ON M5T 3M7, Canada
| | | | - Urooj Qamar
- Institute of Business & Information Technology, University of the Punjab, Lahore 54590, Pakistan
| |
Collapse
|
30
|
Raza S, Reji DJ, Shajan F, Bashir SR. Large-scale application of named entity recognition to biomedicine and epidemiology. PLOS DIGITAL HEALTH 2022; 1:e0000152. [PMID: 36812589 PMCID: PMC9931203 DOI: 10.1371/journal.pdig.0000152] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/04/2022] [Accepted: 11/01/2022] [Indexed: 12/13/2022]
Abstract
BACKGROUND Despite significant advancements in biomedical named entity recognition methods, the clinical application of these systems continues to face many challenges: (1) most of the methods are trained on a limited set of clinical entities; (2) these methods are heavily reliant on a large amount of data for both pre-training and prediction, making their use in production impractical; (3) they do not consider non-clinical entities, which are also related to patient's health, such as social, economic or demographic factors. METHODS In this paper, we develop Bio-Epidemiology-NER (https://pypi.org/project/Bio-Epidemiology-NER/) an open-source Python package for detecting biomedical named entities from the text. This approach is based on a Transformer-based system and trained on a dataset that is annotated with many named entities (medical, clinical, biomedical, and epidemiological). This approach improves on previous efforts in three ways: (1) it recognizes many clinical entity types, such as medical risk factors, vital signs, drugs, and biological functions; (2) it is easily configurable, reusable, and can scale up for training and inference; (3) it also considers non-clinical factors (age and gender, race and social history and so) that influence health outcomes. At a high level, it consists of the phases: pre-processing, data parsing, named entity recognition, and named entity enhancement. RESULTS Experimental results show that our pipeline outperforms other methods on three benchmark datasets with macro-and micro average F1 scores around 90 percent and above. CONCLUSION This package is made publicly available for researchers, doctors, clinicians, and anyone to extract biomedical named entities from unstructured biomedical texts.
Collapse
Affiliation(s)
- Shaina Raza
- Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
- * E-mail: (SR); (SRB)
| | | | - Femi Shajan
- Environmental Resources Management, Bangalore, India
| | - Syed Raza Bashir
- Toronto Metropolitan University, Toronto, Ontario, Canada
- * E-mail: (SR); (SRB)
| |
Collapse
|
31
|
Zheng X, Du H, Luo X, Tong F, Song W, Zhao D. BioByGANS: biomedical named entity recognition by fusing contextual and syntactic features through graph attention network in node classification framework. BMC Bioinformatics 2022; 23:501. [PMID: 36418937 PMCID: PMC9682683 DOI: 10.1186/s12859-022-05051-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2022] [Accepted: 11/10/2022] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND Automatic and accurate recognition of various biomedical named entities from literature is an important task of biomedical text mining, which is the foundation of extracting biomedical knowledge from unstructured texts into structured formats. Using the sequence labeling framework and deep neural networks to implement biomedical named entity recognition (BioNER) is a common method at present. However, the above method often underutilizes syntactic features such as dependencies and topology of sentences. Therefore, it is an urgent problem to be solved to integrate semantic and syntactic features into the BioNER model. RESULTS In this paper, we propose a novel biomedical named entity recognition model, named BioByGANS (BioBERT/SpaCy-Graph Attention Network-Softmax), which uses a graph to model the dependencies and topology of a sentence and formulate the BioNER task as a node classification problem. This formulation can introduce more topological features of language and no longer be only concerned about the distance between words in the sequence. First, we use periods to segment sentences and spaces and symbols to segment words. Second, contextual features are encoded by BioBERT, and syntactic features such as part of speeches, dependencies and topology are preprocessed by SpaCy respectively. A graph attention network is then used to generate a fusing representation considering both the contextual features and syntactic features. Last, a softmax function is used to calculate the probabilities and get the results. We conduct experiments on 8 benchmark datasets, and our proposed model outperforms existing BioNER state-of-the-art methods on the BC2GM, JNLPBA, BC4CHEMD, BC5CDR-chem, BC5CDR-disease, NCBI-disease, Species-800, and LINNAEUS datasets, and achieves F1-scores of 85.15%, 78.16%, 92.97%, 94.74%, 87.74%, 91.57%, 75.01%, 90.99%, respectively. CONCLUSION The experimental results on 8 biomedical benchmark datasets demonstrate the effectiveness of our model, and indicate that formulating the BioNER task into a node classification problem and combining syntactic features into the graph attention networks can significantly improve model performance.
Collapse
Affiliation(s)
- Xiangwen Zheng
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Haijian Du
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Xiaowei Luo
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Fan Tong
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Wei Song
- Beijing MedPeer Information Technology Co., Ltd, Beijing, 102300, China
| | - Dongsheng Zhao
- Academy of Military Medical Sciences, Beijing, 100039, China.
| |
Collapse
|
32
|
Su Y, Wang M, Wang P, Zheng C, Liu Y, Zeng X. Deep learning joint models for extracting entities and relations in biomedical: a survey and comparison. Brief Bioinform 2022; 23:6686739. [PMID: 36125190 DOI: 10.1093/bib/bbac342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Revised: 07/20/2022] [Accepted: 07/25/2022] [Indexed: 12/14/2022] Open
Abstract
The rapid development of biomedicine has produced a large number of biomedical written materials. These unstructured text data create serious challenges for biomedical researchers to find information. Biomedical named entity recognition (BioNER) and biomedical relation extraction (BioRE) are the two most fundamental tasks of biomedical text mining. Accurately and efficiently identifying entities and extracting relations have become very important. Methods that perform two tasks separately are called pipeline models, and they have shortcomings such as insufficient interaction, low extraction quality and easy redundancy. To overcome the above shortcomings, many deep learning-based joint name entity recognition and relation extraction models have been proposed, and they have achieved advanced performance. This paper comprehensively summarize deep learning models for joint name entity recognition and relation extraction for biomedicine. The joint BioNER and BioRE models are discussed in the light of the challenges existing in the BioNER and BioRE tasks. Five joint BioNER and BioRE models and one pipeline model are selected for comparative experiments on four biomedical public datasets, and the experimental results are analyzed. Finally, we discuss the opportunities for future development of deep learning-based joint BioNER and BioRE models.
Collapse
Affiliation(s)
- Yansen Su
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, School of Artificial Intelligence, Anhui University, 111 Jiulong Road, Economic and Technological Development Zone, 230601, Hefei, China
| | - Minglu Wang
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, School of Computer Science and Technology, Anhui University, 111 Jiulong Road, Economic and Technological Development Zone, 230601, Hefei, China
| | - Pengpeng Wang
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, School of Computer Science and Technology, Anhui University, 111 Jiulong Road, Economic and Technological Development Zone, 230601, Hefei, China
| | - Chunhou Zheng
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, School of Artificial Intelligence, Anhui University, 111 Jiulong Road, Economic and Technological Development Zone, 230601, Hefei, China
| | - Yuansheng Liu
- College of Information Science and Engineering, Hunan University, 2 Lushan S Rd, Yuelu District, 410086, Changsha, China
| | - Xiangxiang Zeng
- College of Information Science and Engineering, Hunan University, 2 Lushan S Rd, Yuelu District, 410086, Changsha, China
| |
Collapse
|
33
|
Wang SY, Huang J, Hwang H, Hu W, Tao S, Hernandez-Boussard T. Leveraging weak supervision to perform named entity recognition in electronic health records progress notes to identify the ophthalmology exam. Int J Med Inform 2022; 167:104864. [PMID: 36179600 PMCID: PMC9901505 DOI: 10.1016/j.ijmedinf.2022.104864] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2022] [Revised: 08/11/2022] [Accepted: 09/05/2022] [Indexed: 02/08/2023]
Abstract
OBJECTIVE To develop deep learning models to recognize ophthalmic examination components from clinical notes in electronic health records (EHR) using a weak supervision approach. METHODS A corpus of 39,099 ophthalmology notes weakly labeled for 24 examination entities was assembled from the EHR of one academic center. Four pre-trained transformer-based language models (DistilBert, BioBert, BlueBert, and ClinicalBert) were fine-tuned to this named entity recognition task and compared to a baseline regular expression model. Models were evaluated on the weakly labeled test dataset, a human-labeled sample of that set, and a human-labeled independent dataset. RESULTS On the weakly labeled test set, all transformer-based models had recall > 0.93, with precision varying from 0.815 to 0.843. The baseline model had lower recall (0.769) and precision (0.682). On the human-annotated sample, the baseline model had high recall (0.962, 95 % CI 0.955-0.067) with variable precision across entities (0.081-0.999). Bert models had recall ranging from 0.771 to 0.831, and precision >=0.973. On the independent dataset, precision was 0.926 and recall 0.458 for BlueBert. The baseline model had better recall (0.708, 95 % CI 0.674-0.738) but worse precision (0.399, 95 % CI -0.352-0.451). CONCLUSION We developed the first deep learning system to recognize eye examination components from clinical notes, leveraging a novel opportunity for weak supervision. Transformer-based models had high precision on human-annotated labels, whereas the baseline model had poor precision but higher recall. This system may be used to improve cohort and feature identification using free-text notes.Our weakly supervised approach may help amass large datasets of domain-specific entities from EHRs in many fields.
Collapse
Affiliation(s)
- Sophia Y Wang
- Department of Ophthalmology, Byers Eye Institute, Stanford University, Palo Alto, CA, USA.
| | - Justin Huang
- Johns Hopkins School of Medicine, Baltimore, MD, USA
| | - Hannah Hwang
- Department of Ophthalmology, Weill Cornell Medicine, New York, NY, USA
| | - Wendeng Hu
- Department of Ophthalmology, Byers Eye Institute, Stanford University, Palo Alto, CA, USA
| | - Shiqi Tao
- Department of Ophthalmology, Byers Eye Institute, Stanford University, Palo Alto, CA, USA
| | | |
Collapse
|
34
|
Review on knowledge extraction from text and scope in agriculture domain. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10239-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
35
|
Morandini P, Laino ME, Paoletti G, Carlucci A, Tommasini T, Angelotti G, Pepys J, Canonica GW, Heffler E, Savevski V, Puggioni F. Artificial intelligence processing electronic health records to identify commonalities and comorbidities cluster at Immuno Center Humanitas. Clin Transl Allergy 2022; 12:e12144. [PMID: 35702725 PMCID: PMC9175261 DOI: 10.1002/clt2.12144] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2021] [Revised: 02/24/2022] [Accepted: 03/28/2022] [Indexed: 11/12/2022] Open
Abstract
Background Comorbidities are common in chronic inflammatory conditions, requiring multidisciplinary treatment approach. Understanding the link between a single disease and its comorbidities is important for appropriate treatment and management. We evaluate the ability of an NLP-based process for knowledge discovery to detect information about pathologies, patients' phenotype, doctors' prescriptions and commonalities in electronic medical records, by extracting information from free narrative text written by clinicians during medical visits, resulting in the extraction of valuable information and enriching real world evidence data from a multidisciplinary setting. Methods We collected clinical notes from the Allergy Department of Humanitas Research Hospital written in the last 3 years and used it to look for diseases that cluster together as comorbidities associated to the main pathology of our patients, and for the extent of prescription of systemic corticosteroids, thus evaluating the ability of NLP-based tools for knowledge discovery to extract structured information from free text. Results We found that the 3 most frequent comorbidities to appear in our clusters were asthma, rhinitis, and urticaria, and that 991 (of 2057) patients suffered from at least one of these comorbidities. The clusters which co-occur particularly often are oral allergy syndrome and urticaria (131 patients), angioedema and urticaria (105 patients), rhinitis and asthma (227 patients). With regards to systemic corticosteroid prescription volume by our clinicians, we found it was lower when compared to the therapy the patients followed before coming to our attention, with the exception of two diseases: Chronic obstructive pulmonary disease and Angioedema. Conclusions This analysis seems to be valid and is confirmed by the data from the literature. This means that NLP tools could have significant role in many other research fields of medicine, as it may help identify other important, and possibly previously neglected clusters of patients with comorbidities and commonalities. Another potential benefit of this approach lies in its potential ability to foster a multidisciplinary approach, using the same drugs to treat pathologies normally treated by physicians in different branches of medicine, thus saving resources and improving the pharmacological management of patients.
Collapse
Affiliation(s)
| | - Maria Elena Laino
- Artificial Intelligence CenterIRCCS Humanitas Research HospitalMilanItaly
| | - Giovanni Paoletti
- Department of Biomedical SciencesHumanitas UniversityMilanItaly
- Personalized Medicine, Asthma and AllergyIRCCS Humanitas Research HospitalMilanItaly
| | | | - Tobia Tommasini
- Artificial Intelligence CenterIRCCS Humanitas Research HospitalMilanItaly
| | - Giovanni Angelotti
- Artificial Intelligence CenterIRCCS Humanitas Research HospitalMilanItaly
| | - Jack Pepys
- Department of Biomedical SciencesHumanitas UniversityMilanItaly
| | - Giorgio Walter Canonica
- Department of Biomedical SciencesHumanitas UniversityMilanItaly
- Personalized Medicine, Asthma and AllergyIRCCS Humanitas Research HospitalMilanItaly
| | - Enrico Heffler
- Department of Biomedical SciencesHumanitas UniversityMilanItaly
- Personalized Medicine, Asthma and AllergyIRCCS Humanitas Research HospitalMilanItaly
| | - Victor Savevski
- Artificial Intelligence CenterIRCCS Humanitas Research HospitalMilanItaly
| | - Francesca Puggioni
- Department of Biomedical SciencesHumanitas UniversityMilanItaly
- Personalized Medicine, Asthma and AllergyIRCCS Humanitas Research HospitalMilanItaly
| |
Collapse
|
36
|
Yang H, Lee N, Park B, Park J, Lee J, Jang HS, Yoo H. Hierarchical network analysis of co-occurring bioentities in literature. Sci Rep 2022; 12:7885. [PMID: 35550589 PMCID: PMC9098521 DOI: 10.1038/s41598-022-12093-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2021] [Accepted: 05/03/2022] [Indexed: 11/09/2022] Open
Abstract
Biomedical databases grow by more than a thousand new publications every day. The large volume of biomedical literature that is being published at an unprecedented rate hinders the discovery of relevant knowledge from keywords of interest to gather new insights and form hypotheses. A text-mining tool, PubTator, helps to automatically annotate bioentities, such as species, chemicals, genes, and diseases, from PubMed abstracts and full-text articles. However, the manual re-organization and analysis of bioentities is a non-trivial and highly time-consuming task. ChexMix was designed to extract the unique identifiers of bioentities from query results. Herein, ChexMix was used to construct a taxonomic tree with allied species among Korean native plants and to extract the medical subject headings unique identifier of the bioentities, which co-occurred with the keywords in the same literature. ChexMix discovered the allied species related to a keyword of interest and experimentally proved its usefulness for multi-species analysis.
Collapse
Affiliation(s)
- Heejung Yang
- Department of Pharmacy, Kangwon National University, Chuncheon, 24341, Republic of Korea. .,Bionsight, Inc., Chuncheon, 24341, Republic of Korea.
| | - Namgil Lee
- Bionsight, Inc., Chuncheon, 24341, Republic of Korea.,Department of Information Statistics, Kangwon National University, Gangwondaehak-gil 1, Chuncheon, Gangwon, 24341, Republic of Korea
| | - Beomjun Park
- Bionsight, Inc., Chuncheon, 24341, Republic of Korea
| | - Jinyoung Park
- Department of Pharmacy, Kangwon National University, Chuncheon, 24341, Republic of Korea
| | - Jiho Lee
- Department of Pharmacy, Kangwon National University, Chuncheon, 24341, Republic of Korea
| | - Hyeon Seok Jang
- Department of Pharmacy, Kangwon National University, Chuncheon, 24341, Republic of Korea
| | - Hojin Yoo
- Bionsight, Inc., Chuncheon, 24341, Republic of Korea
| |
Collapse
|
37
|
Comparison of Text Mining Models for Food and Dietary Constituent Named-Entity Recognition. MACHINE LEARNING AND KNOWLEDGE EXTRACTION 2022. [DOI: 10.3390/make4010012] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Biomedical Named-Entity Recognition (BioNER) has become an essential part of text mining due to the continuously increasing digital archives of biological and medical articles. While there are many well-performing BioNER tools for entities such as genes, proteins, diseases or species, there is very little research into food and dietary constituent named-entity recognition. For this reason, in this paper, we study seven BioNER models for food and dietary constituents recognition. Specifically, we study a dictionary-based model, a conditional random fields (CRF) model and a new hybrid model, called FooDCoNER (Food and Dietary Constituents Named-Entity Recognition), which we introduce combining the former two models. In addition, we study deep language models including BERT, BioBERT, RoBERTa and ELECTRA. As a result, we find that FooDCoNER does not only lead to the overall best results, comparable with the deep language models, but FooDCoNER is also much more efficient with respect to run time and sample size requirements of the training data. The latter has been identified via the study of learning curves. Overall, our results not only provide a new tool for food and dietary constituent NER but also shed light on the difference between classical machine learning models and recent deep language models.
Collapse
|
38
|
Borchert F, Meister L, Langer T, Follmann M, Arnrich B, Schapranow MP. Controversial Trials First: Identifying Disagreement Between Clinical Guidelines and New Evidence. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2022; 2021:237-246. [PMID: 35308948 PMCID: PMC8861732] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Clinical guidelines integrate latest evidence to support clinical decision-making. As new research findings are published at an increasing rate, it would be helpful to detect when such results disagree with current guideline recommendations. In this work, we describe a software system for the automatic identification of disagreement between clinical guidelines and published research. A critical feature of the system is the extraction and cross-lingual normalization of information through natural language processing. The initial version focuses on the detection of cancer treatments in clinical trial reports that are not addressed in oncology guidelines. We evaluate the relevance of trials retrieved by our system retrospectively by comparison with historic guideline updates and also prospectively through manual evaluation by guideline experts. The system improves precision over state-of-the-art literature research strategies while maintaining near-total recall. Detailed error analysis highlights challenges for fine-grained clinical information extraction, in particular when extracting population definitions for tumor-agnostic therapies.
Collapse
Affiliation(s)
- Florian Borchert
- Digital Health Center, Hasso Plattner Institute, University of Potsdam, Germany
| | - Laura Meister
- Digital Health Center, Hasso Plattner Institute, University of Potsdam, Germany
| | - Thomas Langer
- German Guideline Program in Oncology, German Cancer Society, Berlin, Germany
| | - Markus Follmann
- German Guideline Program in Oncology, German Cancer Society, Berlin, Germany
| | - Bert Arnrich
- Digital Health Center, Hasso Plattner Institute, University of Potsdam, Germany
| | | |
Collapse
|
39
|
Walker VR, Schmitt CP, Wolfe MS, Nowak AJ, Kulesza K, Williams AR, Shin R, Cohen J, Burch D, Stout MD, Shipkowski KA, Rooney AA. Evaluation of a semi-automated data extraction tool for public health literature-based reviews: Dextr. ENVIRONMENT INTERNATIONAL 2022; 159:107025. [PMID: 34920276 DOI: 10.1016/j.envint.2021.107025] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/01/2021] [Revised: 10/07/2021] [Accepted: 12/03/2021] [Indexed: 06/14/2023]
Abstract
INTRODUCTION There has been limited development and uptake of machine-learning methods to automate data extraction for literature-based assessments. Although advanced extraction approaches have been applied to some clinical research reviews, existing methods are not well suited for addressing toxicology or environmental health questions due to unique data needs to support reviews in these fields. OBJECTIVES To develop and evaluate a flexible, web-based tool for semi-automated data extraction that: 1) makes data extraction predictions with user verification, 2) integrates token-level annotations, and 3) connects extracted entities to support hierarchical data extraction. METHODS Dextr was developed with Agile software methodology using a two-team approach. The development team outlined proposed features and coded the software. The advisory team guided developers and evaluated Dextr's performance on precision, recall, and extraction time by comparing a manual extraction workflow to a semi-automated extraction workflow using a dataset of 51 environmental health animal studies. RESULTS The semi-automated workflow did not appear to affect precision rate (96.0% vs. 95.4% manual, p = 0.38), resulted in a small reduction in recall rate (91.8% vs. 97.0% manual, p < 0.01), and substantially reduced the median extraction time (436 s vs. 933 s per study manual, p < 0.01) compared to a manual workflow. DISCUSSION Dextr provides similar performance to manual extraction in terms of recall and precision and greatly reduces data extraction time. Unlike other tools, Dextr provides the ability to extract complex concepts (e.g., multiple experiments with various exposures and doses within a single study), properly connect the extracted elements within a study, and effectively limit the work required by researchers to generate machine-readable, annotated exports. The Dextr tool addresses data-extraction challenges associated with environmental health sciences literature with a simple user interface, incorporates the key capabilities of user verification and entity connecting, provides a platform for further automation developments, and has the potential to improve data extraction for literature reviews in this and other fields.
Collapse
Affiliation(s)
- Vickie R Walker
- Division of the National Toxicology Program (DNTP), National Institute of Environmental Health Sciences (NIEHS), National Institutes of Health (NIH), Research Triangle Park, NC, USA.
| | - Charles P Schmitt
- Division of the National Toxicology Program (DNTP), National Institute of Environmental Health Sciences (NIEHS), National Institutes of Health (NIH), Research Triangle Park, NC, USA
| | - Mary S Wolfe
- Division of the National Toxicology Program (DNTP), National Institute of Environmental Health Sciences (NIEHS), National Institutes of Health (NIH), Research Triangle Park, NC, USA
| | | | | | | | - Rob Shin
- ICF, Research Triangle Park, NC, USA
| | | | | | - Matthew D Stout
- Division of the National Toxicology Program (DNTP), National Institute of Environmental Health Sciences (NIEHS), National Institutes of Health (NIH), Research Triangle Park, NC, USA
| | - Kelly A Shipkowski
- Division of the National Toxicology Program (DNTP), National Institute of Environmental Health Sciences (NIEHS), National Institutes of Health (NIH), Research Triangle Park, NC, USA
| | - Andrew A Rooney
- Division of the National Toxicology Program (DNTP), National Institute of Environmental Health Sciences (NIEHS), National Institutes of Health (NIH), Research Triangle Park, NC, USA
| |
Collapse
|
40
|
Analyzing COVID-19 Medical Papers Using Artificial Intelligence: Insights for Researchers and Medical Professionals. BIG DATA AND COGNITIVE COMPUTING 2022. [DOI: 10.3390/bdcc6010004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
Since the beginning of the COVID-19 pandemic almost two years ago, there have been more than 700,000 scientific papers published on the subject. An individual researcher cannot possibly get acquainted with such a huge text corpus and, therefore, some help from artificial intelligence (AI) is highly needed. We propose the AI-based tool to help researchers navigate the medical papers collections in a meaningful way and extract some knowledge from scientific COVID-19 papers. The main idea of our approach is to get as much semi-structured information from text corpus as possible, using named entity recognition (NER) with a model called PubMedBERT and Text Analytics for Health service, then store the data into NoSQL database for further fast processing and insights generation. Additionally, the contexts in which the entities were used (neutral or negative) are determined. Application of NLP and text-based emotion detection (TBED) methods to COVID-19 text corpus allows us to gain insights on important issues of diagnosis and treatment (such as changes in medical treatment over time, joint treatment strategies using several medications, and the connection between signs and symptoms of coronavirus, etc.).
Collapse
|
41
|
Abdulkadhar S, Natarajan J. A Text Mining Protocol for Mining Biological Pathways and Regulatory Networks from Biomedical Literature. Methods Mol Biol 2022; 2496:141-157. [PMID: 35713863 DOI: 10.1007/978-1-0716-2305-3_8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
A biological pathway or regulatory network is a collection of molecular regulators which can activate the changes in cellular processes leading to an assembly of new molecules by series of actions among the molecules. There are three important pathways in system biology studies namely signaling pathways, metabolic pathways, and genetic pathways (or) gene regulatory networks. Recently, biological pathway construction from scientific literature is given much attention as the scientific literature contains a rich set of linguistic features to extract biological associations between genes and proteins. These associations can be united to construct biological networks. Here, we present a brief overview about various biological pathways, biomedical text resources/corpora for network construction and state-of-the-art existing methods for network construction followed by our hybrid text mining protocol for extracting pathways and regulatory networks from biomedical literature.
Collapse
Affiliation(s)
- Sabenabanu Abdulkadhar
- Data Mining and Text Mining Laboratory, Department of Bioinformatics, Bharathiar University, Coimbatore, Tamilnadu, India
| | - Jeyakumar Natarajan
- Data Mining and Text Mining Laboratory, Department of Bioinformatics, Bharathiar University, Coimbatore, Tamilnadu, India.
| |
Collapse
|
42
|
Using semantics to scale up evidence-based chemical risk-assessments. PLoS One 2021; 16:e0260712. [PMID: 34910747 PMCID: PMC8673667 DOI: 10.1371/journal.pone.0260712] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Accepted: 11/15/2021] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND The manual processes used for risk assessments are not scaling to the amount of data available. Although automated approaches appear promising, they must be transparent in a public policy setting. OBJECTIVE Our goal is to create an automated approach that moves beyond retrieval to the extraction step of the information synthesis process, where evidence is characterized as supporting, refuting, or neutral with respect to a given outcome. METHODS We combine knowledge resources and natural language processing to resolve coordinated ellipses and thus avoid surface level differences between concepts in an ontology and outcomes in an abstract. As with a systematic review, the search criterion, and inclusion and exclusion criterion are explicit. RESULTS The system scales to 482K abstracts on 27 chemicals. Results for three endpoints that are critical for cancer risk assessments show that refuting evidence (where the outcome decreased) was higher for cell proliferation (45.9%), and general cell changes (37.7%) than for cell death (25.0%). Moreover, cell death was the only end point where supporting claims were the majority (61.3%). If the number of abstracts that measure an outcome was used as a proxy for association there would be a stronger association with cell proliferation than cell death (20/27 chemicals). However, if the amount of supporting evidence was used (where the outcome increased) the conclusion would change for 21/27 chemicals (20 from proliferation to death and 1 from death to proliferation). CONCLUSIONS We provide decision makers with a visual representation of supporting, neutral, and refuting evidence whilst maintaining the reproducibility and transparency needed for public policy. Our findings show that results from the retrieval step where the number of abstracts that measure an outcome are reported can be misleading if not accompanied with results from the extraction step where the directionality of the outcome is established.
Collapse
|
43
|
Le Guillarme N, Thuiller W. TaxoNERD: Deep neural models for the recognition of taxonomic entities in the ecological and evolutionary literature. Methods Ecol Evol 2021. [DOI: 10.1111/2041-210x.13778] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Nicolas Le Guillarme
- CNRS LECA Laboratoire d'Ecologie Alpine Université Grenoble Alpes University Savoie Mont Blanc Grenoble France
| | - Wilfried Thuiller
- CNRS LECA Laboratoire d'Ecologie Alpine Université Grenoble Alpes University Savoie Mont Blanc Grenoble France
| |
Collapse
|
44
|
Yim WWY, Kurikawa Y, Mizushima N. An exploratory text analysis of the autophagy research field. Autophagy 2021; 18:1648-1661. [PMID: 34812110 PMCID: PMC9298454 DOI: 10.1080/15548627.2021.1995151] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
After its discovery in the 1950 s, the autophagy research field has seen its annual number of publications climb from tens to thousands. The ever-growing number of autophagy publications is a wealth of information but presents a challenge to researchers, especially those new to the field, who are looking for a general overview of the field to, for example, determine current topics of the field or formulate new hypotheses. Here, we employed text mining tools to extract research trends in the autophagy field, including those of genes, terms, and topics. The publication trend of the field can be separated into three phases. The exponential rise in publication number began in the last phase and is most likely spurred by a series of highly cited research papers published in previous phases. The exponential increase in papers has resulted in a larger variety of research topics, with the majority involving those that are directly physiologically relevant, such as disease and modulating autophagy. Our findings provide researchers a summary of the history of the autophagy research field and perhaps hints of what is to come.Abbreviations: 5Y-IF: 5-year impact factor; AIS: article influence score; EM: electron microscopy; HGNC: HUGO gene nomenclature committee; LDA: latent Dirichlet allocation; MeSH: medical subject headings; ncRNA: non-coding RNA.
Collapse
Affiliation(s)
- Willa Wen-You Yim
- Department of Biochemistry and Molecular Biology, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| | - Yoshitaka Kurikawa
- Department of Biochemistry and Molecular Biology, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| | - Noboru Mizushima
- Department of Biochemistry and Molecular Biology, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| |
Collapse
|
45
|
Green NL. Argumentation schemes: From genetics to international relations to environmental science policy to AI ethics. ARGUMENT & COMPUTATION 2021. [DOI: 10.3233/aac-210551] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Argumentation schemes have played a key role in our research projects on computational models of natural argument over the last decade. The catalogue of schemes in Walton, Reed and Macagno’s 2008 book, Argumentation Schemes, served as our starting point for analysis of the naturally occurring arguments in written text, i.e., text in different genres having different types of author, audience, and subject domain (genetics, international relations, environmental science policy, AI ethics), for different argument goals, and for different possible future applications. We would often first attempt to analyze the arguments in our corpora in terms of those schemes, then adapt schemes as needed for the goals of the project, and in some cases implement them for use in computational models. Among computational researchers, the main interest in argumentation schemes has been for use in argument mining by applying machine learning methods to existing argument corpora. In contrast, a primary goal of our research has been to learn more about written arguments themselves in various contemporary fields. Our approach has been to manually analyze semantics, discourse structure, argumentation, and rhetoric in texts. Another goal has been to create sharable digital corpora containing the results of our studies. Our approach has been to define argument schemes for use by human corpus annotators or for use in logic programs for argument mining. The third goal is to design useful computer applications based upon our studies, such as argument diagramming systems that provide argument schemes as building blocks. This paper describes each of the various projects: the methods, the argument schemes that were identified, and how they were used. Then a synthesis of the results is given with a discussion of open issues.
Collapse
Affiliation(s)
- Nancy L. Green
- University of North Carolina Greensboro, Greensboro, NC 27402, USA. E-mail:
| |
Collapse
|
46
|
Baltoumas FA, Zafeiropoulou S, Karatzas E, Paragkamian S, Thanati F, Iliopoulos I, Eliopoulos AG, Schneider R, Jensen LJ, Pafilis E, Pavlopoulos GA. OnTheFly 2.0: a text-mining web application for automated biomedical entity recognition, document annotation, network and functional enrichment analysis. NAR Genom Bioinform 2021; 3:lqab090. [PMID: 34632381 PMCID: PMC8494211 DOI: 10.1093/nargab/lqab090] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2021] [Revised: 09/09/2021] [Accepted: 09/20/2021] [Indexed: 02/06/2023] Open
Abstract
Extracting and processing information from documents is of great importance as lots of experimental results and findings are stored in local files. Therefore, extracting and analyzing biomedical terms from such files in an automated way is absolutely necessary. In this article, we present OnTheFly2.0, a web application for extracting biomedical entities from individual files such as plain texts, office documents, PDF files or images. OnTheFly2.0 can generate informative summaries in popup windows containing knowledge related to the identified terms along with links to various databases. It uses the EXTRACT tagging service to perform named entity recognition (NER) for genes/proteins, chemical compounds, organisms, tissues, environments, diseases, phenotypes and gene ontology terms. Multiple files can be analyzed, whereas identified terms such as proteins or genes can be explored through functional enrichment analysis or be associated with diseases and PubMed entries. Finally, protein-protein and protein-chemical networks can be generated with the use of STRING and STITCH services. To demonstrate its capacity for knowledge discovery, we interrogated published meta-analyses of clinical biomarkers of severe COVID-19 and uncovered inflammatory and senescence pathways that impact disease pathogenesis. OnTheFly2.0 currently supports 197 species and is available at http://bib.fleming.gr:3838/OnTheFly/ and http://onthefly.pavlopouloslab.info.
Collapse
Affiliation(s)
- Fotis A Baltoumas
- Institute for Fundamental Biomedical Research, Biomedical Sciences Research Center "Alexander Fleming", Vari 16672, Greece
| | - Sofia Zafeiropoulou
- Institute for Fundamental Biomedical Research, Biomedical Sciences Research Center "Alexander Fleming", Vari 16672, Greece
| | - Evangelos Karatzas
- Institute for Fundamental Biomedical Research, Biomedical Sciences Research Center "Alexander Fleming", Vari 16672, Greece
| | - Savvas Paragkamian
- Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Former U.S. Base of Gournes P.O. Box 2214, 71003 Heraklion, Crete, Greece
| | - Foteini Thanati
- Institute for Fundamental Biomedical Research, Biomedical Sciences Research Center "Alexander Fleming", Vari 16672, Greece
| | - Ioannis Iliopoulos
- Department of Basic Sciences, School of Medicine, University of Crete, Heraklion 71003, Crete, Greece
| | - Aristides G Eliopoulos
- Department of Biology, School of Medicine, National and Kapodistrian University of Athens, Athens, 70013, Greece
| | - Reinhard Schneider
- University of Luxembourg, Luxembourg Centre for Systems Biomedicine, Bioinformatics Core, Esch-sur-Alzette, L-4365, Luxembourg
| | - Lars Juhl Jensen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, 2200, Denmark
| | - Evangelos Pafilis
- Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Former U.S. Base of Gournes P.O. Box 2214, 71003 Heraklion, Crete, Greece
| | - Georgios A Pavlopoulos
- Institute for Fundamental Biomedical Research, Biomedical Sciences Research Center "Alexander Fleming", Vari 16672, Greece
| |
Collapse
|
47
|
Qin X, Li L, Sun X. Reply to letter to the editor by Kharawala S, et al: Artificial intelligence for assisting systematic reviews: Opportunities with continuing challenges. J Clin Epidemiol 2021; 138:245-246. [PMID: 33753226 DOI: 10.1016/j.jclinepi.2021.03.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2021] [Accepted: 03/11/2021] [Indexed: 02/08/2023]
Affiliation(s)
- Xuan Qin
- Chinese Evidence-based Medicine Center, Cochrane China Center and National Clinical Research Center for Geriatrics, West China Hospital, Sichuan University, Chengdu 610041, Sichuan, China
| | - Ling Li
- Chinese Evidence-based Medicine Center, Cochrane China Center and National Clinical Research Center for Geriatrics, West China Hospital, Sichuan University, Chengdu 610041, Sichuan, China
| | - Xin Sun
- Chinese Evidence-based Medicine Center, Cochrane China Center and National Clinical Research Center for Geriatrics, West China Hospital, Sichuan University, Chengdu 610041, Sichuan, China; Evidence-based Medicine Research Center, School of Basic Science, Jiangxi University of Traditional Chinese Medicine, Nanchang 330004, Jiangxi, China.
| |
Collapse
|
48
|
Emmert-Streib F. Grand Challenges for Artificial Intelligence in Molecular Medicine. FRONTIERS IN MOLECULAR MEDICINE 2021; 1:734659. [PMID: 39087080 PMCID: PMC11285658 DOI: 10.3389/fmmed.2021.734659] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/01/2021] [Accepted: 07/08/2021] [Indexed: 08/02/2024]
Affiliation(s)
- Frank Emmert-Streib
- Predictive Society and Data Analytics Lab, Faculty of Information Technolgy and Communication Sciences, Tampere University, Tampere, Finland
- Institute of Biosciences and Medical Technology, Tampere, Finland
| |
Collapse
|
49
|
Su J, Wu Y, Ting HF, Lam TW, Luo R. RENET2: high-performance full-text gene-disease relation extraction with iterative training data expansion. NAR Genom Bioinform 2021; 3:lqab062. [PMID: 34235433 PMCID: PMC8256824 DOI: 10.1093/nargab/lqab062] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2021] [Revised: 06/16/2021] [Accepted: 06/23/2021] [Indexed: 01/06/2023] Open
Abstract
Relation extraction (RE) is a fundamental task for extracting gene–disease associations from biomedical text. Many state-of-the-art tools have limited capacity, as they can extract gene–disease associations only from single sentences or abstract texts. A few studies have explored extracting gene–disease associations from full-text articles, but there exists a large room for improvements. In this work, we propose RENET2, a deep learning-based RE method, which implements Section Filtering and ambiguous relations modeling to extract gene–disease associations from full-text articles. We designed a novel iterative training data expansion strategy to build an annotated full-text dataset to resolve the scarcity of labels on full-text articles. In our experiments, RENET2 achieved an F1-score of 72.13% for extracting gene–disease associations from an annotated full-text dataset, which was 27.22, 30.30, 29.24 and 23.87% higher than BeFree, DTMiner, BioBERT and RENET, respectively. We applied RENET2 to (i) ∼1.89M full-text articles from PubMed Central and found ∼3.72M gene–disease associations; and (ii) the LitCovid articles and ranked the top 15 proteins associated with COVID-19, supported by recent articles. RENET2 is an efficient and accurate method for full-text gene–disease association extraction. The source-code, manually curated abstract/full-text training data, and results of RENET2 are available at GitHub.
Collapse
Affiliation(s)
- Junhao Su
- Department of Computer Science, The University of Hong Kong, Hong Kong, 999077, China
| | - Ye Wu
- Department of Computer Science, The University of Hong Kong, Hong Kong, 999077, China
| | - Hing-Fung Ting
- Department of Computer Science, The University of Hong Kong, Hong Kong, 999077, China
| | - Tak-Wah Lam
- Department of Computer Science, The University of Hong Kong, Hong Kong, 999077, China
| | - Ruibang Luo
- Department of Computer Science, The University of Hong Kong, Hong Kong, 999077, China
| |
Collapse
|
50
|
Mining Proteome Research Reports: A Bird's Eye View. Proteomes 2021; 9:proteomes9020029. [PMID: 34200663 PMCID: PMC8293458 DOI: 10.3390/proteomes9020029] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2021] [Revised: 05/27/2021] [Accepted: 06/08/2021] [Indexed: 01/25/2023] Open
Abstract
The complexity of data has burgeoned to such an extent that scientists of every realm are encountering the incessant challenge of data management. Modern-day analytical approaches with the help of free source tools and programming languages have facilitated access to the context of the various domains as well as specific works reported. Here, with this article, an attempt has been made to provide a systematic analysis of all the available reports at PubMed on Proteome using text mining. The work is comprised of scientometrics as well as information extraction to provide the publication trends as well as frequent keywords, bioconcepts and most importantly gene–gene co-occurrence network. Out of 33,028 PMIDs collected initially, the segregation of 24,350 articles under 28 Medical Subject Headings (MeSH) was analyzed and plotted. Keyword link network and density visualizations were provided for the top 1000 frequent Mesh keywords. PubTator was used, and 322,026 bioconcepts were able to extracted under 10 classes (such as Gene, Disease, CellLine, etc.). Co-occurrence networks were constructed for PMID-bioconcept as well as bioconcept–bioconcept associations. Further, for creation of subnetwork with respect to gene–gene co-occurrence, a total of 11,100 unique genes participated with mTOR and AKT showing the highest (64) number of connections. The gene p53 was the most popular one in the network in accordance with both the degree and weighted degree centrality, which were 425 and 1414, respectively. The present piece of study is an amalgam of bibliometrics and scientific data mining methods looking deeper into the whole scale analysis of available literature on proteome.
Collapse
|