1
|
Ge Y, Guo Y, Yang YC, Al-Garadi MA, Sarker A. A comparison of few-shot and traditional named entity recognition models for medical text. IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS. IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS 2022; 2022:84-89. [PMID: 37641590 PMCID: PMC10462421 DOI: 10.1109/ichi54592.2022.00024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/31/2023]
Abstract
Many research problems involving medical texts have limited amounts of annotated data available (e.g., expressions of rare diseases). Traditional supervised machine learning algorithms, particularly those based on deep neural networks, require large volumes of annotated data, and they underperform when only small amounts of labeled data are available. Few-shot learning (FSL) is a category of machine learning models that are designed with the intent of solving problems that have small annotated datasets available. However, there is no current study that compares the performances of FSL models with traditional models (e.g., conditional random fields) for medical text at different training set sizes. In this paper, we attempted to fill this gap in research by comparing multiple FSL models with traditional models for the task of named entity recognition (NER) from medical texts. Using five health-related annotated NER datasets, we benchmarked three traditional NER models based on BERT-BERT-Linear Classifier (BLC), BERT-CRF (BC) and SANER; and three FSL NER models-StructShot & NNShot, Few-Shot Slot Tagging (FS-ST) and ProtoNER. Our benchmarking results show that almost all models, whether traditional or FSL, achieve significantly lower performances compared to the state-of-the-art with small amounts of training data. For the NER experiments we executed, the F1-scores were very low with small training sets, typically below 30%. FSL models that were reported to perform well on non-medical texts significantly underperformed, compared to their reported best, on medical texts. Our experiments also suggest that FSL methods tend to perform worse on data sets from noisy sources of medical texts, such as social media (which includes misspellings and colloquial expressions), compared to less noisy sources such as medical literature. Our experiments demonstrate that the current state-of-the-art FSL systems are not yet suitable for effective NER in medical natural language processing tasks, and further research needs to be carried out to improve their performances. Creation of specialized, standardized datasets replicating real-world scenarios may help to move this category of methods forward.
Collapse
Affiliation(s)
- Yao Ge
- Department of Biomedical Informatics School of Medicine, Emory University Atlanta, GA
| | - Yuting Guo
- Department of Biomedical Informatics School of Medicine, Emory University Atlanta, GA
| | - Yuan-Chi Yang
- Department of Biomedical Informatics School of Medicine, Emory University Atlanta, GA
| | | | - Abeed Sarker
- Department of Biomedical Informatics School of Medicine, Emory University Atlanta, GA
| |
Collapse
|
2
|
Wang H, Zang Y, Zhao Y, Hao D, Kang Y, Zhang J, Zhang Z, Zhang L, Yang Z, Zhang S. Sequence Matching between Hemagglutinin and Neuraminidase through Sequence Analysis Using Machine Learning. Viruses 2022; 14:v14030469. [PMID: 35336876 PMCID: PMC8950662 DOI: 10.3390/v14030469] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2022] [Revised: 02/15/2022] [Accepted: 02/17/2022] [Indexed: 01/27/2023] Open
Abstract
To date, many experiments have revealed that the functional balance between hemagglutinin (HA) and neuraminidase (NA) plays a crucial role in viral mobility, production, and transmission. However, whether and how HA and NA maintain balance at the sequence level needs further investigation. Here, we applied principal component analysis and hierarchical clustering analysis on thousands of HA and NA sequences of A/H1N1 and A/H3N2. We discovered significant coevolution between HA and NA at the sequence level, which is closely related to the type of host species and virus epidemic years. Furthermore, we propose a sequence-to-sequence transformer model (S2STM), which mainly consists of an encoder and a decoder that adopts a multi-head attention mechanism for establishing the mapping relationship between HA and NA sequences. The training results reveal that the S2STM can effectively realize the “translation” from HA to NA or vice versa, thereby building a relationship network between them. Our work combines unsupervised and supervised machine learning methods to identify the sequence matching between HA and NA, which will advance our understanding of IAVs’ evolution and also provide a novel idea for sequence analysis methods.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | - Zhiwei Yang
- Correspondence: (Z.Y.); (S.Z.); Tel.: +86-029-8266-8634 (Z.Y.); +86-029-8266-0915 (S.Z.)
| | - Shengli Zhang
- Correspondence: (Z.Y.); (S.Z.); Tel.: +86-029-8266-8634 (Z.Y.); +86-029-8266-0915 (S.Z.)
| |
Collapse
|
3
|
Peterson KS, Lewis J, Patterson OV, Chapman AB, Denhalter DW, Lye PA, Stevens VW, Gamage SD, Roselle GA, Wallace KS, Jones M. Automated Travel History Extraction From Clinical Notes for Informing the Detection of Emergent Infectious Disease Events: Algorithm Development and Validation. JMIR Public Health Surveill 2021; 7:e26719. [PMID: 33759790 PMCID: PMC7993087 DOI: 10.2196/26719] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2020] [Revised: 02/05/2021] [Accepted: 02/12/2021] [Indexed: 02/02/2023] Open
Abstract
Background Patient travel history can be crucial in evaluating evolving infectious disease events. Such information can be challenging to acquire in electronic health records, as it is often available only in unstructured text. Objective This study aims to assess the feasibility of annotating and automatically extracting travel history mentions from unstructured clinical documents in the Department of Veterans Affairs across disparate health care facilities and among millions of patients. Information about travel exposure augments existing surveillance applications for increased preparedness in responding quickly to public health threats. Methods Clinical documents related to arboviral disease were annotated following selection using a semiautomated bootstrapping process. Using annotated instances as training data, models were developed to extract from unstructured clinical text any mention of affirmed travel locations outside of the continental United States. Automated text processing models were evaluated, involving machine learning and neural language models for extraction accuracy. Results Among 4584 annotated instances, 2659 (58%) contained an affirmed mention of travel history, while 347 (7.6%) were negated. Interannotator agreement resulted in a document-level Cohen kappa of 0.776. Automated text processing accuracy (F1 85.6, 95% CI 82.5-87.9) and computational burden were acceptable such that the system can provide a rapid screen for public health events. Conclusions Automated extraction of patient travel history from clinical documents is feasible for enhanced passive surveillance public health systems. Without such a system, it would usually be necessary to manually review charts to identify recent travel or lack of travel, use an electronic health record that enforces travel history documentation, or ignore this potential source of information altogether. The development of this tool was initially motivated by emergent arboviral diseases. More recently, this system was used in the early phases of response to COVID-19 in the United States, although its utility was limited to a relatively brief window due to the rapid domestic spread of the virus. Such systems may aid future efforts to prevent and contain the spread of infectious diseases.
Collapse
Affiliation(s)
- Kelly S Peterson
- VA Salt Lake City Health Care System, US Department of Veterans Affairs, Salt Lake City, UT, United States.,Division of Epidemiology, Department of Internal Medicine, University of Utah, Salt Lake City, UT, United States
| | - Julia Lewis
- VA Salt Lake City Health Care System, US Department of Veterans Affairs, Salt Lake City, UT, United States.,Division of Epidemiology, Department of Internal Medicine, University of Utah, Salt Lake City, UT, United States
| | - Olga V Patterson
- VA Salt Lake City Health Care System, US Department of Veterans Affairs, Salt Lake City, UT, United States.,Division of Epidemiology, Department of Internal Medicine, University of Utah, Salt Lake City, UT, United States
| | - Alec B Chapman
- VA Salt Lake City Health Care System, US Department of Veterans Affairs, Salt Lake City, UT, United States.,Division of Epidemiology, Department of Internal Medicine, University of Utah, Salt Lake City, UT, United States
| | - Daniel W Denhalter
- VA Salt Lake City Health Care System, US Department of Veterans Affairs, Salt Lake City, UT, United States.,Department of Rocky Mountain Cancer Data Systems, University of Utah, Salt Lake City, UT, United States
| | - Patricia A Lye
- National Infectious Diseases Service, Specialty Care Services, US Department of Veterans Affairs, Cincinnati, OH, United States
| | - Vanessa W Stevens
- VA Salt Lake City Health Care System, US Department of Veterans Affairs, Salt Lake City, UT, United States.,Division of Epidemiology, Department of Internal Medicine, University of Utah, Salt Lake City, UT, United States
| | - Shantini D Gamage
- National Infectious Diseases Service, Specialty Care Services, US Department of Veterans Affairs, Cincinnati, OH, United States.,Division of Infectious Diseases, Department of Internal Medicine, University of Cincinnati College of Medicine, Cincinnati, OH, United States
| | - Gary A Roselle
- National Infectious Diseases Service, Specialty Care Services, US Department of Veterans Affairs, Cincinnati, OH, United States.,Division of Infectious Diseases, Department of Internal Medicine, University of Cincinnati College of Medicine, Cincinnati, OH, United States.,Cincinnati VA Medical Center, US Department of Veterans Affairs, Cincinnati, OH, United States
| | - Katherine S Wallace
- Office of Biosurveillance, Veterans Affairs Central Office, US Department of Veterans Affairs, Washington, DC, United States.,National Biosurveillance Integration Center, Countering Weapons of Mass Destruction, Department of Homeland Security, Washington, DC, United States
| | - Makoto Jones
- VA Salt Lake City Health Care System, US Department of Veterans Affairs, Salt Lake City, UT, United States.,Division of Epidemiology, Department of Internal Medicine, University of Utah, Salt Lake City, UT, United States
| |
Collapse
|
4
|
Magge A, Weissenbacher D, O'Connor K, Tahsin T, Gonzalez-Hernandez G, Scotch M. GeoBoost2: a natural languageprocessing pipeline for GenBank metadata enrichment for virus phylogeography. Bioinformatics 2021; 36:5120-5121. [PMID: 32683454 PMCID: PMC7755405 DOI: 10.1093/bioinformatics/btaa647] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2020] [Revised: 07/03/2020] [Accepted: 07/13/2020] [Indexed: 12/27/2022] Open
Abstract
Summary We present GeoBoost2, a natural language-processing pipeline for extracting the location of infected hosts for enriching metadata in nucleotide sequences repositories like National Center of Biotechnology Information’s GenBank for downstream analysis including phylogeography and genomic epidemiology. The increasing number of pathogen sequences requires complementary information extraction methods for focused research, including surveillance within countries and between borders. In this article, we describe the enhancements from our earlier release including improvement in end-to-end extraction performance and speed, availability of a fully functional web-interface and state-of-the-art methods for location extraction using deep learning. Availability and implementation Application is freely available on the web at https://zodo.asu.edu/geoboost2. Source code, usage examples and annotated data for GeoBoost2 is freely available at https://github.com/ZooPhy/geoboost2. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Arjun Magge
- College of Health Solutions, Arizona State University, Phoenix, AZ 85004, USA.,Biodesign Center for Environmental Health Engineering, Biodesign Institute, Arizona State University, Tempe, AZ 85287, USA.,Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine,University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Davy Weissenbacher
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine,University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Karen O'Connor
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine,University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Tasnia Tahsin
- College of Health Solutions, Arizona State University, Phoenix, AZ 85004, USA
| | - Graciela Gonzalez-Hernandez
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine,University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Matthew Scotch
- College of Health Solutions, Arizona State University, Phoenix, AZ 85004, USA.,Biodesign Center for Environmental Health Engineering, Biodesign Institute, Arizona State University, Tempe, AZ 85287, USA
| |
Collapse
|
5
|
Vaiente MA, Scotch M. Going back to the roots: Evaluating Bayesian phylogeographic models with discrete trait uncertainty. INFECTION, GENETICS AND EVOLUTION : JOURNAL OF MOLECULAR EPIDEMIOLOGY AND EVOLUTIONARY GENETICS IN INFECTIOUS DISEASES 2020; 85:104501. [PMID: 32798768 PMCID: PMC7686256 DOI: 10.1016/j.meegid.2020.104501] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 08/06/2020] [Accepted: 08/09/2020] [Indexed: 01/14/2023]
Abstract
Phylogeography is a popular way to analyze virus sequences annotated with discrete, epidemiologically-relevant, trait data. For applied public health surveillance, a key quantity of interest is often the state at the root of the inferred phylogeny. In epidemiological terms, this represents the geographic origin of the observed outbreak. Since determining the origin of an outbreak is often critical for public health intervention, it is prudent to understand how well phylogeographic models perform this root state classification task under various analytical scenarios. Specifically, we investigate how discrete state space and sequence data set influence the root state classification accuracy. We performed phylogeographic inference on several simulated DNA data sets while i) increasing the number of sequences and ii) increasing the total number of possible discrete trait values. We show that phylogeographic models tend to perform best at intermediate sequence data set sizes. Further, we demonstrate that a popular metric used for evaluation of phylogeographic models, the Kullback-Leibler (KL) divergence, both increases with discrete state space and data set sizes. Further, by modeling phylogeographic root state classification accuracy using logistic regression, we show that KL is not supported as a predictor of model accuracy, indicating its limited utility for assessing phylogeographic model performance on empirical data. These results suggest that relying solely on the KL metric may lead to artificially inflated support for models with finer discretization schemes and larger data set sizes. These results will be important for public health practitioners seeking to use phylogeographic models for applied infectious disease surveillance.
Collapse
Affiliation(s)
- Matteo A Vaiente
- Biodesign Center for Environmental Health Engineering, Arizona State University, 727 E. Tyler St, Tempe, AZ 85281, USA; College of Health Solutions, Arizona State University, 500 N 3rd St, Phoenix, AZ 85004, USA
| | - Matthew Scotch
- Biodesign Center for Environmental Health Engineering, Arizona State University, 727 E. Tyler St, Tempe, AZ 85281, USA; College of Health Solutions, Arizona State University, 500 N 3rd St, Phoenix, AZ 85004, USA.
| |
Collapse
|
6
|
Junge A, Jensen LJ. CoCoScore: context-aware co-occurrence scoring for text mining applications using distant supervision. Bioinformatics 2020; 36:264-271. [PMID: 31199464 PMCID: PMC6956794 DOI: 10.1093/bioinformatics/btz490] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2018] [Revised: 05/30/2019] [Accepted: 06/10/2019] [Indexed: 12/18/2022] Open
Abstract
MOTIVATION Information extraction by mining the scientific literature is key to uncovering relations between biomedical entities. Most existing approaches based on natural language processing extract relations from single sentence-level co-mentions, ignoring co-occurrence statistics over the whole corpus. Existing approaches counting entity co-occurrences ignore the textual context of each co-occurrence. RESULTS We propose a novel corpus-wide co-occurrence scoring approach to relation extraction that takes the textual context of each co-mention into account. Our method, called CoCoScore, scores the certainty of stating an association for each sentence that co-mentions two entities. CoCoScore is trained using distant supervision based on a gold-standard set of associations between entities of interest. Instead of requiring a manually annotated training corpus, co-mentions are labeled as positives/negatives according to their presence/absence in the gold standard. We show that CoCoScore outperforms previous approaches in identifying human disease-gene and tissue-gene associations as well as in identifying physical and functional protein-protein associations in different species. CoCoScore is a versatile text mining tool to uncover pairwise associations via co-occurrence mining, within and beyond biomedical applications. AVAILABILITY AND IMPLEMENTATION CoCoScore is available at: https://github.com/JungeAlexander/cocoscore. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Alexander Junge
- Disease Systems Biology Program, Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen N 2200, Denmark
| | - Lars Juhl Jensen
- Disease Systems Biology Program, Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen N 2200, Denmark
| |
Collapse
|
7
|
Magge A, Weissenbacher D, Sarker A, Scotch M, Gonzalez-Hernandez G. Bi-directional Recurrent Neural Network Models for Geographic Location Extraction in Biomedical Literature. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2019; 24:100-111. [PMID: 30864314 PMCID: PMC6417823] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Phylogeography research involving virus spread and tree reconstruction relies on accurate geographic locations of infected hosts. Insufficient level of geographic information in nucleotide sequence repositories such as GenBank motivates the use of natural language processing methods for extracting geographic location names (toponyms) in the scientific article associated with the sequence, and disambiguating the locations to their co-ordinates. In this paper, we present an extensive study of multiple recurrent neural network architectures for the task of extracting geographic locations and their effective contribution to the disambiguation task using population heuristics. The methods presented in this paper achieve a strict detection F1 score of 0.94, disambiguation accuracy of 91% and an overall resolution F1 score of 0.88 that are significantly higher than previously developed methods, improving our capability to find the location of infected hosts and enrich metadata information.
Collapse
Affiliation(s)
- Arjun Magge
- College of Health Solutions, Arizona State University, Tempe, AZ 85281, USA
- Biodesign Center for Environmental Health Engineering, Arizona State University, Tempe, AZ 85281, USA
| | - Davy Weissenbacher
- Department of Biostatistics, Epidemiology and Informatics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Abeed Sarker
- Department of Biostatistics, Epidemiology and Informatics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Matthew Scotch
- College of Health Solutions, Arizona State University, Tempe, AZ 85281, USA
- Biodesign Center for Environmental Health Engineering, Arizona State University, Tempe, AZ 85281, USA
| | - Graciela Gonzalez-Hernandez
- Department of Biostatistics, Epidemiology and Informatics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|