1
|
O’Connor K, Weissenbacher D, Elyaderani A, Lautenbach E, Scotch M, Gonzalez-Hernandez G. Patient-Related Metadata Reported in Sequencing Studies of SARS-CoV-2: Protocol for a Scoping Review and Bibliometric Analysis. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2023.07.14.23292681. [PMID: 37503241 PMCID: PMC10371180 DOI: 10.1101/2023.07.14.23292681] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
Background There has been an unprecedented effort to sequence the SARS-CoV-2 virus and examine its molecular evolution. This has been facilitated by the availability of publicly accessible databases, the Global Initiative on Sharing All Influenza Data (GISAID) and GenBank, which collectively hold millions of SARS-CoV-2 sequence records. Genomic epidemiology, however, seeks to go beyond phylogenetic analysis by linking genetic information to patient characteristics and disease outcomes, enabling a comprehensive understanding of transmission dynamics and disease impact.While these repositories include fields reflecting patient-related metadata for a given sequence, inclusion of these demographic and clinical details is scarce. The extent to which patient-related metadata is reported in published sequencing studies and its quality remains largely unexplored. Methods The NIH's LitCovid collection will be used for automated classification of articles reporting having deposited SARS-CoV-2 sequences in public repositories, while an independent search will be conducted in PubMed for validation. Data extraction will be conducted using Covidence. The extracted data will be synthesized and summarized to quantify the availability of patient metadata in the published literature of SARS-CoV-2 sequencing studies. For the bibliometric analysis, relevant data points, such as author affiliations and citation metrics will be extracted. Discussion This scoping review will report on the extent and types of patient-related metadata reported in genomic viral sequencing studies of SARS-CoV-2, identify gaps in this reporting, and make recommendations for improving the quality and consistency of reporting in this area. The bibliometric analysis will uncover trends and patterns in the reporting of patient-related metadata, including differences in reporting based on study types or geographic regions. Co-occurrence networks of author keywords will also be presented. The insights gained from this study may help improve the quality and consistency of reporting patient metadata, enhancing the utility of sequence metadata and facilitating future research on infectious diseases.
Collapse
Affiliation(s)
- Karen O’Connor
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Davy Weissenbacher
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, West Hollywood, CA, USA
| | - Amir Elyaderani
- Biodesign Center for Environmental Health Engineering, Arizona State University, Tempe, AZ, USA
| | - Ebbing Lautenbach
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Division of Infectious Diseases, Department of Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
- Center for Clinical Epidemiology and Biostatistics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Matthew Scotch
- Biodesign Center for Environmental Health Engineering, Arizona State University, Tempe, AZ, USA
- College of Health Solutions, Arizona State University, Tempe, AZ, USA
| | | |
Collapse
|
2
|
Jimeno Yepes AJ, Verspoor K. Classifying literature mentions of biological pathogens as experimentally studied using natural language processing. J Biomed Semantics 2023; 14:1. [PMID: 36721225 PMCID: PMC9889128 DOI: 10.1186/s13326-023-00282-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2022] [Accepted: 01/17/2023] [Indexed: 02/02/2023] Open
Abstract
BACKGROUND Information pertaining to mechanisms, management and treatment of disease-causing pathogens including viruses and bacteria is readily available from research publications indexed in MEDLINE. However, identifying the literature that specifically characterises these pathogens and their properties based on experimental research, important for understanding of the molecular basis of diseases caused by these agents, requires sifting through a large number of articles to exclude incidental mentions of the pathogens, or references to pathogens in other non-experimental contexts such as public health. OBJECTIVE In this work, we lay the foundations for the development of automatic methods for characterising mentions of pathogens in scientific literature, focusing on the task of identifying research that involves the experimental study of a pathogen in an experimental context. There are no manually annotated pathogen corpora available for this purpose, while such resources are necessary to support the development of machine learning-based models. We therefore aim to fill this gap, producing a large data set automatically from MEDLINE under some simplifying assumptions for the task definition, and using it to explore automatic methods that specifically support the detection of experimentally studied pathogen mentions in research publications. METHODS We developed a pathogen mention characterisation literature data set -READBiomed-Pathogens- automatically using NCBI resources, which we make available. Resources such as the NCBI Taxonomy, MeSH and GenBank can be used effectively to identify relevant literature about experimentally researched pathogens, more specifically using MeSH to link to MEDLINE citations including titles and abstracts with experimentally researched pathogens. We experiment with several machine learning-based natural language processing (NLP) algorithms leveraging this data set as training data, to model the task of detecting papers that specifically describe experimental study of a pathogen. RESULTS We show that our data set READBiomed-Pathogens can be used to explore natural language processing configurations for experimental pathogen mention characterisation. READBiomed-Pathogens includes citations related to organisms including bacteria, viruses, and a small number of toxins and other disease-causing agents. CONCLUSIONS We studied the characterisation of experimentally studied pathogens in scientific literature, developing several natural language processing methods supported by an automatically developed data set. As a core contribution of the work, we presented a methodology to automatically construct a data set for pathogen identification using existing biomedical resources. The data set and the annotation code are made publicly available. Performance of the pathogen mention identification and characterisation algorithms were additionally evaluated on a small manually annotated data set shows that the data set that we have generated allows characterising pathogens of interest. TRIAL REGISTRATION N/A.
Collapse
Affiliation(s)
- Antonio Jose Jimeno Yepes
- School of Computing Technologies, RMIT University, Melbourne, Australia.
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia.
| | - Karin Verspoor
- School of Computing Technologies, RMIT University, Melbourne, Australia
- School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
| |
Collapse
|
3
|
Folk RA, Siniscalchi CM. Biodiversity at the global scale: the synthesis continues. AMERICAN JOURNAL OF BOTANY 2021; 108:912-924. [PMID: 34181762 DOI: 10.1002/ajb2.1694] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Accepted: 04/14/2021] [Indexed: 06/13/2023]
Abstract
Traditionally, the generation and use of biodiversity data and their associated specimen objects have been primarily the purview of individuals and small research groups. While deposition of data and specimens in herbaria and other repositories has long been the norm, throughout most of their history, these resources have been accessible only to a small community of specialists. Through recent concerted efforts, primarily at the level of national and international governmental agencies over the last two decades, the pace of biodiversity data accumulation has accelerated, and a wider array of biodiversity scientists has gained access to this massive accumulation of resources, applying them to an ever-widening compass of research pursuits. We review how these new resources and increasing access to them are affecting the landscape of biodiversity research in plants today, focusing on new applications across evolution, ecology, and other fields that have been enabled specifically by the availability of these data and the global scope that was previously beyond the reach of individual investigators. We give an overview of recent advances organized along three lines: broad-scale analyses of distributional data and spatial information, phylogenetic research circumscribing large clades with comprehensive taxon sampling, and data sets derived from improved accessibility of biodiversity literature. We also review synergies between large data resources and more traditional data collection paradigms, describe shortfalls and how to overcome them, and reflect on the future of plant biodiversity analyses in light of increasing linkages between data types and scientists in our field.
Collapse
Affiliation(s)
- Ryan A Folk
- Department of Biological Sciences, Mississippi State University, Mississippi State, Mississippi, USA
| | - Carolina M Siniscalchi
- Department of Biological Sciences, Mississippi State University, Mississippi State, Mississippi, USA
| |
Collapse
|
4
|
Magge A, Weissenbacher D, O'Connor K, Tahsin T, Gonzalez-Hernandez G, Scotch M. GeoBoost2: a natural languageprocessing pipeline for GenBank metadata enrichment for virus phylogeography. Bioinformatics 2021; 36:5120-5121. [PMID: 32683454 PMCID: PMC7755405 DOI: 10.1093/bioinformatics/btaa647] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2020] [Revised: 07/03/2020] [Accepted: 07/13/2020] [Indexed: 12/27/2022] Open
Abstract
Summary We present GeoBoost2, a natural language-processing pipeline for extracting the location of infected hosts for enriching metadata in nucleotide sequences repositories like National Center of Biotechnology Information’s GenBank for downstream analysis including phylogeography and genomic epidemiology. The increasing number of pathogen sequences requires complementary information extraction methods for focused research, including surveillance within countries and between borders. In this article, we describe the enhancements from our earlier release including improvement in end-to-end extraction performance and speed, availability of a fully functional web-interface and state-of-the-art methods for location extraction using deep learning. Availability and implementation Application is freely available on the web at https://zodo.asu.edu/geoboost2. Source code, usage examples and annotated data for GeoBoost2 is freely available at https://github.com/ZooPhy/geoboost2. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Arjun Magge
- College of Health Solutions, Arizona State University, Phoenix, AZ 85004, USA.,Biodesign Center for Environmental Health Engineering, Biodesign Institute, Arizona State University, Tempe, AZ 85287, USA.,Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine,University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Davy Weissenbacher
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine,University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Karen O'Connor
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine,University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Tasnia Tahsin
- College of Health Solutions, Arizona State University, Phoenix, AZ 85004, USA
| | - Graciela Gonzalez-Hernandez
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine,University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Matthew Scotch
- College of Health Solutions, Arizona State University, Phoenix, AZ 85004, USA.,Biodesign Center for Environmental Health Engineering, Biodesign Institute, Arizona State University, Tempe, AZ 85287, USA
| |
Collapse
|
5
|
Folk RA, Kates HR, LaFrance R, Soltis DE, Soltis PS, Guralnick RP. High-throughput methods for efficiently building massive phylogenies from natural history collections. APPLICATIONS IN PLANT SCIENCES 2021; 9:e11410. [PMID: 33680581 PMCID: PMC7910806 DOI: 10.1002/aps3.11410] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/23/2020] [Accepted: 12/20/2020] [Indexed: 05/10/2023]
Abstract
PREMISE Large phylogenetic data sets have often been restricted to small numbers of loci from GenBank, and a vetted sampling-to-sequencing phylogenomic protocol scaling to thousands of species is not yet available. Here, we report a high-throughput collections-based approach that empowers researchers to explore more branches of the tree of life with numerous loci. METHODS We developed an integrated Specimen-to-Laboratory Information Management System (SLIMS), connecting sampling and wet lab efforts with progress tracking at each stage. Using unique identifiers encoded in QR codes and a taxonomic database, a research team can sample herbarium specimens, efficiently record the sampling event, and capture specimen images. After sampling in herbaria, images are uploaded to a citizen science platform for metadata generation, and tissue samples are moved through a simple, high-throughput, plate-based herbarium DNA extraction and sequencing protocol. RESULTS We applied this sampling-to-sequencing workflow to ~15,000 species, producing for the first time a data set with ~50% taxonomic representation of the "nitrogen-fixing clade" of angiosperms. DISCUSSION The approach we present is appropriate at any taxonomic scale and is extensible to other collection types. The widespread use of large-scale sampling strategies repositions herbaria as accessible but largely untapped resources for broad taxonomic sampling with thousands of species.
Collapse
Affiliation(s)
- Ryan A. Folk
- Department of Biological SciencesMississippi State UniversityMississippi StateMississippiUSA
| | - Heather R. Kates
- Florida Museum of Natural HistoryUniversity of FloridaGainesvilleFloridaUSA
| | - Raphael LaFrance
- Florida Museum of Natural HistoryUniversity of FloridaGainesvilleFloridaUSA
| | - Douglas E. Soltis
- Florida Museum of Natural HistoryUniversity of FloridaGainesvilleFloridaUSA
- Department of BiologyUniversity of FloridaGainesvilleFloridaUSA
- Genetics InstituteUniversity of FloridaGainesvilleFloridaUSA
- Biodiversity InstituteUniversity of FloridaGainesvilleFloridaUSA
| | - Pamela S. Soltis
- Florida Museum of Natural HistoryUniversity of FloridaGainesvilleFloridaUSA
- Genetics InstituteUniversity of FloridaGainesvilleFloridaUSA
- Biodiversity InstituteUniversity of FloridaGainesvilleFloridaUSA
| | - Robert P. Guralnick
- Florida Museum of Natural HistoryUniversity of FloridaGainesvilleFloridaUSA
- Biodiversity InstituteUniversity of FloridaGainesvilleFloridaUSA
| |
Collapse
|
6
|
Webb TJ, Vanhoorne B. Linking dimensions of data on global marine animal diversity. Philos Trans R Soc Lond B Biol Sci 2020; 375:20190445. [PMID: 33131434 DOI: 10.1098/rstb.2019.0445] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Recent decades have seen an explosion in the amount of data available on all aspects of biodiversity, which has led to data-driven approaches to understand how and why diversity varies in time and space. Global repositories facilitate access to various classes of species-level data including biogeography, genetics and conservation status, which are in turn required to study different dimensions of diversity. Ensuring that these different data sources are interoperable is a challenge as we aim to create synthetic data products to monitor the state of the world's biodiversity. One way to approach this is to link data of different classes, and to inventory the availability of data across multiple sources. Here, we use a comprehensive list of more than 200 000 marine animal species, and quantify the availability of data on geographical occurrences, genetic sequences, conservation assessments and DNA barcodes across all phyla and broad functional groups. This reveals a very uneven picture: 44% of species are represented by no record other than their taxonomy, but some species are rich in data. Although these data-rich species are concentrated into a few taxonomic and functional groups, especially vertebrates, data are spread widely across marine animals, with members of all 32 phyla represented in at least one database. By highlighting gaps in current knowledge, our census of marine diversity data helps to prioritize future data collection activities, as well as emphasizing the importance of ongoing sustained observations and archiving of existing data into global repositories. This article is part of the theme issue 'Integrative research perspectives on marine conservation'.
Collapse
Affiliation(s)
- Thomas J Webb
- Department of Animal and Plant Sciences, University of Sheffield, Sheffield S10 2TN, UK
| | | |
Collapse
|
7
|
Vaiente MA, Scotch M. Going back to the roots: Evaluating Bayesian phylogeographic models with discrete trait uncertainty. INFECTION, GENETICS AND EVOLUTION : JOURNAL OF MOLECULAR EPIDEMIOLOGY AND EVOLUTIONARY GENETICS IN INFECTIOUS DISEASES 2020; 85:104501. [PMID: 32798768 PMCID: PMC7686256 DOI: 10.1016/j.meegid.2020.104501] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 08/06/2020] [Accepted: 08/09/2020] [Indexed: 01/14/2023]
Abstract
Phylogeography is a popular way to analyze virus sequences annotated with discrete, epidemiologically-relevant, trait data. For applied public health surveillance, a key quantity of interest is often the state at the root of the inferred phylogeny. In epidemiological terms, this represents the geographic origin of the observed outbreak. Since determining the origin of an outbreak is often critical for public health intervention, it is prudent to understand how well phylogeographic models perform this root state classification task under various analytical scenarios. Specifically, we investigate how discrete state space and sequence data set influence the root state classification accuracy. We performed phylogeographic inference on several simulated DNA data sets while i) increasing the number of sequences and ii) increasing the total number of possible discrete trait values. We show that phylogeographic models tend to perform best at intermediate sequence data set sizes. Further, we demonstrate that a popular metric used for evaluation of phylogeographic models, the Kullback-Leibler (KL) divergence, both increases with discrete state space and data set sizes. Further, by modeling phylogeographic root state classification accuracy using logistic regression, we show that KL is not supported as a predictor of model accuracy, indicating its limited utility for assessing phylogeographic model performance on empirical data. These results suggest that relying solely on the KL metric may lead to artificially inflated support for models with finer discretization schemes and larger data set sizes. These results will be important for public health practitioners seeking to use phylogeographic models for applied infectious disease surveillance.
Collapse
Affiliation(s)
- Matteo A Vaiente
- Biodesign Center for Environmental Health Engineering, Arizona State University, 727 E. Tyler St, Tempe, AZ 85281, USA; College of Health Solutions, Arizona State University, 500 N 3rd St, Phoenix, AZ 85004, USA
| | - Matthew Scotch
- Biodesign Center for Environmental Health Engineering, Arizona State University, 727 E. Tyler St, Tempe, AZ 85281, USA; College of Health Solutions, Arizona State University, 500 N 3rd St, Phoenix, AZ 85004, USA.
| |
Collapse
|
8
|
Scotch M, Tahsin T, Weissenbacher D, O'Connor K, Magge A, Vaiente M, Suchard MA, Gonzalez-Hernandez G. Incorporating sampling uncertainty in the geospatial assignment of taxa for virus phylogeography. Virus Evol 2019; 5:vey043. [PMID: 30838129 PMCID: PMC6395475 DOI: 10.1093/ve/vey043] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
Discrete phylogeography using software such as BEAST considers the sampling location of each taxon as fixed; often to a single location without uncertainty. When studying viruses, this implies that there is no possibility that the location of the infected host for that taxa is somewhere else. Here, we relaxed this strong assumption and allowed for analytic integration of uncertainty for discrete virus phylogeography. We used automatic language processing methods to find and assign uncertainty to alternative potential locations. We considered two influenza case studies: H5N1 in Egypt; H1N1 pdm09 in North America. For each, we implemented scenarios in which 25 per cent of the taxa had different amounts of sampling uncertainty including 10, 30, and 50 per cent uncertainty and varied how it was distributed for each taxon. This includes scenarios that: (i) placed a specific amount of uncertainty on one location while uniformly distributing the remaining amount across all other candidate locations (correspondingly labeled 10, 30, and 50); (ii) assigned the remaining uncertainty to just one other location; thus ‘splitting’ the uncertainty among two locations (i.e. 10/90, 30/70, and 50/50); and (iii) eliminated uncertainty via two predefined heuristic approaches: assignment to a centroid location (CNTR) or the largest population in the country (POP). We compared all scenarios to a reference standard (RS) in which all taxa had known (absolutely certain) locations. From this, we implemented five random selections of 25 per cent of the taxa and used these for specifying uncertainty. We performed posterior analyses for each scenario, including: (a) virus persistence, (b) migration rates, (c) trunk rewards, and (d) the posterior probability of the root state. The scenarios with sampling uncertainty were closer to the RS than CNTR and POP. For H5N1, the absolute error of virus persistence had a median range of 0.005–0.047 for scenarios with sampling uncertainty—(i) and (ii) above—versus a range of 0.063–0.075 for CNTR and POP. Persistence for the pdm09 case study followed a similar trend as did our analyses of migration rates across scenarios (i) and (ii). When considering the posterior probability of the root state, we found all but one of the H5N1 scenarios with sampling uncertainty had agreement with the RS on the origin of the outbreak whereas both CNTR and POP disagreed. Our results suggest that assigning geospatial uncertainty to taxa benefits estimation of virus phylogeography as compared to ad-hoc heuristics. We also found that, in general, there was limited difference in results regardless of how the sampling uncertainty was assigned; uniform distribution or split between two locations did not greatly impact posterior results. This framework is available in BEAST v.1.10. In future work, we will explore viruses beyond influenza. We will also develop a web interface for researchers to use our language processing methods to find and assign uncertainty to alternative potential locations for virus phylogeography.
Collapse
Affiliation(s)
- Matthew Scotch
- College of Health Solutions, Arizona State University, 550 N. 3rd St., Phoenix, AZ, USA.,Biodesign Center for Environmental Health Engineering, Arizona State University, 727 E. Tyler St, Tempe, AZ, USA
| | - Tasnia Tahsin
- College of Health Solutions, Arizona State University, 550 N. 3rd St., Phoenix, AZ, USA
| | - Davy Weissenbacher
- Department of Biostatistics, Epidemiology, and Informatics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 423 Guardian Drive, Philadelphia, PA, USA
| | - Karen O'Connor
- Department of Biostatistics, Epidemiology, and Informatics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 423 Guardian Drive, Philadelphia, PA, USA
| | - Arjun Magge
- College of Health Solutions, Arizona State University, 550 N. 3rd St., Phoenix, AZ, USA.,Biodesign Center for Environmental Health Engineering, Arizona State University, 727 E. Tyler St, Tempe, AZ, USA
| | - Matteo Vaiente
- College of Health Solutions, Arizona State University, 550 N. 3rd St., Phoenix, AZ, USA.,Biodesign Center for Environmental Health Engineering, Arizona State University, 727 E. Tyler St, Tempe, AZ, USA
| | - Marc A Suchard
- Department of Biomathematics, David Geffen School of Medicine, University of California, Los Angeles, 621 Charles E. Young Dr. South, Los Angeles, CA, USA.,Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, 695 Charles E. Young Dr. South, Los Angeles, CA, USA.,Department of Biostatistics, Fielding School of Public Health, University of California, Los Angeles, 650 Charles E Young Dr. South, Los Angeles, CA, USA
| | - Graciela Gonzalez-Hernandez
- Department of Biostatistics, Epidemiology, and Informatics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 423 Guardian Drive, Philadelphia, PA, USA
| |
Collapse
|
9
|
Magge A, Weissenbacher D, Sarker A, Scotch M, Gonzalez-Hernandez G. Bi-directional Recurrent Neural Network Models for Geographic Location Extraction in Biomedical Literature. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2019; 24:100-111. [PMID: 30864314 PMCID: PMC6417823] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Phylogeography research involving virus spread and tree reconstruction relies on accurate geographic locations of infected hosts. Insufficient level of geographic information in nucleotide sequence repositories such as GenBank motivates the use of natural language processing methods for extracting geographic location names (toponyms) in the scientific article associated with the sequence, and disambiguating the locations to their co-ordinates. In this paper, we present an extensive study of multiple recurrent neural network architectures for the task of extracting geographic locations and their effective contribution to the disambiguation task using population heuristics. The methods presented in this paper achieve a strict detection F1 score of 0.94, disambiguation accuracy of 91% and an overall resolution F1 score of 0.88 that are significantly higher than previously developed methods, improving our capability to find the location of infected hosts and enrich metadata information.
Collapse
Affiliation(s)
- Arjun Magge
- College of Health Solutions, Arizona State University, Tempe, AZ 85281, USA
- Biodesign Center for Environmental Health Engineering, Arizona State University, Tempe, AZ 85281, USA
| | - Davy Weissenbacher
- Department of Biostatistics, Epidemiology and Informatics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Abeed Sarker
- Department of Biostatistics, Epidemiology and Informatics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Matthew Scotch
- College of Health Solutions, Arizona State University, Tempe, AZ 85281, USA
- Biodesign Center for Environmental Health Engineering, Arizona State University, Tempe, AZ 85281, USA
| | - Graciela Gonzalez-Hernandez
- Department of Biostatistics, Epidemiology and Informatics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
10
|
Beard R, Wentz E, Scotch M. A systematic review of spatial decision support systems in public health informatics supporting the identification of high risk areas for zoonotic disease outbreaks. Int J Health Geogr 2018; 17:38. [PMID: 30376842 PMCID: PMC6208014 DOI: 10.1186/s12942-018-0157-5] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2018] [Accepted: 10/19/2018] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Zoonotic diseases account for a substantial portion of infectious disease outbreaks and burden on public health programs to maintain surveillance and preventative measures. Taking advantage of new modeling approaches and data sources have become necessary in an interconnected global community. To facilitate data collection, analysis, and decision-making, the number of spatial decision support systems reported in the last 10 years has increased. This systematic review aims to describe characteristics of spatial decision support systems developed to assist public health officials in the management of zoonotic disease outbreaks. METHODS A systematic search of the Google Scholar database was undertaken for published articles written between 2008 and 2018, with no language restriction. A manual search of titles and abstracts using Boolean logic and keyword search terms was undertaken using predefined inclusion and exclusion criteria. Data extraction included items such as spatial database management, visualizations, and report generation. RESULTS For this review we screened 34 full text articles. Design and reporting quality were assessed, resulting in a final set of 12 articles which were evaluated on proposed interventions and identifying characteristics were described. Multisource data integration, and user centered design were inconsistently applied, though indicated diverse utilization of modeling techniques. CONCLUSIONS The characteristics, data sources, development and modeling techniques implemented in the design of recent SDSS that target zoonotic disease outbreak were described. There are still many challenges to address during the design process to effectively utilize the value of emerging data sources and modeling methods. In the future, development should adhere to comparable standards for functionality and system development such as user input for system requirements, and flexible interfaces to visualize data that exist on different scales. PROSPERO registration number: CRD42018110466.
Collapse
Affiliation(s)
- Rachel Beard
- College of Health Solutions, Arizona State University, Phoenix, AZ USA
- Center for Environmental Health Engineering, Biodesign Institute, Arizona State University, Tempe, AZ USA
| | - Elizabeth Wentz
- School of Geographical Sciences and Urban Planning, Arizona State University, Tempe, AZ USA
| | - Matthew Scotch
- College of Health Solutions, Arizona State University, Phoenix, AZ USA
- Center for Environmental Health Engineering, Biodesign Institute, Arizona State University, Tempe, AZ USA
| |
Collapse
|