1
|
Vaughan AL, Parvizi E, Matheson P, McGaughran A, Dhami MK. Current stewardship practices in invasion biology limit the value and secondary use of genomic data. Mol Ecol Resour 2023. [PMID: 37647021 DOI: 10.1111/1755-0998.13858] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2023] [Revised: 07/09/2023] [Accepted: 08/14/2023] [Indexed: 09/01/2023]
Abstract
Invasive species threaten native biota, putting fragile ecosystems at risk and having a large-scale impact on primary industries. Growing trade networks and the popularity of personal travel make incursions a more frequent risk, one only compounded by global climate change. With increasing publication of whole-genome sequences lies an opportunity for cross-species assessment of invasive potential. However, the degree to which published sequences are accompanied by satisfactory spatiotemporal data is unclear. We assessed the metadata associated with 199 whole-genome assemblies of 89 invasive terrestrial invertebrate species and found that only 38% of these were derived from field-collected samples. Seventy-six assemblies (38%) reported an 'undescribed' sample origin and, while further examination of associated literature closed this gap to 23.6%, an absence of spatial data remained for 47 of the total assemblies. Of the 76 assemblies that were ultimately determined to be field-collected, associated metadata relevant for invasion studies was predominantly lacking: only 35% (27 assemblies) provided granular location data, and 33% (n = 25) lacked sufficient collection date information. Our results support recent calls for standardized metadata in genome sequencing data submissions, highlighting the impact of missing metadata on current research in invasion biology (and likely other fields). Notably, large-scale consortia tended to provide the most complete metadata submissions in our analysis-such cross-institutional collaborations can foster a culture of increased adherence to improved metadata submission standards and a standard of metadata stewardship that enables reuse of genomes in invasion science.
Collapse
Affiliation(s)
- Amy L Vaughan
- Biocontrol & Molecular Ecology, Manaaki Whenua Landcare Research, Lincoln, New Zealand
| | - Elahe Parvizi
- Te Aka Mātuatua/School of Science, University of Waikato, Hamilton, New Zealand
| | - Paige Matheson
- Te Aka Mātuatua/School of Science, University of Waikato, Hamilton, New Zealand
| | - Angela McGaughran
- Te Aka Mātuatua/School of Science, University of Waikato, Hamilton, New Zealand
| | - Manpreet K Dhami
- Biocontrol & Molecular Ecology, Manaaki Whenua Landcare Research, Lincoln, New Zealand
| |
Collapse
|
2
|
Bernasconi A, Canakoglu A, Masseroli M, Pinoli P, Ceri S. A review on viral data sources and search systems for perspective mitigation of COVID-19. Brief Bioinform 2021; 22:664-675. [PMID: 33348368 PMCID: PMC7799334 DOI: 10.1093/bib/bbaa359] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2020] [Revised: 10/09/2020] [Accepted: 11/09/2020] [Indexed: 12/26/2022] Open
Abstract
With the outbreak of the COVID-19 disease, the research community is producing unprecedented efforts dedicated to better understand and mitigate the effects of the pandemic. In this context, we review the data integration efforts required for accessing and searching genome sequences and metadata of SARS-CoV2, the virus responsible for the COVID-19 disease, which have been deposited into the most important repositories of viral sequences. Organizations that were already present in the virus domain are now dedicating special interest to the emergence of COVID-19 pandemics, by emphasizing specific SARS-CoV2 data and services. At the same time, novel organizations and resources were born in this critical period to serve specifically the purposes of COVID-19 mitigation while setting the research ground for contrasting possible future pandemics. Accessibility and integration of viral sequence data, possibly in conjunction with the human host genotype and clinical data, are paramount to better understand the COVID-19 disease and mitigate its effects. Few examples of host-pathogen integrated datasets exist so far, but we expect them to grow together with the knowledge of COVID-19 disease; once such datasets will be available, useful integrative surveillance mechanisms can be put in place by observing how common variants distribute in time and space, relating them to the phenotypic impact evidenced in the literature.
Collapse
|
3
|
Dellicour S, Lemey P, Artois J, Lam TT, Fusaro A, Monne I, Cattoli G, Kuznetsov D, Xenarios I, Dauphin G, Kalpravidh W, Von Dobschuetz S, Claes F, Newman SH, Suchard MA, Baele G, Gilbert M. Incorporating heterogeneous sampling probabilities in continuous phylogeographic inference - Application to H5N1 spread in the Mekong region. Bioinformatics 2020; 36:2098-2104. [PMID: 31790143 DOI: 10.1093/bioinformatics/btz882] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2019] [Revised: 11/01/2019] [Accepted: 11/22/2019] [Indexed: 12/25/2022] Open
Abstract
MOTIVATION The potentially low precision associated with the geographic origin of sampled sequences represents an important limitation for spatially explicit (i.e. continuous) phylogeographic inference of fast-evolving pathogens such as RNA viruses. A substantial proportion of publicly available sequences is geo-referenced at broad spatial scale such as the administrative unit of origin, rather than more precise locations (e.g. geographic coordinates). Most frequently, such sequences are either discarded prior to continuous phylogeographic inference or arbitrarily assigned to the geographic coordinates of the centroid of their administrative area of origin for lack of a better alternative. RESULTS We here implement and describe a new approach that allows to incorporate heterogeneous prior sampling probabilities over a geographic area. External data, such as outbreak locations, are used to specify these prior sampling probabilities over a collection of sub-polygons. We apply this new method to the analysis of highly pathogenic avian influenza H5N1 clade data in the Mekong region. Our method allows to properly include, in continuous phylogeographic analyses, H5N1 sequences that are only associated with large administrative areas of origin and assign them with more accurate locations. Finally, we use continuous phylogeographic reconstructions to analyse the dispersal dynamics of different H5N1 clades and investigate the impact of environmental factors on lineage dispersal velocities. AVAILABILITY AND IMPLEMENTATION Our new method allowing heterogeneous sampling priors for continuous phylogeographic inference is implemented in the open-source multi-platform software package BEAST 1.10. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Simon Dellicour
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, 3000 Leuven, Belgium.,Spatial Epidemiology Lab (SpELL), Université Libre de Bruxelles, 1050 Bruxelles, Belgium
| | - Philippe Lemey
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, 3000 Leuven, Belgium
| | - Jean Artois
- Spatial Epidemiology Lab (SpELL), Université Libre de Bruxelles, 1050 Bruxelles, Belgium
| | - Tommy T Lam
- State Key Laboratory of Emerging Infectious Diseases, School of Public Health, The University of Hong Kong, Hong Kong SAR, China
| | - Alice Fusaro
- Department of Comparative Biomedical Sciences, Istituto Zooprofilattico Sperimentale delle Venezie (IZSVe), Legnaro, Italy
| | - Isabella Monne
- Department of Comparative Biomedical Sciences, Istituto Zooprofilattico Sperimentale delle Venezie (IZSVe), Legnaro, Italy
| | - Giovanni Cattoli
- Department of Comparative Biomedical Sciences, Istituto Zooprofilattico Sperimentale delle Venezie (IZSVe), Legnaro, Italy.,Animal Production and Health Laboratory, Joint FAO/IAEA Division, 2444 Seibersdorf, Austria
| | | | - Ioannis Xenarios
- Center for Integrative Genomics, University of Lausanne, 1005 Lausanne, Switzerland
| | | | - Wantanee Kalpravidh
- Food and Agriculture Organization of the United Nations, Regional Office for Asia and the Pacific, Emergency Center of the Transboundary Animal Diseases, Bangkok 10200, Thailand
| | | | - Filip Claes
- Food and Agriculture Organization of the United Nations, Regional Office for Asia and the Pacific, Emergency Center of the Transboundary Animal Diseases, Bangkok 10200, Thailand
| | - Scott H Newman
- Food and Agriculture Organization of the United Nations, Regional Office for Africa, Accra, Ghana
| | - Marc A Suchard
- Department of Biomathematics, David Geffen School of Medicine, Los Angeles, CA, USA.,Department of Biostatistics, Fielding School of Public Health, Los Angeles, CA, USA.,Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
| | - Guy Baele
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, 3000 Leuven, Belgium
| | - Marius Gilbert
- Spatial Epidemiology Lab (SpELL), Université Libre de Bruxelles, 1050 Bruxelles, Belgium
| |
Collapse
|
4
|
Poulin R, Hay E, Jorge F. Taxonomic and geographic bias in the genetic study of helminth parasites. Int J Parasitol 2019; 49:429-435. [PMID: 30797772 DOI: 10.1016/j.ijpara.2018.12.005] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2018] [Revised: 12/19/2018] [Accepted: 12/19/2018] [Indexed: 11/25/2022]
Abstract
The use of genetic information is now fundamental in parasite taxonomy and systematics, for resolving parasite phylogenies, discovering cryptic species, and elucidating patterns of gene flow among parasite populations. The accumulation of available gene sequences per geographical area or per parasite taxonomic group is likely proportional to species richness, but not without some biases. Certain areas and certain taxonomic groups receive more research effort than others, possibly causing a deficit in the relative number of parasite species being characterized genetically in some areas or taxonomic groups. Here, we use data on the number of parasite records per country or helminth family from the London Natural History Museum host-parasite database, and matching data on the number of gene sequences available from the National Center for Biotechnology Information (NCBI) GenBank database, to determine how available gene sequences scale with species richness across countries or parasitic helminth families. Our quantitative analysis identified countries/regions of the world and helminth families that have received the most effort in genetic research. More importantly, it allowed us to generate lists (based on residuals from the statistical model) of the 20 countries/regions and the 20 helminth families with the largest deficit in available gene sequences relative to their helminth species richness. We propose these lists as useful guides toward future allocation of effort to maximise advances in parasite biodiscovery, systematics and population structure.
Collapse
Affiliation(s)
- Robert Poulin
- Department of Zoology, University of Otago, P.O. Box 56, Dunedin 9054, New Zealand.
| | - Eleanor Hay
- Department of Zoology, University of Otago, P.O. Box 56, Dunedin 9054, New Zealand
| | - Fátima Jorge
- Department of Zoology, University of Otago, P.O. Box 56, Dunedin 9054, New Zealand
| |
Collapse
|
5
|
Magge A, Weissenbacher D, Sarker A, Scotch M, Gonzalez-Hernandez G. Bi-directional Recurrent Neural Network Models for Geographic Location Extraction in Biomedical Literature. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2019; 24:100-111. [PMID: 30864314 PMCID: PMC6417823] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Phylogeography research involving virus spread and tree reconstruction relies on accurate geographic locations of infected hosts. Insufficient level of geographic information in nucleotide sequence repositories such as GenBank motivates the use of natural language processing methods for extracting geographic location names (toponyms) in the scientific article associated with the sequence, and disambiguating the locations to their co-ordinates. In this paper, we present an extensive study of multiple recurrent neural network architectures for the task of extracting geographic locations and their effective contribution to the disambiguation task using population heuristics. The methods presented in this paper achieve a strict detection F1 score of 0.94, disambiguation accuracy of 91% and an overall resolution F1 score of 0.88 that are significantly higher than previously developed methods, improving our capability to find the location of infected hosts and enrich metadata information.
Collapse
Affiliation(s)
- Arjun Magge
- College of Health Solutions, Arizona State University, Tempe, AZ 85281, USA
- Biodesign Center for Environmental Health Engineering, Arizona State University, Tempe, AZ 85281, USA
| | - Davy Weissenbacher
- Department of Biostatistics, Epidemiology and Informatics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Abeed Sarker
- Department of Biostatistics, Epidemiology and Informatics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Matthew Scotch
- College of Health Solutions, Arizona State University, Tempe, AZ 85281, USA
- Biodesign Center for Environmental Health Engineering, Arizona State University, Tempe, AZ 85281, USA
| | - Graciela Gonzalez-Hernandez
- Department of Biostatistics, Epidemiology and Informatics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|