1
|
Finkbeiner A, Khatib A, Upham N, Sterner B. A Systematic Review of the Distribution and Prevalence of Viruses Detected in the Peromyscus maniculatus Species Complex (Rodentia: Cricetidae). BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.04.602117. [PMID: 39026800 PMCID: PMC11257420 DOI: 10.1101/2024.07.04.602117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/20/2024]
Abstract
The North American Deermouse, Peromyscus maniculatus, is one of the most widespread and abundant mammals on the continent. It is of public health interest as a known host of several viruses that are transmissible to humans and can cause illness, including the acute respiratory disease Hantavirus Pulmonary Syndrome (HPS). However, recent taxonomic studies indicate that P. maniculatus is a complex of multiple species, raising questions about how to identify and interpret three decades of hantavirus monitoring data. We conducted a systematic review investigating the prevalence and spatial distribution of viral taxa detected in wild populations allocated to P. maniculatus. From the 46 relevant studies published from 2000 to 2022, we extracted and analyzed spatial occurrence data to calculate weighted populational prevalences for hantaviruses. We found that detection efforts have been concentrated in the Western United States and Mexico with a focus on the spread of Sin Nombre virus, the primary causative agent of HPS. There are significant gaps in the existing literature both geographically and in regard to the types of hantaviruses being sampled. These results are significantly impacted by a recent taxonomic split of P. maniculatus into four species, which results in the relabeling of 92% of hantavirus observations. Considering the uncertain, and likely multiple, phylogenetic histories of these viral hosts should be a key emphasis of future modeling efforts.
Collapse
Affiliation(s)
| | - Ahmad Khatib
- School of Life Sciences, Arizona State University
| | - Nathan Upham
- School of Life Sciences, Arizona State University
| | | |
Collapse
|
2
|
Cho MH, Cho KH, No KT. PhyloSophos: a high-throughput scientific name mapping algorithm augmented with explicit consideration of taxonomic science, and its application on natural product (NP) occurrence database processing. BMC Bioinformatics 2023; 24:475. [PMID: 38097955 PMCID: PMC10722791 DOI: 10.1186/s12859-023-05588-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Accepted: 11/29/2023] [Indexed: 12/17/2023] Open
Abstract
BACKGROUND The standardization of biological data using unique identifiers is vital for seamless data integration, comprehensive interpretation, and reproducibility of research findings, contributing to advancements in bioinformatics and systems biology. Despite being widely accepted as a universal identifier, scientific names for biological species have inherent limitations, including lack of stability, uniqueness, and convertibility, hindering their effective use as identifiers in databases, particularly in natural product (NP) occurrence databases, posing a substantial obstacle to utilizing this valuable data for large-scale research applications. RESULT To address these challenges and facilitate high-throughput analysis of biological data involving scientific names, we developed PhyloSophos, a Python package that considers the properties of scientific names and taxonomic systems to accurately map name inputs to entries within a chosen reference database. We illustrate the importance of assessing multiple taxonomic databases and considering taxonomic syntax-based pre-processing using NP occurrence databases as an example, with the ultimate goal of integrating heterogeneous information into a single, unified dataset. CONCLUSIONS We anticipate PhyloSophos to significantly aid in the systematic processing of poorly digitized and curated biological data, such as biodiversity information and ethnopharmacological resources, enabling full-scale bioinformatics analysis using these valuable data resources.
Collapse
Affiliation(s)
- Min Hyung Cho
- Bioinformatics and Molecular Design Research Center (BMDRC), 209, Veritas A Hall, Yonsei University, 85 Songdogwahak-ro, Yeonsu-gu, Incheon, 21983, Republic of Korea.
| | - Kwang-Hwi Cho
- School of Systems Biomedical Science, Soongsil University, Seoul, 06978, South Korea
| | - Kyoung Tai No
- Bioinformatics and Molecular Design Research Center (BMDRC), 209, Veritas A Hall, Yonsei University, 85 Songdogwahak-ro, Yeonsu-gu, Incheon, 21983, Republic of Korea
- Department of Integrative Biotechnology and Translational Medicine, 214, Veritas A Hall, Yonsei University, 85 Songdogwahak-ro, Yeonsu-gu, Incheon, 21983, Republic of Korea
| |
Collapse
|
3
|
Seah BKB. Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers. Biodivers Data J 2023; 11:e114076. [PMID: 38312332 PMCID: PMC10838036 DOI: 10.3897/bdj.11.e114076] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Accepted: 11/06/2023] [Indexed: 02/06/2024] Open
Abstract
Linking records for the same taxa between different databases is an essential step when working with biodiversity data. However, name-matching alone is error-prone, because of issues such as homonyms (unrelated taxa with the same name) and synonyms (same taxon under different names). Therefore, most projects will require some curation to ensure that taxon identifiers are correctly linked. Unfortunately, formal guidance on such curation is uncommon and these steps are often ad hoc and poorly documented, which hinders transparency and reproducibility, yet the task requires specialist knowledge and cannot be easily automated without careful validation. Here, we present a case study on linking identifiers between the GBIF and NCBI taxonomies for a species checklist. This represents a common scenario: finding published sequence data (from NCBI) for species chosen by occurrence or geographical distribution (from GBIF). Wikidata, a publicly editable knowledge base of structured data, can serve as an additional information source for identifier linking. We suggest a software toolkit for taxon name-matching and data-cleaning, describe common issues encountered during curation and propose concrete steps to address them. For example, about 2.8% of the taxa in our dataset had wrong identifiers linked on Wikidata because of errors in name-matching caused by homonyms. By correcting such errors during data-cleaning, either directly (through editing Wikidata) or indirectly (by reporting errors in GBIF or NCBI), we crowdsource the curation and contribute to community resources, thereby improving the quality of downstream analyses.
Collapse
Affiliation(s)
- Brandon Kwee Boon Seah
- Thünen Institute for Biodiversity, Braunschweig, GermanyThünen Institute for BiodiversityBraunschweigGermany
| |
Collapse
|
4
|
Sterner B, Elliott S, Gilbert EE, Franz NM. Unified and pluralistic ideals for data sharing and reuse in biodiversity. Database (Oxford) 2023; 2023:baad048. [PMID: 37465916 PMCID: PMC10354506 DOI: 10.1093/database/baad048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2023] [Revised: 05/30/2023] [Accepted: 06/27/2023] [Indexed: 07/20/2023]
Abstract
How should billions of species observations worldwide be shared and made reusable? Many biodiversity scientists assume the ideal solution is to standardize all datasets according to a single, universal classification and aggregate them into a centralized, global repository. This ideal has known practical and theoretical limitations, however, which justifies investigating alternatives. To support better community deliberation and normative evaluation, we develop a novel conceptual framework showing how different organizational models, regulative ideals and heuristic strategies are combined to form shared infrastructures supporting data reuse. The framework is anchored in a general definition of data pooling as an activity of making a taxonomically standardized body of information available for community reuse via digital infrastructure. We describe and illustrate unified and pluralistic ideals for biodiversity data pooling and show how communities may advance toward these ideals using different heuristic strategies. We present evidence for the strengths and limitations of the unification and pluralistic ideals based on systemic relationships of power, responsibility and benefit they establish among stakeholders, and we conclude the pluralistic ideal is better suited for biodiversity data.
Collapse
Affiliation(s)
- Beckett Sterner
- School of Life Sciences, Arizona State University, 427 E Tyler Mall, Tempe, AZ 85281, USA
| | - Steve Elliott
- School of Life Sciences, Arizona State University, 427 E Tyler Mall, Tempe, AZ 85281, USA
| | - Edward E Gilbert
- School of Life Sciences, Arizona State University, 427 E Tyler Mall, Tempe, AZ 85281, USA
| | - Nico M Franz
- School of Life Sciences, Arizona State University, 427 E Tyler Mall, Tempe, AZ 85281, USA
| |
Collapse
|
5
|
Tam J, Lagisz M, Cornwell W, Nakagawa S. Quantifying research interests in 7,521 mammalian species with h-index: a case study. Gigascience 2022; 11:6665406. [PMID: 35962776 PMCID: PMC9375528 DOI: 10.1093/gigascience/giac074] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2021] [Revised: 04/11/2022] [Accepted: 06/27/2022] [Indexed: 11/14/2022] Open
Abstract
BACKGROUND Taxonomic bias is a known issue within the field of biology, causing scientific knowledge to be unevenly distributed across species. However, a systematic quantification of the research interest that the scientific community has allocated to individual species remains a big data problem. Scalable approaches are needed to integrate biodiversity data sets and bibliometric methods across large numbers of species. The outputs of these analyses are important for identifying understudied species and directing future research to fill these gaps. FINDINGS In this study, we used the species h-index to quantity the research interest in 7,521 species of mammals. We tested factors potentially driving species h-index, by using a Bayesian phylogenetic generalized linear mixed model (GLMM). We found that a third of the mammals had a species h-index of zero, while a select few had inflated research interest. Further, mammals with higher species h-index had larger body masses; were found in temperate latitudes; had their humans uses documented, including domestication; and were in lower-risk International Union for Conservation of Nature Red List categories. These results surprisingly suggested that critically endangered mammals are understudied. A higher interest in domesticated species suggested that human use is a major driver and focus in mammalian scientific literature. CONCLUSIONS Our study has demonstrated a scalable workflow and systematically identified understudied species of mammals, as well as identified the likely drivers of this taxonomic bias in the literature. This case study can become a benchmark for future research that asks similar biological and meta-research questions for other taxa.
Collapse
Affiliation(s)
- Jessica Tam
- Evolution & Ecology Research Centre and School of Biological, Earth and Environmental Sciences, University of New South Wales, Sydney 2052, Australia
| | - Malgorzata Lagisz
- Evolution & Ecology Research Centre and School of Biological, Earth and Environmental Sciences, University of New South Wales, Sydney 2052, Australia
| | - Will Cornwell
- Evolution & Ecology Research Centre and School of Biological, Earth and Environmental Sciences, University of New South Wales, Sydney 2052, Australia
| | - Shinichi Nakagawa
- Evolution & Ecology Research Centre and School of Biological, Earth and Environmental Sciences, University of New South Wales, Sydney 2052, Australia
| |
Collapse
|
6
|
Sterner B, Upham N, Gupta P, Powell C, Franz N. Wanted: Standards for FAIR taxonomic concept representations and relationships. BIODIVERSITY INFORMATION SCIENCE AND STANDARDS 2021; 5. [PMID: 35462676 PMCID: PMC9028594 DOI: 10.3897/biss.5.75587] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
Abstract
Making the most of biodiversity data requires linking observations of biological species from multiple sources both efficiently and accurately (Bisby 2000, Franz et al. 2016). Aggregating occurrence records using taxonomic names and synonyms is computationally efficient but known to experience significant limitations on accuracy when the assumption of one-to-one relationships between names and biological entities breaks down (Remsen 2016, Franz and Sterner 2018). Taxonomic treatments and checklists provide authoritative information about the correct usage of names for species, including operational representations of the meanings of those names in the form of range maps, reference genetic sequences, or diagnostic traits. They increasingly provide taxonomic intelligence in the form of precise description of the semantic relationships between different published names in the literature. Making this authoritative information Findable, Accessible, Interoperable, and Reusable (FAIR; Wilkinson et al. 2016) would be a transformative advance for biodiversity data sharing and help drive adoption and novel extensions of existing standards such as the Taxonomic Concept Schema and the OpenBiodiv Ontology (Kennedy et al. 2006, Senderov et al. 2018). We call for the greater, global Biodiversity Information Standards (TDWG) and taxonomy community to commit to extending and expanding on how FAIR applies to biodiversity data and include practical targets and criteria for the publication and digitization of taxonomic concept representations and alignments in taxonomic treatments, checklists, and backbones.
Collapse
|
7
|
Folk RA, Siniscalchi CM. Biodiversity at the global scale: the synthesis continues. AMERICAN JOURNAL OF BOTANY 2021; 108:912-924. [PMID: 34181762 DOI: 10.1002/ajb2.1694] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Accepted: 04/14/2021] [Indexed: 06/13/2023]
Abstract
Traditionally, the generation and use of biodiversity data and their associated specimen objects have been primarily the purview of individuals and small research groups. While deposition of data and specimens in herbaria and other repositories has long been the norm, throughout most of their history, these resources have been accessible only to a small community of specialists. Through recent concerted efforts, primarily at the level of national and international governmental agencies over the last two decades, the pace of biodiversity data accumulation has accelerated, and a wider array of biodiversity scientists has gained access to this massive accumulation of resources, applying them to an ever-widening compass of research pursuits. We review how these new resources and increasing access to them are affecting the landscape of biodiversity research in plants today, focusing on new applications across evolution, ecology, and other fields that have been enabled specifically by the availability of these data and the global scope that was previously beyond the reach of individual investigators. We give an overview of recent advances organized along three lines: broad-scale analyses of distributional data and spatial information, phylogenetic research circumscribing large clades with comprehensive taxon sampling, and data sets derived from improved accessibility of biodiversity literature. We also review synergies between large data resources and more traditional data collection paradigms, describe shortfalls and how to overcome them, and reflect on the future of plant biodiversity analyses in light of increasing linkages between data types and scientists in our field.
Collapse
Affiliation(s)
- Ryan A Folk
- Department of Biological Sciences, Mississippi State University, Mississippi State, Mississippi, USA
| | - Carolina M Siniscalchi
- Department of Biological Sciences, Mississippi State University, Mississippi State, Mississippi, USA
| |
Collapse
|
8
|
Bourgoin T, Bailly N, Zaragueta R, Vignes-Lebbe R. Complete formalization of taxa with their names, contents and descriptions improves taxonomic databases and access to the taxonomic knowledge they support. SYST BIODIVERS 2021. [DOI: 10.1080/14772000.2021.1915895] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Affiliation(s)
- Thierry Bourgoin
- Muséum national d'Histoire naturelle, Institut Systématique, Évolution, Biodiversité (ISYEB), UMR 7205 MNHN-CNRS-Sorbonne Université-EPHE-Université des Antilles, Paris, 75005 France
| | - Nicolas Bailly
- Beaty Biodiversity Museum - Department of Zoology, University of British Columbia, Vancouver, Canada
| | - René Zaragueta
- Sorbonne Université, Muséum national d’Histoire naturelle, CNRS, EPHE, Université des Antilles, Institut de Systématique Évolution Biodiversité (ISYEB), Paris, 75005 France
| | - Régine Vignes-Lebbe
- Sorbonne Université, Muséum national d’Histoire naturelle, CNRS, EPHE, Université des Antilles, Institut de Systématique Évolution Biodiversité (ISYEB), Paris, 75005 France
| |
Collapse
|
9
|
Conti M, Nimis PL, Martellos S. Match Algorithms for Scientific Names in FlorItaly, the Portal to the Flora of Italy. PLANTS 2021; 10:plants10050974. [PMID: 34068389 PMCID: PMC8153551 DOI: 10.3390/plants10050974] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Revised: 05/06/2021] [Accepted: 05/08/2021] [Indexed: 11/21/2022]
Abstract
Scientific names are not part of everyday language in any modern country, and their input as strings in a query system can be easily associated with typographical errors. While globally unique identifiers univocally address a taxon name, they can hardly be used for querying a database manually. Thus, matching algorithms are often used to overcome misspelled names in query systems in several data repositories worldwide. In order to improve users’ experience in the use of FlorItaly, the Portal to the Flora of Italy, a near match algorithm to resolve misspelled scientific names has been integrated in the query systems. In addition, a novel tool in FlorItaly, capable of rapidly aligning any list of names to the nomenclatural backbone provided by the national checklists, has been developed. This manuscript aims at describing the potential of these new tools.
Collapse
|
10
|
Norman KEA, Chamberlain S, Boettiger C. taxadb: A high‐performance local taxonomic database interface. Methods Ecol Evol 2020. [DOI: 10.1111/2041-210x.13440] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Kari E. A. Norman
- Department of Environmental Science, Policy, and Management University of California Berkeley Berkeley CA USA
| | - Scott Chamberlain
- The rOpenSci Project University of California Berkeley Berkeley CA USA
| | - Carl Boettiger
- Department of Environmental Science, Policy, and Management University of California Berkeley Berkeley CA USA
| |
Collapse
|
11
|
Campbell DL, Thessen AE, Ries L. A novel curation system to facilitate data integration across regional citizen science survey programs. PeerJ 2020; 8:e9219. [PMID: 32821528 PMCID: PMC7395600 DOI: 10.7717/peerj.9219] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2019] [Accepted: 04/28/2020] [Indexed: 11/20/2022] Open
Abstract
Integrative modeling methods can now enable macrosystem-level understandings of biodiversity patterns, such as range changes resulting from shifts in climate or land use, by aggregating species-level data across multiple monitoring sources. This requires ensuring that taxon interpretations match up across different sources. While encouraging checklist standardization is certainly an option, coercing programs to change species lists they have used consistently for decades is rarely successful. Here we demonstrate a novel approach for tracking equivalent names and concepts, applied to a network of 10 regional programs that use the same protocols (so-called “Pollard walks”) to monitor butterflies across America north of Mexico. Our system involves, for each monitoring program, associating the taxonomic authority (in this case one of three North American butterfly fauna treatments: Pelham, 2014; North American Butterfly Association, Inc., 2016; Opler & Warren, 2003) that shares the most similar overall taxonomic interpretation to the program’s working species list. This allows us to define each term on each program’s list in the context of the appropriate authority’s species concept and curate the term alongside its authoritative concept. We then aligned the names representing equivalent taxonomic concepts among the three authorities. These stepping stones allow us to bridge a species concept from one program’s species list to the name of the equivalent in any other program, through the intermediary scaffolding of aligned authoritative taxon concepts. Using a software tool we developed to access our curation system, a user can link equivalent species concepts between data collecting agencies with no specialized knowledge of taxonomic complexities.
Collapse
Affiliation(s)
- Dana L Campbell
- Division of Biological Sciences, School of STEM, University of Washington, Bothell, WA, USA
| | - Anne E Thessen
- The Ronin Institute for Independent Scholarship, Montclair, NJ, USA.,Center for Genome Research and Biocomputing, Oregon State University, Corvallis, OR, USA
| | - Leslie Ries
- Department of Biology, Georgetown University, Washington, DC, USA
| |
Collapse
|
12
|
Walton S, Livermore L, Bánki O, Cubey R, Drinkwater R, Englund M, Goble C, Groom Q, Kermorvant C, Rey I, Santos C, Scott B, Williams A, Wu Z. Landscape Analysis for the Specimen Data Refinery. RESEARCH IDEAS AND OUTCOMES 2020. [DOI: 10.3897/rio.6.e57602] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
This report reviews the current state-of-the-art applied approaches on automated tools, services and workflows for extracting information from images of natural history specimens and their labels. We consider the potential for repurposing existing tools, including workflow management systems; and areas where more development is required. This paper was written as part of the SYNTHESYS+ project for software development teams and informatics teams working on new software-based approaches to improve mass digitisation of natural history specimens.
Collapse
|
13
|
Santos JW, Correia RA, Malhado ACM, Campos‐Silva JV, Teles D, Jepson P, Ladle RJ. Drivers of taxonomic bias in conservation research: a global analysis of terrestrial mammals. Anim Conserv 2020. [DOI: 10.1111/acv.12586] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Affiliation(s)
- J. W. Santos
- Institute of Biological Science and Health Federal University of Alagoas Maceió Brazil
| | - R. A. Correia
- Institute of Biological Science and Health Federal University of Alagoas Maceió Brazil
- Department of Geosciences and Geography Helsinki Lab of Interdisciplinary Conservation Science University of Helsinki Helsinki Finland
- Helsinki Institute for Sustainability Science University of Helsinki Helsinki Finland
| | - A. C. M. Malhado
- Institute of Biological Science and Health Federal University of Alagoas Maceió Brazil
| | - J. V. Campos‐Silva
- Institute of Biological Science and Health Federal University of Alagoas Maceió Brazil
- Faculty of Environmental Sciences and Natural Resource Management Norwegian University of Life Sciences Ås Norway
| | - D. Teles
- Institute of Biological Science and Health Federal University of Alagoas Maceió Brazil
| | | | - R. J. Ladle
- Institute of Biological Science and Health Federal University of Alagoas Maceió Brazil
| |
Collapse
|
14
|
Sterner B, Witteveen J, Franz N. Coordinating dissent as an alternative to consensus classification: insights from systematics for bio-ontologies. HISTORY AND PHILOSOPHY OF THE LIFE SCIENCES 2020; 42:8. [PMID: 32030540 DOI: 10.1007/s40656-020-0300-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/10/2019] [Accepted: 01/17/2020] [Indexed: 06/10/2023]
Abstract
The collection and classification of data into meaningful categories is a key step in the process of knowledge making. In the life sciences, the design of data discovery and integration tools has relied on the premise that a formal classificatory system for expressing a body of data should be grounded in consensus definitions for classifications. On this approach, exemplified by the realist program of the Open Biomedical Ontologies Foundry, progress is maximized by grounding the representation and aggregation of data on settled knowledge. We argue that historical practices in systematic biology provide an important and overlooked alternative approach to classifying and disseminating data, based on a principle of coordinative rather than definitional consensus. Systematists have developed a robust system for referring to taxonomic entities that can deliver high quality data discovery and integration without invoking consensus about reality or "settled" science.
Collapse
Affiliation(s)
- Beckett Sterner
- School of Life Sciences, Arizona State University, Tempe, USA.
| | - Joeri Witteveen
- Department of Science Education, Section for History and Philosophy of Science, University of Copenhagen, Copenhagen, Denmark
| | - Nico Franz
- School of Life Sciences, Arizona State University, Tempe, USA
| |
Collapse
|
15
|
OpenBiodiv: A Knowledge Graph for Literature-Extracted Linked Open Data in Biodiversity Science. PUBLICATIONS 2019. [DOI: 10.3390/publications7020038] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Hundreds of years of biodiversity research have resulted in the accumulation of a substantial pool of communal knowledge; however, most of it is stored in silos isolated from each other, such as published articles or monographs. The need for a system to store and manage collective biodiversity knowledge in a community-agreed and interoperable open format has evolved into the concept of the Open Biodiversity Knowledge Management System (OBKMS). This paper presents OpenBiodiv: An OBKMS that utilizes semantic publishing workflows, text and data mining, common standards, ontology modelling and graph database technologies to establish a robust infrastructure for managing biodiversity knowledge. It is presented as a Linked Open Dataset generated from scientific literature. OpenBiodiv encompasses data extracted from more than 5000 scholarly articles published by Pensoft and many more taxonomic treatments extracted by Plazi from journals of other publishers. The data from both sources are converted to Resource Description Framework (RDF) and integrated in a graph database using the OpenBiodiv-O ontology and an RDF version of the Global Biodiversity Information Facility (GBIF) taxonomic backbone. Through the application of semantic technologies, the project showcases the value of open publishing of Findable, Accessible, Interoperable, Reusable (FAIR) data towards the establishment of open science practices in the biodiversity domain.
Collapse
|
16
|
Stucky BJ, Balhoff JP, Barve N, Barve V, Brenskelle L, Brush MH, Dahlem GA, Gilbert JDJ, Kawahara AY, Keller O, Lucky A, Mayhew PJ, Plotkin D, Seltmann KC, Talamas E, Vaidya G, Walls R, Yoder M, Zhang G, Guralnick R. Developing a vocabulary and ontology for modeling insect natural history data: example data, use cases, and competency questions. Biodivers Data J 2019; 7:e33303. [PMID: 30918448 PMCID: PMC6426826 DOI: 10.3897/bdj.7.e33303] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Accepted: 02/28/2019] [Indexed: 11/12/2022] Open
Abstract
Insects are possibly the most taxonomically and ecologically diverse class of multicellular organisms on Earth. Consequently, they provide nearly unlimited opportunities to develop and test ecological and evolutionary hypotheses. Currently, however, large-scale studies of insect ecology, behavior, and trait evolution are impeded by the difficulty in obtaining and analyzing data derived from natural history observations of insects. These data are typically highly heterogeneous and widely scattered among many sources, which makes developing robust information systems to aggregate and disseminate them a significant challenge. As a step towards this goal, we report initial results of a new effort to develop a standardized vocabulary and ontology for insect natural history data. In particular, we describe a new database of representative insect natural history data derived from multiple sources (but focused on data from specimens in biological collections), an analysis of the abstract conceptual areas required for a comprehensive ontology of insect natural history data, and a database of use cases and competency questions to guide the development of data systems for insect natural history data. We also discuss data modeling and technology-related challenges that must be overcome to implement robust integration of insect natural history data.
Collapse
Affiliation(s)
- Brian J. Stucky
- Florida Museum of Natural History, University of Florida, Gainesville, FL, United States of AmericaFlorida Museum of Natural History, University of FloridaGainesville, FLUnited States of America
| | - James P. Balhoff
- Renaissance Computing Institute, University of North Carolina, Chapel Hill, NC, United States of AmericaRenaissance Computing Institute, University of North CarolinaChapel Hill, NCUnited States of America
| | - Narayani Barve
- Florida Museum of Natural History, University of Florida, Gainesville, FL, United States of AmericaFlorida Museum of Natural History, University of FloridaGainesville, FLUnited States of America
| | - Vijay Barve
- Florida Museum of Natural History, University of Florida, Gainesville, FL, United States of AmericaFlorida Museum of Natural History, University of FloridaGainesville, FLUnited States of America
| | - Laura Brenskelle
- Florida Museum of Natural History, University of Florida, Gainesville, FL, United States of AmericaFlorida Museum of Natural History, University of FloridaGainesville, FLUnited States of America
| | - Matthew H. Brush
- Oregon Health and Science University, Portland, OR, United States of AmericaOregon Health and Science UniversityPortland, ORUnited States of America
| | - Gregory A Dahlem
- Department of Biological Sciences, Northern Kentucky University, Highland Heights, KY, United States of AmericaDepartment of Biological Sciences, Northern Kentucky UniversityHighland Heights, KYUnited States of America
| | - James D. J. Gilbert
- Department of Biological and Marine Sciences, University of Hull, Hull, United KingdomDepartment of Biological and Marine Sciences, University of HullHullUnited Kingdom
| | - Akito Y. Kawahara
- Florida Museum of Natural History, University of Florida, Gainesville, FL, United States of AmericaFlorida Museum of Natural History, University of FloridaGainesville, FLUnited States of America
- Entomology and Nematology Department, University of Florida, Gainesville, FL, United States of AmericaEntomology and Nematology Department, University of FloridaGainesville, FLUnited States of America
| | - Oliver Keller
- Entomology and Nematology Department, University of Florida, Gainesville, FL, United States of AmericaEntomology and Nematology Department, University of FloridaGainesville, FLUnited States of America
| | - Andrea Lucky
- Entomology and Nematology Department, University of Florida, Gainesville, FL, United States of AmericaEntomology and Nematology Department, University of FloridaGainesville, FLUnited States of America
| | - Peter J. Mayhew
- Department of Biology, University of York, York, United KingdomDepartment of Biology, University of YorkYorkUnited Kingdom
| | - David Plotkin
- Florida Museum of Natural History, University of Florida, Gainesville, FL, United States of AmericaFlorida Museum of Natural History, University of FloridaGainesville, FLUnited States of America
| | | | - Elijah Talamas
- Florida Department of Agriculture and Consumer Services, Gainesville, FL, United States of AmericaFlorida Department of Agriculture and Consumer ServicesGainesville, FLUnited States of America
| | - Gaurav Vaidya
- Florida Museum of Natural History, University of Florida, Gainesville, FL, United States of AmericaFlorida Museum of Natural History, University of FloridaGainesville, FLUnited States of America
| | - Ramona Walls
- Bio5 and CyVerse, University of Arizona, Tucson, AZ, United States of AmericaBio5 and CyVerse, University of ArizonaTucson, AZUnited States of America
| | - Matt Yoder
- Species File Group, Illinois Natural History Survey, University of Illinois, Champaign, IL, United States of AmericaSpecies File Group, Illinois Natural History Survey, University of IllinoisChampaign, ILUnited States of America
| | - Guanyang Zhang
- Florida Museum of Natural History, University of Florida, Gainesville, FL, United States of AmericaFlorida Museum of Natural History, University of FloridaGainesville, FLUnited States of America
| | - Rob Guralnick
- Florida Museum of Natural History, University of Florida, Gainesville, FL, United States of AmericaFlorida Museum of Natural History, University of FloridaGainesville, FLUnited States of America
| |
Collapse
|
17
|
Franz NM, Musher LJ, Brown JW, Yu S, Ludäscher B. Verbalizing phylogenomic conflict: Representation of node congruence across competing reconstructions of the neoavian explosion. PLoS Comput Biol 2019; 15:e1006493. [PMID: 30768597 PMCID: PMC6395011 DOI: 10.1371/journal.pcbi.1006493] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2017] [Revised: 02/28/2019] [Accepted: 09/10/2018] [Indexed: 11/24/2022] Open
Abstract
Phylogenomic research is accelerating the publication of landmark studies that aim to resolve deep divergences of major organismal groups. Meanwhile, systems for identifying and integrating the products of phylogenomic inference-such as newly supported clade concepts-have not kept pace. However, the ability to verbalize node concept congruence and conflict across multiple, in effect simultaneously endorsed phylogenomic hypotheses, is a prerequisite for building synthetic data environments for biological systematics and other domains impacted by these conflicting inferences. Here we develop a novel solution to the conflict verbalization challenge, based on a logic representation and reasoning approach that utilizes the language of Region Connection Calculus (RCC-5) to produce consistent alignments of node concepts endorsed by incongruent phylogenomic studies. The approach employs clade concept labels to individuate concepts used by each source, even if these carry identical names. Indirect RCC-5 modeling of intensional (property-based) node concept definitions, facilitated by the local relaxation of coverage constraints, allows parent concepts to attain congruence in spite of their differentially sampled children. To demonstrate the feasibility of this approach, we align two recent phylogenomic reconstructions of higher-level avian groups that entail strong conflict in the "neoavian explosion" region. According to our representations, this conflict is constituted by 26 instances of input "whole concept" overlap. These instances are further resolvable in the output labeling schemes and visualizations as "split concepts", which provide the labels and relations needed to build truly synthetic phylogenomic data environments. Because the RCC-5 alignments fundamentally reflect the trained, logic-enabled judgments of systematic experts, future designs for such environments need to promote a culture where experts routinely assess the intensionalities of node concepts published by our peers-even and especially when we are not in agreement with each other.
Collapse
Affiliation(s)
- Nico M. Franz
- School of Life Sciences, Arizona State University, Tempe, Arizona, United States of America
| | - Lukas J. Musher
- Richard Gilder Graduate School and Department of Ornithology, American Museum of Natural History, New York, New York, United States of America
| | - Joseph W. Brown
- Department of Animal and Plant Sciences, University of Sheffield, Sheffield, United Kingdom
| | - Shizhuo Yu
- Department of Computer Science, University of California at Davis, Davis, California, United States of America
| | - Bertram Ludäscher
- School of Information Sciences, University of Illinois at Urbana-Champaign, Champaign, Illinois, United States of America
| |
Collapse
|
18
|
Johnston MA, Aalbu RL, Franz NM. An updated checklist of the Tenebrionidae sec. Bousquet et al. 2018 of the Algodones Dunes of California, with comments on checklist data practices. Biodivers Data J 2018:e24927. [PMID: 29942173 PMCID: PMC6013544 DOI: 10.3897/bdj.6.e24927] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2018] [Accepted: 06/11/2018] [Indexed: 11/12/2022] Open
Abstract
Generating regional checklists for insects is frequently based on combining data sources ranging from literature and expert assertions that merely imply the existence of an occurrence to aggregated, standard-compliant data of uniquely identified specimens. The increasing diversity of data sources also means that checklist authors are faced with new responsibilities, effectively acting as filterers to select and utilize an expert-validated subset of all available data. Authors are also faced with the technical obstacle to bring more occurrences into Darwin Core-based data aggregation, even if the corresponding specimens belong to external institutions. We illustrate these issues based on a partial update of the Kimsey et al. 2017 checklist of darkling beetles - Tenebrionidae sec. Bousquet et al. 2018 - inhabiting the Algodones Dunes of California. Our update entails 54 species-level concepts for this group and region, of which 31 concepts were found to be represented in three specimen-data aggregator portals, based on our interpretations of the aggregators' data. We reassess the distributions and biogeographic affinities of these species, focusing on taxa that are precinctive (highly geographically restricted) to the Lower Colorado River Valley in the context of recent dune formation from the Colorado River. Throughout, we apply taxonomic concept labels (taxonomic name according to source) to contextualize preferred name usages, but also show that the identification data of aggregated occurrences are very rarely well-contextualized or annotated. Doing so is a pre-requisite for publishing open, dynamic checklist versions that finely accredit incremental expert efforts spent to improve the quality of checklists and aggregated occurrence data.
Collapse
Affiliation(s)
- M Andrew Johnston
- Biodiversity Knowledge Integration Center, Arizona State University, Tempe, AZ, United States of America
| | - Rolf L Aalbu
- California Academy of Sciences, San Francisco, CA, United States of America
| | - Nico M Franz
- Biodiversity Knowledge Integration Center, Arizona State University, Tempe, AZ, United States of America
| |
Collapse
|
19
|
Franz NM, Zhang C, Lee J. A logic approach to modelling nomenclatural change. Cladistics 2018; 34:336-357. [PMID: 34645079 DOI: 10.1111/cla.12201] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/10/2017] [Indexed: 11/27/2022] Open
Abstract
We utilize an Answer Set Programming (ASP) approach to show that the principles of nomenclature are tractable in computational logic. To this end we design a hypothetical, 20 nomenclatural taxon use case, with starting conditions that embody several overarching principles of the International Code of Zoological Nomenclature, including Binomial Nomenclature, Priority, Coordination, Homonymy, Typification and the structural requirement of Gender Agreement. The use case ending conditions are triggered by the reinterpretation of the diagnostic features of one of 12 type specimens anchoring the corresponding species-level epithets. Permutations of this child-to-parent reassignment action lead to 36 alternative scenarios, where each scenario requires a set of 1-14 logically contingent nomenclatural emendations. We show that an ASP transition system approach can correctly infer the Code-mandated changes for each scenario, and visually output the ending conditions. The results provide a foundation for further developing logic-based nomenclatural change optimization and validation services, which could be applied in global nomenclatural registries. More generally, logic explorations of nomenclatural and taxonomic change scenarios provide a novel means of assessing design biases inherent in the principles of nomenclature, and can therefore inform the design of future, big data-compatible identifier systems that recognize and mitigate these constraints.
Collapse
Affiliation(s)
- Nico M Franz
- School of Life Sciences, Arizona State University, PO Box 874501, Tempe, AZ, 85287-4501, USA
| | - Chao Zhang
- School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, PO Box 878809, Tempe, AZ, 85287-8809, USA
| | - Joohyung Lee
- School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, PO Box 878809, Tempe, AZ, 85287-8809, USA
| |
Collapse
|
20
|
Vaidya G, Lepage D, Guralnick R. The tempo and mode of the taxonomic correction process: How taxonomists have corrected and recorrected North American bird species over the last 127 years. PLoS One 2018; 13:e0195736. [PMID: 29672539 PMCID: PMC5909608 DOI: 10.1371/journal.pone.0195736] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2017] [Accepted: 03/28/2018] [Indexed: 11/19/2022] Open
Abstract
While studies of taxonomy usually focus on species description, there is also a taxonomic correction process that retests and updates existing species circumscriptions on the basis of new evidence. These corrections may themselves be subsequently retested and recorrected. We studied this correction process by using the Check-List of North and Middle American Birds, a well-known taxonomic checklist that spans 130 years. We identified 142 lumps and 95 splits across sixty-three versions of the Check-List and found that while lumping rates have markedly decreased since the 1970s, splitting rates are accelerating. We found that 74% of North American bird species recognized today have never been corrected (i.e., lumped or split) over the period of the checklist, while 16% have been corrected exactly once and 10% have been corrected twice or more. Since North American bird species are known to have been extensively lumped in the first half of the 20th century with the advent of the biological species concept, we determined whether most splits seen today were the result of those lumps being recorrected. We found that 5% of lumps and 23% of splits fully reverted previous corrections, while a further 3% of lumps and 13% of splits are partial reversions. These results show a taxonomic correction process with moderate levels of recorrection, particularly of previous lumps. However, 81% of corrections do not revert any previous corrections, suggesting that the majority result in novel circumscriptions not previously recognized by the Check-List. We could find no order or family with a significantly higher rate of correction than any other, but twenty-two genera as currently recognized by the AOU do have significantly higher rates than others. Given the currently accelerating rate of splitting, prediction of the end-point of the taxonomic recorrection process is difficult, and many entirely new taxonomic concepts are still being, and likely will continue to be, proposed and further tested.
Collapse
Affiliation(s)
- Gaurav Vaidya
- Department of Ecology and Evolutionary Biology, University of Colorado Boulder, Boulder, Colorado, United States of America
- * E-mail:
| | - Denis Lepage
- Bird Studies Canada, Port Rowan, Ontario, Canada
| | - Robert Guralnick
- Department of Natural History and the Florida Museum of Natural History, University of Florida, Gainesville, Florida, United States of America
| |
Collapse
|
21
|
Peterson KJ, Jiang G, Brue SM, Shen F, Liu H. Mining Hierarchies and Similarity Clusters from Value Set Repositories. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2018; 2017:1372-1381. [PMID: 29854206 PMCID: PMC5977603] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
A value set is a collection of permissible values used to describe a specific conceptual domain for a given purpose. By helping to establish a shared semantic understanding across use cases, these artifacts are important enablers of interoperability and data standardization. As the size of repositories cataloging these value sets expand, knowledge management challenges become more pronounced. Specifically, discovering value sets applicable to a given use case may be challenging in a large repository. In this study, we describe methods to extract implicit relationships between value sets, and utilize these relationships to overlay organizational structure onto value set repositories. We successfully extract two different structurings, hierarchy and clustering, and show how tooling can leverage these structures to enable more effective value set discovery.
Collapse
Affiliation(s)
- Kevin J Peterson
- Division of Information Management and Analytics, Mayo Clinic, Rochester, MN
| | - Guoqian Jiang
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN
| | - Scott M Brue
- Division of Information Management and Analytics, Mayo Clinic, Rochester, MN
| | - Feichen Shen
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN
| | - Hongfang Liu
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN
| |
Collapse
|
22
|
Senderov V, Simov K, Franz N, Stoev P, Catapano T, Agosti D, Sautter G, Morris RA, Penev L. OpenBiodiv-O: ontology of the OpenBiodiv knowledge management system. J Biomed Semantics 2018; 9:5. [PMID: 29347997 PMCID: PMC5774086 DOI: 10.1186/s13326-017-0174-5] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2017] [Accepted: 12/28/2017] [Indexed: 11/16/2022] Open
Abstract
BACKGROUND The biodiversity domain, and in particular biological taxonomy, is moving in the direction of semantization of its research outputs. The present work introduces OpenBiodiv-O, the ontology that serves as the basis of the OpenBiodiv Knowledge Management System. Our intent is to provide an ontology that fills the gaps between ontologies for biodiversity resources, such as DarwinCore-based ontologies, and semantic publishing ontologies, such as the SPAR Ontologies. We bridge this gap by providing an ontology focusing on biological taxonomy. RESULTS OpenBiodiv-O introduces classes, properties, and axioms in the domains of scholarly biodiversity publishing and biological taxonomy and aligns them with several important domain ontologies (FaBiO, DoCO, DwC, Darwin-SW, NOMEN, ENVO). By doing so, it bridges the ontological gap across scholarly biodiversity publishing and biological taxonomy and allows for the creation of a Linked Open Dataset (LOD) of biodiversity information (a biodiversity knowledge graph) and enables the creation of the OpenBiodiv Knowledge Management System. A key feature of the ontology is that it is an ontology of the scientific process of biological taxonomy and not of any particular state of knowledge. This feature allows it to express a multiplicity of scientific opinions. The resulting OpenBiodiv knowledge system may gain a high level of trust in the scientific community as it does not force a scientific opinion on its users (e.g. practicing taxonomists, library researchers, etc.), but rather provides the tools for experts to encode different views as science progresses. CONCLUSIONS OpenBiodiv-O provides a conceptual model of the structure of a biodiversity publication and the development of related taxonomic concepts. It also serves as the basis for the OpenBiodiv Knowledge Management System.
Collapse
Affiliation(s)
- Viktor Senderov
- Pensoft Publishers, Prof. Georgi Zlatarski 12, Sofia, 1700 Bulgaria
- Institute of Biodiversity and Ecosystems Research, Bulgarian Academy of Sciences, Sofia, Bulgaria
| | - Kiril Simov
- Institute of Information and Communication Technologies, Bulgarian Academy of Sciences, Sofia, Bulgaria
| | - Nico Franz
- Arizona State University, School of Life Sciences, Tempe Campus, Tempe, 4501 AZ USA
| | - Pavel Stoev
- Pensoft Publishers, Prof. Georgi Zlatarski 12, Sofia, 1700 Bulgaria
- National Museum of Natural History, 1 Tsar Osvoboditel Blvd., Sofia, 1000 Bulgaria
| | | | | | | | | | - Lyubomir Penev
- Pensoft Publishers, Prof. Georgi Zlatarski 12, Sofia, 1700 Bulgaria
- Institute of Biodiversity and Ecosystems Research, Bulgarian Academy of Sciences, Sofia, Bulgaria
| |
Collapse
|
23
|
|
24
|
Franz N, Gilbert E, Ludäscher B, Weakley A. Controlling the taxonomic variable: Taxonomic concept resolution for a southeastern United States herbarium portal. RESEARCH IDEAS AND OUTCOMES 2016. [DOI: 10.3897/rio.2.e10610] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Overview. Taxonomic names are imperfect identifiers of specific and sometimes conflicting taxonomic perspectives in aggregated biodiversity data environments. The inherent ambiguities of names can be mitigated using syntactic and semantic conventions developed under the taxonomic concept approach. These include: (1) representation of taxonomic concept labels (TCLs: name sec. source) to precisely identify name usages and meanings, (2) use of parent/child relationships to assemble separate taxonomic perspectives, and (3) expert provision of Region Connection Calculus articulations (RCC–5: congruence, [inverse] inclusion, overlap, exclusion) that specify how data identified to different-sourced TCLs can be integrated. Application of these conventions greatly increases trust in biodiversity data networks, most of which promote unitary taxonomic 'syntheses' that obscure the actual diversity of expert-held views. Better design solutions allow users to control the taxonomic variable and thereby assess the robustness of their biological inferences under different perspectives. A unique constellation of prior efforts – including the powerful Symbiota collections software platform, the Euler/X multi-taxonomy alignment toolkit, and the "Weakley Flora" which entails 7,000 concepts and more than 75,000 RCC–5 articulations – provides the opportunity to build a first full-scale concept resolution service for SERNEC, the SouthEast Regional Network of Expertise and Collections, currently with 60 member herbaria and 2 million occurrence records.
Intellectual merit. We have developed a multi-dimensional, step-wise plan to transition SERNEC's data culture from name- to concept-based practices. (1) We will engage SERNEC experts through annual, regional workshops and follow-up interactions that will foster buy-in and ultimately the completion of 12 community-identified use cases. (2). We will leverage RCC–5 data from the Weakley Flora and further development of the Euler/X logic reasoning toolkit to provide comprehensive genus- to variety-level concept alignments for at least 10 major flora treatments with highest relevance to SERNEC. The visualizations and estimated > 1 billion inferred concept-to-concept relations will effectively drive specimen data integration in the transformed portal. (3) We will expand Symbiota's taxonomy and occurrence schemas and related user interfaces to support the new concept data, including novel batch and map-based specimen determination modules, with easy output options in Darwin Core Archive format. (4) Through combinations of the new technology, enlisted taxonomic expertise, and SERNEC's large image resources, we will upgrade minimally 80% of all SERNEC specimen identifications from names to the narrowest suitable TCLs, or add "uncertainty" flags to specimens needing further study. (5) We will utilize the novel tools and data to demonstrate how controlling for the taxonomic variable in 12 use cases variously drives the outcomes of evolutionary, ecological, and conservation-based research hypotheses.
Broader impacts. Our project is focused on just one herbarium network, but the potential impact is as wide as Darwin Core or even comparative biology. We believe that trust in networked biodiversity data depends on open and dynamic system designs, allowing expert access and resolution of multiple conflicting views that reflect the complex realities of ongoing taxonomic research. Taking well over 1 million SERNEC records from name- to TCL-resolution will show that "big" specimen data can pass the credibility threshold needed to validate the substantive data mobilization investment. We will mentor one postdoctoral researcher (UNC), two Ph.D. students (ASU, UIUC), and at least 15 undergraduate students (ASU). Each of our workshops will capacitate 10-15 SERNEC experts, who in turn can recruit colleagues and students at their home collections. We will incorporate the project theme and use cases into undergraduate courses taught at six institutions and reaching an estimated 300-500 students annually (10-40% minority students). At each institution, project members will make a systematic effort to recruit new students from underrepresented groups. Our group's leadership of Symbiota (with close ties to iDigBio), SERNEC, and local biodiversity projects and centers will further promote the new data culture. We will create a feature story "Where do plant species occur?" for ASU's popular "Ask A Biologist" website, and a series of undergraduate student-led "How-To" videos that illustrate the use case workflows, including the creation of multi-taxonomy alignments.
Collapse
|
25
|
Franz NM, Pier NM, Reeder DM, Chen M, Yu S, Kianmajd P, Bowers S, Ludäscher B. Two Influential Primate Classifications Logically Aligned. Syst Biol 2016; 65:561-82. [PMID: 27009895 PMCID: PMC4911943 DOI: 10.1093/sysbio/syw023] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2015] [Revised: 03/11/2016] [Accepted: 03/17/2016] [Indexed: 01/02/2023] Open
Abstract
Classifications and phylogenies of perceived natural entities change in the light of new evidence. Taxonomic changes, translated into Code-compliant names, frequently lead to name:meaning dissociations across succeeding treatments. Classification standards such as the Mammal Species of the World (MSW) may experience significant levels of taxonomic change from one edition to the next, with potential costs to long-term, large-scale information integration. This circumstance challenges the biodiversity and phylogenetic data communities to express taxonomic congruence and incongruence in ways that both humans and machines can process, that is, to logically represent taxonomic alignments across multiple classifications. We demonstrate that such alignments are feasible for two classifications of primates corresponding to the second and third MSW editions. Our approach has three main components: (i) use of taxonomic concept labels, that is name sec. author (where sec. means according to), to assemble each concept hierarchy separately via parent/child relationships; (ii) articulation of select concepts across the two hierarchies with user-provided Region Connection Calculus (RCC-5) relationships; and (iii) the use of an Answer Set Programming toolkit to infer and visualize logically consistent alignments of these input constraints. Our use case entails the Primates sec. Groves (1993; MSW2-317 taxonomic concepts; 233 at the species level) and Primates sec. Groves (2005; MSW3-483 taxonomic concepts; 376 at the species level). Using 402 RCC-5 input articulations, the reasoning process yields a single, consistent alignment and 153,111 Maximally Informative Relations that constitute a comprehensive meaning resolution map for every concept pair in the Primates sec. MSW2/MSW3. The complete alignment, and various partitions thereof, facilitate quantitative analyses of name:meaning dissociation, revealing that nearly one in three taxonomic names are not reliable across treatments-in the sense of the same name identifying congruent taxonomic meanings. The RCC-5 alignment approach is potentially widely applicable in systematics and can achieve scalable, precise resolution of semantically evolving name usages in synthetic, next-generation biodiversity, and phylogeny data platforms.
Collapse
Affiliation(s)
- Nico M Franz
- School of Life Sciences, PO Box 874501, Arizona State University, Tempe, AZ 85287, USA;
| | - Naomi M Pier
- School of Life Sciences, PO Box 874501, Arizona State University, Tempe, AZ 85287, USA
| | - Deeann M Reeder
- Department of Biology, Bucknell University, 1 Dent Drive, Lewisburg, PA 17837, USA
| | - Mingmin Chen
- Department of Computer Science, 2063 Kemper Hall, 1 Shields Avenue, University of California at Davis, CA 95616, USA
| | - Shizhuo Yu
- Department of Computer Science, 2063 Kemper Hall, 1 Shields Avenue, University of California at Davis, CA 95616, USA
| | - Parisa Kianmajd
- Department of Computer Science, 2063 Kemper Hall, 1 Shields Avenue, University of California at Davis, CA 95616, USA
| | - Shawn Bowers
- Department of Computer Science, 502 East Boone Avenue, AD Box 26, Gonzaga University, Spokane, WA 99258, USA
| | - Bertram Ludäscher
- Gradate School of Library and Information Science, 510 East Daniel Street, University of Illinois at Urbana-Champaign, Champaign, IL 61820
| |
Collapse
|
26
|
Patterson D, Mozzherin D, Shorthouse DP, Thessen A. Challenges with using names to link digital biodiversity information. Biodivers Data J 2016; 4:e8080. [PMID: 27346955 PMCID: PMC4910497 DOI: 10.3897/bdj.4.e8080] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2016] [Accepted: 05/19/2016] [Indexed: 01/05/2023] Open
Affiliation(s)
| | - Dmitry Mozzherin
- Illinois Natural History Survey, Champaign, IL, United States of America
| | | | - Anne Thessen
- The Data Detektive, Waltham, United States of America
- The Ronin Institute for Independent Scholarship, Montclair, United States of America
| |
Collapse
|
27
|
Pilsk SC, Kalfatovic MR, Richard JM. Unlocking Index Animalium: From paper slips to bytes and bits. Zookeys 2016:153-71. [PMID: 26877657 PMCID: PMC4741219 DOI: 10.3897/zookeys.550.9673] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2015] [Accepted: 03/25/2015] [Indexed: 11/28/2022] Open
Abstract
In 1996 Smithsonian Libraries (SIL) embarked on the digitization of its collections. By 1999, a full-scale digitization center was in place and rare volumes from the natural history collections, often of high illustrative value, were the focus for the first years of the program. The resulting beautiful books made available for online display were successful to a certain extent, but it soon became clear that the data locked within the texts needed to be converted to more usable and re-purposable form via digitization methods that went beyond simple page imaging and included text conversion elements. Library staff met with researchers from the taxonomic community to understand their path to the literature and identified tools (indexes and bibliographies) used to connect to the library holdings. The traditional library metadata describing the titles, which made them easily retrievable from the shelves of libraries, was not meeting the needs of the researcher looking for more detailed and granular data within the texts. The result was to identify proper print tools that could potential assist researchers in digital form. This paper outlines the project undertaken to convert Charles Davies Sherborn’s Index Animalium into a tool to connect researchers to the library holdings: from a print index to a database to eventually a dataset. Sherborn’s microcitation of a species name and his bibliographies help bridge the gap between taxonomist and literature holdings of libraries. In 2004, SIL received funding from the Smithsonian’s Atherton Seidell Endowment to create an online version of Sherborn’s Index Animalium. The initial project was to digitize the page images and re-key the data into a simple data structure. As the project evolved, a more complex database was developed which enabled quality field searching to retrieve species names and to search the bibliography. Problems with inconsistent abbreviations and styling of his bibliographies made the parsing of the data difficult. Coinciding with the development of the Biodiversity Heritage Library (BHL) in 2005, it became obvious there was a need to integrate the database converted Index Animalium, BHL’s scanned taxonomic literature, and taxonomic intelligence (the algorithmic identification of binomial, Latinate name-strings). The challenges of working with legacy taxonomic citation, computer matching algorithms, and making connections have brought us to today’s goal of making Sherborn available and linked to other datasets. Partnering with others to allow machine-to-machine communications the data is being examined for possible transformation into RDF markup and meeting the standards of Linked Open Data. SIL staff have partnered with Thomson Reuters and the Global Names Initiative to further enhance the Index Animalium data set. Thomson Reuters’ staff is now working on integrating the species microcitation and species name in the ION: Index to Organism Names project ; Richard Pyle (The Bishop Museum) is also working on further parsing of the text. The Index Animalium collaborative project’s ultimate goal is to successful have researchers go seamlessly from the species name in either ION or the scanned pages of Index Animalium to the digitized original description in BHL - connecting taxonomic researchers to original authored species descriptions with just a click.
Collapse
|