1
|
Folk RA, Guralnick RP, LaFrance RT. FloraTraiter: Automated parsing of traits from descriptive biodiversity literature. APPLICATIONS IN PLANT SCIENCES 2024; 12:e11563. [PMID: 38369975 PMCID: PMC10873814 DOI: 10.1002/aps3.11563] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Revised: 09/13/2023] [Accepted: 10/01/2023] [Indexed: 02/20/2024]
Abstract
Premise Plant trait data are essential for quantifying biodiversity and function across Earth, but these data are challenging to acquire for large studies. Diverse strategies are needed, including the liberation of heritage data locked within specialist literature such as floras and taxonomic monographs. Here we report FloraTraiter, a novel approach using rule-based natural language processing (NLP) to parse computable trait data from biodiversity literature. Methods FloraTraiter was implemented through collaborative work between programmers and botanical experts and customized for both online floras and scanned literature. We report a strategy spanning optical character recognition, recognition of taxa, iterative building of traits, and establishing linkages among all of these, as well as curational tools and code for turning these results into standard morphological matrices. Results Over 95% of treatment content was successfully parsed for traits with <1% error. Data for more than 700 taxa are reported, including a demonstration of common downstream uses. Conclusions We identify strategies, applications, tips, and challenges that we hope will facilitate future similar efforts to produce large open-source trait data sets for broad community reuse. Largely automated tools like FloraTraiter will be an important addition to the toolkit for assembling trait data at scale.
Collapse
Affiliation(s)
- Ryan A. Folk
- Department of Biological SciencesMississippi State UniversityMississippi StateMississippiUSA
| | - Robert P. Guralnick
- Florida Museum of Natural HistoryUniversity of FloridaGainesvilleFloridaUSA
- Biodiversity InstituteUniversity of FloridaGainesvilleFloridaUSA
| | | |
Collapse
|
2
|
Campbell DL, Thessen AE, Ries L. A novel curation system to facilitate data integration across regional citizen science survey programs. PeerJ 2020; 8:e9219. [PMID: 32821528 PMCID: PMC7395600 DOI: 10.7717/peerj.9219] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2019] [Accepted: 04/28/2020] [Indexed: 11/20/2022] Open
Abstract
Integrative modeling methods can now enable macrosystem-level understandings of biodiversity patterns, such as range changes resulting from shifts in climate or land use, by aggregating species-level data across multiple monitoring sources. This requires ensuring that taxon interpretations match up across different sources. While encouraging checklist standardization is certainly an option, coercing programs to change species lists they have used consistently for decades is rarely successful. Here we demonstrate a novel approach for tracking equivalent names and concepts, applied to a network of 10 regional programs that use the same protocols (so-called “Pollard walks”) to monitor butterflies across America north of Mexico. Our system involves, for each monitoring program, associating the taxonomic authority (in this case one of three North American butterfly fauna treatments: Pelham, 2014; North American Butterfly Association, Inc., 2016; Opler & Warren, 2003) that shares the most similar overall taxonomic interpretation to the program’s working species list. This allows us to define each term on each program’s list in the context of the appropriate authority’s species concept and curate the term alongside its authoritative concept. We then aligned the names representing equivalent taxonomic concepts among the three authorities. These stepping stones allow us to bridge a species concept from one program’s species list to the name of the equivalent in any other program, through the intermediary scaffolding of aligned authoritative taxon concepts. Using a software tool we developed to access our curation system, a user can link equivalent species concepts between data collecting agencies with no specialized knowledge of taxonomic complexities.
Collapse
Affiliation(s)
- Dana L Campbell
- Division of Biological Sciences, School of STEM, University of Washington, Bothell, WA, USA
| | - Anne E Thessen
- The Ronin Institute for Independent Scholarship, Montclair, NJ, USA.,Center for Genome Research and Biocomputing, Oregon State University, Corvallis, OR, USA
| | - Leslie Ries
- Department of Biology, Georgetown University, Washington, DC, USA
| |
Collapse
|
3
|
Walton S, Livermore L, Bánki O, Cubey R, Drinkwater R, Englund M, Goble C, Groom Q, Kermorvant C, Rey I, Santos C, Scott B, Williams A, Wu Z. Landscape Analysis for the Specimen Data Refinery. RESEARCH IDEAS AND OUTCOMES 2020. [DOI: 10.3897/rio.6.e57602] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
This report reviews the current state-of-the-art applied approaches on automated tools, services and workflows for extracting information from images of natural history specimens and their labels. We consider the potential for repurposing existing tools, including workflow management systems; and areas where more development is required. This paper was written as part of the SYNTHESYS+ project for software development teams and informatics teams working on new software-based approaches to improve mass digitisation of natural history specimens.
Collapse
|
4
|
The Spider Anatomy Ontology (SPD)—A Versatile Tool to Link Anatomy with Cross-Disciplinary Data. DIVERSITY 2019. [DOI: 10.3390/d11100202] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Abstract
Spiders are a diverse group with a high eco-morphological diversity, which complicates anatomical descriptions especially with regard to its terminology. New terms are constantly proposed, and definitions and limits of anatomical concepts are regularly updated. Therefore, it is often challenging to find the correct terms, even for trained scientists, especially when the terminology has obstacles such as synonyms, disputed definitions, ambiguities, or homonyms. Here, we present the Spider Anatomy Ontology (SPD), which we developed combining the functionality of a glossary (a controlled defined vocabulary) with a network of formalized relations between terms that can be used to compute inferences. The SPD follows the guidelines of the Open Biomedical Ontologies and is available through the NCBO BioPortal (ver. 1.1). It constitutes of 757 valid terms and definitions, is rooted with the Common Anatomy Reference Ontology (CARO), and has cross references to other ontologies, especially of arthropods. The SPD offers a wealth of anatomical knowledge that can be used as a resource for any scientific study as, for example, to link images to phylogenetic datasets, compute structural complexity over phylogenies, and produce ancestral ontologies. By using a common reference in a standardized way, the SPD will help bridge diverse disciplines, such as genomics, taxonomy, systematics, evolution, ecology, and behavior.
Collapse
|
5
|
Endara L, Thessen AE, Cole HA, Walls R, Gkoutos G, Cao Y, Chong SS, Cui H. Modifier Ontologies for frequency, certainty, degree, and coverage phenotype modifier. Biodivers Data J 2018; 6:e29232. [PMID: 30532623 PMCID: PMC6281706 DOI: 10.3897/bdj.6.e29232] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2018] [Accepted: 11/20/2018] [Indexed: 11/21/2022] Open
Abstract
Background: When phenotypic characters are described in the literature, they may be constrained or clarified with additional information such as the location or degree of expression, these terms are called "modifiers". With effort underway to convert narrative character descriptions to computable data, ontologies for such modifiers are needed. Such ontologies can also be used to guide term usage in future publications. Spatial and method modifiers are the subjects of ontologies that already have been developed or are under development. In this work, frequency (e.g., rarely, usually), certainty (e.g., probably, definitely), degree (e.g., slightly, extremely), and coverage modifiers (e.g., sparsely, entirely) are collected, reviewed, and used to create two modifier ontologies with different design considerations. The basic goal is to express the sequential relationships within a type of modifiers, for example, usually is more frequent than rarely, in order to allow data annotated with ontology terms to be classified accordingly. Method: Two designs are proposed for the ontology, both using the list pattern: a closed ordered list (i.e., five-bin design) and an open ordered list design. The five-bin design puts the modifier terms into a set of 5 fixed bins with interval object properties, for example, one_level_more/less_frequently_than, where new terms can only be added as synonyms to existing classes. The open list approach starts with 5 bins, but supports the extensibility of the list via ordinal properties, for example, more/less_frequently_than, allowing new terms to be inserted as a new class anywhere in the list. The consequences of the different design decisions are discussed in the paper. CharaParser was used to extract modifiers from plant, ant, and other taxonomic descriptions. After a manual screening, 130 modifier words were selected as the candidate terms for the modifier ontologies. Four curators/experts (three biologists and one information scientist specialized in biosemantics) reviewed and categorized the terms into 20 bins using the Ontology Term Organizer (OTO) (http://biosemantics.arizona.edu/OTO). Inter-curator variations were reviewed and expressed in the final ontologies. Results: Frequency, certainty, degree, and coverage terms with complete agreement among all curators were used as class labels or exact synonyms. Terms with different interpretations were either excluded or included using "broader synonym" or "not recommended" annotation properties. These annotations explicitly allow for the user to be aware of the semantic ambiguity associated with the terms and whether they should be used with caution or avoided. Expert categorization results showed that 16 out of 20 bins contained terms with full agreements, suggesting differentiating the modifiers into 5 levels/bins balances the need to differentiate modifiers and the need for the ontology to reflect user consensus. Two ontologies, developed using the Protege ontology editor, are made available as OWL files and can be downloaded from https://github.com/biosemantics/ontologies. Contribution: We built the first two modifier ontologies following a consensus-based approach with terms commonly used in taxonomic literature. The five-bin ontology has been used in the Explorer of Taxon Concepts web toolkit to compute the similarity between characters extracted from literature to facilitate taxon concepts alignments. The two ontologies will also be used in an ontology-informed authoring tool for taxonomists to facilitate consistency in modifier term usage.
Collapse
Affiliation(s)
- Lorena Endara
- University of Florida, Gainesville, United States of AmericaUniversity of FloridaGainesvilleUnited States of America
| | - Anne E Thessen
- The Ronin Institute for Independent Scholarship, Monclair, NJ, United States of AmericaThe Ronin Institute for Independent ScholarshipMonclair, NJUnited States of America
| | - Heather A Cole
- Science and Technology Branch, Agriculture and Agri-Food Canada, Government of Canada, Ottawa, CanadaScience and Technology Branch, Agriculture and Agri-Food Canada, Government of CanadaOttawaCanada
| | - Ramona Walls
- CyVerse, Tucson, United States of AmericaCyVerseTucsonUnited States of America
| | - Georgios Gkoutos
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, Centre for Computational Biology, University of Birmingham, Birmingham, United KingdomCollege of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, Centre for Computational Biology, University of BirminghamBirminghamUnited Kingdom
- Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, B15 2TT, Birmingham, United KingdomInstitute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, B15 2TTBirminghamUnited Kingdom
| | - Yujie Cao
- Center for Studies of Information Resources, Wuhan Universtity, Wuhan, ChinaCenter for Studies of Information Resources, Wuhan UniverstityWuhanChina
| | - Steven S. Chong
- National Center for Ecological Analysis and Synthesis, University of California, Santa Barbara, Santa Barbara, United States of AmericaNational Center for Ecological Analysis and Synthesis, University of California, Santa BarbaraSanta BarbaraUnited States of America
- University of Arizona, Tucson, United States of AmericaUniversity of ArizonaTucsonUnited States of America
| | - Hong Cui
- University of Arizona, Tucson, United States of AmericaUniversity of ArizonaTucsonUnited States of America
| |
Collapse
|
6
|
Cui H, Macklin JA, Sachs J, Reznicek A, Starr J, Ford B, Penev L, Chen HL. Incentivising use of structured language in biological descriptions: Author-driven phenotype data and ontology production. Biodivers Data J 2018; 6:e29616. [PMID: 30473620 PMCID: PMC6235995 DOI: 10.3897/bdj.6.e29616] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2018] [Accepted: 10/23/2018] [Indexed: 01/17/2023] Open
Abstract
Phenotypes are used for a multitude of purposes such as defining species, reconstructing phylogenies, diagnosing diseases or improving crop and animal productivity, but most of this phenotypic data is published in free-text narratives that are not computable. This means that the complex relationship between the genome, the environment and phenotypes is largely inaccessible to analysis and important questions related to the evolution of organisms, their diseases or their response to climate change cannot be fully addressed. It takes great effort to manually convert free-text narratives to a computable format before they can be used in large-scale analyses. We argue that this manual curation approach is not a sustainable solution to produce computable phenotypic data for three reasons: 1) it does not scale to all of biodiversity; 2) it does not stop the publication of free-text phenotypes that will continue to need manual curation in the future and, most importantly, 3) It does not solve the problem of inter-curator variation (curators interpret/convert a phenotype differently from each other). Our empirical studies have shown that inter-curator variation is as high as 40% even within a single project. With this level of variation, it is difficult to imagine that data integrated from multiple curation projects can be of high quality. The key causes of this variation have been identified as semantic vagueness in original phenotype descriptions and difficulties in using standardised vocabularies (ontologies). We argue that the authors describing phenotypes are the key to the solution. Given the right tools and appropriate attribution, the authors should be in charge of developing a project's semantics and ontology. This will speed up ontology development and improve the semantic clarity of phenotype descriptions from the moment of publication. A proof of concept project on this idea was funded by NSF ABI in July 2017. We seek readers input or critique of the proposed approaches to help achieve community-based computable phenotype data production in the near future. Results from this project will be accessible through https://biosemantics.github.io/author-driven-production.
Collapse
Affiliation(s)
- Hong Cui
- University of Arizona, TUCSON, United States of AmericaUniversity of ArizonaTUCSONUnited States of America
| | - James A. Macklin
- Agriculture and Agri-Food Canada, Ottawa, CanadaAgriculture and Agri-Food CanadaOttawaCanada
| | - Joel Sachs
- Agriculture and Agri-Food Canada, Ottawa, CanadaAgriculture and Agri-Food CanadaOttawaCanada
| | - Anton Reznicek
- University of Michigan, Ann Arbor, United States of AmericaUniversity of MichiganAnn ArborUnited States of America
| | - Julian Starr
- University of Ottawa, Ottawa, CanadaUniversity of OttawaOttawaCanada
| | - Bruce Ford
- University of Manitoba, Winnipeg, CanadaUniversity of ManitobaWinnipegCanada
| | - Lyubomir Penev
- Pensoft Publishers & Bulgarian Academy of Sciences, Sofia, BulgariaPensoft Publishers & Bulgarian Academy of SciencesSofiaBulgaria
| | - Hsin-Liang Chen
- University of Massachusetts at Boston, Boston, United States of AmericaUniversity of Massachusetts at BostonBostonUnited States of America
| |
Collapse
|
7
|
Xu D, Chong SS, Rodenhausen T, Cui H. Resolving "orphaned" non-specific structures using machine learning and natural language processing methods. Biodivers Data J 2018:e26659. [PMID: 30393454 PMCID: PMC6207837 DOI: 10.3897/bdj.6.e26659] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2018] [Accepted: 07/27/2018] [Indexed: 11/12/2022] Open
Abstract
Scholarly publications of biodiversity literature contain a vast amount of information in human readable format. The detailed morphological descriptions in these publications contain rich information that can be extracted to facilitate analysis and computational biology research. However, the idiosyncrasies of morphological descriptions still pose a number of challenges to machines. In this work, we demonstrate the use of two different approaches to resolve meronym (i.e. part-of) relations between anatomical parts and their anchor organs, including a syntactic rule-based approach and a SVM-based (support vector machine) method. Both methods made use of domain ontologies. We compared the two approaches with two other baseline methods and the evaluation results show the syntactic methods (92.1% F1 score) outperformed the SVM methods (80.7% F1 score) and the part-of ontologies were valuable knowledge sources for the task. It is notable that the mistakes made by the two approaches rarely overlapped. Additional tests will be conducted on the development version of the Explorer of Taxon Concepts toolkit before we make the functionality publicly available. Meanwhile, we will further investigate and leverage the complementary nature of the two approaches to further drive down the error rate, as in practical application, even a 1% error rate could lead to hundreds of errors.
Collapse
Affiliation(s)
- Dongfang Xu
- University of Arizona, Tucson, United States of America University of Arizona Tucson United States of America
| | - Steven S Chong
- University of Arizona, Tucson, United States of America University of Arizona Tucson United States of America.,National Center for Ecological Analysis and Synthesis, University of California, Santa Barbara, United States of America National Center for Ecological Analysis and Synthesis, University of California Santa Barbara United States of America
| | - Thomas Rodenhausen
- University of Arizona, Tucson, United States of America University of Arizona Tucson United States of America
| | - Hong Cui
- University of Arizona, Tucson, United States of America University of Arizona Tucson United States of America
| |
Collapse
|
8
|
Vaidya G, Lepage D, Guralnick R. The tempo and mode of the taxonomic correction process: How taxonomists have corrected and recorrected North American bird species over the last 127 years. PLoS One 2018; 13:e0195736. [PMID: 29672539 PMCID: PMC5909608 DOI: 10.1371/journal.pone.0195736] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2017] [Accepted: 03/28/2018] [Indexed: 11/19/2022] Open
Abstract
While studies of taxonomy usually focus on species description, there is also a taxonomic correction process that retests and updates existing species circumscriptions on the basis of new evidence. These corrections may themselves be subsequently retested and recorrected. We studied this correction process by using the Check-List of North and Middle American Birds, a well-known taxonomic checklist that spans 130 years. We identified 142 lumps and 95 splits across sixty-three versions of the Check-List and found that while lumping rates have markedly decreased since the 1970s, splitting rates are accelerating. We found that 74% of North American bird species recognized today have never been corrected (i.e., lumped or split) over the period of the checklist, while 16% have been corrected exactly once and 10% have been corrected twice or more. Since North American bird species are known to have been extensively lumped in the first half of the 20th century with the advent of the biological species concept, we determined whether most splits seen today were the result of those lumps being recorrected. We found that 5% of lumps and 23% of splits fully reverted previous corrections, while a further 3% of lumps and 13% of splits are partial reversions. These results show a taxonomic correction process with moderate levels of recorrection, particularly of previous lumps. However, 81% of corrections do not revert any previous corrections, suggesting that the majority result in novel circumscriptions not previously recognized by the Check-List. We could find no order or family with a significantly higher rate of correction than any other, but twenty-two genera as currently recognized by the AOU do have significantly higher rates than others. Given the currently accelerating rate of splitting, prediction of the end-point of the taxonomic recorrection process is difficult, and many entirely new taxonomic concepts are still being, and likely will continue to be, proposed and further tested.
Collapse
Affiliation(s)
- Gaurav Vaidya
- Department of Ecology and Evolutionary Biology, University of Colorado Boulder, Boulder, Colorado, United States of America
- * E-mail:
| | - Denis Lepage
- Bird Studies Canada, Port Rowan, Ontario, Canada
| | - Robert Guralnick
- Department of Natural History and the Florida Museum of Natural History, University of Florida, Gainesville, Florida, United States of America
| |
Collapse
|
9
|
Folk RA, Sun M, Soltis PS, Smith SA, Soltis DE, Guralnick RP. Challenges of comprehensive taxon sampling in comparative biology: Wrestling with rosids. AMERICAN JOURNAL OF BOTANY 2018; 105:433-445. [PMID: 29665035 DOI: 10.1002/ajb2.1059] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/16/2017] [Accepted: 12/19/2017] [Indexed: 06/08/2023]
Abstract
Using phylogenetic approaches to test hypotheses on a large scale, in terms of both species sampling and associated species traits and occurrence data-and doing this with rigor despite all the attendant challenges-is critical for addressing many broad questions in evolution and ecology. However, application of such approaches to empirical systems is hampered by a lingering series of theoretical and practical bottlenecks. The community is still wrestling with the challenges of how to develop species-level, comprehensively sampled phylogenies and associated geographic and phenotypic resources that enable global-scale analyses. We illustrate difficulties and opportunities using the rosids as a case study, arguing that assembly of biodiversity data that is scale-appropriate-and therefore comprehensive and global in scope-is required to test global-scale hypotheses. Synthesizing comprehensive biodiversity data sets in clades such as the rosids will be key to understanding the origin and present-day evolutionary and ecological dynamics of the angiosperms.
Collapse
Affiliation(s)
- Ryan A Folk
- Florida Museum of Natural History, Gainesville, FL, 32611, USA
| | - Miao Sun
- Florida Museum of Natural History, Gainesville, FL, 32611, USA
| | - Pamela S Soltis
- Florida Museum of Natural History, Gainesville, FL, 32611, USA
- Genetics Institute, University of Florida, Gainesville, FL, 32610, USA
| | - Stephen A Smith
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Douglas E Soltis
- Florida Museum of Natural History, Gainesville, FL, 32611, USA
- Genetics Institute, University of Florida, Gainesville, FL, 32610, USA
- Department of Biology, University of Florida, Gainesville, FL, 32611, USA
| | | |
Collapse
|
10
|
Endara L, Cui H, Burleigh JG. Extraction of phenotypic traits from taxonomic descriptions for the tree of life using natural language processing. APPLICATIONS IN PLANT SCIENCES 2018; 6:e1035. [PMID: 29732265 PMCID: PMC5895189 DOI: 10.1002/aps3.1035] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/05/2017] [Accepted: 01/31/2018] [Indexed: 05/09/2023]
Abstract
PREMISE OF THE STUDY Phenotypic data sets are necessary to elucidate the genealogy of life, but assembling phenotypic data for taxa across the tree of life can be technically challenging and prohibitively time consuming. We describe a semi-automated protocol to facilitate and expedite the assembly of phenotypic character matrices of plants from formal taxonomic descriptions. This pipeline uses new natural language processing (NLP) techniques and a glossary of over 9000 botanical terms. METHODS AND RESULTS Our protocol includes the Explorer of Taxon Concepts (ETC), an online application that assembles taxon-by-character matrices from taxonomic descriptions, and MatrixConverter, a Java application that enables users to evaluate and discretize the characters extracted by ETC. We demonstrate this protocol using descriptions from Araucariaceae. CONCLUSIONS The NLP pipeline unlocks the phenotypic data found in taxonomic descriptions and makes them usable for evolutionary analyses.
Collapse
Affiliation(s)
- Lorena Endara
- Department of BiologyUniversity of FloridaGainesvilleFlorida32611USA
| | - Hong Cui
- School of InformationUniversity of ArizonaTucsonArizona85719USA
| | | |
Collapse
|