1
|
Folk RA, Guralnick RP, LaFrance RT. FloraTraiter: Automated parsing of traits from descriptive biodiversity literature. APPLICATIONS IN PLANT SCIENCES 2024; 12:e11563. [PMID: 38369975 PMCID: PMC10873814 DOI: 10.1002/aps3.11563] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Revised: 09/13/2023] [Accepted: 10/01/2023] [Indexed: 02/20/2024]
Abstract
Premise Plant trait data are essential for quantifying biodiversity and function across Earth, but these data are challenging to acquire for large studies. Diverse strategies are needed, including the liberation of heritage data locked within specialist literature such as floras and taxonomic monographs. Here we report FloraTraiter, a novel approach using rule-based natural language processing (NLP) to parse computable trait data from biodiversity literature. Methods FloraTraiter was implemented through collaborative work between programmers and botanical experts and customized for both online floras and scanned literature. We report a strategy spanning optical character recognition, recognition of taxa, iterative building of traits, and establishing linkages among all of these, as well as curational tools and code for turning these results into standard morphological matrices. Results Over 95% of treatment content was successfully parsed for traits with <1% error. Data for more than 700 taxa are reported, including a demonstration of common downstream uses. Conclusions We identify strategies, applications, tips, and challenges that we hope will facilitate future similar efforts to produce large open-source trait data sets for broad community reuse. Largely automated tools like FloraTraiter will be an important addition to the toolkit for assembling trait data at scale.
Collapse
Affiliation(s)
- Ryan A. Folk
- Department of Biological SciencesMississippi State UniversityMississippi StateMississippiUSA
| | - Robert P. Guralnick
- Florida Museum of Natural HistoryUniversity of FloridaGainesvilleFloridaUSA
- Biodiversity InstituteUniversity of FloridaGainesvilleFloridaUSA
| | | |
Collapse
|
2
|
Zhao H, Wu H, Wang X. OIAE: Overall Improved Autoencoder with Powerful Image Reconstruction and Discriminative Feature Extraction. Cognit Comput 2022. [DOI: 10.1007/s12559-022-10000-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
3
|
Djemiel C, Maron PA, Terrat S, Dequiedt S, Cottin A, Ranjard L. Inferring microbiota functions from taxonomic genes: a review. Gigascience 2022; 11:giab090. [PMID: 35022702 PMCID: PMC8756179 DOI: 10.1093/gigascience/giab090] [Citation(s) in RCA: 37] [Impact Index Per Article: 18.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2021] [Revised: 12/02/2021] [Accepted: 12/02/2021] [Indexed: 12/13/2022] Open
Abstract
Deciphering microbiota functions is crucial to predict ecosystem sustainability in response to global change. High-throughput sequencing at the individual or community level has revolutionized our understanding of microbial ecology, leading to the big data era and improving our ability to link microbial diversity with microbial functions. Recent advances in bioinformatics have been key for developing functional prediction tools based on DNA metabarcoding data and using taxonomic gene information. This cheaper approach in every aspect serves as an alternative to shotgun sequencing. Although these tools are increasingly used by ecologists, an objective evaluation of their modularity, portability, and robustness is lacking. Here, we reviewed 100 scientific papers on functional inference and ecological trait assignment to rank the advantages, specificities, and drawbacks of these tools, using a scientific benchmarking. To date, inference tools have been mainly devoted to bacterial functions, and ecological trait assignment tools, to fungal functions. A major limitation is the lack of reference genomes-compared with the human microbiota-especially for complex ecosystems such as soils. Finally, we explore applied research prospects. These tools are promising and already provide relevant information on ecosystem functioning, but standardized indicators and corresponding repositories are still lacking that would enable them to be used for operational diagnosis.
Collapse
Affiliation(s)
- Christophe Djemiel
- Agroécologie, AgroSup Dijon, INRAE, Université de Bourgogne, Université de Bourgogne Franche-Comté, F-21000 Dijon, France
| | - Pierre-Alain Maron
- Agroécologie, AgroSup Dijon, INRAE, Université de Bourgogne, Université de Bourgogne Franche-Comté, F-21000 Dijon, France
| | - Sébastien Terrat
- Agroécologie, AgroSup Dijon, INRAE, Université de Bourgogne, Université de Bourgogne Franche-Comté, F-21000 Dijon, France
| | - Samuel Dequiedt
- Agroécologie, AgroSup Dijon, INRAE, Université de Bourgogne, Université de Bourgogne Franche-Comté, F-21000 Dijon, France
| | - Aurélien Cottin
- Agroécologie, AgroSup Dijon, INRAE, Université de Bourgogne, Université de Bourgogne Franche-Comté, F-21000 Dijon, France
| | - Lionel Ranjard
- Agroécologie, AgroSup Dijon, INRAE, Université de Bourgogne, Université de Bourgogne Franche-Comté, F-21000 Dijon, France
| |
Collapse
|
4
|
Spaulding SA, Potapova MG, Bishop IW, Lee SS, Gasperak TS, Jovanoska E, Furey PC, Edlund MB. Diatoms.org: supporting taxonomists, connecting communities. DIATOM RESEARCH : THE JOURNAL OF THE INTERNATIONAL SOCIETY FOR DIATOM RESEARCH 2022; 36:291-304. [PMID: 35958044 PMCID: PMC9359083 DOI: 10.1080/0269249x.2021.2006790] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/27/2021] [Accepted: 09/22/2021] [Indexed: 05/23/2023]
Abstract
Consistent identification of diatoms is a prerequisite for studying their ecology, biogeography, and successful application as environmental indicators. However, taxonomic consistency among observers has been difficult to achieve, because taxonomic information is scattered across numerous literature sources, presenting challenges to the diatomist. First, literature is often inaccessible because of cost, or its location in journals that are not widely circulated. Second, taxonomic revisions of diatoms are taking place faster than floras can be updated. Finally, taxonomic information is often contradictory across literature sources. These issues can be addressed by developing a content creation community dedicated to making taxonomic, ecological, and image-based data freely available for diatom researchers. Diatoms.org represents such a content curation community, providing open, online access to a vast amount of recent and historical information on North American diatom taxonomy and ecology. The content curation community aggregates existing taxonomic information, creates new content, and provides feedback in the form of corrections and notice of literature with nomenclatural changes. The website not only addresses the needs of experienced diatom scientists for consistent identification, but is also designed to meet users at their level of expertise, including engaging the lay public in the importance of diatom science. The website now contains over 1000 species pages contributed by over 100 content contributors, from students to established scientists. The project began with the intent to provide accurate information on diatom identification, ecology, and distribution using an approach that incorporates engaging design, user feedback, and advanced data access technology. In retrospect, the project that began as an "extended electronic book" has emerged not only as a means to support taxonomists, but for practitioners to communicate and collaborate, expanding the size of and benefits to the content curation community. In this paper, we outline the development of diatoms.org, document key elements of the project, examine ongoing challenges, and consider the unexpected emergent properties, including the value of diatoms.org as a source of data. Ultimately, if the field of diatom taxonomy, ecology, and biodiversity is to be relevant, a new generation of taxonomists needs to be trained and employed using new tools. We propose that diatoms.org is in a key position to serve as a hub of training and continuity for the study of diatom biodiversity and aquatic conditions.
Collapse
Affiliation(s)
- Sarah A Spaulding
- U.S. Geological Survey/INSTAAR, 4001 Discovery Drive, Boulder, CO 80309
| | - Marina G Potapova
- The Academy of Natural Sciences of Drexel University, 1900 Benjamin Franklin Parkway, Philadelphia PA 19103
| | - Ian W Bishop
- Graduate School of Oceanography, University of Rhode Island, 215 S. Ferry Rd, Narragansett, RI 02882
| | - Sylvia S Lee
- U.S. Environmental Protection Agency, Office of Research and Development, Center for Public Health and Environmental Assessment, 1200 Pennsylvania Ave. NW, Mail code 8623-P, Washington, D.C. 20460
| | | | - Elena Jovanoska
- Department of Palaeoanthropology, Senckenberg Research Institute, Senckenberganlage 25, 60325, Frankfurt am Main, Germany
| | - Paula C Furey
- Department of Biology, St. Catherine University, 2004 Randolph Ave., St. Paul, MN 55105
| | - Mark B Edlund
- St. Croix Watershed Res. Station, Science Museum of Minnesota, Marine on St. Croix MN 55047
| |
Collapse
|
5
|
Folk RA, Siniscalchi CM. Biodiversity at the global scale: the synthesis continues. AMERICAN JOURNAL OF BOTANY 2021; 108:912-924. [PMID: 34181762 DOI: 10.1002/ajb2.1694] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Accepted: 04/14/2021] [Indexed: 06/13/2023]
Abstract
Traditionally, the generation and use of biodiversity data and their associated specimen objects have been primarily the purview of individuals and small research groups. While deposition of data and specimens in herbaria and other repositories has long been the norm, throughout most of their history, these resources have been accessible only to a small community of specialists. Through recent concerted efforts, primarily at the level of national and international governmental agencies over the last two decades, the pace of biodiversity data accumulation has accelerated, and a wider array of biodiversity scientists has gained access to this massive accumulation of resources, applying them to an ever-widening compass of research pursuits. We review how these new resources and increasing access to them are affecting the landscape of biodiversity research in plants today, focusing on new applications across evolution, ecology, and other fields that have been enabled specifically by the availability of these data and the global scope that was previously beyond the reach of individual investigators. We give an overview of recent advances organized along three lines: broad-scale analyses of distributional data and spatial information, phylogenetic research circumscribing large clades with comprehensive taxon sampling, and data sets derived from improved accessibility of biodiversity literature. We also review synergies between large data resources and more traditional data collection paradigms, describe shortfalls and how to overcome them, and reflect on the future of plant biodiversity analyses in light of increasing linkages between data types and scientists in our field.
Collapse
Affiliation(s)
- Ryan A Folk
- Department of Biological Sciences, Mississippi State University, Mississippi State, Mississippi, USA
| | - Carolina M Siniscalchi
- Department of Biological Sciences, Mississippi State University, Mississippi State, Mississippi, USA
| |
Collapse
|
6
|
Singh G, Papoutsoglou EA, Keijts-Lalleman F, Vencheva B, Rice M, Visser RG, Bachem CW, Finkers R. Extracting knowledge networks from plant scientific literature: potato tuber flesh color as an exemplary trait. BMC PLANT BIOLOGY 2021; 21:198. [PMID: 33894758 PMCID: PMC8070292 DOI: 10.1186/s12870-021-02943-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/16/2020] [Accepted: 03/29/2021] [Indexed: 06/12/2023]
Abstract
BACKGROUND Scientific literature carries a wealth of information crucial for research, but only a fraction of it is present as structured information in databases and therefore can be analyzed using traditional data analysis tools. Natural language processing (NLP) is often and successfully employed to support humans by distilling relevant information from large corpora of free text and structuring it in a way that lends itself to further computational analyses. For this pilot, we developed a pipeline that uses NLP on biological literature to produce knowledge networks. We focused on the flesh color of potato, a well-studied trait with known associations, and we investigated whether these knowledge networks can assist us in formulating new hypotheses on the underlying biological processes. RESULTS We trained an NLP model based on a manually annotated corpus of 34 full-text potato articles, to recognize relevant biological entities and relationships between them in text (genes, proteins, metabolites and traits). This model detected the number of biological entities with a precision of 97.65% and a recall of 88.91% on the training set. We conducted a time series analysis on 4023 PubMed abstract of plant genetics-based articles which focus on 4 major Solanaceous crops (tomato, potato, eggplant and capsicum), to determine that the networks contained both previously known and contemporaneously unknown leads to subsequently discovered biological phenomena relating to flesh color. A novel time-based analysis of these networks indicates a connection between our trait and a candidate gene (zeaxanthin epoxidase) already two years prior to explicit statements of that connection in the literature. CONCLUSIONS Our time-based analysis indicates that network-assisted hypothesis generation shows promise for knowledge discovery, data integration and hypothesis generation in scientific research.
Collapse
Affiliation(s)
- Gurnoor Singh
- Plant Breeding, Wageningen University & Research, PO Box 386, Wageningen, 6700AJ The Netherlands
| | | | | | | | - Mark Rice
- IBM Netherlands, Amsterdam, The Netherlands
| | - Richard G.F. Visser
- Plant Breeding, Wageningen University & Research, PO Box 386, Wageningen, 6700AJ The Netherlands
| | - Christian W.B. Bachem
- Plant Breeding, Wageningen University & Research, PO Box 386, Wageningen, 6700AJ The Netherlands
| | - Richard Finkers
- Plant Breeding, Wageningen University & Research, PO Box 386, Wageningen, 6700AJ The Netherlands
| |
Collapse
|
7
|
Walton S, Livermore L, Bánki O, Cubey R, Drinkwater R, Englund M, Goble C, Groom Q, Kermorvant C, Rey I, Santos C, Scott B, Williams A, Wu Z. Landscape Analysis for the Specimen Data Refinery. RESEARCH IDEAS AND OUTCOMES 2020. [DOI: 10.3897/rio.6.e57602] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
This report reviews the current state-of-the-art applied approaches on automated tools, services and workflows for extracting information from images of natural history specimens and their labels. We consider the potential for repurposing existing tools, including workflow management systems; and areas where more development is required. This paper was written as part of the SYNTHESYS+ project for software development teams and informatics teams working on new software-based approaches to improve mass digitisation of natural history specimens.
Collapse
|
8
|
Eliason CM, Edwards SV, Clarke JA. phenotools: An
r
package for visualizing and analysing phenomic datasets. Methods Ecol Evol 2019. [DOI: 10.1111/2041-210x.13217] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Chad M. Eliason
- Department of Geological Sciences University of Texas Austin Austin Texas
- Grainger Bioinformatics Center Field Museum of Natural History Chicago Illinois
| | - Scott V. Edwards
- Department of Organismic and Evolutionary Biology and Museum of Comparative Zoology Harvard University Cambridge Massachusetts
| | - Julia A. Clarke
- Department of Geological Sciences University of Texas Austin Austin Texas
| |
Collapse
|
9
|
König C, Weigelt P, Schrader J, Taylor A, Kattge J, Kreft H. Biodiversity data integration-the significance of data resolution and domain. PLoS Biol 2019; 17:e3000183. [PMID: 30883539 PMCID: PMC6445469 DOI: 10.1371/journal.pbio.3000183] [Citation(s) in RCA: 42] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2018] [Revised: 04/02/2019] [Indexed: 11/19/2022] Open
Abstract
Recent years have seen an explosion in the availability of biodiversity data describing the distribution, function, and evolutionary history of life on earth. Integrating these heterogeneous data remains a challenge due to large variations in observational scales, collection purposes, and terminologies. Here, we conceptualize widely used biodiversity data types according to their domain (what aspect of biodiversity is described?) and informational resolution (how specific is the description?). Applying this framework to major data providers in biodiversity research reveals a strong focus on the disaggregated end of the data spectrum, whereas aggregated data types remain largely underutilized. We discuss the implications of this imbalance for the scope and representativeness of current macroecological research and highlight the synergies arising from a tighter integration of biodiversity data across domains and resolutions. We lay out effective strategies for data collection, mobilization, imputation, and sharing and summarize existing frameworks for scalable and integrative biodiversity research. Finally, we use two case studies to demonstrate how the explicit consideration of data domain and resolution helps to identify biases and gaps in global data sets and achieve unprecedented taxonomic and geographical data coverage in macroecological analyses. This Essay highlights data resolution as central property of biodiversity data that affects the precision and representativeness of macroecological inferences. It also discusses ways to maximize synergies among data types and showcases the potential of cross-resolution, cross-domain data integration.
Collapse
Affiliation(s)
- Christian König
- Biodiversity, Macroecology & Biogeography, University of Goettingen, Goettingen, Germany
- * E-mail:
| | - Patrick Weigelt
- Biodiversity, Macroecology & Biogeography, University of Goettingen, Goettingen, Germany
| | - Julian Schrader
- Biodiversity, Macroecology & Biogeography, University of Goettingen, Goettingen, Germany
| | - Amanda Taylor
- Biodiversity, Macroecology & Biogeography, University of Goettingen, Goettingen, Germany
| | - Jens Kattge
- Research Group Functional Biogeography, Max Planck Institute for Biogeochemistry, Jena, Germany
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany
| | - Holger Kreft
- Biodiversity, Macroecology & Biogeography, University of Goettingen, Goettingen, Germany
- Centre of Biodiversity and Sustainable Land Use (CBL), University of Goettingen, Goettingen, Germany
| |
Collapse
|
10
|
Xu D, Chong SS, Rodenhausen T, Cui H. Resolving "orphaned" non-specific structures using machine learning and natural language processing methods. Biodivers Data J 2018:e26659. [PMID: 30393454 PMCID: PMC6207837 DOI: 10.3897/bdj.6.e26659] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2018] [Accepted: 07/27/2018] [Indexed: 11/12/2022] Open
Abstract
Scholarly publications of biodiversity literature contain a vast amount of information in human readable format. The detailed morphological descriptions in these publications contain rich information that can be extracted to facilitate analysis and computational biology research. However, the idiosyncrasies of morphological descriptions still pose a number of challenges to machines. In this work, we demonstrate the use of two different approaches to resolve meronym (i.e. part-of) relations between anatomical parts and their anchor organs, including a syntactic rule-based approach and a SVM-based (support vector machine) method. Both methods made use of domain ontologies. We compared the two approaches with two other baseline methods and the evaluation results show the syntactic methods (92.1% F1 score) outperformed the SVM methods (80.7% F1 score) and the part-of ontologies were valuable knowledge sources for the task. It is notable that the mistakes made by the two approaches rarely overlapped. Additional tests will be conducted on the development version of the Explorer of Taxon Concepts toolkit before we make the functionality publicly available. Meanwhile, we will further investigate and leverage the complementary nature of the two approaches to further drive down the error rate, as in practical application, even a 1% error rate could lead to hundreds of errors.
Collapse
Affiliation(s)
- Dongfang Xu
- University of Arizona, Tucson, United States of America University of Arizona Tucson United States of America
| | - Steven S Chong
- University of Arizona, Tucson, United States of America University of Arizona Tucson United States of America.,National Center for Ecological Analysis and Synthesis, University of California, Santa Barbara, United States of America National Center for Ecological Analysis and Synthesis, University of California Santa Barbara United States of America
| | - Thomas Rodenhausen
- University of Arizona, Tucson, United States of America University of Arizona Tucson United States of America
| | - Hong Cui
- University of Arizona, Tucson, United States of America University of Arizona Tucson United States of America
| |
Collapse
|
11
|
Gitzendanner MA, Yang Y, Wickett NJ, McKain M, Beaulieu JM. Methods for exploring the plant tree of life. APPLICATIONS IN PLANT SCIENCES 2018; 6:e1039. [PMCID: PMC5895194 DOI: 10.1002/aps3.1039] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/13/2018] [Accepted: 03/15/2018] [Indexed: 05/24/2023]
Affiliation(s)
| | - Ya Yang
- Department of Plant and Microbial BiologyUniversity of MinnesotaSt. PaulMinnesota55108USA
| | - Norman J. Wickett
- Department of Plant ScienceChicago Botanic GardenGlencoeIllinois60022USA
- Plant Biology and ConservationNorthwestern UniversityEvanstonIllinois60208USA
| | - Michael McKain
- Department of Biological SciencesUniversity of AlabamaTuscaloosaAlabama35487USA
| | - Jeremy M. Beaulieu
- Department of Biological SciencesUniversity of ArkansasFayettevilleArkansas72701USA
| |
Collapse
|
12
|
Folk RA, Sun M, Soltis PS, Smith SA, Soltis DE, Guralnick RP. Challenges of comprehensive taxon sampling in comparative biology: Wrestling with rosids. AMERICAN JOURNAL OF BOTANY 2018; 105:433-445. [PMID: 29665035 DOI: 10.1002/ajb2.1059] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/16/2017] [Accepted: 12/19/2017] [Indexed: 06/08/2023]
Abstract
Using phylogenetic approaches to test hypotheses on a large scale, in terms of both species sampling and associated species traits and occurrence data-and doing this with rigor despite all the attendant challenges-is critical for addressing many broad questions in evolution and ecology. However, application of such approaches to empirical systems is hampered by a lingering series of theoretical and practical bottlenecks. The community is still wrestling with the challenges of how to develop species-level, comprehensively sampled phylogenies and associated geographic and phenotypic resources that enable global-scale analyses. We illustrate difficulties and opportunities using the rosids as a case study, arguing that assembly of biodiversity data that is scale-appropriate-and therefore comprehensive and global in scope-is required to test global-scale hypotheses. Synthesizing comprehensive biodiversity data sets in clades such as the rosids will be key to understanding the origin and present-day evolutionary and ecological dynamics of the angiosperms.
Collapse
Affiliation(s)
- Ryan A Folk
- Florida Museum of Natural History, Gainesville, FL, 32611, USA
| | - Miao Sun
- Florida Museum of Natural History, Gainesville, FL, 32611, USA
| | - Pamela S Soltis
- Florida Museum of Natural History, Gainesville, FL, 32611, USA
- Genetics Institute, University of Florida, Gainesville, FL, 32610, USA
| | - Stephen A Smith
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Douglas E Soltis
- Florida Museum of Natural History, Gainesville, FL, 32611, USA
- Genetics Institute, University of Florida, Gainesville, FL, 32610, USA
- Department of Biology, University of Florida, Gainesville, FL, 32611, USA
| | | |
Collapse
|