1
|
Dumschott K, Dörpholz H, Laporte MA, Brilhaus D, Schrader A, Usadel B, Neumann S, Arnaud E, Kranz A. Ontologies for increasing the FAIRness of plant research data. FRONTIERS IN PLANT SCIENCE 2023; 14:1279694. [PMID: 38098789 PMCID: PMC10720748 DOI: 10.3389/fpls.2023.1279694] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/18/2023] [Accepted: 11/15/2023] [Indexed: 12/17/2023]
Abstract
The importance of improving the FAIRness (findability, accessibility, interoperability, reusability) of research data is undeniable, especially in the face of large, complex datasets currently being produced by omics technologies. Facilitating the integration of a dataset with other types of data increases the likelihood of reuse, and the potential of answering novel research questions. Ontologies are a useful tool for semantically tagging datasets as adding relevant metadata increases the understanding of how data was produced and increases its interoperability. Ontologies provide concepts for a particular domain as well as the relationships between concepts. By tagging data with ontology terms, data becomes both human- and machine- interpretable, allowing for increased reuse and interoperability. However, the task of identifying ontologies relevant to a particular research domain or technology is challenging, especially within the diverse realm of fundamental plant research. In this review, we outline the ontologies most relevant to the fundamental plant sciences and how they can be used to annotate data related to plant-specific experiments within metadata frameworks, such as Investigation-Study-Assay (ISA). We also outline repositories and platforms most useful for identifying applicable ontologies or finding ontology terms.
Collapse
Affiliation(s)
- Kathryn Dumschott
- Institute of Bio- and Geosciences (IBG-4: Bioinformatics) & Bioeconomy Science Center (BioSC), CEPLAS, Forschungszentrum Jülich, Jülich, Germany
| | - Hannah Dörpholz
- Institute of Bio- and Geosciences (IBG-4: Bioinformatics) & Bioeconomy Science Center (BioSC), CEPLAS, Forschungszentrum Jülich, Jülich, Germany
| | - Marie-Angélique Laporte
- Digital Solutions Team, Digital Inclusion Lever, Bioversity International, Montpellier Office, Montpellier, France
| | - Dominik Brilhaus
- Data Science and Management & Cluster of Excellence on Plant Sciences (CEPLAS), Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Andrea Schrader
- Data Science and Management & Cluster of Excellence on Plant Sciences (CEPLAS), University of Cologne, Cologne, Germany
| | - Björn Usadel
- Institute of Bio- and Geosciences (IBG-4: Bioinformatics) & Bioeconomy Science Center (BioSC), CEPLAS, Forschungszentrum Jülich, Jülich, Germany
- Institute for Biological Data Science & Cluster of Excellence on Plant Sciences (CEPLAS), Faculty of Mathematics and Life Sciences, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Steffen Neumann
- Program Center MetaCom, Leibniz Institute of Plant Biochemistry, Halle, Germany
- German Centre for Integrative Biodiversity Research (iDiv), Halle-Jena-Leipzig, Germany
| | - Elizabeth Arnaud
- Digital Solutions Team, Digital Inclusion Lever, Bioversity International, Montpellier Office, Montpellier, France
| | - Angela Kranz
- Institute of Bio- and Geosciences (IBG-4: Bioinformatics) & Bioeconomy Science Center (BioSC), CEPLAS, Forschungszentrum Jülich, Jülich, Germany
| |
Collapse
|
2
|
Stefancsik R, Balhoff JP, Balk MA, Ball RL, Bello SM, Caron AR, Chesler EJ, de Souza V, Gehrke S, Haendel M, Harris LW, Harris NL, Ibrahim A, Koehler S, Matentzoglu N, McMurry JA, Mungall CJ, Munoz-Torres MC, Putman T, Robinson P, Smedley D, Sollis E, Thessen AE, Vasilevsky N, Walton DO, Osumi-Sutherland D. The Ontology of Biological Attributes (OBA)-computational traits for the life sciences. Mamm Genome 2023; 34:364-378. [PMID: 37076585 PMCID: PMC10382347 DOI: 10.1007/s00335-023-09992-1] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2023] [Accepted: 04/06/2023] [Indexed: 04/21/2023]
Abstract
Existing phenotype ontologies were originally developed to represent phenotypes that manifest as a character state in relation to a wild-type or other reference. However, these do not include the phenotypic trait or attribute categories required for the annotation of genome-wide association studies (GWAS), Quantitative Trait Loci (QTL) mappings or any population-focussed measurable trait data. The integration of trait and biological attribute information with an ever increasing body of chemical, environmental and biological data greatly facilitates computational analyses and it is also highly relevant to biomedical and clinical applications. The Ontology of Biological Attributes (OBA) is a formalised, species-independent collection of interoperable phenotypic trait categories that is intended to fulfil a data integration role. OBA is a standardised representational framework for observable attributes that are characteristics of biological entities, organisms, or parts of organisms. OBA has a modular design which provides several benefits for users and data integrators, including an automated and meaningful classification of trait terms computed on the basis of logical inferences drawn from domain-specific ontologies for cells, anatomical and other relevant entities. The logical axioms in OBA also provide a previously missing bridge that can computationally link Mendelian phenotypes with GWAS and quantitative traits. The term components in OBA provide semantic links and enable knowledge and data integration across specialised research community boundaries, thereby breaking silos.
Collapse
Affiliation(s)
- Ray Stefancsik
- European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridgeshire, CB10 1SD, UK.
| | - James P Balhoff
- Renaissance Computing Institute, University of North Carolina, Chapel Hill, NC, 27517, USA
| | - Meghan A Balk
- Natural History Museum, University of Oslo, Oslo, Norway
| | - Robyn L Ball
- The Jackson Laboratory, Bar Harbor, ME, 04609, USA
| | | | - Anita R Caron
- European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridgeshire, CB10 1SD, UK
| | | | - Vinicius de Souza
- European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Sarah Gehrke
- Anschutz Medical Campus, University of Colorado, Aurora, CO, 80045, USA
| | - Melissa Haendel
- Anschutz Medical Campus, University of Colorado, Aurora, CO, 80045, USA
| | - Laura W Harris
- European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Nomi L Harris
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Arwa Ibrahim
- European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridgeshire, CB10 1SD, UK
| | | | | | - Julie A McMurry
- Anschutz Medical Campus, University of Colorado, Aurora, CO, 80045, USA
| | - Christopher J Mungall
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | | | - Tim Putman
- Anschutz Medical Campus, University of Colorado, Aurora, CO, 80045, USA
| | | | - Damian Smedley
- William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, EC1M 6BQ, UK
| | - Elliot Sollis
- European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Anne E Thessen
- Anschutz Medical Campus, University of Colorado, Aurora, CO, 80045, USA
| | - Nicole Vasilevsky
- Data Collaboration Center, Critical Path Institute, Tucson, AZ, 85718, USA
| | | | | |
Collapse
|
3
|
Senft M, Stahl U, Svoboda N. Research data management in agricultural sciences in Germany: We are not yet where we want to be. PLoS One 2022; 17:e0274677. [PMID: 36178887 PMCID: PMC9524626 DOI: 10.1371/journal.pone.0274677] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2021] [Accepted: 09/01/2022] [Indexed: 11/19/2022] Open
Abstract
To meet the future challenges and foster integrated and holistic research approaches in agricultural sciences, new and sustainable methods in research data management (RDM) are needed. The involvement of scientific users is a critical success factor for their development. We conducted an online survey in 2020 among different user groups in agricultural sciences about their RDM practices and needs. In total, the questionnaire contained 52 questions on information about produced and (re-)used data, data quality aspects, information about the use of standards, publication practices and legal aspects of agricultural research data, the current situation in RDM in regards to awareness, consulting and curricula as well as needs of the agricultural community in respect to future developments. We received 196 (partially) completed questionnaires from data providers, data users, infrastructure and information service providers. In addition to the diversity in the research data landscape of agricultural sciences in Germany, the study reveals challenges, deficits and uncertainties in handling research data in agricultural sciences standing in the way of access and efficient reuse of valuable research data. However, the study also suggests and discusses potential solutions to enhance data publications, facilitate and secure data re-use, ensure data quality and develop services (i.e. training, support and bundling services). Therefore, our research article provides the basis for the development of common RDM, future infrastructures and services needed to foster the cultural change in handling research data across agricultural sciences in Germany and beyond.
Collapse
Affiliation(s)
- Matthias Senft
- Leibniz Institute for Agricultural Engineering and Bioeconomy (ATB), Potsdam, Germany
| | - Ulrike Stahl
- Julius Kühn Institute (JKI)—Federal Research Centre for Cultivated Plants, Quedlinburg, Germany
| | - Nikolai Svoboda
- Leibniz Centre for Agricultural Landscape Research (ZALF), Müncheberg, Germany
| |
Collapse
|
4
|
Salim JA, Saraiva AM, Zermoglio PF, Agostini K, Wolowski M, Drucker DP, Soares FM, Bergamo PJ, Varassin IG, Freitas L, Maués MM, Rech AR, Veiga AK, Acosta AL, Araujo AC, Nogueira A, Blochtein B, Freitas BM, Albertini BC, Maia-Silva C, Nunes CEP, Pires CSS, dos Santos CF, Queiroz EP, Cartolano EA, de Oliveira FF, Amorim FW, Fontúrbel FE, da Silva GV, Consolaro H, Alves-dos-Santos I, Machado IC, Silva JS, Aleixo KP, Carvalheiro LG, Rocca MA, Pinheiro M, Hrncir M, Streher NS, Ferreira PA, de Albuquerque PMC, Maruyama PK, Borges RC, Giannini TC, Brito VLG. Data standardization of plant-pollinator interactions. Gigascience 2022; 11:giac043. [PMID: 35639882 PMCID: PMC9154084 DOI: 10.1093/gigascience/giac043] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND Animal pollination is an important ecosystem function and service, ensuring both the integrity of natural systems and human well-being. Although many knowledge shortfalls remain, some high-quality data sets on biological interactions are now available. The development and adoption of standards for biodiversity data and metadata has promoted great advances in biological data sharing and aggregation, supporting large-scale studies and science-based public policies. However, these standards are currently not suitable to fully support interaction data sharing. RESULTS Here we present a vocabulary of terms and a data model for sharing plant-pollinator interactions data based on the Darwin Core standard. The vocabulary introduces 48 new terms targeting several aspects of plant-pollinator interactions and can be used to capture information from different approaches and scales. Additionally, we provide solutions for data serialization using RDF, XML, and DwC-Archives and recommendations of existing controlled vocabularies for some of the terms. Our contribution supports open access to standardized data on plant-pollinator interactions. CONCLUSIONS The adoption of the vocabulary would facilitate data sharing to support studies ranging from the spatial and temporal distribution of interactions to the taxonomic, phenological, functional, and phylogenetic aspects of plant-pollinator interactions. We expect to fill data and knowledge gaps, thus further enabling scientific research on the ecology and evolution of plant-pollinator communities, biodiversity conservation, ecosystem services, and the development of public policies. The proposed data model is flexible and can be adapted for sharing other types of interactions data by developing discipline-specific vocabularies of terms.
Collapse
Affiliation(s)
- José A Salim
- Escola Politécnica, Universidade de São Paulo, São Paulo, SP, 05508-010, Brazil
| | - Antonio M Saraiva
- Escola Politécnica, Universidade de São Paulo, São Paulo, SP, 05508-010, Brazil
| | - Paula F Zermoglio
- Departamento de Ecología, Genética y Evolución, Instituto IEGEBA (CONICET-UBA), Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Buenos Aires, Argentina
| | - Kayna Agostini
- Departamento de Ciências da Natureza, Matemática e Educação, Universidade Federal de São Carlos, Rodovia Anhanguera km 174, Araras, São Paulo, Caixa Postal 153. CEP 13600-970, Brazil
| | - Marina Wolowski
- Instituto de Ciências da Natureza, Universidade Federal de Alfenas, Rua Gabriel Monteiro da Silva 700, Alfenas, Minas Gerais, 37130-001, Brazil
| | - Debora P Drucker
- Embrapa Agricultura Digital, Empresa Brasileira de Pesquisa Agropecuária (Embrapa), Campinas, SP, Brazil
| | - Filipi M Soares
- Escola Politécnica, Universidade de São Paulo, São Paulo, SP, 05508-010, Brazil
| | - Pedro J Bergamo
- Jardim Botânico do Rio de Janeiro, R. Pacheco Leão 915, Rio de Janeiro, Rio de Janeiro, 22460-030, Brazil
| | - Isabela G Varassin
- Departamento de Botânica, Universidade Federal do Paraná, Curitiba, Paraná, Brazil
| | - Leandro Freitas
- Jardim Botânico do Rio de Janeiro, R. Pacheco Leão 915, Rio de Janeiro, Rio de Janeiro, 22460-030, Brazil
| | - Márcia M Maués
- Laboratório de Entomologia, Embrapa Amazônia Oriental, Trav. Dr. Enéas Pinheiro, s/n°, Bairro do Marco, Belém, Pará, 66095-903, Brazil
| | - Andre R Rech
- Faculdade Interdisciplinar de Humanidades, Centro Multiusuário de Pesquisa em Ciência Florestal (MULTIFLOR), Universidade Federal dos Vales do Jequitinhonha e Mucuri, Diamantina, Minas Gerais, 39100-000, Brazil
| | - Allan K Veiga
- Escola Politécnica, Universidade de São Paulo, São Paulo, SP, 05508-010, Brazil
| | - Andre L Acosta
- Instituto Tecnológico Vale. Rua Boaventura da Silva, 955, 66055-900, Belém, Pará, Brazil
| | - Andréa C Araujo
- Instituto de Biociências, Universidade Federal de Mato Grosso do Sul, Campo Grande, Mato Grosso do Sul, Brazil
| | - Anselmo Nogueira
- Laboratório de Interações Plant-Animal (LIPA), Centro de Ciências Naturais e Humanas (CCNH), Universidade Federal do ABC, Alameda da Universidade, s/nº, Anchieta, São Bernardo do Campo, São Paulo, Brazil
| | - Betina Blochtein
- Escola de Ciências da Saúde e da Vida, Pontifícia Universidade Católica do Rio Grande do Sul, Porto Alegre, RS, 90619-900, Brazil
| | - Breno M Freitas
- Departamento de Zootecnia, Campus Universitário do Pici, Universidade Federal do Ceará, Centro de Ciências Agrárias, Fortaleza, CE, Brazil
| | - Bruno C Albertini
- Escola Politécnica, Universidade de São Paulo, São Paulo, SP, 05508-010, Brazil
| | - Camila Maia-Silva
- Departamento de Biociências, Universidade Federal Rural do Semi-Árido, Av. Francisco Mota, n° 572, Presidente Costa e Silva, Mossoró, RN, 59625-900, Brazil
| | - Carlos E P Nunes
- Department of Biological and Environmental Sciences, Cottrell Building, University of Stirling, Stirling FK9 4LA, Scotland, United Kingdom
| | - Carmen S S Pires
- Embrapa Recursos Genéticos e Biotecnologia, Brasília, Distrito Federal, Brazil
| | - Charles F dos Santos
- Escola de Ciências da Saúde e da Vida, Pontifícia Universidade Católica do Rio Grande do Sul, Porto Alegre, RS, 90619-900, Brazil
| | - Elisa P Queiroz
- Departamento de Ecologia, Instituto de Biociências, Universidade de São Paulo, São Paulo, Brazil
| | - Etienne A Cartolano
- Escola Politécnica, Universidade de São Paulo, São Paulo, SP, 05508-010, Brazil
| | - Favízia F de Oliveira
- Laboratório de Bionomia, Biogeografia e Sistemática de Insetos (BIOSIS), Instituto de Biologia (IBIO), Universidade Federal da Bahia, 40170-115 Salvador, Bahia, Brazil
| | - Felipe W Amorim
- Laboratório de Ecologia da Polinização e Interações (LEPI), Programa de Pós-graduação em Botânica, Programa de Pós-graduação em Zoologia, Instituto de Biociências, Universidade Estadual Paulista, Botucatu, SP, Brazil
| | - Francisco E Fontúrbel
- Instituto de Biología, Facultad de Ciencias, Pontificia Universidad Católica de Valparaíso, Valparaíso, Chile
| | - Gleycon V da Silva
- Programa de Pós-Graduação em Ecologia / INPA-V8 - Instituto Nacional de Pesquisas da Amazônia, Av. André Araújo 2936, Petrópolis, 69067-375, Manaus - AM, Brazil
| | - Hélder Consolaro
- Instituto de Biotecnologia, Universidade Federal de Catalão, Catalão, Goiás, Brazil
| | - Isabel Alves-dos-Santos
- Departamento de Ecologia, Instituto de Biociências, Universidade de São Paulo, São Paulo, Brazil
| | - Isabel C Machado
- Programa de Pós-Graduação em Biologia Vegetal, Departamento de Botânica, Universidade Federal de Pernambuco, Recife, PE 50670-901, Brazil
| | - Juliana S Silva
- Instituto Federal de Educação Ciência e Tecnologia de Mato Grosso, Avenida Sen. Filinto Müller, 953 - CEP: 78043-400 - Cuiabá, MT, Brazil
| | - Kátia P Aleixo
- Associação Brasileira de Estudos das Abelhas (A.B.E.L.H.A.), São Paulo, SP, 04535-001, Brazil
| | - Luísa G Carvalheiro
- Departamento de Ecologia, Universidade Federal de Goiás, Campus Samambaia, Goiânia, Brazil Centre for Ecology, Evolution and Environmental Changes (cE3c), University of Lisboa, Lisbon, Portugal
| | - Márcia A Rocca
- Departamento de Ecologia, Centro de Ciências Biológicas e da Saúde, Universidade Federal de Sergipe, Avenida Marechal Rondon s/n, São Cristóvão, Sergipe, 49100-000, Brazil
| | - Mardiore Pinheiro
- Universidade Federal da Fronteira Sul, R. Major Antônio Cardoso 590, Cerro Largo, Rio Grande do Sul, 97900-000, Brazil
| | - Michael Hrncir
- Departamento de Fisiologia, Instituto de Biociências, Universidade de São Paulo, Rua do Matão, 321, Travessa 14, São Paulo, São Paulo, 05508-900, Brazil
| | - Nathália S Streher
- Department of Biological Sciences, University of Pittsburgh, Pittsburgh, PA,15260, United States of America
| | - Patricia A Ferreira
- Environmental Sciences Department, Federal University of São Carlos, São Paulo, Brazil
| | | | - Pietro K Maruyama
- Centro de Síntese Ecológica e Conservação, Departamento de Genética, Ecologia e Evolução, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Rafael C Borges
- Instituto Tecnológico Vale. Rua Boaventura da Silva, 955, 66055-900, Belém, Pará, Brazil
| | - Tereza C Giannini
- Instituto Tecnológico Vale. Rua Boaventura da Silva, 955, 66055-900, Belém, Pará, Brazil
| | - Vinícius L G Brito
- Instituto de Biologia, Universidade Federal de Uberlândia, Rua Ceará sn, Uberlândia, Minas Gerais, 38.405-302, Brazil
| |
Collapse
|
5
|
An Ontology-Driven Personalized Faceted Search for Exploring Knowledge Bases of Capsicum. FUTURE INTERNET 2021. [DOI: 10.3390/fi13070172] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Capsicum is a genus of flowering plants in the Solanaceae family in which the members are well known to have a high economic value. The Capsicum fruits, which are popularly known as peppers or chili, have been widely used by people worldwide. It serves as a spice and raw material for many products such as sauce, food coloring, and medicine. For many years, scientists have studied this plant to optimize its production. A tremendous amount of knowledge has been obtained and shared, as reflected in multiple knowledge-based systems, databases, or information systems. An approach to knowledge-sharing is through the adoption of a common ontology to eliminate knowledge understanding discrepancy. Unfortunately, most of the knowledge-sharing solutions are intended for scientists who are familiar with the subject. On the other hand, there are groups of potential users that could benefit from such systems but have minimal knowledge of the subject. For these non-expert users, finding relevant information from a less familiar knowledge base would be daunting. More than that, users have various degrees of understanding of the available content in the knowledge base. This understanding discrepancy raises a personalization problem. In this paper, we introduce a solution to overcome this challenge. First, we developed an ontology to facilitate knowledge-sharing about Capsicum to non-expert users. Second, we developed a personalized faceted search algorithm that provides multiple structured ways to explore the knowledge base. The algorithm addresses the personalization problem by identifying the degree of understanding about the subject from each user. In this way, non-expert users could explore a knowledge base of Capsicum efficiently. Our solution characterized users into four groups. As a result, our faceted search algorithm defines four types of matching mechanisms, including three ranking mechanisms as the core of our solution. In order to evaluate the proposed method, we measured the predictability degree of produced list of facets. Our findings indicated that the proposed matching mechanisms could tolerate various query types, and a high degree of predictability can be achieved by combining multiple ranking mechanisms. Furthermore, it demonstrates that our approach has a high potential contribution to biodiversity science in general, where many knowledge-based systems have been developed with limited access to users outside of the domain.
Collapse
|
6
|
Owen D, Groom Q, Hardisty A, Leegwater T, Livermore L, van Walsum M, Wijkamp N, Spasić I. Towards a scientific workflow featuring Natural Language Processing for the digitisation of natural history collections. RESEARCH IDEAS AND OUTCOMES 2020. [DOI: 10.3897/rio.6.e58030] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We describe an effective approach to automated text digitisation with respect to natural history specimen labels. These labels contain much useful data about the specimen including its collector, country of origin, and collection date. Our approach to automatically extracting these data takes the form of a pipeline. Recommendations are made for the pipeline's component parts based on state-of-the-art technologies.
Optical Character Recognition (OCR) can be used to digitise text on images of specimens. However, recognising text quickly and accurately from these images can be a challenge for OCR. We show that OCR performance can be improved by prior segmentation of specimen images into their component parts. This ensures that only text-bearing labels are submitted for OCR processing as opposed to whole specimen images, which inevitably contain non-textual information that may lead to false positive readings. In our testing Tesseract OCR version 4.0.0 offers promising text recognition accuracy with segmented images.
Not all the text on specimen labels is printed. Handwritten text varies much more and does not conform to standard shapes and sizes of individual characters, which poses an additional challenge for OCR. Recently, deep learning has allowed for significant advances in this area. Google's Cloud Vision, which is based on deep learning, is trained on large-scale datasets, and is shown to be quite adept at this task. This may take us some way towards negating the need for humans to routinely transcribe handwritten text.
Determining the countries and collectors of specimens has been the goal of previous automated text digitisation research activities. Our approach also focuses on these two pieces of information. An area of Natural Language Processing (NLP) known as Named Entity Recognition (NER) has matured enough to semi-automate this task. Our experiments demonstrated that existing approaches can accurately recognise location and person names within the text extracted from segmented images via Tesseract version 4.0.0.
We have highlighted the main recommendations for potential pipeline components. The paper also provides guidance on selecting appropriate software solutions. These include automatic language identification, terminology extraction, and integrating all pipeline components into a scientific workflow to automate the overall digitisation process.
Collapse
|
7
|
Owen D, Livermore L, Groom Q, Hardisty A, Leegwater T, van Walsum M, Wijkamp N, Spasić I. Towards a scientific workflow featuring Natural Language Processing for the digitisation of natural history collections. RESEARCH IDEAS AND OUTCOMES 2020. [DOI: 10.3897/rio.6.e55789] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We describe an effective approach to automated text digitisation with respect to natural history specimen labels. These labels contain much useful data about the specimen including its collector, country of origin, and collection date. Our approach to automatically extracting these data takes the form of a pipeline. Recommendations are made for the pipeline's component parts based on some of the state-of-the-art technologies.
Optical Character Recognition (OCR) can be used to digitise text on images of specimens. However, recognising text quickly and accurately from these images can be a challenge for OCR. We show that OCR performance can be improved by prior segmentation of specimen images into their component parts. This ensures that only text-bearing labels are submitted for OCR processing as opposed to whole specimen images, which inevitably contain non-textual information that may lead to false positive readings. In our testing Tesseract OCR version 4.0.0 offers promising text recognition accuracy with segmented images.
Not all the text on specimen labels is printed. Handwritten text varies much more and does not conform to standard shapes and sizes of individual characters, which poses an additional challenge for OCR. Recently, deep learning has allowed for significant advances in this area. Google's Cloud Vision, which is based on deep learning, is trained on large-scale datasets, and is shown to be quite adept at this task. This may take us some way towards negating the need for humans to routinely transcribe handwritten text.
Determining the countries and collectors of specimens has been the goal of previous automated text digitisation research activities. Our approach also focuses on these two pieces of information. An area of Natural Language Processing (NLP) known as Named Entity Recognition (NER) has matured enough to semi-automate this task. Our experiments demonstrated that existing approaches can accurately recognise location and person names within the text extracted from segmented images via Tesseract version 4.0.0. Potentially, NER could be used in conjunction with other online services, such as those of the Biodiversity Heritage Library to map the named entities to entities in the biodiversity literature (https://www.biodiversitylibrary.org/docs/api3.html).
We have highlighted the main recommendations for potential pipeline components. The document also provides guidance on selecting appropriate software solutions. These include automatic language identification, terminology extraction, and integrating all pipeline components into a scientific workflow to automate the overall digitisation process.
Collapse
|
8
|
Quantifying Changes in Plant Species Diversity in a Savanna Ecosystem Through Observed and Remotely Sensed Data. SUSTAINABILITY 2020. [DOI: 10.3390/su12062345] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
This study examined the impact of climate change on plant species diversity of a savanna ecosystem, through an assessment of climatic trends over a period of forty years (1974–2014) using Masvingo Province, Zimbabwe, as a case study. The normalised difference vegetation index (NDVI) was used as a proxy for plant species diversity to cover for the absence of long-term historical plant diversity data. Observed precipitation and temperature data collected over the review period were compared with the trends in NDVI to understand the impact of climate change on plant species diversity over time. The nonaligned block sampling design was used as the sampling framework, from which 198 sampling plots were identified. Data sources included satellite images, field measurements, and direct observations. Temperature and precipitation had significant (p < 0.05) trends over the period under study. However, the trend for seasonal total precipitation was not significant but declining. Significant correlations (p < 0.001) were identified between various climate variables and the Shannon index of diversity. NDVI was also significantly correlated to the Shannon index of diversity. The declining trend of plant species in savanna ecosystems is directly linked to the decreasing precipitation and increasing temperatures.
Collapse
|
9
|
Abstract
Abstract
Biodiversity research studies the variability and diversity of organisms, including variability within and between species with particular focus on the functional diversity of traits and their relationship to environment. Managing biodiversity data implies dealing with its heterogeneous nature using semantics and tailored ontologies. These are themselves differently conceived, and combining them in semantically enabled applications necessitates an effective alignment between their concepts. This paper describes the ontology matching of biodiversity- and ecology-related ontologies. We illustrate diverse challenges introduced by this kind of ontologies to ontology matching in general. Real use cases requiring pairwise alignments between environment and trait ontologies are introduced. We describe our experience creating a new track at the Ontology Alignment Evaluation Initiative designed for this specific domain and report on the results obtained by state-of-the-art participating systems. The biodiversity and ecology use case turns out to be a strong one for ontology matching, introducing new interesting challenges. Even if most of the matching systems perform relatively well in the proposed matching tasks, there is still room for improvement. We highlight possible directions in that matter and elaborate on our plan to further progress with the track.
Collapse
|
10
|
Wegrzyn JL, Falk T, Grau E, Buehler S, Ramnath R, Herndon N. Cyberinfrastructure and resources to enable an integrative approach to studying forest trees. Evol Appl 2020; 13:228-241. [PMID: 31892954 PMCID: PMC6935593 DOI: 10.1111/eva.12860] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2019] [Revised: 08/11/2019] [Accepted: 08/14/2019] [Indexed: 12/19/2022] Open
Abstract
Sequencing technologies and bioinformatic approaches are now available to resolve the challenges associated with complex and heterozygous genomes. Increased access to less expensive and more effective instrumentation will contribute to a wealth of high-quality plant genomes in the next few years. In the meantime, more than 370 tree species are associated with public projects in primary repositories that are interrogating expression profiles, identifying variants, or analyzing targeted capture without a high-quality reference genome. Genomic data from these projects generates sequences that represent intermediate assemblies for transcriptomes and genomes. These data contribute to forest tree biology, but the associated sequence remains trapped in supplemental files that are poorly integrated in plant community databases and comparative genomic platforms. Successful implementation of life science cyberinfrastructure is improving data standards, ontologies, analytic workflows, and integrated database platforms for both model and non-model plant species. Unique to forest trees with large populations that are long-lived, outcrossing, and genetically diverse, the phenotypic and environmental metrics associated with georeferenced populations are just as important as the genomic data sampled for each individual. To address questions related to forest health and productivity, cyberinfrastructure must keep pace with the magnitude of genomic and phenomic sampling of larger populations. This review examines the current landscape of cyberinfrastructure, with an emphasis on best practices and resources to align community data with the Findable, Accessible, Interoperable, and Reusable (FAIR) guidelines.
Collapse
Affiliation(s)
- Jill L. Wegrzyn
- Department of Ecology and Evolutionary BiologyUniversity of ConnecticutStorrsConnecticut
| | - Taylor Falk
- Department of Ecology and Evolutionary BiologyUniversity of ConnecticutStorrsConnecticut
| | - Emily Grau
- Department of Ecology and Evolutionary BiologyUniversity of ConnecticutStorrsConnecticut
| | - Sean Buehler
- Department of Ecology and Evolutionary BiologyUniversity of ConnecticutStorrsConnecticut
| | - Risharde Ramnath
- Department of Ecology and Evolutionary BiologyUniversity of ConnecticutStorrsConnecticut
| | - Nic Herndon
- Department of Ecology and Evolutionary BiologyUniversity of ConnecticutStorrsConnecticut
| |
Collapse
|
11
|
Schneider FD, Fichtmueller D, Gossner MM, Güntsch A, Jochum M, König‐Ries B, Le Provost G, Manning P, Ostrowski A, Penone C, Simons NK. Towards an ecological trait‐data standard. Methods Ecol Evol 2019. [DOI: 10.1111/2041-210x.13288] [Citation(s) in RCA: 50] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Affiliation(s)
- Florian D. Schneider
- unaffiliated, c/o Birgitta König‐Ries Department of Mathematics and Computer Science Friedrich‐Schiller‐Universität Jena Jena Germany
| | - David Fichtmueller
- Botanic Garden and Botanical Museum Berlin Freie Universität Berlin Berlin Germany
| | - Martin M. Gossner
- Forest Entomology Swiss Federal Research Institute WSL Birmensdorf Switzerland
| | - Anton Güntsch
- Botanic Garden and Botanical Museum Berlin Freie Universität Berlin Berlin Germany
| | - Malte Jochum
- Institute of Plant Sciences University of Bern Bern Switzerland
- German Centre for Integrative Biodiversity Research (iDiv) Halle‐Jena‐Leipzig Leipzig Germany
- Institute of Biology Leipzig University Leipzig Germany
| | - Birgitta König‐Ries
- Department of Mathematics and Computer Science Friedrich‐Schiller‐Universität Jena Jena Germany
| | - Gaëtane Le Provost
- Senckenberg Biodiversity and Climate Research Centre (BiK‐F) Frankfurt am Main Germany
| | - Peter Manning
- Senckenberg Biodiversity and Climate Research Centre (BiK‐F) Frankfurt am Main Germany
| | - Andreas Ostrowski
- Department of Mathematics and Computer Science Friedrich‐Schiller‐Universität Jena Jena Germany
| | - Caterina Penone
- Institute of Plant Sciences University of Bern Bern Switzerland
| | - Nadja K. Simons
- Department of Ecology and Ecosystem Management Technische Universität München Freising Germany
- Ecological Networks Department of Biology Technische Universität Darmstadt Darmstadt Germany
| |
Collapse
|
12
|
König C, Weigelt P, Schrader J, Taylor A, Kattge J, Kreft H. Biodiversity data integration-the significance of data resolution and domain. PLoS Biol 2019; 17:e3000183. [PMID: 30883539 PMCID: PMC6445469 DOI: 10.1371/journal.pbio.3000183] [Citation(s) in RCA: 42] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2018] [Revised: 04/02/2019] [Indexed: 11/19/2022] Open
Abstract
Recent years have seen an explosion in the availability of biodiversity data describing the distribution, function, and evolutionary history of life on earth. Integrating these heterogeneous data remains a challenge due to large variations in observational scales, collection purposes, and terminologies. Here, we conceptualize widely used biodiversity data types according to their domain (what aspect of biodiversity is described?) and informational resolution (how specific is the description?). Applying this framework to major data providers in biodiversity research reveals a strong focus on the disaggregated end of the data spectrum, whereas aggregated data types remain largely underutilized. We discuss the implications of this imbalance for the scope and representativeness of current macroecological research and highlight the synergies arising from a tighter integration of biodiversity data across domains and resolutions. We lay out effective strategies for data collection, mobilization, imputation, and sharing and summarize existing frameworks for scalable and integrative biodiversity research. Finally, we use two case studies to demonstrate how the explicit consideration of data domain and resolution helps to identify biases and gaps in global data sets and achieve unprecedented taxonomic and geographical data coverage in macroecological analyses. This Essay highlights data resolution as central property of biodiversity data that affects the precision and representativeness of macroecological inferences. It also discusses ways to maximize synergies among data types and showcases the potential of cross-resolution, cross-domain data integration.
Collapse
Affiliation(s)
- Christian König
- Biodiversity, Macroecology & Biogeography, University of Goettingen, Goettingen, Germany
- * E-mail:
| | - Patrick Weigelt
- Biodiversity, Macroecology & Biogeography, University of Goettingen, Goettingen, Germany
| | - Julian Schrader
- Biodiversity, Macroecology & Biogeography, University of Goettingen, Goettingen, Germany
| | - Amanda Taylor
- Biodiversity, Macroecology & Biogeography, University of Goettingen, Goettingen, Germany
| | - Jens Kattge
- Research Group Functional Biogeography, Max Planck Institute for Biogeochemistry, Jena, Germany
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany
| | - Holger Kreft
- Biodiversity, Macroecology & Biogeography, University of Goettingen, Goettingen, Germany
- Centre of Biodiversity and Sustainable Land Use (CBL), University of Goettingen, Goettingen, Germany
| |
Collapse
|
13
|
Endara L, Thessen AE, Cole HA, Walls R, Gkoutos G, Cao Y, Chong SS, Cui H. Modifier Ontologies for frequency, certainty, degree, and coverage phenotype modifier. Biodivers Data J 2018; 6:e29232. [PMID: 30532623 PMCID: PMC6281706 DOI: 10.3897/bdj.6.e29232] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2018] [Accepted: 11/20/2018] [Indexed: 11/21/2022] Open
Abstract
Background: When phenotypic characters are described in the literature, they may be constrained or clarified with additional information such as the location or degree of expression, these terms are called "modifiers". With effort underway to convert narrative character descriptions to computable data, ontologies for such modifiers are needed. Such ontologies can also be used to guide term usage in future publications. Spatial and method modifiers are the subjects of ontologies that already have been developed or are under development. In this work, frequency (e.g., rarely, usually), certainty (e.g., probably, definitely), degree (e.g., slightly, extremely), and coverage modifiers (e.g., sparsely, entirely) are collected, reviewed, and used to create two modifier ontologies with different design considerations. The basic goal is to express the sequential relationships within a type of modifiers, for example, usually is more frequent than rarely, in order to allow data annotated with ontology terms to be classified accordingly. Method: Two designs are proposed for the ontology, both using the list pattern: a closed ordered list (i.e., five-bin design) and an open ordered list design. The five-bin design puts the modifier terms into a set of 5 fixed bins with interval object properties, for example, one_level_more/less_frequently_than, where new terms can only be added as synonyms to existing classes. The open list approach starts with 5 bins, but supports the extensibility of the list via ordinal properties, for example, more/less_frequently_than, allowing new terms to be inserted as a new class anywhere in the list. The consequences of the different design decisions are discussed in the paper. CharaParser was used to extract modifiers from plant, ant, and other taxonomic descriptions. After a manual screening, 130 modifier words were selected as the candidate terms for the modifier ontologies. Four curators/experts (three biologists and one information scientist specialized in biosemantics) reviewed and categorized the terms into 20 bins using the Ontology Term Organizer (OTO) (http://biosemantics.arizona.edu/OTO). Inter-curator variations were reviewed and expressed in the final ontologies. Results: Frequency, certainty, degree, and coverage terms with complete agreement among all curators were used as class labels or exact synonyms. Terms with different interpretations were either excluded or included using "broader synonym" or "not recommended" annotation properties. These annotations explicitly allow for the user to be aware of the semantic ambiguity associated with the terms and whether they should be used with caution or avoided. Expert categorization results showed that 16 out of 20 bins contained terms with full agreements, suggesting differentiating the modifiers into 5 levels/bins balances the need to differentiate modifiers and the need for the ontology to reflect user consensus. Two ontologies, developed using the Protege ontology editor, are made available as OWL files and can be downloaded from https://github.com/biosemantics/ontologies. Contribution: We built the first two modifier ontologies following a consensus-based approach with terms commonly used in taxonomic literature. The five-bin ontology has been used in the Explorer of Taxon Concepts web toolkit to compute the similarity between characters extracted from literature to facilitate taxon concepts alignments. The two ontologies will also be used in an ontology-informed authoring tool for taxonomists to facilitate consistency in modifier term usage.
Collapse
Affiliation(s)
- Lorena Endara
- University of Florida, Gainesville, United States of AmericaUniversity of FloridaGainesvilleUnited States of America
| | - Anne E Thessen
- The Ronin Institute for Independent Scholarship, Monclair, NJ, United States of AmericaThe Ronin Institute for Independent ScholarshipMonclair, NJUnited States of America
| | - Heather A Cole
- Science and Technology Branch, Agriculture and Agri-Food Canada, Government of Canada, Ottawa, CanadaScience and Technology Branch, Agriculture and Agri-Food Canada, Government of CanadaOttawaCanada
| | - Ramona Walls
- CyVerse, Tucson, United States of AmericaCyVerseTucsonUnited States of America
| | - Georgios Gkoutos
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, Centre for Computational Biology, University of Birmingham, Birmingham, United KingdomCollege of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, Centre for Computational Biology, University of BirminghamBirminghamUnited Kingdom
- Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, B15 2TT, Birmingham, United KingdomInstitute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, B15 2TTBirminghamUnited Kingdom
| | - Yujie Cao
- Center for Studies of Information Resources, Wuhan Universtity, Wuhan, ChinaCenter for Studies of Information Resources, Wuhan UniverstityWuhanChina
| | - Steven S. Chong
- National Center for Ecological Analysis and Synthesis, University of California, Santa Barbara, Santa Barbara, United States of AmericaNational Center for Ecological Analysis and Synthesis, University of California, Santa BarbaraSanta BarbaraUnited States of America
- University of Arizona, Tucson, United States of AmericaUniversity of ArizonaTucsonUnited States of America
| | - Hong Cui
- University of Arizona, Tucson, United States of AmericaUniversity of ArizonaTucsonUnited States of America
| |
Collapse
|
14
|
Gkoutos GV, Schofield PN, Hoehndorf R. The anatomy of phenotype ontologies: principles, properties and applications. Brief Bioinform 2018; 19:1008-1021. [PMID: 28387809 PMCID: PMC6169674 DOI: 10.1093/bib/bbx035] [Citation(s) in RCA: 48] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2017] [Revised: 02/05/2017] [Indexed: 12/14/2022] Open
Abstract
The past decade has seen an explosion in the collection of genotype data in domains as diverse as medicine, ecology, livestock and plant breeding. Along with this comes the challenge of dealing with the related phenotype data, which is not only large but also highly multidimensional. Computational analysis of phenotypes has therefore become critical for our ability to understand the biological meaning of genomic data in the biological sciences. At the heart of computational phenotype analysis are the phenotype ontologies. A large number of these ontologies have been developed across many domains, and we are now at a point where the knowledge captured in the structure of these ontologies can be used for the integration and analysis of large interrelated data sets. The Phenotype And Trait Ontology framework provides a method for formal definitions of phenotypes and associated data sets and has proved to be key to our ability to develop methods for the integration and analysis of phenotype data. Here, we describe the development and products of the ontological approach to phenotype capture, the formal content of phenotype ontologies and how their content can be used computationally.
Collapse
Affiliation(s)
| | | | - Robert Hoehndorf
- Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, King Abdullah University of Science and Technology, Thuwal
| |
Collapse
|
15
|
Kissling WD, Walls R, Bowser A, Jones MO, Kattge J, Agosti D, Amengual J, Basset A, van Bodegom PM, Cornelissen JHC, Denny EG, Deudero S, Egloff W, Elmendorf SC, Alonso García E, Jones KD, Jones OR, Lavorel S, Lear D, Navarro LM, Pawar S, Pirzl R, Rüger N, Sal S, Salguero-Gómez R, Schigel D, Schulz KS, Skidmore A, Guralnick RP. Towards global data products of Essential Biodiversity Variables on species traits. Nat Ecol Evol 2018; 2:1531-1540. [PMID: 30224814 DOI: 10.1038/s41559-018-0667-3] [Citation(s) in RCA: 82] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2018] [Accepted: 07/16/2018] [Indexed: 02/03/2023]
Abstract
Essential Biodiversity Variables (EBVs) allow observation and reporting of global biodiversity change, but a detailed framework for the empirical derivation of specific EBVs has yet to be developed. Here, we re-examine and refine the previous candidate set of species traits EBVs and show how traits related to phenology, morphology, reproduction, physiology and movement can contribute to EBV operationalization. The selected EBVs express intra-specific trait variation and allow monitoring of how organisms respond to global change. We evaluate the societal relevance of species traits EBVs for policy targets and demonstrate how open, interoperable and machine-readable trait data enable the building of EBV data products. We outline collection methods, meta(data) standardization, reproducible workflows, semantic tools and licence requirements for producing species traits EBVs. An operationalization is critical for assessing progress towards biodiversity conservation and sustainable development goals and has wide implications for data-intensive science in ecology, biogeography, conservation and Earth observation.
Collapse
Affiliation(s)
- W Daniel Kissling
- Department of Theoretical and Computational Ecology, Institute for Biodiversity and Ecosystem Dynamics (IBED), University of Amsterdam, Amsterdam, The Netherlands.
| | | | - Anne Bowser
- Woodrow Wilson International Center for Scholars, Washington DC, USA
| | - Matthew O Jones
- University of Montana, W. A. Franke Department of Forestry and Conservation, Missoula, MT, USA
| | - Jens Kattge
- Max Planck Institute for Biogeochemistry, Jena, Germany.,German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany
| | | | - Josep Amengual
- Area de Conservacion, Seguimiento y Programas de la Red, Organismo Autonomo Parques Nacionales, Ministerio de Agricultura y Pesca, Madrid, Spain
| | - Alberto Basset
- Department of Biological and Environmental Sciences and Technologies, University of Salento, Lecce, Italy
| | - Peter M van Bodegom
- Institute of Environmental Sciences, Leiden University, Leiden, The Netherlands
| | - Johannes H C Cornelissen
- Systems Ecology, Department of Ecological Science, Vrije Universiteit, Amsterdam, The Netherlands
| | - Ellen G Denny
- USA National Phenology Network, University of Arizona, Tucson, AZ, USA
| | - Salud Deudero
- Instituto Español de Oceanografía, Centro Oceanográfico de Baleares, Palma de Mallorca, Spain
| | | | - Sarah C Elmendorf
- National Ecological Observatory Network, Battelle Ecology, Boulder, CO, USA.,Department of Ecology and Evolutionary Biology, University of Colorado, Boulder, CO, USA
| | | | - Katherine D Jones
- National Ecological Observatory Network, Battelle Ecology, Boulder, CO, USA
| | - Owen R Jones
- Department of Biology, University of Southern Denmark, Odense M, Denmark
| | - Sandra Lavorel
- Laboratoire d'Ecologie Alpine, CNRS - Université Grenoble Alpes, Grenoble, France
| | - Dan Lear
- Marine Biological Association of the United Kingdom, Plymouth, Devon, UK
| | - Laetitia M Navarro
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany.,Institute of Biology, Martin Luther University Halle Wittenberg, Halle (Saale), Germany
| | - Samraat Pawar
- Department of Life Sciences, Imperial College London, Ascot, Berkshire, UK
| | - Rebecca Pirzl
- CSIRO and Atlas of Living Australia, Canberra, Australian Capital Territory, Australia
| | - Nadja Rüger
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany.,Smithsonian Tropical Research Institute, Ancon, Panama
| | - Sofia Sal
- Department of Life Sciences, Imperial College London, Ascot, Berkshire, UK
| | - Roberto Salguero-Gómez
- Department of Zoology, Oxford University, Oxford, UK.,Department of Animal and Plant Sciences, University of Sheffield, Sheffield, UK.,Centre for Biodiversity and Conservation Science, University of Queensland, St Lucia, Queensland, Australia.,Evolutionary Demography Laboratory, Max Plank Institute for Demographic Research, Rostock, Germany
| | - Dmitry Schigel
- Global Biodiversity Information Facility (GBIF), Secretariat, Copenhagen, Denmark
| | - Katja-Sabine Schulz
- Smithsonian Institution, National Museum of Natural History, Washington DC, USA
| | - Andrew Skidmore
- Department of Natural Resources, Faculty of Geo-Information Science and Earth Observation (ITC), University of Twente, Enschede, The Netherlands.,Department of Environmental Science, Macquarie University, New South Wales, Australia
| | - Robert P Guralnick
- Florida Museum of Natural History, University of Florida, Gainesville, FL, USA
| |
Collapse
|
16
|
Xu D, Chong SS, Rodenhausen T, Cui H. Resolving "orphaned" non-specific structures using machine learning and natural language processing methods. Biodivers Data J 2018:e26659. [PMID: 30393454 PMCID: PMC6207837 DOI: 10.3897/bdj.6.e26659] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2018] [Accepted: 07/27/2018] [Indexed: 11/12/2022] Open
Abstract
Scholarly publications of biodiversity literature contain a vast amount of information in human readable format. The detailed morphological descriptions in these publications contain rich information that can be extracted to facilitate analysis and computational biology research. However, the idiosyncrasies of morphological descriptions still pose a number of challenges to machines. In this work, we demonstrate the use of two different approaches to resolve meronym (i.e. part-of) relations between anatomical parts and their anchor organs, including a syntactic rule-based approach and a SVM-based (support vector machine) method. Both methods made use of domain ontologies. We compared the two approaches with two other baseline methods and the evaluation results show the syntactic methods (92.1% F1 score) outperformed the SVM methods (80.7% F1 score) and the part-of ontologies were valuable knowledge sources for the task. It is notable that the mistakes made by the two approaches rarely overlapped. Additional tests will be conducted on the development version of the Explorer of Taxon Concepts toolkit before we make the functionality publicly available. Meanwhile, we will further investigate and leverage the complementary nature of the two approaches to further drive down the error rate, as in practical application, even a 1% error rate could lead to hundreds of errors.
Collapse
Affiliation(s)
- Dongfang Xu
- University of Arizona, Tucson, United States of America University of Arizona Tucson United States of America
| | - Steven S Chong
- University of Arizona, Tucson, United States of America University of Arizona Tucson United States of America.,National Center for Ecological Analysis and Synthesis, University of California, Santa Barbara, United States of America National Center for Ecological Analysis and Synthesis, University of California Santa Barbara United States of America
| | - Thomas Rodenhausen
- University of Arizona, Tucson, United States of America University of Arizona Tucson United States of America
| | - Hong Cui
- University of Arizona, Tucson, United States of America University of Arizona Tucson United States of America
| |
Collapse
|
17
|
Endara L, Cui H, Burleigh JG. Extraction of phenotypic traits from taxonomic descriptions for the tree of life using natural language processing. APPLICATIONS IN PLANT SCIENCES 2018; 6:e1035. [PMID: 29732265 PMCID: PMC5895189 DOI: 10.1002/aps3.1035] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/05/2017] [Accepted: 01/31/2018] [Indexed: 05/09/2023]
Abstract
PREMISE OF THE STUDY Phenotypic data sets are necessary to elucidate the genealogy of life, but assembling phenotypic data for taxa across the tree of life can be technically challenging and prohibitively time consuming. We describe a semi-automated protocol to facilitate and expedite the assembly of phenotypic character matrices of plants from formal taxonomic descriptions. This pipeline uses new natural language processing (NLP) techniques and a glossary of over 9000 botanical terms. METHODS AND RESULTS Our protocol includes the Explorer of Taxon Concepts (ETC), an online application that assembles taxon-by-character matrices from taxonomic descriptions, and MatrixConverter, a Java application that enables users to evaluate and discretize the characters extracted by ETC. We demonstrate this protocol using descriptions from Araucariaceae. CONCLUSIONS The NLP pipeline unlocks the phenotypic data found in taxonomic descriptions and makes them usable for evolutionary analyses.
Collapse
Affiliation(s)
- Lorena Endara
- Department of BiologyUniversity of FloridaGainesvilleFlorida32611USA
| | - Hong Cui
- School of InformationUniversity of ArizonaTucsonArizona85719USA
| | | |
Collapse
|
18
|
|
19
|
Salhi A, Negrão S, Essack M, Morton MJL, Bougouffa S, Razali R, Radovanovic A, Marchand B, Kulmanov M, Hoehndorf R, Tester M, Bajic VB. DES-TOMATO: A Knowledge Exploration System Focused On Tomato Species. Sci Rep 2017; 7:5968. [PMID: 28729549 PMCID: PMC5519719 DOI: 10.1038/s41598-017-05448-0] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2017] [Accepted: 05/25/2017] [Indexed: 12/29/2022] Open
Abstract
Tomato is the most economically important horticultural crop used as a model to study plant biology and particularly fruit development. Knowledge obtained from tomato research initiated improvements in tomato and, being transferrable to other such economically important crops, has led to a surge of tomato-related research and published literature. We developed DES-TOMATO knowledgebase (KB) for exploration of information related to tomato. Information exploration is enabled through terms from 26 dictionaries and combination of these terms. To illustrate the utility of DES-TOMATO, we provide several examples how one can efficiently use this KB to retrieve known or potentially novel information. DES-TOMATO is free for academic and nonprofit users and can be accessed at http://cbrc.kaust.edu.sa/des_tomato/, using any of the mainstream web browsers, including Firefox, Safari and Chrome.
Collapse
Affiliation(s)
- Adil Salhi
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Sónia Negrão
- King Abdullah University of Science and Technology (KAUST), Division of Biological and Environmental Sciences and Engineering, Thuwal, 23955-6900, Saudi Arabia
| | - Magbubah Essack
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Mitchell J L Morton
- King Abdullah University of Science and Technology (KAUST), Division of Biological and Environmental Sciences and Engineering, Thuwal, 23955-6900, Saudi Arabia
| | - Salim Bougouffa
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Rozaimi Razali
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Aleksandar Radovanovic
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | | | - Maxat Kulmanov
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Robert Hoehndorf
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
- King Abdullah University of Science and Technology (KAUST), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Thuwal, 23955-6900, Saudi Arabia
| | - Mark Tester
- King Abdullah University of Science and Technology (KAUST), Division of Biological and Environmental Sciences and Engineering, Thuwal, 23955-6900, Saudi Arabia
| | - Vladimir B Bajic
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia.
- King Abdullah University of Science and Technology (KAUST), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Thuwal, 23955-6900, Saudi Arabia.
| |
Collapse
|