1
|
Thompson KM, Turnbull R, Fitzgerald E, Birch JL. Identification of herbarium specimen sheet components from high-resolution images using deep learning. Ecol Evol 2023; 13:e10395. [PMID: 37589042 PMCID: PMC10425611 DOI: 10.1002/ece3.10395] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Revised: 07/10/2023] [Accepted: 07/24/2023] [Indexed: 08/18/2023] Open
Abstract
Advanced computer vision techniques hold the potential to mobilise vast quantities of biodiversity data by facilitating the rapid extraction of text- and trait-based data from herbarium specimen digital images, and to increase the efficiency and accuracy of downstream data capture during digitisation. This investigation developed an object detection model using YOLOv5 and digitised collection images from the University of Melbourne Herbarium (MELU). The MELU-trained 'sheet-component' model-trained on 3371 annotated images, validated on 1000 annotated images, run using 'large' model type, at 640 pixels, for 200 epochs-successfully identified most of the 11 component types of the digital specimen images, with an overall model precision measure of 0.983, recall of 0.969 and moving average precision (mAP0.5-0.95) of 0.847. Specifically, 'institutional' and 'annotation' labels were predicted with mAP0.5-0.95 of 0.970 and 0.878 respectively. It was found that annotating at least 2000 images was required to train an adequate model, likely due to the heterogeneity of specimen sheets. The full model was then applied to selected specimens from nine global herbaria (Biodiversity Data Journal, 7, 2019), quantifying its generalisability: for example, the 'institutional label' was identified with mAP0.5-0.95 of between 0.68 and 0.89 across the various herbaria. Further detailed study demonstrated that starting with the MELU-model weights and retraining for as few as 50 epochs on 30 additional annotated images was sufficient to enable the prediction of a previously unseen component. As many herbaria are resource-constrained, the MELU-trained 'sheet-component' model weights are made available and application encouraged.
Collapse
|
2
|
Tarride S, Maarand M, Boillet M, McGrath J, Capel E, Vézina H, Kermorvant C. Large-scale genealogical information extraction from handwritten Quebec parish records. INT J DOC ANAL RECOG 2023. [DOI: 10.1007/s10032-023-00427-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
|
3
|
Woolland O, Brack P, Soiland-Reyes S, Scott B, Livermore L. Incrementally building FAIR Digital Objects with Specimen Data Refinery workflows. RESEARCH IDEAS AND OUTCOMES 2022. [DOI: 10.3897/rio.8.e94349] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Specimen Data Refinery (SDR) is a developing platform for automating transcription of specimens from natural history collections (Hardisty et al. 2022). SDR is based on computational workflows and digital twins using FAIR Digital Objects.
We show our recent experiences with building SDR using the Galaxy workflow system and combining two FDO methodologies with open digital specimens (openDS) and RO-Crate data packaging. We suggest FDO improvements for incremental building of digital objects in computational workflows.
SDR workflows
SDR is realised as the workflow system Galaxy (Afgan et al. 2018) with SDR tools installed. An Open Research challenge is that some tools have machine learning models with a commercial licence. This complicates publishing to Galaxy toolshed, however we created Ansible scripts to install equivalent Galaxy servers, including tools and dependencies, accounts and workflows. SDR workflows are published in WorkflowHub as FDOs.
We implemented the use case De novo digitization in Galaxy (Brack et al. 2022). Shown in Fig. 1 the workflow steps exchange openDS JSON (Hardisty et al. 2019), for incremental completion of a digital specimen. Initial stages build a template openDS from a CSV with metadata and image references – subsequent analysis completes the rest of the JSON with regions of interest, text digitised from handwriting, and recognized named entities.
Galaxy can visualise outputs of each step (Fig. 2), important to make the FDOs understandable by domain experts and to verify accuracy of SDR.
We are adding workflows for partial stages, e.g. detection of regions (Livermore and Woolland 2022a) and hand-written text recognition (Livermore and Woolland 2022b), which we'll combine with scalability testing and wider testing by project users. Additional workflows will enhance existing FDOs and use new tools such as barcode detection of museums’ internal identifiers.
We are now ready to publish digital specimens as FAIR Digital Objects, with registration into DiSSCO repositories, PID assignment and workflow provenance. However, even at this early stage we have identified several challenges that need to be addressed.
FDO lessons
We highlight the De novo use case because this workflow is exchanging partial FDOs – openDS objects which are not fully completed and not yet assigned persistent identifiers. openDS schemas are still in development, therefore SDR uses a more flexible JSON schema where only the initial metadata (populated from CSV) are required. Each step validates the partial FDO before passing it to the underlying command line tool.
Although workflow steps exchange openDS objects, they cannot be combined in any order. For instance, named entity recognition requires digitised text in the FDO. We can consider these intermediate steps as sub-profiles of an FDO Type. Unlike hierarchical subclasses, these FDO profiles are more like ducktyping. For instance a text detection step may only require the regions key, but semantically there is no requirement for an OpenDSWithText to be a subclass of OpenDSWithRegion, as text also can be transcribed manually without regions.
Similarly, we found that some steps can be executed in parallel, but this requires merging of partial FDOs. This can be achieved by combining JSON queries and JSON Schemas, but indicates that it may be more beneficial to have FDO fragments as separate objects. Adding openDS fragment steps would however complicate workflows.
Several of our tools process the referenced images, currently https URLs in openDS. We added a caching layer to avoid repeated image downloading, coupled with local file-paths wiring in the workflow. A similar challenge occurs if accessing image data using DOIP, which unlike HTTP, has no caching mechanisms.
RO-Crate lessons
Galaxy is developing support for importing and exporting Workflow Run Crates, a profile of RO-Crate (Soiland-Reyes et al. 2022b) to captures execution history of a workflow, including its definition and intermediate data (De Geest et al. 2022). SDR is adopting this support to combine openDS FDOs with workflow provenance, as envisioned by Walton et al. (2020).
Our prototype de novo workflow returns results as a ZIP file of openDS objects. End-users should also get copies of the referenced images and generated visualisations, along with workflow execution metadata. We are investigating ways to embed the preliminary Galaxy workflow history before the final step, so that this result can be an enriched RO-Crate.
Conclusions
SDR is an example of machine-assisted construction of FDOs, which highlight the needs for intermediate digital objects that are not yet FDO compliant. The passing of such “local FDOs” is beneficial not just for efficiency and visual inspection, but also to simplify workflow composition of canonical workflow building blocks. At the same time we see that it is insufficient to only pass FDOs as JSON objects, as they also have references to other data such as images, which should not need to be re-downloaded.
Further work will investigate the use of RO-Crate as a wrapper of partial FDOs, but this needs to be coupled with more flexible FDO types as profiles, in order to restrict “impossible” ordering of steps depending on particular inner FDO fragments. A distinction needs to be made between open digital specimens that are in “draft” state and those that can be pushed to DiSSCo registries.
We are experimenting with changing the SDR components into Canonical Workflow Building Blocks (Soiland-Reyes et al. 2022a) using the Common Workflow Language (Crusoe et al. 2022). This gives flexibility to scalably execute SDR workflows on different compute backends such as HPC or local cluster, without the additional setup of Galaxy servers.
Collapse
|
4
|
Groom Q, Bräuchler C, Cubey RWN, Dillen M, Huybrechts P, Kearney N, Klazenga N, Leachman S, Paul DL, Rogers H, Santos J, Shorthouse DP, Vaughan A, von Mering S, Haston EM. The disambiguation of people names in biological collections. Biodivers Data J 2022; 10:e86089. [PMID: 36761559 PMCID: PMC9836581 DOI: 10.3897/bdj.10.e86089] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2022] [Accepted: 08/26/2022] [Indexed: 11/12/2022] Open
Abstract
Scientific collections have been built by people. For hundreds of years, people have collected, studied, identified, preserved, documented and curated collection specimens. Understanding who those people are is of interest to historians, but much more can be made of these data by other stakeholders once they have been linked to the people's identities and their biographies. Knowing who people are helps us attribute work correctly, validate data and understand the scientific contribution of people and institutions. We can evaluate the work they have done, the interests they have, the places they have worked and what they have created from the specimens they have collected. The problem is that all we know about most of the people associated with collections are their names written on specimens. Disambiguating these people is the challenge that this paper addresses. Disambiguation of people often proves difficult in isolation and can result in staff or researchers independently trying to determine the identity of specific individuals over and over again. By sharing biographical data and building an open, collectively maintained dataset with shared knowledge, expertise and resources, it is possible to collectively deduce the identities of individuals, aggregate biographical information for each person, reduce duplication of effort and share the information locally and globally. The authors of this paper aspire to disambiguate all person names efficiently and fully in all their variations across the entirety of the biological sciences, starting with collections. Towards that vision, this paper has three key aims: to improve the linking, validation, enhancement and valorisation of person-related information within and between collections, databases and publications; to suggest good practice for identifying people involved in biological collections; and to promote coordination amongst all stakeholders, including individuals, natural history collections, institutions, learned societies, government agencies and data aggregators.
Collapse
Affiliation(s)
- Quentin Groom
- Meise Botanic Garden, Meise, BelgiumMeise Botanic GardenMeiseBelgium
| | - Christian Bräuchler
- Naturhistorisches Museum Wien, Wien, AustriaNaturhistorisches Museum WienWienAustria
| | - Robert W. N. Cubey
- Royal Botanic Garden Edinburgh, Edinburgh, United KingdomRoyal Botanic Garden EdinburghEdinburghUnited Kingdom
| | - Mathias Dillen
- Meise Botanic Garden, Meise, BelgiumMeise Botanic GardenMeiseBelgium
| | - Pieter Huybrechts
- Meise Botanic Garden, Meise, BelgiumMeise Botanic GardenMeiseBelgium
| | - Nicole Kearney
- Biodiversity Heritage Library (BHL) Australia, Melbourne, AustraliaBiodiversity Heritage Library (BHL) AustraliaMelbourneAustralia
| | - Niels Klazenga
- Royal Botanic Gardens Victoria, Melbourne, AustraliaRoyal Botanic Gardens VictoriaMelbourneAustralia
| | - Siobhan Leachman
- Independent Researcher, Wellington, New ZealandIndependent ResearcherWellingtonNew Zealand
| | - Deborah L Paul
- University of Illinois, Champaign, United States of AmericaUniversity of IllinoisChampaignUnited States of America,Florida State University, Tallahassee, United States of AmericaFlorida State UniversityTallahasseeUnited States of America
| | - Heather Rogers
- McGill University, Montreal, CanadaMcGill UniversityMontrealCanada
| | - Joaquim Santos
- Centre for Functional Ecology, Department of Life Sciences, University of Coimbra, Coimbra, PortugalCentre for Functional Ecology, Department of Life Sciences, University of CoimbraCoimbraPortugal
| | - David Peter Shorthouse
- Agriculture & Agri-Food Canada, Ottawa, CanadaAgriculture & Agri-Food CanadaOttawaCanada
| | - Alison Vaughan
- Royal Botanic Gardens Victoria, Melbourne, AustraliaRoyal Botanic Gardens VictoriaMelbourneAustralia
| | - Sabine von Mering
- Museum für Naturkunde, Leibniz Institute for Evolution and Biodiversity Science, Berlin, GermanyMuseum für Naturkunde, Leibniz Institute for Evolution and Biodiversity ScienceBerlinGermany
| | - Elspeth M Haston
- Royal Botanic Garden Edinburgh, Edinburgh, United KingdomRoyal Botanic Garden EdinburghEdinburghUnited Kingdom
| |
Collapse
|
5
|
Hussein BR, Malik OA, Ong WH, Slik JWF. Applications of computer vision and machine learning techniques for digitized herbarium specimens: A systematic literature review. ECOL INFORM 2022. [DOI: 10.1016/j.ecoinf.2022.101641] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
|
6
|
Hardisty A, Brack P, Goble C, Livermore L, Scott B, Groom Q, Owen S, Soiland-Reyes S. The Specimen Data Refinery: A Canonical Workflow Framework and FAIR Digital Object Approach to Speeding up Digital Mobilisation of Natural History Collections. DATA INTELLIGENCE 2022. [DOI: 10.1162/dint_a_00134] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
Abstract
A key limiting factor in organising and using information from physical specimens curated in natural science collections is making that information computable, with institutional digitization tending to focus more on imaging the specimens themselves than on efficiently capturing computable data about them. Label data are traditionally manually transcribed today with high cost and low throughput, rendering such a task constrained for many collection-holding institutions at current funding levels. We show how computer vision, optical character recognition, handwriting recognition, named entity recognition and language translation technologies can be implemented into canonical workflow component libraries with findable, accessible, interoperable, and reusable (FAIR) characteristics. These libraries are being developed in a cloud- based workflow platform—the ‘Specimen Data Refinery’ (SDR)—founded on Galaxy workflow engine, Common Workflow Language, Research Object Crates (RO-Crate) and WorkflowHub technologies. The SDR can be applied to specimens’ labels and other artefacts, offering the prospect of greatly accelerated and more accurate data capture in computable form. Two kinds of FAIR Digital Objects (FDO) are created by packaging outputs of SDR workflows and workflow components as digital objects with metadata, a persistent identifier, and a specific type definition. The first kind of FDO are computable Digital Specimen (DS) objects that can be consumed/produced by workflows, and other applications. A single DS is the input data structure submitted to a workflow that is modified by each workflow component in turn to produce a refined DS at the end. The Specimen Data Refinery provides a library of such components that can be used individually, or in series. To cofunction, each library component describes the fields it requires from the DS and the fields it will in turn populate or enrich. The second kind of FDO, RO-Crates gather and archive the diverse set of digital and real-world resources, configurations, and actions (the provenance) contributing to a unit of research work, allowing that work to be faithfully recorded and reproduced. Here we describe the Specimen Data Refinery with its motivating requirements, focusing on what is essential in the creation of canonical workflow component libraries and its conformance with the requirements of an emerging FDO Core Specification being developed by the FDO Forum.
Collapse
Affiliation(s)
- Alex Hardisty
- School of Computer Science and informatics, Cardiff University, Cardiff CF24 3AA, UK
| | - Paul Brack
- The Department of Computer Science, The University of Manchester, Manchester M13 9PL, UK
| | - Carole Goble
- The Department of Computer Science, The University of Manchester, Manchester M13 9PL, UK
| | | | - Ben Scott
- The Natural History Museum, London SW7 5BD, UK
| | | | - Stuart Owen
- The Department of Computer Science, The University of Manchester, Manchester M13 9PL, UK
| | - Stian Soiland-Reyes
- The Department of Computer Science, The University of Manchester, Manchester M13 9PL, UK
- Informatics Institute, Faculty of Science, University of Amsterdam, 1090 GH Amsterdam, The Netherlands
| |
Collapse
|
7
|
Greeff M, Caspers M, Kalkman V, Willemse L, Sunderland B, Bánki O, Hogeweg L. Sharing taxonomic expertise between natural history collections using image recognition. RESEARCH IDEAS AND OUTCOMES 2022. [DOI: 10.3897/rio.8.e79187] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Natural history collections play a vital role in biodiversity research and conservation by providing a window to the past. The usefulness of the vast amount of historical data depends on their quality, with correct taxonomic identifications being the most critical. The identification of many of the objects of natural history collections, however, is wanting, doubtful or outdated. Providing correct identifications is difficult given the sheer number of objects and the scarcity of expertise. Here we outline the construction of an ecosystem for the collaborative development and exchange of image recognition algorithms designed to support the identification of objects. Such an ecosystem will facilitate sharing taxonomic expertise among institutions by offering image datasets that are correctly identified by their in-house taxonomic experts. Together with openly accessible machine learning algorithms and easy to use workbenches, this will allow other institutes to train image recognition algorithms and thereby compensate for the lacking expertise.
Collapse
|
8
|
Albani Rocchetti G, Armstrong CG, Abeli T, Orsenigo S, Jasper C, Joly S, Bruneau A, Zytaruk M, Vamosi JC. Reversing extinction trends: new uses of (old) herbarium specimens to accelerate conservation action on threatened species. THE NEW PHYTOLOGIST 2021; 230:433-450. [PMID: 33280123 DOI: 10.1111/nph.17133] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/06/2020] [Accepted: 11/22/2020] [Indexed: 05/29/2023]
Abstract
Although often not collected specifically for the purposes of conservation, herbarium specimens offer sufficient information to reconstruct parameters that are needed to designate a species as 'at-risk' of extinction. While such designations should prompt quick and efficient legal action towards species recovery, such action often lags far behind and is mired in bureaucratic procedure. The increase in online digitization of natural history collections has now led to a surge in the number new studies on the uses of machine learning. These repositories of species occurrences are now equipped with advances that allow for the identification of rare species. The increase in attention devoted to estimating the scope and severity of the threats that lead to the decline of such species will increase our ability to mitigate these threats and reverse the declines, overcoming a current barrier to the recovery of many threatened plant species. Thus far, collected specimens have been used to fill gaps in systematics, range extent, and past genetic diversity. We find that they also offer material with which it is possible to foster species recovery, ecosystem restoration, and de-extinction, and these elements should be used in conjunction with machine learning and citizen science initiatives to mobilize as large a force as possible to counter current extinction trends.
Collapse
Affiliation(s)
| | | | - Thomas Abeli
- Department of Science, University Roma Tre, Viale G. Marconi 446, Roma, 00154, Italy
| | - Simone Orsenigo
- Department of Earth and Environmental Sciences, University of Pavia, Pavia, 27100, Italy
| | - Caroline Jasper
- Department of Biological Sciences, University of Calgary, Calgary, AB, T2N 1N4, Canada
| | - Simon Joly
- Montreal Botanical Garden, Montréal, QC, H1X 2B2, Canada
- Département de Sciences Biologiques and Institut de Recherche en Biologie Végétale, Université de Montréal, Montréal, QC, H1X 2B2, Canada
| | - Anne Bruneau
- Département de Sciences Biologiques and Institut de Recherche en Biologie Végétale, Université de Montréal, Montréal, QC, H1X 2B2, Canada
| | - Maria Zytaruk
- Department of English, University of Calgary, Calgary, AB, T2N 1N4, Canada
| | - Jana C Vamosi
- Department of Biological Sciences, University of Calgary, Calgary, AB, T2N 1N4, Canada
| |
Collapse
|
9
|
Hardy H, van Walsum M, Livermore L, Walton S. Research and development in robotics with potential to automate handling of biological collections. RESEARCH IDEAS AND OUTCOMES 2020. [DOI: 10.3897/rio.6.e61366] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
This report investigates the current state of physical (mechanical) robotics, automated warehousing approaches and assistive technologies in relation to the storage, handling and processing (particularly digitisation) of natural history collections.
Robotics can sound futuristic, however we provide case studies that show many and growing examples of physical automation in the natural history and cultural heritage sectors, including barcodes and conveyor belts for digitisation; robots that handle multiple vials for molecular and genetic work; robots for use in in display or exhibition contexts; and automated warehousing of library collections. We provide a non-exhaustive example of an end to end workflow of storage, retrieval and processing and discuss aspects of the tools and challenges relevant to these stages. The Distributed System of Scientific Collections (DiSSCo), a new Research Infrastructure for natural science collections, should build on this, leading a future programme of pilots that develop understanding of independent stages, and can be connected to make progress towards end-to-end solutions.
Robots, or automated systems, excel at repetitive tasks, and are developing rapidly to be able to handle more complex object types, at lower cost. High volume, high variety of objects, and considerations such as fragility are not unique to the natural history sector - they apply for example to major retail operations - however natural history collections do offer some of the more extreme examples of these challenges, and in particular are not replaceable. Increased consistency of storage units is likely to be a critical factor in enabling automated handling in future, as well as looking at automation possibilities when new collections storage spaces are developed and built. Engagement with industry and subject matter experts has been patchy and again we recommend that DiSSCo help to ensure a joined up engagement with the right incentives in place, and with clear communication of requirements and challenges for shared R&D.
When examining return on investment for particular automation, collections-holding institutions need to consider not only time and cost of automation compared to human labour, but wider factors including: health and safety such as physical environment and repetitve strain injury; security; quality and consistency of outputs; degree of criticality in response times (e.g. if digitising on demand); effective use of spaces; and freeing up staff to conduct other tasks.
Purely software-based automation is outside the scope of this report, but is also in increasing use and has enormous potential, for example to transform the extraction of label and specimen data at scale from images. The challenges of managing and digitising collections at scale under DiSSCo are likely to require a combination of hardware and software automation approaches.
Collapse
|