1
|
Soiland-Reyes S, Goble C, Groth P. Evaluating FAIR Digital Object and Linked Data as distributed object systems. PeerJ Comput Sci 2024; 10:e1781. [PMID: 38855229 PMCID: PMC11157569 DOI: 10.7717/peerj-cs.1781] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Accepted: 12/06/2023] [Indexed: 06/11/2024]
Abstract
FAIR Digital Object (FDO) is an emerging concept that is highlighted by European Open Science Cloud (EOSC) as a potential candidate for building an ecosystem of machine-actionable research outputs. In this work we systematically evaluate FDO and its implementations as a global distributed object system, by using five different conceptual frameworks that cover interoperability, middleware, FAIR principles, EOSC requirements and FDO guidelines themself. We compare the FDO approach with established Linked Data practices and the existing Web architecture, and provide a brief history of the Semantic Web while discussing why these technologies may have been difficult to adopt for FDO purposes. We conclude with recommendations for both Linked Data and FDO communities to further their adaptation and alignment.
Collapse
Affiliation(s)
- Stian Soiland-Reyes
- Department of Computer Science, The University of Manchester, Manchester, UK
- Informatics Institute, University of Amsterdam, Amsterdam, Netherlands
| | - Carole Goble
- Department of Computer Science, The University of Manchester, Manchester, UK
| | - Paul Groth
- Informatics Institute, University of Amsterdam, Amsterdam, Netherlands
| |
Collapse
|
2
|
de Koning K, Broekhuijsen J, Kühn I, Ovaskainen O, Taubert F, Endresen D, Schigel D, Grimm V. Digital twins: dynamic model-data fusion for ecology. Trends Ecol Evol 2023; 38:916-926. [PMID: 37208222 DOI: 10.1016/j.tree.2023.04.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2022] [Revised: 04/17/2023] [Accepted: 04/18/2023] [Indexed: 05/21/2023]
Abstract
Digital twins (DTs) are an emerging phenomenon in the public and private sectors as a new tool to monitor and understand systems and processes. DTs have the potential to change the status quo in ecology as part of its digital transformation. However, it is important to avoid misguided developments by managing expectations about DTs. We stress that DTs are not just big models of everything, containing big data and machine learning. Rather, the strength of DTs is in combining data, models, and domain knowledge, and their continuous alignment with the real world. We suggest that researchers and stakeholders exercise caution in DT development, keeping in mind that many of the strengths and challenges of computational modelling in ecology also apply to DTs.
Collapse
Affiliation(s)
- Koen de Koning
- Wageningen University and Research, Environmental Systems Analysis Group, P.O. Box 47, 6700, AA, Wageningen, The Netherlands
| | - Jeroen Broekhuijsen
- Nederlandse organisatie voor toegepast natuurwetenschappenlijk onderzoek - TNO, Department of Monitoring & Control Services, Eemsgolaan 3, 9727 DW Groningen, The Netherlands
| | - Ingolf Kühn
- Helmholtz Centre for Environmental Research - UFZ, Department of Community Ecology, Theodor-Lieser-Strasse, 4, 06120 Halle, Germany; Martin Luther University Halle-Wittenberg, Institute for Biology/Geobotany & Botanical Garden, Große Steinstraße 79/80, 06108 Halle, Germany; German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Puschstrasse 4, 04103 Leipzig, Germany
| | - Otso Ovaskainen
- Department of Biological and Environmental Science, University of Jyväskylä, P.O. Box 35 (Survontie 9C), FI-40014 Jyväskylä, Finland; Organismal and Evolutionary Biology Research Programme, Faculty of Biological and Environmental Sciences, University of Helsinki, P.O. Box 65, Helsinki 00014, Finland; Department of Biology, Centre for Biodiversity Dynamics, Norwegian University of Science and Technology, Trondheim N-7491, Norway
| | - Franziska Taubert
- Helmholtz Centre for Environmental Research - UFZ, Department of Ecological Modelling, Permoserstr. 15, 04318 Leipzig, Germany
| | - Dag Endresen
- University of Oslo, Natural History Museum, Sars gate 1, NO-0562 Oslo, Norway.
| | - Dmitry Schigel
- Global Biodiversity Information Facility - GBIF Secreteriat, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark
| | - Volker Grimm
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Puschstrasse 4, 04103 Leipzig, Germany; Helmholtz Centre for Environmental Research - UFZ, Department of Ecological Modelling, Permoserstr. 15, 04318 Leipzig, Germany; University of Potsdam, Plant Ecology and Nature Conservation, Am Mühlenberg 3, 14476 Potsdam, Germany
| |
Collapse
|
3
|
Woolland O, Brack P, Soiland-Reyes S, Scott B, Livermore L. Incrementally building FAIR Digital Objects with Specimen Data Refinery workflows. RESEARCH IDEAS AND OUTCOMES 2022. [DOI: 10.3897/rio.8.e94349] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Specimen Data Refinery (SDR) is a developing platform for automating transcription of specimens from natural history collections (Hardisty et al. 2022). SDR is based on computational workflows and digital twins using FAIR Digital Objects.
We show our recent experiences with building SDR using the Galaxy workflow system and combining two FDO methodologies with open digital specimens (openDS) and RO-Crate data packaging. We suggest FDO improvements for incremental building of digital objects in computational workflows.
SDR workflows
SDR is realised as the workflow system Galaxy (Afgan et al. 2018) with SDR tools installed. An Open Research challenge is that some tools have machine learning models with a commercial licence. This complicates publishing to Galaxy toolshed, however we created Ansible scripts to install equivalent Galaxy servers, including tools and dependencies, accounts and workflows. SDR workflows are published in WorkflowHub as FDOs.
We implemented the use case De novo digitization in Galaxy (Brack et al. 2022). Shown in Fig. 1 the workflow steps exchange openDS JSON (Hardisty et al. 2019), for incremental completion of a digital specimen. Initial stages build a template openDS from a CSV with metadata and image references – subsequent analysis completes the rest of the JSON with regions of interest, text digitised from handwriting, and recognized named entities.
Galaxy can visualise outputs of each step (Fig. 2), important to make the FDOs understandable by domain experts and to verify accuracy of SDR.
We are adding workflows for partial stages, e.g. detection of regions (Livermore and Woolland 2022a) and hand-written text recognition (Livermore and Woolland 2022b), which we'll combine with scalability testing and wider testing by project users. Additional workflows will enhance existing FDOs and use new tools such as barcode detection of museums’ internal identifiers.
We are now ready to publish digital specimens as FAIR Digital Objects, with registration into DiSSCO repositories, PID assignment and workflow provenance. However, even at this early stage we have identified several challenges that need to be addressed.
FDO lessons
We highlight the De novo use case because this workflow is exchanging partial FDOs – openDS objects which are not fully completed and not yet assigned persistent identifiers. openDS schemas are still in development, therefore SDR uses a more flexible JSON schema where only the initial metadata (populated from CSV) are required. Each step validates the partial FDO before passing it to the underlying command line tool.
Although workflow steps exchange openDS objects, they cannot be combined in any order. For instance, named entity recognition requires digitised text in the FDO. We can consider these intermediate steps as sub-profiles of an FDO Type. Unlike hierarchical subclasses, these FDO profiles are more like ducktyping. For instance a text detection step may only require the regions key, but semantically there is no requirement for an OpenDSWithText to be a subclass of OpenDSWithRegion, as text also can be transcribed manually without regions.
Similarly, we found that some steps can be executed in parallel, but this requires merging of partial FDOs. This can be achieved by combining JSON queries and JSON Schemas, but indicates that it may be more beneficial to have FDO fragments as separate objects. Adding openDS fragment steps would however complicate workflows.
Several of our tools process the referenced images, currently https URLs in openDS. We added a caching layer to avoid repeated image downloading, coupled with local file-paths wiring in the workflow. A similar challenge occurs if accessing image data using DOIP, which unlike HTTP, has no caching mechanisms.
RO-Crate lessons
Galaxy is developing support for importing and exporting Workflow Run Crates, a profile of RO-Crate (Soiland-Reyes et al. 2022b) to captures execution history of a workflow, including its definition and intermediate data (De Geest et al. 2022). SDR is adopting this support to combine openDS FDOs with workflow provenance, as envisioned by Walton et al. (2020).
Our prototype de novo workflow returns results as a ZIP file of openDS objects. End-users should also get copies of the referenced images and generated visualisations, along with workflow execution metadata. We are investigating ways to embed the preliminary Galaxy workflow history before the final step, so that this result can be an enriched RO-Crate.
Conclusions
SDR is an example of machine-assisted construction of FDOs, which highlight the needs for intermediate digital objects that are not yet FDO compliant. The passing of such “local FDOs” is beneficial not just for efficiency and visual inspection, but also to simplify workflow composition of canonical workflow building blocks. At the same time we see that it is insufficient to only pass FDOs as JSON objects, as they also have references to other data such as images, which should not need to be re-downloaded.
Further work will investigate the use of RO-Crate as a wrapper of partial FDOs, but this needs to be coupled with more flexible FDO types as profiles, in order to restrict “impossible” ordering of steps depending on particular inner FDO fragments. A distinction needs to be made between open digital specimens that are in “draft” state and those that can be pushed to DiSSCo registries.
We are experimenting with changing the SDR components into Canonical Workflow Building Blocks (Soiland-Reyes et al. 2022a) using the Common Workflow Language (Crusoe et al. 2022). This gives flexibility to scalably execute SDR workflows on different compute backends such as HPC or local cluster, without the additional setup of Galaxy servers.
Collapse
|
4
|
Islam S, Weiland C, Addink W. From data pipelines to FAIR data infrastructures: A vision for the new horizons of bio- and geodiversity data for scientific research. RESEARCH IDEAS AND OUTCOMES 2022. [DOI: 10.3897/rio.8.e93816] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Natural science collections are vast repositories of bio- and geodiversity specimens. These collections, originating from natural history cabinets or expeditions, are increasingly becoming unparalleled sources of data facilitating multidisciplinary research (Meineke et al. 2018, Heberling et al. 2019, Cook et al. 2020, Thompson et al. 2021). Due to various global data mobilization and digitisation efforts (Blagoderov et al. 2012,Nelson and Ellis 2018), this digitised information about specimens includes database records along with two/three-dimensional images, sonograms, sound or video recordings, computerised tomography scans, machine-readable texts from labels on the specimens as well as media items and notes related to the discovery sites and acquisition (Hedrick et al. 2020,Phillipson 2022).
The scope and practice of specimen gathering are also evolving. The term extended specimen was coined to refer to the specimen and associated data extending beyond the singular physical object to other physical or digital entities such as chemical composition, genetic sequence data or species data. Thus the specimen becomes an interconnected network of data resources that have incredible potential to enhance integrative and data-driven research (Webster 2017,Lendemer et al. 2019,Hardisty et al. 2022). These practices also reflect the role of data and the curatorial data life-cycle starting from the initial material sampling process to the downstream analysis. We are also seeing growing acknowledgement that disparate and domain specific data elements prevent interdisciplinarity which is crucial for a holistic understanding of biodiversity and climate crisis (Hicks et al. 2010, Craven et al. 2019, Folk and Siniscalchi 2021).
Thus the data elements are not just records or rows in a database or data pipelines going from one repository to another. They have the potential to become self-describing digital artefacts that can revolutionise how machines interpret and work with specimen data. Within this context, the Distributed System of Scientific Collections (DiSSCo), a new European Research Infrastructure for natural science collections, envisions an infrastructure based on FAIR Digital Objects (FDO) that can unify more than 170 European natural science collections under common and FAIR-compliant (Findable, Accessible, Interoperable, Reusable) (Wilkinson et al. 2016) access and curation policies and practices. DiSSCo’s key element in achieving FAIR is the implementation of Digital Specimen (a domain specific FDO) that closely aligns with the extended specimen practices. The idea behind Digital Specimen – an FDO that acts as a digital surrogate for a specific physical specimen in a natural science collection – was influenced by global conversations around the implementation of the Digital Object Architecture for biodiversity data (De Smedt et al. 2020, Islam et al. 2020,Hardisty et al. 2020).
The main purpose of this talk is to explain the vision of how FAIR and FDO can create a data infrastructure that can not only take advantage of existing databases and repositories but at the same time provide support for innovative services such as AI and digital twinning. With scientific use cases in mind, the talk will highlight a few key FAIR and FDO components (persistent identifiers, metadata, ontologies) within the collaborative modelling activity of Digital Specimen specification. These components provide the template for specifying how a Digital Specimen should look so DiSSCo can build a FAIR service ecosystem based on FDOs (Addink et al. 2021). We will also give examples of envisioned services that can help with image feature extraction, and model training (Grieb et al. 2021,Hardisty et al. 2022) and digital twinning (Schultes et al. 2022). We believe this is an exciting new paradigm powered by FAIR and FDO that can help both humans and machines to accelerate the use of specimen data. From physical objects curated over hundred years, we have developed data pipelines, aggregators and repositories (Barberousse 2021). Now is the time to look for solutions where these data records can become FAIR Digital Objects to enable wider access and multidisciplinary research.
Collapse
|