1
|
Wright A, Wilkinson MD, Mungall C, Cain S, Richards S, Sternberg P, Provin E, Jacobs JL, Geib S, Raciti D, Yook K, Stein L, Molik DC. FAIR Header Reference genome: a TRUSTworthy standard. Brief Bioinform 2024; 25:bbae122. [PMID: 38555475 PMCID: PMC10981671 DOI: 10.1093/bib/bbae122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Revised: 02/16/2024] [Accepted: 02/22/2024] [Indexed: 04/02/2024] Open
Abstract
The lack of interoperable data standards among reference genome data-sharing platforms inhibits cross-platform analysis while increasing the risk of data provenance loss. Here, we describe the FAIR bioHeaders Reference genome (FHR), a metadata standard guided by the principles of Findability, Accessibility, Interoperability and Reuse (FAIR) in addition to the principles of Transparency, Responsibility, User focus, Sustainability and Technology. The objective of FHR is to provide an extensive set of data serialisation methods and minimum data field requirements while still maintaining extensibility, flexibility and expressivity in an increasingly decentralised genomic data ecosystem. The effort needed to implement FHR is low; FHR's design philosophy ensures easy implementation while retaining the benefits gained from recording both machine and human-readable provenance.
Collapse
Affiliation(s)
- Adam Wright
- Adaptive Oncology Program, Ontario Institute for Cancer Research, 661 University Avenue Suite 500, Toronto, ON M5G 0A3, Canada
| | - Mark D Wilkinson
- Departamento de Biotecnolog’ıa-Biolog’ıa Vegetal, Escuela T’ecnica Superior de Ingenier’ıa Agron’omica, Alimentaria y de Biosistemas,Centro de Biotecnolog’ıa y Gen’omica de Plantas (CBGP, UPM-INIA/CSIC), Universidad Polit’ecnica de Madrid (UPM) - Instituto Nacional de Investigaci’on y Tecnolog’ıa Agraria y Alimentaria (INIA/CSIC), Pozuelo de Alarc’on, Madrid, ES, Spain
| | - Christopher Mungall
- Biosystems Data Science, Lawrence Berkeley National Laboratory, Building: 977, 1 Cyclotron Rd, Berkeley, CA 94720, USA
| | - Scott Cain
- Adaptive Oncology Program, Ontario Institute for Cancer Research, 661 University Avenue Suite 500, Toronto, ON M5G 0A3, Canada
| | - Stephen Richards
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, MS: BCM226, Houston, TX 77030, USA
| | - Paul Sternberg
- Division of Biology and Biological Engineering 140-18, California Institute of Technology, Pasadena, CA 91125, USA
| | - Ellen Provin
- Department of Horticultural Studies, Texas A&M University, HFSB 204, TAMU 2133, College Station, TX 77848, USA
| | - Jonathan L Jacobs
- American Type Culture Collection, 10801 University Blvd, Manassas, VA 20110, USA
| | - Scott Geib
- Tropical Pest Genetics and Molecular Biology Research Unit, Daniel K. Inouye U.S. Pacific Basin Agricultural Research Center, United States Department of Agriculture, Agricultural Research Service, 64 Nowelo St, Hilo HI 96720, USA
| | - Daniela Raciti
- Division of Biology and Biological Engineering 140-18, California Institute of Technology, Pasadena, CA 91125, USA
| | - Karen Yook
- Division of Biology and Biological Engineering 140-18, California Institute of Technology, Pasadena, CA 91125, USA
| | - Lincoln Stein
- Adaptive Oncology Program, Ontario Institute for Cancer Research, 661 University Avenue Suite 500, Toronto, ON M5G 0A3, Canada
| | - David C Molik
- Arthropod-borne Animal Diseases Research Unit, Center for Grain and Animal Health Research United States Department of Agriculture, Agricultural Research Service, 1515 College Ave, Manhattan, KS 66502 USA
| |
Collapse
|
2
|
Wright A, Wilkinson MD, Mungall C, Cain S, Richards S, Sternberg P, Provin E, Jacobs JL, Geib S, Raciti D, Yook K, Stein L, Molik DC. DATA RESOURCES AND ANALYSES FAIR Header Reference genome: A TRUSTworthy standard. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.29.569306. [PMID: 38076838 PMCID: PMC10705436 DOI: 10.1101/2023.11.29.569306] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/28/2023]
Abstract
The lack of interoperable data standards among reference genome data-sharing platforms inhibits cross-platform analysis while increasing the risk of data provenance loss. Here, we describe the FAIR-bioHeaders Reference genome (FHR), a metadata standard guided by the principles of Findability, Accessibility, Interoperability, and Reuse (FAIR) in addition to the principles of Transparency, Responsibility, User focus, Sustainability, and Technology (TRUST). The objective of FHR is to provide an extensive set of data serialisation methods and minimum data field requirements while still maintaining extensibility, flexibility, and expressivity in an increasingly decentralised genomic data ecosystem. The effort needed to implement FHR is low; FHR's design philosophy ensures easy implementation while retaining the benefits gained from recording both machine and human-readable provenance.
Collapse
Affiliation(s)
- Adam Wright
- Adaptive Oncology Program, Ontario Institute for Cancer Research, 661 University Avenue Suite 500, Toronto, ON M5G 0A3, Canada
| | - Mark D Wilkinson
- Departamento de Biotecnología-Biología Vegetal, Escuela Técnica Superior de Ingeniería Agronómica, Alimentaria y de Biosistemas,Centro de Biotecnología y Genómica de Plantas (CBGP, UPM-INIA/CSIC), Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA/CSIC), Pozuelo de Alarcón, Madrid, ES, Spain
| | - Chris Mungall
- Biosystems Data Science, Lawrence Berkeley National Laboratory, Building: 977, 1 Cyclotron Rd, Berkeley, CA 94720 USA
| | - Scott Cain
- Adaptive Oncology Program, Ontario Institute for Cancer Research, 661 University Avenue Suite 500, Toronto, ON M5G 0A3, Canada
| | - Stephen Richards
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, MS: BCM226, Houston, TX 77030, USA
| | - Paul Sternberg
- Division of Biology and Biological Engineering 140-18, California Institute of Technology, Pasadena, CA 91125, USA
| | - Ellen Provin
- Department of Horticultural Studies, Texas A&M University, HFSB 204, TAMU 2133, College Station, TX 77848, USA
| | - Jonathan L Jacobs
- American Type Culture Collection, 10801 University Blvd, Manassas, VA 20110, USA
| | - Scott Geib
- Tropical Pest Genetics and Molecular Biology Research Unit, Daniel K. Inouye U.S. Pacific Basin Agricultural Research Center, United States Department of Agriculture, Agricultural Research Service, 64 Nowelo St, Hilo HI 96720 USA
| | - Daniela Raciti
- Division of Biology and Biological Engineering 140-18, California Institute of Technology, Pasadena, CA 91125, USA
| | - Karen Yook
- Division of Biology and Biological Engineering 140-18, California Institute of Technology, Pasadena, CA 91125, USA
| | - Lincoln Stein
- Adaptive Oncology Program, Ontario Institute for Cancer Research, 661 University Avenue Suite 500, Toronto, ON M5G 0A3, Canada
| | - David C Molik
- Arthropod-borne Animal Diseases Research Unit, Center for Grain and Animal Health Research United States Department of Agriculture, Agricultural Research Service, 1515 College Ave, Manhattan, KS 66502 USA
| |
Collapse
|
3
|
Vujčić V, Marinković BP, Srećković VA, Tošić S, Jevremović D, Ignjatović LM, Rabasović MS, Šević D, Simonović N, Mason NJ. Current stage and future development of Belgrade collisional and radiative databases/datasets of importance for molecular dynamics. Phys Chem Chem Phys 2023; 25:26972-26985. [PMID: 37791414 DOI: 10.1039/d3cp03752e] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/05/2023]
Abstract
Atomic and molecular (A&M) databases that contain information about species, their identities and radiative/collisional processes are essential and helpful tools that are utilized in many fields of physics, chemistry, and chem/phys-informatics. Errors or inconsistencies in the datasets are a serious issue since they can lead to inaccurate predictions and generate problems with the modeling. This demonstrates that data curation efforts around A&M databases are still indispensable and that in the curation process studious attention is required. Therefore, we herein present research activities around Belgrade "nodes" - datasets of collision/radiative cross-sections and rates needed for spectroscopy analysis in various A&M, optical and plasma physics fields. Methodologies of our research and both present and future aspects of the applications are explained. We explored the possibility to extend our nodes towards building a new database on Judd-Ofelt parameters by using machine learning in order to predict optical properties of luminescence materials. In addition, we hope that public availability of our datasets and their graphical representations will also motivate others to investigate the potential of these data.
Collapse
Affiliation(s)
- Veljko Vujčić
- Astronomical Observatory Belgrade, Volgina 7, 11000 Belgrade, Serbia.
| | | | | | - Sanja Tošić
- Institute of Physics Belgrade, University of Belgrade, 11080 Belgrade, Serbia
| | - Darko Jevremović
- Astronomical Observatory Belgrade, Volgina 7, 11000 Belgrade, Serbia.
| | | | - Maja S Rabasović
- Institute of Physics Belgrade, University of Belgrade, 11080 Belgrade, Serbia
| | - Dragutin Šević
- Institute of Physics Belgrade, University of Belgrade, 11080 Belgrade, Serbia
| | - Nenad Simonović
- Institute of Physics Belgrade, University of Belgrade, 11080 Belgrade, Serbia
| | - Nigel J Mason
- School of Physics and Astronomy, University of Kent, Canterbury CT2 7NH, UK
| |
Collapse
|
4
|
Bremer PL, Fiehn O. SMetaS: A Sample Metadata Standardizer for Metabolomics. Metabolites 2023; 13:941. [PMID: 37623884 PMCID: PMC10456726 DOI: 10.3390/metabo13080941] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Revised: 07/26/2023] [Accepted: 08/10/2023] [Indexed: 08/26/2023] Open
Abstract
Metabolomics has advanced to an extent where it is desired to standardize and compare data across individual studies. While past work in standardization has focused on data acquisition, data processing, and data storage aspects, metabolomics databases are useless without ontology-based descriptions of biological samples and study designs. We introduce here a user-centric tool to automatically standardize sample metadata. Using such a tool in frontends for metabolomic databases will dramatically increase the FAIRness (Findability, Accessibility, Interoperability, and Reusability) of data, specifically for data reuse and for finding datasets that share comparable sets of metadata, e.g., study meta-analyses, cross-species analyses or large scale metabolomic atlases. SMetaS (Sample Metadata Standardizer) combines a classic database with an API and frontend and is provided in a containerized environment. The tool has two user-centric components. In the first component, the user designs a sample metadata matrix and fills the cells using natural language terminology. In the second component, the tool transforms the completed matrix by replacing freetext terms with terms from fixed vocabularies. This transformation process is designed to maximize simplicity and is guided by, among other strategies, synonym matching and typographical fixing in an n-grams/nearest neighbors model approach. The tool enables downstream analysis of submitted studies and samples via string equality for FAIR retrospective use.
Collapse
Affiliation(s)
- Parker Ladd Bremer
- Department of Chemistry, University of California, Davis, CA 95616, USA;
| | - Oliver Fiehn
- West Coast Metabolomics Center for Compound Identification, UC Davis Genome Center, University of California, Davis, CA 95616, USA
| |
Collapse
|
5
|
Chicco D, Cumbo F, Angione C. Ten quick tips for avoiding pitfalls in multi-omics data integration analyses. PLoS Comput Biol 2023; 19:e1011224. [PMID: 37410704 DOI: 10.1371/journal.pcbi.1011224] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/08/2023] Open
Abstract
Data are the most important elements of bioinformatics: Computational analysis of bioinformatics data, in fact, can help researchers infer new knowledge about biology, chemistry, biophysics, and sometimes even medicine, influencing treatments and therapies for patients. Bioinformatics and high-throughput biological data coming from different sources can even be more helpful, because each of these different data chunks can provide alternative, complementary information about a specific biological phenomenon, similar to multiple photos of the same subject taken from different angles. In this context, the integration of bioinformatics and high-throughput biological data gets a pivotal role in running a successful bioinformatics study. In the last decades, data originating from proteomics, metabolomics, metagenomics, phenomics, transcriptomics, and epigenomics have been labelled -omics data, as a unique name to refer to them, and the integration of these omics data has gained importance in all biological areas. Even if this omics data integration is useful and relevant, due to its heterogeneity, it is not uncommon to make mistakes during the integration phases. We therefore decided to present these ten quick tips to perform an omics data integration correctly, avoiding common mistakes we experienced or noticed in published studies in the past. Even if we designed our ten guidelines for beginners, by using a simple language that (we hope) can be understood by anyone, we believe our ten recommendations should be taken into account by all the bioinformaticians performing omics data integration, including experts.
Collapse
Affiliation(s)
- Davide Chicco
- Institute of Health Policy Management and Evaluation, University of Toronto, Toronto, Ontario, Canada
| | - Fabio Cumbo
- Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, Ohio, United States of America
| | - Claudio Angione
- School of Computing Engineering and Digital Technologies, Teesside University, Middlesbrough, United Kingdom
| |
Collapse
|
6
|
Welter D, Juty N, Rocca-Serra P, Xu F, Henderson D, Gu W, Strubel J, Giessmann RT, Emam I, Gadiya Y, Abbassi-Daloii T, Alharbi E, Gray AJG, Courtot M, Gribbon P, Ioannidis V, Reilly DS, Lynch N, Boiten JW, Satagopam V, Goble C, Sansone SA, Burdett T. FAIR in action - a flexible framework to guide FAIRification. Sci Data 2023; 10:291. [PMID: 37208349 PMCID: PMC10199076 DOI: 10.1038/s41597-023-02167-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2022] [Accepted: 03/28/2023] [Indexed: 05/21/2023] Open
Abstract
The COVID-19 pandemic has highlighted the need for FAIR (Findable, Accessible, Interoperable, and Reusable) data more than any other scientific challenge to date. We developed a flexible, multi-level, domain-agnostic FAIRification framework, providing practical guidance to improve the FAIRness for both existing and future clinical and molecular datasets. We validated the framework in collaboration with several major public-private partnership projects, demonstrating and delivering improvements across all aspects of FAIR and across a variety of datasets and their contexts. We therefore managed to establish the reproducibility and far-reaching applicability of our approach to FAIRification tasks.
Collapse
Affiliation(s)
- Danielle Welter
- Luxembourg Centre for Systems Biomedicine, ELIXIR Luxembourg, University of Luxembourg, L-4367, Belval, Luxembourg
| | - Nick Juty
- University of Manchester, Department of Computer Science, The University of Manchester, Manchester, M13 9PL, UK
| | - Philippe Rocca-Serra
- Oxford e-Research Centre, Department of Engineering Science, University of Oxford, 7 Keble Road, OX13QG, Oxford, UK
| | - Fuqi Xu
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, CB10 1SD, UK
| | - David Henderson
- Bayer AG, Business Development & Licensing & OI, Muellerstrasse 178, 13353, Berlin, Germany
| | - Wei Gu
- Luxembourg Centre for Systems Biomedicine, ELIXIR Luxembourg, University of Luxembourg, L-4367, Belval, Luxembourg
| | - Jolanda Strubel
- The Hyve BV, Arthur van Schendelstraat 650, 3511 MJ, Utrecht, The Netherlands
| | - Robert T Giessmann
- Bayer AG, Business Development & Licensing & OI, Muellerstrasse 178, 13353, Berlin, Germany
- Institute for Globally Distributed Open Research and Education (IGDORE), Gothenburg, Sweden
| | - Ibrahim Emam
- Data Science Institute, Imperial College, London, UK
| | - Yojana Gadiya
- Fraunhofer Institute for Translational Medicine and Pharmacology (ITMP) and Fraunhofer Cluster of Excellence for Immune Mediated Diseases (CIMD), Schnackenburgallee 114, 22525 Hamburg, and Theodor Stern Kai 7, 60590, Frankfurt, Germany
| | - Tooba Abbassi-Daloii
- Department of Bioinformatics (BiGCaT), NUTRIM, FHML, Maastricht University, Maastricht, The Netherlands
| | - Ebtisam Alharbi
- College of Computer and Information Systems, Umm Al-Qura University, Mecca, Saudi Arabia
| | - Alasdair J G Gray
- Department of Computer Science, Heriot-Watt University, Edinburgh, EH14 4AS, Scotland, UK
| | - Melanie Courtot
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, CB10 1SD, UK
- Ontario Institute for Cancer Research MaRS Centre, 661 University Avenue, Suite 510, Toronto, Ontario, M5G 0A3, Canada
| | - Philip Gribbon
- Fraunhofer Institute for Translational Medicine and Pharmacology (ITMP) and Fraunhofer Cluster of Excellence for Immune Mediated Diseases (CIMD), Schnackenburgallee 114, 22525 Hamburg, and Theodor Stern Kai 7, 60590, Frankfurt, Germany
| | - Vassilios Ioannidis
- Vital-IT Group, SIB Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | - Dorothy S Reilly
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Basel, Switzerland
| | | | | | - Venkata Satagopam
- Luxembourg Centre for Systems Biomedicine, ELIXIR Luxembourg, University of Luxembourg, L-4367, Belval, Luxembourg
| | - Carole Goble
- University of Manchester, Department of Computer Science, The University of Manchester, Manchester, M13 9PL, UK
| | - Susanna-Assunta Sansone
- Oxford e-Research Centre, Department of Engineering Science, University of Oxford, 7 Keble Road, OX13QG, Oxford, UK
| | - Tony Burdett
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, CB10 1SD, UK.
| |
Collapse
|
7
|
Viviani M, Montemurro M, Trusolino L, Bertotti A, Urgese G, Grassi E. EGAsubmitter: A software to automate submission of nucleic acid sequencing data to the European Genome-phenome Archive. FRONTIERS IN BIOINFORMATICS 2023; 3:1143014. [PMID: 37063647 PMCID: PMC10098081 DOI: 10.3389/fbinf.2023.1143014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2023] [Accepted: 03/14/2023] [Indexed: 04/03/2023] Open
Abstract
Making raw data available to the research community is one of the pillars of Findability, Accessibility, Interoperability, and Reuse (FAIR) research. However, the submission of raw data to public databases still involves many manually operated procedures that are intrinsically time-consuming and error-prone, which raises potential reliability issues for both the data themselves and the ensuing metadata. For example, submitting sequencing data to the European Genome-phenome Archive (EGA) is estimated to take 1 month overall, and mainly relies on a web interface for metadata management that requires manual completion of forms and the upload of several comma separated values (CSV) files, which are not structured from a formal point of view. To tackle these limitations, here we present EGAsubmitter, a Snakemake-based pipeline that guides the user across all the submission steps, ranging from files encryption and upload, to metadata submission. EGASubmitter is expected to streamline the automated submission of sequencing data to EGA, minimizing user errors and ensuring higher end product fidelity.
Collapse
Affiliation(s)
- Marco Viviani
- Candiolo Cancer Institute—FPO IRCCS, Candiolo, Italy
- Department of Oncology, University of Torino, Candiolo, Italy
| | | | - Livio Trusolino
- Candiolo Cancer Institute—FPO IRCCS, Candiolo, Italy
- Department of Oncology, University of Torino, Candiolo, Italy
| | - Andrea Bertotti
- Candiolo Cancer Institute—FPO IRCCS, Candiolo, Italy
- Department of Oncology, University of Torino, Candiolo, Italy
| | | | - Elena Grassi
- Candiolo Cancer Institute—FPO IRCCS, Candiolo, Italy
- Department of Oncology, University of Torino, Candiolo, Italy
- *Correspondence: Elena Grassi,
| |
Collapse
|
8
|
Modeling community standards for metadata as templates makes data FAIR. Sci Data 2022; 9:696. [DOI: 10.1038/s41597-022-01815-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2022] [Accepted: 11/01/2022] [Indexed: 11/13/2022] Open
Abstract
AbstractIt is challenging to determine whether datasets are findable, accessible, interoperable, and reusable (FAIR) because the FAIR Guiding Principles refer to highly idiosyncratic criteria regarding the metadata used to annotate datasets. Specifically, the FAIR principles require metadata to be “rich” and to adhere to “domain-relevant” community standards. Scientific communities should be able to define their own machine-actionable templates for metadata that encode these “rich,” discipline-specific elements. We have explored this template-based approach in the context of two software systems. One system is the CEDAR Workbench, which investigators use to author new metadata. The other is the FAIRware Workbench, which evaluates the metadata of archived datasets for their adherence to community standards. Benefits accrue when templates for metadata become central elements in an ecosystem of tools to manage online datasets—both because the templates serve as a community reference for what constitutes FAIR data, and because they embody that perspective in a form that can be distributed among a variety of software applications to assist with data stewardship and data sharing.
Collapse
|