1
|
Lusa L, Proust-Lima C, Schmidt CO, Lee KJ, le Cessie S, Baillie M, Lawrence F, Huebner M. Initial data analysis for longitudinal studies to build a solid foundation for reproducible analysis. PLoS One 2024; 19:e0295726. [PMID: 38809844 PMCID: PMC11135704 DOI: 10.1371/journal.pone.0295726] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Accepted: 03/13/2024] [Indexed: 05/31/2024] Open
Abstract
Initial data analysis (IDA) is the part of the data pipeline that takes place between the end of data retrieval and the beginning of data analysis that addresses the research question. Systematic IDA and clear reporting of the IDA findings is an important step towards reproducible research. A general framework of IDA for observational studies includes data cleaning, data screening, and possible updates of pre-planned statistical analyses. Longitudinal studies, where participants are observed repeatedly over time, pose additional challenges, as they have special features that should be taken into account in the IDA steps before addressing the research question. We propose a systematic approach in longitudinal studies to examine data properties prior to conducting planned statistical analyses. In this paper we focus on the data screening element of IDA, assuming that the research aims are accompanied by an analysis plan, meta-data are well documented, and data cleaning has already been performed. IDA data screening comprises five types of explorations, covering the analysis of participation profiles over time, evaluation of missing data, presentation of univariate and multivariate descriptions, and the depiction of longitudinal aspects. Executing the IDA plan will result in an IDA report to inform data analysts about data properties and possible implications for the analysis plan-another element of the IDA framework. Our framework is illustrated focusing on hand grip strength outcome data from a data collection across several waves in a complex survey. We provide reproducible R code on a public repository, presenting a detailed data screening plan for the investigation of the average rate of age-associated decline of grip strength. With our checklist and reproducible R code we provide data analysts a framework to work with longitudinal data in an informed way, enhancing the reproducibility and validity of their work.
Collapse
Affiliation(s)
- Lara Lusa
- Department of Mathematics, Faculty of Mathematics, Natural Sciences and Information Technologies, University of Primorska, Koper, Capodistria, Slovenia
- Institute for Biostatistics and Medical Informatics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia
| | - Cécile Proust-Lima
- Univ. Bordeaux, Inserm, Bordeaux Population Health Research Center, UMR1219, Bordeaux, France
| | - Carsten O. Schmidt
- Institute for community Medicine, SHIP-KEF University Medicine of Greifswald, Greifswald, Germany
| | - Katherine J. Lee
- Clinical Epidemiology and Biostatistics Unit, Murdoch Children’s Research Institute, Melbourne, Australia
- University of Melbourne, Melbourne, Australia
| | - Saskia le Cessie
- Department of Clinical Epidemiology and Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, The Netherlands
- Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, The Netherlands
| | | | - Frank Lawrence
- Center for Statistical Training and Consulting, Michigan State University, East Lansing, MI, United States of America
| | - Marianne Huebner
- Center for Statistical Training and Consulting, Michigan State University, East Lansing, MI, United States of America
- Department of Statistics and Probability, Michigan State University, East Lansing, MI, United States of America
| | | |
Collapse
|
2
|
Ljujić J, Vujisić L, Tešević V, Sofrenić I, Ivanović S, Simić K, Anđelković B. Critical Review of Selected Analytical Platforms for GC-MS Metabolomics Profiling-Case Study: HS-SPME/GC-MS Analysis of Blackberry's Aroma. Foods 2024; 13:1222. [PMID: 38672895 PMCID: PMC11049629 DOI: 10.3390/foods13081222] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2024] [Revised: 04/08/2024] [Accepted: 04/09/2024] [Indexed: 04/28/2024] Open
Abstract
Data processing and data extraction are the first, and most often crucial, steps in metabolomics and multivariate data analysis in general. There are several software solutions for these purposes in GC-MS metabolomics. It becomes unclear which platform offers what kind of data and how that information influences the analysis's conclusions. In this study, selected analytical platforms for GC-MS metabolomics profiling, SpectConnect and XCMS as well as MestReNova software, were used to process the results of the HS-SPME/GC-MS aroma analyses of several blackberry varieties. In addition, a detailed analysis of the identification of the individual components of the blackberry aroma club varieties was performed. In total, 72 components were detected in the XCMS platform, 119 in SpectConnect, and 87 and 167 in MestReNova, with automatic integral and manual correction, respectively, as well as 219 aroma components after manual analysis of GC-MS chromatograms. The obtained datasets were fed, for multivariate data analysis, to SIMCA software, and underwent the creation of PCA, OPLS, and OPLS-DA models. The results of the validation tests and VIP-pred. scores were analyzed in detail.
Collapse
Affiliation(s)
- Jovana Ljujić
- Faculty of Chemistry, University of Belgrade, Studentski trg 12–16, 11000 Belgrade, Serbia
| | - Ljubodrag Vujisić
- Faculty of Chemistry, University of Belgrade, Studentski trg 12–16, 11000 Belgrade, Serbia
| | - Vele Tešević
- Faculty of Chemistry, University of Belgrade, Studentski trg 12–16, 11000 Belgrade, Serbia
| | - Ivana Sofrenić
- Faculty of Chemistry, University of Belgrade, Studentski trg 12–16, 11000 Belgrade, Serbia
| | - Stefan Ivanović
- Institute of Chemistry, Technology and Metallurgy, National Institute of the Republic of Serbia, University of Belgrade, Njegoševa 12, 11000 Belgrade, Serbia
| | - Katarina Simić
- Institute of Chemistry, Technology and Metallurgy, National Institute of the Republic of Serbia, University of Belgrade, Njegoševa 12, 11000 Belgrade, Serbia
| | - Boban Anđelković
- Faculty of Chemistry, University of Belgrade, Studentski trg 12–16, 11000 Belgrade, Serbia
| |
Collapse
|
3
|
Zhou Y, Kathiresan N, Yu Z, Rivera LF, Yang Y, Thimma M, Manickam K, Chebotarov D, Mauleon R, Chougule K, Wei S, Gao T, Green CD, Zuccolo A, Xie W, Ware D, Zhang J, McNally KL, Wing RA. A high-performance computational workflow to accelerate GATK SNP detection across a 25-genome dataset. BMC Biol 2024; 22:13. [PMID: 38273258 PMCID: PMC10809545 DOI: 10.1186/s12915-024-01820-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2023] [Accepted: 01/09/2024] [Indexed: 01/27/2024] Open
Abstract
BACKGROUND Single-nucleotide polymorphisms (SNPs) are the most widely used form of molecular genetic variation studies. As reference genomes and resequencing data sets expand exponentially, tools must be in place to call SNPs at a similar pace. The genome analysis toolkit (GATK) is one of the most widely used SNP calling software tools publicly available, but unfortunately, high-performance computing versions of this tool have yet to become widely available and affordable. RESULTS Here we report an open-source high-performance computing genome variant calling workflow (HPC-GVCW) for GATK that can run on multiple computing platforms from supercomputers to desktop machines. We benchmarked HPC-GVCW on multiple crop species for performance and accuracy with comparable results with previously published reports (using GATK alone). Finally, we used HPC-GVCW in production mode to call SNPs on a "subpopulation aware" 16-genome rice reference panel with ~ 3000 resequenced rice accessions. The entire process took ~ 16 weeks and resulted in the identification of an average of 27.3 M SNPs/genome and the discovery of ~ 2.3 million novel SNPs that were not present in the flagship reference genome for rice (i.e., IRGSP RefSeq). CONCLUSIONS This study developed an open-source pipeline (HPC-GVCW) to run GATK on HPC platforms, which significantly improved the speed at which SNPs can be called. The workflow is widely applicable as demonstrated successfully for four major crop species with genomes ranging in size from 400 Mb to 2.4 Gb. Using HPC-GVCW in production mode to call SNPs on a 25 multi-crop-reference genome data set produced over 1.1 billion SNPs that were publicly released for functional and breeding studies. For rice, many novel SNPs were identified and were found to reside within genes and open chromatin regions that are predicted to have functional consequences. Combined, our results demonstrate the usefulness of combining a high-performance SNP calling architecture solution with a subpopulation-aware reference genome panel for rapid SNP discovery and public deployment.
Collapse
Affiliation(s)
- Yong Zhou
- Center for Desert Agriculture (CDA), Biological and Environmental Sciences & Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
- Arizona Genomics Institute (AGI), School of Plant Sciences, University of Arizona, Tucson, AZ, 85721, USA
| | - Nagarajan Kathiresan
- KAUST Supercomputing Laboratory (KSL), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
| | - Zhichao Yu
- Center for Desert Agriculture (CDA), Biological and Environmental Sciences & Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
- National Key Laboratory of Crop Genetic Improvement, Hubei Hongshan Laboratory, Huazhong Agricultural University, Wuhan, 430070, China
| | - Luis F Rivera
- Center for Desert Agriculture (CDA), Biological and Environmental Sciences & Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
| | - Yujian Yang
- National Key Laboratory of Crop Genetic Improvement, Hubei Hongshan Laboratory, Huazhong Agricultural University, Wuhan, 430070, China
| | - Manjula Thimma
- Center for Desert Agriculture (CDA), Biological and Environmental Sciences & Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
| | - Keerthana Manickam
- Center for Desert Agriculture (CDA), Biological and Environmental Sciences & Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
| | - Dmytro Chebotarov
- International Rice Research Institute (IRRI), Los Baños, Laguna, 4031, Philippines
| | - Ramil Mauleon
- International Rice Research Institute (IRRI), Los Baños, Laguna, 4031, Philippines
| | - Kapeel Chougule
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA
| | - Sharon Wei
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA
| | - Tingting Gao
- National Key Laboratory of Crop Genetic Improvement, Hubei Hongshan Laboratory, Huazhong Agricultural University, Wuhan, 430070, China
| | - Carl D Green
- Information Technology Department, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
| | - Andrea Zuccolo
- Center for Desert Agriculture (CDA), Biological and Environmental Sciences & Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
- Crop Science Research Center (CSRC), Scuola Superiore Sant'Anna, Pisa, 56127, Italy
| | - Weibo Xie
- National Key Laboratory of Crop Genetic Improvement, Hubei Hongshan Laboratory, Huazhong Agricultural University, Wuhan, 430070, China
| | - Doreen Ware
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA
- USDA ARS NEA Plant, Soil & Nutrition Laboratory Research Unit, Ithaca, NY, 14853, USA
| | - Jianwei Zhang
- National Key Laboratory of Crop Genetic Improvement, Hubei Hongshan Laboratory, Huazhong Agricultural University, Wuhan, 430070, China
| | - Kenneth L McNally
- International Rice Research Institute (IRRI), Los Baños, Laguna, 4031, Philippines
| | - Rod A Wing
- Center for Desert Agriculture (CDA), Biological and Environmental Sciences & Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia.
- Arizona Genomics Institute (AGI), School of Plant Sciences, University of Arizona, Tucson, AZ, 85721, USA.
- International Rice Research Institute (IRRI), Los Baños, Laguna, 4031, Philippines.
| |
Collapse
|
4
|
Fredston AL, Lowndes JSS. Welcoming More Participation in Open Data Science for the Oceans. ANNUAL REVIEW OF MARINE SCIENCE 2024; 16:537-549. [PMID: 37418835 DOI: 10.1146/annurev-marine-041723-094741] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/09/2023]
Abstract
Open science is a global movement happening across all research fields. Enabled by technology and the open web, it builds on years of efforts by individuals, grassroots organizations, institutions, and agencies. The goal is to share knowledge and broaden participation in science, from early ideation to making research outputs openly accessible to all (open access). With an emphasis on transparency and collaboration, the open science movement dovetails with efforts to increase diversity, equity, inclusion, and belonging in science and society. The US Biden-Harris Administration and many other US government agencies have declared 2023 the Year of Open Science, providing a great opportunity to boost participation in open science for the oceans. For researchers day-to-day, open science is a critical piece of modern analytical workflows with increasing amounts of data. Therefore, we focus this article on open data science-the tooling and people enabling reproducible, transparent, inclusive practices for data-intensive research-and its intersection with the marine sciences. We discuss the state of various dimensions of open science and argue that technical advancements have outpaced our field's culture change to incorporate them. Increasing inclusivity and technical skill building are interlinked and must be prioritized within the marine science community to find collaborative solutions for responding to climate change and other threats to marine biodiversity and society.
Collapse
Affiliation(s)
- Alexa L Fredston
- Department of Ocean Sciences, University of California, Santa Cruz, California, USA;
| | - Julia S Stewart Lowndes
- National Center for Ecological Analysis and Synthesis, University of California, Santa Barbara, California, USA
| |
Collapse
|
5
|
Niehues A, de Visser C, Hagenbeek FA, Kulkarni P, Pool R, Karu N, Kindt ASD, Singh G, Vermeiren RRJM, Boomsma DI, van Dongen J, 't Hoen PAC, van Gool AJ. A multi-omics data analysis workflow packaged as a FAIR Digital Object. Gigascience 2024; 13:giad115. [PMID: 38217405 PMCID: PMC10787363 DOI: 10.1093/gigascience/giad115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Revised: 11/14/2023] [Accepted: 12/10/2023] [Indexed: 01/15/2024] Open
Abstract
BACKGROUND Applying good data management and FAIR (Findable, Accessible, Interoperable, and Reusable) data principles in research projects can help disentangle knowledge discovery, study result reproducibility, and data reuse in future studies. Based on the concepts of the original FAIR principles for research data, FAIR principles for research software were recently proposed. FAIR Digital Objects enable discovery and reuse of Research Objects, including computational workflows for both humans and machines. Practical examples can help promote the adoption of FAIR practices for computational workflows in the research community. We developed a multi-omics data analysis workflow implementing FAIR practices to share it as a FAIR Digital Object. FINDINGS We conducted a case study investigating shared patterns between multi-omics data and childhood externalizing behavior. The analysis workflow was implemented as a modular pipeline in the workflow manager Nextflow, including containers with software dependencies. We adhered to software development practices like version control, documentation, and licensing. Finally, the workflow was described with rich semantic metadata, packaged as a Research Object Crate, and shared via WorkflowHub. CONCLUSIONS Along with the packaged multi-omics data analysis workflow, we share our experiences adopting various FAIR practices and creating a FAIR Digital Object. We hope our experiences can help other researchers who develop omics data analysis workflows to turn FAIR principles into practice.
Collapse
Affiliation(s)
- Anna Niehues
- Department of Medical BioSciences, Radboud University Medical Center, 6525 GA Nijmegen, The Netherlands
- Translational Metabolic Laboratory, Department of Laboratory Medicine, Radboud University Medical Center, 6525 GA Nijmegen, the Netherlands
| | - Casper de Visser
- Department of Medical BioSciences, Radboud University Medical Center, 6525 GA Nijmegen, The Netherlands
| | - Fiona A Hagenbeek
- Department of Biological Psychology, Vrije Universiteit Amsterdam, 1081 BT Amsterdam, The Netherlands
- Amsterdam Public Health Research Institute, 1081 BT Amsterdam, The Netherlands
| | - Purva Kulkarni
- Department of Medical BioSciences, Radboud University Medical Center, 6525 GA Nijmegen, The Netherlands
- Translational Metabolic Laboratory, Department of Laboratory Medicine, Radboud University Medical Center, 6525 GA Nijmegen, the Netherlands
- Department of Human Genetics, Radboud University Medical Center, 6525 GA Nijmegen, The Netherlands
| | - René Pool
- Department of Biological Psychology, Vrije Universiteit Amsterdam, 1081 BT Amsterdam, The Netherlands
- Amsterdam Public Health Research Institute, 1081 BT Amsterdam, The Netherlands
| | - Naama Karu
- Metabolomics and Analytics Centre, Leiden Academic Centre for Drug Research, Leiden University, 2333 AL Leiden, The Netherlands
| | - Alida S D Kindt
- Metabolomics and Analytics Centre, Leiden Academic Centre for Drug Research, Leiden University, 2333 AL Leiden, The Netherlands
| | - Gurnoor Singh
- Department of Medical BioSciences, Radboud University Medical Center, 6525 GA Nijmegen, The Netherlands
| | - Robert R J M Vermeiren
- Department of Child and Adolescent Psychiatry, LUMC-Curium, Leiden University Medical Center, 2342 AK Oegstgeest, The Netherlands
| | - Dorret I Boomsma
- Department of Biological Psychology, Vrije Universiteit Amsterdam, 1081 BT Amsterdam, The Netherlands
- Amsterdam Public Health Research Institute, 1081 BT Amsterdam, The Netherlands
- Amsterdam Reproduction & Development (AR&D) Research Institute, 1081 BT Amsterdam, The Netherlands
| | - Jenny van Dongen
- Department of Biological Psychology, Vrije Universiteit Amsterdam, 1081 BT Amsterdam, The Netherlands
- Amsterdam Public Health Research Institute, 1081 BT Amsterdam, The Netherlands
- Amsterdam Reproduction & Development (AR&D) Research Institute, 1081 BT Amsterdam, The Netherlands
| | - Peter A C 't Hoen
- Department of Medical BioSciences, Radboud University Medical Center, 6525 GA Nijmegen, The Netherlands
| | - Alain J van Gool
- Translational Metabolic Laboratory, Department of Laboratory Medicine, Radboud University Medical Center, 6525 GA Nijmegen, the Netherlands
| |
Collapse
|
6
|
Gomes DGE, Pottier P, Crystal-Ornelas R, Hudgins EJ, Foroughirad V, Sánchez-Reyes LL, Turba R, Martinez PA, Moreau D, Bertram MG, Smout CA, Gaynor KM. Why don't we share data and code? Perceived barriers and benefits to public archiving practices. Proc Biol Sci 2022; 289:20221113. [PMID: 36416041 PMCID: PMC9682438 DOI: 10.1098/rspb.2022.1113] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2022] [Accepted: 11/02/2022] [Indexed: 08/10/2023] Open
Abstract
The biological sciences community is increasingly recognizing the value of open, reproducible and transparent research practices for science and society at large. Despite this recognition, many researchers fail to share their data and code publicly. This pattern may arise from knowledge barriers about how to archive data and code, concerns about its reuse, and misaligned career incentives. Here, we define, categorize and discuss barriers to data and code sharing that are relevant to many research fields. We explore how real and perceived barriers might be overcome or reframed in the light of the benefits relative to costs. By elucidating these barriers and the contexts in which they arise, we can take steps to mitigate them and align our actions with the goals of open science, both as individual scientists and as a scientific community.
Collapse
Affiliation(s)
- Dylan G. E. Gomes
- NRC Research Associate, Northwest Fisheries Science Center, National Marine Fisheries Service, National Oceanic and Atmospheric Administration, Seattle, WA 98112, USA
- Cooperative Institute for Marine Resources Studies, Hatfield Marine Science Center, Oregon State University, Newport, OR 97365, USA
| | - Patrice Pottier
- Evolution & Ecology Research Centre, School of Biological, Earth and Environmental Sciences, The University of New South Wales, Sydney, New South Wales 2052, Australia
| | - Robert Crystal-Ornelas
- Earth and Environmental Sciences Area, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Emma J. Hudgins
- Department of Biology, Carleton University, Ottawa, Canada, K1S 5B6
| | | | | | - Rachel Turba
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, CA 90095-7239, USA
| | - Paula Andrea Martinez
- Australian Research Data Commons, The University of Queensland, Brisbane 4072, Australia
| | - David Moreau
- School of Psychology and Centre for Brain Research, University of Auckland, Auckland 1010, New Zealand
| | - Michael G. Bertram
- Department of Wildlife, Fish, and Environmental Studies, Swedish University of Agricultural Sciences, Umeå, SE-907 36, Sweden
| | - Cooper A. Smout
- Institute for Globally Distributed Open Research and Education (IGDORE), Brisbane 4001, Australia
| | - Kaitlyn M. Gaynor
- Departments of Zoology and Botany, University of British Columbia, Vancouver, Canada, BC V6T 1Z4
- National Center for Ecological Analysis and Synthesis, Santa Barbara, CA 93101, USA
| |
Collapse
|
7
|
Niehues A, de Visser C, Hagenbeek F, Karu N, Kindt A, Kulkarni P, Pool R, Boomsma D, van Dongen J, van Gool A, 't Hoen P. A Multi-omics Data Analysis Workflow Packaged as a FAIR Digital Object. RESEARCH IDEAS AND OUTCOMES 2022. [DOI: 10.3897/rio.8.e94042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
In current biomedical and complex trait research, increasing numbers of large molecular profiling (omics) data sets are being generated. At the same time, many studies fail to be reproduced (Baker 2016, Kim 2018). In order to improve study reproducibility and data reuse, including integration of data sets of different types and origins, it is imperative to work with omics data that is findable, accessible, interoperable, and reusable (FAIR, Wilkinson 2016) at the source. The data analysis, integration and stewardship pillar of the Netherlands X-omics Initiative aims to facilitate multi-omics research by providing tools to create, analyze and integrate FAIR omics data. We here report a joint activity of X-omics and the Netherlands Twin Register demonstrating the FAIRification of a multi-omics data set and the development of a FAIR multi-omics data analysis workflow.
The implementation of FAIR principles (Wilkinson 2016) can improve scientific transparency and facilitate data reuse. However, Kim (2018) showed in a case study that the availability of data and code are required but not sufficient to reproduce data analyses. They highlighted the importance of interoperable and open formats, and structured metadata. In order to increase research reproducibility on the data analysis level, additional practices such as version-control, code licensing, and documentation have been proposed. These include recommendations for FAIR software by the Netherlands eScience Center and the Dutch Data Archiving and Networked Services (DANS), and FAIR principles for research software proposed by the Research Data Alliance (Chue Hong 2022). Data analysis in biomedical research usually comprises multiple steps often resulting in complex data analysis workflows and requiring additional practices, such as containerization, to ensure transparency and reproducibility (Goble 2020, Stoudt 2021).
We apply these practices to a multi-omics data set that comprises genome-wide DNA methylation profiles, targeted metabolomics, and behavioral data of two cohorts that participated in the ACTION Biomarker Study (ACTION, Aggression in Children: Unraveling gene-environment interplay to inform Treatment and InterventiON strategies, see consortium members in Suppl. material 1) (Boomsma 2015, Bartels 2018, Hagenbeek 2020, van Dongen 2021, Hagenbeek 2022). The ACTION-NTR cohort consists of twins that are either longitudinally concordant or discordant for childhood aggression. The ACTION-Curium-LUMC cohort consists of children referred to the Dutch LUMC Curium academic center for child and youth psychiatry. With the joint analysis of multi-omics data and behavioral data, we aim to identify substructures in the ACTION-NTR cohort and link them to aggressive behavior. First, the individuals are clustered using Similarity Network Fusion (SNF, Wang 2014), and latent feature dimensions are uncovered using different unsupervised methods including Multi-Omics Factor Analysis (MOFA) (Argelaguet 2018) and Multiple Correspondence Analysis (MCA, Lê 2008, Husson 2017). In a second step, we determine correlations between -omics and phenotype dimensions, and use them to explain the subgroups of individuals from the ACTION-NTR cohort. In order to validate the results, we project data of the ACTION-Curium-LUMC cohort onto the latent dimensions and determine if correlations between omics and phenotype data can be reproduced.
Integration of data across cohorts and across data types, requires interoperability. We applied different practices to make the data FAIR, including conversion of files to community-standard formats, and capturing experimental metadata using the ISA (Investigation, Study, Assay) metadata framework (Johnson 2021) and ontology-based annotations. All data analysis steps including pre-processing of different omics data types were implemented in either R or Python and combined in a modular Nextflow (Di Tommaso 2017) workflow, where the environment for each step is provided as a Singularity (Kurtzer 2017) container. The analysis workflow is packaged in a Research Object Crate (RO-Crate) (Soiland-Reyes 2022). The RO-Crate is a FAIR digital object that contains the Nextflow workflow including ontology-based annotations of each analysis step. Since omics data is considered to be potentially personally identifiable, the packaged workflow contains a minimal synthetic data set resembling the original data structure. Finally, the code is made available on GitHub and the workflow is registered at Workflowhub (Goble 2021). Since our Nextflow workflow is set up in a modular manner, the individual analysis steps can be reused in other workflows. We demonstrate this replicability by applying different sub-workflows to data from two different cohorts.
Collapse
|
8
|
Hunter-Zinck H, de Siqueira AF, Vásquez VN, Barnes R, Martinez CC. Ten simple rules on writing clean and reliable open-source scientific software. PLoS Comput Biol 2021; 17:e1009481. [PMID: 34762641 PMCID: PMC8584773 DOI: 10.1371/journal.pcbi.1009481] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
Functional, usable, and maintainable open-source software is increasingly essential to scientific research, but there is a large variation in formal training for software development and maintainability. Here, we propose 10 “rules” centered on 2 best practice components: clean code and testing. These 2 areas are relatively straightforward and provide substantial utility relative to the learning investment. Adopting clean code practices helps to standardize and organize software code in order to enhance readability and reduce cognitive load for both the initial developer and subsequent contributors; this allows developers to concentrate on core functionality and reduce errors. Clean coding styles make software code more amenable to testing, including unit tests that work best with modular and consistent software code. Unit tests interrogate specific and isolated coding behavior to reduce coding errors and ensure intended functionality, especially as code increases in complexity; unit tests also implicitly provide example usages of code. Other forms of testing are geared to discover erroneous behavior arising from unexpected inputs or emerging from the interaction of complex codebases. Although conforming to coding styles and designing tests can add time to the software development project in the short term, these foundational tools can help to improve the correctness, quality, usability, and maintainability of open-source scientific software code. They also advance the principal point of scientific research: producing accurate results in a reproducible way. In addition to suggesting several tips for getting started with clean code and testing practices, we recommend numerous tools for the popular open-source scientific software languages Python, R, and Julia.
Collapse
Affiliation(s)
- Haley Hunter-Zinck
- Berkeley Institute for Data Science, University of California, Berkeley, Berkeley, California, United States of America
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, California, United States of America
- VA Boston Healthcare System, Boston, Massachusetts, United States of America
- VA St. Louis Health Care System, St. Louis, Missouri, United States of America
| | | | - Váleri N. Vásquez
- Berkeley Institute for Data Science, University of California, Berkeley, Berkeley, California, United States of America
- Energy and Resources Group, Rausser College of Natural Resources, University of California, Berkeley, Berkeley, California, United States of America
| | - Richard Barnes
- Berkeley Institute for Data Science, University of California, Berkeley, Berkeley, California, United States of America
- Energy and Resources Group, Rausser College of Natural Resources, University of California, Berkeley, Berkeley, California, United States of America
| | - Ciera C. Martinez
- Berkeley Institute for Data Science, University of California, Berkeley, Berkeley, California, United States of America
- * E-mail:
| |
Collapse
|