1
|
Weberpals J, Wang SV. The FAIRification of research in real-world evidence: A practical introduction to reproducible analytic workflows using Git and R. Pharmacoepidemiol Drug Saf 2024; 33:e5740. [PMID: 38173166 DOI: 10.1002/pds.5740] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Revised: 11/29/2023] [Accepted: 11/30/2023] [Indexed: 01/05/2024]
Abstract
Transparency and reproducibility are major prerequisites for conducting meaningful real-world evidence (RWE) studies that are fit for decision-making. Many advances have been made in the documentation and reporting of study protocols and results, but the principles for version control and sharing of analytic code in RWE are not yet as established as in other quantitative disciplines like computational biology and health informatics. In this practical tutorial, we aim to give an introduction to distributed version control systems (VCS) tailored toward the FAIR (Findable, Accessible, Interoperable, and Reproducible) implementation of RWE studies. To ease adoption, we provide detailed step-by-step instructions with practical examples on how the Git VCS and R programming language can be implemented into RWE study workflows to facilitate reproducible analyzes. We further discuss and showcase how these tools can be used to track changes, collaborate, disseminate, and archive RWE studies through dedicated project repositories that maintain a complete audit trail of all relevant study documents.
Collapse
Affiliation(s)
- Janick Weberpals
- Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts, USA
| | - Shirley V Wang
- Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts, USA
| |
Collapse
|
2
|
Mallya P, Stevens LM, Zhao J, Hong C, Henao R, Economou-Zavlanos N, Wojdyla DM, Schibler T, Manchanda V, Pencina MJ, Hall JL. Facilitating Harmonization of Variables in Framingham, MESA, ARIC, and REGARDS Studies Through a Metadata Repository. Circ Cardiovasc Qual Outcomes 2023; 16:e009938. [PMID: 37850400 PMCID: PMC10841164 DOI: 10.1161/circoutcomes.123.009938] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 10/19/2023]
Abstract
BACKGROUND High-quality research in cardiovascular prevention, as in other fields, requires inclusion of a broad range of data sets from different sources. Integrating and harmonizing different data sources are essential to increase generalizability, sample size, and representation of understudied populations-strengthening the evidence for the scientific questions being addressed. METHODS Here, we describe an effort to build an open-access repository and interactive online portal for researchers to access the metadata and code harmonizing data from 4 well-known cohort studies-the REGARDS (Reasons for Geographic and Racial Differences in Stroke) study, FHS (Framingham Heart Study), MESA (Multi-Ethnic Study of Atherosclerosis), and ARIC (Atherosclerosis Risk in Communities) study. We introduce a methodology and a framework used for preprocessing and harmonizing variables from multiple studies. RESULTS We provide a real-case study and step-by-step guidance to demonstrate the practical utility of our repository and interactive web page. In addition to our successful development of such an open-access repository and interactive web page, this exercise in harmonizing data from multiple cohort studies has revealed several key themes. These themes include the importance of careful preprocessing and harmonization of variables, the value of creating an open-access repository to facilitate collaboration and reproducibility, and the potential for using harmonized data to address important scientific questions and disparities in cardiovascular disease research. CONCLUSIONS By integrating and harmonizing these large-scale cohort studies, such a repository may improve the statistical power and representation of understudied cohorts, enabling development and validation of risk prediction models, identification and investigation of risk factors, and creating a platform for racial disparities research. REGISTRATION URL: https://precision.heart.org/duke-ninds.
Collapse
Affiliation(s)
- Pratheek Mallya
- American Heart Association, Dallas, TX (P.M., J.Z., V.M., J.L.H.)
| | - Laura M Stevens
- University of Colorado Anschutz Medical School, Aurora (L.M.S.)
| | - Juan Zhao
- American Heart Association, Dallas, TX (P.M., J.Z., V.M., J.L.H.)
| | - Chuan Hong
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC (C.H., R.H., M.P.)
- Duke Clinical Research Institute, Durham, NC (C.H., R.H., D.W., T.S.)
| | - Ricardo Henao
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC (C.H., R.H., M.P.)
- Duke Clinical Research Institute, Durham, NC (C.H., R.H., D.W., T.S.)
| | | | - Daniel M Wojdyla
- Duke Clinical Research Institute, Durham, NC (C.H., R.H., D.W., T.S.)
| | - Tony Schibler
- Duke Clinical Research Institute, Durham, NC (C.H., R.H., D.W., T.S.)
| | - Vihaan Manchanda
- American Heart Association, Dallas, TX (P.M., J.Z., V.M., J.L.H.)
| | - Michael J Pencina
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC (C.H., R.H., M.P.)
| | - Jennifer L Hall
- American Heart Association, Dallas, TX (P.M., J.Z., V.M., J.L.H.)
| |
Collapse
|
3
|
Serret-Larmande A, Kaltman JR, Avillach P. Streamlining statistical reproducibility: NHLBI ORCHID clinical trial results reproduction. JAMIA Open 2022; 5:ooac001. [PMID: 35156003 PMCID: PMC8826998 DOI: 10.1093/jamiaopen/ooac001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2021] [Revised: 12/23/2021] [Accepted: 01/07/2022] [Indexed: 11/24/2022] Open
Abstract
Reproducibility in medical research has been a long-standing issue. More recently, the COVID-19 pandemic has publicly underlined this fact as the retraction of several studies reached out to general media audiences. A significant number of these retractions occurred after in-depth scrutiny of the methodology and results by the scientific community. Consequently, these retractions have undermined confidence in the peer-review process, which is not considered sufficiently reliable to generate trust in the published results. This partly stems from opacity in published results, the practical implementation of the statistical analysis often remaining undisclosed. We present a workflow that uses a combination of informatics tools to foster statistical reproducibility: an open-source programming language, Jupyter Notebook, cloud-based data repository, and an application programming interface can streamline an analysis and help to kick-start new analyses. We illustrate this principle by (1) reproducing the results of the ORCHID clinical trial, which evaluated the efficacy of hydroxychloroquine in COVID-19 patients, and (2) expanding on the analyses conducted in the original trial by investigating the association of premedication with biological laboratory results. Such workflows will be encouraged for future publications from National Heart, Lung, and Blood Institute-funded studies. The COVID-19 pandemic has seen several articles published in high-profile journals being retracted. These retractions undermined even more confidence in the peer-review process, which is not considered sufficiently reliable to generate trust in the published results. A significant number of these retractions occurred after in-depth scrutiny of the methodology and results by the scientific community. This partly stems from opacity in published results, the practical implementation of the statistical analysis often remaining undisclosed. This article presents a simple workflow that leverages a combination of preexisting and newly developed biomedical informatics tools to promote transparent statistical analysis in biomedical research, which relies on the National Heart, Lung, and Blood Institute (NHLBI) BioData Catalyst platform. By streamlining access to data and analysis source code, it eases results reproduction and accelerates supplemental analyses. Such workflows will be encouraged for future publications from NHLBI-funded studies. We illustrate it by reproducing the results of the ORCHID clinical trial, which evaluated the efficacy of hydroxychloroquine in COVID-19 patients.
Collapse
Affiliation(s)
- Arnaud Serret-Larmande
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA
- Computational Health Informatics Program, Boston Children's Hospital, Boston, Massachusetts, USA
| | - Jonathan R Kaltman
- Division of Cardiovascular Sciences, National Heart, Lung, and Blood Institute, NIH, Bethesda, Maryland, USA
| | - Paul Avillach
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA
- Computational Health Informatics Program, Boston Children's Hospital, Boston, Massachusetts, USA
| |
Collapse
|
4
|
Vesteghem C, Brøndum RF, Sønderkær M, Sommer M, Schmitz A, Bødker JS, Dybkær K, El-Galaly TC, Bøgsted M. Implementing the FAIR Data Principles in precision oncology: review of supporting initiatives. Brief Bioinform 2021; 21:936-945. [PMID: 31263868 PMCID: PMC7299292 DOI: 10.1093/bib/bbz044] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2019] [Revised: 03/13/2019] [Accepted: 03/21/2019] [Indexed: 12/26/2022] Open
Abstract
Compelling research has recently shown that cancer is so heterogeneous that single research centres cannot produce enough data to fit prognostic and predictive models of sufficient accuracy. Data sharing in precision oncology is therefore of utmost importance. The Findable, Accessible, Interoperable and Reusable (FAIR) Data Principles have been developed to define good practices in data sharing. Motivated by the ambition of applying the FAIR Data Principles to our own clinical precision oncology implementations and research, we have performed a systematic literature review of potentially relevant initiatives. For clinical data, we suggest using the Genomic Data Commons model as a reference as it provides a field-tested and well-documented solution. Regarding classification of diagnosis, morphology and topography and drugs, we chose to follow the World Health Organization standards, i.e. ICD10, ICD-O-3 and Anatomical Therapeutic Chemical classifications, respectively. For the bioinformatics pipeline, the Genome Analysis ToolKit Best Practices using Docker containers offer a coherent solution and have therefore been selected. Regarding the naming of variants, we follow the Human Genome Variation Society's standard. For the IT infrastructure, we have built a centralized solution to participate in data sharing through federated solutions such as the Beacon Networks.
Collapse
Affiliation(s)
- Charles Vesteghem
- Department of Clinical Medicine, Aalborg University, Denmark.,Department of Haematology, Aalborg University Hospital, Denmark
| | | | - Mads Sønderkær
- Department of Haematology, Aalborg University Hospital, Denmark
| | - Mia Sommer
- Department of Clinical Medicine, Aalborg University, Denmark.,Department of Haematology, Aalborg University Hospital, Denmark
| | | | | | - Karen Dybkær
- Department of Clinical Medicine, Aalborg University, Denmark.,Department of Haematology, Aalborg University Hospital, Denmark.,Clinical Cancer Research Center, Aalborg University Hospital, Denmark
| | - Tarec Christoffer El-Galaly
- Department of Clinical Medicine, Aalborg University, Denmark.,Department of Haematology, Aalborg University Hospital, Denmark.,Clinical Cancer Research Center, Aalborg University Hospital, Denmark
| | - Martin Bøgsted
- Department of Clinical Medicine, Aalborg University, Denmark.,Department of Haematology, Aalborg University Hospital, Denmark.,Clinical Cancer Research Center, Aalborg University Hospital, Denmark
| |
Collapse
|
5
|
Nishiwaki H, Hamaguchi T, Ito M, Ishida T, Maeda T, Kashihara K, Tsuboi Y, Ueyama J, Shimamura T, Mori H, Kurokawa K, Katsuno M, Hirayama M, Ohno K. Short-Chain Fatty Acid-Producing Gut Microbiota Is Decreased in Parkinson's Disease but Not in Rapid-Eye-Movement Sleep Behavior Disorder. mSystems 2020; 5:e00797-20. [PMID: 33293403 PMCID: PMC7771407 DOI: 10.1128/msystems.00797-20] [Citation(s) in RCA: 61] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2020] [Accepted: 11/16/2020] [Indexed: 02/07/2023] Open
Abstract
Gut dysbiosis has been repeatedly reported in Parkinson's disease (PD) but only once in idiopathic rapid-eye-movement sleep behavior disorder (iRBD) from Germany. Abnormal aggregation of α-synuclein fibrils causing PD possibly starts from the intestine, although this is still currently under debate. iRBD patients frequently develop PD. Early-stage gut dysbiosis that is causally associated with PD is thus expected to be observed in iRBD. We analyzed gut microbiota in 26 iRBD patients and 137 controls by 16S rRNA sequencing (16S rRNA-seq). Our iRBD data set was meta-analyzed with the German iRBD data set and was compared with gut microbiota in 223 PD patients. Unsupervised clustering of gut microbiota by LIGER, a topic model-based tool for single-cell RNA sequencing (RNA-seq) analysis, revealed four enterotypes in controls, iRBD, and PD. Short-chain fatty acid (SCFA)-producing bacteria were conserved in an enterotype observed in controls and iRBD, whereas they were less conserved in enterotypes observed in PD. Genus Akkermansia and family Akkermansiaceae were consistently increased in both iRBD in two countries and PD in five countries. Short-chain fatty acid (SCFA)-producing bacteria were not significantly decreased in iRBD in two countries. In contrast, we previously reported that recognized or putative SCFA-producing genera Faecalibacterium, Roseburia, and Lachnospiraceae ND3007 group were consistently decreased in PD in five countries. In α-synucleinopathy, increase of mucin-layer-degrading genus Akkermansia is observed at the stage of iRBD, whereas decrease of SCFA-producing genera becomes obvious with development of PD.IMPORTANCE Twenty studies on gut microbiota in PD have been reported, whereas only one study has been reported on iRBD from Germany. iRBD has the highest likelihood ratio to develop PD. Our meta-analysis of iRBD in Japan and Germany revealed increased mucin-layer-degrading genus Akkermansia in iRBD. Genus Akkermansia may increase the intestinal permeability, as we previously observed in PD patients, and may make the intestinal neural plexus exposed to oxidative stress, which can lead to abnormal aggregation of prion-like α-synuclein fibrils in the intestine. In contrast to PD, SCFA-producing bacteria were not decreased in iRBD. As SCFA induces regulatory T (Treg) cells, a decrease of SCFA-producing bacteria may be a prerequisite for the development of PD. We propose that prebiotic and/or probiotic therapeutic strategies to increase the intestinal mucin layer and to increase intestinal SCFA potentially retard the development of iRBD and PD.
Collapse
Affiliation(s)
- Hiroshi Nishiwaki
- Division of Neurogenetics, Center for Neurological Diseases and Cancer, Nagoya University Graduate School of Medicine, Nagoya, Japan
| | - Tomonari Hamaguchi
- Division of Neurogenetics, Center for Neurological Diseases and Cancer, Nagoya University Graduate School of Medicine, Nagoya, Japan
| | - Mikako Ito
- Division of Neurogenetics, Center for Neurological Diseases and Cancer, Nagoya University Graduate School of Medicine, Nagoya, Japan
| | - Tomohiro Ishida
- Department of Pathophysiological Laboratory Sciences, Nagoya University Graduate School of Medicine, Nagoya, Japan
| | - Tetsuya Maeda
- Division of Neurology and Gerontology, Department of Internal Medicine, School of Medicine, Iwate Medical University, Iwate, Japan
| | | | - Yoshio Tsuboi
- Department of Neurology, Fukuoka University, Fukuoka, Japan
| | - Jun Ueyama
- Department of Pathophysiological Laboratory Sciences, Nagoya University Graduate School of Medicine, Nagoya, Japan
| | - Teppei Shimamura
- Division of Systems Biology, Center for Neurological Diseases and Cancer, Nagoya University Graduate School of Medicine, Nagoya, Japan
| | - Hiroshi Mori
- Genome Evolution Laboratory, Department of Informatics, National Institute of Genetics, Mishima, Japan
| | - Ken Kurokawa
- Genome Evolution Laboratory, Department of Informatics, National Institute of Genetics, Mishima, Japan
| | - Masahisa Katsuno
- Department of Neurology, Nagoya University Graduate School of Medicine, Nagoya, Japan
| | - Masaaki Hirayama
- Department of Pathophysiological Laboratory Sciences, Nagoya University Graduate School of Medicine, Nagoya, Japan
| | - Kinji Ohno
- Division of Neurogenetics, Center for Neurological Diseases and Cancer, Nagoya University Graduate School of Medicine, Nagoya, Japan
| |
Collapse
|
6
|
Virkus S, Garoufallou E. Data science and its relationship to library and information science: a content analysis. DATA TECHNOLOGIES AND APPLICATIONS 2020. [DOI: 10.1108/dta-07-2020-0167] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
PurposeThe purpose of this paper is to present the results of a study exploring the emerging field of data science from the library and information science (LIS) perspective.Design/methodology/approachContent analysis of research publications on data science was made of papers published in the Web of Science database to identify the main themes discussed in the publications from the LIS perspective.FindingsA content analysis of 80 publications is presented. The articles belonged to the six broad categories: data science education and training; knowledge and skills of the data professional; the role of libraries and librarians in the data science movement; tools, techniques and applications of data science; data science from the knowledge management perspective; and data science from the perspective of health sciences. The category of tools, techniques and applications of data science was most addressed by the authors, followed by data science from the perspective of health sciences, data science education and training and knowledge and skills of the data professional. However, several publications fell into several categories because these topics were closely related.Research limitations/implicationsOnly publication recorded in the Web of Science database and with the term “data science” in the topic area were analyzed. Therefore, several relevant studies are not discussed in this paper that either were related to other keywords such as “e-science”, “e-research”, “data service”, “data curation”, “research data management” or “scientific data management” or were not present in the Web of Science database.Originality/valueThe paper provides the first exploration by content analysis of the field of data science from the perspective of the LIS.
Collapse
|
7
|
Stevens L, Kao D, Hall J, Görg C, Abdo K, Linstead E. ML-MEDIC: A Preliminary Study of an Interactive Visual Analysis Tool Facilitating Clinical Applications of Machine Learning for Precision Medicine. APPLIED SCIENCES (BASEL, SWITZERLAND) 2020; 10:3309. [PMID: 33664984 PMCID: PMC7928533 DOI: 10.3390/app10093309] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Accessible interactive tools that integrate machine learning methods with clinical research and reduce the programming experience required are needed to move science forward. Here, we present Machine Learning for Medical Exploration and Data-Inspired Care (ML-MEDIC), a point-and-click, interactive tool with a visual interface for facilitating machine learning and statistical analyses in clinical research. We deployed ML-MEDIC in the American Heart Association (AHA) Precision Medicine Platform to provide secure internet access and facilitate collaboration. ML-MEDIC's efficacy for facilitating the adoption of machine learning was evaluated through two case studies in collaboration with clinical domain experts. A domain expert review was also conducted to obtain an impression of the usability and potential limitations.
Collapse
Affiliation(s)
- Laura Stevens
- Department of Cardiology, University of Colorado Medical School, Aurora, CO 80045, USA
- Cardiovascular Medicine, Institute for Precision Cardiovascular Medicine at the American Heart Association, Dallas, TX 75231, USA
| | - David Kao
- Department of Cardiology, University of Colorado Medical School, Aurora, CO 80045, USA
| | - Jennifer Hall
- Cardiovascular Medicine, Institute for Precision Cardiovascular Medicine at the American Heart Association, Dallas, TX 75231, USA
| | - Carsten Görg
- Department of Cardiology, University of Colorado Medical School, Aurora, CO 80045, USA
| | - Kaitlyn Abdo
- Electrical Engineering and Computer Science, Chapman University, Orange, CA 92866, USA
| | - Erik Linstead
- Electrical Engineering and Computer Science, Chapman University, Orange, CA 92866, USA
| |
Collapse
|
8
|
Ulfenborg B. Vertical and horizontal integration of multi-omics data with miodin. BMC Bioinformatics 2019; 20:649. [PMID: 31823712 PMCID: PMC6902525 DOI: 10.1186/s12859-019-3224-4] [Citation(s) in RCA: 34] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2019] [Accepted: 11/14/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Studies on multiple modalities of omics data such as transcriptomics, genomics and proteomics are growing in popularity, since they allow us to investigate complex mechanisms across molecular layers. It is widely recognized that integrative omics analysis holds the promise to unlock novel and actionable biological insights into health and disease. Integration of multi-omics data remains challenging, however, and requires combination of several software tools and extensive technical expertise to account for the properties of heterogeneous data. RESULTS This paper presents the miodin R package, which provides a streamlined workflow-based syntax for multi-omics data analysis. The package allows users to perform analysis of omics data either across experiments on the same samples (vertical integration), or across studies on the same variables (horizontal integration). Workflows have been designed to promote transparent data analysis and reduce the technical expertise required to perform low-level data import and processing. CONCLUSIONS The miodin package is implemented in R and is freely available for use and extension under the GPL-3 license. Package source, reference documentation and user manual are available at https://gitlab.com/algoromics/miodin.
Collapse
|
9
|
Coiera E, Ammenwerth E, Georgiou A, Magrabi F. Does health informatics have a replication crisis? J Am Med Inform Assoc 2019; 25:963-968. [PMID: 29669066 PMCID: PMC6077781 DOI: 10.1093/jamia/ocy028] [Citation(s) in RCA: 57] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2017] [Accepted: 03/13/2018] [Indexed: 01/27/2023] Open
Abstract
Objective Many research fields, including psychology and basic medical sciences, struggle with poor reproducibility of reported studies. Biomedical and health informatics is unlikely to be immune to these challenges. This paper explores replication in informatics and the unique challenges the discipline faces. Methods Narrative review of recent literature on research replication challenges. Results While there is growing interest in re-analysis of existing data, experimental replication studies appear uncommon in informatics. Context effects are a particular challenge as they make ensuring replication fidelity difficult, and the same intervention will never quite reproduce the same result in different settings. Replication studies take many forms, trading-off testing validity of past findings against testing generalizability. Exact and partial replication designs emphasize testing validity while quasi and conceptual studies test generalizability of an underlying model or hypothesis with different methods or in a different setting. Conclusions The cost of poor replication is a weakening in the quality of published research and the evidence-based foundation of health informatics. The benefits of replication include increased rigor in research, and the development of evaluation methods that distinguish the impact of context and the nonreproducibility of research. Taking replication seriously is essential if biomedical and health informatics is to be an evidence-based discipline.
Collapse
Affiliation(s)
- Enrico Coiera
- Australian Institute of Health Innovation, Macquarie University, NSW 2109, Australia
| | - Elske Ammenwerth
- University for Health Sciences, Medical Informatics and Technology, Austria
| | - Andrew Georgiou
- Australian Institute of Health Innovation, Macquarie University, NSW 2109, Australia
| | - Farah Magrabi
- Australian Institute of Health Innovation, Macquarie University, NSW 2109, Australia
| |
Collapse
|
10
|
Building Containerized Workflows Using the BioDepot-Workflow-Builder. Cell Syst 2019; 9:508-514.e3. [PMID: 31521606 DOI: 10.1016/j.cels.2019.08.007] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2018] [Revised: 05/21/2019] [Accepted: 08/16/2019] [Indexed: 11/22/2022]
Abstract
We present the BioDepot-workflow-builder (Bwb), a software tool that allows users to create and execute reproducible bioinformatics workflows using a drag-and-drop interface. Graphical widgets represent Docker containers executing a modular task. Widgets are linked graphically to build bioinformatics workflows that can be reproducibly deployed across different local and cloud platforms. Each widget contains a form-based user interface to facilitate parameter entry and a console to display intermediate results. Bwb provides tools for rapid customization of widgets, containers, and workflows. Saved workflows can be shared using Bwb's native format or exported as shell scripts.
Collapse
|
11
|
Rodríguez-Pérez H, Hernández-Beeftink T, Lorenzo-Salazar JM, Roda-García JL, Pérez-González CJ, Colebrook M, Flores C. NanoDJ: a Dockerized Jupyter notebook for interactive Oxford Nanopore MinION sequence manipulation and genome assembly. BMC Bioinformatics 2019; 20:234. [PMID: 31072312 PMCID: PMC6509807 DOI: 10.1186/s12859-019-2860-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2018] [Accepted: 04/29/2019] [Indexed: 12/23/2022] Open
Abstract
Background The Oxford Nanopore Technologies (ONT) MinION portable sequencer makes it possible to use cutting-edge genomic technologies in the field and the academic classroom. Results We present NanoDJ, a Jupyter notebook integration of tools for simplified manipulation and assembly of DNA sequences produced by ONT devices. It integrates basecalling, read trimming and quality control, simulation and plotting routines with a variety of widely used aligners and assemblers, including procedures for hybrid assembly. Conclusions With the use of Jupyter-facilitated access to self-explanatory contents of applications and the interactive visualization of results, as well as by its distribution into a Docker software container, NanoDJ is aimed to simplify and make more reproducible ONT DNA sequence analysis. The NanoDJ package code, documentation and installation instructions are freely available at https://github.com/genomicsITER/NanoDJ. Electronic supplementary material The online version of this article (10.1186/s12859-019-2860-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Héctor Rodríguez-Pérez
- Research Unit, Hospital Universitario Nuestra Señora de Candelaria, Universidad de La Laguna, Santa Cruz de Tenerife, Spain
| | - Tamara Hernández-Beeftink
- Research Unit, Hospital Universitario Nuestra Señora de Candelaria, Universidad de La Laguna, Santa Cruz de Tenerife, Spain
| | - José M Lorenzo-Salazar
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, Spain
| | - José L Roda-García
- Departamento de Ingeniería Informática y de Sistemas, Universidad de La Laguna, Santa Cruz de Tenerife, Spain
| | - Carlos J Pérez-González
- Departamento de Matemáticas, Estadística e Investigación Operativa, Universidad de La Laguna, Santa Cruz de Tenerife, Spain
| | - Marcos Colebrook
- Departamento de Ingeniería Informática y de Sistemas, Universidad de La Laguna, Santa Cruz de Tenerife, Spain.
| | - Carlos Flores
- Research Unit, Hospital Universitario Nuestra Señora de Candelaria, Universidad de La Laguna, Santa Cruz de Tenerife, Spain. .,Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, Spain. .,CIBER de Enfermedades Respiratorias, Instituto de Salud Carlos III, Madrid, Spain.
| |
Collapse
|
12
|
Brennan PF, Chiang MF, Ohno-Machado L. Biomedical informatics and data science: evolving fields with significant overlap. J Am Med Inform Assoc 2019; 25:2-3. [PMID: 29267964 DOI: 10.1093/jamia/ocx146] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Affiliation(s)
- Patricia Flatley Brennan
- 9500 Gilman Dr, MC 0728, La Jolla, CA 92093, USA. Phone: 858-822-4931; Fax: 858-822-7685; E-mail:
| | - Michael F Chiang
- 9500 Gilman Dr, MC 0728, La Jolla, CA 92093, USA. Phone: 858-822-4931; Fax: 858-822-7685; E-mail:
| | - Lucila Ohno-Machado
- 9500 Gilman Dr, MC 0728, La Jolla, CA 92093, USA. Phone: 858-822-4931; Fax: 858-822-7685; E-mail:
| |
Collapse
|
13
|
Kulkarni N, Alessandrì L, Panero R, Arigoni M, Olivero M, Ferrero G, Cordero F, Beccuti M, Calogero RA. Reproducible bioinformatics project: a community for reproducible bioinformatics analysis pipelines. BMC Bioinformatics 2018; 19:349. [PMID: 30367595 PMCID: PMC6191970 DOI: 10.1186/s12859-018-2296-x] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
Background Reproducibility of a research is a key element in the modern science and it is mandatory for any industrial application. It represents the ability of replicating an experiment independently by the location and the operator. Therefore, a study can be considered reproducible only if all used data are available and the exploited computational analysis workflow is clearly described. However, today for reproducing a complex bioinformatics analysis, the raw data and the list of tools used in the workflow could be not enough to guarantee the reproducibility of the results obtained. Indeed, different releases of the same tools and/or of the system libraries (exploited by such tools) might lead to sneaky reproducibility issues. Results To address this challenge, we established the Reproducible Bioinformatics Project (RBP), which is a non-profit and open-source project, whose aim is to provide a schema and an infrastructure, based on docker images and R package, to provide reproducible results in Bioinformatics. One or more Docker images are then defined for a workflow (typically one for each task), while the workflow implementation is handled via R-functions embedded in a package available at github repository. Thus, a bioinformatician participating to the project has firstly to integrate her/his workflow modules into Docker image(s) exploiting an Ubuntu docker image developed ad hoc by RPB to make easier this task. Secondly, the workflow implementation must be realized in R according to an R-skeleton function made available by RPB to guarantee homogeneity and reusability among different RPB functions. Moreover she/he has to provide the R vignette explaining the package functionality together with an example dataset which can be used to improve the user confidence in the workflow utilization. Conclusions Reproducible Bioinformatics Project provides a general schema and an infrastructure to distribute robust and reproducible workflows. Thus, it guarantees to final users the ability to repeat consistently any analysis independently by the used UNIX-like architecture.
Collapse
Affiliation(s)
- Neha Kulkarni
- Department of Molecular Biotechnology and Health Sciences, University of Torino, Torino, Italy
| | - Luca Alessandrì
- Department of Molecular Biotechnology and Health Sciences, University of Torino, Torino, Italy
| | - Riccardo Panero
- Department of Molecular Biotechnology and Health Sciences, University of Torino, Torino, Italy
| | - Maddalena Arigoni
- Department of Molecular Biotechnology and Health Sciences, University of Torino, Torino, Italy
| | - Martina Olivero
- Department of Oncology, University of Torino, Candiolo, Italy
| | - Giulio Ferrero
- Department of Computer Sciences, University of Torino, Torino, Italy
| | - Francesca Cordero
- Department of Computer Sciences, University of Torino, Torino, Italy.
| | - Marco Beccuti
- Department of Computer Sciences, University of Torino, Torino, Italy
| | - Raffaele A Calogero
- Department of Molecular Biotechnology and Health Sciences, University of Torino, Torino, Italy.
| |
Collapse
|