1
|
Zulfiqar M, Crusoe MR, König-Ries B, Steinbeck C, Peters K, Gadelha L. Implementation of FAIR Practices in Computational Metabolomics Workflows-A Case Study. Metabolites 2024; 14:118. [PMID: 38393009 PMCID: PMC10891576 DOI: 10.3390/metabo14020118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Revised: 01/30/2024] [Accepted: 02/07/2024] [Indexed: 02/25/2024] Open
Abstract
Scientific workflows facilitate the automation of data analysis tasks by integrating various software and tools executed in a particular order. To enable transparency and reusability in workflows, it is essential to implement the FAIR principles. Here, we describe our experiences implementing the FAIR principles for metabolomics workflows using the Metabolome Annotation Workflow (MAW) as a case study. MAW is specified using the Common Workflow Language (CWL), allowing for the subsequent execution of the workflow on different workflow engines. MAW is registered using a CWL description on WorkflowHub. During the submission process on WorkflowHub, a CWL description is used for packaging MAW using the Workflow RO-Crate profile, which includes metadata in Bioschemas. Researchers can use this narrative discussion as a guideline to commence using FAIR practices for their bioinformatics or cheminformatics workflows while incorporating necessary amendments specific to their research area.
Collapse
Affiliation(s)
- Mahnoor Zulfiqar
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University Jena, 07743 Jena, Germany;
- Cluster of Excellence Balance of the Microverse, Friedrich Schiller University Jena, 07743 Jena, Germany;
| | - Michael R. Crusoe
- ELIXIR (The European Life-Sciences Infrastructure for Biological Information) Germany, Institute of Bio- and Geosciences (IBG-5)—Computational Metagenomics, Forschungszentrum Jülich GmbH, 52428 Jülich, Germany;
| | - Birgitta König-Ries
- Cluster of Excellence Balance of the Microverse, Friedrich Schiller University Jena, 07743 Jena, Germany;
- Institute for Informatics, Friedrich Schiller University Jena, 07743 Jena, Germany
- iDiv—German Centre for Integrative Biodiversity Research, Halle-Jena-Leipzig, 04103 Leipzig, Germany;
| | - Christoph Steinbeck
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University Jena, 07743 Jena, Germany;
- Cluster of Excellence Balance of the Microverse, Friedrich Schiller University Jena, 07743 Jena, Germany;
| | - Kristian Peters
- iDiv—German Centre for Integrative Biodiversity Research, Halle-Jena-Leipzig, 04103 Leipzig, Germany;
- Geobotany and Botanical Gardens, Martin-Luther University of Halle-Wittenberg, 06108 Halle, Germany
- Leibniz Institute of Plant Biochemistry, 06120 Halle, Germany
| | - Luiz Gadelha
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University Jena, 07743 Jena, Germany;
- Cluster of Excellence Balance of the Microverse, Friedrich Schiller University Jena, 07743 Jena, Germany;
- Institute for Informatics, Friedrich Schiller University Jena, 07743 Jena, Germany
- German Cancer Research Center (DKFZ), 69120 Heidelberg, Germany
| |
Collapse
|
2
|
Parvizpour S, Beyrampour-Basmenj H, Razmara J, Farhadi F, Shamsir MS. Cancer treatment comes to age: from one-size-fits-all to next-generation sequencing (NGS) technologies. BIOIMPACTS : BI 2023; 14:29957. [PMID: 39104623 PMCID: PMC11298019 DOI: 10.34172/bi.2023.29957] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Revised: 11/08/2023] [Accepted: 11/14/2023] [Indexed: 08/07/2024]
Abstract
Cancer is one of the leading causes of death worldwide and one of the greatest challenges in extending life expectancy. The paradigm of one-size-fits-all medicine has already given way to the stratification of patients by disease subtypes, clinical characteristics, and biomarkers (stratified medicine). The introduction of next-generation sequencing (NGS) in clinical oncology has made it possible to tailor cancer patient therapy to their molecular profiles. NGS is expected to lead the transition to precision medicine (PM), where the right therapeutic approach is chosen for each patient based on their characteristics and mutations. Here, we highlight how the NGS technology facilitates cancer treatment. In this regard, first, precision medicine and NGS technology are reviewed, and then, the NGS revolution in precision medicine is described. In the sequel, the role of NGS in oncology and the existing limitations are discussed. The available databases and bioinformatics tools and online servers used in NGS data analysis are also reviewed. The review ends with concluding remarks.
Collapse
Affiliation(s)
- Sepideh Parvizpour
- Research Center for Pharmaceutical Nanotechnology, Biomedicine Institute, Tabriz University of Medical Sciences, Tabriz, Iran
- Department of Medical Biotechnology, School of Advanced Medical Sciences, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Hanieh Beyrampour-Basmenj
- Department of Medical Biotechnology, School of Advanced Medical Sciences, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Jafar Razmara
- Department of Computer Science, Faculty of Mathematics, Statistics and Computer Science, University of Tabriz, Tabriz, Iran
| | - Farhad Farhadi
- Food and Drug Administration, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Mohd Shahir Shamsir
- Bioinformatics Research Group, Faculty of Science, Universiti Teknologi Malaysia, Johor Bahru, Malaysia
| |
Collapse
|
3
|
Du X, Dastmalchi F, Diller MA, Brochhausen M, Garrett TJ, Hogan WR, Lemas DJ. An Automated Workflow Composition System for Liquid Chromatography-Mass Spectrometry Metabolomics Data Processing. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2023; 34:2857-2863. [PMID: 37874901 DOI: 10.1021/jasms.3c00248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/26/2023]
Abstract
Liquid chromatography-mass spectrometry (LC-MS) metabolomics studies produce high-dimensional data that must be processed by a complex network of informatics tools to generate analysis-ready data sets. As the first computational step in metabolomics, data processing is increasingly becoming a challenge for researchers to develop customized computational workflows that are applicable for LC-MS metabolomics analysis. Ontology-based automated workflow composition (AWC) systems provide a feasible approach for developing computational workflows that consume high-dimensional molecular data. We used the Automated Pipeline Explorer (APE) to create an AWC for LC-MS metabolomics data processing across three use cases. Our results show that APE predicted 145 data processing workflows across all the three use cases. We identified six traditional workflows and six novel workflows. Through manual review, we found that one-third of novel workflows were executable whereby the data processing function could be completed without obtaining an error. When selecting the top six workflows from each use case, the computational viable rate of our predicted workflows reached 45%. Collectively, our study demonstrates the feasibility of developing an AWC system for LC-MS metabolomics data processing.
Collapse
Affiliation(s)
- Xinsong Du
- Division of General Internal Medicine, Department of Medicine, Brigham and Women's Hospital, Boston, Massachusetts 02115, United States
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115, United States
| | - Farhad Dastmalchi
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, Florida 32610, United States
| | - Matthew A Diller
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, Florida 32610, United States
| | - Mathias Brochhausen
- Department of Biomedical Informatics, College of Medicine, University of Arkansas for Medical Sciences, Little Rock, Arkansas 72205, United States
| | - Timothy J Garrett
- Department of Pathology, Immunology and Laboratory Medicine, College of Medicine, University of Florida, Gainesville, Florida 32610, United States
| | - William R Hogan
- Data Science Institute, Medical College of Wisconsin, Milwaukee, Wisconsin 53226, United States
| | - Dominick J Lemas
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, Florida 32610, United States
- Department of Obstetrics and Gynecology, College of Medicine, University of Florida, Gainesville, Florida 32610, United States
- Center for Perinatal Outcomes Research, College of Medicine, University of Florida, Gainesville, Florida 32610, United States
| |
Collapse
|
4
|
Johns M, Meurers T, Wirth FN, Haber AC, Müller A, Halilovic M, Balzer F, Prasser F. Data Provenance in Biomedical Research: Scoping Review. J Med Internet Res 2023; 25:e42289. [PMID: 36972116 PMCID: PMC10132013 DOI: 10.2196/42289] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Revised: 12/14/2022] [Accepted: 12/23/2022] [Indexed: 03/29/2023] Open
Abstract
BACKGROUND Data provenance refers to the origin, processing, and movement of data. Reliable and precise knowledge about data provenance has great potential to improve reproducibility as well as quality in biomedical research and, therefore, to foster good scientific practice. However, despite the increasing interest on data provenance technologies in the literature and their implementation in other disciplines, these technologies have not yet been widely adopted in biomedical research. OBJECTIVE The aim of this scoping review was to provide a structured overview of the body of knowledge on provenance methods in biomedical research by systematizing articles covering data provenance technologies developed for or used in this application area; describing and comparing the functionalities as well as the design of the provenance technologies used; and identifying gaps in the literature, which could provide opportunities for future research on technologies that could receive more widespread adoption. METHODS Following a methodological framework for scoping studies and the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) guidelines, articles were identified by searching the PubMed, IEEE Xplore, and Web of Science databases and subsequently screened for eligibility. We included original articles covering software-based provenance management for scientific research published between 2010 and 2021. A set of data items was defined along the following five axes: publication metadata, application scope, provenance aspects covered, data representation, and functionalities. The data items were extracted from the articles, stored in a charting spreadsheet, and summarized in tables and figures. RESULTS We identified 44 original articles published between 2010 and 2021. We found that the solutions described were heterogeneous along all axes. We also identified relationships among motivations for the use of provenance information, feature sets (capture, storage, retrieval, visualization, and analysis), and implementation details such as the data models and technologies used. The important gap that we identified is that only a few publications address the analysis of provenance data or use established provenance standards, such as PROV. CONCLUSIONS The heterogeneity of provenance methods, models, and implementations found in the literature points to the lack of a unified understanding of provenance concepts for biomedical data. Providing a common framework, a biomedical reference, and benchmarking data sets could foster the development of more comprehensive provenance solutions.
Collapse
Affiliation(s)
- Marco Johns
- Medical Informatics Group, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Thierry Meurers
- Medical Informatics Group, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Felix N Wirth
- Medical Informatics Group, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Anna C Haber
- Medical Informatics Group, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Armin Müller
- Medical Informatics Group, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Mehmed Halilovic
- Medical Informatics Group, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Felix Balzer
- Institute of Medical Informatics, Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Fabian Prasser
- Medical Informatics Group, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany
| |
Collapse
|
5
|
Shao D, Kellogg GD, Nematbakhsh A, Kuntala PK, Mahony S, Pugh BF, Lai WKM. PEGR: a flexible management platform for reproducible epigenomic and genomic research. Genome Biol 2022; 23:99. [PMID: 35440038 PMCID: PMC9016988 DOI: 10.1186/s13059-022-02671-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2021] [Accepted: 04/07/2022] [Indexed: 11/27/2022] Open
Abstract
Reproducibility is a significant challenge in (epi)genomic research due to the complexity of experiments composed of traditional biochemistry and informatics. Recent advances have exacerbated this as high-throughput sequencing data is generated at an unprecedented pace. Here, we report the development of a Platform for Epi-Genomic Research (PEGR), a web-based project management platform that tracks and quality controls experiments from conception to publication-ready figures, compatible with multiple assays and bioinformatic pipelines. It supports rigor and reproducibility for biochemists working at the bench, while fully supporting reproducibility and reliability for bioinformaticians through integration with the Galaxy platform.
Collapse
Affiliation(s)
- Danying Shao
- Institute for Computational and Data Sciences, Pennsylvania State University, University Park, PA, 16802, USA
| | - Gretta D Kellogg
- Cornell Institute of Biotechnology, Cornell University, Ithaca, NY, 14850, USA
| | - Ali Nematbakhsh
- Cornell Institute of Biotechnology, Cornell University, Ithaca, NY, 14850, USA
| | - Prashant K Kuntala
- Department of Biochemistry & Molecular Biology, Pennsylvania State University, University Park, PA, 16802, USA
| | - Shaun Mahony
- Department of Biochemistry & Molecular Biology, Pennsylvania State University, University Park, PA, 16802, USA
| | - B Franklin Pugh
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY, 14850, USA
| | - William K M Lai
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY, 14850, USA. .,Department of Computational Biology, Cornell University, Ithaca, NY, 14850, USA.
| |
Collapse
|
6
|
RESCRIPt: Reproducible sequence taxonomy reference database management. PLoS Comput Biol 2021; 17:e1009581. [PMID: 34748542 PMCID: PMC8601625 DOI: 10.1371/journal.pcbi.1009581] [Citation(s) in RCA: 251] [Impact Index Per Article: 83.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2021] [Revised: 11/18/2021] [Accepted: 10/21/2021] [Indexed: 12/22/2022] Open
Abstract
Nucleotide sequence and taxonomy reference databases are critical resources for widespread applications including marker-gene and metagenome sequencing for microbiome analysis, diet metabarcoding, and environmental DNA (eDNA) surveys. Reproducibly generating, managing, using, and evaluating nucleotide sequence and taxonomy reference databases creates a significant bottleneck for researchers aiming to generate custom sequence databases. Furthermore, database composition drastically influences results, and lack of standardization limits cross-study comparisons. To address these challenges, we developed RESCRIPt, a Python 3 software package and QIIME 2 plugin for reproducible generation and management of reference sequence taxonomy databases, including dedicated functions that streamline creating databases from popular sources, and functions for evaluating, comparing, and interactively exploring qualitative and quantitative characteristics across reference databases. To highlight the breadth and capabilities of RESCRIPt, we provide several examples for working with popular databases for microbiome profiling (SILVA, Greengenes, NCBI-RefSeq, GTDB), eDNA and diet metabarcoding surveys (BOLD, GenBank), as well as for genome comparison. We show that bigger is not always better, and reference databases with standardized taxonomies and those that focus on type strains have quantitative advantages, though may not be appropriate for all use cases. Most databases appear to benefit from some curation (quality filtering), though sequence clustering appears detrimental to database quality. Finally, we demonstrate the breadth and extensibility of RESCRIPt for reproducible workflows with a comparison of global hepatitis genomes. RESCRIPt provides tools to democratize the process of reference database acquisition and management, enabling researchers to reproducibly and transparently create reference materials for diverse research applications. RESCRIPt is released under a permissive BSD-3 license at https://github.com/bokulich-lab/RESCRIPt. Generating and managing sequence and taxonomy reference data presents a bottleneck to many researchers, whether they are generating custom databases or attempting to format existing, curated reference databases for use with standard sequence analysis tools. Evaluating database quality and choosing the “best” database can be an equally formidable challenge. We developed RESCRIPt to alleviate this bottleneck, supporting reproducible, streamlined generation, curation, and evaluation of reference sequence databases. RESCRIPt uses QIIME 2 artifact file formats, which store all processing steps as data provenance within each file, allowing researchers to retrace the computational steps used to generate any given file. We used RESCRIPt to benchmark several commonly used marker-gene sequence databases for 16S rRNA genes, ITS, and COI sequences, demonstrating both the utility of RESCRIPt to streamline use of these databases, but also to evaluate several qualitative and quantitative characteristics of each database. We show that larger databases are not always best, and curation steps to reduce redundancy and filter out noisy sequences may be beneficial for some applications. We anticipate that RESCRIPt will streamline the use, management, and evaluation/selection of reference database materials for microbiomics, diet metabarcoding, eDNA, and other diverse applications.
Collapse
|
7
|
Orchestrating and sharing large multimodal data for transparent and reproducible research. Nat Commun 2021; 12:5797. [PMID: 34608132 PMCID: PMC8490371 DOI: 10.1038/s41467-021-25974-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2021] [Accepted: 09/08/2021] [Indexed: 11/08/2022] Open
Abstract
Reproducibility is essential to open science, as there is limited relevance for findings that can not be reproduced by independent research groups, regardless of its validity. It is therefore crucial for scientists to describe their experiments in sufficient detail so they can be reproduced, scrutinized, challenged, and built upon. However, the intrinsic complexity and continuous growth of biomedical data makes it increasingly difficult to process, analyze, and share with the community in a FAIR (findable, accessible, interoperable, and reusable) manner. To overcome these issues, we created a cloud-based platform called ORCESTRA ( orcestra.ca ), which provides a flexible framework for the reproducible processing of multimodal biomedical data. It enables processing of clinical, genomic and perturbation profiles of cancer samples through automated processing pipelines that are user-customizable. ORCESTRA creates integrated and fully documented data objects with persistent identifiers (DOI) and manages multiple dataset versions, which can be shared for future studies.
Collapse
|
8
|
Wratten L, Wilm A, Göke J. Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nat Methods 2021; 18:1161-1168. [PMID: 34556866 DOI: 10.1038/s41592-021-01254-9] [Citation(s) in RCA: 53] [Impact Index Per Article: 17.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2020] [Accepted: 07/29/2021] [Indexed: 02/08/2023]
Abstract
The rapid growth of high-throughput technologies has transformed biomedical research. With the increasing amount and complexity of data, scalability and reproducibility have become essential not just for experiments, but also for computational analysis. However, transforming data into information involves running a large number of tools, optimizing parameters, and integrating dynamically changing reference data. Workflow managers were developed in response to such challenges. They simplify pipeline development, optimize resource usage, handle software installation and versions, and run on different compute platforms, enabling workflow portability and sharing. In this Perspective, we highlight key features of workflow managers, compare commonly used approaches for bioinformatics workflows, and provide a guide for computational and noncomputational users. We outline community-curated pipeline initiatives that enable novice and experienced users to perform complex, best-practice analyses without having to manually assemble workflows. In sum, we illustrate how workflow managers contribute to making computational analysis in biomedical research shareable, scalable, and reproducible.
Collapse
Affiliation(s)
| | | | - Jonathan Göke
- Genome Institute of Singapore, Singapore, Singapore.
| |
Collapse
|
9
|
Kohls M, Saremi B, Muchsin I, Fischer N, Becher P, Jung K. A resampling strategy for studying robustness in virus detection pipelines. Comput Biol Chem 2021; 94:107555. [PMID: 34364046 DOI: 10.1016/j.compbiolchem.2021.107555] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2021] [Revised: 07/14/2021] [Accepted: 07/28/2021] [Indexed: 10/20/2022]
Abstract
Next-generation sequencing is regularly used to identify viral sequences in DNA or RNA samples of infected hosts. A major step of most pipelines for virus detection is to map sequence reads against known virus genomes. Due to small differences between the sequences of related viruses, and due to several biological or technical errors, mapping underlies uncertainties. As a consequence, the resulting list of detected viruses can lack robustness. A new approach for generating artificial sequencing reads together with a strategy of resampling from the original findings is proposed that can help to assess the robustness of the originally identified list of viruses. From the original mapping result in form of a SAM file, a set of statistical distributions are derived. These are used in the resampling pipeline to generate new artificial reads which are again mapped versus the reference genomes. By summarizing the resampling procedure, the analyst receives information about whether the presence of a particular virus in the sample gains or losses evidence, and thus about the robustness of the original mapping list but also that of individual viruses in this list. To judge robustness, several indicators are derived from the resampling procedure such as the correlation between original and resampling read counts, or the statistical detection of outliers in the differences of read counts. Additionally, graphical illustrations of read count shifts via Sankey diagrams are provided. To demonstrate the use of the new approach, the resampling approach is applied to three real-world data samples, one of them with laboratory-confirmed Influenza sequences, and to artificially generated data where virus sequences have been spiked into the sequencing data of a host. By applying the resampling pipeline, several viruses drop from the original list while new viruses emerge, showing robustness of those viruses that remain in the list. The evaluation of the new approach shows that the resampling approach is helpful to analyze the viral content of a biological sample, to rate the robustness of original findings and to better show the overall distribution of findings. The method is also applicable to other virus detection pipelines based on read mapping.
Collapse
Affiliation(s)
- Moritz Kohls
- Institute for Animal Breeding and Genetics, University of Veterinary Medicine Hannover, Foundation, Bünteweg 17p, 30559 Hannover, Germany.
| | - Babak Saremi
- Institute for Animal Breeding and Genetics, University of Veterinary Medicine Hannover, Foundation, Bünteweg 17p, 30559 Hannover, Germany.
| | - Ihsan Muchsin
- Institute for Virology and Immunobiology, University of Würzburg, Versbacher Straße 7, 97078 Würzburg, Germany.
| | - Nicole Fischer
- Institute of Medical Microbiology, Virology and Hygiene, University Medical Center Hamburg-Eppendorf (UKE), Martinistraße 52, 20251 Hamburg, Germany.
| | - Paul Becher
- Institute of Virology, University of Veterinary Medicine Hannover, Foundation, Bünteweg 17, 30559 Hannover, Germany.
| | - Klaus Jung
- Institute for Animal Breeding and Genetics, University of Veterinary Medicine Hannover, Foundation, Bünteweg 17p, 30559 Hannover, Germany.
| |
Collapse
|
10
|
John A, Muenzen K, Ausmees K. Evaluation of serverless computing for scalable execution of a joint variant calling workflow. PLoS One 2021; 16:e0254363. [PMID: 34242357 PMCID: PMC8270184 DOI: 10.1371/journal.pone.0254363] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Accepted: 06/24/2021] [Indexed: 11/18/2022] Open
Abstract
Advances in whole-genome sequencing have greatly reduced the cost and time of obtaining raw genetic information, but the computational requirements of analysis remain a challenge. Serverless computing has emerged as an alternative to using dedicated compute resources, but its utility has not been widely evaluated for standardized genomic workflows. In this study, we define and execute a best-practice joint variant calling workflow using the SWEEP workflow management system. We present an analysis of performance and scalability, and discuss the utility of the serverless paradigm for executing workflows in the field of genomics research. The GATK best-practice short germline joint variant calling pipeline was implemented as a SWEEP workflow comprising 18 tasks. The workflow was executed on Illumina paired-end read samples from the European and African super populations of the 1000 Genomes project phase III. Cost and runtime increased linearly with increasing sample size, although runtime was driven primarily by a single task for larger problem sizes. Execution took a minimum of around 3 hours for 2 samples, up to nearly 13 hours for 62 samples, with costs ranging from $2 to $70.
Collapse
Affiliation(s)
- Aji John
- Department of Biology, University of Washington, Seattle, Washington, United States of America
- * E-mail:
| | - Kathleen Muenzen
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, Washington, United States of America
| | - Kristiina Ausmees
- Department of Information Technology, Uppsala University, Uppsala, Sweden
| |
Collapse
|
11
|
Melendrez MC, Shaw S, Brown CT, Goodner BW, Kvaal C. Editorial: Curriculum Applications in Microbiology: Bioinformatics in the Classroom. Front Microbiol 2021; 12:705233. [PMID: 34276638 PMCID: PMC8281245 DOI: 10.3389/fmicb.2021.705233] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2021] [Accepted: 06/07/2021] [Indexed: 11/18/2022] Open
Affiliation(s)
| | - Sophie Shaw
- Centre for Genome Enabled Biology and Medicine, University of Aberdeen, Aberdeen, United Kingdom
| | - C Titus Brown
- Department of Population Health and Reproduction, University of California, Davis, Davis, CA, United States
| | | | - Christopher Kvaal
- Department of Biology, St. Cloud State University, St. Cloud, MN, United States
| |
Collapse
|
12
|
Westbrook A, Varki E, Thomas WK. RepeatFS: a file system providing reproducibility through provenance and automation. Bioinformatics 2021; 37:1292-1296. [PMID: 33230554 PMCID: PMC8189677 DOI: 10.1093/bioinformatics/btaa950] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2020] [Revised: 10/11/2020] [Accepted: 10/29/2020] [Indexed: 11/30/2022] Open
Abstract
Motivation Reproducibility is of central importance to the scientific process. The difficulty of consistently replicating and verifying experimental results is magnified in the era of big data, in which bioinformatics analysis often involves complex multi-application pipelines operating on terabytes of data. These processes result in thousands of possible permutations of data preparation steps, software versions and command-line arguments. Existing reproducibility frameworks are cumbersome and involve redesigning computational methods. To address these issues, we developed RepeatFS, a file system that records, replicates and verifies informatics workflows with no alteration to the original methods. RepeatFS also provides several other features to help promote analytical transparency and reproducibility, including provenance visualization and task automation. Results We used RepeatFS to successfully visualize and replicate a variety of bioinformatics tasks consisting of over a million operations with no alteration to the original methods. RepeatFS correctly identified all software inconsistencies that resulted in replication differences. Availabilityand implementation RepeatFS is implemented in Python 3. Its source code and documentation are available at https://github.com/ToniWestbrook/repeatfs. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - W Kelley Thomas
- Hubbard Center for Genome Studies.,Department of Molecular Cellular and Biomedical Sciences, University of New Hampshire, Durham, NH 03824, USA
| |
Collapse
|
13
|
Patel JA, Dean DA, King CH, Xiao N, Koc S, Minina E, Golikov A, Brooks P, Kahsay R, Navelkar R, Ray M, Roberson D, Armstrong C, Mazumder R, Keeney J. Bioinformatics tools developed to support BioCompute Objects. Database (Oxford) 2021; 2021:baab008. [PMID: 33784373 PMCID: PMC8009203 DOI: 10.1093/database/baab008] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2019] [Revised: 01/10/2021] [Accepted: 03/06/2021] [Indexed: 11/17/2022]
Abstract
Developments in high-throughput sequencing (HTS) result in an exponential increase in the amount of data generated by sequencing experiments, an increase in the complexity of bioinformatics analysis reporting and an increase in the types of data generated. These increases in volume, diversity and complexity of the data generated and their analysis expose the necessity of a structured and standardized reporting template. BioCompute Objects (BCOs) provide the requisite support for communication of HTS data analysis that includes support for workflow, as well as data, curation, accessibility and reproducibility of communication. BCOs standardize how researchers report provenance and the established verification and validation protocols used in workflows while also being robust enough to convey content integration or curation in knowledge bases. BCOs that encapsulate tools, platforms, datasets and workflows are FAIR (findable, accessible, interoperable and reusable) compliant. Providing operational workflow and data information facilitates interoperability between platforms and incorporation of future dataset within an HTS analysis for use within industrial, academic and regulatory settings. Cloud-based platforms, including High-performance Integrated Virtual Environment (HIVE), Cancer Genomics Cloud (CGC) and Galaxy, support BCO generation for users. Given the 100K+ userbase between these platforms, BioCompute can be leveraged for workflow documentation. In this paper, we report the availability of platform-dependent and platform-independent BCO tools: HIVE BCO App, CGC BCO App, Galaxy BCO API Extension and BCO Portal. Community engagement was utilized to evaluate tool efficacy. We demonstrate that these tools further advance BCO creation from text editing approaches used in earlier releases of the standard. Moreover, we demonstrate that integrating BCO generation within existing analysis platforms greatly streamlines BCO creation while capturing granular workflow details. We also demonstrate that the BCO tools described in the paper provide an approach to solve the long-standing challenge of standardizing workflow descriptions that are both human and machine readable while accommodating manual and automated curation with evidence tagging. Database URL: https://www.biocomputeobject.org/resources.
Collapse
Affiliation(s)
- Janisha A Patel
- The Department of Biochemistry & Molecular Medicine, The George Washington University School of Medicine and Health Sciences, Washington, DC 20037, USA
| | | | - Charles Hadley King
- The Department of Biochemistry & Molecular Medicine, The George Washington University School of Medicine and Health Sciences, Washington, DC 20037, USA
- The McCormick Genomic and Proteomic Center, The George Washington University, Washington, DC 20037, USA
| | - Nan Xiao
- Seven Bridges, Charlestown, MA 02129, USA
| | - Soner Koc
- Seven Bridges, Charlestown, MA 02129, USA
| | - Ekaterina Minina
- CBER-HIVE, Center for Biologics Evaluation and Research, US Food and Drug Administration, Silver Spring, MD 20993, USA
| | - Anton Golikov
- CBER-HIVE, Center for Biologics Evaluation and Research, US Food and Drug Administration, Silver Spring, MD 20993, USA
| | | | - Robel Kahsay
- The Department of Biochemistry & Molecular Medicine, The George Washington University School of Medicine and Health Sciences, Washington, DC 20037, USA
| | - Rahi Navelkar
- The Department of Biochemistry & Molecular Medicine, The George Washington University School of Medicine and Health Sciences, Washington, DC 20037, USA
| | | | | | - Chris Armstrong
- The Department of Biochemistry & Molecular Medicine, The George Washington University School of Medicine and Health Sciences, Washington, DC 20037, USA
| | - Raja Mazumder
- The Department of Biochemistry & Molecular Medicine, The George Washington University School of Medicine and Health Sciences, Washington, DC 20037, USA
- The McCormick Genomic and Proteomic Center, The George Washington University, Washington, DC 20037, USA
| | - Jonathon Keeney
- The Department of Biochemistry & Molecular Medicine, The George Washington University School of Medicine and Health Sciences, Washington, DC 20037, USA
| |
Collapse
|
14
|
Balagurunathan Y, Mitchell R, El Naqa I. Requirements and reliability of AI in the medical context. Phys Med 2021; 83:72-78. [PMID: 33721700 PMCID: PMC8915137 DOI: 10.1016/j.ejmp.2021.02.024] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/07/2020] [Revised: 02/04/2021] [Accepted: 02/23/2021] [Indexed: 12/12/2022] Open
Abstract
The digital information age has been a catalyst in creating a renewed interest in Artificial Intelligence (AI) approaches, especially the subclass of computer algorithms that are popularly grouped into Machine Learning (ML). These methods have allowed one to go beyond limited human cognitive ability into understanding the complexity in the high dimensional data. Medical sciences have seen a steady use of these methods but have been slow in adoption to improve patient care. There are some significant impediments that have diluted this effort, which include availability of curated diverse data sets for model building, reliable human-level interpretation of these models, and reliable reproducibility of these methods for routine clinical use. Each of these aspects has several limiting conditions that need to be balanced out, considering the data/model building efforts, clinical implementation, integration cost to translational effort with minimal patient level harm, which may directly impact future clinical adoption. In this review paper, we will assess each aspect of the problem in the context of reliable use of the ML methods in oncology, as a representative study case, with the goal to safeguard utility and improve patient care in medicine in general.
Collapse
Affiliation(s)
| | - Ross Mitchell
- Department of Machine Learning, H. Lee. Moffitt Cancer Center, Tampa, FL, USA; Health Data Services, H. Lee. Moffitt Cancer Center, Tampa, FL, USA.
| | - Issam El Naqa
- Department of Machine Learning, H. Lee. Moffitt Cancer Center, Tampa, FL, USA.
| |
Collapse
|
15
|
Moossavi S, Fehr K, Khafipour E, Azad MB. Repeatability and reproducibility assessment in a large-scale population-based microbiota study: case study on human milk microbiota. MICROBIOME 2021; 9:41. [PMID: 33568231 PMCID: PMC7877029 DOI: 10.1186/s40168-020-00998-4] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/20/2020] [Accepted: 12/29/2020] [Indexed: 06/12/2023]
Abstract
BACKGROUND Quality control including assessment of batch variabilities and confirmation of repeatability and reproducibility are integral component of high throughput omics studies including microbiome research. Batch effects can mask true biological results and/or result in irreproducible conclusions and interpretations. Low biomass samples in microbiome research are prone to reagent contamination; yet, quality control procedures for low biomass samples in large-scale microbiome studies are not well established. RESULTS In this study, we have proposed a framework for an in-depth step-by-step approach to address this gap. The framework consists of three independent stages: (1) verification of sequencing accuracy by assessing technical repeatability and reproducibility of the results using mock communities and biological controls; (2) contaminant removal and batch variability correction by applying a two-tier strategy using statistical algorithms (e.g. decontam) followed by comparison of the data structure between batches; and (3) corroborating the repeatability and reproducibility of microbiome composition and downstream statistical analysis. Using this approach on the milk microbiota data from the CHILD Cohort generated in two batches (extracted and sequenced in 2016 and 2019), we were able to identify potential reagent contaminants that were missed with standard algorithms and substantially reduce contaminant-induced batch variability. Additionally, we confirmed the repeatability and reproducibility of our results in each batch before merging them for downstream analysis. CONCLUSION This study provides important insight to advance quality control efforts in low biomass microbiome research. Within-study quality control that takes advantage of the data structure (i.e. differential prevalence of contaminants between batches) would enhance the overall reliability and reproducibility of research in this field. Video abstract.
Collapse
Affiliation(s)
- Shirin Moossavi
- Department of Medical Microbiology and Infectious Diseases, University of Manitoba, Winnipeg, MB, Canada.
- Children's Hospital Research Institute of Manitoba, Winnipeg, MB, Canada.
- Developmental Origins of Chronic Diseases in Children Network (DEVOTION), Winnipeg, MB, Canada.
- Digestive Oncology Research Center, Digestive Disease Research Institute, Tehran University of Medical Sciences, Tehran, Iran.
- Department of Physiology and Pharmacology & Mechanical and Manufacturing Engineering, University of Calgary, Calgary, AB, Canada.
| | - Kelsey Fehr
- Children's Hospital Research Institute of Manitoba, Winnipeg, MB, Canada
- Department of Pediatrics and Child Health, University of Manitoba, Winnipeg, MB, Canada
| | - Ehsan Khafipour
- Department of Animal Science, University of Manitoba, Winnipeg, MB, Canada
- Microbiome Research and Technical Support, Cargill Animal Nutrition, Diamond V brand, Cedar Rapids, USA
| | - Meghan B Azad
- Children's Hospital Research Institute of Manitoba, Winnipeg, MB, Canada.
- Developmental Origins of Chronic Diseases in Children Network (DEVOTION), Winnipeg, MB, Canada.
- Department of Pediatrics and Child Health, University of Manitoba, Winnipeg, MB, Canada.
| |
Collapse
|
16
|
Lee H, Shuaibi A, Bell JM, Pavlichin DS, Ji HP. Unique k-mer sequences for validating cancer-related substitution, insertion and deletion mutations. NAR Cancer 2020; 2:zcaa034. [PMID: 33345188 PMCID: PMC7727745 DOI: 10.1093/narcan/zcaa034] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2020] [Revised: 10/23/2020] [Accepted: 11/12/2020] [Indexed: 12/26/2022] Open
Abstract
Cancer genome sequencing has led to important discoveries such as the identification of cancer genes. However, challenges remain in the analysis of cancer genome sequencing. One significant issue is that mutations identified by multiple variant callers are frequently discordant even when using the same genome sequencing data. For insertion and deletion mutations, oftentimes there is no agreement among different callers. Identifying somatic mutations involves read mapping and variant calling, a complicated process that uses many parameters and model tuning. To validate the identification of true mutations, we developed a method using k-mer sequences. First, we characterized the landscape of unique versus non-unique k-mers in the human genome. Second, we developed a software package, KmerVC, to validate the given somatic mutations from sequencing data. Our program validates the occurrence of a mutation based on statistically significant difference in frequency of k-mers with and without a mutation from matched normal and tumor sequences. Third, we tested our method on both simulated and cancer genome sequencing data. Counting k-mer involving mutations effectively validated true positive mutations including insertions and deletions across different individual samples in a reproducible manner. Thus, we demonstrated a straightforward approach for rapidly validating mutations from cancer genome sequencing data.
Collapse
Affiliation(s)
- HoJoon Lee
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Ahmed Shuaibi
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - John M Bell
- Stanford Genome Technology Center, Stanford University, Palo Alto, CA 94304, USA
| | - Dmitri S Pavlichin
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Hanlee P Ji
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA
| |
Collapse
|
17
|
Abstract
Genomics is both a data- and compute-intensive discipline. The success of genomics depends on an adequate informatics infrastructure that can address growing data demands and enable a diverse range of resource-intensive computational activities. Designing a suitable infrastructure is a challenging task, and its success largely depends on its adoption by users. In this article, we take a user-centric view of the genomics, where users are bioinformaticians, computational biologists, and data scientists. We try to take their point of view on how traditional computational activities for genomics are expanding due to data growth, as well as the introduction of big data and cloud technologies. The changing landscape of computational activities and new user requirements will influence the design of future genomics infrastructures.
Collapse
Affiliation(s)
- Ritesh Krishna
- IBM Research Europe, The Hartree Centre STFC Laboratory, Warrington WA4 4AD, UK.,IBM Research Europe, The Hartree Centre STFC Laboratory, Warrington WA4 4AD, UK
| | - Vadim Elisseev
- IBM Research Europe, The Hartree Centre STFC Laboratory, Warrington WA4 4AD, UK.,IBM Research Europe, The Hartree Centre STFC Laboratory, Warrington WA4 4AD, UK
| |
Collapse
|
18
|
Kanzi AM, San JE, Chimukangara B, Wilkinson E, Fish M, Ramsuran V, de Oliveira T. Next Generation Sequencing and Bioinformatics Analysis of Family Genetic Inheritance. Front Genet 2020; 11:544162. [PMID: 33193618 PMCID: PMC7649788 DOI: 10.3389/fgene.2020.544162] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2020] [Accepted: 09/21/2020] [Indexed: 12/29/2022] Open
Abstract
Mendelian and complex genetic trait diseases continue to burden and affect society both socially and economically. The lack of effective tests has hampered diagnosis thus, the affected lack proper prognosis. Mendelian diseases are caused by genetic mutations in a singular gene while complex trait diseases are caused by the accumulation of mutations in either linked or unlinked genomic regions. Significant advances have been made in identifying novel diseases associated mutations especially with the introduction of next generation and third generation sequencing. Regardless, some diseases are still without diagnosis as most tests rely on SNP genotyping panels developed from population based genetic analyses. Analysis of family genetic inheritance using whole genomes, whole exomes or a panel of genes has been shown to be effective in identifying disease-causing mutations. In this review, we discuss next generation and third generation sequencing platforms, bioinformatic tools and genetic resources commonly used to analyze family based genomic data with a focus on identifying inherited or novel disease-causing mutations. Additionally, we also highlight the analytical, ethical and regulatory challenges associated with analyzing personal genomes which constitute the data used for family genetic inheritance.
Collapse
Affiliation(s)
- Aquillah M. Kanzi
- Kwazulu-Natal Research and Innovation Sequencing Platform (KRISP), School of Laboratory Medicine and Medical Sciences, College of Health Sciences, University of KwaZulu-Natal, Durban, South Africa
| | | | | | | | | | | | | |
Collapse
|
19
|
Perez-Riverol Y, Moreno P. Scalable Data Analysis in Proteomics and Metabolomics Using BioContainers and Workflows Engines. Proteomics 2020; 20:e1900147. [PMID: 31657527 PMCID: PMC7613303 DOI: 10.1002/pmic.201900147] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2019] [Revised: 09/30/2019] [Indexed: 12/29/2022]
Abstract
The recent improvements in mass spectrometry instruments and new analytical methods are increasing the intersection between proteomics and big data science. In addition, bioinformatics analysis is becoming increasingly complex and convoluted, involving multiple algorithms and tools. A wide variety of methods and software tools have been developed for computational proteomics and metabolomics during recent years, and this trend is likely to continue. However, most of the computational proteomics and metabolomics tools are designed as single-tiered software application where the analytics tasks cannot be distributed, limiting the scalability and reproducibility of the data analysis. In this paper the key steps of metabolomics and proteomics data processing, including the main tools and software used to perform the data analysis, are summarized. The combination of software containers with workflows environments for large-scale metabolomics and proteomics analysis is discussed. Finally, a new approach for reproducible and large-scale data analysis based on BioContainers and two of the most popular workflow environments, Galaxy and Nextflow, is introduced to the proteomics and metabolomics communities.
Collapse
Affiliation(s)
- Yasset Perez-Riverol
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Pablo Moreno
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| |
Collapse
|
20
|
Schaduangrat N, Lampa S, Simeon S, Gleeson MP, Spjuth O, Nantasenamat C. Towards reproducible computational drug discovery. J Cheminform 2020; 12:9. [PMID: 33430992 PMCID: PMC6988305 DOI: 10.1186/s13321-020-0408-x] [Citation(s) in RCA: 85] [Impact Index Per Article: 21.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2019] [Accepted: 01/02/2020] [Indexed: 12/11/2022] Open
Abstract
The reproducibility of experiments has been a long standing impediment for further scientific progress. Computational methods have been instrumental in drug discovery efforts owing to its multifaceted utilization for data collection, pre-processing, analysis and inference. This article provides an in-depth coverage on the reproducibility of computational drug discovery. This review explores the following topics: (1) the current state-of-the-art on reproducible research, (2) research documentation (e.g. electronic laboratory notebook, Jupyter notebook, etc.), (3) science of reproducible research (i.e. comparison and contrast with related concepts as replicability, reusability and reliability), (4) model development in computational drug discovery, (5) computational issues on model development and deployment, (6) use case scenarios for streamlining the computational drug discovery protocol. In computational disciplines, it has become common practice to share data and programming codes used for numerical calculations as to not only facilitate reproducibility, but also to foster collaborations (i.e. to drive the project further by introducing new ideas, growing the data, augmenting the code, etc.). It is therefore inevitable that the field of computational drug design would adopt an open approach towards the collection, curation and sharing of data/code.
Collapse
Affiliation(s)
- Nalini Schaduangrat
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, 10700, Bangkok, Thailand
| | - Samuel Lampa
- Department of Pharmaceutical Biosciences, Uppsala University, 751 24, Uppsala, Sweden
| | - Saw Simeon
- Interdisciplinary Graduate Program in Bioscience, Faculty of Science, Kasetsart University, 10900, Bangkok, Thailand
| | - Matthew Paul Gleeson
- Department of Biomedical Engineering, Faculty of Engineering, King Mongkut's Institute of Technology Ladkrabang, 10520, Bangkok, Thailand.
| | - Ola Spjuth
- Department of Pharmaceutical Biosciences, Uppsala University, 751 24, Uppsala, Sweden.
| | - Chanin Nantasenamat
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, 10700, Bangkok, Thailand.
| |
Collapse
|
21
|
Ulfenborg B. Vertical and horizontal integration of multi-omics data with miodin. BMC Bioinformatics 2019; 20:649. [PMID: 31823712 PMCID: PMC6902525 DOI: 10.1186/s12859-019-3224-4] [Citation(s) in RCA: 34] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2019] [Accepted: 11/14/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Studies on multiple modalities of omics data such as transcriptomics, genomics and proteomics are growing in popularity, since they allow us to investigate complex mechanisms across molecular layers. It is widely recognized that integrative omics analysis holds the promise to unlock novel and actionable biological insights into health and disease. Integration of multi-omics data remains challenging, however, and requires combination of several software tools and extensive technical expertise to account for the properties of heterogeneous data. RESULTS This paper presents the miodin R package, which provides a streamlined workflow-based syntax for multi-omics data analysis. The package allows users to perform analysis of omics data either across experiments on the same samples (vertical integration), or across studies on the same variables (horizontal integration). Workflows have been designed to promote transparent data analysis and reduce the technical expertise required to perform low-level data import and processing. CONCLUSIONS The miodin package is implemented in R and is freely available for use and extension under the GPL-3 license. Package source, reference documentation and user manual are available at https://gitlab.com/algoromics/miodin.
Collapse
|
22
|
Wercelens P, da Silva W, Hondo F, Castro K, Walter ME, Araújo A, Lifschitz S, Holanda M. Bioinformatics Workflows With NoSQL Database in Cloud Computing. Evol Bioinform Online 2019; 15:1176934319889974. [PMID: 31839702 PMCID: PMC6896126 DOI: 10.1177/1176934319889974] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2019] [Accepted: 10/29/2019] [Indexed: 12/29/2022] Open
Abstract
Scientific workflows can be understood as arrangements of managed activities executed by different processing entities. It is a regular Bioinformatics approach applying workflows to solve problems in Molecular Biology, notably those related to sequence analyses. Due to the nature of the raw data and the in silico environment of Molecular Biology experiments, apart from the research subject, 2 practical and closely related problems have been studied: reproducibility and computational environment. When aiming to enhance the reproducibility of Bioinformatics experiments, various aspects should be considered. The reproducibility requirements comprise the data provenance, which enables the acquisition of knowledge about the trajectory of data over a defined workflow, the settings of the programs, and the entire computational environment. Cloud computing is a booming alternative that can provide this computational environment, hiding technical details, and delivering a more affordable, accessible, and configurable on-demand environment for researchers. Considering this specific scenario, we proposed a solution to improve the reproducibility of Bioinformatics workflows in a cloud computing environment using both Infrastructure as a Service (IaaS) and Not only SQL (NoSQL) database systems. To meet the goal, we have built 3 typical Bioinformatics workflows and ran them on 1 private and 2 public clouds, using different types of NoSQL database systems to persist the provenance data according to the Provenance Data Model (PROV-DM). We present here the results and a guide for the deployment of a cloud environment for Bioinformatics exploring the characteristics of various NoSQL database systems to persist provenance data.
Collapse
Affiliation(s)
- Polyane Wercelens
- Department of Computer Science, University of Brasília, Brasília, Brazil
| | - Waldeyr da Silva
- Department of Computer Science, University of Brasília, Brasília, Brazil.,NEPBIO (Group of Biological Studies and Research on Cerrado), Federal Institute of Goiás (IFG), Formosa, Goiás, Brazil
| | - Fernanda Hondo
- Department of Computer Science, University of Brasília, Brasília, Brazil
| | - Klayton Castro
- Department of Computer Science, University of Brasília, Brasília, Brazil
| | | | - Aletéia Araújo
- Department of Computer Science, University of Brasília, Brasília, Brazil
| | - Sergio Lifschitz
- Pontifical Catholic University of Rio de Janeiro, Rio de Janeiro, Brazil
| | - Maristela Holanda
- Department of Computer Science, University of Brasília, Brasília, Brazil
| |
Collapse
|
23
|
Khan FZ, Soiland-Reyes S, Sinnott RO, Lonie A, Goble C, Crusoe MR. Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv. Gigascience 2019; 8:giz095. [PMID: 31675414 PMCID: PMC6824458 DOI: 10.1093/gigascience/giz095] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2018] [Revised: 05/23/2019] [Accepted: 07/17/2019] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND The automation of data analysis in the form of scientific workflows has become a widely adopted practice in many fields of research. Computationally driven data-intensive experiments using workflows enable automation, scaling, adaptation, and provenance support. However, there are still several challenges associated with the effective sharing, publication, and reproducibility of such workflows due to the incomplete capture of provenance and lack of interoperability between different technical (software) platforms. RESULTS Based on best-practice recommendations identified from the literature on workflow design, sharing, and publishing, we define a hierarchical provenance framework to achieve uniformity in provenance and support comprehensive and fully re-executable workflows equipped with domain-specific information. To realize this framework, we present CWLProv, a standard-based format to represent any workflow-based computational analysis to produce workflow output artefacts that satisfy the various levels of provenance. We use open source community-driven standards, interoperable workflow definitions in Common Workflow Language (CWL), structured provenance representation using the W3C PROV model, and resource aggregation and sharing as workflow-centric research objects generated along with the final outputs of a given workflow enactment. We demonstrate the utility of this approach through a practical implementation of CWLProv and evaluation using real-life genomic workflows developed by independent groups. CONCLUSIONS The underlying principles of the standards utilized by CWLProv enable semantically rich and executable research objects that capture computational workflows with retrospective provenance such that any platform supporting CWL will be able to understand the analysis, reuse the methods for partial reruns, or reproduce the analysis to validate the published findings.
Collapse
Affiliation(s)
- Farah Zaib Khan
- The University of Melbourne, School of Computing and Information System, Doug Mcdonnell Building, Parkville, Australia, 3052
- Common Workflow Language Project
| | | | - Richard O Sinnott
- The University of Melbourne, School of Computing and Information System, Doug Mcdonnell Building, Parkville, Australia, 3052
| | - Andrew Lonie
- The University of Melbourne, School of Computing and Information System, Doug Mcdonnell Building, Parkville, Australia, 3052
| | | | | |
Collapse
|
24
|
Montemayor C, Brunker PAR, Keller MA. Banking with precision: transfusion medicine as a potential universal application in clinical genomics. Curr Opin Hematol 2019; 26:480-487. [PMID: 31490317 PMCID: PMC7302862 DOI: 10.1097/moh.0000000000000536] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
PURPOSE OF REVIEW To summarize the most recent scientific progress in transfusion medicine genomics and discuss its role within the broad genomic precision medicine model, with a focus on the unique computational and bioinformatic aspects of this emergent field. RECENT FINDINGS Recent publications continue to validate the feasibility of using next-generation sequencing (NGS) for blood group prediction with three distinct approaches: exome sequencing, whole genome sequencing, and PCR-based targeted NGS methods. The reported correlation of NGS with serologic and alternative genotyping methods ranges from 92 to 99%. NGS has demonstrated improved detection of weak antigens, structural changes, copy number variations, novel genomic variants, and microchimerism. Addition of a transfusion medicine interpretation to any clinically sequenced genome is proposed as a strategy to enhance the cost-effectiveness of precision genomic medicine. Interpretation of NGS in the blood group antigen context requires not only advanced immunohematology knowledge, but also specialized software and hardware resources, and a bioinformatics-trained workforce. SUMMARY Blood transfusions are a common inpatient procedure, making blood group genomics a promising facet of precision medicine research. Further efforts are needed to embrace transfusion bioinformatic challenges and evaluate its clinical utility.
Collapse
Affiliation(s)
- Celina Montemayor
- Department of Transfusion Medicine, National Institutes of Health Clinical Center, Bethesda, MD
| | - Patricia A. R. Brunker
- Division of Transfusion Medicine, Department of Pathology, The Johns Hopkins Hospital, Baltimore, MD
- American Red Cross, Greater Chesapeake and Potomac Region, Baltimore, MD
| | | |
Collapse
|
25
|
Review of Issues and Solutions to Data Analysis Reproducibility and Data Quality in Clinical Proteomics. Methods Mol Biol 2019. [PMID: 31552637 DOI: 10.1007/978-1-4939-9744-2_15] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2023]
Abstract
In any analytical discipline, data analysis reproducibility is closely interlinked with data quality. In this book chapter focused on mass spectrometry-based proteomics approaches, we introduce how both data analysis reproducibility and data quality can influence each other and how data quality and data analysis designs can be used to increase robustness and improve reproducibility. We first introduce methods and concepts to design and maintain robust data analysis pipelines such that reproducibility can be increased in parallel. The technical aspects related to data analysis reproducibility are challenging, and current ways to increase the overall robustness are multifaceted. Software containerization and cloud infrastructures play an important part.We will also show how quality control (QC) and quality assessment (QA) approaches can be used to spot analytical issues, reduce the experimental variability, and increase confidence in the analytical results of (clinical) proteomics studies, since experimental variability plays a substantial role in analysis reproducibility. Therefore, we give an overview on existing solutions for QC/QA, including different quality metrics, and methods for longitudinal monitoring. The efficient use of both types of approaches undoubtedly provides a way to improve the experimental reliability, reproducibility, and level of consistency in proteomics analytical measurements.
Collapse
|
26
|
Silliman K. Population structure, genetic connectivity, and adaptation in the Olympia oyster ( Ostrea lurida) along the west coast of North America. Evol Appl 2019; 12:923-939. [PMID: 31080505 PMCID: PMC6503834 DOI: 10.1111/eva.12766] [Citation(s) in RCA: 29] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2018] [Revised: 11/28/2018] [Accepted: 12/02/2018] [Indexed: 01/02/2023] Open
Abstract
Effective management of threatened and exploited species requires an understanding of both the genetic connectivity among populations and local adaptation. The Olympia oyster (Ostrea lurida), patchily distributed from Baja California to the central coast of Canada, has a long history of population declines due to anthropogenic stressors. For such coastal marine species, population structure could follow a continuous isolation-by-distance model, contain regional blocks of genetic similarity separated by barriers to gene flow, or be consistent with a null model of no population structure. To distinguish between these hypotheses in O. lurida, 13,424 single nucleotide polymorphisms (SNPs) were used to characterize rangewide population structure, genetic connectivity, and adaptive divergence. Samples were collected across the species range on the west coast of North America, from southern California to Vancouver Island. A conservative approach for detecting putative loci under selection identified 235 SNPs across 129 GBS loci, which were functionally annotated and analyzed separately from the remaining neutral loci. While strong population structure was observed on a regional scale in both neutral and outlier markers, neutral markers had greater power to detect fine-scale structure. Geographic regions of reduced gene flow aligned with known marine biogeographic barriers, such as Cape Mendocino, Monterey Bay, and the currents around Cape Flattery. The outlier loci identified as under putative selection included genes involved in developmental regulation, sensory information processing, energy metabolism, immune response, and muscle contraction. These loci are excellent candidates for future research and may provide targets for genetic monitoring programs. Beyond specific applications for restoration and management of the Olympia oyster, this study lends to the growing body of evidence for both population structure and adaptive differentiation across a range of marine species exhibiting the potential for panmixia. Computational notebooks are available to facilitate reproducibility and future open-sourced research on the population structure of O. lurida.
Collapse
|
27
|
Juanillas V, Dereeper A, Beaume N, Droc G, Dizon J, Mendoza JR, Perdon JP, Mansueto L, Triplett L, Lang J, Zhou G, Ratharanjan K, Plale B, Haga J, Leach JE, Ruiz M, Thomson M, Alexandrov N, Larmande P, Kretzschmar T, Mauleon RP. Rice Galaxy: an open resource for plant science. Gigascience 2019; 8:giz028. [PMID: 31107941 PMCID: PMC6527052 DOI: 10.1093/gigascience/giz028] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2018] [Revised: 08/29/2018] [Accepted: 02/12/2019] [Indexed: 01/16/2023] Open
Abstract
BACKGROUND Rice molecular genetics, breeding, genetic diversity, and allied research (such as rice-pathogen interaction) have adopted sequencing technologies and high-density genotyping platforms for genome variation analysis and gene discovery. Germplasm collections representing rice diversity, improved varieties, and elite breeding materials are accessible through rice gene banks for use in research and breeding, with many having genome sequences and high-density genotype data available. Combining phenotypic and genotypic information on these accessions enables genome-wide association analysis, which is driving quantitative trait loci discovery and molecular marker development. Comparative sequence analyses across quantitative trait loci regions facilitate the discovery of novel alleles. Analyses involving DNA sequences and large genotyping matrices for thousands of samples, however, pose a challenge to non-computer savvy rice researchers. FINDINGS The Rice Galaxy resource has shared datasets that include high-density genotypes from the 3,000 Rice Genomes project and sequences with corresponding annotations from 9 published rice genomes. The Rice Galaxy web server and deployment installer includes tools for designing single-nucleotide polymorphism assays, analyzing genome-wide association studies, population diversity, rice-bacterial pathogen diagnostics, and a suite of published genomic prediction methods. A prototype Rice Galaxy compliant to Open Access, Open Data, and Findable, Accessible, Interoperable, and Reproducible principles is also presented. CONCLUSIONS Rice Galaxy is a freely available resource that empowers the plant research community to perform state-of-the-art analyses and utilize publicly available big datasets for both fundamental and applied science.
Collapse
Affiliation(s)
- Venice Juanillas
- International Rice Research Institute, DAPO Box 7777, Metro Manila 1301, Philippines
| | - Alexis Dereeper
- Institut de recherche pour le développement (IRD), University of Montpellier, DIADE, IPME, Montpellier, France
| | - Nicolas Beaume
- International Rice Research Institute, DAPO Box 7777, Metro Manila 1301, Philippines
| | - Gaetan Droc
- CIRAD, UMR AGAP, F-34398 Montpellier, France
| | - Joshua Dizon
- International Rice Research Institute, DAPO Box 7777, Metro Manila 1301, Philippines
| | - John Robert Mendoza
- Advanced Science and Technology Institute, Department of Science and Technology, Quezon City, Philippines
| | - Jon Peter Perdon
- Advanced Science and Technology Institute, Department of Science and Technology, Quezon City, Philippines
| | - Locedie Mansueto
- International Rice Research Institute, DAPO Box 7777, Metro Manila 1301, Philippines
| | - Lindsay Triplett
- Department of Bioagricultural Sciences and Pest Management, Colorado State University, Fort Collins, CO 80523-1177, USA
| | - Jillian Lang
- Department of Bioagricultural Sciences and Pest Management, Colorado State University, Fort Collins, CO 80523-1177, USA
| | - Gabriel Zhou
- Indiana University, 107 S Indiana Ave, Bloomington, IN 47405, USA
| | | | - Beth Plale
- Indiana University, 107 S Indiana Ave, Bloomington, IN 47405, USA
| | - Jason Haga
- National Institute of Advanced Industrial Science and Technology, AIST Tsukuba Central 1,1-1-1 Umezono, Tsukuba, Ibaraki 305-8560, Japan
| | - Jan E Leach
- Department of Bioagricultural Sciences and Pest Management, Colorado State University, Fort Collins, CO 80523-1177, USA
| | - Manuel Ruiz
- CIRAD, UMR AGAP, F-34398 Montpellier, France
| | - Michael Thomson
- International Rice Research Institute, DAPO Box 7777, Metro Manila 1301, Philippines
- Department of Soil and Crop Sciences, Texas A&M University, Houston, TX, USA
| | - Nickolai Alexandrov
- International Rice Research Institute, DAPO Box 7777, Metro Manila 1301, Philippines
| | - Pierre Larmande
- Institut de recherche pour le développement (IRD), University of Montpellier, DIADE, IPME, Montpellier, France
| | - Tobias Kretzschmar
- International Rice Research Institute, DAPO Box 7777, Metro Manila 1301, Philippines
- Southern Cross Plant Science, Southern Cross University, Lismore, Australia
| | - Ramil P Mauleon
- International Rice Research Institute, DAPO Box 7777, Metro Manila 1301, Philippines
- Southern Cross Plant Science, Southern Cross University, Lismore, Australia
| |
Collapse
|
28
|
Wang G, Peng B. Script of Scripts: A pragmatic workflow system for daily computational research. PLoS Comput Biol 2019; 15:e1006843. [PMID: 30811390 PMCID: PMC6411228 DOI: 10.1371/journal.pcbi.1006843] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2018] [Revised: 03/11/2019] [Accepted: 01/29/2019] [Indexed: 01/22/2023] Open
Abstract
Computationally intensive disciplines such as computational biology often require use of a variety of tools implemented in different scripting languages and analysis of large data sets using high-performance computing systems. Although scientific workflow systems can powerfully organize and execute large-scale data-analysis processes, creating and maintaining such workflows usually comes with nontrivial learning curves and engineering overhead, making them cumbersome to use for everyday data exploration and prototyping. To bridge the gap between interactive analysis and workflow systems, we developed Script of Scripts (SoS), an interactive data-analysis platform and workflow system with a strong emphasis on readability, practicality, and reproducibility in daily computational research. For exploratory analysis, SoS has a multilanguage scripting format that centralizes otherwise-scattered scripts and creates dynamic reports for publication and sharing. As a workflow engine, SoS provides an intuitive syntax for creating workflows in process-oriented, outcome-oriented, and mixed styles, as well as a unified interface for executing and managing tasks on a variety of computing platforms with automatic synchronization of files among isolated file systems. As illustrated herein by real-world examples, SoS is both an interactive analysis tool and pipeline platform suitable for different stages of method development and data-analysis projects. In particular, SoS can be easily adopted in existing data analysis routines to substantially improve organization, readability, and cross-platform computation management of research projects.
Collapse
Affiliation(s)
- Gao Wang
- Department of Human Genetics, The University of Chicago, Chicago, IL, United States of America
| | - Bo Peng
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, United States of America
- * E-mail:
| |
Collapse
|
29
|
Das S, Lecours Boucher X, Rogers C, Makowski C, Chouinard-Decorte F, Oros Klein K, Beck N, Rioux P, Brown ST, Mohaddes Z, Zweber C, Foing V, Forest M, O'Donnell KJ, Clark J, Meaney MJ, Greenwood CMT, Evans AC. Integration of "omics" Data and Phenotypic Data Within a Unified Extensible Multimodal Framework. Front Neuroinform 2018; 12:91. [PMID: 30631270 PMCID: PMC6315165 DOI: 10.3389/fninf.2018.00091] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2018] [Accepted: 11/16/2018] [Indexed: 12/11/2022] Open
Abstract
Analysis of “omics” data is often a long and segmented process, encompassing multiple stages from initial data collection to processing, quality control and visualization. The cross-modal nature of recent genomic analyses renders this process challenging to both automate and standardize; consequently, users often resort to manual interventions that compromise data reliability and reproducibility. This in turn can produce multiple versions of datasets across storage systems. As a result, scientists can lose significant time and resources trying to execute and monitor their analytical workflows and encounter difficulties sharing versioned data. In 2015, the Ludmer Centre for Neuroinformatics and Mental Health at McGill University brought together expertise from the Douglas Mental Health University Institute, the Lady Davis Institute and the Montreal Neurological Institute (MNI) to form a genetics/epigenetics working group. The objectives of this working group are to: (i) design an automated and seamless process for (epi)genetic data that consolidates heterogeneous datasets into the LORIS open-source data platform; (ii) streamline data analysis; (iii) integrate results with provenance information; and (iv) facilitate structured and versioned sharing of pipelines for optimized reproducibility using high-performance computing (HPC) environments via the CBRAIN processing portal. This article outlines the resulting generalizable “omics” framework and its benefits, specifically, the ability to: (i) integrate multiple types of biological and multi-modal datasets (imaging, clinical, demographics and behavioral); (ii) automate the process of launching analysis pipelines on HPC platforms; (iii) remove the bioinformatic barriers that are inherent to this process; (iv) ensure standardization and transparent sharing of processing pipelines to improve computational consistency; (v) store results in a queryable web interface; (vi) offer visualization tools to better view the data; and (vii) provide the mechanisms to ensure usability and reproducibility. This framework for workflows facilitates brain research discovery by reducing human error through automation of analysis pipelines and seamless linking of multimodal data, allowing investigators to focus on research instead of data handling.
Collapse
Affiliation(s)
- Samir Das
- McGill Centre for Integrative Neuroscience, Montreal Neurological Institute, Montreal, QC, Canada.,Montreal Neurological Institute, McGill University, Montreal, QC, Canada
| | - Xavier Lecours Boucher
- McGill Centre for Integrative Neuroscience, Montreal Neurological Institute, Montreal, QC, Canada.,Montreal Neurological Institute, McGill University, Montreal, QC, Canada
| | - Christine Rogers
- McGill Centre for Integrative Neuroscience, Montreal Neurological Institute, Montreal, QC, Canada.,Montreal Neurological Institute, McGill University, Montreal, QC, Canada
| | - Carolina Makowski
- McGill Centre for Integrative Neuroscience, Montreal Neurological Institute, Montreal, QC, Canada.,Montreal Neurological Institute, McGill University, Montreal, QC, Canada.,Douglas Hospital Research Centre, McGill University, Montreal, QC, Canada
| | - François Chouinard-Decorte
- McGill Centre for Integrative Neuroscience, Montreal Neurological Institute, Montreal, QC, Canada.,Montreal Neurological Institute, McGill University, Montreal, QC, Canada
| | - Kathleen Oros Klein
- Ludmer Centre for Neuroinformatics & Mental Health, McGill University, Montreal, QC, Canada.,Lady Davis Institute, Jewish General Hospital, McGill University, Montreal, QC, Canada
| | - Natacha Beck
- McGill Centre for Integrative Neuroscience, Montreal Neurological Institute, Montreal, QC, Canada.,Montreal Neurological Institute, McGill University, Montreal, QC, Canada
| | - Pierre Rioux
- McGill Centre for Integrative Neuroscience, Montreal Neurological Institute, Montreal, QC, Canada.,Montreal Neurological Institute, McGill University, Montreal, QC, Canada
| | - Shawn T Brown
- McGill Centre for Integrative Neuroscience, Montreal Neurological Institute, Montreal, QC, Canada.,Montreal Neurological Institute, McGill University, Montreal, QC, Canada
| | - Zia Mohaddes
- McGill Centre for Integrative Neuroscience, Montreal Neurological Institute, Montreal, QC, Canada.,Montreal Neurological Institute, McGill University, Montreal, QC, Canada
| | - Cole Zweber
- McGill Centre for Integrative Neuroscience, Montreal Neurological Institute, Montreal, QC, Canada.,Montreal Neurological Institute, McGill University, Montreal, QC, Canada
| | - Victoria Foing
- McGill Centre for Integrative Neuroscience, Montreal Neurological Institute, Montreal, QC, Canada.,Montreal Neurological Institute, McGill University, Montreal, QC, Canada
| | - Marie Forest
- Ludmer Centre for Neuroinformatics & Mental Health, McGill University, Montreal, QC, Canada.,Lady Davis Institute, Jewish General Hospital, McGill University, Montreal, QC, Canada
| | - Kieran J O'Donnell
- Douglas Hospital Research Centre, McGill University, Montreal, QC, Canada.,Ludmer Centre for Neuroinformatics & Mental Health, McGill University, Montreal, QC, Canada
| | - Joanne Clark
- Ludmer Centre for Neuroinformatics & Mental Health, McGill University, Montreal, QC, Canada
| | - Michael J Meaney
- Douglas Hospital Research Centre, McGill University, Montreal, QC, Canada.,Ludmer Centre for Neuroinformatics & Mental Health, McGill University, Montreal, QC, Canada
| | - Celia M T Greenwood
- Ludmer Centre for Neuroinformatics & Mental Health, McGill University, Montreal, QC, Canada.,Lady Davis Institute, Jewish General Hospital, McGill University, Montreal, QC, Canada
| | - Alan C Evans
- McGill Centre for Integrative Neuroscience, Montreal Neurological Institute, Montreal, QC, Canada.,Montreal Neurological Institute, McGill University, Montreal, QC, Canada
| |
Collapse
|
30
|
Goodstadt MN, Marti-Renom MA. Communicating Genome Architecture: Biovisualization of the Genome, from Data Analysis and Hypothesis Generation to Communication and Learning. J Mol Biol 2018; 431:1071-1087. [PMID: 30419242 DOI: 10.1016/j.jmb.2018.11.008] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2018] [Revised: 10/29/2018] [Accepted: 11/01/2018] [Indexed: 01/07/2023]
Abstract
Genome discoveries at the core of biology are made by visual description and exploration of the cell, from microscopic sketches and biochemical mapping to computational analysis and spatial modeling. We outline the experimental and visualization techniques that have been developed recently which capture the three-dimensional interactions regulating how genes are expressed. We detail the challenges faced in integration of the data to portray the components and organization and their dynamic landscape. The goal is more than a single data-driven representation as interactive visualization for de novo research is paramount to decipher insights on genome organization in space.
Collapse
Affiliation(s)
- Mike N Goodstadt
- CNAG-CRG, Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Baldiri Reixac 4, Barcelona 08028, Spain; Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Spain.
| | - Marc A Marti-Renom
- CNAG-CRG, Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Baldiri Reixac 4, Barcelona 08028, Spain; Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Spain; Universitat Pompeu Fabra (UPF), Barcelona, Spain; Institució Catalana de Recerca i Estudis Avançats (ICREA), Pg. Lluis Companys 23, Barcelona 08010, Spain.
| |
Collapse
|
31
|
Kulkarni N, Alessandrì L, Panero R, Arigoni M, Olivero M, Ferrero G, Cordero F, Beccuti M, Calogero RA. Reproducible bioinformatics project: a community for reproducible bioinformatics analysis pipelines. BMC Bioinformatics 2018; 19:349. [PMID: 30367595 PMCID: PMC6191970 DOI: 10.1186/s12859-018-2296-x] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
Background Reproducibility of a research is a key element in the modern science and it is mandatory for any industrial application. It represents the ability of replicating an experiment independently by the location and the operator. Therefore, a study can be considered reproducible only if all used data are available and the exploited computational analysis workflow is clearly described. However, today for reproducing a complex bioinformatics analysis, the raw data and the list of tools used in the workflow could be not enough to guarantee the reproducibility of the results obtained. Indeed, different releases of the same tools and/or of the system libraries (exploited by such tools) might lead to sneaky reproducibility issues. Results To address this challenge, we established the Reproducible Bioinformatics Project (RBP), which is a non-profit and open-source project, whose aim is to provide a schema and an infrastructure, based on docker images and R package, to provide reproducible results in Bioinformatics. One or more Docker images are then defined for a workflow (typically one for each task), while the workflow implementation is handled via R-functions embedded in a package available at github repository. Thus, a bioinformatician participating to the project has firstly to integrate her/his workflow modules into Docker image(s) exploiting an Ubuntu docker image developed ad hoc by RPB to make easier this task. Secondly, the workflow implementation must be realized in R according to an R-skeleton function made available by RPB to guarantee homogeneity and reusability among different RPB functions. Moreover she/he has to provide the R vignette explaining the package functionality together with an example dataset which can be used to improve the user confidence in the workflow utilization. Conclusions Reproducible Bioinformatics Project provides a general schema and an infrastructure to distribute robust and reproducible workflows. Thus, it guarantees to final users the ability to repeat consistently any analysis independently by the used UNIX-like architecture.
Collapse
Affiliation(s)
- Neha Kulkarni
- Department of Molecular Biotechnology and Health Sciences, University of Torino, Torino, Italy
| | - Luca Alessandrì
- Department of Molecular Biotechnology and Health Sciences, University of Torino, Torino, Italy
| | - Riccardo Panero
- Department of Molecular Biotechnology and Health Sciences, University of Torino, Torino, Italy
| | - Maddalena Arigoni
- Department of Molecular Biotechnology and Health Sciences, University of Torino, Torino, Italy
| | - Martina Olivero
- Department of Oncology, University of Torino, Candiolo, Italy
| | - Giulio Ferrero
- Department of Computer Sciences, University of Torino, Torino, Italy
| | - Francesca Cordero
- Department of Computer Sciences, University of Torino, Torino, Italy.
| | - Marco Beccuti
- Department of Computer Sciences, University of Torino, Torino, Italy
| | - Raffaele A Calogero
- Department of Molecular Biotechnology and Health Sciences, University of Torino, Torino, Italy.
| |
Collapse
|
32
|
Mondelli ML, Magalhães T, Loss G, Wilde M, Foster I, Mattoso M, Katz D, Barbosa H, de Vasconcelos ATR, Ocaña K, Gadelha LMR. BioWorkbench: a high-performance framework for managing and analyzing bioinformatics experiments. PeerJ 2018; 6:e5551. [PMID: 30186700 PMCID: PMC6119457 DOI: 10.7717/peerj.5551] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2018] [Accepted: 08/07/2018] [Indexed: 11/20/2022] Open
Abstract
Advances in sequencing techniques have led to exponential growth in biological data, demanding the development of large-scale bioinformatics experiments. Because these experiments are computation- and data-intensive, they require high-performance computing techniques and can benefit from specialized technologies such as Scientific Workflow Management Systems and databases. In this work, we present BioWorkbench, a framework for managing and analyzing bioinformatics experiments. This framework automatically collects provenance data, including both performance data from workflow execution and data from the scientific domain of the workflow application. Provenance data can be analyzed through a web application that abstracts a set of queries to the provenance database, simplifying access to provenance information. We evaluate BioWorkbench using three case studies: SwiftPhylo, a phylogenetic tree assembly workflow; SwiftGECKO, a comparative genomics workflow; and RASflow, a RASopathy analysis workflow. We analyze each workflow from both computational and scientific domain perspectives, by using queries to a provenance and annotation database. Some of these queries are available as a pre-built feature of the BioWorkbench web application. Through the provenance data, we show that the framework is scalable and achieves high-performance, reducing up to 98% of the case studies execution time. We also show how the application of machine learning techniques can enrich the analysis process.
Collapse
Affiliation(s)
- Maria Luiza Mondelli
- National Laboratory for Scientific Computing, Petrópolis, Rio de Janeiro, Brazil
| | - Thiago Magalhães
- National Laboratory for Scientific Computing, Petrópolis, Rio de Janeiro, Brazil
| | - Guilherme Loss
- National Laboratory for Scientific Computing, Petrópolis, Rio de Janeiro, Brazil
| | - Michael Wilde
- Computation Institute, Argonne National Laboratory/University of Chicago, Chicago, IL, USA
| | - Ian Foster
- Computation Institute, Argonne National Laboratory/University of Chicago, Chicago, IL, USA
| | - Marta Mattoso
- Computer and Systems Engineering Program, COPPE, Federal University of Rio de Janeiro, Rio de Janeiro, Rio de Janeiro, Brazil
| | - Daniel Katz
- National Center for Supercomputing Applications, University of Illinois, Urbana, IL, USA
| | - Helio Barbosa
- National Laboratory for Scientific Computing, Petrópolis, Rio de Janeiro, Brazil.,Federal University of Juiz de Fora, Juiz de Fora, Minas Gerais, Brazil
| | | | - Kary Ocaña
- National Laboratory for Scientific Computing, Petrópolis, Rio de Janeiro, Brazil
| | - Luiz M R Gadelha
- National Laboratory for Scientific Computing, Petrópolis, Rio de Janeiro, Brazil
| |
Collapse
|
33
|
Gruenstaeudl M, Gerschler N, Borsch T. Bioinformatic Workflows for Generating Complete Plastid Genome Sequences-An Example from Cabomba (Cabombaceae) in the Context of the Phylogenomic Analysis of the Water-Lily Clade. Life (Basel) 2018; 8:E25. [PMID: 29933597 PMCID: PMC6160935 DOI: 10.3390/life8030025] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2018] [Revised: 06/11/2018] [Accepted: 06/19/2018] [Indexed: 12/13/2022] Open
Abstract
The sequencing and comparison of plastid genomes are becoming a standard method in plant genomics, and many researchers are using this approach to infer plant phylogenetic relationships. Due to the widespread availability of next-generation sequencing, plastid genome sequences are being generated at breakneck pace. This trend towards massive sequencing of plastid genomes highlights the need for standardized bioinformatic workflows. In particular, documentation and dissemination of the details of genome assembly, annotation, alignment and phylogenetic tree inference are needed, as these processes are highly sensitive to the choice of software and the precise settings used. Here, we present the procedure and results of sequencing, assembling, annotating and quality-checking of three complete plastid genomes of the aquatic plant genus Cabomba as well as subsequent gene alignment and phylogenetic tree inference. We accompany our findings by a detailed description of the bioinformatic workflow employed. Importantly, we share a total of eleven software scripts for each of these bioinformatic processes, enabling other researchers to evaluate and replicate our analyses step by step. The results of our analyses illustrate that the plastid genomes of Cabomba are highly conserved in both structure and gene content.
Collapse
Affiliation(s)
- Michael Gruenstaeudl
- Institut für Biologie, Systematische Botanik und Pflanzengeographie, Freie Universität Berlin, 14195 Berlin, Germany.
| | - Nico Gerschler
- Institut für Biologie, Systematische Botanik und Pflanzengeographie, Freie Universität Berlin, 14195 Berlin, Germany.
| | - Thomas Borsch
- Institut für Biologie, Systematische Botanik und Pflanzengeographie, Freie Universität Berlin, 14195 Berlin, Germany.
- Botanischer Garten und Botanisches Museum Berlin, Freie Universität Berlin, 14195 Berlin, Germany.
- Berlin Center for Genomics in Biodiversity Research (BeGenDiv), 14195 Berlin, Germany.
| |
Collapse
|
34
|
Gryk MR, Ludäscher B. Semantic Mediation to Improve Reproducibility for Biomolecular NMR Analysis. TRANSFORMING DIGITAL WORLDS : 13TH INTERNATIONAL CONFERENCE, ICONFERENCE 2018, SHEFFIELD, UK, MARCH 25-28, 2018, PROCEEDINGS. INTERNATIONAL CONFERENCE ON TRANSFORMING DIGITAL WORLDS (13TH : 2018 : SHEFFIELD, ENGLAND) 2018; 10766:620-625. [PMID: 30334020 PMCID: PMC6186436 DOI: 10.1007/978-3-319-78105-1_70] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Two barriers to computational reproducibility are the ability to record the critical metadata required for rerunning a computation, as well as translating the semantics of the metadata so that alternate approaches can easily be configured for verifying computational reproducibility. We are addressing this problem in the context of biomolecular NMR computational analysis by developing a series of linked ontologies which define the semantics of the various software tools used by researchers for data transformation and analysis. Building from a core ontology representing the primary observational data of NMR, the linked data approach allows for the translation of metadata in order to configure alternate software approaches for given computational tasks. In this paper we illustrate the utility of this with a small sample of the core ontology as well as tool-specific semantics for two third-party software tools. This approach to semantic mediation will help support an automated approach to validating the reliability of computation in which the same processing workflow is implemented with different software tools. In addition, the detailed semantics of both the data and the processing functionalities will provide a method for software tool classification.
Collapse
Affiliation(s)
- Michael R Gryk
- University of Illinois, Urbana-Champaign, Champaign IL 61820, USA
- UCONN Health, Farmington, CT 06030, USA
| | | |
Collapse
|
35
|
Al Kawam A, Sen A, Datta A, Dickey N. Understanding the Bioinformatics Challenges of Integrating Genomics into Healthcare. IEEE J Biomed Health Inform 2017; 22:1672-1683. [PMID: 29990071 DOI: 10.1109/jbhi.2017.2778263] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Genomic data is paving the way towards personalized healthcare. By unveiling genetic disease-contributing factors, genomic data can aid in the detection, diagnosis, and treatment of a wide range of complex diseases. Integrating genomic data into healthcare is riddled with a wide range of challenges spanning social, ethical, legal, educational, economic, and technical aspects. Bioinformatics is a core integration aspect presenting an overwhelming number of unaddressed challenges. In this paper we tackle the fundamental bioinformatics integration concerns including: genomic data generation, storage, representation, and utilization in conjunction with clinical data. We divide the bioinformatics challenges into a series of seven intertwined integration aspects spanning the areas of informatics, knowledge management, and communication. For each aspect, we provide a detailed discussion of the current research directions, outstanding challenges, and possible resolutions. This paper seeks to help narrow the gap between the genomic applications, which are being predominantly utilized in research settings, and the clinical adoption of these applications.
Collapse
|