Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Kanwal S, Khan FZ, Lonie A, Sinnott RO. Investigating reproducibility and tracking provenance - A genomic workflow case study. BMC Bioinformatics 2017;18:337. [PMID: 28701218 PMCID: PMC5508699 DOI: 10.1186/s12859-017-1747-0] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2017] [Accepted: 07/04/2017] [Indexed: 11/10/2022] Open

For:	Kanwal S, Khan FZ, Lonie A, Sinnott RO. Investigating reproducibility and tracking provenance - A genomic workflow case study. BMC Bioinformatics 2017;18:337. [PMID: 28701218 PMCID: PMC5508699 DOI: 10.1186/s12859-017-1747-0] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2017] [Accepted: 07/04/2017] [Indexed: 11/10/2022] Open

Number

Cited by Other Article(s)

Zulfiqar M, Crusoe MR, König-Ries B, Steinbeck C, Peters K, Gadelha L. Implementation of FAIR Practices in Computational Metabolomics Workflows-A Case Study. Metabolites 2024;14:118. [PMID: 38393009 PMCID: PMC10891576 DOI: 10.3390/metabo14020118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Revised: 01/30/2024] [Accepted: 02/07/2024] [Indexed: 02/25/2024] Open

Parvizpour S, Beyrampour-Basmenj H, Razmara J, Farhadi F, Shamsir MS. Cancer treatment comes to age: from one-size-fits-all to next-generation sequencing (NGS) technologies. BIOIMPACTS : BI 2023;14:29957. [PMID: 39104623 PMCID: PMC11298019 DOI: 10.34172/bi.2023.29957] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Revised: 11/08/2023] [Accepted: 11/14/2023] [Indexed: 08/07/2024]

Du X, Dastmalchi F, Diller MA, Brochhausen M, Garrett TJ, Hogan WR, Lemas DJ. An Automated Workflow Composition System for Liquid Chromatography-Mass Spectrometry Metabolomics Data Processing. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2023;34:2857-2863. [PMID: 37874901 DOI: 10.1021/jasms.3c00248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/26/2023]

Johns M, Meurers T, Wirth FN, Haber AC, Müller A, Halilovic M, Balzer F, Prasser F. Data Provenance in Biomedical Research: Scoping Review. J Med Internet Res 2023;25:e42289. [PMID: 36972116 PMCID: PMC10132013 DOI: 10.2196/42289] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Revised: 12/14/2022] [Accepted: 12/23/2022] [Indexed: 03/29/2023] Open

Abstract

BACKGROUND

Data provenance refers to the origin, processing, and movement of data. Reliable and precise knowledge about data provenance has great potential to improve reproducibility as well as quality in biomedical research and, therefore, to foster good scientific practice. However, despite the increasing interest on data provenance technologies in the literature and their implementation in other disciplines, these technologies have not yet been widely adopted in biomedical research.

OBJECTIVE

The aim of this scoping review was to provide a structured overview of the body of knowledge on provenance methods in biomedical research by systematizing articles covering data provenance technologies developed for or used in this application area; describing and comparing the functionalities as well as the design of the provenance technologies used; and identifying gaps in the literature, which could provide opportunities for future research on technologies that could receive more widespread adoption.

METHODS

Following a methodological framework for scoping studies and the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) guidelines, articles were identified by searching the PubMed, IEEE Xplore, and Web of Science databases and subsequently screened for eligibility. We included original articles covering software-based provenance management for scientific research published between 2010 and 2021. A set of data items was defined along the following five axes: publication metadata, application scope, provenance aspects covered, data representation, and functionalities. The data items were extracted from the articles, stored in a charting spreadsheet, and summarized in tables and figures.

RESULTS

We identified 44 original articles published between 2010 and 2021. We found that the solutions described were heterogeneous along all axes. We also identified relationships among motivations for the use of provenance information, feature sets (capture, storage, retrieval, visualization, and analysis), and implementation details such as the data models and technologies used. The important gap that we identified is that only a few publications address the analysis of provenance data or use established provenance standards, such as PROV.

CONCLUSIONS

The heterogeneity of provenance methods, models, and implementations found in the literature points to the lack of a unified understanding of provenance concepts for biomedical data. Providing a common framework, a biomedical reference, and benchmarking data sets could foster the development of more comprehensive provenance solutions.

Collapse

Shao D, Kellogg GD, Nematbakhsh A, Kuntala PK, Mahony S, Pugh BF, Lai WKM. PEGR: a flexible management platform for reproducible epigenomic and genomic research. Genome Biol 2022;23:99. [PMID: 35440038 PMCID: PMC9016988 DOI: 10.1186/s13059-022-02671-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2021] [Accepted: 04/07/2022] [Indexed: 11/27/2022] Open

RESCRIPt: Reproducible sequence taxonomy reference database management. PLoS Comput Biol 2021;17:e1009581. [PMID: 34748542 PMCID: PMC8601625 DOI: 10.1371/journal.pcbi.1009581] [Citation(s) in RCA: 251] [Impact Index Per Article: 83.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2021] [Revised: 11/18/2021] [Accepted: 10/21/2021] [Indexed: 12/22/2022] Open

Abstract

Nucleotide sequence and taxonomy reference databases are critical resources for widespread applications including marker-gene and metagenome sequencing for microbiome analysis, diet metabarcoding, and environmental DNA (eDNA) surveys. Reproducibly generating, managing, using, and evaluating nucleotide sequence and taxonomy reference databases creates a significant bottleneck for researchers aiming to generate custom sequence databases. Furthermore, database composition drastically influences results, and lack of standardization limits cross-study comparisons. To address these challenges, we developed RESCRIPt, a Python 3 software package and QIIME 2 plugin for reproducible generation and management of reference sequence taxonomy databases, including dedicated functions that streamline creating databases from popular sources, and functions for evaluating, comparing, and interactively exploring qualitative and quantitative characteristics across reference databases. To highlight the breadth and capabilities of RESCRIPt, we provide several examples for working with popular databases for microbiome profiling (SILVA, Greengenes, NCBI-RefSeq, GTDB), eDNA and diet metabarcoding surveys (BOLD, GenBank), as well as for genome comparison. We show that bigger is not always better, and reference databases with standardized taxonomies and those that focus on type strains have quantitative advantages, though may not be appropriate for all use cases. Most databases appear to benefit from some curation (quality filtering), though sequence clustering appears detrimental to database quality. Finally, we demonstrate the breadth and extensibility of RESCRIPt for reproducible workflows with a comparison of global hepatitis genomes. RESCRIPt provides tools to democratize the process of reference database acquisition and management, enabling researchers to reproducibly and transparently create reference materials for diverse research applications. RESCRIPt is released under a permissive BSD-3 license at https://github.com/bokulich-lab/RESCRIPt.

Generating and managing sequence and taxonomy reference data presents a bottleneck to many researchers, whether they are generating custom databases or attempting to format existing, curated reference databases for use with standard sequence analysis tools. Evaluating database quality and choosing the “best” database can be an equally formidable challenge. We developed RESCRIPt to alleviate this bottleneck, supporting reproducible, streamlined generation, curation, and evaluation of reference sequence databases. RESCRIPt uses QIIME 2 artifact file formats, which store all processing steps as data provenance within each file, allowing researchers to retrace the computational steps used to generate any given file. We used RESCRIPt to benchmark several commonly used marker-gene sequence databases for 16S rRNA genes, ITS, and COI sequences, demonstrating both the utility of RESCRIPt to streamline use of these databases, but also to evaluate several qualitative and quantitative characteristics of each database. We show that larger databases are not always best, and curation steps to reduce redundancy and filter out noisy sequences may be beneficial for some applications. We anticipate that RESCRIPt will streamline the use, management, and evaluation/selection of reference database materials for microbiomics, diet metabarcoding, eDNA, and other diverse applications.

Collapse

Orchestrating and sharing large multimodal data for transparent and reproducible research. Nat Commun 2021;12:5797. [PMID: 34608132 PMCID: PMC8490371 DOI: 10.1038/s41467-021-25974-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2021] [Accepted: 09/08/2021] [Indexed: 11/08/2022] Open

Wratten L, Wilm A, Göke J. Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nat Methods 2021;18:1161-1168. [PMID: 34556866 DOI: 10.1038/s41592-021-01254-9] [Citation(s) in RCA: 53] [Impact Index Per Article: 17.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2020] [Accepted: 07/29/2021] [Indexed: 02/08/2023]

Kohls M, Saremi B, Muchsin I, Fischer N, Becher P, Jung K. A resampling strategy for studying robustness in virus detection pipelines. Comput Biol Chem 2021;94:107555. [PMID: 34364046 DOI: 10.1016/j.compbiolchem.2021.107555] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2021] [Revised: 07/14/2021] [Accepted: 07/28/2021] [Indexed: 10/20/2022]

Abstract

Next-generation sequencing is regularly used to identify viral sequences in DNA or RNA samples of infected hosts. A major step of most pipelines for virus detection is to map sequence reads against known virus genomes. Due to small differences between the sequences of related viruses, and due to several biological or technical errors, mapping underlies uncertainties. As a consequence, the resulting list of detected viruses can lack robustness. A new approach for generating artificial sequencing reads together with a strategy of resampling from the original findings is proposed that can help to assess the robustness of the originally identified list of viruses. From the original mapping result in form of a SAM file, a set of statistical distributions are derived. These are used in the resampling pipeline to generate new artificial reads which are again mapped versus the reference genomes. By summarizing the resampling procedure, the analyst receives information about whether the presence of a particular virus in the sample gains or losses evidence, and thus about the robustness of the original mapping list but also that of individual viruses in this list. To judge robustness, several indicators are derived from the resampling procedure such as the correlation between original and resampling read counts, or the statistical detection of outliers in the differences of read counts. Additionally, graphical illustrations of read count shifts via Sankey diagrams are provided. To demonstrate the use of the new approach, the resampling approach is applied to three real-world data samples, one of them with laboratory-confirmed Influenza sequences, and to artificially generated data where virus sequences have been spiked into the sequencing data of a host. By applying the resampling pipeline, several viruses drop from the original list while new viruses emerge, showing robustness of those viruses that remain in the list. The evaluation of the new approach shows that the resampling approach is helpful to analyze the viral content of a biological sample, to rate the robustness of original findings and to better show the overall distribution of findings. The method is also applicable to other virus detection pipelines based on read mapping.

Collapse

John A, Muenzen K, Ausmees K. Evaluation of serverless computing for scalable execution of a joint variant calling workflow. PLoS One 2021;16:e0254363. [PMID: 34242357 PMCID: PMC8270184 DOI: 10.1371/journal.pone.0254363] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Accepted: 06/24/2021] [Indexed: 11/18/2022] Open

Melendrez MC, Shaw S, Brown CT, Goodner BW, Kvaal C. Editorial: Curriculum Applications in Microbiology: Bioinformatics in the Classroom. Front Microbiol 2021;12:705233. [PMID: 34276638 PMCID: PMC8281245 DOI: 10.3389/fmicb.2021.705233] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2021] [Accepted: 06/07/2021] [Indexed: 11/18/2022] Open

Westbrook A, Varki E, Thomas WK. RepeatFS: a file system providing reproducibility through provenance and automation. Bioinformatics 2021;37:1292-1296. [PMID: 33230554 PMCID: PMC8189677 DOI: 10.1093/bioinformatics/btaa950] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2020] [Revised: 10/11/2020] [Accepted: 10/29/2020] [Indexed: 11/30/2022] Open

Patel JA, Dean DA, King CH, Xiao N, Koc S, Minina E, Golikov A, Brooks P, Kahsay R, Navelkar R, Ray M, Roberson D, Armstrong C, Mazumder R, Keeney J. Bioinformatics tools developed to support BioCompute Objects. Database (Oxford) 2021;2021:baab008. [PMID: 33784373 PMCID: PMC8009203 DOI: 10.1093/database/baab008] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2019] [Revised: 01/10/2021] [Accepted: 03/06/2021] [Indexed: 11/17/2022]

Abstract

Developments in high-throughput sequencing (HTS) result in an exponential increase in the amount of data generated by sequencing experiments, an increase in the complexity of bioinformatics analysis reporting and an increase in the types of data generated. These increases in volume, diversity and complexity of the data generated and their analysis expose the necessity of a structured and standardized reporting template. BioCompute Objects (BCOs) provide the requisite support for communication of HTS data analysis that includes support for workflow, as well as data, curation, accessibility and reproducibility of communication. BCOs standardize how researchers report provenance and the established verification and validation protocols used in workflows while also being robust enough to convey content integration or curation in knowledge bases. BCOs that encapsulate tools, platforms, datasets and workflows are FAIR (findable, accessible, interoperable and reusable) compliant. Providing operational workflow and data information facilitates interoperability between platforms and incorporation of future dataset within an HTS analysis for use within industrial, academic and regulatory settings. Cloud-based platforms, including High-performance Integrated Virtual Environment (HIVE), Cancer Genomics Cloud (CGC) and Galaxy, support BCO generation for users. Given the 100K+ userbase between these platforms, BioCompute can be leveraged for workflow documentation. In this paper, we report the availability of platform-dependent and platform-independent BCO tools: HIVE BCO App, CGC BCO App, Galaxy BCO API Extension and BCO Portal. Community engagement was utilized to evaluate tool efficacy. We demonstrate that these tools further advance BCO creation from text editing approaches used in earlier releases of the standard. Moreover, we demonstrate that integrating BCO generation within existing analysis platforms greatly streamlines BCO creation while capturing granular workflow details. We also demonstrate that the BCO tools described in the paper provide an approach to solve the long-standing challenge of standardizing workflow descriptions that are both human and machine readable while accommodating manual and automated curation with evidence tagging. Database URL: https://www.biocomputeobject.org/resources.

Collapse

Affiliation(s)

Janisha A Patel The Department of Biochemistry & Molecular Medicine, The George Washington University School of Medicine and Health Sciences, Washington, DC 20037, USA
Dennis A Dean Seven Bridges, Charlestown, MA 02129, USA
Charles Hadley King The Department of Biochemistry & Molecular Medicine, The George Washington University School of Medicine and Health Sciences, Washington, DC 20037, USA The McCormick Genomic and Proteomic Center, The George Washington University, Washington, DC 20037, USA
Nan Xiao Seven Bridges, Charlestown, MA 02129, USA
Soner Koc Seven Bridges, Charlestown, MA 02129, USA
Ekaterina Minina CBER-HIVE, Center for Biologics Evaluation and Research, US Food and Drug Administration, Silver Spring, MD 20993, USA
Anton Golikov CBER-HIVE, Center for Biologics Evaluation and Research, US Food and Drug Administration, Silver Spring, MD 20993, USA
Phillip Brooks Seven Bridges, Charlestown, MA 02129, USA
Robel Kahsay The Department of Biochemistry & Molecular Medicine, The George Washington University School of Medicine and Health Sciences, Washington, DC 20037, USA
Rahi Navelkar The Department of Biochemistry & Molecular Medicine, The George Washington University School of Medicine and Health Sciences, Washington, DC 20037, USA
Manisha Ray Seven Bridges, Charlestown, MA 02129, USA
Dave Roberson Seven Bridges, Charlestown, MA 02129, USA
Chris Armstrong The Department of Biochemistry & Molecular Medicine, The George Washington University School of Medicine and Health Sciences, Washington, DC 20037, USA
Raja Mazumder The Department of Biochemistry & Molecular Medicine, The George Washington University School of Medicine and Health Sciences, Washington, DC 20037, USA The McCormick Genomic and Proteomic Center, The George Washington University, Washington, DC 20037, USA
Jonathon Keeney The Department of Biochemistry & Molecular Medicine, The George Washington University School of Medicine and Health Sciences, Washington, DC 20037, USA

Collapse

Balagurunathan Y, Mitchell R, El Naqa I. Requirements and reliability of AI in the medical context. Phys Med 2021;83:72-78. [PMID: 33721700 PMCID: PMC8915137 DOI: 10.1016/j.ejmp.2021.02.024] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/07/2020] [Revised: 02/04/2021] [Accepted: 02/23/2021] [Indexed: 12/12/2022] Open

Moossavi S, Fehr K, Khafipour E, Azad MB. Repeatability and reproducibility assessment in a large-scale population-based microbiota study: case study on human milk microbiota. MICROBIOME 2021;9:41. [PMID: 33568231 PMCID: PMC7877029 DOI: 10.1186/s40168-020-00998-4] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/20/2020] [Accepted: 12/29/2020] [Indexed: 06/12/2023]

Abstract

BACKGROUND

Quality control including assessment of batch variabilities and confirmation of repeatability and reproducibility are integral component of high throughput omics studies including microbiome research. Batch effects can mask true biological results and/or result in irreproducible conclusions and interpretations. Low biomass samples in microbiome research are prone to reagent contamination; yet, quality control procedures for low biomass samples in large-scale microbiome studies are not well established.

RESULTS

In this study, we have proposed a framework for an in-depth step-by-step approach to address this gap. The framework consists of three independent stages: (1) verification of sequencing accuracy by assessing technical repeatability and reproducibility of the results using mock communities and biological controls; (2) contaminant removal and batch variability correction by applying a two-tier strategy using statistical algorithms (e.g. decontam) followed by comparison of the data structure between batches; and (3) corroborating the repeatability and reproducibility of microbiome composition and downstream statistical analysis. Using this approach on the milk microbiota data from the CHILD Cohort generated in two batches (extracted and sequenced in 2016 and 2019), we were able to identify potential reagent contaminants that were missed with standard algorithms and substantially reduce contaminant-induced batch variability. Additionally, we confirmed the repeatability and reproducibility of our results in each batch before merging them for downstream analysis.

CONCLUSION

This study provides important insight to advance quality control efforts in low biomass microbiome research. Within-study quality control that takes advantage of the data structure (i.e. differential prevalence of contaminants between batches) would enhance the overall reliability and reproducibility of research in this field. Video abstract.

Collapse

Lee H, Shuaibi A, Bell JM, Pavlichin DS, Ji HP. Unique k-mer sequences for validating cancer-related substitution, insertion and deletion mutations. NAR Cancer 2020;2:zcaa034. [PMID: 33345188 PMCID: PMC7727745 DOI: 10.1093/narcan/zcaa034] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2020] [Revised: 10/23/2020] [Accepted: 11/12/2020] [Indexed: 12/26/2022] Open

Krishna R, Elisseev V. User-centric genomics infrastructure: trends and technologies. Genome 2020;64:467-475. [PMID: 33216660 DOI: 10.1139/gen-2020-0096] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]

Kanzi AM, San JE, Chimukangara B, Wilkinson E, Fish M, Ramsuran V, de Oliveira T. Next Generation Sequencing and Bioinformatics Analysis of Family Genetic Inheritance. Front Genet 2020;11:544162. [PMID: 33193618 PMCID: PMC7649788 DOI: 10.3389/fgene.2020.544162] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2020] [Accepted: 09/21/2020] [Indexed: 12/29/2022] Open

Perez-Riverol Y, Moreno P. Scalable Data Analysis in Proteomics and Metabolomics Using BioContainers and Workflows Engines. Proteomics 2020;20:e1900147. [PMID: 31657527 PMCID: PMC7613303 DOI: 10.1002/pmic.201900147] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2019] [Revised: 09/30/2019] [Indexed: 12/29/2022]

Schaduangrat N, Lampa S, Simeon S, Gleeson MP, Spjuth O, Nantasenamat C. Towards reproducible computational drug discovery. J Cheminform 2020;12:9. [PMID: 33430992 PMCID: PMC6988305 DOI: 10.1186/s13321-020-0408-x] [Citation(s) in RCA: 85] [Impact Index Per Article: 21.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2019] [Accepted: 01/02/2020] [Indexed: 12/11/2022] Open

Ulfenborg B. Vertical and horizontal integration of multi-omics data with miodin. BMC Bioinformatics 2019;20:649. [PMID: 31823712 PMCID: PMC6902525 DOI: 10.1186/s12859-019-3224-4] [Citation(s) in RCA: 34] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2019] [Accepted: 11/14/2019] [Indexed: 11/10/2022] Open

Wercelens P, da Silva W, Hondo F, Castro K, Walter ME, Araújo A, Lifschitz S, Holanda M. Bioinformatics Workflows With NoSQL Database in Cloud Computing. Evol Bioinform Online 2019;15:1176934319889974. [PMID: 31839702 PMCID: PMC6896126 DOI: 10.1177/1176934319889974] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2019] [Accepted: 10/29/2019] [Indexed: 12/29/2022] Open

Khan FZ, Soiland-Reyes S, Sinnott RO, Lonie A, Goble C, Crusoe MR. Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv. Gigascience 2019;8:giz095. [PMID: 31675414 PMCID: PMC6824458 DOI: 10.1093/gigascience/giz095] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2018] [Revised: 05/23/2019] [Accepted: 07/17/2019] [Indexed: 01/22/2023] Open

Abstract

BACKGROUND

The automation of data analysis in the form of scientific workflows has become a widely adopted practice in many fields of research. Computationally driven data-intensive experiments using workflows enable automation, scaling, adaptation, and provenance support. However, there are still several challenges associated with the effective sharing, publication, and reproducibility of such workflows due to the incomplete capture of provenance and lack of interoperability between different technical (software) platforms.

RESULTS

Based on best-practice recommendations identified from the literature on workflow design, sharing, and publishing, we define a hierarchical provenance framework to achieve uniformity in provenance and support comprehensive and fully re-executable workflows equipped with domain-specific information. To realize this framework, we present CWLProv, a standard-based format to represent any workflow-based computational analysis to produce workflow output artefacts that satisfy the various levels of provenance. We use open source community-driven standards, interoperable workflow definitions in Common Workflow Language (CWL), structured provenance representation using the W3C PROV model, and resource aggregation and sharing as workflow-centric research objects generated along with the final outputs of a given workflow enactment. We demonstrate the utility of this approach through a practical implementation of CWLProv and evaluation using real-life genomic workflows developed by independent groups.

CONCLUSIONS

The underlying principles of the standards utilized by CWLProv enable semantically rich and executable research objects that capture computational workflows with retrospective provenance such that any platform supporting CWL will be able to understand the analysis, reuse the methods for partial reruns, or reproduce the analysis to validate the published findings.

Collapse

Montemayor C, Brunker PAR, Keller MA. Banking with precision: transfusion medicine as a potential universal application in clinical genomics. Curr Opin Hematol 2019;26:480-487. [PMID: 31490317 PMCID: PMC7302862 DOI: 10.1097/moh.0000000000000536] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]

Review of Issues and Solutions to Data Analysis Reproducibility and Data Quality in Clinical Proteomics. Methods Mol Biol 2019. [PMID: 31552637 DOI: 10.1007/978-1-4939-9744-2_15] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2023]

Silliman K. Population structure, genetic connectivity, and adaptation in the Olympia oyster (Ostrea lurida) along the west coast of North America. Evol Appl 2019;12:923-939. [PMID: 31080505 PMCID: PMC6503834 DOI: 10.1111/eva.12766] [Citation(s) in RCA: 29] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2018] [Revised: 11/28/2018] [Accepted: 12/02/2018] [Indexed: 01/02/2023] Open

Abstract

Effective management of threatened and exploited species requires an understanding of both the genetic connectivity among populations and local adaptation. The Olympia oyster (Ostrea lurida), patchily distributed from Baja California to the central coast of Canada, has a long history of population declines due to anthropogenic stressors. For such coastal marine species, population structure could follow a continuous isolation-by-distance model, contain regional blocks of genetic similarity separated by barriers to gene flow, or be consistent with a null model of no population structure. To distinguish between these hypotheses in O. lurida, 13,424 single nucleotide polymorphisms (SNPs) were used to characterize rangewide population structure, genetic connectivity, and adaptive divergence. Samples were collected across the species range on the west coast of North America, from southern California to Vancouver Island. A conservative approach for detecting putative loci under selection identified 235 SNPs across 129 GBS loci, which were functionally annotated and analyzed separately from the remaining neutral loci. While strong population structure was observed on a regional scale in both neutral and outlier markers, neutral markers had greater power to detect fine-scale structure. Geographic regions of reduced gene flow aligned with known marine biogeographic barriers, such as Cape Mendocino, Monterey Bay, and the currents around Cape Flattery. The outlier loci identified as under putative selection included genes involved in developmental regulation, sensory information processing, energy metabolism, immune response, and muscle contraction. These loci are excellent candidates for future research and may provide targets for genetic monitoring programs. Beyond specific applications for restoration and management of the Olympia oyster, this study lends to the growing body of evidence for both population structure and adaptive differentiation across a range of marine species exhibiting the potential for panmixia. Computational notebooks are available to facilitate reproducibility and future open-sourced research on the population structure of O. lurida.

Collapse

Juanillas V, Dereeper A, Beaume N, Droc G, Dizon J, Mendoza JR, Perdon JP, Mansueto L, Triplett L, Lang J, Zhou G, Ratharanjan K, Plale B, Haga J, Leach JE, Ruiz M, Thomson M, Alexandrov N, Larmande P, Kretzschmar T, Mauleon RP. Rice Galaxy: an open resource for plant science. Gigascience 2019;8:giz028. [PMID: 31107941 PMCID: PMC6527052 DOI: 10.1093/gigascience/giz028] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2018] [Revised: 08/29/2018] [Accepted: 02/12/2019] [Indexed: 01/16/2023] Open

Affiliation(s)

Venice Juanillas International Rice Research Institute, DAPO Box 7777, Metro Manila 1301, Philippines
Alexis Dereeper Institut de recherche pour le développement (IRD), University of Montpellier, DIADE, IPME, Montpellier, France
Nicolas Beaume International Rice Research Institute, DAPO Box 7777, Metro Manila 1301, Philippines
Gaetan Droc CIRAD, UMR AGAP, F-34398 Montpellier, France
Joshua Dizon International Rice Research Institute, DAPO Box 7777, Metro Manila 1301, Philippines
John Robert Mendoza Advanced Science and Technology Institute, Department of Science and Technology, Quezon City, Philippines
Jon Peter Perdon Advanced Science and Technology Institute, Department of Science and Technology, Quezon City, Philippines
Locedie Mansueto International Rice Research Institute, DAPO Box 7777, Metro Manila 1301, Philippines
Lindsay Triplett Department of Bioagricultural Sciences and Pest Management, Colorado State University, Fort Collins, CO 80523-1177, USA
Jillian Lang Department of Bioagricultural Sciences and Pest Management, Colorado State University, Fort Collins, CO 80523-1177, USA
Gabriel Zhou Indiana University, 107 S Indiana Ave, Bloomington, IN 47405, USA
Kunalan Ratharanjan Indiana University, 107 S Indiana Ave, Bloomington, IN 47405, USA
Beth Plale Indiana University, 107 S Indiana Ave, Bloomington, IN 47405, USA
Jason Haga National Institute of Advanced Industrial Science and Technology, AIST Tsukuba Central 1,1-1-1 Umezono, Tsukuba, Ibaraki 305-8560, Japan
Jan E Leach Department of Bioagricultural Sciences and Pest Management, Colorado State University, Fort Collins, CO 80523-1177, USA
Manuel Ruiz CIRAD, UMR AGAP, F-34398 Montpellier, France
Michael Thomson International Rice Research Institute, DAPO Box 7777, Metro Manila 1301, Philippines Department of Soil and Crop Sciences, Texas A&M University, Houston, TX, USA
Nickolai Alexandrov International Rice Research Institute, DAPO Box 7777, Metro Manila 1301, Philippines
Pierre Larmande Institut de recherche pour le développement (IRD), University of Montpellier, DIADE, IPME, Montpellier, France
Tobias Kretzschmar International Rice Research Institute, DAPO Box 7777, Metro Manila 1301, Philippines Southern Cross Plant Science, Southern Cross University, Lismore, Australia
Ramil P Mauleon International Rice Research Institute, DAPO Box 7777, Metro Manila 1301, Philippines Southern Cross Plant Science, Southern Cross University, Lismore, Australia

Collapse

Wang G, Peng B. Script of Scripts: A pragmatic workflow system for daily computational research. PLoS Comput Biol 2019;15:e1006843. [PMID: 30811390 PMCID: PMC6411228 DOI: 10.1371/journal.pcbi.1006843] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2018] [Revised: 03/11/2019] [Accepted: 01/29/2019] [Indexed: 01/22/2023] Open

Das S, Lecours Boucher X, Rogers C, Makowski C, Chouinard-Decorte F, Oros Klein K, Beck N, Rioux P, Brown ST, Mohaddes Z, Zweber C, Foing V, Forest M, O'Donnell KJ, Clark J, Meaney MJ, Greenwood CMT, Evans AC. Integration of "omics" Data and Phenotypic Data Within a Unified Extensible Multimodal Framework. Front Neuroinform 2018;12:91. [PMID: 30631270 PMCID: PMC6315165 DOI: 10.3389/fninf.2018.00091] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2018] [Accepted: 11/16/2018] [Indexed: 12/11/2022] Open

Abstract

Analysis of “omics” data is often a long and segmented process, encompassing multiple stages from initial data collection to processing, quality control and visualization. The cross-modal nature of recent genomic analyses renders this process challenging to both automate and standardize; consequently, users often resort to manual interventions that compromise data reliability and reproducibility. This in turn can produce multiple versions of datasets across storage systems. As a result, scientists can lose significant time and resources trying to execute and monitor their analytical workflows and encounter difficulties sharing versioned data. In 2015, the Ludmer Centre for Neuroinformatics and Mental Health at McGill University brought together expertise from the Douglas Mental Health University Institute, the Lady Davis Institute and the Montreal Neurological Institute (MNI) to form a genetics/epigenetics working group. The objectives of this working group are to: (i) design an automated and seamless process for (epi)genetic data that consolidates heterogeneous datasets into the LORIS open-source data platform; (ii) streamline data analysis; (iii) integrate results with provenance information; and (iv) facilitate structured and versioned sharing of pipelines for optimized reproducibility using high-performance computing (HPC) environments via the CBRAIN processing portal. This article outlines the resulting generalizable “omics” framework and its benefits, specifically, the ability to: (i) integrate multiple types of biological and multi-modal datasets (imaging, clinical, demographics and behavioral); (ii) automate the process of launching analysis pipelines on HPC platforms; (iii) remove the bioinformatic barriers that are inherent to this process; (iv) ensure standardization and transparent sharing of processing pipelines to improve computational consistency; (v) store results in a queryable web interface; (vi) offer visualization tools to better view the data; and (vii) provide the mechanisms to ensure usability and reproducibility. This framework for workflows facilitates brain research discovery by reducing human error through automation of analysis pipelines and seamless linking of multimodal data, allowing investigators to focus on research instead of data handling.

Collapse

Affiliation(s)

Samir Das McGill Centre for Integrative Neuroscience, Montreal Neurological Institute, Montreal, QC, Canada.,Montreal Neurological Institute, McGill University, Montreal, QC, Canada
Xavier Lecours Boucher McGill Centre for Integrative Neuroscience, Montreal Neurological Institute, Montreal, QC, Canada.,Montreal Neurological Institute, McGill University, Montreal, QC, Canada
Christine Rogers McGill Centre for Integrative Neuroscience, Montreal Neurological Institute, Montreal, QC, Canada.,Montreal Neurological Institute, McGill University, Montreal, QC, Canada
Carolina Makowski McGill Centre for Integrative Neuroscience, Montreal Neurological Institute, Montreal, QC, Canada.,Montreal Neurological Institute, McGill University, Montreal, QC, Canada.,Douglas Hospital Research Centre, McGill University, Montreal, QC, Canada
François Chouinard-Decorte McGill Centre for Integrative Neuroscience, Montreal Neurological Institute, Montreal, QC, Canada.,Montreal Neurological Institute, McGill University, Montreal, QC, Canada
Kathleen Oros Klein Ludmer Centre for Neuroinformatics & Mental Health, McGill University, Montreal, QC, Canada.,Lady Davis Institute, Jewish General Hospital, McGill University, Montreal, QC, Canada
Natacha Beck McGill Centre for Integrative Neuroscience, Montreal Neurological Institute, Montreal, QC, Canada.,Montreal Neurological Institute, McGill University, Montreal, QC, Canada
Pierre Rioux McGill Centre for Integrative Neuroscience, Montreal Neurological Institute, Montreal, QC, Canada.,Montreal Neurological Institute, McGill University, Montreal, QC, Canada
Shawn T Brown McGill Centre for Integrative Neuroscience, Montreal Neurological Institute, Montreal, QC, Canada.,Montreal Neurological Institute, McGill University, Montreal, QC, Canada
Zia Mohaddes McGill Centre for Integrative Neuroscience, Montreal Neurological Institute, Montreal, QC, Canada.,Montreal Neurological Institute, McGill University, Montreal, QC, Canada
Cole Zweber McGill Centre for Integrative Neuroscience, Montreal Neurological Institute, Montreal, QC, Canada.,Montreal Neurological Institute, McGill University, Montreal, QC, Canada
Victoria Foing McGill Centre for Integrative Neuroscience, Montreal Neurological Institute, Montreal, QC, Canada.,Montreal Neurological Institute, McGill University, Montreal, QC, Canada
Marie Forest Ludmer Centre for Neuroinformatics & Mental Health, McGill University, Montreal, QC, Canada.,Lady Davis Institute, Jewish General Hospital, McGill University, Montreal, QC, Canada
Kieran J O'Donnell Douglas Hospital Research Centre, McGill University, Montreal, QC, Canada.,Ludmer Centre for Neuroinformatics & Mental Health, McGill University, Montreal, QC, Canada
Joanne Clark Ludmer Centre for Neuroinformatics & Mental Health, McGill University, Montreal, QC, Canada
Michael J Meaney Douglas Hospital Research Centre, McGill University, Montreal, QC, Canada.,Ludmer Centre for Neuroinformatics & Mental Health, McGill University, Montreal, QC, Canada
Celia M T Greenwood Ludmer Centre for Neuroinformatics & Mental Health, McGill University, Montreal, QC, Canada.,Lady Davis Institute, Jewish General Hospital, McGill University, Montreal, QC, Canada
Alan C Evans McGill Centre for Integrative Neuroscience, Montreal Neurological Institute, Montreal, QC, Canada.,Montreal Neurological Institute, McGill University, Montreal, QC, Canada

Collapse

Goodstadt MN, Marti-Renom MA. Communicating Genome Architecture: Biovisualization of the Genome, from Data Analysis and Hypothesis Generation to Communication and Learning. J Mol Biol 2018;431:1071-1087. [PMID: 30419242 DOI: 10.1016/j.jmb.2018.11.008] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2018] [Revised: 10/29/2018] [Accepted: 11/01/2018] [Indexed: 01/07/2023]

Kulkarni N, Alessandrì L, Panero R, Arigoni M, Olivero M, Ferrero G, Cordero F, Beccuti M, Calogero RA. Reproducible bioinformatics project: a community for reproducible bioinformatics analysis pipelines. BMC Bioinformatics 2018;19:349. [PMID: 30367595 PMCID: PMC6191970 DOI: 10.1186/s12859-018-2296-x] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open

Abstract

Background

Reproducibility of a research is a key element in the modern science and it is mandatory for any industrial application. It represents the ability of replicating an experiment independently by the location and the operator. Therefore, a study can be considered reproducible only if all used data are available and the exploited computational analysis workflow is clearly described. However, today for reproducing a complex bioinformatics analysis, the raw data and the list of tools used in the workflow could be not enough to guarantee the reproducibility of the results obtained. Indeed, different releases of the same tools and/or of the system libraries (exploited by such tools) might lead to sneaky reproducibility issues.

Results

To address this challenge, we established the Reproducible Bioinformatics Project (RBP), which is a non-profit and open-source project, whose aim is to provide a schema and an infrastructure, based on docker images and R package, to provide reproducible results in Bioinformatics. One or more Docker images are then defined for a workflow (typically one for each task), while the workflow implementation is handled via R-functions embedded in a package available at github repository. Thus, a bioinformatician participating to the project has firstly to integrate her/his workflow modules into Docker image(s) exploiting an Ubuntu docker image developed ad hoc by RPB to make easier this task. Secondly, the workflow implementation must be realized in R according to an R-skeleton function made available by RPB to guarantee homogeneity and reusability among different RPB functions. Moreover she/he has to provide the R vignette explaining the package functionality together with an example dataset which can be used to improve the user confidence in the workflow utilization.

Conclusions

Reproducible Bioinformatics Project provides a general schema and an infrastructure to distribute robust and reproducible workflows. Thus, it guarantees to final users the ability to repeat consistently any analysis independently by the used UNIX-like architecture.

Collapse

Mondelli ML, Magalhães T, Loss G, Wilde M, Foster I, Mattoso M, Katz D, Barbosa H, de Vasconcelos ATR, Ocaña K, Gadelha LMR. BioWorkbench: a high-performance framework for managing and analyzing bioinformatics experiments. PeerJ 2018;6:e5551. [PMID: 30186700 PMCID: PMC6119457 DOI: 10.7717/peerj.5551] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2018] [Accepted: 08/07/2018] [Indexed: 11/20/2022] Open

Gruenstaeudl M, Gerschler N, Borsch T. Bioinformatic Workflows for Generating Complete Plastid Genome Sequences-An Example from Cabomba (Cabombaceae) in the Context of the Phylogenomic Analysis of the Water-Lily Clade. Life (Basel) 2018;8:E25. [PMID: 29933597 PMCID: PMC6160935 DOI: 10.3390/life8030025] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2018] [Revised: 06/11/2018] [Accepted: 06/19/2018] [Indexed: 12/13/2022] Open

Gryk MR, Ludäscher B. Semantic Mediation to Improve Reproducibility for Biomolecular NMR Analysis. TRANSFORMING DIGITAL WORLDS : 13TH INTERNATIONAL CONFERENCE, ICONFERENCE 2018, SHEFFIELD, UK, MARCH 25-28, 2018, PROCEEDINGS. INTERNATIONAL CONFERENCE ON TRANSFORMING DIGITAL WORLDS (13TH : 2018 : SHEFFIELD, ENGLAND) 2018;10766:620-625. [PMID: 30334020 PMCID: PMC6186436 DOI: 10.1007/978-3-319-78105-1_70] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]

Al Kawam A, Sen A, Datta A, Dickey N. Understanding the Bioinformatics Challenges of Integrating Genomics into Healthcare. IEEE J Biomed Health Inform 2017;22:1672-1683. [PMID: 29990071 DOI: 10.1109/jbhi.2017.2778263] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]