1
|
Coelho LP. For long-term sustainable software in bioinformatics. PLoS Comput Biol 2024; 20:e1011920. [PMID: 38489255 PMCID: PMC10942072 DOI: 10.1371/journal.pcbi.1011920] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/17/2024] Open
Affiliation(s)
- Luis Pedro Coelho
- Centre for Microbiome Research, School of Biomedical Sciences, Queensland University of Technology, Translational Research Institute, Woolloongabba, Queensland, Australia
- Centre for Data Science, Queensland University of Technology, Brisbane, Australia
| |
Collapse
|
2
|
Cassidy MJ, Wallace DA, Purcell S, Sofer T. Reproducibility in computational sleep research: a call for action. Sleep 2024; 47:zsad143. [PMID: 37235755 PMCID: PMC10782485 DOI: 10.1093/sleep/zsad143] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/28/2023] Open
Affiliation(s)
- Michael J Cassidy
- Division of Sleep and Circadian Disorders, Departments of Medicine and Neurology, Brigham and Women’s Hospital, Boston MA, USA
- Department of Medicine, Cardiovascular Institute, Beth Israel Deaconess Medical Center, Boston, MA, USA
| | - Danielle A Wallace
- Division of Sleep and Circadian Disorders, Departments of Medicine and Neurology, Brigham and Women’s Hospital, Boston MA, USA
- Division of Sleep and Circadian Disorders, Harvard Medical School, Boston MA, USA
| | - Shaun Purcell
- Department of Psychiatry, Brigham and Women’s Hospital, Boston MA, USA
| | - Tamar Sofer
- Division of Sleep and Circadian Disorders, Departments of Medicine and Neurology, Brigham and Women’s Hospital, Boston MA, USA
- Department of Medicine, Cardiovascular Institute, Beth Israel Deaconess Medical Center, Boston, MA, USA
- Division of Sleep and Circadian Disorders, Harvard Medical School, Boston MA, USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| |
Collapse
|
3
|
Mendes P. Reproducibility and FAIR principles: the case of a segment polarity network model. Front Cell Dev Biol 2023; 11:1201673. [PMID: 37346177 PMCID: PMC10279958 DOI: 10.3389/fcell.2023.1201673] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2023] [Accepted: 05/30/2023] [Indexed: 06/23/2023] Open
Abstract
The issue of reproducibility of computational models and the related FAIR principles (findable, accessible, interoperable, and reusable) are examined in a specific test case. I analyze a computational model of the segment polarity network in Drosophila embryos published in 2000. Despite the high number of citations to this publication, 23 years later the model is barely accessible, and consequently not interoperable. Following the text of the original publication allowed successfully encoding the model for the open source software COPASI. Subsequently saving the model in the SBML format allowed it to be reused in other open source software packages. Submission of this SBML encoding of the model to the BioModels database enables its findability and accessibility. This demonstrates how the FAIR principles can be successfully enabled by using open source software, widely adopted standards, and public repositories, facilitating reproducibility and reuse of computational cell biology models that will outlive the specific software used.
Collapse
Affiliation(s)
- Pedro Mendes
- Center for Cell Analysis and Modeling, University of Connecticut School of Medicine, Farmington, CT, United States
- Department of Cell Biology, University of Connecticut School of Medicine, Farmington, CT, United States
| |
Collapse
|
4
|
Xu Y, Mansmann U. Validating the knowledge bank approach for personalized prediction of survival in acute myeloid leukemia: a reproducibility study. Hum Genet 2022; 141:1467-1480. [PMID: 35429300 PMCID: PMC9360099 DOI: 10.1007/s00439-022-02455-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2021] [Accepted: 04/05/2022] [Indexed: 11/29/2022]
Abstract
Reproducibility is not only essential for the integrity of scientific research but is also a prerequisite for model validation and refinement for the future application of predictive algorithms. However, reproducible research is becoming increasingly challenging, particularly in high-dimensional genomic data analyses with complex statistical or algorithmic techniques. Given that there are no mandatory requirements in most biomedical and statistical journals to provide the original data, analytical source code, or other relevant materials for publication, accessibility to these supplements naturally suggests a greater credibility of the published work. In this study, we performed a reproducibility assessment of the notable paper by Gerstung et al. (Nat Genet 49:332–340, 2017) by rerunning the analysis using their original code and data, which are publicly accessible. Despite an open science setting, it was challenging to reproduce the entire research project; reasons included: incomplete data and documentation, suboptimal code readability, coding errors, limited portability of intensive computing performed on a specific platform, and an R computing environment that could no longer be re-established. We learn that the availability of code and data does not guarantee transparency and reproducibility of a study; paradoxically, the source code is still liable to error and obsolescence, essentially due to methodological and computational complexity, a lack of reproducibility checking at submission, and updates for software and operating environment. The complex code may also hide problematic methodological aspects of the proposed research. Building on the experience gained, we discuss the best programming and software engineering practices that could have been employed to improve reproducibility, and propose practical criteria for the conduct and reporting of reproducibility studies for future researchers.
Collapse
Affiliation(s)
- Yujun Xu
- Institute for Medical Information Processing, Biometry, and Epidemiology, Ludwig-Maximilians-Universität München, Marchioninistr. 15, 81377 Munich, Germany
| | - Ulrich Mansmann
- Institute for Medical Information Processing, Biometry, and Epidemiology, Ludwig-Maximilians-Universität München, Marchioninistr. 15, 81377 Munich, Germany
| |
Collapse
|
5
|
Palevich N, Maclean PH. Sequencing and Reconstructing Helminth Mitochondrial Genomes Directly from Genomic Next-Generation Sequencing Data. Methods Mol Biol 2022; 2369:27-40. [PMID: 34313982 DOI: 10.1007/978-1-0716-1681-9_3] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/09/2023]
Abstract
We present a detailed method for extraction of high-molecular weight genomic DNA suitable for numerous DNA sequencing applications, and a straightforward in silico approach for reconstructing novel mitochondrial (mt) genomes directly from total genomic DNA extracts derived from next-generation sequencing (NGS) data sets. The in silico post-sequencing pipeline described is fast, accurate, and highly efficient, with modest memory requirements that can be performed using a standard desktop computer. The approach is particularly effective for obtaining mitochondrial genomes for species with little or no mitochondrial sequence information currently available and overcomes many of the limitations of traditional strategies. The described methodologies are also applicable for metagenomics sequencing from mixed or pooled samples containing multiple species and subsequent specific assembly of specific mitochondrial genomes.
Collapse
Affiliation(s)
- Nikola Palevich
- AgResearch Limited, Grasslands Research Centre, Palmerston North, New Zealand.
| | - Paul Haydon Maclean
- AgResearch Limited, Grasslands Research Centre, Palmerston North, New Zealand
| |
Collapse
|
6
|
Implementing Data Management Workflows in Research Groups Through Integrated Library Consultancy. DATA SCIENCE JOURNAL 2021. [DOI: 10.5334/dsj-2021-009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
|
7
|
|
8
|
Kim YM, Poline JB, Dumas G. Experimenting with reproducibility: a case study of robustness in bioinformatics. Gigascience 2018; 7:5046609. [PMID: 29961842 PMCID: PMC6054242 DOI: 10.1093/gigascience/giy077] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2017] [Accepted: 06/13/2018] [Indexed: 02/06/2023] Open
Abstract
Reproducibility has been shown to be limited in many scientific fields. This question is a fundamental tenet of scientific activity, but the related issues of reusability of scientific data are poorly documented. Here, we present a case study of our difficulties in reproducing a published bioinformatics method even though code and data were available. First, we tried to re-run the analysis with the code and data provided by the authors. Second, we reimplemented the whole method in a Python package to avoid dependency on a MATLAB license and ease the execution of the code on a high-performance computing cluster. Third, we assessed reusability of our reimplementation and the quality of our documentation, testing how easy it would be to start from our implementation to reproduce the results. In a second section, we propose solutions from this case study and other observations to improve reproducibility and research efficiency at the individual and collective levels. While finalizing our code, we created case-specific documentation and tutorials for the associated Python package StratiPy. Readers are invited to experiment with our reproducibility case study by generating the two confusion matrices (see more in section “Robustness: from MATLAB to Python, language and organization"). Here, we propose two options: a step-by-step process to follow in a Jupyter/IPython notebook or a Docker container ready to be built and run.
Collapse
Affiliation(s)
- Yang-Min Kim
- Human Genetics and Cognitive Functions Unit, Institut Pasteur, 25 rue du Docteur Roux 75015 Paris, France.,CNRS UMR 3571 Genes, Synapses and Cognition, Institut Pasteur, 25 rue du Docteur Roux 75015 Paris, France.,Paris Diderot University, Sorbonne Paris Cité, 5 rue Thomas Mann 75013 Paris, France.,Center of Bioinformatics, Biostatistics and Integrative Biology (C3BI), USR 3756, Institut Pasteur and CNRS, 25-28 rue du Docteur Roux 75015 Paris, France
| | - Jean-Baptiste Poline
- Montreal Neurological Institute and Hospital, Brain Imaging Center, Ludmer Center, McGill University, 3801 University Street, Montreal, QC H3A 2B4, Canada.,Henry H. Wheeler, Jr. Brain Imaging Center, Helen Wills Neuroscience Institute, 132 Barker Hall, office 210S, MC 3190, University of California, Berkeley, CA 94720, USA
| | - Guillaume Dumas
- Human Genetics and Cognitive Functions Unit, Institut Pasteur, 25 rue du Docteur Roux 75015 Paris, France.,CNRS UMR 3571 Genes, Synapses and Cognition, Institut Pasteur, 25 rue du Docteur Roux 75015 Paris, France.,Paris Diderot University, Sorbonne Paris Cité, 5 rue Thomas Mann 75013 Paris, France.,Center of Bioinformatics, Biostatistics and Integrative Biology (C3BI), USR 3756, Institut Pasteur and CNRS, 25-28 rue du Docteur Roux 75015 Paris, France
| |
Collapse
|
9
|
Russell PH, Johnson RL, Ananthan S, Harnke B, Carlson NE. A large-scale analysis of bioinformatics code on GitHub. PLoS One 2018; 13:e0205898. [PMID: 30379882 PMCID: PMC6209220 DOI: 10.1371/journal.pone.0205898] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2018] [Accepted: 10/03/2018] [Indexed: 11/19/2022] Open
Abstract
In recent years, the explosion of genomic data and bioinformatic tools has been accompanied by a growing conversation around reproducibility of results and usability of software. However, the actual state of the body of bioinformatics software remains largely unknown. The purpose of this paper is to investigate the state of source code in the bioinformatics community, specifically looking at relationships between code properties, development activity, developer communities, and software impact. To investigate these issues, we curated a list of 1,720 bioinformatics repositories on GitHub through their mention in peer-reviewed bioinformatics articles. Additionally, we included 23 high-profile repositories identified by their popularity in an online bioinformatics forum. We analyzed repository metadata, source code, development activity, and team dynamics using data made available publicly through the GitHub API, as well as article metadata. We found key relationships within our dataset, including: certain scientific topics are associated with more active code development and higher community interest in the repository; most of the code in the main dataset is written in dynamically typed languages, while most of the code in the high-profile set is statically typed; developer team size is associated with community engagement and high-profile repositories have larger teams; the proportion of female contributors decreases for high-profile repositories and with seniority level in author lists; and, multiple measures of project impact are associated with the simple variable of whether the code was modified at all after paper publication. In addition to providing the first large-scale analysis of bioinformatics code to our knowledge, our work will enable future analysis through publicly available data, code, and methods. Code to generate the dataset and reproduce the analysis is provided under the MIT license at https://github.com/pamelarussell/github-bioinformatics. Data are available at https://doi.org/10.17605/OSF.IO/UWHX8.
Collapse
Affiliation(s)
- Pamela H. Russell
- Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO, United States of America
- * E-mail:
| | - Rachel L. Johnson
- Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO, United States of America
| | - Shreyas Ananthan
- High-Performance Algorithms and Complex Fluids, National Renewable Energy Laboratory, Golden, CO, United States of America
| | - Benjamin Harnke
- Health Sciences Library, University of Colorado Anschutz Medical Campus, Aurora, CO, United States of America
| | - Nichole E. Carlson
- Department of Biostatistics and Informatics, Colorado School of Public Health, Aurora, CO, United States of America
| |
Collapse
|
10
|
Kulkarni N, Alessandrì L, Panero R, Arigoni M, Olivero M, Ferrero G, Cordero F, Beccuti M, Calogero RA. Reproducible bioinformatics project: a community for reproducible bioinformatics analysis pipelines. BMC Bioinformatics 2018; 19:349. [PMID: 30367595 PMCID: PMC6191970 DOI: 10.1186/s12859-018-2296-x] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
Background Reproducibility of a research is a key element in the modern science and it is mandatory for any industrial application. It represents the ability of replicating an experiment independently by the location and the operator. Therefore, a study can be considered reproducible only if all used data are available and the exploited computational analysis workflow is clearly described. However, today for reproducing a complex bioinformatics analysis, the raw data and the list of tools used in the workflow could be not enough to guarantee the reproducibility of the results obtained. Indeed, different releases of the same tools and/or of the system libraries (exploited by such tools) might lead to sneaky reproducibility issues. Results To address this challenge, we established the Reproducible Bioinformatics Project (RBP), which is a non-profit and open-source project, whose aim is to provide a schema and an infrastructure, based on docker images and R package, to provide reproducible results in Bioinformatics. One or more Docker images are then defined for a workflow (typically one for each task), while the workflow implementation is handled via R-functions embedded in a package available at github repository. Thus, a bioinformatician participating to the project has firstly to integrate her/his workflow modules into Docker image(s) exploiting an Ubuntu docker image developed ad hoc by RPB to make easier this task. Secondly, the workflow implementation must be realized in R according to an R-skeleton function made available by RPB to guarantee homogeneity and reusability among different RPB functions. Moreover she/he has to provide the R vignette explaining the package functionality together with an example dataset which can be used to improve the user confidence in the workflow utilization. Conclusions Reproducible Bioinformatics Project provides a general schema and an infrastructure to distribute robust and reproducible workflows. Thus, it guarantees to final users the ability to repeat consistently any analysis independently by the used UNIX-like architecture.
Collapse
Affiliation(s)
- Neha Kulkarni
- Department of Molecular Biotechnology and Health Sciences, University of Torino, Torino, Italy
| | - Luca Alessandrì
- Department of Molecular Biotechnology and Health Sciences, University of Torino, Torino, Italy
| | - Riccardo Panero
- Department of Molecular Biotechnology and Health Sciences, University of Torino, Torino, Italy
| | - Maddalena Arigoni
- Department of Molecular Biotechnology and Health Sciences, University of Torino, Torino, Italy
| | - Martina Olivero
- Department of Oncology, University of Torino, Candiolo, Italy
| | - Giulio Ferrero
- Department of Computer Sciences, University of Torino, Torino, Italy
| | - Francesca Cordero
- Department of Computer Sciences, University of Torino, Torino, Italy.
| | - Marco Beccuti
- Department of Computer Sciences, University of Torino, Torino, Italy
| | - Raffaele A Calogero
- Department of Molecular Biotechnology and Health Sciences, University of Torino, Torino, Italy.
| |
Collapse
|
11
|
Abstract
Like other types of computational research, modeling and simulation of biological processes (biomodels) is still largely communicated without sufficient detail to allow independent reproduction of results. But reproducibility in this area of research could easily be achieved by making use of existing resources, such as supplying models in standard formats and depositing code, models, and results in public repositories.
Collapse
|
12
|
Nüst D, Granell C, Hofer B, Konkol M, Ostermann FO, Sileryte R, Cerutti V. Reproducible research and GIScience: an evaluation using AGILE conference papers. PeerJ 2018; 6:e5072. [PMID: 30013826 PMCID: PMC6047504 DOI: 10.7717/peerj.5072] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2018] [Accepted: 06/04/2018] [Indexed: 11/20/2022] Open
Abstract
The demand for reproducible research is on the rise in disciplines concerned with data analysis and computational methods. Therefore, we reviewed current recommendations for reproducible research and translated them into criteria for assessing the reproducibility of articles in the field of geographic information science (GIScience). Using this criteria, we assessed a sample of GIScience studies from the Association of Geographic Information Laboratories in Europe (AGILE) conference series, and we collected feedback about the assessment from the study authors. Results from the author feedback indicate that although authors support the concept of performing reproducible research, the incentives for doing this in practice are too small. Therefore, we propose concrete actions for individual researchers and the GIScience conference series to improve transparency and reproducibility. For example, to support researchers in producing reproducible work, the GIScience conference series could offer awards and paper badges, provide author guidelines for computational research, and publish articles in Open Access formats.
Collapse
Affiliation(s)
- Daniel Nüst
- Institute for Geoinformatics, University of Münster, Münster, Germany
| | - Carlos Granell
- Institute of New Imaging Technologies, Universitat Jaume I de Castellón, Castellón, Spain
| | - Barbara Hofer
- Interfaculty Department of Geoinformatics - Z_GIS, University of Salzburg, Salzburg, Austria
| | - Markus Konkol
- Institute for Geoinformatics, University of Münster, Münster, Germany
| | - Frank O. Ostermann
- Faculty of Geo-Information Science and Earth Observation (ITC), University of Twente, Enschede, The Netherlands
| | - Rusne Sileryte
- Faculty of Architecture and the Built Environment, Delft University of Technology, Delft, The Netherlands
| | - Valentina Cerutti
- Faculty of Geo-Information Science and Earth Observation (ITC), University of Twente, Enschede, The Netherlands
| |
Collapse
|
13
|
Visconti A, Martin TC, Falchi M. YAMP: a containerized workflow enabling reproducibility in metagenomics research. Gigascience 2018; 7:5039705. [PMID: 29917068 PMCID: PMC6047416 DOI: 10.1093/gigascience/giy072] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2018] [Revised: 05/01/2018] [Accepted: 06/11/2018] [Indexed: 01/12/2023] Open
Abstract
YAMP ("Yet Another Metagenomics Pipeline") is a user-friendly workflow that enables the analysis of whole shotgun metagenomic data while using containerization to ensure computational reproducibility and facilitate collaborative research. YAMP can be executed on any UNIX-like system and offers seamless support for multiple job schedulers as well as for the Amazon AWS cloud. Although YAMP was developed to be ready to use by nonexperts, bioinformaticians will appreciate its flexibility, modularization, and simple customization.
Collapse
Affiliation(s)
- Alessia Visconti
- Department of Twin Research and Genetic Epidemiology, King’s College London, Westminster Bridge Road, SE1 7EH, London, UK
| | - Tiphaine C Martin
- Department of Twin Research and Genetic Epidemiology, King’s College London, Westminster Bridge Road, SE1 7EH, London, UK
| | - Mario Falchi
- Department of Twin Research and Genetic Epidemiology, King’s College London, Westminster Bridge Road, SE1 7EH, London, UK
| |
Collapse
|
14
|
Kanwal S, Khan FZ, Lonie A, Sinnott RO. Investigating reproducibility and tracking provenance - A genomic workflow case study. BMC Bioinformatics 2017; 18:337. [PMID: 28701218 PMCID: PMC5508699 DOI: 10.1186/s12859-017-1747-0] [Citation(s) in RCA: 46] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2017] [Accepted: 07/04/2017] [Indexed: 11/10/2022] Open
Abstract
Background Computational bioinformatics workflows are extensively used to analyse genomics data, with different approaches available to support implementation and execution of these workflows. Reproducibility is one of the core principles for any scientific workflow and remains a challenge, which is not fully addressed. This is due to incomplete understanding of reproducibility requirements and assumptions of workflow definition approaches. Provenance information should be tracked and used to capture all these requirements supporting reusability of existing workflows. Results We have implemented a complex but widely deployed bioinformatics workflow using three representative approaches to workflow definition and execution. Through implementation, we identified assumptions implicit in these approaches that ultimately produce insufficient documentation of workflow requirements resulting in failed execution of the workflow. This study proposes a set of recommendations that aims to mitigate these assumptions and guides the scientific community to accomplish reproducible science, hence addressing reproducibility crisis. Conclusions Reproducing, adapting or even repeating a bioinformatics workflow in any environment requires substantial technical knowledge of the workflow execution environment, resolving analysis assumptions and rigorous compliance with reproducibility requirements. Towards these goals, we propose conclusive recommendations that along with an explicit declaration of workflow specification would result in enhanced reproducibility of computational genomic analyses. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1747-0) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Sehrish Kanwal
- Department of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, 3010, Australia.
| | - Farah Zaib Khan
- Department of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, 3010, Australia.
| | - Andrew Lonie
- Melbourne Bioinformatics, The University of Melbourne, Melbourne, VIC, 3010, Australia
| | - Richard O Sinnott
- Department of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, 3010, Australia
| |
Collapse
|
15
|
Beaulieu-Jones BK, Greene CS. Reproducibility of computational workflows is automated using continuous analysis. Nat Biotechnol 2017. [PMID: 28288103 DOI: 10.1038/nbt.3780.] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Replication, validation and extension of experiments are crucial for scientific progress. Computational experiments are scriptable and should be easy to reproduce. However, computational analyses are designed and run in a specific computing environment, which may be difficult or impossible to match using written instructions. We report the development of continuous analysis, a workflow that enables reproducible computational analyses. Continuous analysis combines Docker, a container technology akin to virtual machines, with continuous integration, a software development technique, to automatically rerun a computational analysis whenever updates or improvements are made to source code or data. This enables researchers to reproduce results without contacting the study authors. Continuous analysis allows reviewers, editors or readers to verify reproducibility without manually downloading and rerunning code and can provide an audit trail for analyses of data that cannot be shared.
Collapse
Affiliation(s)
- Brett K Beaulieu-Jones
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| |
Collapse
|
16
|
Reproducibility of computational workflows is automated using continuous analysis. Nat Biotechnol 2017; 35:342-346. [PMID: 28288103 DOI: 10.1038/nbt.3780] [Citation(s) in RCA: 86] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2016] [Accepted: 12/22/2016] [Indexed: 11/08/2022]
Abstract
Replication, validation and extension of experiments are crucial for scientific progress. Computational experiments are scriptable and should be easy to reproduce. However, computational analyses are designed and run in a specific computing environment, which may be difficult or impossible to match using written instructions. We report the development of continuous analysis, a workflow that enables reproducible computational analyses. Continuous analysis combines Docker, a container technology akin to virtual machines, with continuous integration, a software development technique, to automatically rerun a computational analysis whenever updates or improvements are made to source code or data. This enables researchers to reproduce results without contacting the study authors. Continuous analysis allows reviewers, editors or readers to verify reproducibility without manually downloading and rerunning code and can provide an audit trail for analyses of data that cannot be shared.
Collapse
|
17
|
Abstract
When reporting research findings, scientists document the steps they followed so that others can verify and build upon the research. When those steps have been described in sufficient detail that others can retrace the steps and obtain similar results, the research is said to be reproducible. Computers play a vital role in many research disciplines and present both opportunities and challenges for reproducibility. Computers can be programmed to execute analysis tasks, and those programs can be repeated and shared with others. The deterministic nature of most computer programs means that the same analysis tasks, applied to the same data, will often produce the same outputs. However, in practice, computational findings often cannot be reproduced because of complexities in how software is packaged, installed, and executed-and because of limitations associated with how scientists document analysis steps. Many tools and techniques are available to help overcome these challenges; here we describe seven such strategies. With a broad scientific audience in mind, we describe the strengths and limitations of each approach, as well as the circumstances under which each might be applied. No single strategy is sufficient for every scenario; thus we emphasize that it is often useful to combine approaches.
Collapse
Affiliation(s)
- Stephen R Piccolo
- Department of Biology, Brigham Young University, Provo, UT, 84602, USA.
| | - Michael B Frampton
- Department of Computer Science, Brigham Young University, Provo, UT, USA
| |
Collapse
|
18
|
Hofner B, Schmid M, Edler L. Reproducible research in statistics: A review and guidelines for theBiometrical Journal. Biom J 2015; 58:416-27. [DOI: 10.1002/bimj.201500156] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2015] [Revised: 10/06/2015] [Accepted: 11/17/2015] [Indexed: 11/10/2022]
Affiliation(s)
- Benjamin Hofner
- Department of Medical Informatics, Biometry and Epidemiology; Friedrich-Alexander University; Erlangen-Nuremberg, Waldstraße 6 91054 Erlangen Germany
| | - Matthias Schmid
- Department of Medical Biometry, Informatics and Epidemiology; University of Bonn; Sigmund-Freud-Straße 25 53127 Bonn Germany
| | - Lutz Edler
- Division of Biostatistics-C060; German Cancer Research Center; Im Neuenheimer Feld 581 69120 Heidelberg Germany
| |
Collapse
|
19
|
Schuster BS, Ensign LM, Allan DB, Suk JS, Hanes J. Particle tracking in drug and gene delivery research: State-of-the-art applications and methods. Adv Drug Deliv Rev 2015; 91:70-91. [PMID: 25858664 DOI: 10.1016/j.addr.2015.03.017] [Citation(s) in RCA: 92] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2015] [Revised: 03/25/2015] [Accepted: 03/27/2015] [Indexed: 01/17/2023]
Abstract
Particle tracking is a powerful microscopy technique to quantify the motion of individual particles at high spatial and temporal resolution in complex fluids and biological specimens. Particle tracking's applications and impact in drug and gene delivery research have greatly increased during the last decade. Thanks to advances in hardware and software, this technique is now more accessible than ever, and can be reliably automated to enable rapid processing of large data sets, thereby further enhancing the role that particle tracking will play in drug and gene delivery studies in the future. We begin this review by discussing particle tracking-based advances in characterizing extracellular and cellular barriers to therapeutic nanoparticles and in characterizing nanoparticle size and stability. To facilitate wider adoption of the technique, we then present a user-friendly review of state-of-the-art automated particle tracking algorithms and methods of analysis. We conclude by reviewing technological developments for next-generation particle tracking methods, and we survey future research directions in drug and gene delivery where particle tracking may be useful.
Collapse
Affiliation(s)
- Benjamin S Schuster
- Center for Nanomedicine, Johns Hopkins University School of Medicine , Baltimore, MD 21231, USA
- Department of Biomedical Engineering, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
| | - Laura M Ensign
- Center for Nanomedicine, Johns Hopkins University School of Medicine , Baltimore, MD 21231, USA
- Department of Ophthalmology, Johns Hopkins University School of Medicine, Baltimore, MD 21231, USA
| | - Daniel B Allan
- Department of Physics and Astronomy, Johns Hopkins University, Baltimore, MD, 21218 USA
| | - Jung Soo Suk
- Center for Nanomedicine, Johns Hopkins University School of Medicine , Baltimore, MD 21231, USA
- Department of Ophthalmology, Johns Hopkins University School of Medicine, Baltimore, MD 21231, USA
| | - Justin Hanes
- Center for Nanomedicine, Johns Hopkins University School of Medicine , Baltimore, MD 21231, USA
- Department of Biomedical Engineering, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
- Department of Ophthalmology, Johns Hopkins University School of Medicine, Baltimore, MD 21231, USA
| |
Collapse
|
20
|
Abstract
The ability to self-correct is considered a hallmark of science. However, self-correction does not always happen to scientific evidence by default. The trajectory of scientific credibility can fluctuate over time, both for defined scientific fields and for science at-large. History suggests that major catastrophes in scientific credibility are unfortunately possible and the argument that "it is obvious that progress is made" is weak. Careful evaluation of the current status of credibility of various scientific fields is important in order to understand any credibility deficits and how one could obtain and establish more trustworthy results. Efficient and unbiased replication mechanisms are essential for maintaining high levels of scientific credibility. Depending on the types of results obtained in the discovery and replication phases, there are different paradigms of research: optimal, self-correcting, false nonreplication, and perpetuated fallacy. In the absence of replication efforts, one is left with unconfirmed (genuine) discoveries and unchallenged fallacies. In several fields of investigation, including many areas of psychological science, perpetuated and unchallenged fallacies may comprise the majority of the circulating evidence. I catalogue a number of impediments to self-correction that have been empirically studied in psychological science. Finally, I discuss some proposed solutions to promote sound replication practices enhancing the credibility of scientific results as well as some potential disadvantages of each of them. Any deviation from the principle that seeking the truth has priority over any other goals may be seriously damaging to the self-correcting functions of science.
Collapse
Affiliation(s)
- John P A Ioannidis
- Stanford Prevention Research Center, Department of Medicine and Department of Health Research and Policy, Stanford University School of Medicine, and Department of Statistics, Stanford University School of Humanities and Sciences
| |
Collapse
|
21
|
Barbieri RB, Bufalo NE, Secolin R, Assumpção LVM, Maciel RMB, Cerutti JM, Ward LS. Polymorphisms of cell cycle control genes influence the development of sporadic medullary thyroid carcinoma. Eur J Endocrinol 2014; 171:761-7. [PMID: 25565272 DOI: 10.1530/eje-14-0461] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
BACKGROUND The role of key cell cycle regulation genes such as, CDKN1B, CDKN2A, CDKN2B, and CDKN2C in sporadic medullary thyroid carcinoma (s-MTC) is still largely unknown. METHODS In order to evaluate the influence of inherited polymorphisms of these genes on the pathogenesis of s-MTC, we used TaqMan SNP genotyping to examine 45 s-MTC patients carefully matched with 98 controls. RESULTS A multivariate logistic regression analysis demonstrated that CDKN1B and CDKN2A genes were related to s-MTC susceptibility. The rs2066827*GT+GG CDKN1B genotype was more frequent in s-MTC patients (62.22%) than in controls (40.21%), increasing the susceptibility to s-MTC (OR=2.47; 95% CI=1.048-5.833; P=0.038). By contrast, the rs11515*CG+GG of CDKN2A gene was more frequent in the controls (32.65%) than in patients (15.56%), reducing the risk for s-MTC (OR=0.174; 95% CI=0.048-0.627; P=0.0075). A stepwise regression analysis indicated that two genotypes together could explain 11% of the total s-MTC risk. In addition, a relationship was found between disease progression and the presence of alterations in the CDKN1A (rs1801270), CDKN2C (rs12885), and CDKN2B (rs1063192) genes. WT rs1801270 CDKN1A patients presented extrathyroidal tumor extension more frequently (92%) than polymorphic CDKN1A rs1801270 patients (50%; P=0.0376). Patients with the WT CDKN2C gene (rs12885) presented larger tumors (2.9±1.8 cm) than polymorphic patients (1.5±0.7 cm; P=0.0324). On the other hand, patients with the polymorphic CDKN2B gene (rs1063192) presented distant metastases (36.3%; P=0.0261). CONCLUSION In summary, we demonstrated that CDKN1B and CDKN2A genes are associated with susceptibility, whereas the inherited genetic profile of CDKN1A, CDKN2B, and CDKN2C is associated with aggressive features of tumors. This study suggests that profiling cell cycle genes may help define the risk and characterize s-MTC aggressiveness.
Collapse
Affiliation(s)
- R B Barbieri
- University of Campinas (FCM - Unicamp)126, Tessalia Vieira de Camargo, Street. Cidade Universitaria Zeferino Vaz, Campinas - São Paulo, 13083-887 BrazilFederal University of Sao Paulo (Unifesp)669, Pedro Toledo Street, São Paulo-SP 04039-032, Brazil
| | - N E Bufalo
- University of Campinas (FCM - Unicamp)126, Tessalia Vieira de Camargo, Street. Cidade Universitaria Zeferino Vaz, Campinas - São Paulo, 13083-887 BrazilFederal University of Sao Paulo (Unifesp)669, Pedro Toledo Street, São Paulo-SP 04039-032, Brazil
| | - R Secolin
- University of Campinas (FCM - Unicamp)126, Tessalia Vieira de Camargo, Street. Cidade Universitaria Zeferino Vaz, Campinas - São Paulo, 13083-887 BrazilFederal University of Sao Paulo (Unifesp)669, Pedro Toledo Street, São Paulo-SP 04039-032, Brazil
| | - L V M Assumpção
- University of Campinas (FCM - Unicamp)126, Tessalia Vieira de Camargo, Street. Cidade Universitaria Zeferino Vaz, Campinas - São Paulo, 13083-887 BrazilFederal University of Sao Paulo (Unifesp)669, Pedro Toledo Street, São Paulo-SP 04039-032, Brazil
| | - R M B Maciel
- University of Campinas (FCM - Unicamp)126, Tessalia Vieira de Camargo, Street. Cidade Universitaria Zeferino Vaz, Campinas - São Paulo, 13083-887 BrazilFederal University of Sao Paulo (Unifesp)669, Pedro Toledo Street, São Paulo-SP 04039-032, Brazil
| | - J M Cerutti
- University of Campinas (FCM - Unicamp)126, Tessalia Vieira de Camargo, Street. Cidade Universitaria Zeferino Vaz, Campinas - São Paulo, 13083-887 BrazilFederal University of Sao Paulo (Unifesp)669, Pedro Toledo Street, São Paulo-SP 04039-032, Brazil
| | - L S Ward
- University of Campinas (FCM - Unicamp)126, Tessalia Vieira de Camargo, Street. Cidade Universitaria Zeferino Vaz, Campinas - São Paulo, 13083-887 BrazilFederal University of Sao Paulo (Unifesp)669, Pedro Toledo Street, São Paulo-SP 04039-032, Brazil
| |
Collapse
|
22
|
Rodrigo-Domingo M, Waagepetersen R, Bødker JS, Falgreen S, Kjeldsen MK, Johnsen HE, Dybkær K, Bøgsted M. Reproducible probe-level analysis of the Affymetrix Exon 1.0 ST array with R/Bioconductor. Brief Bioinform 2014; 15:519-33. [PMID: 23603090 PMCID: PMC4103539 DOI: 10.1093/bib/bbt011] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2012] [Accepted: 02/15/2013] [Indexed: 12/22/2022] Open
Abstract
The presence of different transcripts of a gene across samples can be analysed by whole-transcriptome microarrays. Reproducing results from published microarray data represents a challenge owing to the vast amounts of data and the large variety of preprocessing and filtering steps used before the actual analysis is carried out. To guarantee a firm basis for methodological development where results with new methods are compared with previous results, it is crucial to ensure that all analyses are completely reproducible for other researchers. We here give a detailed workflow on how to perform reproducible analysis of the GeneChip®Human Exon 1.0 ST Array at probe and probeset level solely in R/Bioconductor, choosing packages based on their simplicity of use. To exemplify the use of the proposed workflow, we analyse differential splicing and differential gene expression in a publicly available dataset using various statistical methods. We believe this study will provide other researchers with an easy way of accessing gene expression data at different annotation levels and with the sufficient details needed for developing their own tools for reproducible analysis of the GeneChip®Human Exon 1.0 ST Array.
Collapse
|
23
|
Chan AW, Song F, Vickers A, Jefferson T, Dickersin K, Gøtzsche PC, Krumholz HM, Ghersi D, van der Worp HB. Increasing value and reducing waste: addressing inaccessible research. Lancet 2014; 383:257-66. [PMID: 24411650 PMCID: PMC4533904 DOI: 10.1016/s0140-6736(13)62296-5] [Citation(s) in RCA: 531] [Impact Index Per Article: 53.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
The methods and results of health research are documented in study protocols, full study reports (detailing all analyses), journal reports, and participant-level datasets. However, protocols, full study reports, and participant-level datasets are rarely available, and journal reports are available for only half of all studies and are plagued by selective reporting of methods and results. Furthermore, information provided in study protocols and reports varies in quality and is often incomplete. When full information about studies is inaccessible, billions of dollars in investment are wasted, bias is introduced, and research and care of patients are detrimentally affected. To help to improve this situation at a systemic level, three main actions are warranted. First, academic institutions and funders should reward investigators who fully disseminate their research protocols, reports, and participant-level datasets. Second, standards for the content of protocols and full study reports and for data sharing practices should be rigorously developed and adopted for all types of health research. Finally, journals, funders, sponsors, research ethics committees, regulators, and legislators should endorse and enforce policies supporting study registration and wide availability of journal reports, full study reports, and participant-level datasets.
Collapse
Affiliation(s)
- An-Wen Chan
- Women's College Research Institute, Department of Medicine, Women's College Hospital, University of Toronto, Toronto, ON, Canada.
| | - Fujian Song
- Norwich Medical School, Faculty of Medicine and Health Science, University of East Anglia, Norwich, UK
| | - Andrew Vickers
- Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, New York, NY, USA
| | | | - Kay Dickersin
- Center for Clinical Trials, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | | | - Harlan M Krumholz
- Section of Cardiovascular Medicine and the Robert Wood Johnson Foundation Clinical Scholars Program, Department of Medicine, Yale School of Medicine, Yale University, New Haven, CT, USA; Department of Health Policy and Management, Yale School of Public Health, Yale University, New Haven, CT, USA; Center for Outcomes Research and Evaluation, Yale-New Haven Hospital, New Haven, CT, USA
| | - Davina Ghersi
- Research Translation Branch, National Health and Medical Research Council, Canberra, ACT, Australia
| | - H Bart van der Worp
- Department of Neurology and Neurosurgery, Brain Center Rudolf Magnus, University Medical Center Utrecht, Utrecht, Netherlands
| |
Collapse
|
24
|
Ioannidis JPA, Greenland S, Hlatky MA, Khoury MJ, Macleod MR, Moher D, Schulz KF, Tibshirani R. Increasing value and reducing waste in research design, conduct, and analysis. Lancet 2014; 383:166-75. [PMID: 24411645 PMCID: PMC4697939 DOI: 10.1016/s0140-6736(13)62227-8] [Citation(s) in RCA: 930] [Impact Index Per Article: 93.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Correctable weaknesses in the design, conduct, and analysis of biomedical and public health research studies can produce misleading results and waste valuable resources. Small effects can be difficult to distinguish from bias introduced by study design and analyses. An absence of detailed written protocols and poor documentation of research is common. Information obtained might not be useful or important, and statistical precision or power is often too low or used in a misleading way. Insufficient consideration might be given to both previous and continuing studies. Arbitrary choice of analyses and an overemphasis on random extremes might affect the reported findings. Several problems relate to the research workforce, including failure to involve experienced statisticians and methodologists, failure to train clinical researchers and laboratory scientists in research methods and design, and the involvement of stakeholders with conflicts of interest. Inadequate emphasis is placed on recording of research decisions and on reproducibility of research. Finally, reward systems incentivise quantity more than quality, and novelty more than reliability. We propose potential solutions for these problems, including improvements in protocols and documentation, consideration of evidence from studies in progress, standardisation of research efforts, optimisation and training of an experienced and non-conflicted scientific workforce, and reconsideration of scientific reward systems.
Collapse
Affiliation(s)
- John P A Ioannidis
- Stanford Prevention Research Center, Department of Medicine, School of Medicine, Stanford University, Stanford, CA, USA; Division of Epidemiology, School of Medicine, Stanford University, Stanford, CA, USA; Department of Statistics, School of Humanities and Sciences, Stanford University, Stanford, CA, USA; Meta-Research Innovation Center at Stanford (METRICS), Stanford University, Stanford, CA, USA.
| | - Sander Greenland
- Department of Epidemiology and Department of Statistics, UCLA School of Public Health, Los Angeles, CA, USA
| | - Mark A Hlatky
- Division of Cardiovascular Medicine, Department of Medicine, School of Medicine, Stanford University, Stanford, CA, USA; Division of Health Services Research, Stanford University, Stanford, CA, USA
| | - Muin J Khoury
- Office of Public Health Genomics, Centers for Disease Control and Prevention, Atlanta, GA, USA; Epidemiology and Genomics Research Program, National Cancer Institute, Rockville, MD, USA
| | - Malcolm R Macleod
- Department of Clinical Neurosciences, University of Edinburgh School of Medicine, Edinburgh, UK
| | - David Moher
- Clinical Epidemiology Program, Ottawa Hospital Research Institute, University of Ottawa, Ottawa, ON, Canada; Department of Epidemiology and Community Medicine, Faculty of Medicine, University of Ottawa, Ottawa, ON, Canada
| | - Kenneth F Schulz
- FHI 360, Durham, NC, USA; Department of Obstetrics and Gynecology, University of North Carolina School of Medicine, Chapel Hill, NC, USA
| | - Robert Tibshirani
- Department of Health Research and Policy, Stanford University, Stanford, CA, USA; Department of Statistics, School of Humanities and Sciences, Stanford University, Stanford, CA, USA
| |
Collapse
|
25
|
Garijo D, Kinnings S, Xie L, Xie L, Zhang Y, Bourne PE, Gil Y. Quantifying reproducibility in computational biology: the case of the tuberculosis drugome. PLoS One 2013; 8:e80278. [PMID: 24312207 PMCID: PMC3842296 DOI: 10.1371/journal.pone.0080278] [Citation(s) in RCA: 57] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2012] [Accepted: 10/10/2013] [Indexed: 11/29/2022] Open
Abstract
How easy is it to reproduce the results found in a typical computational biology paper? Either through experience or intuition the reader will already know that the answer is with difficulty or not at all. In this paper we attempt to quantify this difficulty by reproducing a previously published paper for different classes of users (ranging from users with little expertise to domain experts) and suggest ways in which the situation might be improved. Quantification is achieved by estimating the time required to reproduce each of the steps in the method described in the original paper and make them part of an explicit workflow that reproduces the original results. Reproducing the method took several months of effort, and required using new versions and new software that posed challenges to reconstructing and validating the results. The quantification leads to "reproducibility maps" that reveal that novice researchers would only be able to reproduce a few of the steps in the method, and that only expert researchers with advance knowledge of the domain would be able to reproduce the method in its entirety. The workflow itself is published as an online resource together with supporting software and data. The paper concludes with a brief discussion of the complexities of requiring reproducibility in terms of cost versus benefit, and a desiderata with our observations and guidelines for improving reproducibility. This has implications not only in reproducing the work of others from published papers, but reproducing work from one's own laboratory.
Collapse
Affiliation(s)
- Daniel Garijo
- Ontology Engineering Group, Facultad de Informática, Universidad Politécnica de Madrid, Madrid, Spain
| | - Sarah Kinnings
- Department of Chemistry and Biochemistry, University of California San Diego, La Jolla, California, United States of America
| | - Li Xie
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, California, United States of America
| | - Lei Xie
- Department of Computer Science, Hunter College, The City University of New York, New York, New York, United States of America
| | - Yinliang Zhang
- School of Life Sciences, University of Science and Technology of China, Hefei, Anhui, China
| | - Philip E. Bourne
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, California, United States of America
| | - Yolanda Gil
- Information Sciences Institute and Department of Computer Science, University of Southern California, Los Angeles, California, United States of America
| |
Collapse
|
26
|
Boulesteix AL. On representative and illustrative comparisons with real data in bioinformatics: response to the letter to the editor by Smith et al. Bioinformatics 2013; 29:2664-6. [PMID: 23929033 DOI: 10.1093/bioinformatics/btt458] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Affiliation(s)
- Anne-Laure Boulesteix
- Department of Medical Informatics, Biometry and Epidemiology, University of Munich, 81377 Munich, Germany
| |
Collapse
|
27
|
Barbieri RB, Bufalo NE, Cunha LL, Assumpção LVM, Maciel RMB, Cerutti JM, Ward LS. Genes of detoxification are important modulators of hereditary medullary thyroid carcinoma risk. Clin Endocrinol (Oxf) 2013; 79:288-93. [PMID: 23278115 DOI: 10.1111/cen.12136] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/02/2012] [Revised: 10/29/2012] [Accepted: 12/17/2012] [Indexed: 01/12/2023]
Abstract
CONTEXT Different inherited profiles of genes involved in cellular mechanisms of activation and detoxification of carcinogenic products can provide specific protection or determine the risk for cancer. Low-penetrance polymorphic genes related to the biotransformation of environmental toxins have been associated with susceptibility to and the phenotype of, human tumours. OBJECTIVE To investigate the role of germline inheritance of polymorphisms in CYP1A2*F, CYP1A1 m1, GSTP1, NAT2 and TP53 genes in hereditary medullary thyroid carcinoma (HMTC) patients. DESIGN This study was developed in University of Campinas (Unicamp). PATIENTS We studied 132 patients with HMTC, 88 first-degree relatives of HMTC patients and 575 control individuals. MEASUREMENTS All patients with MTC and their relatives were sequenced for the RET gene and five genes were genotyped using TaqMan(®) system. RESULTS We observed that the inheritance of CYP1A2*F (OR = 2·10; 95% CI = 1·11-3·97; P = 0·022), GSTP1 (OR = 4·41; 95% CI = 2·47-7·88; P < 0·001) and NAT2 (OR = 2·54; 95% CI = 1·16-5·58; P = 0·020) variants increased the risk for HMTC. In addition, multiple regression analysis showed that the inheritance of GSTP1 polymorphisms was associated with the diagnosis in older patients (B = 8·0229; 95% IC = ± 5·5735; P = 0·0054). Concerning the group of HTMC relatives, CYP1A2*F (OR = 2:40; 95% CI = 1·19-4·86; P = 0·015), CYP1A1 m1 (OR = 2·79; 95% CI = 1:04-7·51; P = 0·042), GSTP1 (OR = 2·86; 95% IC = 1·53-5·32; P < 0·001) and NAT2 (OR = 2·25; 95% IC = 1·20-4·22; P = 0·012) were associated with HMTC risk. CONCLUSIONS We have demonstrated that the inheritance of specific genes determining the individual response to environmental toxins may contribute to the risk and phenotypic variability that exists in patients with HMTC. Moreover, we identified a group at risk in relatives of HMTC patients.
Collapse
Affiliation(s)
- R B Barbieri
- Faculty of Medical Sciences, Laboratory of Cancer Molecular Genetics, University of Campinas (FCM-Unicamp), Campinas, Brazil
| | | | | | | | | | | | | |
Collapse
|
28
|
Boulesteix AL, Lauer S, Eugster MJA. A plea for neutral comparison studies in computational sciences. PLoS One 2013; 8:e61562. [PMID: 23637855 PMCID: PMC3634809 DOI: 10.1371/journal.pone.0061562] [Citation(s) in RCA: 70] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2012] [Accepted: 03/11/2013] [Indexed: 12/04/2022] Open
Abstract
In computational science literature including, e.g., bioinformatics, computational statistics or machine learning, most published articles are devoted to the development of "new methods", while comparison studies are generally appreciated by readers but surprisingly given poor consideration by many journals. This paper stresses the importance of neutral comparison studies for the objective evaluation of existing methods and the establishment of standards by drawing parallels with clinical research. The goal of the paper is twofold. Firstly, we present a survey of recent computational papers on supervised classification published in seven high-ranking computational science journals. The aim is to provide an up-to-date picture of current scientific practice with respect to the comparison of methods in both articles presenting new methods and articles focusing on the comparison study itself. Secondly, based on the results of our survey we critically discuss the necessity, impact and limitations of neutral comparison studies in computational sciences. We define three reasonable criteria a comparison study has to fulfill in order to be considered as neutral, and explicate general considerations on the individual components of a "tidy neutral comparison study". R codes for completely replicating our statistical analyses and figures are available from the companion website http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/020_professuren/boulesteix/plea2013.
Collapse
Affiliation(s)
- Anne-Laure Boulesteix
- Department of Medical Informatics, Biometry and Epidemiology, Ludwig-Maximilians-University of Munich, Munich, Germany.
| | | | | |
Collapse
|
29
|
Vaughan LK, Srinivasasainagendra V. Where in the genome are we? A cautionary tale of database use in genomics research. Front Genet 2013; 4:38. [PMID: 23519237 PMCID: PMC3604632 DOI: 10.3389/fgene.2013.00038] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2012] [Accepted: 03/04/2013] [Indexed: 11/20/2022] Open
Abstract
With the advent of high throughput data genomic technologies the volume of available data is now staggering. In addition databases that provide resources to annotate, translate, and connect biological data have grown exponentially in content and use. The availability of such data emphasizes the importance of bioinformatics and computational biology in genomics research and has led to the development of thousands of tools to integrate and utilize these resources. When utilizing such resources, the principles of reproducible research are often overlooked. In this manuscript we provide selected case studies illustrating issues that may arise while working with genes and genetic polymorphisms. These case studies illustrate potential sources of error which can be introduced if the practices of reproducible research are not employed and non-concurrent databases are used. We also show examples of a lack of transparency when these databases are concerned when using popular bioinformatics tools. These examples highlight that resources are constantly evolving, and in order to provide reproducible results, research should be aware of and connected to the correct release of the data, particularly when implementing computational tools.
Collapse
Affiliation(s)
- Laura K Vaughan
- Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham Birmingham, AL, USA
| | | |
Collapse
|
30
|
Abstract
MOTIVATION A common practice in biomarker discovery is to decide whether a large laboratory experiment should be carried out based on the results of a preliminary study on a small set of specimens. Consideration of the efficacy of this approach motivates the introduction of a probabilistic measure, for whether a classifier showing promising results in a small-sample preliminary study will perform similarly on a large independent sample. Given the error estimate from the preliminary study, if the probability of reproducible error is low, then there is really no purpose in substantially allocating more resources to a large follow-on study. Indeed, if the probability of the preliminary study providing likely reproducible results is small, then why even perform the preliminary study? RESULTS This article introduces a reproducibility index for classification, measuring the probability that a sufficiently small error estimate on a small sample will motivate a large follow-on study. We provide a simulation study based on synthetic distribution models that possess known intrinsic classification difficulties and emulate real-world scenarios. We also set up similar simulations on four real datasets to show the consistency of results. The reproducibility indices for different distributional models, real datasets and classification schemes are empirically calculated. The effects of reporting and multiple-rule biases on the reproducibility index are also analyzed. AVAILABILITY We have implemented in C code the synthetic data distribution model, classification rules, feature selection routine and error estimation methods. The source code is available at http://gsp.tamu.edu/Publications/supplementary/yousefi12a/.
Collapse
Affiliation(s)
- Mohammadmahdi R Yousefi
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA
| | | |
Collapse
|
31
|
|
32
|
Barbieri RB, Bufalo NE, Secolin R, Silva ACN, Assumpção LVM, Maciel RMB, Cerutti JM, Ward LS. Evidence that polymorphisms in detoxification genes modulate the susceptibility for sporadic medullary thyroid carcinoma. Eur J Endocrinol 2012; 166:241-5. [PMID: 22048975 DOI: 10.1530/eje-11-0843] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
AIM Polymorphic low-penetrance genes have been consistently associated with the susceptibility to a series of human tumors, including differentiated thyroid cancer. METHODS To determine their role in medullary thyroid cancer (MTC), we used TaqMan SNP method to genotype 47 sporadic MTC (s-MTC) and a control group of 578 healthy individuals for CYP1A2*F, CYP1A1m1, GSTP1, NAT2 and 72TP53. A logistic regression analysis showed that NAT2C/C (OR=3.87; 95% CI=2.11-7.10; P=2.2×10(-5)) and TP53C/C genotypes (OR=3.87; 95% CI=1.78-6.10; P=2.8×10(-4)) inheritance increased the risk of s-MTC. A stepwise regression analysis indicated that TP53C/C genotype contributes with 8.07% of the s-MTC risk. RESULTS We were unable to identify any relationship between NAT2 and TP53 polymorphisms suggesting they are independent factors of risk to s-MTC. In addition, there was no association between the investigated genes and clinical or pathological features of aggressiveness of the tumors or the outcome of MTC patients. CONCLUSION In conclusion, we demonstrated that detoxification genes and apoptotic and cell cycle control genes are involved in the susceptibility of s-MTC and may modulate the susceptibility to the disease.
Collapse
Affiliation(s)
- R B Barbieri
- Laboratory of Molecular Genetics Cancer, Faculty of Medical Sciences, University of Campinas, PO Box 6111, Campinas, São Paulo, Brazil
| | | | | | | | | | | | | | | |
Collapse
|
33
|
Leisch F, Eugster M, Hothorn T. Executable Papers for the R Community: The R2 Platform for Reproducible Research. ACTA ACUST UNITED AC 2011. [DOI: 10.1016/j.procs.2011.04.065] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|