1
|
Ghanegolmohammadi F, Eslami M, Ohya Y. Systematic data analysis pipeline for quantitative morphological cell phenotyping. Comput Struct Biotechnol J 2024; 23:2949-2962. [PMID: 39104709 PMCID: PMC11298594 DOI: 10.1016/j.csbj.2024.07.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2024] [Revised: 07/09/2024] [Accepted: 07/10/2024] [Indexed: 08/07/2024] Open
Abstract
Quantitative morphological phenotyping (QMP) is an image-based method used to capture morphological features at both the cellular and population level. Its interdisciplinary nature, spanning from data collection to result analysis and interpretation, can lead to uncertainties, particularly among those new to this actively growing field. High analytical specificity for a typical QMP is achieved through sophisticated approaches that can leverage subtle cellular morphological changes. Here, we outline a systematic workflow to refine the QMP methodology. For a practical review, we describe the main steps of a typical QMP; in each step, we discuss the available methods, their applications, advantages, and disadvantages, along with the R functions and packages for easy implementation. This review does not cover theoretical backgrounds, but provides several references for interested researchers. It aims to broaden the horizons for future phenome studies and demonstrate how to exploit years of endeavors to achieve more with less.
Collapse
Affiliation(s)
- Farzan Ghanegolmohammadi
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
- Department of Integrated Biosciences, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Chiba, Japan
| | - Mohammad Eslami
- Harvard Ophthalmology AI Lab, Schepen’s Eye Research Institute of Massachusetts Eye and Ear Infirmary, Harvard Medical School, Boston, USA
| | - Yoshikazu Ohya
- Department of Integrated Biosciences, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Chiba, Japan
| |
Collapse
|
2
|
Wu NC, Alton L, Bovo RP, Carey N, Currie SE, Lighton JRB, McKechnie AE, Pottier P, Rossi G, White CR, Levesque DL. Reporting guidelines for terrestrial respirometry: Building openness, transparency of metabolic rate and evaporative water loss data. Comp Biochem Physiol A Mol Integr Physiol 2024; 296:111688. [PMID: 38944270 DOI: 10.1016/j.cbpa.2024.111688] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2024] [Revised: 06/24/2024] [Accepted: 06/25/2024] [Indexed: 07/01/2024]
Abstract
Respirometry is an important tool for understanding whole-animal energy and water balance in relation to the environment. Consequently, the growing number of studies using respirometry over the last decade warrants reliable reporting and data sharing for effective dissemination and research synthesis. We provide a checklist guideline on five key sections to facilitate the transparency, reproducibility, and replicability of respirometry studies: 1) materials, set up, plumbing, 2) subject conditions/maintenance, 3) measurement conditions, 4) data processing, and 5) data reporting and statistics, each with explanations and example studies. Transparency in reporting and data availability has benefits on multiple fronts. Authors can use this checklist to design and report on their study, and reviewers and editors can use the checklist to assess the reporting quality of the manuscripts they review. Improved standards for reporting will enhance the value of primary studies and will greatly facilitate the ability to carry out higher quality research syntheses to address ecological and evolutionary theories.
Collapse
Affiliation(s)
- Nicholas C Wu
- Hawkesbury Institute for the Environment, Western Sydney University, New South Wales 2753, Australia.
| | - Lesley Alton
- Centre for Geometric Biology, School of Biological Sciences, Monash University, Melbourne, VIC 3800, Australia. https://twitter.com/lesley_alton
| | - Rafael P Bovo
- Department of Evolution, Ecology, and Organismal Biology, University of California Riverside, Riverside, CA, United States. https://twitter.com/bovo_rp
| | - Nicholas Carey
- Marine Directorate for the Scottish Government, Aberdeen, United Kingdom
| | - Shannon E Currie
- Institute for Cell and Systems Biology, University of Hamburg, Martin-Luther-King Plz 3, 20146 Hamburg, Germany; School of Biosciences, University of Melbourne, Victoria, Australia. https://twitter.com/batsinthbelfry
| | - John R B Lighton
- Sable Systems International, North Las Vegas, NV, United States. https://twitter.com/SableSys
| | - Andrew E McKechnie
- South African Research Chair in Conservation Physiology, South African National Biodiversity Institute, South Africa; DSI-NRF Centre of Excellence at the FitzPatrick Institute, Department of Zoology and Entomology, University of Pretoria, South Africa
| | - Patrice Pottier
- Evolution & Ecology Research Centre, School of Biological, Earth and Environmental Sciences, The University of New South Wales, Sydney, New South Wales, Australia; Division of Ecology and Evolution, Research School of Biology, The Australian National University, Canberra, Australian Capital Territory, Australia. https://twitter.com/PatriceEcoEvo
| | - Giulia Rossi
- Department of Biology, McMaster University, Hamilton, Ontario, Canada. https://twitter.com/giuliasrossi
| | - Craig R White
- Centre for Geometric Biology, School of Biological Sciences, Monash University, Melbourne, VIC 3800, Australia
| | - Danielle L Levesque
- School of Biology and Ecology, University of Maine, Orono, ME, United States. https://twitter.com/dl_levesque
| |
Collapse
|
3
|
Aksenova A, Johny A, Adams T, Gribbon P, Jacobs M, Hofmann-Apitius M. Current state of data stewardship tools in life science. Front Big Data 2024; 7:1428568. [PMID: 39351001 PMCID: PMC11439729 DOI: 10.3389/fdata.2024.1428568] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2024] [Accepted: 08/23/2024] [Indexed: 10/04/2024] Open
Abstract
In today's data-centric landscape, effective data stewardship is critical for facilitating scientific research and innovation. This article provides an overview of essential tools and frameworks for modern data stewardship practices. Over 300 tools were analyzed in this study, assessing their utility, relevance to data stewardship, and applicability within the life sciences domain.
Collapse
Affiliation(s)
- Anna Aksenova
- Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, Bonn, Germany
| | - Anoop Johny
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing, Sankt Augustin, Germany
| | - Tim Adams
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing, Sankt Augustin, Germany
| | - Phil Gribbon
- Fraunhofer Institute for Translational Medicine and Pharmacology, Discovery Research Screening Port, Hamburg, Germany
| | - Marc Jacobs
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing, Sankt Augustin, Germany
| | - Martin Hofmann-Apitius
- Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, Bonn, Germany
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing, Sankt Augustin, Germany
| |
Collapse
|
4
|
Scorza LC, Zieliński T, Kalita I, Lepore A, El Karoui M, Millar AJ. Daily life in the Open Biologist's second job, as a Data Curator. Wellcome Open Res 2024; 9:523. [PMID: 39360219 PMCID: PMC11445645 DOI: 10.12688/wellcomeopenres.22899.1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/29/2024] [Indexed: 10/04/2024] Open
Abstract
Background Data reusability is the driving force of the research data life cycle. However, implementing strategies to generate reusable data from the data creation to the sharing stages is still a significant challenge. Even when datasets supporting a study are publicly shared, the outputs are often incomplete and/or not reusable. The FAIR (Findable, Accessible, Interoperable, Reusable) principles were published as a general guidance to promote data reusability in research, but the practical implementation of FAIR principles in research groups is still falling behind. In biology, the lack of standard practices for a large diversity of data types, data storage and preservation issues, and the lack of familiarity among researchers are some of the main impeding factors to achieve FAIR data. Past literature describes biological curation from the perspective of data resources that aggregate data, often from publications. Methods Our team works alongside data-generating, experimental researchers so our perspective aligns with publication authors rather than aggregators. We detail the processes for organizing datasets for publication, showcasing practical examples from data curation to data sharing. We also recommend strategies, tools and web resources to maximize data reusability, while maintaining research productivity. Conclusion We propose a simple approach to address research data management challenges for experimentalists, designed to promote FAIR data sharing. This strategy not only simplifies data management, but also enhances data visibility, recognition and impact, ultimately benefiting the entire scientific community.
Collapse
Affiliation(s)
- Livia C.T. Scorza
- Centre for Engineering Biology and School of Biological Sciences, University of Edinburgh, Edinburgh, Scotland, EH9 3BF, UK
| | - Tomasz Zieliński
- Centre for Engineering Biology and School of Biological Sciences, University of Edinburgh, Edinburgh, Scotland, EH9 3BF, UK
| | - Irina Kalita
- Centre for Engineering Biology and School of Biological Sciences, University of Edinburgh, Edinburgh, Scotland, EH9 3BF, UK
- Institute of Cell Biology, School of Biological Sciences, University of Edinburgh, Edinburgh, Scotland, EH9 3JD, UK
- Center for Synthetic Microbiology (SYNMIKRO), Max Planck Institute for Terrestrial Microbiology, Marburg, Germany
| | - Alessia Lepore
- Centre for Engineering Biology and School of Biological Sciences, University of Edinburgh, Edinburgh, Scotland, EH9 3BF, UK
- Institute of Cell Biology, School of Biological Sciences, University of Edinburgh, Edinburgh, Scotland, EH9 3JD, UK
- Laboratory for Optics and Biosciences, École Polytechnique, Institut Polytechnique de Paris, Palaiseau, Île-de-France, France
| | - Meriem El Karoui
- Centre for Engineering Biology and School of Biological Sciences, University of Edinburgh, Edinburgh, Scotland, EH9 3BF, UK
- Institute of Cell Biology, School of Biological Sciences, University of Edinburgh, Edinburgh, Scotland, EH9 3JD, UK
- Laboratoire de Biologie et Pharmacologie Appliquée (LBPA), - ENS Paris-Saclay CNRS UMR 8113, Paris, Gif-sur-Yvette, France
| | - Andrew J. Millar
- Centre for Engineering Biology and School of Biological Sciences, University of Edinburgh, Edinburgh, Scotland, EH9 3BF, UK
| |
Collapse
|
5
|
Abdill RJ, Talarico E, Grieneisen L. A how-to guide for code sharing in biology. PLoS Biol 2024; 22:e3002815. [PMID: 39255324 DOI: 10.1371/journal.pbio.3002815] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Revised: 09/20/2024] [Indexed: 09/12/2024] Open
Abstract
In 2024, all biology is computational biology. Computer-aided analysis continues to spread into new fields, becoming more accessible to researchers trained in the wet lab who are eager to take advantage of growing datasets, falling costs, and novel assays that present new opportunities for discovery. It is currently much easier to find guidance for implementing these techniques than for reporting their use, leaving biologists to guess which details and files are relevant. In this essay, we review existing literature on the topic, summarize common tips, and link to additional resources for training. Following this overview, we then provide a set of recommendations for sharing code, with an eye toward guiding those who are comparatively new to applying open science principles to their computational work. Taken together, we provide a guide for biologists who seek to follow code sharing best practices but are unsure where to start.
Collapse
Affiliation(s)
- Richard J Abdill
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, Illinois, United States of America
| | - Emma Talarico
- Department of Biology, University of British Columbia-Okanagan Campus, Kelowna, British Columbia, Canada
| | - Laura Grieneisen
- Department of Biology, University of British Columbia-Okanagan Campus, Kelowna, British Columbia, Canada
- Okanagan Institute for Biodiversity, Resilience, and Ecosystem Services, University of British Columbia-Okanagan Campus, Kelowna, British Columbia, Canada
| |
Collapse
|
6
|
Tiemann JKS, Szczuka M, Bouarroudj L, Oussaren M, Garcia S, Howard RJ, Delemotte L, Lindahl E, Baaden M, Lindorff-Larsen K, Chavent M, Poulain P. MDverse, shedding light on the dark matter of molecular dynamics simulations. eLife 2024; 12:RP90061. [PMID: 39212001 PMCID: PMC11364437 DOI: 10.7554/elife.90061] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/04/2024] Open
Abstract
The rise of open science and the absence of a global dedicated data repository for molecular dynamics (MD) simulations has led to the accumulation of MD files in generalist data repositories, constituting the dark matter of MD - data that is technically accessible, but neither indexed, curated, or easily searchable. Leveraging an original search strategy, we found and indexed about 250,000 files and 2000 datasets from Zenodo, Figshare and Open Science Framework. With a focus on files produced by the Gromacs MD software, we illustrate the potential offered by the mining of publicly available MD data. We identified systems with specific molecular composition and were able to characterize essential parameters of MD simulation such as temperature and simulation length, and could identify model resolution, such as all-atom and coarse-grain. Based on this analysis, we inferred metadata to propose a search engine prototype to explore the MD data. To continue in this direction, we call on the community to pursue the effort of sharing MD data, and to report and standardize metadata to reuse this valuable matter.
Collapse
Affiliation(s)
- Johanna KS Tiemann
- Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of CopenhagenCopenhagenDenmark
| | - Magdalena Szczuka
- Institut de Pharmacologie et Biologie Structurale, CNRS, Université de ToulouseToulouseFrance
| | - Lisa Bouarroudj
- Université Paris Cité, CNRS, Institut Jacques MonodParisFrance
| | | | | | - Rebecca J Howard
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm UniversityStockholmSweden
| | - Lucie Delemotte
- Department of applied physics, Science for Life Laboratory, KTH Royal Institute of TechnologyStockholmSweden
| | - Erik Lindahl
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm UniversityStockholmSweden
- Department of applied physics, Science for Life Laboratory, KTH Royal Institute of TechnologyStockholmSweden
| | - Marc Baaden
- Laboratoire de Biochimie Théorique, CNRS, Université Paris CitéParisFrance
| | - Kresten Lindorff-Larsen
- Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of CopenhagenCopenhagenDenmark
| | - Matthieu Chavent
- Institut de Pharmacologie et Biologie Structurale, CNRS, Université de ToulouseToulouseFrance
| | - Pierre Poulain
- Université Paris Cité, CNRS, Institut Jacques MonodParisFrance
| |
Collapse
|
7
|
Fouad K, Vavrek R, Surles-Zeigler MC, Huie JR, Radabaugh HL, Gurkoff GG, Visser U, Grethe JS, Martone ME, Ferguson AR, Gensel JC, Torres-Espin A. A practical guide to data management and sharing for biomedical laboratory researchers. Exp Neurol 2024; 378:114815. [PMID: 38762093 DOI: 10.1016/j.expneurol.2024.114815] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2023] [Revised: 05/13/2024] [Accepted: 05/14/2024] [Indexed: 05/20/2024]
Abstract
Effective data management and sharing have become increasingly crucial in biomedical research; however, many laboratory researchers lack the necessary tools and knowledge to address this challenge. This article provides an introductory guide into research data management (RDM), and the importance of FAIR (Findable, Accessible, Interoperable, and Reusable) data-sharing principles for laboratory researchers produced by practicing scientists. We explore the advantages of implementing organized data management strategies and introduce key concepts such as data standards, data documentation, and the distinction between machine and human-readable data formats. Furthermore, we offer practical guidance for creating a data management plan and establishing efficient data workflows within the laboratory setting, suitable for labs of all sizes. This includes an examination of requirements analysis, the development of a data dictionary for routine data elements, the implementation of unique subject identifiers, and the formulation of standard operating procedures (SOPs) for seamless data flow. To aid researchers in implementing these practices, we present a simple organizational system as an illustrative example, which can be tailored to suit individual needs and research requirements. By presenting a user-friendly approach, this guide serves as an introduction to the field of RDM and offers practical tips to help researchers effortlessly meet the common data management and sharing mandates rapidly becoming prevalent in biomedical research.
Collapse
Affiliation(s)
- K Fouad
- Department of Physical Therapy, Faculty of Rehabilitation Medicine, University of Alberta, Edmonton, AB, Canada.
| | - R Vavrek
- Department of Physical Therapy, Faculty of Rehabilitation Medicine, University of Alberta, Edmonton, AB, Canada
| | - M C Surles-Zeigler
- Department of Neuroscience, University of California, San Diego, La Jolla, CA, United States
| | - J R Huie
- Department of Neurosurgery, Brain and Spinal Injury Center, Weill Institutes for Neurosciences, University of California, San Francisco, San Francisco, CA, United States; San Francisco Veterans Affairs Healthcare System, San Francisco, CA, United States
| | - H L Radabaugh
- Department of Neurosurgery, Brain and Spinal Injury Center, Weill Institutes for Neurosciences, University of California, San Francisco, San Francisco, CA, United States
| | - G G Gurkoff
- Center for Neuroscience, University of California Davis, Davis, CA, United States; Department of Neurological Surgery, University of California Davis, Davis, CA, United States; Northern California Veterans Affairs Healthcare System, Martinez, CA, United States
| | - U Visser
- Department of Computer Science, University of Miami, Coral Gables, FL, United States
| | - J S Grethe
- Department of Neuroscience, University of California, San Diego, La Jolla, CA, United States
| | - M E Martone
- Department of Neuroscience, University of California, San Diego, La Jolla, CA, United States; San Francisco Veterans Affairs Healthcare System, San Francisco, CA, United States
| | - A R Ferguson
- Department of Neurosurgery, Brain and Spinal Injury Center, Weill Institutes for Neurosciences, University of California, San Francisco, San Francisco, CA, United States; San Francisco Veterans Affairs Healthcare System, San Francisco, CA, United States
| | - J C Gensel
- Spinal Cord and Brain Injury Research Center and Department of Physiology, University of Kentucky College of Medicine, Lexington, KY, United States.
| | - A Torres-Espin
- Department of Physical Therapy, Faculty of Rehabilitation Medicine, University of Alberta, Edmonton, AB, Canada; Department of Neurosurgery, Brain and Spinal Injury Center, Weill Institutes for Neurosciences, University of California, San Francisco, San Francisco, CA, United States; School of Public Health Sciences, University of Waterloo, Waterloo, ON, Canada.
| |
Collapse
|
8
|
Biriukov D, Vácha R. Pathways to a Shiny Future: Building the Foundation for Computational Physical Chemistry and Biophysics in 2050. ACS PHYSICAL CHEMISTRY AU 2024; 4:302-313. [PMID: 39069976 PMCID: PMC11274290 DOI: 10.1021/acsphyschemau.4c00003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/07/2024] [Revised: 03/15/2024] [Accepted: 03/18/2024] [Indexed: 07/30/2024]
Abstract
In the last quarter-century, the field of molecular dynamics (MD) has undergone a remarkable transformation, propelled by substantial enhancements in software, hardware, and underlying methodologies. In this Perspective, we contemplate the future trajectory of MD simulations and their possible look at the year 2050. We spotlight the pivotal role of artificial intelligence (AI) in shaping the future of MD and the broader field of computational physical chemistry. We outline critical strategies and initiatives that are essential for the seamless integration of such technologies. Our discussion delves into topics like multiscale modeling, adept management of ever-increasing data deluge, the establishment of centralized simulation databases, and the autonomous refinement, cross-validation, and self-expansion of these repositories. The successful implementation of these advancements requires scientific transparency, a cautiously optimistic approach to interpreting AI-driven simulations and their analysis, and a mindset that prioritizes knowledge-motivated research alongside AI-enhanced big data exploration. While history reminds us that the trajectory of technological progress can be unpredictable, this Perspective offers guidance on preparedness and proactive measures, aiming to steer future advancements in the most beneficial and successful direction.
Collapse
Affiliation(s)
- Denys Biriukov
- CEITEC
− Central European Institute of Technology, Masaryk University, Kamenice 753/5, 625 00 Brno, Czech Republic
- National
Centre for Biomolecular Research, Faculty of Science, Masaryk University, Kamenice 753/5, 625 00 Brno, Czech Republic
| | - Robert Vácha
- CEITEC
− Central European Institute of Technology, Masaryk University, Kamenice 753/5, 625 00 Brno, Czech Republic
- National
Centre for Biomolecular Research, Faculty of Science, Masaryk University, Kamenice 753/5, 625 00 Brno, Czech Republic
- Department
of Condensed Matter Physics, Faculty of Science, Masaryk University, Kotlářská 267/2, 611 37 Brno, Czech
Republic
| |
Collapse
|
9
|
Martorelli I, Pooryousefi A, van Thiel H, Sicking FJ, Ramackers GJ, Merckx V, Verbeek FJ. Multiple graphical views for automatically generating SQL for the MycoDiversity DB; making fungal biodiversity studies accessible. Biodivers Data J 2024; 12:e119660. [PMID: 38933486 PMCID: PMC11199959 DOI: 10.3897/bdj.12.e119660] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2024] [Accepted: 06/06/2024] [Indexed: 06/28/2024] Open
Abstract
Fungi is a highly diverse group of eukaryotic organisms that live under an extremely wide range of environmental conditions. Nowadays, there is a fundamental focus on observing how biodiversity varies on different spatial scales, in addition to understanding the environmental factors which drive fungal biodiversity. Metabarcoding is a high-throughput DNA sequencing technology that has positively contributed to observing fungal communities in environments. While the DNA sequencing data generated from metabarcoding studies are available in public archives, this valuable data resource is not directly usable for fungal biodiversity investigation. Additionally, due to its fragmented storage and distributed nature, it is not immediately accessible through a single user interface. We developed the MycoDiversity DataBase User Interface (https://mycodiversity.liacs.nl) to provide direct access and retrieval of fungal data that was previously inaccessible in the public domain. The user interface provides multiple graphical views of the data components used to reveal fungal biodiversity. These components include reliable geo-location terms, the reference taxonomic scientific names associated with fungal species and the standard features describing the environment where they occur. Direct observation of the public DNA sequencing data in association with fungi is accessible through SQL search queries created by interactively manipulating topological maps and dynamic hierarchical tree views. The search results are presented in configurable data table views that can be downloaded for further use. With the MycoDiversity DataBase User Interface, we make fungal biodiversity data accessible, assisting researchers and other stakeholders in using metabarcoding studies for assessing fungal biodiversity.
Collapse
Affiliation(s)
- Irene Martorelli
- Leiden Institute of Advanced Computer Science (LIACS), Leiden University, Leiden, NetherlandsLeiden Institute of Advanced Computer Science (LIACS), Leiden UniversityLeidenNetherlands
- Naturalis Biodiversity Center, Leiden, NetherlandsNaturalis Biodiversity CenterLeidenNetherlands
| | - Aram Pooryousefi
- Leiden Institute of Advanced Computer Science (LIACS), Leiden University, Leiden, NetherlandsLeiden Institute of Advanced Computer Science (LIACS), Leiden UniversityLeidenNetherlands
| | - Haike van Thiel
- Leiden Institute of Advanced Computer Science (LIACS), Leiden University, Leiden, NetherlandsLeiden Institute of Advanced Computer Science (LIACS), Leiden UniversityLeidenNetherlands
| | - Floris J Sicking
- Leiden Institute of Advanced Computer Science (LIACS), Leiden University, Leiden, NetherlandsLeiden Institute of Advanced Computer Science (LIACS), Leiden UniversityLeidenNetherlands
| | - Guus J Ramackers
- Leiden Institute of Advanced Computer Science (LIACS), Leiden University, Leiden, NetherlandsLeiden Institute of Advanced Computer Science (LIACS), Leiden UniversityLeidenNetherlands
| | - Vincent Merckx
- Naturalis Biodiversity Center, Leiden, NetherlandsNaturalis Biodiversity CenterLeidenNetherlands
- Institute for Biodiversity and Ecosystem Dynamics, University of Amsterdam, Amsterdam, NetherlandsInstitute for Biodiversity and Ecosystem Dynamics, University of AmsterdamAmsterdamNetherlands
| | - Fons J Verbeek
- Leiden Institute of Advanced Computer Science (LIACS), Leiden University, Leiden, NetherlandsLeiden Institute of Advanced Computer Science (LIACS), Leiden UniversityLeidenNetherlands
| |
Collapse
|
10
|
Tiemann JKS, Szczuka M, Bouarroudj L, Oussaren M, Garcia S, Howard RJ, Delemotte L, Lindahl E, Baaden M, Lindorff-Larsen K, Chavent M, Poulain P. MDverse: Shedding Light on the Dark Matter of Molecular Dynamics Simulations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.05.02.538537. [PMID: 37205542 PMCID: PMC10187166 DOI: 10.1101/2023.05.02.538537] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
The rise of open science and the absence of a global dedicated data repository for molecular dynamics (MD) simulations has led to the accumulation of MD files in generalist data repositories, constituting the dark matter of MD - data that is technically accessible, but neither indexed, curated, or easily searchable. Leveraging an original search strategy, we found and indexed about 250,000 files and 2,000 datasets from Zenodo, Figshare and Open Science Framework. With a focus on files produced by the Gromacs MD software, we illustrate the potential offered by the mining of publicly available MD data. We identified systems with specific molecular composition and were able to characterize essential parameters of MD simulation such as temperature and simulation length, and could identify model resolution, such as all-atom and coarse-grain. Based on this analysis, we inferred metadata to propose a search engine prototype to explore the MD data. To continue in this direction, we call on the community to pursue the effort of sharing MD data, and to report and standardize metadata to reuse this valuable matter.
Collapse
|
11
|
Bibik P, Alibai S, Pandini A, Dantu SC. PyCoM: a python library for large-scale analysis of residue-residue coevolution data. Bioinformatics 2024; 40:btae166. [PMID: 38532297 PMCID: PMC11009027 DOI: 10.1093/bioinformatics/btae166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Revised: 02/02/2024] [Accepted: 03/25/2024] [Indexed: 03/28/2024] Open
Abstract
MOTIVATION Computational methods to detect correlated amino acid positions in proteins have become a valuable tool to predict intra- and inter-residue protein contacts, protein structures, and effects of mutation on protein stability and function. While there are many tools and webservers to compute coevolution scoring matrices, there is no central repository of alignments and coevolution matrices for large-scale studies and pattern detection leveraging on biological and structural annotations already available in UniProt. RESULTS We present a Python library, PyCoM, which enables users to query and analyze coevolution matrices and sequence alignments of 457 622 proteins, selected from UniProtKB/Swiss-Prot database (length ≤ 500 residues), from a precompiled coevolution matrix database (PyCoMdb). PyCoM facilitates the development of statistical analyses of residue coevolution patterns using filters on biological and structural annotations from UniProtKB/Swiss-Prot, with simple access to PyCoMdb for both novice and advanced users, supporting Jupyter Notebooks, Python scripts, and a web API access. The resource is open source and will help in generating data-driven computational models and methods to study and understand protein structures, stability, function, and design. AVAILABILITY AND IMPLEMENTATION PyCoM code is freely available from https://github.com/scdantu/pycom and PyCoMdb and the Jupyter Notebook tutorials are freely available from https://pycom.brunel.ac.uk.
Collapse
Affiliation(s)
- Philipp Bibik
- Department of Computer Science, Brunel University London, Uxbridge UB8 3PH, United Kingdom
| | - Sabriyeh Alibai
- Department of Computer Science, Brunel University London, Uxbridge UB8 3PH, United Kingdom
| | - Alessandro Pandini
- Department of Computer Science, Brunel University London, Uxbridge UB8 3PH, United Kingdom
| | - Sarath Chandra Dantu
- Department of Computer Science, Brunel University London, Uxbridge UB8 3PH, United Kingdom
| |
Collapse
|
12
|
Emissah H, Ljungquist B, Ascoli GA. Bibliometric analysis of neuroscience publications quantifies the impact of data sharing. Bioinformatics 2023; 39:btad746. [PMID: 38070153 PMCID: PMC10733721 DOI: 10.1093/bioinformatics/btad746] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2023] [Revised: 11/01/2023] [Accepted: 12/07/2023] [Indexed: 12/19/2023] Open
Abstract
SUMMARY Neural morphology, the branching geometry of brain cells, is an essential cellular substrate of nervous system function and pathology. Despite the accelerating production of digital reconstructions of neural morphology, the public accessibility of data remains a core issue in neuroscience. Deficiencies in the availability of existing data create redundancy of research efforts and limit synergy. We carried out a comprehensive bibliometric analysis of neural morphology publications to quantify the impact of data sharing in the neuroscience community. Our findings demonstrate that sharing digital reconstructions of neural morphology via NeuroMorpho.Org leads to a significant increase of citations to the original article, thus directly benefiting authors. The rate of data reusage remains constant for at least 16 years after sharing (the whole period analyzed), altogether nearly doubling the peer-reviewed discoveries in the field. Furthermore, the recent availability of larger and more numerous datasets fostered integrative applications, which accrue on average twice the citations of re-analyses of individual datasets. We also released an open-source citation tracking web-service allowing researchers to monitor reusage of their datasets in independent peer-reviewed reports. These results and tools can facilitate the recognition of shared data reuse for merit evaluations and funding decisions. AVAILABILITY AND IMPLEMENTATION The application is available at: http://cng-nmo-dev3.orc.gmu.edu:8181/. The source code at https://github.com/HerveEmissah/nmo-authors-app and https://github.com/HerveEmissah/nmo-bibliometric-analysis.
Collapse
Affiliation(s)
- Herve Emissah
- Bioinformatics Program, College of Science, George Mason University, Fairfax, VA 22030, United States
- Center for Neural Informatics, Structures, & Plasticity (CN3) and Bioengineering Department, College of Engineering & Computing, George Mason University, Fairfax, VA 22030, United States
| | - Bengt Ljungquist
- Center for Neural Informatics, Structures, & Plasticity (CN3) and Bioengineering Department, College of Engineering & Computing, George Mason University, Fairfax, VA 22030, United States
| | - Giorgio A Ascoli
- Bioinformatics Program, College of Science, George Mason University, Fairfax, VA 22030, United States
- Center for Neural Informatics, Structures, & Plasticity (CN3) and Bioengineering Department, College of Engineering & Computing, George Mason University, Fairfax, VA 22030, United States
| |
Collapse
|
13
|
Way GP, Sailem H, Shave S, Kasprowicz R, Carragher NO. Evolution and impact of high content imaging. SLAS DISCOVERY : ADVANCING LIFE SCIENCES R & D 2023; 28:292-305. [PMID: 37666456 DOI: 10.1016/j.slasd.2023.08.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/03/2023] [Revised: 08/09/2023] [Accepted: 08/29/2023] [Indexed: 09/06/2023]
Abstract
The field of high content imaging has steadily evolved and expanded substantially across many industry and academic research institutions since it was first described in the early 1990's. High content imaging refers to the automated acquisition and analysis of microscopic images from a variety of biological sample types. Integration of high content imaging microscopes with multiwell plate handling robotics enables high content imaging to be performed at scale and support medium- to high-throughput screening of pharmacological, genetic and diverse environmental perturbations upon complex biological systems ranging from 2D cell cultures to 3D tissue organoids to small model organisms. In this perspective article the authors provide a collective view on the following key discussion points relevant to the evolution of high content imaging: • Evolution and impact of high content imaging: An academic perspective • Evolution and impact of high content imaging: An industry perspective • Evolution of high content image analysis • Evolution of high content data analysis pipelines towards multiparametric and phenotypic profiling applications • The role of data integration and multiomics • The role and evolution of image data repositories and sharing standards • Future perspective of high content imaging hardware and software.
Collapse
Affiliation(s)
- Gregory P Way
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Heba Sailem
- School of Cancer and Pharmaceutical Sciences, King's College London, UK
| | - Steven Shave
- GlaxoSmithKline Medicines Research Centre, Gunnels Wood Rd, Stevenage SG1 2NY, UK; Edinburgh Cancer Research, Cancer Research UK Scotland Centre, Institute of Genetics and Cancer, University of Edinburgh, UK
| | - Richard Kasprowicz
- GlaxoSmithKline Medicines Research Centre, Gunnels Wood Rd, Stevenage SG1 2NY, UK
| | - Neil O Carragher
- Edinburgh Cancer Research, Cancer Research UK Scotland Centre, Institute of Genetics and Cancer, University of Edinburgh, UK.
| |
Collapse
|
14
|
Emissah H, Ljungquist B, Ascoli GA. Bibliometric analysis of neuroscience publications quantifies the impact of data sharing. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.09.12.557386. [PMID: 37745378 PMCID: PMC10515804 DOI: 10.1101/2023.09.12.557386] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/26/2023]
Abstract
Motivation Neural morphology, the branching geometry of neurons and glia in the nervous system, is an essential cellular substrate of brain function and pathology. Despite the accelerating production of digital reconstructions of neural morphology in laboratories worldwide, the public accessibility of data remains a core issue in neuroscience. Deficiencies in the availability of existing data create redundancy of research efforts and prevent researchers from building on others' work. Data sharing complements the development of computational resources and literature mining tools to accelerate scientific discovery. Results We carried out a comprehensive bibliometric analysis of neural morphology publications to quantify the impact of data sharing in the neuroscience community. Our findings demonstrate that sharing digital reconstructions of neural morphology via the NeuroMorpho.Org online repository leads to a significant increase of citations to the original article, thus directly benefiting the authors. Moreover, the rate of data reusage remains constant for at least 16 years after sharing (the whole period analyzed), altogether nearly doubling the peer-reviewed discoveries in the field. Furthermore, the recent availability of larger and more numerous datasets fostered integrative meta-analysis applications, which accrue on average twice the citations of re-analyses of individual datasets. We also designed and deployed an open-source citation tracking web-service that allows researchers to monitor reusage of their datasets in independent peer-reviewed reports. These results and the released tool can facilitate the recognition of shared data reuse for promotion and tenure considerations, merit evaluations, and funding decisions.
Collapse
Affiliation(s)
- Herve Emissah
- Bioinformatics Program, College of Science, George Mason University
| | - Bengt Ljungquist
- Center for Neural Informatics, Structures, and Plasticity, College of Engineering & Computing, George Mason University
| | - Giorgio A. Ascoli
- Bioinformatics Program, College of Science, George Mason University
- Center for Neural Informatics, Structures, and Plasticity, College of Engineering & Computing, George Mason University
| |
Collapse
|
15
|
Kemmer I, Keppler A, Serrano-Solano B, Rybina A, Özdemir B, Bischof J, El Ghadraoui A, Eriksson JE, Mathur A. Building a FAIR image data ecosystem for microscopy communities. Histochem Cell Biol 2023; 160:199-209. [PMID: 37341795 PMCID: PMC10492678 DOI: 10.1007/s00418-023-02203-7] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/27/2023] [Indexed: 06/22/2023]
Abstract
Bioimaging has now entered the era of big data with faster-than-ever development of complex microscopy technologies leading to increasingly complex datasets. This enormous increase in data size and informational complexity within those datasets has brought with it several difficulties in terms of common and harmonized data handling, analysis, and management practices, which are currently hampering the full potential of image data being realized. Here, we outline a wide range of efforts and solutions currently being developed by the microscopy community to address these challenges on the path towards FAIR bioimaging data. We also highlight how different actors in the microscopy ecosystem are working together, creating synergies that develop new approaches, and how research infrastructures, such as Euro-BioImaging, are fostering these interactions to shape the field.
Collapse
Affiliation(s)
- Isabel Kemmer
- Euro-BioImaging ERIC Bio-Hub, European Molecular Biology Laboratory (EMBL) Heidelberg, Meyerhofstraße 1, 69117, Heidelberg, Germany
| | - Antje Keppler
- Euro-BioImaging ERIC Bio-Hub, European Molecular Biology Laboratory (EMBL) Heidelberg, Meyerhofstraße 1, 69117, Heidelberg, Germany
| | - Beatriz Serrano-Solano
- Euro-BioImaging ERIC Bio-Hub, European Molecular Biology Laboratory (EMBL) Heidelberg, Meyerhofstraße 1, 69117, Heidelberg, Germany
| | - Arina Rybina
- Euro-BioImaging ERIC Bio-Hub, European Molecular Biology Laboratory (EMBL) Heidelberg, Meyerhofstraße 1, 69117, Heidelberg, Germany
| | - Buğra Özdemir
- Euro-BioImaging ERIC Bio-Hub, European Molecular Biology Laboratory (EMBL) Heidelberg, Meyerhofstraße 1, 69117, Heidelberg, Germany
| | - Johanna Bischof
- Euro-BioImaging ERIC Bio-Hub, European Molecular Biology Laboratory (EMBL) Heidelberg, Meyerhofstraße 1, 69117, Heidelberg, Germany
| | - Ayoub El Ghadraoui
- Euro-BioImaging ERIC Bio-Hub, European Molecular Biology Laboratory (EMBL) Heidelberg, Meyerhofstraße 1, 69117, Heidelberg, Germany
| | - John E Eriksson
- Euro-BioImaging ERIC Statutory Seat, Tykistökatu 6, P.O. Box 123, 20521, Turku, Finland
| | - Aastha Mathur
- Euro-BioImaging ERIC Bio-Hub, European Molecular Biology Laboratory (EMBL) Heidelberg, Meyerhofstraße 1, 69117, Heidelberg, Germany.
| |
Collapse
|
16
|
O'Connor LM, O'Connor BA, Lim SB, Zeng J, Lo CH. Integrative multi-omics and systems bioinformatics in translational neuroscience: A data mining perspective. J Pharm Anal 2023; 13:836-850. [PMID: 37719197 PMCID: PMC10499660 DOI: 10.1016/j.jpha.2023.06.011] [Citation(s) in RCA: 15] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2022] [Revised: 06/20/2023] [Accepted: 06/25/2023] [Indexed: 09/19/2023] Open
Abstract
Bioinformatic analysis of large and complex omics datasets has become increasingly useful in modern day biology by providing a great depth of information, with its application to neuroscience termed neuroinformatics. Data mining of omics datasets has enabled the generation of new hypotheses based on differentially regulated biological molecules associated with disease mechanisms, which can be tested experimentally for improved diagnostic and therapeutic targeting of neurodegenerative diseases. Importantly, integrating multi-omics data using a systems bioinformatics approach will advance the understanding of the layered and interactive network of biological regulation that exchanges systemic knowledge to facilitate the development of a comprehensive human brain profile. In this review, we first summarize data mining studies utilizing datasets from the individual type of omics analysis, including epigenetics/epigenomics, transcriptomics, proteomics, metabolomics, lipidomics, and spatial omics, pertaining to Alzheimer's disease, Parkinson's disease, and multiple sclerosis. We then discuss multi-omics integration approaches, including independent biological integration and unsupervised integration methods, for more intuitive and informative interpretation of the biological data obtained across different omics layers. We further assess studies that integrate multi-omics in data mining which provide convoluted biological insights and offer proof-of-concept proposition towards systems bioinformatics in the reconstruction of brain networks. Finally, we recommend a combination of high dimensional bioinformatics analysis with experimental validation to achieve translational neuroscience applications including biomarker discovery, therapeutic development, and elucidation of disease mechanisms. We conclude by providing future perspectives and opportunities in applying integrative multi-omics and systems bioinformatics to achieve precision phenotyping of neurodegenerative diseases and towards personalized medicine.
Collapse
Affiliation(s)
- Lance M. O'Connor
- College of Biological Sciences, University of Minnesota, Minneapolis, MN, 55455, USA
| | - Blake A. O'Connor
- School of Pharmacy, University of Wisconsin, Madison, WI, 53705, USA
| | - Su Bin Lim
- Department of Biochemistry and Molecular Biology, Ajou University School of Medicine, Suwon, 16499, South Korea
| | - Jialiu Zeng
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, 308232, Singapore
| | - Chih Hung Lo
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, 308232, Singapore
| |
Collapse
|
17
|
Danis D, Jacobsen JOB, Wagner AH, Groza T, Beckwith MA, Rekerle L, Carmody LC, Reese J, Hegde H, Ladewig MS, Seitz B, Munoz-Torres M, Harris NL, Rambla J, Baudis M, Mungall CJ, Haendel MA, Robinson PN. Phenopacket-tools: Building and validating GA4GH Phenopackets. PLoS One 2023; 18:e0285433. [PMID: 37196000 PMCID: PMC10191354 DOI: 10.1371/journal.pone.0285433] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2023] [Accepted: 04/21/2023] [Indexed: 05/19/2023] Open
Abstract
The Global Alliance for Genomics and Health (GA4GH) is a standards-setting organization that is developing a suite of coordinated standards for genomics. The GA4GH Phenopacket Schema is a standard for sharing disease and phenotype information that characterizes an individual person or biosample. The Phenopacket Schema is flexible and can represent clinical data for any kind of human disease including rare disease, complex disease, and cancer. It also allows consortia or databases to apply additional constraints to ensure uniform data collection for specific goals. We present phenopacket-tools, an open-source Java library and command-line application for construction, conversion, and validation of phenopackets. Phenopacket-tools simplifies construction of phenopackets by providing concise builders, programmatic shortcuts, and predefined building blocks (ontology classes) for concepts such as anatomical organs, age of onset, biospecimen type, and clinical modifiers. Phenopacket-tools can be used to validate the syntax and semantics of phenopackets as well as to assess adherence to additional user-defined requirements. The documentation includes examples showing how to use the Java library and the command-line tool to create and validate phenopackets. We demonstrate how to create, convert, and validate phenopackets using the library or the command-line application. Source code, API documentation, comprehensive user guide and a tutorial can be found at https://github.com/phenopackets/phenopacket-tools. The library can be installed from the public Maven Central artifact repository and the application is available as a standalone archive. The phenopacket-tools library helps developers implement and standardize the collection and exchange of phenotypic and other clinical data for use in phenotype-driven genomic diagnostics, translational research, and precision medicine applications.
Collapse
Affiliation(s)
- Daniel Danis
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, United States of America
| | - Julius O. B. Jacobsen
- William Harvey Research Institute, Queen Mary University of London, London, United Kingdom
| | - Alex H. Wagner
- Departments of Pediatrics and Biomedical Informatics, The Ohio State University College of Medicine, Columbus, OH, United States of America
- The Steve and Cindy Rasmussen Institute for Genomic Medicine, Nationwide Children’s Hospital, Columbus, OH, United States of America
| | | | - Martha A. Beckwith
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, United States of America
| | - Lauren Rekerle
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, United States of America
| | - Leigh C. Carmody
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, United States of America
| | - Justin Reese
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, United States of America
| | - Harshad Hegde
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, United States of America
| | - Markus S. Ladewig
- Department of Ophthalmology, Klinikum Saarbrücken, Saarbrücken, Germany
| | - Berthold Seitz
- Department of Ophthalmology, Saarland University Medical Center, Homburg/Saar, Germany
| | - Monica Munoz-Torres
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, United States of America
| | - Nomi L. Harris
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, United States of America
| | - Jordi Rambla
- European Genome-Phenome Archive (EGA) in the Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Michael Baudis
- University of Zurich and Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Christopher J. Mungall
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, United States of America
| | - Melissa A. Haendel
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, United States of America
| | - Peter N. Robinson
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, United States of America
- Institute for Systems Genomics, University of Connecticut, Farmington, CT, United States of America
| |
Collapse
|
18
|
Tsueng G, Cano MAA, Bento J, Czech C, Kang M, Pache L, Rasmussen LV, Savidge TC, Starren J, Wu Q, Xin J, Yeaman MR, Zhou X, Su AI, Wu C, Brown L, Shabman RS, Hughes LD. Developing a standardized but extendable framework to increase the findability of infectious disease datasets. Sci Data 2023; 10:99. [PMID: 36823157 PMCID: PMC9950378 DOI: 10.1038/s41597-023-01968-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 01/13/2023] [Indexed: 02/25/2023] Open
Abstract
Biomedical datasets are increasing in size, stored in many repositories, and face challenges in FAIRness (findability, accessibility, interoperability, reusability). As a Consortium of infectious disease researchers from 15 Centers, we aim to adopt open science practices to promote transparency, encourage reproducibility, and accelerate research advances through data reuse. To improve FAIRness of our datasets and computational tools, we evaluated metadata standards across established biomedical data repositories. The vast majority do not adhere to a single standard, such as Schema.org, which is widely-adopted by generalist repositories. Consequently, datasets in these repositories are not findable in aggregation projects like Google Dataset Search. We alleviated this gap by creating a reusable metadata schema based on Schema.org and catalogued nearly 400 datasets and computational tools we collected. The approach is easily reusable to create schemas interoperable with community standards, but customized to a particular context. Our approach enabled data discovery, increased the reusability of datasets from a large research consortium, and accelerated research. Lastly, we discuss ongoing challenges with FAIRness beyond discoverability.
Collapse
Affiliation(s)
- Ginger Tsueng
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA.
| | - Marco A Alvarado Cano
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - José Bento
- Department of Computer Science, Boston College, 245 Beacon St, Chestnut Hill, MA, 02467, USA
| | - Candice Czech
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Mengjia Kang
- Division of Pulmonary and Critical Care, Feinberg School of Medicine, Northwestern University, Chicago, IL, 60611, USA
| | - Lars Pache
- Infectious and Inflammatory Disease Center, Immunity and Pathogenesis Program, Sanford Burnham Prebys Medical Discovery Institute, La Jolla, CA, 92037, USA
| | - Luke V Rasmussen
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
| | - Tor C Savidge
- Texas Children's Microbiome Center & Department of Pathology & Immunology, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Justin Starren
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
| | - Qinglong Wu
- Texas Children's Microbiome Center & Department of Pathology & Immunology, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Jiwen Xin
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Michael R Yeaman
- Department of Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, 90095, USA
- Divisions of Molecular Medicine and Infectious Diseases, Harbor-UCLA Medical Center, Torrance, CA, 90502, USA
- Lundquist Institute for Infection & Immunity at Harbor-UCLA Medical Center, Torrance, CA, 90502, USA
| | - Xinghua Zhou
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Andrew I Su
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
- Scripps Research Translational Institute, La Jolla, CA, 92037, USA
- Department of Molecular Medicine, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Chunlei Wu
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
- Scripps Research Translational Institute, La Jolla, CA, 92037, USA
- Department of Molecular Medicine, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Liliana Brown
- Office of Genomics and Advanced Technologies, National Institute of Allergy and Infectious Diseases, Rockville, MD, 20852, USA
| | - Reed S Shabman
- Office of Genomics and Advanced Technologies, National Institute of Allergy and Infectious Diseases, Rockville, MD, 20852, USA
| | - Laura D Hughes
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA.
| |
Collapse
|
19
|
Gomes DGE, Pottier P, Crystal-Ornelas R, Hudgins EJ, Foroughirad V, Sánchez-Reyes LL, Turba R, Martinez PA, Moreau D, Bertram MG, Smout CA, Gaynor KM. Why don't we share data and code? Perceived barriers and benefits to public archiving practices. Proc Biol Sci 2022; 289:20221113. [PMID: 36416041 PMCID: PMC9682438 DOI: 10.1098/rspb.2022.1113] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2022] [Accepted: 11/02/2022] [Indexed: 08/10/2023] Open
Abstract
The biological sciences community is increasingly recognizing the value of open, reproducible and transparent research practices for science and society at large. Despite this recognition, many researchers fail to share their data and code publicly. This pattern may arise from knowledge barriers about how to archive data and code, concerns about its reuse, and misaligned career incentives. Here, we define, categorize and discuss barriers to data and code sharing that are relevant to many research fields. We explore how real and perceived barriers might be overcome or reframed in the light of the benefits relative to costs. By elucidating these barriers and the contexts in which they arise, we can take steps to mitigate them and align our actions with the goals of open science, both as individual scientists and as a scientific community.
Collapse
Affiliation(s)
- Dylan G. E. Gomes
- NRC Research Associate, Northwest Fisheries Science Center, National Marine Fisheries Service, National Oceanic and Atmospheric Administration, Seattle, WA 98112, USA
- Cooperative Institute for Marine Resources Studies, Hatfield Marine Science Center, Oregon State University, Newport, OR 97365, USA
| | - Patrice Pottier
- Evolution & Ecology Research Centre, School of Biological, Earth and Environmental Sciences, The University of New South Wales, Sydney, New South Wales 2052, Australia
| | - Robert Crystal-Ornelas
- Earth and Environmental Sciences Area, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Emma J. Hudgins
- Department of Biology, Carleton University, Ottawa, Canada, K1S 5B6
| | | | | | - Rachel Turba
- Department of Ecology and Evolutionary Biology, University of California, Los Angeles, CA 90095-7239, USA
| | - Paula Andrea Martinez
- Australian Research Data Commons, The University of Queensland, Brisbane 4072, Australia
| | - David Moreau
- School of Psychology and Centre for Brain Research, University of Auckland, Auckland 1010, New Zealand
| | - Michael G. Bertram
- Department of Wildlife, Fish, and Environmental Studies, Swedish University of Agricultural Sciences, Umeå, SE-907 36, Sweden
| | - Cooper A. Smout
- Institute for Globally Distributed Open Research and Education (IGDORE), Brisbane 4001, Australia
| | - Kaitlyn M. Gaynor
- Departments of Zoology and Botany, University of British Columbia, Vancouver, Canada, BC V6T 1Z4
- National Center for Ecological Analysis and Synthesis, Santa Barbara, CA 93101, USA
| |
Collapse
|
20
|
Hoyt CT, Balk M, Callahan TJ, Domingo-Fernández D, Haendel MA, Hegde HB, Himmelstein DS, Karis K, Kunze J, Lubiana T, Matentzoglu N, McMurry J, Moxon S, Mungall CJ, Rutz A, Unni DR, Willighagen E, Winston D, Gyori BM. Unifying the identification of biomedical entities with the Bioregistry. Sci Data 2022; 9:714. [PMID: 36402838 PMCID: PMC9675740 DOI: 10.1038/s41597-022-01807-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2022] [Accepted: 10/26/2022] [Indexed: 11/21/2022] Open
Abstract
The standardized identification of biomedical entities is a cornerstone of interoperability, reuse, and data integration in the life sciences. Several registries have been developed to catalog resources maintaining identifiers for biomedical entities such as small molecules, proteins, cell lines, and clinical trials. However, existing registries have struggled to provide sufficient coverage and metadata standards that meet the evolving needs of modern life sciences researchers. Here, we introduce the Bioregistry, an integrative, open, community-driven metaregistry that synthesizes and substantially expands upon 23 existing registries. The Bioregistry addresses the need for a sustainable registry by leveraging public infrastructure and automation, and employing a progressive governance model centered around open code and open data to foster community contribution. The Bioregistry can be used to support the standardized annotation of data, models, ontologies, and scientific literature, thereby promoting their interoperability and reuse. The Bioregistry can be accessed through https://bioregistry.io and its source code and data are available under the MIT and CC0 Licenses at https://github.com/biopragmatics/bioregistry .
Collapse
Affiliation(s)
| | | | | | - Daniel Domingo-Fernández
- Department of Bioinformatics, Fraunhofer SCAI, Sankt Augustin, Germany
- Enveda Biosciences, Boulder, USA
| | | | | | | | - Klas Karis
- Laboratory of Systems Pharmacology, Harvard Medical School, Boston, USA
| | - John Kunze
- California Digital Library, University of California, Berkeley, USA
| | - Tiago Lubiana
- School of Pharmaceutical Sciences, University of São Paulo, São Paulo, Brazil
| | | | - Julie McMurry
- University of Colorado Anschutz Medical Campus, Aurora, USA
| | - Sierra Moxon
- Lawrence Berkeley National Laboratory, Berkeley, USA
| | | | - Adriano Rutz
- School of Pharmaceutical Sciences, University of Geneva, Geneva, Switzerland
- Institute of Pharmaceutical Sciences of Western Switzerland, University of Geneva, Geneva, Switzerland
| | - Deepak R Unni
- Lawrence Berkeley National Laboratory, Berkeley, USA
- European Molecular Biology Laboratory, Heidelberg, Germany
| | - Egon Willighagen
- Department of Bioinformatics - BiGCaT, NUTRIM, Maastricht University, Maastricht, Netherlands
| | | | - Benjamin M Gyori
- Laboratory of Systems Pharmacology, Harvard Medical School, Boston, USA.
| |
Collapse
|
21
|
Bittremieux W, Wang M, Dorrestein PC. The critical role that spectral libraries play in capturing the metabolomics community knowledge. Metabolomics 2022; 18:94. [PMID: 36409434 PMCID: PMC10284100 DOI: 10.1007/s11306-022-01947-y] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Accepted: 10/19/2022] [Indexed: 11/22/2022]
Abstract
BACKGROUND Spectral library searching is currently the most common approach for compound annotation in untargeted metabolomics. Spectral libraries applicable to liquid chromatography mass spectrometry have grown in size over the past decade to include hundreds of thousands to millions of mass spectra and tens of thousands of compounds, forming an essential knowledge base for the interpretation of metabolomics experiments. AIM OF REVIEW We describe existing spectral library resources, highlight different strategies for compiling spectral libraries, and discuss quality considerations that should be taken into account when interpreting spectral library searching results. Finally, we describe how spectral libraries are empowering the next generation of machine learning tools in computational metabolomics, and discuss several opportunities for using increasingly accessible large spectral libraries. KEY SCIENTIFIC CONCEPTS OF REVIEW This review focuses on the current state of spectral libraries for untargeted LC-MS/MS based metabolomics. We show how the number of entries in publicly accessible spectral libraries has increased more than 60-fold in the past eight years to aid molecular interpretation and we discuss how the role of spectral libraries in untargeted metabolomics will evolve in the near future.
Collapse
Affiliation(s)
- Wout Bittremieux
- Collaborative Mass Spectrometry Innovation Center, University of California San Diego, La Jolla, CA, 92093, USA
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA, 92093, USA
| | - Mingxun Wang
- Department of Computer Science, University of California Riverside, Riverside, CA, 92507, USA
| | - Pieter C Dorrestein
- Collaborative Mass Spectrometry Innovation Center, University of California San Diego, La Jolla, CA, 92093, USA.
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA, 92093, USA.
| |
Collapse
|
22
|
Wang LQ, Fernandez-Boyano I, Robinson WP. Genetic variation in placental insufficiency: What have we learned over time? Front Cell Dev Biol 2022; 10:1038358. [PMID: 36313546 PMCID: PMC9613937 DOI: 10.3389/fcell.2022.1038358] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2022] [Accepted: 10/03/2022] [Indexed: 11/28/2022] Open
Abstract
Genetic variation shapes placental development and function, which has long been known to impact fetal growth and pregnancy outcomes such as miscarriage or maternal pre-eclampsia. Early epidemiology studies provided evidence of a strong heritable component to these conditions with both maternal and fetal-placental genetic factors contributing. Subsequently, cytogenetic studies of the placenta and the advent of prenatal diagnosis to detect chromosomal abnormalities provided direct evidence of the importance of spontaneously arising genetic variation in the placenta, such as trisomy and uniparental disomy, drawing inferences that remain relevant to this day. Candidate gene approaches highlighted the role of genetic variation in genes influencing immune interactions at the maternal-fetal interface and angiogenic factors. More recently, the emergence of molecular techniques and in particular high-throughput technologies such as Single-Nucleotide Polymorphism (SNP) arrays, has facilitated the discovery of copy number variation and study of SNP associations with conditions related to placental insufficiency. This review integrates past and more recent knowledge to provide important insights into the role of placental function on fetal and perinatal health, as well as into the mechanisms leading to genetic variation during development.
Collapse
Affiliation(s)
- Li Qing Wang
- BC Children’s Hospital Research Institute, Vancouver, BC, Canada
- Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada
| | - Icíar Fernandez-Boyano
- BC Children’s Hospital Research Institute, Vancouver, BC, Canada
- Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada
| | - Wendy P. Robinson
- BC Children’s Hospital Research Institute, Vancouver, BC, Canada
- Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada
| |
Collapse
|
23
|
Garcia BJ, Urrutia J, Zheng G, Becker D, Corbet C, Maschhoff P, Cristofaro A, Gaffney N, Vaughn M, Saxena U, Chen YP, Gordon DB, Eslami M. A toolkit for enhanced reproducibility of RNASeq analysis for synthetic biologists. SYNTHETIC BIOLOGY (OXFORD, ENGLAND) 2022; 7:ysac012. [PMID: 36035514 PMCID: PMC9408027 DOI: 10.1093/synbio/ysac012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/21/2021] [Revised: 06/17/2022] [Accepted: 08/22/2022] [Indexed: 11/13/2022]
Abstract
Sequencing technologies, in particular RNASeq, have become critical tools in the design, build, test and learn cycle of synthetic biology. They provide a better understanding of synthetic designs, and they help identify ways to improve and select designs. While these data are beneficial to design, their collection and analysis is a complex, multistep process that has implications on both discovery and reproducibility of experiments. Additionally, tool parameters, experimental metadata, normalization of data and standardization of file formats present challenges that are computationally intensive. This calls for high-throughput pipelines expressly designed to handle the combinatorial and longitudinal nature of synthetic biology. In this paper, we present a pipeline to maximize the analytical reproducibility of RNASeq for synthetic biologists. We also explore the impact of reproducibility on the validation of machine learning models. We present the design of a pipeline that combines traditional RNASeq data processing tools with structured metadata tracking to allow for the exploration of the combinatorial design in a high-throughput and reproducible manner. We then demonstrate utility via two different experiments: a control comparison experiment and a machine learning model experiment. The first experiment compares datasets collected from identical biological controls across multiple days for two different organisms. It shows that a reproducible experimental protocol for one organism does not guarantee reproducibility in another. The second experiment quantifies the differences in experimental runs from multiple perspectives. It shows that the lack of reproducibility from these different perspectives can place an upper bound on the validation of machine learning models trained on RNASeq data.
Graphical Abstract
Collapse
Affiliation(s)
- Benjamin J Garcia
- Department of Biological Engineering, Synthetic Biology Center, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Joshua Urrutia
- Texas Advanced Computing Center, University of Texas at Austin, Austin, TX, USA
| | | | | | | | | | - Alexander Cristofaro
- Department of Biological Engineering, Synthetic Biology Center, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Niall Gaffney
- Texas Advanced Computing Center, University of Texas at Austin, Austin, TX, USA
| | - Matthew Vaughn
- Texas Advanced Computing Center, University of Texas at Austin, Austin, TX, USA
| | - Uma Saxena
- Department of Biological Engineering, Synthetic Biology Center, Massachusetts Institute of Technology, Cambridge, MA, USA
| | | | - D Benjamin Gordon
- Department of Biological Engineering, Synthetic Biology Center, Massachusetts Institute of Technology, Cambridge, MA, USA
| | | |
Collapse
|
24
|
Forero DA, Curioso WH, Patrinos GP. The importance of adherence to international standards for depositing open data in public repositories. BMC Res Notes 2021; 14:405. [PMID: 34727971 PMCID: PMC8561348 DOI: 10.1186/s13104-021-05817-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2021] [Accepted: 10/22/2021] [Indexed: 12/14/2022] Open
Abstract
There has been an important global interest in Open Science, which include open data and methods, in addition to open access publications. It has been proposed that public availability of raw data increases the value and the possibility of confirmation of scientific findings, in addition to the potential of reducing research waste. Availability of raw data in open repositories facilitates the adequate development of meta-analysis and the cumulative evaluation of evidence for specific topics. In this commentary, we discuss key elements about data sharing in open repositories and we invite researchers around the world to deposit their data in them.
Collapse
Affiliation(s)
- Diego A Forero
- Health and Sport Sciences Research Group, School of Health and Sport Sciences, Fundación Universitaria del Área Andina, Bogotá, Colombia. .,Professional Program in Respiratory Therapy, School of Health and Sport Sciences, Fundación Universitaria del Área Andina, Bogotá, Colombia.
| | - Walter H Curioso
- Vicerrectorado de Investigación, Universidad Continental, Lima, Peru
| | - George P Patrinos
- Department of Pharmacy, School of Health Sciences, University of Patras, Patras, Greece.,Department of Pathology, College of Medicine and Health Sciences, United Arab Emirates University, Al-Ain, UAE.,Zayed Center for Health Sciences, United Arab Emirates University, Al-Ain, UAE
| |
Collapse
|
25
|
Heil BJ, Hoffman MM, Markowetz F, Lee SI, Greene CS, Hicks SC. Reproducibility standards for machine learning in the life sciences. Nat Methods 2021; 18:1132-1135. [PMID: 34462593 PMCID: PMC9131851 DOI: 10.1038/s41592-021-01256-7] [Citation(s) in RCA: 60] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
To make machine learning analyses in the life sciences more computationally reproducible, we propose standards based on data, model, and code publication, programming best practices, and workflow automation. By meeting these standards, the community of researchers applying machine learning methods in the life sciences can ensure that their analyses are worthy of trust. this article has been peer reviewed.
Collapse
Affiliation(s)
- Benjamin J Heil
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Michael M Hoffman
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
- Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Vector Institute, Toronto, Ontario, Canada
| | - Florian Markowetz
- Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK
| | - Su-In Lee
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA
| | - Casey S Greene
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, CO, USA.
- Center for Health AI, University of Colorado School of Medicine, Aurora, CO, USA.
| | - Stephanie C Hicks
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA.
| |
Collapse
|
26
|
Way GP, Greene CS, Carninci P, Carvalho BS, de Hoon M, Finley SD, Gosline SJC, Lȇ Cao KA, Lee JSH, Marchionni L, Robine N, Sindi SS, Theis FJ, Yang JYH, Carpenter AE, Fertig EJ. A field guide to cultivating computational biology. PLoS Biol 2021; 19:e3001419. [PMID: 34618807 PMCID: PMC8525744 DOI: 10.1371/journal.pbio.3001419] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Revised: 10/19/2021] [Indexed: 11/18/2022] Open
Abstract
Evolving in sync with the computation revolution over the past 30 years, computational biology has emerged as a mature scientific field. While the field has made major contributions toward improving scientific knowledge and human health, individual computational biology practitioners at various institutions often languish in career development. As optimistic biologists passionate about the future of our field, we propose solutions for both eager and reluctant individual scientists, institutions, publishers, funding agencies, and educators to fully embrace computational biology. We believe that in order to pave the way for the next generation of discoveries, we need to improve recognition for computational biologists and better align pathways of career success with pathways of scientific progress. With 10 outlined steps, we call on all adjacent fields to move away from the traditional individual, single-discipline investigator research model and embrace multidisciplinary, data-driven, team science.
Collapse
Affiliation(s)
- Gregory P. Way
- Imaging Platform, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- Center for Health AI, University of Colorado School of Medicine, Aurora, Colorado, United States of America
| | - Casey S. Greene
- Center for Health AI, University of Colorado School of Medicine, Aurora, Colorado, United States of America
| | - Piero Carninci
- RIKEN Center for Integrative Medical Sciences Yokohama, Kanagawa, Japan
- Human Technopole, Milan, Italy
| | - Benilton S. Carvalho
- Department of Statistics, Institute of Mathematics, Statistics and Scientific Computing, University of Campinas, Campinas, Brazil
| | - Michiel de Hoon
- RIKEN Center for Integrative Medical Sciences Yokohama, Kanagawa, Japan
| | - Stacey D. Finley
- Department of Biomedical Engineering, Quantitative and Computational Biology, and Chemical Engineering & Materials Science, University of Southern California, Los Angeles, California, United States of America
| | - Sara J. C. Gosline
- Pacific Northwest National Laboratory, Seattle, Washington, United States of America
| | - Kim-Anh Lȇ Cao
- Melbourne Integrative Genomics, School of Mathematics and Statistics, The University of Melbourne, Melbourne, Australia
| | - Jerry S. H. Lee
- Ellison Institute and Departments of Medicine/Oncology, Chemical Engineering, and Material Sciences, University of Southern California, Los Angeles, California, United States of America
| | - Luigi Marchionni
- Department of Pathology and Laboratory Medicine, Weill-Cornell Medicine, New York, New York, United States of America
| | - Nicolas Robine
- Computational Biology Lab, New York Genome Center, New York, New York, United States of America
| | - Suzanne S. Sindi
- Department of Applied Mathematics, University of California Merced, Merced, California, United States of America
| | - Fabian J. Theis
- Institute of Computational Biology, Helmholtz Center Munich and Department of Mathematics, Technical University of Munich, Munich, Germany
| | - Jean Y. H. Yang
- Charles Perkins Centre and School of Mathematics and Statistics, The University of Sydney, Australia
| | - Anne E. Carpenter
- Imaging Platform, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Elana J. Fertig
- Convergence Institute, Departments of Oncology, Biomedical Engineering, and Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, Maryland, United States of America
| |
Collapse
|
27
|
Abstract
The volume of proteomics and mass spectrometry data available in public repositories continues to grow at a rapid pace as more researchers embrace open science practices. Open access to the data behind scientific discoveries has become critical to validate published findings and develop new computational tools. Here, we present ppx, a Python package that provides easy, programmatic access to the data stored in ProteomeXchange repositories, such as PRIDE and MassIVE. The ppx package can be used as either a command line tool or a Python package to retrieve the files and metadata associated with a project when provided its identifier. To demonstrate how ppx enhances reproducible research, we used ppx within a Snakemake workflow to reanalyze a published data set with the open modification search tool ANN-SoLo and compared our reanalysis to the original results. We show that ppx readily integrates into workflows, and our reanalysis produced results consistent with the original analysis. We envision that ppx will be a valuable tool for creating reproducible analyses, providing tool developers easy access to data for development, testing, and benchmarking, and enabling the use of mass spectrometry data in data-intensive analyses. The ppx package is freely available and open source under the MIT license at https://github.com/wfondrie/ppx.
Collapse
Affiliation(s)
- William E Fondrie
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Wout Bittremieux
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA, USA
- Department of Computer Science, University of Antwerp, Antwerp, Belgium
| | - William S Noble
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA
| |
Collapse
|