1
|
Shome M, MacKenzie TMG, Subbareddy SR, Snyder MP. The Importance, Challenges, and Possible Solutions for Sharing Proteomics Data While Safeguarding Individuals' Privacy. Mol Cell Proteomics 2024; 23:100731. [PMID: 38331191 PMCID: PMC10915627 DOI: 10.1016/j.mcpro.2024.100731] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Revised: 01/28/2024] [Accepted: 02/05/2024] [Indexed: 02/10/2024] Open
Abstract
Proteomics data sharing has profound benefits at the individual level as well as at the community level. While data sharing has increased over the years, mostly due to journal and funding agency requirements, the reluctance of researchers with regard to data sharing is evident as many shares only the bare minimum dataset required to publish an article. In many cases, proper metadata is missing, essentially making the dataset useless. This behavior can be explained by a lack of incentives, insufficient awareness, or a lack of clarity surrounding ethical issues. Through adequate training at research institutes, researchers can realize the benefits associated with data sharing and can accelerate the norm of data sharing for the field of proteomics, as has been the standard in genomics for decades. In this article, we have put together various repository options available for proteomics data. We have also added pros and cons of those repositories to facilitate researchers in selecting the repository most suitable for their data submission. It is also important to note that a few types of proteomics data have the potential to re-identify an individual in certain scenarios. In such cases, extra caution should be taken to remove any personal identifiers before sharing on public repositories. Data sets that will be useless without personal identifiers need to be shared in a controlled access repository so that only authorized researchers can access the data and personal identifiers are kept safe.
Collapse
Affiliation(s)
- Mahasish Shome
- Department of Genetics, Stanford University, Palo Alto, California, USA
| | - Tim M G MacKenzie
- Department of Genetics, Stanford University, Palo Alto, California, USA
| | | | - Michael P Snyder
- Department of Genetics, Stanford University, Palo Alto, California, USA.
| |
Collapse
|
2
|
Bouyssié D, Altıner P, Capella-Gutierrez S, Fernández JM, Hagemeijer YP, Horvatovich P, Hubálek M, Levander F, Mauri P, Palmblad M, Raffelsberger W, Rodríguez-Navas L, Di Silvestre D, Kunkli BT, Uszkoreit J, Vandenbrouck Y, Vizcaíno JA, Winkelhardt D, Schwämmle V. WOMBAT-P: Benchmarking Label-Free Proteomics Data Analysis Workflows. J Proteome Res 2024; 23:418-429. [PMID: 38038272 DOI: 10.1021/acs.jproteome.3c00636] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2023]
Abstract
The inherent diversity of approaches in proteomics research has led to a wide range of software solutions for data analysis. These software solutions encompass multiple tools, each employing different algorithms for various tasks such as peptide-spectrum matching, protein inference, quantification, statistical analysis, and visualization. To enable an unbiased comparison of commonly used bottom-up label-free proteomics workflows, we introduce WOMBAT-P, a versatile platform designed for automated benchmarking and comparison. WOMBAT-P simplifies the processing of public data by utilizing the sample and data relationship format for proteomics (SDRF-Proteomics) as input. This feature streamlines the analysis of annotated local or public ProteomeXchange data sets, promoting efficient comparisons among diverse outputs. Through an evaluation using experimental ground truth data and a realistic biological data set, we uncover significant disparities and a limited overlap in the quantified proteins. WOMBAT-P not only enables rapid execution and seamless comparison of workflows but also provides valuable insights into the capabilities of different software solutions. These benchmarking metrics are a valuable resource for researchers in selecting the most suitable workflow for their specific data sets. The modular architecture of WOMBAT-P promotes extensibility and customization. The software is available at https://github.com/wombat-p/WOMBAT-Pipelines.
Collapse
Affiliation(s)
- David Bouyssié
- Institut de Pharmacologie et de Biologie Structurale (IPBS), Université de Toulouse, CNRS, Université Toulouse III─Paul Sabatier (UT3), 31062 Toulouse, France
- Proteomics French Infrastructure, ProFI, FR 2048 Toulouse, France
| | - Pınar Altıner
- Institut de Pharmacologie et de Biologie Structurale (IPBS), Université de Toulouse, CNRS, Université Toulouse III─Paul Sabatier (UT3), 31062 Toulouse, France
| | | | - José M Fernández
- Life Sciences Department, Barcelona Supercomputing Center (BSC), 08034 Barcelona, Spain
| | - Yanick Paco Hagemeijer
- Department of Analytical Biochemistry, University of Groningen, Groningen Research Institute of Pharmacy, 9712 CP Groningen, The Netherlands
- European Research Institute for the Biology of Ageing, University Medical Center Groningen, 9713 GZ Groningen, The Netherlands
| | - Peter Horvatovich
- Department of Analytical Biochemistry, University of Groningen, Groningen Research Institute of Pharmacy, 9712 CP Groningen, The Netherlands
| | - Martin Hubálek
- Institute of Organic Chemistry and Biochemistry, CAS, 160 00 Prague, Czech Republic
| | - Fredrik Levander
- National Bioinformatics Infrastructure Sweden (NBIS), Science for Life Laboratory, Department of Immunotechnology, Lund University, 22100 Lund, Sweden
| | - Pierluigi Mauri
- Institute for Biomedical Technologies (ITB), Department of Biomedical Sciences, National Research Council (CNR), Segrate, 20054 Milan, Italy
| | - Magnus Palmblad
- Leiden University Medical Center, Postbus 9600, 2300 RC Leiden, The Netherlands
| | - Wolfgang Raffelsberger
- Wolfgang Raffelsberger: Institut de Génétique et de Biologie Moléculaire et Cellulaire, Université de Strasbourg, CNRS UMR7104, INSERM U1258, Illkirch, 1 Rue Laurent Fries, 67404 Illkirch, France
| | - Laura Rodríguez-Navas
- Life Sciences Department, Barcelona Supercomputing Center (BSC), 08034 Barcelona, Spain
| | - Dario Di Silvestre
- Institute for Biomedical Technologies (ITB), Department of Biomedical Sciences, National Research Council (CNR), Segrate, 20054 Milan, Italy
| | - Balázs Tibor Kunkli
- Balázs Tibor Kunkli: Department of Biochemistry and Molecular Biology, University of Debrecen, 4032 Debrecen, Hungary
| | - Julian Uszkoreit
- Medical Faculty, Medical Bioinformatics, Ruhr University Bochum, 44801 Bochum, Germany
- Center for Protein Diagnostics (ProDi), Medical Proteome Analysis, Ruhr University Bochum, 44801 Bochum, Germany
- Medical Faculty, Medizinisches Proteom-Center, Ruhr University Bochum, 44801 Bochum, Germany
| | - Yves Vandenbrouck
- Proteomics French Infrastructure, ProFI, FR 2048 Toulouse, France
- CEA, Fundamental Research Division, Proteomics French Infrastructure, 91191 Gif-sur-Yvette, France
| | - Juan Antonio Vizcaíno
- European Molecular Biology Laboratory European Bioinformatics Institute (EMBL-EBI), Wellcome Trust, Genome Campus, Hinxton, Cambridge CB10 1SD, U.K
| | - Dirk Winkelhardt
- Medical Faculty, Medizinisches Proteom-Center, Ruhr University Bochum, 44801 Bochum, Germany
| | - Veit Schwämmle
- Department of Biochemistry and Molecular Biology, University of Southern Denmark, Campusvej 55, 5230 Odense M, Denmark
| |
Collapse
|
3
|
Belliard F, Maineri AM, Plomp E, Ramos Padilla AF, Sun J, Zare Jeddi M. Ten simple rules for starting FAIR discussions in your community. PLoS Comput Biol 2023; 19:e1011668. [PMID: 38096152 PMCID: PMC10721007 DOI: 10.1371/journal.pcbi.1011668] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2023] Open
Abstract
This work presents 10 rules that provide guidance and recommendations on how to start up discussions around the implementation of the FAIR (Findable, Accessible, Interoperable, Reusable) principles and creation of standardised ways of working. These recommendations will be particularly relevant if you are unsure where to start, who to involve, what the benefits and barriers of standardisation are, and if little work has been done in your discipline to standardise research workflows. When applied, these rules will support a more effective way of engaging the community with discussions on standardisation and practical implementation of the FAIR principles.
Collapse
Affiliation(s)
| | - Angelica Maria Maineri
- Erasmus University Rotterdam—Erasmus School of Social and Behavioral Sciences/ODISSEI, Rotterdam, the Netherlands
| | - Esther Plomp
- Delft University of Technology, Faculty of Applied Sciences, Delft, the Netherlands
| | | | - Junzi Sun
- Faculty of Aerospace Engineering, Delft University of Technology, Delft, the Netherlands
| | - Maryam Zare Jeddi
- National Institute for Public Health and the Environment (RIVM), Bilthoven, the Netherlands
| |
Collapse
|
4
|
Bremer PL, Fiehn O. SMetaS: A Sample Metadata Standardizer for Metabolomics. Metabolites 2023; 13:941. [PMID: 37623884 PMCID: PMC10456726 DOI: 10.3390/metabo13080941] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Revised: 07/26/2023] [Accepted: 08/10/2023] [Indexed: 08/26/2023] Open
Abstract
Metabolomics has advanced to an extent where it is desired to standardize and compare data across individual studies. While past work in standardization has focused on data acquisition, data processing, and data storage aspects, metabolomics databases are useless without ontology-based descriptions of biological samples and study designs. We introduce here a user-centric tool to automatically standardize sample metadata. Using such a tool in frontends for metabolomic databases will dramatically increase the FAIRness (Findability, Accessibility, Interoperability, and Reusability) of data, specifically for data reuse and for finding datasets that share comparable sets of metadata, e.g., study meta-analyses, cross-species analyses or large scale metabolomic atlases. SMetaS (Sample Metadata Standardizer) combines a classic database with an API and frontend and is provided in a containerized environment. The tool has two user-centric components. In the first component, the user designs a sample metadata matrix and fills the cells using natural language terminology. In the second component, the tool transforms the completed matrix by replacing freetext terms with terms from fixed vocabularies. This transformation process is designed to maximize simplicity and is guided by, among other strategies, synonym matching and typographical fixing in an n-grams/nearest neighbors model approach. The tool enables downstream analysis of submitted studies and samples via string equality for FAIR retrospective use.
Collapse
Affiliation(s)
- Parker Ladd Bremer
- Department of Chemistry, University of California, Davis, CA 95616, USA;
| | - Oliver Fiehn
- West Coast Metabolomics Center for Compound Identification, UC Davis Genome Center, University of California, Davis, CA 95616, USA
| |
Collapse
|
5
|
van Zalm PW, Ahmed S, Fatou B, Schreiber R, Barnaby O, Boxer A, Zetterberg H, Steen JA, Steen H. Meta-analysis of published cerebrospinal fluid proteomics data identifies and validates metabolic enzyme panel as Alzheimer's disease biomarkers. Cell Rep Med 2023; 4:101005. [PMID: 37075703 PMCID: PMC10140596 DOI: 10.1016/j.xcrm.2023.101005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Revised: 10/10/2022] [Accepted: 03/17/2023] [Indexed: 04/21/2023]
Abstract
To develop therapies for Alzheimer's disease, we need accurate in vivo diagnostics. Multiple proteomic studies mapping biomarker candidates in cerebrospinal fluid (CSF) resulted in little overlap. To overcome this shortcoming, we apply the rarely used concept of proteomics meta-analysis to identify an effective biomarker panel. We combine ten independent datasets for biomarker identification: seven datasets from 150 patients/controls for discovery, one dataset with 20 patients/controls for down-selection, and two datasets with 494 patients/controls for validation. The discovery results in 21 biomarker candidates and down-selection in three, to be validated in the two additional large-scale proteomics datasets with 228 diseased and 266 control samples. This resulting 3-protein biomarker panel differentiates Alzheimer's disease (AD) from controls in the two validation cohorts with areas under the receiver operating characteristic curve (AUROCs) of 0.83 and 0.87, respectively. This study highlights the value of systematically re-analyzing previously published proteomics data and the need for more stringent data deposition.
Collapse
Affiliation(s)
- Patrick W van Zalm
- Department of Pathology, Boston Children's Hospital, and Department of Pathology, Harvard Medical School, Boston, MA, USA; Department of Neuropsychology and Psychopharmacology, EURON, Faculty of Psychology and Neuroscience, Maastricht University, Maastricht, the Netherlands
| | - Saima Ahmed
- Department of Pathology, Boston Children's Hospital, and Department of Pathology, Harvard Medical School, Boston, MA, USA
| | - Benoit Fatou
- Department of Pathology, Boston Children's Hospital, and Department of Pathology, Harvard Medical School, Boston, MA, USA
| | - Rudy Schreiber
- Department of Neuropsychology and Psychopharmacology, EURON, Faculty of Psychology and Neuroscience, Maastricht University, Maastricht, the Netherlands
| | - Omar Barnaby
- Department of Pathology, Boston Children's Hospital, and Department of Pathology, Harvard Medical School, Boston, MA, USA
| | - Adam Boxer
- Memory and Aging Center, Department of Neurology, Weill Institute for Neuroscience, University of California, San Francisco, CA, USA
| | - Henrik Zetterberg
- Department of Psychiatry and Neurochemistry, Institute of Neuroscience and Physiology, the Sahlgrenska Academy at the University of Gothenburg, Mölndal, Sweden; Clinical Neurochemistry Laboratory, Sahlgrenska University Hospital, Mölndal, Sweden; UK Dementia Research Institute at UCL, London, UK; Department of Neurodegenerative Disease, UCL Institute of Neurology, London, UK
| | - Judith A Steen
- F.M. Kirby Neurobiology Center, Boston Children's Hospital, and Department of Neurology, Harvard Medical School, Boston, MA, USA; Neuroiology Program, Boston Children's Hospital, Boston, MA, USA
| | - Hanno Steen
- Department of Pathology, Boston Children's Hospital, and Department of Pathology, Harvard Medical School, Boston, MA, USA; Neuroiology Program, Boston Children's Hospital, Boston, MA, USA.
| |
Collapse
|
6
|
A Schema for Digitized Surface Swab Site Metadata in Open-Source DNA Sequence Databases. mSystems 2023; 8:e0128422. [PMID: 36847566 PMCID: PMC10134794 DOI: 10.1128/msystems.01284-22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/01/2023] Open
Abstract
Large, open-source DNA sequence databases have been generated, in part, through the collection of microbial pathogens by swabbing surfaces in built environments. Analyzing these data in aggregate through public health surveillance requires digitization of the complex, domain-specific metadata that are associated with the swab site locations. However, the swab site location information is currently collected in a single, free-text, "isolation source", field-promoting generation of poorly detailed descriptions with various word order, granularity, and linguistic errors, making automation difficult and reducing machine-actionability. We assessed 1,498 free-text swab site descriptions that were generated during routine foodborne pathogen surveillance. The lexicon of free-text metadata was evaluated to determine the informational facets and the quantity of unique terms used by data collectors. Open Biological Ontologies (OBO) Foundry libraries were used to develop hierarchical vocabularies that are connected with logical relationships to describe swab site locations. 5 informational facets that were described by 338 unique terms were identified via content analysis. Term hierarchy facets were developed, as were statements (called axioms) about how the entities within these five domains are related. The schema developed through this study has been integrated into a publicly available pathogen metadata standard, facilitating ongoing surveillance and investigations. The One Health Enteric Package was available at NCBI BioSample, beginning in 2022. The collective use of metadata standards increases the interoperability of DNA sequence databases and enables large-scale approaches to data sharing and artificial intelligence as well as big-data solutions to food safety. IMPORTANCE The regular analysis of whole-genome sequence data in collections such as NCBI's Pathogen Detection Database is used by many public health organizations to detect outbreaks of infectious disease. However, isolate metadata in these databases are often incomplete and of poor quality. These complex, raw metadata must often be reorganized and manually formatted for use in aggregate analyses. These processes are inefficient and time-consuming, increasing the interpretative labor needed by public health groups to extract actionable information. The future use of open genomic epidemiology networks will be supported through the development of an internationally applicable vocabulary system with which swab site locations can be described.
Collapse
|
7
|
Yang SS, Wang C, Jiang YF, Zhang H. Three-Dimensional MAX-Ti 3 AlC 2 Nanomaterials for Dual-Selective and Highly Efficient Enrichment of Phosphorylated and Glycosylated Peptides. Chempluschem 2023; 88:e202200375. [PMID: 36581565 DOI: 10.1002/cplu.202200375] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Revised: 12/13/2022] [Indexed: 12/15/2022]
Abstract
Dual-selective enrichment of phosphopeptides and glycopeptides of post-translational modifications (PTMs) in the complex biological samples are challenging. In this work, considering the versatile properties including surface abundant metal sites and electrostatic attraction between Ti3 C2 -layers and Al-layers, layered ternary carbides Ti3 AlC2 nanomaterials was successfully applied for the first time as an affinity adsorbent for the dual-selective capture of phosphopeptides and glycopeptides. Especially, the Ti3 AlC2 nanomaterials had an excellent detection sensitivity for phosphopeptides (1×10-11 M) and a good selectivity for glycopeptides with a low molar ratio of 1 : 500 of HRP (horseradish peroxidase) to BSA (bovine serum albumin). Furthermore, Ti3 AlC2 nanomaterials was also applied for dual-selective enrichment of phosphopeptides and glycopeptides from mouse brain neocortex lysate and human serum lysate respectively before mass spectrometry (MS) analysis, yielding twenty-two unique phosphopeptides from thirteen phosphoproteins and fifty-three unique glycopeptides from thirty-seven glycoproteins, respectively. This work will open a new avenue and will greatly promote sample preparation for mass spectrometric analysis in phosphoproteomics and glycoproteomics research.
Collapse
Affiliation(s)
- Shi-Shu Yang
- Henan Key Laboratory of Green Chemical Media and Reactions, Ministry of Education, Henan Key Laboratory of Organic Functional Molecule and Drug Innovation, School of Chemistry and Chemical Engineering, Henan Normal University, Xinxiang, 453007, P. R. China
| | - Chen Wang
- State Key Laboratory of Analytical Chemistry for Life Science School of Chemistry and Chemical Engineering, Nanjing University, Nanjing, 210023, P. R. China
| | - Yu-Fei Jiang
- State Key Laboratory of Analytical Chemistry for Life Science School of Chemistry and Chemical Engineering, Nanjing University, Nanjing, 210023, P. R. China
| | - Hua Zhang
- Henan Key Laboratory of Green Chemical Media and Reactions, Ministry of Education, Henan Key Laboratory of Organic Functional Molecule and Drug Innovation, School of Chemistry and Chemical Engineering, Henan Normal University, Xinxiang, 453007, P. R. China
| |
Collapse
|
8
|
Walzer M, García-Seisdedos D, Prakash A, Brack P, Crowther P, Graham RL, George N, Mohammed S, Moreno P, Papatheodorou I, Hubbard SJ, Vizcaíno JA. Implementing the reuse of public DIA proteomics datasets: from the PRIDE database to Expression Atlas. Sci Data 2022; 9:335. [PMID: 35701420 PMCID: PMC9197839 DOI: 10.1038/s41597-022-01380-9] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2021] [Accepted: 05/12/2022] [Indexed: 11/14/2022] Open
Abstract
The number of mass spectrometry (MS)-based proteomics datasets in the public domain keeps increasing, particularly those generated by Data Independent Acquisition (DIA) approaches such as SWATH-MS. Unlike Data Dependent Acquisition datasets, the re-use of DIA datasets has been rather limited to date, despite its high potential, due to the technical challenges involved. We introduce a (re-)analysis pipeline for public SWATH-MS datasets which includes a combination of metadata annotation protocols, automated workflows for MS data analysis, statistical analysis, and the integration of the results into the Expression Atlas resource. Automation is orchestrated with Nextflow, using containerised open analysis software tools, rendering the pipeline readily available and reproducible. To demonstrate its utility, we reanalysed 10 public DIA datasets from the PRIDE database, comprising 1,278 SWATH-MS runs. The robustness of the analysis was evaluated, and the results compared to those obtained in the original publications. The final expression values were integrated into Expression Atlas, making SWATH-MS experiments more widely available and combining them with expression data originating from other proteomics and transcriptomics datasets.
Collapse
Affiliation(s)
- Mathias Walzer
- European Molecular Biology Laboratory, EMBL-European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, CB10 1SD, United Kingdom.
| | - David García-Seisdedos
- European Molecular Biology Laboratory, EMBL-European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, CB10 1SD, United Kingdom
| | - Ananth Prakash
- European Molecular Biology Laboratory, EMBL-European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, CB10 1SD, United Kingdom
| | - Paul Brack
- Division of Evolution, Infection and Genomics, School of Biological Sciences, Faculty of Biology, Medicine and Health, University of Manchester, Manchester Academic Health Science Centre, Oxford Road, Manchester, M13 9PT, United Kingdom
| | - Peter Crowther
- Melandra Limited, 16 Brook Road, Urmston, Manchester, M41 5RY, United Kingdom
| | - Robert L Graham
- School of Biological Sciences, Chlorine Gardens, Queen's University Belfast, Belfast, BT9 5DL, United Kingdom
| | - Nancy George
- European Molecular Biology Laboratory, EMBL-European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, CB10 1SD, United Kingdom
| | - Suhaib Mohammed
- European Molecular Biology Laboratory, EMBL-European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, CB10 1SD, United Kingdom
| | - Pablo Moreno
- European Molecular Biology Laboratory, EMBL-European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, CB10 1SD, United Kingdom
| | - Irene Papatheodorou
- European Molecular Biology Laboratory, EMBL-European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, CB10 1SD, United Kingdom
| | - Simon J Hubbard
- Division of Evolution, Infection and Genomics, School of Biological Sciences, Faculty of Biology, Medicine and Health, University of Manchester, Manchester Academic Health Science Centre, Oxford Road, Manchester, M13 9PT, United Kingdom
| | - Juan Antonio Vizcaíno
- European Molecular Biology Laboratory, EMBL-European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, CB10 1SD, United Kingdom.
| |
Collapse
|
9
|
A knowledge graph to interpret clinical proteomics data. Nat Biotechnol 2022; 40:692-702. [PMID: 35102292 PMCID: PMC9110295 DOI: 10.1038/s41587-021-01145-6] [Citation(s) in RCA: 78] [Impact Index Per Article: 39.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2020] [Accepted: 11/01/2021] [Indexed: 12/14/2022]
Abstract
Implementing precision medicine hinges on the integration of omics data, such as proteomics, into the clinical decision-making process, but the quantity and diversity of biomedical data, and the spread of clinically relevant knowledge across multiple biomedical databases and publications, pose a challenge to data integration. Here we present the Clinical Knowledge Graph (CKG), an open-source platform currently comprising close to 20 million nodes and 220 million relationships that represent relevant experimental data, public databases and literature. The graph structure provides a flexible data model that is easily extendable to new nodes and relationships as new databases become available. The CKG incorporates statistical and machine learning algorithms that accelerate the analysis and interpretation of typical proteomics workflows. Using a set of proof-of-concept biomarker studies, we show how the CKG might augment and enrich proteomics data and help inform clinical decision-making. A knowledge graph platform integrates proteomics with other omics data and biomedical databases.
Collapse
|
10
|
Perez-Riverol Y, Bai J, Bandla C, García-Seisdedos D, Hewapathirana S, Kamatchinathan S, Kundu D, Prakash A, Frericks-Zipper A, Eisenacher M, Walzer M, Wang S, Brazma A, Vizcaíno J. The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences. Nucleic Acids Res 2022; 50:D543-D552. [PMID: 34723319 PMCID: PMC8728295 DOI: 10.1093/nar/gkab1038] [Citation(s) in RCA: 2702] [Impact Index Per Article: 1351.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2021] [Revised: 10/12/2021] [Accepted: 10/14/2021] [Indexed: 12/12/2022] Open
Abstract
The PRoteomics IDEntifications (PRIDE) database (https://www.ebi.ac.uk/pride/) is the world's largest data repository of mass spectrometry-based proteomics data. PRIDE is one of the founding members of the global ProteomeXchange (PX) consortium and an ELIXIR core data resource. In this manuscript, we summarize the developments in PRIDE resources and related tools since the previous update manuscript was published in Nucleic Acids Research in 2019. The number of submitted datasets to PRIDE Archive (the archival component of PRIDE) has reached on average around 500 datasets per month during 2021. In addition to continuous improvements in PRIDE Archive data pipelines and infrastructure, the PRIDE Spectra Archive has been developed to provide direct access to the submitted mass spectra using Universal Spectrum Identifiers. As a key point, the file format MAGE-TAB for proteomics has been developed to enable the improvement of sample metadata annotation. Additionally, the resource PRIDE Peptidome provides access to aggregated peptide/protein evidences across PRIDE Archive. Furthermore, we will describe how PRIDE has increased its efforts to reuse and disseminate high-quality proteomics data into other added-value resources such as UniProt, Ensembl and Expression Atlas.
Collapse
Affiliation(s)
- Yasset Perez-Riverol
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Jingwen Bai
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Chakradhar Bandla
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - David García-Seisdedos
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Suresh Hewapathirana
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Selvakumar Kamatchinathan
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Deepti J Kundu
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Ananth Prakash
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Anika Frericks-Zipper
- Ruhr University Bochum, Medical Faculty, Medizinisches Proteom-Center, D-44801 Bochum, Germany
- Ruhr University Bochum, Center for Protein Diagnostics (PRODI), Medical Proteome Analysis, 44801 Bochum, Germany
| | - Martin Eisenacher
- Ruhr University Bochum, Medical Faculty, Medizinisches Proteom-Center, D-44801 Bochum, Germany
- Ruhr University Bochum, Center for Protein Diagnostics (PRODI), Medical Proteome Analysis, 44801 Bochum, Germany
| | - Mathias Walzer
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Shengbo Wang
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Alvis Brazma
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Juan Antonio Vizcaíno
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| |
Collapse
|
11
|
A proteomics sample metadata representation for multiomics integration and big data analysis. Nat Commun 2021; 12:5854. [PMID: 34615866 PMCID: PMC8494749 DOI: 10.1038/s41467-021-26111-3] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2021] [Accepted: 09/16/2021] [Indexed: 11/08/2022] Open
Abstract
The amount of public proteomics data is rapidly increasing but there is no standardized format to describe the sample metadata and their relationship with the dataset files in a way that fully supports their understanding or reanalysis. Here we propose to develop the transcriptomics data format MAGE-TAB into a standard representation for proteomics sample metadata. We implement MAGE-TAB-Proteomics in a crowdsourcing project to manually curate over 200 public datasets. We also describe tools and libraries to validate and submit sample metadata-related information to the PRIDE repository. We expect that these developments will improve the reproducibility and facilitate the reanalysis and integration of public proteomics datasets.
Collapse
|
12
|
Wilson SL, Way GP, Bittremieux W, Armache JP, Haendel MA, Hoffman MM. Sharing biological data: why, when, and how. FEBS Lett 2021; 595:847-863. [PMID: 33843054 PMCID: PMC10390076 DOI: 10.1002/1873-3468.14067] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Affiliation(s)
- Samantha L Wilson
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada
| | - Gregory P Way
- Imaging Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Wout Bittremieux
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA, USA.,Department of Computer Science, University of Antwerp, Antwerpen, Belgium
| | - Jean-Paul Armache
- Department of Biochemistry & Molecular Biology, The Huck Institutes of Life Sciences, Pennsylvania State University, University Park, PA, USA
| | | | - Michael M Hoffman
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada.,Department of Medical Biophysics, Department of Computer Science, University of Toronto, Toronto, ON, Canada.,Vector Institute, Toronto, ON, Canada
| |
Collapse
|
13
|
Bittremieux W, Bouyssié D, Dorfer V, Locard-Paulet M, Perez-Riverol Y, Schwämmle V, Uszkoreit J, Van Den Bossche T. The European Bioinformatics Community for Mass Spectrometry (EuBIC-MS): an open community for bioinformatics training and research. RAPID COMMUNICATIONS IN MASS SPECTROMETRY : RCM 2021:e9087. [PMID: 33861485 DOI: 10.1002/rcm.9087] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/18/2020] [Revised: 02/13/2021] [Accepted: 03/18/2021] [Indexed: 06/12/2023]
Abstract
The European Bioinformatics Community for Mass Spectrometry (EuBIC-MS; eubic-ms.org) was founded in 2014 to unite European computational mass spectrometry researchers and proteomics bioinformaticians working in academia and industry. EuBIC-MS maintains educational resources (proteomics-academy.org) and organises workshops at national and international conferences on proteomics and mass spectrometry. Furthermore, EuBIC-MS is actively involved in several community initiatives such as the Human Proteome Organization's Proteomics Standards Initiative (HUPO-PSI). Apart from these collaborations, EuBIC-MS has organised two Winter Schools and two Developers' Meetings that have contributed to the strengthening of the European mass spectrometry network and fostered international collaboration in this field, even beyond Europe. Moreover, EuBIC-MS is currently actively developing a community-driven standard dedicated to mass spectrometry data annotation (SDRF-Proteomics) that will facilitate data reuse and collaboration. This manuscript highlights what EuBIC-MS is, what it does, and what it already has achieved. A warm invitation is extended to new researchers at all career stages to join the EuBIC-MS community on its Slack channel (eubic.slack.com).
Collapse
Affiliation(s)
- Wout Bittremieux
- European Bioinformatics Community for Mass Spectrometry, Belgium
- University of California San Diego, La Jolla, CA, USA
- University of Antwerp, Antwerp, Belgium
| | - David Bouyssié
- European Bioinformatics Community for Mass Spectrometry, Belgium
- IPBS, University of Toulouse, CNRS, UPS, Toulouse, France
| | - Viktoria Dorfer
- European Bioinformatics Community for Mass Spectrometry, Belgium
- Bioinformatics Research Group, University of Applied Sciences Upper Austria, Hagenberg, Austria
| | - Marie Locard-Paulet
- European Bioinformatics Community for Mass Spectrometry, Belgium
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Denmark
| | - Yasset Perez-Riverol
- European Bioinformatics Community for Mass Spectrometry, Belgium
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Veit Schwämmle
- European Bioinformatics Community for Mass Spectrometry, Belgium
- Department of Biochemistry and Molecular Biology, University of Southern Denmark, Odense, Denmark
| | - Julian Uszkoreit
- European Bioinformatics Community for Mass Spectrometry, Belgium
- Center for Protein Diagnostics (PRODI), Medical Proteome Analysis, Ruhr University Bochum, Bochum, Germany
- Medical Faculty, Medizinisches Proteom-Center, Ruhr University Bochum, Bochum, Germany
| | - Tim Van Den Bossche
- European Bioinformatics Community for Mass Spectrometry, Belgium
- VIB-UGent Center for Medical Biotechnology, VIB, Ghent, Belgium
- Department of Biomolecular Medicine, Ghent University, Ghent, Belgium
| |
Collapse
|
14
|
Cantelli G, Cochrane G, Brooksbank C, McDonagh E, Flicek P, McEntyre J, Birney E, Apweiler R. The European Bioinformatics Institute: empowering cooperation in response to a global health crisis. Nucleic Acids Res 2021; 49:D29-D37. [PMID: 33245775 PMCID: PMC7778996 DOI: 10.1093/nar/gkaa1077] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2020] [Revised: 10/20/2020] [Accepted: 10/22/2020] [Indexed: 02/06/2023] Open
Abstract
The European Bioinformatics Institute (EMBL-EBI; https://www.ebi.ac.uk/) provides freely available data and bioinformatics services to the scientific community, alongside its research activity and training provision. The 2020 COVID-19 pandemic has brought to the forefront a need for the scientific community to work even more cooperatively to effectively tackle a global health crisis. EMBL-EBI has been able to build on its position to contribute to the fight against COVID-19 in a number of ways. Firstly, EMBL-EBI has used its infrastructure, expertise and network of international collaborations to help build the European COVID-19 Data Platform (https://www.covid19dataportal.org/), which brings together COVID-19 biomolecular data and connects it to researchers, clinicians and public health professionals. By September 2020, the COVID-19 Data Platform has integrated in excess of 170 000 COVID-19 biomolecular data and literature records, collected through a number of EMBL-EBI resources. Secondly, EMBL-EBI has strived to continue its support of the life science communities through the crisis, with updated Training provision and improved service provision throughout its resources. The COVID-19 pandemic has highlighted the importance of EMBL-EBI's core principles, including international cooperation, resource sharing and central data brokering, and has further empowered scientific cooperation.
Collapse
Affiliation(s)
- Gaia Cantelli
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Guy Cochrane
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Cath Brooksbank
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Ellen McDonagh
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
- Open Targets, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Johanna McEntyre
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Ewan Birney
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Rolf Apweiler
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| |
Collapse
|
15
|
Boekweg H, McCown MA, Payne SH. Simple and Efficient Data Analysis Dissemination for Individual Laboratories. J Proteome Res 2020; 19:4191-4195. [PMID: 32790999 DOI: 10.1021/acs.jproteome.0c00454] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Scientific progress comes as we build upon the work of others. Implicit in this advance is that we have access to and can thoroughly examine the work of others. It is important to recognize that our scholarly work as scientists encompasses not only experimental design and data collection but also our analytical methods. Thus when communicating biology experiments, especially those that utilize molecular omics data, the analysis methods that connect raw data to scientific conclusions must be presented with sufficient clarity that others can reproduce our exact work. Although there are many resources for sharing raw data files, there is currently not a widely utilized method for sharing analysis methods. We present a semistructured pattern for sharing analysis methods that is simple and efficient and can be implemented by individual laboratories using existing software. This pattern requires three types of files in a publicly accessible repository, such as GitHub: (1) data files, (2) a universal I/O script that parses all data files, and (3) analysis scripts creating figures and metrics reported in the manuscript. We suggest additional conventions to improve the readability and provide a template repository for the pattern. Sharing our exact analysis methods as software, in addition to their narrative description in a manuscript, will ensure reproducibility and transparency. Importantly, the pattern we present does not require new infrastructure and can be achieved without advanced computing skills.
Collapse
Affiliation(s)
- Hannah Boekweg
- Biology Department, Brigham Young University, Provo, Utah 84602, United States
| | - Michaela A McCown
- Biology Department, Brigham Young University, Provo, Utah 84602, United States
| | - Samuel H Payne
- Biology Department, Brigham Young University, Provo, Utah 84602, United States
| |
Collapse
|