1
|
Klein J, Lam H, Mak TD, Bittremieux W, Perez-Riverol Y, Gabriels R, Shofstahl J, Hecht H, Binz PA, Kawano S, Van Den Bossche T, Carver J, Neely BA, Mendoza L, Suomi T, Claeys T, Payne T, Schulte D, Sun Z, Hoffmann N, Zhu Y, Neumann S, Jones AR, Bandeira N, Vizcaíno JA, Deutsch EW. The Proteomics Standards Initiative Standardized Formats for Spectral Libraries and Fragment Ion Peak Annotations: mzSpecLib and mzPAF. Anal Chem 2024. [PMID: 39514576 DOI: 10.1021/acs.analchem.4c04091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2024]
Abstract
Mass spectral libraries are collections of reference spectra, usually associated with specific analytes from which the spectra were generated, that are used for further downstream analysis of new spectra. There are many different formats used for encoding spectral libraries, but none have undergone a standardization process to ensure broad applicability to many applications. As part of the Human Proteome Organization Proteomics Standards Initiative (PSI), we have developed a standardized format for encoding spectral libraries, called mzSpecLib (https://psidev.info/mzSpecLib). It is primarily a data model that flexibly encodes metadata about the library entries using the extensible PSI-MS controlled vocabulary and can be encoded in and converted between different serialization formats. We have also developed a standardized data model and serialization for fragment ion peak annotations, called mzPAF (https://psidev.info/mzPAF). It is defined as a separate standard, since it may be used for other applications besides spectral libraries. The mzSpecLib and mzPAF standards are compatible with existing PSI standards such as ProForma 2.0 and the Universal Spectrum Identifier. The mzSpecLib and mzPAF standards have been primarily defined for peptides in proteomics applications with basic small molecule support. They could be extended in the future to other fields that need to encode spectral libraries for nonpeptidic analytes.
Collapse
Affiliation(s)
- Joshua Klein
- Program for Bioinformatics, Boston University, Boston, Massachusetts 02215, United States
| | - Henry Lam
- Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, 999077 Hong Kong, P. R. China
| | - Tytus D Mak
- Mass Spectrometry Data Center, National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, Maryland 20899, United States
| | - Wout Bittremieux
- Department of Computer Science, University of Antwerp, 2020 Antwerpen, Belgium
| | - Yasset Perez-Riverol
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Ralf Gabriels
- VIB-UGent Center for Medical Biotechnology, VIB, 9052 Ghent, Belgium
- Department of Biomolecular Medicine, Faculty of Medicine and Health Sciences, Ghent University, 9052 Ghent, Belgium
| | - Jim Shofstahl
- Thermo Fisher Scientific, 355 River Oaks Parkway, San Jose, California 95134, United States
| | - Helge Hecht
- RECETOX, Faculty of Science, Masaryk University, Kotlářská 2, 60200 Brno, Czech Republic
| | | | - Shin Kawano
- Database Center for Life Science, Joint Support Center for Data Science Research, Research Organization of Information and Systems, Chiba 277-0871, Japan
- School of Frontier Engineering, Kitasato University, Sagamihara 252-0373, Japan
| | - Tim Van Den Bossche
- VIB-UGent Center for Medical Biotechnology, VIB, 9052 Ghent, Belgium
- Department of Biomolecular Medicine, Faculty of Medicine and Health Sciences, Ghent University, 9052 Ghent, Belgium
| | - Jeremy Carver
- Center for Computational Mass Spectrometry, Department of Computer Science and Engineering, University of California, San Diego, California 92093-0404, United States
| | - Benjamin A Neely
- National Institute of Standards and Technology (NIST) Charleston, Charleston, South Carolina 29412, United States
| | - Luis Mendoza
- Institute for Systems Biology, Seattle, Washington 98109, United States
| | - Tomi Suomi
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, FI-20520 Turku, Finland
| | - Tine Claeys
- VIB-UGent Center for Medical Biotechnology, VIB, 9052 Ghent, Belgium
- Department of Biomolecular Medicine, Faculty of Medicine and Health Sciences, Ghent University, 9052 Ghent, Belgium
| | - Thomas Payne
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Douwe Schulte
- Biomolecular Mass Spectrometry and Proteomics, Bijvoet Center for Biomolecular Research and Utrecht Institute of Pharmaceutical Sciences, Utrecht University, Padualaan 8, 3584, CH, Utrecht, The Netherlands
| | - Zhi Sun
- Institute for Systems Biology, Seattle, Washington 98109, United States
| | - Nils Hoffmann
- Institute for Bio- and Geosciences (IBG-5), Forschungszentrum Jülich GmbH, 52428 Jülich, Germany
| | - Yunping Zhu
- National Center for Protein Sciences (Beijing), Beijing Institute of Lifeomics, #38, Life Science Park, Changping District, Beijing 102206, China
| | - Steffen Neumann
- Computational Plant Biochemistry, Leibniz Institute of Plant Biochemistry, 06120 Halle, Germany
- German Centre for Integrative Biodiversity Research (iDiv), Halle-Jena-Leipzig, 04103 Leipzig, Germany
| | - Andrew R Jones
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 3BX, United Kingdom
| | - Nuno Bandeira
- Center for Computational Mass Spectrometry, Department of Computer Science and Engineering, University of California, San Diego, California 92093-0404, United States
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, California 92093, United States
| | - Juan Antonio Vizcaíno
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Eric W Deutsch
- Institute for Systems Biology, Seattle, Washington 98109, United States
| |
Collapse
|
2
|
Vasylieva V, Arefiev I, Bourassa F, Trifiro FA, Brunet MA. Proteomics Can Rise to the Challenge of Pseudogenes' Coding Nature. J Proteome Res 2024. [PMID: 39486438 DOI: 10.1021/acs.jproteome.4c00116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2024]
Abstract
Throughout the past decade, technological advances in genomics and transcriptomics have revealed pervasive translation throughout mammalian genomes. These putative proteins are usually excluded from proteomics analyses, as they are absent from common protein repositories. A sizable portion of these noncanonical proteins is translated from pseudogenes. Pseudogenes are commonly termed defective copies of coding genes unable to produce proteins. Here, we suggest that proteomics can help in their annotation. First, we define important terms and review specific examples underlining the caveats in pseudogene annotation and their coding potential. Then, we will discuss the challenges inherent to pseudogenes that have thus far rendered complex their confidence in omics data. Finally, we identify recent developments in experimental procedures, instrumentation, and computational methods in proteomics that put the field in a unique position to solve the pseudogene annotation conundrum.
Collapse
Affiliation(s)
- Valeriia Vasylieva
- Pediatrics Department, Université de Sherbrooke, Sherbrooke, Québec J1K 2R1, Canada
- Centre de Recherche du Centre hospitalier de l'université de Sherbrooke (CRCHUS), Sherbrooke, Québec J1E 4K8, Canada
| | - Ihor Arefiev
- Pediatrics Department, Université de Sherbrooke, Sherbrooke, Québec J1K 2R1, Canada
- Centre de Recherche du Centre hospitalier de l'université de Sherbrooke (CRCHUS), Sherbrooke, Québec J1E 4K8, Canada
| | - Francis Bourassa
- Pediatrics Department, Université de Sherbrooke, Sherbrooke, Québec J1K 2R1, Canada
- Centre de Recherche du Centre hospitalier de l'université de Sherbrooke (CRCHUS), Sherbrooke, Québec J1E 4K8, Canada
| | - Félix-Antoine Trifiro
- Pediatrics Department, Université de Sherbrooke, Sherbrooke, Québec J1K 2R1, Canada
- Centre de Recherche du Centre hospitalier de l'université de Sherbrooke (CRCHUS), Sherbrooke, Québec J1E 4K8, Canada
| | - Marie A Brunet
- Pediatrics Department, Université de Sherbrooke, Sherbrooke, Québec J1K 2R1, Canada
- Centre de Recherche du Centre hospitalier de l'université de Sherbrooke (CRCHUS), Sherbrooke, Québec J1E 4K8, Canada
| |
Collapse
|
3
|
Tariq U, Saeed F. Predicting peptide properties from mass spectrometry data using deep attention-based multitask network and uncertainty quantification. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.08.21.609035. [PMID: 39229185 PMCID: PMC11370541 DOI: 10.1101/2024.08.21.609035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 09/05/2024]
Abstract
Database search algorithms reduce the number of potential candidate peptides against which scoring needs to be performed using a single (i.e. mass) property for filtering. While useful, filtering based on one property may lead to exclusion of non-abundant spectra and uncharacterized peptides - potentially exacerbating the streetlight effect. Here we present ProteoRift, a novel attention and multitask deep-network, which can predict multiple peptide properties (length, missed cleavages, and modification status) directly from spectra. We demonstrate that ProteoRift can predict these properties with up to 97% accuracy resulting in search-space reduction by more than 90%. As a result, our end-to-end pipeline is shown to exhibit 8x to 12x speedups with peptide deduction accuracy comparable to algorithmic techniques. We also formulate two uncertainty estimation metrics, which can distinguish between in-distribution and out-of-distribution data (ROC-AUC 0.99) and predict high-scoring mass spectra against correct peptide (ROC-AUC 0.94). These models and metrics are integrated in an end-to-end ML pipeline available at https://github.com/pcdslab/ProteoRift.
Collapse
Affiliation(s)
- Usman Tariq
- Knight Foundation School of Computing, and Information Sciences, Florida International University (FIU), Miami, FL USA
| | - Fahad Saeed
- Knight Foundation School of Computing, and Information Sciences, Florida International University (FIU), Miami, FL USA
- Biomolecular Sciences Institute (BSI), Florida International University, Miami, FL, USA
- Department of Human and Molecular Genetics, Herbert Wertheim School of Medicine, Florida International University, Miami, FL, USA
| |
Collapse
|
4
|
Lebedev VV, Yarykin DI, Buryak AK. Automated Identification of Ions Observed in Mass Spectra of Inorganic Compounds Using Isotopic Distribution Brute Force: Methodology and Performance Measurements. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2024; 35:1806-1817. [PMID: 39041793 DOI: 10.1021/jasms.4c00153] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/24/2024]
Abstract
This Article describes the method of isotopic distribution brute force, which can be used to identify ions registered in mass spectra of inorganic compounds in an automated manner when a library search cannot be conducted. A detailed description of the isotopic distribution brute force methodology is presented, including a discussion of computation-related difficulties. The ability of the proposed algorithm to identify various inorganic ions is tested on a small set of real-life low-resolution mass spectra of lead halides and copper halides. An evaluation of the isotopic distribution brute force performance is conducted using the low-resolution experimental mass spectra of natural rhenium sulfide and lead(II) chloride. Based on identification results and obtained performance measurements, we formulate the empirical restrictions on the input data, ensuring that the application of isotopic distribution brute force is feasible from the standpoints of search space reduction and identification time.
Collapse
Affiliation(s)
- Viacheslav V Lebedev
- A. N. Frumkin Institute of Physical Chemistry and Electrochemistry, Russian Academy of Sciences, Leninsky Prospect, 31 Building 4, Moscow 119071, Russian Federation
| | - Daniil I Yarykin
- A. N. Frumkin Institute of Physical Chemistry and Electrochemistry, Russian Academy of Sciences, Leninsky Prospect, 31 Building 4, Moscow 119071, Russian Federation
| | - Aleksey K Buryak
- A. N. Frumkin Institute of Physical Chemistry and Electrochemistry, Russian Academy of Sciences, Leninsky Prospect, 31 Building 4, Moscow 119071, Russian Federation
| |
Collapse
|
5
|
Bittremieux W, Avalon NE, Thomas SP, Kakhkhorov SA, Aksenov AA, Gomes PWP, Aceves CM, Caraballo-Rodríguez AM, Gauglitz JM, Gerwick WH, Huan T, Jarmusch AK, Kaddurah-Daouk RF, Kang KB, Kim HW, Kondić T, Mannochio-Russo H, Meehan MJ, Melnik AV, Nothias LF, O'Donovan C, Panitchpakdi M, Petras D, Schmid R, Schymanski EL, van der Hooft JJJ, Weldon KC, Yang H, Xing S, Zemlin J, Wang M, Dorrestein PC. Open access repository-scale propagated nearest neighbor suspect spectral library for untargeted metabolomics. Nat Commun 2023; 14:8488. [PMID: 38123557 PMCID: PMC10733301 DOI: 10.1038/s41467-023-44035-y] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Accepted: 11/28/2023] [Indexed: 12/23/2023] Open
Abstract
Despite the increasing availability of tandem mass spectrometry (MS/MS) community spectral libraries for untargeted metabolomics over the past decade, the majority of acquired MS/MS spectra remain uninterpreted. To further aid in interpreting unannotated spectra, we created a nearest neighbor suspect spectral library, consisting of 87,916 annotated MS/MS spectra derived from hundreds of millions of MS/MS spectra originating from published untargeted metabolomics experiments. Entries in this library, or "suspects," were derived from unannotated spectra that could be linked in a molecular network to an annotated spectrum. Annotations were propagated to unknowns based on structural relationships to reference molecules using MS/MS-based spectrum alignment. We demonstrate the broad relevance of the nearest neighbor suspect spectral library through representative examples of propagation-based annotation of acylcarnitines, bacterial and plant natural products, and drug metabolism. Our results also highlight how the library can help to better understand an Alzheimer's brain phenotype. The nearest neighbor suspect spectral library is openly available for download or for data analysis through the GNPS platform to help investigators hypothesize candidate structures for unknown MS/MS spectra in untargeted metabolomics data.
Collapse
Affiliation(s)
- Wout Bittremieux
- Department of Computer Science, University of Antwerp, 2020, Antwerpen, Belgium.
| | - Nicole E Avalon
- Scripps Institution of Oceanography, University of California San Diego, La Jolla, CA, 92093, USA
| | - Sydney P Thomas
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA, 92093, USA
- Collaborative Mass Spectrometry Innovation Center, University of California San Diego, La Jolla, CA, 92093, USA
| | - Sarvar A Kakhkhorov
- Laboratory of Physical and Chemical Methods of Research, Center for Advanced Technologies, Tashkent, 100174, Uzbekistan
- Department of Food Science, Faculty of Science, University of Copenhagen, Rolighedsvej 26, 1958, Frederiksberg C, Denmark
| | - Alexander A Aksenov
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA, 92093, USA
- Collaborative Mass Spectrometry Innovation Center, University of California San Diego, La Jolla, CA, 92093, USA
- Department of Chemistry, University of Connecticut, Storrs, CT, 06269, USA
- Arome Science inc., Farmington, CT, 06032, USA
| | - Paulo Wender P Gomes
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA, 92093, USA
- Collaborative Mass Spectrometry Innovation Center, University of California San Diego, La Jolla, CA, 92093, USA
| | - Christine M Aceves
- Department of Immunology and Microbiology, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Andrés Mauricio Caraballo-Rodríguez
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA, 92093, USA
- Collaborative Mass Spectrometry Innovation Center, University of California San Diego, La Jolla, CA, 92093, USA
| | - Julia M Gauglitz
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA, 92093, USA
- Collaborative Mass Spectrometry Innovation Center, University of California San Diego, La Jolla, CA, 92093, USA
| | - William H Gerwick
- Scripps Institution of Oceanography, University of California San Diego, La Jolla, CA, 92093, USA
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA, 92093, USA
| | - Tao Huan
- Department of Chemistry, University of British Columbia, Vancouver, BC, V6T 1Z1, Canada
| | - Alan K Jarmusch
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA, 92093, USA
- Collaborative Mass Spectrometry Innovation Center, University of California San Diego, La Jolla, CA, 92093, USA
- Immunity, Inflammation, and Disease Laboratory, Division of Intramural Research, National Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, Durham, NC, 27709, USA
| | - Rima F Kaddurah-Daouk
- Department of Psychiatry and Behavioral Sciences, Duke University School of Medicine, Durham, NC, 27701, USA
- Department of Medicine, Duke University, Durham, NC, 27710, USA
- Duke Institute of Brain Sciences, Duke University, Durham, NC, 27710, USA
| | - Kyo Bin Kang
- College of Pharmacy and Research Institute of Pharmaceutical Sciences, Sookmyung Women's University, Seoul, 04310, Korea
| | - Hyun Woo Kim
- College of Pharmacy and Integrated Research Institute for Drug Development, Dongguk University, Goyang, 10326, Korea
| | - Todor Kondić
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, L-4367, Belvaux, Luxembourg
| | - Helena Mannochio-Russo
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA, 92093, USA
- Collaborative Mass Spectrometry Innovation Center, University of California San Diego, La Jolla, CA, 92093, USA
- Department of Biochemistry and Organic Chemistry, Institute of Chemistry, São Paulo State University, Araraquara, 14800-901, Brazil
| | - Michael J Meehan
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA, 92093, USA
- Collaborative Mass Spectrometry Innovation Center, University of California San Diego, La Jolla, CA, 92093, USA
| | - Alexey V Melnik
- Department of Chemistry, University of Connecticut, Storrs, CT, 06269, USA
- Arome Science inc., Farmington, CT, 06032, USA
| | - Louis-Felix Nothias
- Université Côte d'Azur, CNRS, ICN, Nice, France
- Interdisciplinary Institute for Artificial Intelligence (3iA) Côte d'Azur, Nice, France
| | - Claire O'Donovan
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Morgan Panitchpakdi
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA, 92093, USA
- Collaborative Mass Spectrometry Innovation Center, University of California San Diego, La Jolla, CA, 92093, USA
| | - Daniel Petras
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA, 92093, USA
- Collaborative Mass Spectrometry Innovation Center, University of California San Diego, La Jolla, CA, 92093, USA
- Interfaculty Institute of Microbiology and Infection Medicine, University of Tuebingen, 72076, Tuebingen, Germany
- Department of Biochemistry, University of California Riverside, Riverside, CA, 92507, USA
| | - Robin Schmid
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA, 92093, USA
- Collaborative Mass Spectrometry Innovation Center, University of California San Diego, La Jolla, CA, 92093, USA
| | - Emma L Schymanski
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, L-4367, Belvaux, Luxembourg
| | - Justin J J van der Hooft
- Collaborative Mass Spectrometry Innovation Center, University of California San Diego, La Jolla, CA, 92093, USA
- Bioinformatics Group, Wageningen University & Research, 6708 PB, Wageningen, The Netherlands
| | - Kelly C Weldon
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA, 92093, USA
- Collaborative Mass Spectrometry Innovation Center, University of California San Diego, La Jolla, CA, 92093, USA
| | - Heejung Yang
- Laboratory of Natural Products Chemistry, College of Pharmacy, Kangwon National University, Chuncheon, 24341, Korea
| | - Shipei Xing
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA, 92093, USA
- Collaborative Mass Spectrometry Innovation Center, University of California San Diego, La Jolla, CA, 92093, USA
- Department of Chemistry, University of British Columbia, Vancouver, BC, V6T 1Z1, Canada
| | - Jasmine Zemlin
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA, 92093, USA
- Collaborative Mass Spectrometry Innovation Center, University of California San Diego, La Jolla, CA, 92093, USA
| | - Mingxun Wang
- Department of Computer Science and Engineering, University of California Riverside, Riverside, CA, 92507, USA
| | - Pieter C Dorrestein
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA, 92093, USA.
- Collaborative Mass Spectrometry Innovation Center, University of California San Diego, La Jolla, CA, 92093, USA.
| |
Collapse
|
6
|
Prunier G, Cherkaoui M, Lysiak A, Langella O, Blein-Nicolas M, Lollier V, Benoist E, Jean G, Fertin G, Rogniaux H, Tessier D. Fast alignment of mass spectra in large proteomics datasets, capturing dissimilarities arising from multiple complex modifications of peptides. BMC Bioinformatics 2023; 24:421. [PMID: 37940845 PMCID: PMC10631047 DOI: 10.1186/s12859-023-05555-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Accepted: 10/30/2023] [Indexed: 11/10/2023] Open
Abstract
BACKGROUND In proteomics, the interpretation of mass spectra representing peptides carrying multiple complex modifications remains challenging, as it is difficult to strike a balance between reasonable execution time, a limited number of false positives, and a huge search space allowing any number of modifications without a priori. The scientific community needs new developments in this area to aid in the discovery of novel post-translational modifications that may play important roles in disease. RESULTS To make progress on this issue, we implemented SpecGlobX (SpecGlob eXTended to eXperimental spectra), a standalone Java application that quickly determines the best spectral alignments of a (possibly very large) list of Peptide-to-Spectrum Matches (PSMs) provided by any open modification search method, or generated by the user. As input, SpecGlobX reads a file containing spectra in MGF or mzML format and a semicolon-delimited spreadsheet describing the PSMs. SpecGlobX returns the best alignment for each PSM as output, splitting the mass difference between the spectrum and the peptide into one or more shifts while considering the possibility of non-aligned masses (a phenomenon resulting from many situations including neutral losses). SpecGlobX is fast, able to align one million PSMs in about 1.5 min on a standard desktop. Firstly, we remind the foundations of the algorithm and detail how we adapted SpecGlob (the method we previously developed following the same aim, but limited to the interpretation of perfect simulated spectra) to the interpretation of imperfect experimental spectra. Then, we highlight the interest of SpecGlobX as a complementary tool downstream to three open modification search methods on a large simulated spectra dataset. Finally, we ran SpecGlobX on a proteome-wide dataset downloaded from PRIDE to demonstrate that SpecGlobX functions just as well on simulated and experimental spectra. We then carefully analyzed a limited set of interpretations. CONCLUSIONS SpecGlobX is helpful as a decision support tool, providing keys to interpret peptides carrying complex modifications still poorly considered by current open modification search software. Better alignment of PSMs enhances confidence in the identification of spectra provided by open modification search methods and should improve the interpretation rate of spectra.
Collapse
Affiliation(s)
- Grégoire Prunier
- INRAE, PROBE Research Infrastructure, BIBS Facility, 44300, Nantes, France
- INRAE, UR1268 Biopolymères Interactions Assemblages, 44316, Nantes, France
| | - Mehdi Cherkaoui
- INRAE, PROBE Research Infrastructure, BIBS Facility, 44300, Nantes, France
- INRAE, UR1268 Biopolymères Interactions Assemblages, 44316, Nantes, France
| | - Albane Lysiak
- INRAE, PROBE Research Infrastructure, BIBS Facility, 44300, Nantes, France
- Nantes Université, CNRS, LS2N, UMR 6004, 44000, Nantes, France
| | - Olivier Langella
- Université Paris-Saclay, INRAE, CNRS, AgroParisTech, GQE - Le Moulon, PAPPSO, 91190, Gif-Sur-Yvette, France
| | - Mélisande Blein-Nicolas
- Université Paris-Saclay, INRAE, CNRS, AgroParisTech, GQE - Le Moulon, PAPPSO, 91190, Gif-Sur-Yvette, France
| | - Virginie Lollier
- INRAE, PROBE Research Infrastructure, BIBS Facility, 44300, Nantes, France
- INRAE, UR1268 Biopolymères Interactions Assemblages, 44316, Nantes, France
| | - Emile Benoist
- Nantes Université, CNRS, LS2N, UMR 6004, 44000, Nantes, France
| | - Géraldine Jean
- Nantes Université, CNRS, LS2N, UMR 6004, 44000, Nantes, France
| | | | - Hélène Rogniaux
- INRAE, PROBE Research Infrastructure, BIBS Facility, 44300, Nantes, France
- INRAE, UR1268 Biopolymères Interactions Assemblages, 44316, Nantes, France
| | - Dominique Tessier
- INRAE, PROBE Research Infrastructure, BIBS Facility, 44300, Nantes, France.
- INRAE, UR1268 Biopolymères Interactions Assemblages, 44316, Nantes, France.
| |
Collapse
|
7
|
Wu L, Hoque A, Lam H. Spectroscape enables real-time query and visualization of a spectral archive in proteomics. Nat Commun 2023; 14:6267. [PMID: 37805652 PMCID: PMC10560257 DOI: 10.1038/s41467-023-42006-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 09/26/2023] [Indexed: 10/09/2023] Open
Abstract
In proteomics, spectral archives organize the enormous amounts of publicly available peptide tandem mass spectra by similarity, offering opportunities for error correction and novel discoveries. Here we adapt an indexing algorithm developed by Facebook for organizing online multimedia resources to tandem mass spectra and achieve practically instantaneous retrieval and clustering of approximate nearest neighbors in a large spectral archive. An interactive web-based graphical user interface enables the user to view a query spectrum in its clustered neighborhood, which facilitates contextual validation of peptide identifications and exploration of the dark proteome.
Collapse
Affiliation(s)
- Long Wu
- Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong
- Department of Electrical and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong
| | - Ayman Hoque
- Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong
| | - Henry Lam
- Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong.
| |
Collapse
|
8
|
Arab I, Fondrie WE, Laukens K, Bittremieux W. Semisupervised Machine Learning for Sensitive Open Modification Spectral Library Searching. J Proteome Res 2023; 22:585-593. [PMID: 36688569 DOI: 10.1021/acs.jproteome.2c00616] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
A key analysis task in mass spectrometry proteomics is matching the acquired tandem mass spectra to their originating peptides by sequence database searching or spectral library searching. Machine learning is an increasingly popular postprocessing approach to maximize the number of confident spectrum identifications that can be obtained at a given false discovery rate threshold. Here, we have integrated semisupervised machine learning in the ANN-SoLo tool, an efficient spectral library search engine that is optimized for open modification searching to identify peptides with any type of post-translational modification. We show that machine learning rescoring boosts the number of spectra that can be identified for both standard searching and open searching, and we provide insights into relevant spectrum characteristics harnessed by the machine learning model. The semisupervised machine learning functionality has now been fully integrated into ANN-SoLo, which is available as open source under the permissive Apache 2.0 license on GitHub at https://github.com/bittremieux/ANN-SoLo.
Collapse
Affiliation(s)
- Issar Arab
- Department of Computer Science, University of Antwerp, 2020 Antwerp, Belgium.,Biomedical Informatics Network Antwerpen (biomina), 2020 Antwerp, Belgium
| | | | - Kris Laukens
- Department of Computer Science, University of Antwerp, 2020 Antwerp, Belgium.,Biomedical Informatics Network Antwerpen (biomina), 2020 Antwerp, Belgium
| | - Wout Bittremieux
- Department of Computer Science, University of Antwerp, 2020 Antwerp, Belgium.,Biomedical Informatics Network Antwerpen (biomina), 2020 Antwerp, Belgium
| |
Collapse
|
9
|
Dorl S, Winkler S, Mechtler K, Dorfer V. MS Ana: Improving Sensitivity in Peptide Identification with Spectral Library Search. J Proteome Res 2023; 22:462-470. [PMID: 36688604 PMCID: PMC9903325 DOI: 10.1021/acs.jproteome.2c00658] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
Spectral library search can enable more sensitive peptide identification in tandem mass spectrometry experiments. However, its drawbacks are the limited availability of high-quality libraries and the added difficulty of creating decoy spectra for result validation. We describe MS Ana, a new spectral library search engine that enables high sensitivity peptide identification using either curated or predicted spectral libraries as well as robust false discovery control through its own decoy library generation algorithm. MS Ana identifies on average 36% more spectrum matches and 4% more proteins than database search in a benchmark test on single-shot human cell-line data. Further, we demonstrate the quality of the result validation with tests on synthetic peptide pools and show the importance of library selection through a comparison of library search performance with different configurations of publicly available human spectral libraries.
Collapse
Affiliation(s)
- Sebastian Dorl
- University
of Applied Sciences Upper Austria, Bioinformatics Research Group, Softwarepark 11, 4232Hagenberg, Austria,Department
of Computer Science, Johannes Kepler University
Linz, Altenbergerstraße
69, 4040Linz, Austria,E-mail: . Phone: +43 (0) 50804
27145
| | - Stephan Winkler
- University
of Applied Sciences Upper Austria, Bioinformatics Research Group, Softwarepark 11, 4232Hagenberg, Austria,Department
of Computer Science, Johannes Kepler University
Linz, Altenbergerstraße
69, 4040Linz, Austria
| | - Karl Mechtler
- Research
Institute of Molecular Pathology (IMP), Protein Chemistry, Campus-Vienna-Biocenter 1, 1030Vienna, Austria,Institute
of Molecular Biotechnology (IMBA), Protein Chemistry, Vienna Biocenter
(VBC), Dr. Bohr-Gasse 3, 1030Vienna, Austria,Gregor
Mendel Institute of Molecular Plant Biology of the Austrian Academy
of Sciences (GMI), Dr.
Bohr Gasse 3, 1030Vienna, Austria
| | - Viktoria Dorfer
- University
of Applied Sciences Upper Austria, Bioinformatics Research Group, Softwarepark 11, 4232Hagenberg, Austria,E-mail: . Phone: +43 (0) 50804
22740
| |
Collapse
|
10
|
Bittremieux W, Wang M, Dorrestein PC. The critical role that spectral libraries play in capturing the metabolomics community knowledge. Metabolomics 2022; 18:94. [PMID: 36409434 PMCID: PMC10284100 DOI: 10.1007/s11306-022-01947-y] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Accepted: 10/19/2022] [Indexed: 11/22/2022]
Abstract
BACKGROUND Spectral library searching is currently the most common approach for compound annotation in untargeted metabolomics. Spectral libraries applicable to liquid chromatography mass spectrometry have grown in size over the past decade to include hundreds of thousands to millions of mass spectra and tens of thousands of compounds, forming an essential knowledge base for the interpretation of metabolomics experiments. AIM OF REVIEW We describe existing spectral library resources, highlight different strategies for compiling spectral libraries, and discuss quality considerations that should be taken into account when interpreting spectral library searching results. Finally, we describe how spectral libraries are empowering the next generation of machine learning tools in computational metabolomics, and discuss several opportunities for using increasingly accessible large spectral libraries. KEY SCIENTIFIC CONCEPTS OF REVIEW This review focuses on the current state of spectral libraries for untargeted LC-MS/MS based metabolomics. We show how the number of entries in publicly accessible spectral libraries has increased more than 60-fold in the past eight years to aid molecular interpretation and we discuss how the role of spectral libraries in untargeted metabolomics will evolve in the near future.
Collapse
Affiliation(s)
- Wout Bittremieux
- Collaborative Mass Spectrometry Innovation Center, University of California San Diego, La Jolla, CA, 92093, USA
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA, 92093, USA
| | - Mingxun Wang
- Department of Computer Science, University of California Riverside, Riverside, CA, 92507, USA
| | - Pieter C Dorrestein
- Collaborative Mass Spectrometry Innovation Center, University of California San Diego, La Jolla, CA, 92093, USA.
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA, 92093, USA.
| |
Collapse
|
11
|
Adams C, Boonen K, Laukens K, Bittremieux W. Open Modification Searching of SARS-CoV-2-Human Protein Interaction Data Reveals Novel Viral Modification Sites. Mol Cell Proteomics 2022; 21:100425. [PMID: 36241021 PMCID: PMC9554009 DOI: 10.1016/j.mcpro.2022.100425] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2022] [Revised: 09/18/2022] [Accepted: 10/09/2022] [Indexed: 01/18/2023] Open
Abstract
The outbreak of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the causative agent of the coronavirus 2019 disease, has led to an ongoing global pandemic since 2019. Mass spectrometry can be used to understand the molecular mechanisms of viral infection by SARS-CoV-2, for example, by determining virus-host protein-protein interactions through which SARS-CoV-2 hijacks its human hosts during infection, and to study the role of post-translational modifications. We have reanalyzed public affinity purification-mass spectrometry data using open modification searching to investigate the presence of post-translational modifications in the context of the SARS-CoV-2 virus-host protein-protein interaction network. Based on an over twofold increase in identified spectra, our detected protein interactions show a high overlap with independent mass spectrometry-based SARS-CoV-2 studies and virus-host interactions for alternative viruses, as well as previously unknown protein interactions. In addition, we identified several novel modification sites on SARS-CoV-2 proteins that we investigated in relation to their interactions with host proteins. A detailed analysis of relevant modifications, including phosphorylation, ubiquitination, and S-nitrosylation, provides important hypotheses about the functional role of these modifications during viral infection by SARS-CoV-2.
Collapse
Affiliation(s)
- Charlotte Adams
- Department of Computer Science, University of Antwerp, Antwerp, Belgium,Centre for Proteomics (CFP), University of Antwerp, Antwerp, Belgium
| | - Kurt Boonen
- Centre for Proteomics (CFP), University of Antwerp, Antwerp, Belgium,Sustainable Health Department, Flemish Institute for Technological Research (VITO), Antwerp, Belgium
| | - Kris Laukens
- Department of Computer Science, University of Antwerp, Antwerp, Belgium
| | - Wout Bittremieux
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, California, USA,For correspondence: Wout Bittremieux
| |
Collapse
|
12
|
Bittremieux W, Schmid R, Huber F, van der Hooft JJJ, Wang M, Dorrestein PC. Comparison of Cosine, Modified Cosine, and Neutral Loss Based Spectrum Alignment For Discovery of Structurally Related Molecules. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2022; 33:1733-1744. [PMID: 35960544 DOI: 10.1021/jasms.2c00153] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Spectrum alignment of tandem mass spectrometry (MS/MS) data using the modified cosine similarity and subsequent visualization as molecular networks have been demonstrated to be a useful strategy to discover analogs of molecules from untargeted MS/MS-based metabolomics experiments. Recently, a neutral loss matching approach has been introduced as an alternative to MS/MS-based molecular networking with an implied performance advantage in finding analogs that cannot be discovered using existing MS/MS spectrum alignment strategies. To comprehensively evaluate the scoring properties of neutral loss matching, the cosine similarity, and the modified cosine similarity, similarity measures of 955 228 peptide MS/MS spectrum pairs and 10 million small molecule MS/MS spectrum pairs were compared. This comparative analysis revealed that the modified cosine similarity outperformed neutral loss matching and the cosine similarity in all cases. The data further indicated that the performance of MS/MS spectrum alignment depends on the location and type of the modification, as well as the chemical compound class of fragmented molecules.
Collapse
Affiliation(s)
- Wout Bittremieux
- Collaborative Mass Spectrometry Innovation Center, University of California San Diego, La Jolla, California 92093, United States
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, California 92093, United States
| | - Robin Schmid
- Collaborative Mass Spectrometry Innovation Center, University of California San Diego, La Jolla, California 92093, United States
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, California 92093, United States
| | - Florian Huber
- Centre for Digitalization and Digitality, University of Applied Sciences, 40476 Düsseldorf, Germany
| | - Justin J J van der Hooft
- Bioinformatics Group, Wageningen University, 6708PB Wageningen, The Netherlands
- Department of Biochemistry, University of Johannesburg, Auckland Park, Johannesburg 2006, South Africa
| | - Mingxun Wang
- Collaborative Mass Spectrometry Innovation Center, University of California San Diego, La Jolla, California 92093, United States
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, California 92093, United States
| | - Pieter C Dorrestein
- Collaborative Mass Spectrometry Innovation Center, University of California San Diego, La Jolla, California 92093, United States
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, California 92093, United States
| |
Collapse
|
13
|
Bittremieux W, May DH, Bilmes J, Noble WS. A learned embedding for efficient joint analysis of millions of mass spectra. Nat Methods 2022; 19:675-678. [DOI: 10.1038/s41592-022-01496-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2018] [Accepted: 04/14/2022] [Indexed: 11/09/2022]
|
14
|
Shiferaw GA, Gabriels R, Bouwmeester R, Van Den Bossche T, Vandermarliere E, Martens L, Volders PJ. Sensitive and Specific Spectral Library Searching with CompOmics Spectral Library Searching Tool and Percolator. J Proteome Res 2022; 21:1365-1370. [PMID: 35446579 DOI: 10.1021/acs.jproteome.2c00075] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Maintaining high sensitivity while limiting false positives is a key challenge in peptide identification from mass spectrometry data. Here, we investigate the effects of integrating the machine learning-based postprocessor Percolator into our spectral library searching tool COSS (CompOmics Spectral library Searching tool). To evaluate the effects of this postprocessing, we have used 40 data sets from 2 different projects and have searched these against the NIST and MassIVE spectral libraries. The searching is carried out using 2 spectral library search tools, COSS and MSPepSearch with and without Percolator postprocessing, and using sequence database search engine MS-GF+ as a baseline comparator. The addition of the Percolator rescoring step to COSS is effective and results in a substantial improvement in sensitivity and specificity of the identifications. COSS is freely available as open source under the permissive Apache2 license, and binaries and source code are found at https://github.com/compomics/COSS.
Collapse
Affiliation(s)
- Genet Abay Shiferaw
- VIB-UGent Center for Medical Biotechnology, VIB, 9000 Ghent, Belgium.,Department of Biomolecular Medicine, Ghent University, 9000 Ghent, Belgium
| | - Ralf Gabriels
- VIB-UGent Center for Medical Biotechnology, VIB, 9000 Ghent, Belgium.,Department of Biomolecular Medicine, Ghent University, 9000 Ghent, Belgium
| | - Robbin Bouwmeester
- VIB-UGent Center for Medical Biotechnology, VIB, 9000 Ghent, Belgium.,Department of Biomolecular Medicine, Ghent University, 9000 Ghent, Belgium
| | - Tim Van Den Bossche
- VIB-UGent Center for Medical Biotechnology, VIB, 9000 Ghent, Belgium.,Department of Biomolecular Medicine, Ghent University, 9000 Ghent, Belgium
| | - Elien Vandermarliere
- VIB-UGent Center for Medical Biotechnology, VIB, 9000 Ghent, Belgium.,Department of Biomolecular Medicine, Ghent University, 9000 Ghent, Belgium
| | - Lennart Martens
- VIB-UGent Center for Medical Biotechnology, VIB, 9000 Ghent, Belgium.,Department of Biomolecular Medicine, Ghent University, 9000 Ghent, Belgium
| | - Pieter-Jan Volders
- VIB-UGent Center for Medical Biotechnology, VIB, 9000 Ghent, Belgium.,Department of Biomolecular Medicine, Ghent University, 9000 Ghent, Belgium.,Cancer Research Institute Ghent, Ghent University, 9000 Ghent, Belgium
| |
Collapse
|
15
|
Berger MT, Hemmler D, Diederich P, Rychlik M, Marshall JW, Schmitt-Kopplin P. Open Search of Peptide Glycation Products from Tandem Mass Spectra. Anal Chem 2022; 94:5953-5961. [PMID: 35389626 DOI: 10.1021/acs.analchem.2c00388] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Identification of chemically modified peptides in mass spectrometry (MS)-based glycation studies is a crucial yet challenging task. There is a need to establish a mode for matching tandem mass spectrometry (MS/MS) data, allowing for both known and unknown peptide glycation modifications. We present an open search approach that uses classic and modified peptide fragment ions. The latter are shifted by the mass delta of the modification. Both provide key structural information that can be used to assess the peptide core structure of the glycation product. We also leverage redundant neutral losses from the modification side chain, introducing a third ion class for matching referred to as characteristic fragment ions. We demonstrate that peptide glycation product MS/MS spectra contain multidimensional information and that most often, more than half of the spectral information is ignored if no attempt is made to use a multi-step matching algorithm. Compared to regular and/or modified peptide ion matching, our triple-ion strategy significantly increased the median interpretable fraction of the glycation product MS/MS spectra. For reference, we apply our approach for Amadori product characterization and identify all established diagnostic ions automatically. We further show how this method effectively applies the open search concept and allows for optimized elucidation of unknown structures by presenting two hitherto undescribed peptide glycation modifications with a delta mass of 102.0311 and 268.1768 Da. We characterize their fragmentation signature by integration with isotopically labeled glycation products, which provides high validity for non-targeted structure identification.
Collapse
Affiliation(s)
- Michelle T Berger
- Chair of Analytical Food Chemistry, Technical University Munich, Maximus-von-Imhof-Forum 2, 85354 Freising, Germany.,Research Unit Analytical BioGeoChemistry (BGC), Helmholtz Zentrum München, Ingolstädter Landstrasse 1, 85764 Neuherberg, Germany
| | - Daniel Hemmler
- Chair of Analytical Food Chemistry, Technical University Munich, Maximus-von-Imhof-Forum 2, 85354 Freising, Germany.,Research Unit Analytical BioGeoChemistry (BGC), Helmholtz Zentrum München, Ingolstädter Landstrasse 1, 85764 Neuherberg, Germany
| | - Philippe Diederich
- Research Unit Analytical BioGeoChemistry (BGC), Helmholtz Zentrum München, Ingolstädter Landstrasse 1, 85764 Neuherberg, Germany
| | - Michael Rychlik
- Chair of Analytical Food Chemistry, Technical University Munich, Maximus-von-Imhof-Forum 2, 85354 Freising, Germany
| | - James W Marshall
- The Waltham Petcare Science Institute, Mars Petcare UK, Waltham-on-the-Wolds, Leicestershire LE14 4RT, United Kingdom
| | - Philippe Schmitt-Kopplin
- Chair of Analytical Food Chemistry, Technical University Munich, Maximus-von-Imhof-Forum 2, 85354 Freising, Germany.,Research Unit Analytical BioGeoChemistry (BGC), Helmholtz Zentrum München, Ingolstädter Landstrasse 1, 85764 Neuherberg, Germany
| |
Collapse
|
16
|
Altenburg T, Giese SH, Wang S, Muth T, Renard BY. Ad hoc learning of peptide fragmentation from mass spectra enables an interpretable detection of phosphorylated and cross-linked peptides. NAT MACH INTELL 2022. [DOI: 10.1038/s42256-022-00467-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
AbstractMass spectrometry-based proteomics provides a holistic snapshot of the entire protein set of living cells on a molecular level. Currently, only a few deep learning approaches exist that involve peptide fragmentation spectra, which represent partial sequence information of proteins. Commonly, these approaches lack the ability to characterize less studied or even unknown patterns in spectra because of their use of explicit domain knowledge. Here, to elevate unrestricted learning from spectra, we introduce ‘ad hoc learning of fragmentation’ (AHLF), a deep learning model that is end-to-end trained on 19.2 million spectra from several phosphoproteomic datasets. AHLF is interpretable, and we show that peak-level feature importance values and pairwise interactions between peaks are in line with corresponding peptide fragments. We demonstrate our approach by detecting post-translational modifications, specifically protein phosphorylation based on only the fragmentation spectrum without a database search. AHLF increases the area under the receiver operating characteristic curve (AUC) by an average of 9.4% on recent phosphoproteomic data compared with the current state of the art on this task. Furthermore, use of AHLF in rescoring search results increases the number of phosphopeptide identifications by a margin of up to 15.1% at a constant false discovery rate. To show the broad applicability of AHLF, we use transfer learning to also detect cross-linked peptides, as used in protein structure analysis, with an AUC of up to 94%.
Collapse
|
17
|
Bouwmeester R, Gabriels R, Hulstaert N, Martens L, Degroeve S. DeepLC can predict retention times for peptides that carry as-yet unseen modifications. Nat Methods 2021; 18:1363-1369. [PMID: 34711972 DOI: 10.1038/s41592-021-01301-5] [Citation(s) in RCA: 85] [Impact Index Per Article: 28.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2020] [Accepted: 09/13/2021] [Indexed: 11/09/2022]
Abstract
The inclusion of peptide retention time prediction promises to remove peptide identification ambiguity in complex liquid chromatography-mass spectrometry identification workflows. However, due to the way peptides are encoded in current prediction models, accurate retention times cannot be predicted for modified peptides. This is especially problematic for fledgling open searches, which will benefit from accurate retention time prediction for modified peptides to reduce identification ambiguity. We present DeepLC, a deep learning peptide retention time predictor using peptide encoding based on atomic composition that allows the retention time of (previously unseen) modified peptides to be predicted accurately. We show that DeepLC performs similarly to current state-of-the-art approaches for unmodified peptides and, more importantly, accurately predicts retention times for modifications not seen during training. Moreover, we show that DeepLC's ability to predict retention times for any modification enables potentially incorrect identifications to be flagged in an open search of a wide variety of proteome data.
Collapse
Affiliation(s)
- Robbin Bouwmeester
- VIB-UGent Center for Medical Biotechnology, VIB, Ghent, Belgium.,Department of Biomolecular Medicine, Ghent University, Ghent, Belgium
| | - Ralf Gabriels
- VIB-UGent Center for Medical Biotechnology, VIB, Ghent, Belgium.,Department of Biomolecular Medicine, Ghent University, Ghent, Belgium
| | - Niels Hulstaert
- VIB-UGent Center for Medical Biotechnology, VIB, Ghent, Belgium.,Department of Biomolecular Medicine, Ghent University, Ghent, Belgium
| | - Lennart Martens
- VIB-UGent Center for Medical Biotechnology, VIB, Ghent, Belgium. .,Department of Biomolecular Medicine, Ghent University, Ghent, Belgium.
| | - Sven Degroeve
- VIB-UGent Center for Medical Biotechnology, VIB, Ghent, Belgium.,Department of Biomolecular Medicine, Ghent University, Ghent, Belgium
| |
Collapse
|
18
|
Tariq MU, Saeed F. SpeCollate: Deep cross-modal similarity network for mass spectrometry data based peptide deductions. PLoS One 2021; 16:e0259349. [PMID: 34714871 PMCID: PMC8555789 DOI: 10.1371/journal.pone.0259349] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2021] [Accepted: 10/18/2021] [Indexed: 11/19/2022] Open
Abstract
Historically, the database search algorithms have been the de facto standard for inferring peptides from mass spectrometry (MS) data. Database search algorithms deduce peptides by transforming theoretical peptides into theoretical spectra and matching them to the experimental spectra. Heuristic similarity-scoring functions are used to match an experimental spectrum to a theoretical spectrum. However, the heuristic nature of the scoring functions and the simple transformation of the peptides into theoretical spectra, along with noisy mass spectra for the less abundant peptides, can introduce a cascade of inaccuracies. In this paper, we design and implement a Deep Cross-Modal Similarity Network called SpeCollate, which overcomes these inaccuracies by learning the similarity function between experimental spectra and peptides directly from the labeled MS data. SpeCollate transforms spectra and peptides into a shared Euclidean subspace by learning fixed size embeddings for both. Our proposed deep-learning network trains on sextuplets of positive and negative examples coupled with our custom-designed SNAP-loss function. Online hardest negative mining is used to select the appropriate negative examples for optimal training performance. We use 4.8 million sextuplets obtained from the NIST and MassIVE peptide libraries to train the network and demonstrate that for closed search, SpeCollate is able to perform better than Crux and MSFragger in terms of the number of peptide-spectrum matches (PSMs) and unique peptides identified under 1% FDR for real-world data. SpeCollate also identifies a large number of peptides not reported by either Crux or MSFragger. To the best of our knowledge, our proposed SpeCollate is the first deep-learning network that can determine the cross-modal similarity between peptides and mass-spectra for MS-based proteomics. We believe SpeCollate is significant progress towards developing machine-learning solutions for MS-based omics data analysis. SpeCollate is available at https://deepspecs.github.io/.
Collapse
Affiliation(s)
- Muhammad Usman Tariq
- School of Computing & Information Sciences, Florida International University, Miami, FL, United States of America
| | - Fahad Saeed
- School of Computing & Information Sciences, Florida International University, Miami, FL, United States of America
| |
Collapse
|
19
|
Abstract
The volume of proteomics and mass spectrometry data available in public repositories continues to grow at a rapid pace as more researchers embrace open science practices. Open access to the data behind scientific discoveries has become critical to validate published findings and develop new computational tools. Here, we present ppx, a Python package that provides easy, programmatic access to the data stored in ProteomeXchange repositories, such as PRIDE and MassIVE. The ppx package can be used as either a command line tool or a Python package to retrieve the files and metadata associated with a project when provided its identifier. To demonstrate how ppx enhances reproducible research, we used ppx within a Snakemake workflow to reanalyze a published data set with the open modification search tool ANN-SoLo and compared our reanalysis to the original results. We show that ppx readily integrates into workflows, and our reanalysis produced results consistent with the original analysis. We envision that ppx will be a valuable tool for creating reproducible analyses, providing tool developers easy access to data for development, testing, and benchmarking, and enabling the use of mass spectrometry data in data-intensive analyses. The ppx package is freely available and open source under the MIT license at https://github.com/wfondrie/ppx.
Collapse
Affiliation(s)
- William E Fondrie
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Wout Bittremieux
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA, USA
- Department of Computer Science, University of Antwerp, Antwerp, Belgium
| | - William S Noble
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA
| |
Collapse
|
20
|
Decoding post translational modification crosstalk with proteomics. Mol Cell Proteomics 2021; 20:100129. [PMID: 34339852 PMCID: PMC8430371 DOI: 10.1016/j.mcpro.2021.100129] [Citation(s) in RCA: 105] [Impact Index Per Article: 35.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Revised: 07/06/2021] [Accepted: 07/27/2021] [Indexed: 12/12/2022] Open
Abstract
Post-translational modification (PTM) of proteins allows cells to regulate protein functions, transduce signals and respond to perturbations. PTMs expand protein functionality and diversity, which leads to increased proteome complexity. PTM crosstalk describes the combinatorial action of multiple PTMs on the same or on different proteins for higher order regulation. Here we review how recent advances in proteomic technologies, mass spectrometry instrumentation, and bioinformatics spurred the proteome-wide identification of PTM crosstalk through measurements of PTM sites. We provide an overview of the basic modes of PTM crosstalk, the proteomic methods to elucidate PTM crosstalk, and approaches that can inform about the functional consequences of PTM crosstalk. Description of basic modules and different modes of PTM crosstalk. Overview of current proteomic methods to identify and infer PTM crosstalk. Discussion of large-scale approaches to characterize functional PTM crosstalk. Future directions and potential proteomic methods for elucidating PTM crosstalk.
Collapse
|
21
|
Bittremieux W, Laukens K, Noble WS, Dorrestein PC. Large-scale tandem mass spectrum clustering using fast nearest neighbor searching. RAPID COMMUNICATIONS IN MASS SPECTROMETRY : RCM 2021:e9153. [PMID: 34169593 PMCID: PMC8709870 DOI: 10.1002/rcm.9153] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/05/2021] [Revised: 06/21/2021] [Accepted: 06/21/2021] [Indexed: 05/27/2023]
Abstract
RATIONALE Advanced algorithmic solutions are necessary to process the ever-increasing amounts of mass spectrometry data that are being generated. In this study, we describe the falcon spectrum clustering tool for efficient clustering of millions of MS/MS spectra. METHODS falcon succeeds in efficiently clustering large amounts of mass spectral data using advanced techniques for fast spectrum similarity searching. First, high-resolution spectra are binned and converted to low-dimensional vectors using feature hashing. Next, the spectrum vectors are used to construct nearest neighbor indexes for fast similarity searching. The nearest neighbor indexes are used to efficiently compute a sparse pairwise distance matrix without having to exhaustively perform all pairwise spectrum comparisons within the relevant precursor mass tolerance. Finally, density-based clustering is performed to group similar spectra into clusters. RESULTS Several state-of-the-art spectrum clustering tools were evaluated using a large draft human proteome data set consisting of 25 million spectra, indicating that alternative tools produce clustering results with different characteristics. Notably, falcon generates larger highly pure clusters than alternative tools, leading to a larger reduction in data volume without the loss of relevant information for more efficient downstream processing. CONCLUSIONS falcon is a highly efficient spectrum clustering tool, which is publicly available as an open source under the permissive BSD license at https://github.com/bittremieux/falcon.
Collapse
Affiliation(s)
- Wout Bittremieux
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, California, United States
- Department of Computer Science, University of Antwerp, Antwerp, Belgium
| | - Kris Laukens
- Department of Computer Science, University of Antwerp, Antwerp, Belgium
| | - William Stafford Noble
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States
| | - Pieter C Dorrestein
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, California, United States
| |
Collapse
|
22
|
Dorfer V, Strobl M, Winkler S, Mechtler K. MS Amanda 2.0: Advancements in the standalone implementation. RAPID COMMUNICATIONS IN MASS SPECTROMETRY : RCM 2021; 35:e9088. [PMID: 33759252 PMCID: PMC8244010 DOI: 10.1002/rcm.9088] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/30/2020] [Revised: 02/27/2021] [Accepted: 03/18/2021] [Indexed: 06/12/2023]
Abstract
RATIONALE Database search engines are the preferred method to identify peptides in mass spectrometry data. However, valuable software is in this context not only defined by a powerful algorithm to separate correct from false identifications, but also by constant maintenance and continuous improvements. METHODS In 2014, we presented our peptide identification algorithm MS Amanda, showing its suitability for identifying peptides in high-resolution tandem mass spectrometry data and its ability to outperform widely used tools to identify peptides. Since then, we have continuously worked on improvements to enhance its usability and to support new trends and developments in this fast-growing field, while keeping the original scoring algorithm to assess the quality of a peptide spectrum match unchanged. RESULTS We present the outcome of these efforts, MS Amanda 2.0, a faster and more flexible standalone version with the original scoring algorithm. The new implementation has led to a 3-5× speedup, is able to handle new ion types and supports standard data formats. We also show that MS Amanda 2.0 works best when using only the most common ion types in a particular search instead of all possible ion types. CONCLUSIONS MS Amanda is available free of charge from https://ms.imp.ac.at/index.php?action=msamanda.
Collapse
Affiliation(s)
- Viktoria Dorfer
- Bioinformatics Research GroupUniversity of Applied Sciences Upper AustriaSoftwarepark 11, 4232 HagenbergAustria
| | - Marina Strobl
- Bioinformatics Research GroupUniversity of Applied Sciences Upper AustriaSoftwarepark 11, 4232 HagenbergAustria
| | - Stephan Winkler
- Bioinformatics Research GroupUniversity of Applied Sciences Upper AustriaSoftwarepark 11, 4232 HagenbergAustria
| | - Karl Mechtler
- Institute of Molecular Pathology (IMP)Vienna BioCenter (VBC)Campus‐Vienna‐Biocenter 1Vienna1030Austria
- Institute of Molecular Biotechnology (IMBA)Austrian Academy of Sciences, Vienna BioCenter (VBC)Dr. Bohr‐Gasse 3Vienna1030Austria
- Gregor Mendel Institute (GMI)Austrian Academy of Sciences, Vienna BioCenter (VBC)Dr. Bohr‐ Gasse 3Vienna1030Austria
| |
Collapse
|
23
|
Salz R, Bouwmeester R, Gabriels R, Degroeve S, Martens L, Volders PJ, 't Hoen PAC. Personalized Proteome: Comparing Proteogenomics and Open Variant Search Approaches for Single Amino Acid Variant Detection. J Proteome Res 2021; 20:3353-3364. [PMID: 33998808 PMCID: PMC8280751 DOI: 10.1021/acs.jproteome.1c00264] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2021] [Indexed: 12/30/2022]
Abstract
Discovery of variant peptides such as a single amino acid variant (SAAV) in shotgun proteomics data is essential for personalized proteomics. Both the resolution of shotgun proteomics methods and the search engines have improved dramatically, allowing for confident identification of SAAV peptides. However, it is not yet known if these methods are truly successful in accurately identifying SAAV peptides without prior genomic information in the search database. We studied this in unprecedented detail by exploiting publicly available long-read RNA sequences and shotgun proteomics data from the gold standard reference cell line NA12878. Searching spectra from this cell line with the state-of-the-art open modification search engine ionbot against carefully curated search databases resulted in 96.7% false-positive SAAVs and an 85% lower true positive rate than searching with peptide search databases that incorporate prior genetic information. While adding genetic variants to the search database remains indispensable for correct peptide identification, inclusion of long-read RNA sequences in the search database contributes only 0.3% new peptide identifications. These findings reveal the differences in SAAV detection that result from various approaches, providing guidance to researchers studying SAAV peptides and developers of peptide spectrum identification tools.
Collapse
Affiliation(s)
- Renee Salz
- Centre for Molecular and Biomolecular Informatics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen 6525 GA, The Netherlands
| | - Robbin Bouwmeester
- VIB-UGent Center for Medical Biotechnology VIB, Technologiepark-Zwijnaarde 75, 9052 Ghent, Belgium
- Department of Biomolecular Medicine, Ghent University, Technologiepark-Zwijnaarde 75, 9052 Ghent, Belgium
| | - Ralf Gabriels
- VIB-UGent Center for Medical Biotechnology VIB, Technologiepark-Zwijnaarde 75, 9052 Ghent, Belgium
- Department of Biomolecular Medicine, Ghent University, Technologiepark-Zwijnaarde 75, 9052 Ghent, Belgium
| | - Sven Degroeve
- VIB-UGent Center for Medical Biotechnology VIB, Technologiepark-Zwijnaarde 75, 9052 Ghent, Belgium
- Department of Biomolecular Medicine, Ghent University, Technologiepark-Zwijnaarde 75, 9052 Ghent, Belgium
| | - Lennart Martens
- VIB-UGent Center for Medical Biotechnology VIB, Technologiepark-Zwijnaarde 75, 9052 Ghent, Belgium
- Department of Biomolecular Medicine, Ghent University, Technologiepark-Zwijnaarde 75, 9052 Ghent, Belgium
| | - Pieter-Jan Volders
- VIB-UGent Center for Medical Biotechnology VIB, Technologiepark-Zwijnaarde 75, 9052 Ghent, Belgium
- Department of Biomolecular Medicine, Ghent University, Technologiepark-Zwijnaarde 75, 9052 Ghent, Belgium
| | - Peter A C 't Hoen
- Centre for Molecular and Biomolecular Informatics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen 6525 GA, The Netherlands
| |
Collapse
|
24
|
Smythers AL, Hicks LM. Mapping the plant proteome: tools for surveying coordinating pathways. Emerg Top Life Sci 2021; 5:203-220. [PMID: 33620075 PMCID: PMC8166341 DOI: 10.1042/etls20200270] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2021] [Revised: 02/07/2021] [Accepted: 02/09/2021] [Indexed: 12/14/2022]
Abstract
Plants rapidly respond to environmental fluctuations through coordinated, multi-scalar regulation, enabling complex reactions despite their inherently sessile nature. In particular, protein post-translational signaling and protein-protein interactions combine to manipulate cellular responses and regulate plant homeostasis with precise temporal and spatial control. Understanding these proteomic networks are essential to addressing ongoing global crises, including those of food security, rising global temperatures, and the need for renewable materials and fuels. Technological advances in mass spectrometry-based proteomics are enabling investigations of unprecedented depth, and are increasingly being optimized for and applied to plant systems. This review highlights recent advances in plant proteomics, with an emphasis on spatially and temporally resolved analysis of post-translational modifications and protein interactions. It also details the necessity for generation of a comprehensive plant cell atlas while highlighting recent accomplishments within the field.
Collapse
Affiliation(s)
- Amanda L Smythers
- Department of Chemistry, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, U.S.A
| | - Leslie M Hicks
- Department of Chemistry, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, U.S.A
| |
Collapse
|
25
|
Lysiak A, Fertin G, Jean G, Tessier D. Evaluation of open search methods based on theoretical mass spectra comparison. BMC Bioinformatics 2021; 22:65. [PMID: 33902435 PMCID: PMC8073971 DOI: 10.1186/s12859-021-03963-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2020] [Accepted: 01/08/2021] [Indexed: 11/17/2022] Open
Abstract
Background Mass spectrometry remains the privileged method to characterize proteins. Nevertheless, most of the spectra generated by an experiment remain unidentified after their analysis, mostly because of the modifications they carry. Open Modification Search (OMS) methods offer a promising answer to this problem. However, assessing the quality of OMS identifications remains a difficult task. Methods Aiming at better understanding the relationship between (1) similarity of pairs of spectra provided by OMS methods and (2) relevance of their corresponding peptide sequences, we used a dataset composed of theoretical spectra only, on which we applied two OMS strategies. We also introduced two appropriately defined measures for evaluating the above mentioned spectra/sequence relevance in this context: one is a color classification representing the level of difficulty to retrieve the proper sequence of the peptide that generated the identified spectrum ; the other, called LIPR, is the proportion of common masses, in a given Peptide Spectrum Match (PSM), that represent dissimilar sequences. These two measures were also considered in conjunction with the False Discovery Rate (FDR). Results According to our measures, the strategy that selects the best candidate by taking the mass difference between two spectra into account yields better quality results. Besides, although the FDR remains an interesting indicator in OMS methods (as shown by LIPR), it is questionable: indeed, our color classification shows that a non negligible proportion of relevant spectra/sequence interpretations corresponds to PSMs coming from the decoy database. Conclusions The three above mentioned measures allowed us to clearly determine which of the two studied OMS strategies outperformed the other, both in terms of number of identifications and of accuracy of these identifications. Even though quality evaluation of PSMs in OMS methods remains challenging, the study of theoretical spectra is a favorable framework for going further in this direction.
Collapse
Affiliation(s)
- Albane Lysiak
- CNRS, LS2N, Université de Nantes, 44000, Nantes, France.,UR BIA, INRAE, 44316, Nantes, France
| | | | | | - Dominique Tessier
- BIBS Facility, INRAE, 44316, Nantes, France.,UR BIA, INRAE, 44316, Nantes, France
| |
Collapse
|
26
|
Ivanov MV, Bubis JA, Gorshkov V, Abdrakhimov DA, Kjeldsen F, Gorshkov MV. Boosting MS1-only Proteomics with Machine Learning Allows 2000 Protein Identifications in Single-Shot Human Proteome Analysis Using 5 min HPLC Gradient. J Proteome Res 2021; 20:1864-1873. [PMID: 33720732 DOI: 10.1021/acs.jproteome.0c00863] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
Proteome-wide analyses rely on tandem mass spectrometry and the extensive separation of proteolytic mixtures. This imposes considerable instrumental time consumption, which is one of the main obstacles in the broader acceptance of proteomics in biomedical and clinical research. Recently, we presented a fast proteomic method termed DirectMS1 based on ultrashort LC gradients as well as MS1-only mass spectra acquisition and data processing. The method allows significant reduction of the proteome-wide analysis time to a few minutes at the depth of quantitative proteome coverage of 1000 proteins at 1% false discovery rate (FDR). In this work, to further increase the capabilities of the DirectMS1 method, we explored the opportunities presented by the recent progress in the machine-learning area and applied the LightGBM decision tree boosting algorithm to the scoring of peptide feature matches when processing MS1 spectra. Furthermore, we integrated the peptide feature identification algorithm of DirectMS1 with the recently introduced peptide retention time prediction utility, DeepLC. Additional approaches to improve the performance of the DirectMS1 method are discussed and demonstrated, such as using FAIMS for gas-phase ion separation. As a result of all improvements to DirectMS1, we succeeded in identifying more than 2000 proteins at 1% FDR from the HeLa cell line in a 5 min gradient LC-FAIMS/MS1 analysis. The data sets generated and analyzed during the current study have been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the data set identifier PXD023977.
Collapse
Affiliation(s)
- Mark V Ivanov
- V. L. Talrose Institute for Energy Problems of Chemical Physics, N. N. Semenov Federal Research Center for Chemical Physics, Russian Academy of Sciences, 38 Leninsky Pr., Bld. 2, Moscow 119334, Russia
| | - Julia A Bubis
- V. L. Talrose Institute for Energy Problems of Chemical Physics, N. N. Semenov Federal Research Center for Chemical Physics, Russian Academy of Sciences, 38 Leninsky Pr., Bld. 2, Moscow 119334, Russia
| | - Vladimir Gorshkov
- Department of Biochemistry and Molecular Biology, University of Southern Denmark, DK-5230 Odense M, Denmark
| | - Daniil A Abdrakhimov
- V. L. Talrose Institute for Energy Problems of Chemical Physics, N. N. Semenov Federal Research Center for Chemical Physics, Russian Academy of Sciences, 38 Leninsky Pr., Bld. 2, Moscow 119334, Russia.,Moscow Institute of Physics and Technology, Institutsky lane 9, Dolgoprudny, Moscow Region 141700, Russia
| | - Frank Kjeldsen
- Department of Biochemistry and Molecular Biology, University of Southern Denmark, DK-5230 Odense M, Denmark
| | - Mikhail V Gorshkov
- V. L. Talrose Institute for Energy Problems of Chemical Physics, N. N. Semenov Federal Research Center for Chemical Physics, Russian Academy of Sciences, 38 Leninsky Pr., Bld. 2, Moscow 119334, Russia
| |
Collapse
|
27
|
DU Z, SHAO W, QIN W. [Research progress and application of retention time prediction method based on deep learning]. Se Pu 2021; 39:211-218. [PMID: 34227303 PMCID: PMC9403805 DOI: 10.3724/sp.j.1123.2020.08015] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2020] [Indexed: 11/25/2022] Open
Abstract
In "shotgun" proteomics strategy, the proteome is explained by analyzing tryptic digested peptides using liquid chromatography-mass spectrometry. In this strategy, the retention time of peptides in liquid chromatography separation can be predicted based on the peptide sequence. This is a useful feature for peptide identification. Therefore, the prediction of the retention time has attracted much research attention. Traditional methods calculate the physical and chemical properties of the peptides based on their amino acid sequence to obtain the retention time under certain chromatography conditions; however, these methods cannot be directly adopted for other chromatography conditions, nor can they be used across laboratories or instrument platforms. To solve this problem, in recent years, deep learning was introduced to proteomics research for retention time prediction. Deep learning is an advanced machine-learning method that has extraordinary capability to learn complex relationships from large-scale data. By stacking multiple hidden neural networks, deep learning can ingest raw data without manually designed features. Transfer learning is an important method in deep learning. It improves the learning process a new task through the transfer of knowledge from an already-learned related task. Transfer learning allows models trained using large datasets to be utilized across conditions by fine-tuning on smaller datasets, instead of retraining the whole model. Many retention time prediction methods have been developed. In the process of training the model, the sequences of peptides are encoded to represent peptide information. Deep learning considers the relationship between the characteristics of the peptides and their corresponding retention times without the need for manual input of the physical and chemical properties of the peptides. Compared with traditional methods, deep learning methods have higher accuracy and can be easily used under different chromatography conditions by transfer learning. If there are not enough datasets to train a new model, a trained model from other datasets can be used as a replacement after calibration with small datasets obtained from these chromatography conditions. While the retention times of modified peptides can also be predicted, the predictions are inadequate for complex modifications such as glycosylation, and this is one of the main problems to be solved. The predicted retention times were used to control the quality of peptide identification. With high accuracy, the predicted retention times can be considered as actual retention times. Therefore, the difference between predicted and observed retention times can serve as an effective and unbiased quantitative metric for evaluating the quality of peptide-spectrum matches (PSMs) reported using different peptide identification methods. Combined with fragment ion intensity prediction, retention time prediction is used to generate spectral libraries for data-independent acquisition (DIA)-based mass spectrometry analysis. Generally, DIA methods identify peptides using specific spectrum libraries obtained from data-dependent acquisition (DDA) experiments. As a result, only peptides detected in the DDA experiments can be present in the libraries and detected in DIA. Furthermore, it takes a lot of time and effort to build libraries from DDA experiments, and typically, they cannot be adopted across different laboratories or instrument platforms. In contrast, the pseudo spectral libraries generated by retention times and fragment ion intensity prediction can overcome these shortcomings. The pseudo spectral libraries generate theoretical spectra of all possible peptides without the need for DDA experiments. This paper reviews the research progress of deep learning methods in the prediction of retention time and in related applications in order to provide references for retention time prediction and protein identification. At the same time, the development direction and application trend of retention time prediction methods based on deep learning are discussed.
Collapse
|
28
|
Bittremieux W, Adams C, Laukens K, Dorrestein PC, Bandeira N. Open Science Resources for the Mass Spectrometry-Based Analysis of SARS-CoV-2. J Proteome Res 2021; 20:1464-1475. [PMID: 33605735 DOI: 10.1021/acs.jproteome.0c00929] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
The SARS-CoV-2 virus is the causative agent of the 2020 pandemic leading to the COVID-19 respiratory disease. With many scientific and humanitarian efforts ongoing to develop diagnostic tests, vaccines, and treatments for COVID-19, and to prevent the spread of SARS-CoV-2, mass spectrometry research, including proteomics, is playing a role in determining the biology of this viral infection. Proteomics studies are starting to lead to an understanding of the roles of viral and host proteins during SARS-CoV-2 infection, their protein-protein interactions, and post-translational modifications. This is beginning to provide insights into potential therapeutic targets or diagnostic strategies that can be used to reduce the long-term burden of the pandemic. However, the extraordinary situation caused by the global pandemic is also highlighting the need to improve mass spectrometry data and workflow sharing. We therefore describe freely available data and computational resources that can facilitate and assist the mass spectrometry-based analysis of SARS-CoV-2. We exemplify this by reanalyzing a virus-host interactome data set to detect protein-protein interactions and identify host proteins that could potentially be used as targets for drug repurposing.
Collapse
Affiliation(s)
- Wout Bittremieux
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla 92093, California, United States.,Department of Computer Science, University of Antwerp, Antwerp 2020, Belgium
| | - Charlotte Adams
- Department of Computer Science, University of Antwerp, Antwerp 2020, Belgium
| | - Kris Laukens
- Department of Computer Science, University of Antwerp, Antwerp 2020, Belgium
| | - Pieter C Dorrestein
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla 92093, California, United States
| | - Nuno Bandeira
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla 92093, California, United States.,Department of Computer Science and Engineering, University of California San Diego, La Jolla 92093, California, United States
| |
Collapse
|
29
|
Tariq MU, Haseeb M, Aledhari M, Razzak R, Parizi RM, Saeed F. Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey. IEEE ACCESS : PRACTICAL INNOVATIONS, OPEN SOLUTIONS 2020; 9:5497-5516. [PMID: 33537181 PMCID: PMC7853650 DOI: 10.1109/access.2020.3047588] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
Big Data Proteogenomics lies at the intersection of high-throughput Mass Spectrometry (MS) based proteomics and Next Generation Sequencing based genomics. The combined and integrated analysis of these two high-throughput technologies can help discover novel proteins using genomic, and transcriptomic data. Due to the biological significance of integrated analysis, the recent past has seen an influx of proteogenomic tools that perform various tasks, including mapping proteins to the genomic data, searching experimental MS spectra against a six-frame translation genome database, and automating the process of annotating genome sequences. To date, most of such tools have not focused on scalability issues that are inherent in proteogenomic data analysis where the size of the database is much larger than a typical protein database. These state-of-the-art tools can take more than half a month to process a small-scale dataset of one million spectra against a genome of 3 GB. In this article, we provide an up-to-date review of tools that can analyze proteogenomic datasets, providing a critical analysis of the techniques' relative merits and potential pitfalls. We also point out potential bottlenecks and recommendations that can be incorporated in the future design of these workflows to ensure scalability with the increasing size of proteogenomic data. Lastly, we make a case of how high-performance computing (HPC) solutions may be the best bet to ensure the scalability of future big data proteogenomic data analysis.
Collapse
Affiliation(s)
- Muhammad Usman Tariq
- School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA
| | - Muhammad Haseeb
- School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA
| | - Mohammed Aledhari
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA
| | - Rehma Razzak
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA
| | - Reza M Parizi
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA
| | - Fahad Saeed
- School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA
| |
Collapse
|
30
|
The challenge of detecting modifications on proteins. Essays Biochem 2020; 64:135-153. [PMID: 31957791 DOI: 10.1042/ebc20190055] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2019] [Revised: 12/17/2019] [Accepted: 12/19/2019] [Indexed: 12/16/2022]
Abstract
Post-translational modifications (PTMs) are integral to the regulation of protein function, characterising their role in this process is vital to understanding how cells work in both healthy and diseased states. Mass spectrometry (MS) facilitates the mass determination and sequencing of peptides, and thereby also the detection of site-specific PTMs. However, numerous challenges in this field continue to persist. The diverse chemical properties, low abundance, labile nature and instability of many PTMs, in combination with the more practical issues of compatibility with MS and bioinformatics challenges, contribute to the arduous nature of their analysis. In this review, we present an overview of the established MS-based approaches for analysing PTMs and the common complications associated with their investigation, including examples of specific challenges focusing on phosphorylation, lysine acetylation and redox modifications.
Collapse
|
31
|
Wang L, Liu K, Li S, Tang H. A Fast and Memory-Efficient Spectral Library Search Algorithm Using Locality-Sensitive Hashing. Proteomics 2020; 20:e2000002. [PMID: 32415809 PMCID: PMC7669687 DOI: 10.1002/pmic.202000002] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2020] [Revised: 04/17/2020] [Indexed: 01/07/2023]
Abstract
With the accumulation of MS/MS spectra collected in spectral libraries, the spectral library searching approach emerges as an important approach for peptide identification in proteomics, complementary to the commonly used protein database searching approach, in particular for the proteomic analyses of well-studied model organisms, such as human. Existing spectral library searching algorithms compare a query MS/MS spectrum with each spectrum in the library with matched precursor mass and charge state, which may become computationally intensive with the rapidly growing library size. Here, the software msSLASH, which implements a fast spectral library searching algorithm based on the Locality-Sensitive Hashing (LSH) technique, is presented. The algorithm first converts the library and query spectra into bit-strings using LSH functions, and then computes the similarity between the spectra with highly similar bit-string. Using the spectral library searching of large real-world MS/MS spectra datasets, it is demonstrated that the algorithm significantly reduced the number of spectral comparisons, and as a result, achieved 2-9X speedup in comparison with existing spectral library searching algorithm SpectraST. The spectral searching algorithm is implemented in C/C++, and is ready to be used in proteomic data analyses.
Collapse
Affiliation(s)
- Lei Wang
- School of Informatics and Computing, Indiana University, Bloomington, IN, 47405, USA
| | - Kaiyuan Liu
- School of Informatics and Computing, Indiana University, Bloomington, IN, 47405, USA
| | - Sujun Li
- School of Informatics and Computing, Indiana University, Bloomington, IN, 47405, USA
| | - Haixu Tang
- School of Informatics and Computing, Indiana University, Bloomington, IN, 47405, USA
| |
Collapse
|
32
|
Van Houtven J, Boonen K, Baggerman G, Askenazi M, Laukens K, Hooyberghs J, Valkenborg D. PRiSM: A prototype for exhaustive, restriction-free database searching for mass spectrometry-based identification. RAPID COMMUNICATIONS IN MASS SPECTROMETRY : RCM 2020:e8962. [PMID: 33009686 DOI: 10.1002/rcm.8962] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/31/2020] [Revised: 09/28/2020] [Accepted: 09/30/2020] [Indexed: 06/11/2023]
Abstract
RATIONALE The current methods for identifying peptides in mass spectral product ion data still struggle to do so for the majority of spectra. Based on the experimental setup and other assumptions, such methods restrict the search space to speed up computations, but at the cost of creating blind spots. The proteomics community would greatly benefit from a method that is capable of covering the entire search space without using any restrictions, thus establishing a baseline for identification. METHODS We conceived the "mass pattern paradigm" (MPP) that enables the creation of such an identification method, and we implemented it into a prototype database search engine "PRiSM" (PRotein-Spectrum Matching). We then assessed its operational characteristics by applying it to publicly available high-precision mass spectra of low and high identification difficulty. We used those characteristics to gain theoretical insights into trade-offs between sensitivity and speed when trying to establish a baseline for identification. RESULTS Of 100 low difficulty spectra, PRiSM and SEQUEST agree on 84 identifications (of which 75 are statistically significant). Of 15 of 100 spectra not identified in a previous study (using SEQUEST), 13 are considered reliable after visual inspection and represent 3 proteins (out of 9 in total) not detected previously. CONCLUSIONS Despite leaving noise intact, the simple PRiSM prototype can make statistically reliable identifications, while controlling the false discovery rate by fitting a null distribution. It also identifies some spectra previously unidentifiable in an "extremely open" SEQUEST search, paving the way to establishing a baseline for identification in proteomics.
Collapse
Affiliation(s)
- Joris Van Houtven
- Flemish Institute for Technological Research (VITO), Boeretang 200, Mol, Belgium
| | - Kurt Boonen
- Universiteit Hasselt, Data Science Institute (DSI), Interuniversity Institute for Biostatistics and Statistical Bioinformatics (I-BioStat), Diepenbeek, Belgium
| | - Geert Baggerman
- Universiteit Antwerpen, Centre for Proteomics, Antwerp, Belgium
| | | | - Kris Laukens
- Universiteit Antwerpen, Biomedical Informatics Network Antwerp (Biomina), Antwerp, Belgium
| | - Jef Hooyberghs
- ADReM Data Lab, Department of Computer Sciences, Universiteit Antwerpen, Antwerp, Belgium
| | - Dirk Valkenborg
- Universiteit Hasselt, Data Science Institute (DSI), Theoretical Physics, Diepenbeek, Belgium
| |
Collapse
|
33
|
Yu F, Teo GC, Kong AT, Haynes SE, Avtonomov DM, Geiszler DJ, Nesvizhskii AI. Identification of modified peptides using localization-aware open search. Nat Commun 2020; 11:4065. [PMID: 32792501 PMCID: PMC7426425 DOI: 10.1038/s41467-020-17921-y] [Citation(s) in RCA: 136] [Impact Index Per Article: 34.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2020] [Accepted: 07/27/2020] [Indexed: 11/25/2022] Open
Abstract
Identification of post-translationally or chemically modified peptides in mass spectrometry-based proteomics experiments is a crucial yet challenging task. We have recently introduced a fragment ion indexing method and the MSFragger search engine to empower an open search strategy for comprehensive analysis of modified peptides. However, this strategy does not consider fragment ions shifted by unknown modifications, preventing modification localization and limiting the sensitivity of the search. Here we present a localization-aware open search method, in which both modification-containing (shifted) and regular fragment ions are indexed and used in scoring. We also implement a fast mass calibration and optimization method, allowing optimization of the mass tolerances and other key search parameters. We demonstrate that MSFragger with mass calibration and localization-aware open search identifies modified peptides with significantly higher sensitivity and accuracy. Comparing MSFragger to other modification-focused tools (pFind3, MetaMorpheus, and TagGraph) shows that MSFragger remains an excellent option for fast, comprehensive, and sensitive searches for modified peptides in shotgun proteomics data.
Collapse
Affiliation(s)
- Fengchao Yu
- Department of Pathology, University of Michigan, Ann Arbor, Michigan, USA
| | - Guo Ci Teo
- Department of Pathology, University of Michigan, Ann Arbor, Michigan, USA
| | - Andy T Kong
- Department of Pathology, University of Michigan, Ann Arbor, Michigan, USA
| | - Sarah E Haynes
- Department of Pathology, University of Michigan, Ann Arbor, Michigan, USA
| | - Dmitry M Avtonomov
- Department of Pathology, University of Michigan, Ann Arbor, Michigan, USA
| | - Daniel J Geiszler
- Department of Pathology, University of Michigan, Ann Arbor, Michigan, USA
| | - Alexey I Nesvizhskii
- Department of Pathology, University of Michigan, Ann Arbor, Michigan, USA.
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA.
| |
Collapse
|
34
|
Bouwmeester R, Gabriels R, Van Den Bossche T, Martens L, Degroeve S. The Age of Data-Driven Proteomics: How Machine Learning Enables Novel Workflows. Proteomics 2020; 20:e1900351. [PMID: 32267083 DOI: 10.1002/pmic.201900351] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2020] [Revised: 03/21/2020] [Indexed: 12/30/2022]
Abstract
A lot of energy in the field of proteomics is dedicated to the application of challenging experimental workflows, which include metaproteomics, proteogenomics, data independent acquisition (DIA), non-specific proteolysis, immunopeptidomics, and open modification searches. These workflows are all challenging because of ambiguity in the identification stage; they either expand the search space and thus increase the ambiguity of identifications, or, in the case of DIA, they generate data that is inherently more ambiguous. In this context, machine learning-based predictive models are now generating considerable excitement in the field of proteomics because these predictive models hold great potential to drastically reduce the ambiguity in the identification process of the above-mentioned workflows. Indeed, the field has already produced classical machine learning and deep learning models to predict almost every aspect of a liquid chromatography-mass spectrometry (LC-MS) experiment. Yet despite all the excitement, thorough integration of predictive models in these challenging LC-MS workflows is still limited, and further improvements to the modeling and validation procedures can still be made. Therefore, highly promising recent machine learning developments in proteomics are pointed out in this viewpoint, alongside some of the remaining challenges.
Collapse
Affiliation(s)
- Robbin Bouwmeester
- VIB-UGent Center for Medical Biotechnology, VIB, Albert Baertsoenkaai 3, B-9000, Ghent, Belgium.,Department of Biomolecular Medicine, Ghent University, Albert Baertsoenkaai 3, B-9000, Ghent, Belgium
| | - Ralf Gabriels
- VIB-UGent Center for Medical Biotechnology, VIB, Albert Baertsoenkaai 3, B-9000, Ghent, Belgium.,Department of Biomolecular Medicine, Ghent University, Albert Baertsoenkaai 3, B-9000, Ghent, Belgium
| | - Tim Van Den Bossche
- VIB-UGent Center for Medical Biotechnology, VIB, Albert Baertsoenkaai 3, B-9000, Ghent, Belgium.,Department of Biomolecular Medicine, Ghent University, Albert Baertsoenkaai 3, B-9000, Ghent, Belgium
| | - Lennart Martens
- VIB-UGent Center for Medical Biotechnology, VIB, Albert Baertsoenkaai 3, B-9000, Ghent, Belgium.,Department of Biomolecular Medicine, Ghent University, Albert Baertsoenkaai 3, B-9000, Ghent, Belgium
| | - Sven Degroeve
- VIB-UGent Center for Medical Biotechnology, VIB, Albert Baertsoenkaai 3, B-9000, Ghent, Belgium.,Department of Biomolecular Medicine, Ghent University, Albert Baertsoenkaai 3, B-9000, Ghent, Belgium
| |
Collapse
|
35
|
Shiferaw GA, Vandermarliere E, Hulstaert N, Gabriels R, Martens L, Volders PJ. COSS: A Fast and User-Friendly Tool for Spectral Library Searching. J Proteome Res 2020; 19:2786-2793. [PMID: 32384242 DOI: 10.1021/acs.jproteome.9b00743] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
Spectral similarity searching to identify peptide-derived MS/MS spectra is a promising technique, and different spectrum similarity search tools have therefore been developed. Each of these tools, however, comes with some limitations, mainly because of low processing speed and issues with handling large databases. Furthermore, the number of spectral data formats supported is typically limited, which also creates a threshold to adoption. We have therefore developed COSS (CompOmics Spectral Searching), a new and user-friendly spectral library search tool supporting two scoring functions. COSS also includes decoy spectra generation for result validation. We have benchmarked COSS on three different spectral libraries and compared the results with established spectral searching tools and a sequence database search tool. Our comparison showed that COSS more reliably identifies spectra, is capable of handling large data sets and libraries, and is an easy to use tool that can run on low computer specifications. COSS binaries and source code can be freely downloaded from https://github.com/compomics/COSS.
Collapse
Affiliation(s)
- Genet Abay Shiferaw
- VIB-UGent Center for Medical Biotechnology, VIB, 9000 Ghent, Belgium.,Department of Biomolecular Medicine, Ghent University, 9000 Ghent, Belgium
| | - Elien Vandermarliere
- VIB-UGent Center for Medical Biotechnology, VIB, 9000 Ghent, Belgium.,Department of Biomolecular Medicine, Ghent University, 9000 Ghent, Belgium
| | - Niels Hulstaert
- VIB-UGent Center for Medical Biotechnology, VIB, 9000 Ghent, Belgium.,Department of Biomolecular Medicine, Ghent University, 9000 Ghent, Belgium
| | - Ralf Gabriels
- VIB-UGent Center for Medical Biotechnology, VIB, 9000 Ghent, Belgium.,Department of Biomolecular Medicine, Ghent University, 9000 Ghent, Belgium
| | - Lennart Martens
- VIB-UGent Center for Medical Biotechnology, VIB, 9000 Ghent, Belgium.,Department of Biomolecular Medicine, Ghent University, 9000 Ghent, Belgium
| | - Pieter-Jan Volders
- VIB-UGent Center for Medical Biotechnology, VIB, 9000 Ghent, Belgium.,Department of Biomolecular Medicine, Ghent University, 9000 Ghent, Belgium.,Cancer Research Institute Ghent, Ghent University, 9000 Ghent, Belgium
| |
Collapse
|
36
|
Vizcaíno JA, Kubiniok P, Kovalchik KA, Ma Q, Duquette JD, Mongrain I, Deutsch EW, Peters B, Sette A, Sirois I, Caron E. The Human Immunopeptidome Project: A Roadmap to Predict and Treat Immune Diseases. Mol Cell Proteomics 2020; 19:31-49. [PMID: 31744855 PMCID: PMC6944237 DOI: 10.1074/mcp.r119.001743] [Citation(s) in RCA: 46] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2019] [Revised: 11/18/2019] [Indexed: 12/11/2022] Open
Abstract
The science that investigates the ensembles of all peptides associated to human leukocyte antigen (HLA) molecules is termed "immunopeptidomics" and is typically driven by mass spectrometry (MS) technologies. Recent advances in MS technologies, neoantigen discovery and cancer immunotherapy have catalyzed the launch of the Human Immunopeptidome Project (HIPP) with the goal of providing a complete map of the human immunopeptidome and making the technology so robust that it will be available in every clinic. Here, we provide a long-term perspective of the field and we use this framework to explore how we think the completion of the HIPP will truly impact the society in the future. In this context, we introduce the concept of immunopeptidome-wide association studies (IWAS). We highlight the importance of large cohort studies for the future and how applying quantitative immunopeptidomics at population scale may provide a new look at individual predisposition to common immune diseases as well as responsiveness to vaccines and immunotherapies. Through this vision, we aim to provide a fresh view of the field to stimulate new discussions within the community, and present what we see as the key challenges for the future for unlocking the full potential of immunopeptidomics in this era of precision medicine.
Collapse
Affiliation(s)
- Juan Antonio Vizcaíno
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
| | - Peter Kubiniok
- CHU Sainte-Justine Research Center, Montreal, QC H3T 1C5, Canada
| | | | - Qing Ma
- CHU Sainte-Justine Research Center, Montreal, QC H3T 1C5, Canada; School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, ON K1N 6N5, Canada
| | | | - Ian Mongrain
- Université de Montréal Beaulieu-Saucier Pharmacogenomics Centre, Montreal, QC, Canada; Montreal Heart Institute, Montreal, QC, Canada
| | - Eric W Deutsch
- Institute for Systems Biology, Seattle, Washington, 98109
| | - Bjoern Peters
- La Jolla Institute for Allergy and Immunology, La Jolla, California, 92037
| | - Alessandro Sette
- La Jolla Institute for Allergy and Immunology, La Jolla, California, 92037
| | - Isabelle Sirois
- CHU Sainte-Justine Research Center, Montreal, QC H3T 1C5, Canada
| | - Etienne Caron
- CHU Sainte-Justine Research Center, Montreal, QC H3T 1C5, Canada; Department of Pathology and Cellular Biology, Faculty of Medicine, Université de Montréal, QC H3T 1J4, Canada.
| |
Collapse
|
37
|
den Ridder M, Daran-Lapujade P, Pabst M. Shot-gun proteomics: why thousands of unidentified signals matter. FEMS Yeast Res 2019; 20:5682490. [DOI: 10.1093/femsyr/foz088] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2019] [Accepted: 12/19/2019] [Indexed: 12/14/2022] Open
Abstract
ABSTRACT
Mass spectrometry-based proteomics has become a constitutional part of the multi-omics toolbox in yeast research, advancing fundamental knowledge of molecular processes and guiding decisions in strain and product developmental pipelines. Nevertheless, post-translational protein modifications (PTMs) continue to challenge the field of proteomics. PTMs are not directly encoded in the genome; therefore, they require a sensitive analysis of the proteome itself. In yeast, the relevance of post-translational regulators has already been established, such as for phosphorylation, which can directly affect the reaction rates of metabolic enzymes. Whereas, the selective analysis of single modifications has become a broadly employed technique, the sensitive analysis of a comprehensive set of modifications still remains a challenge. At the same time, a large number of fragmentation spectra in a typical shot-gun proteomics experiment remain unidentified. It has been estimated that a good proportion of those unidentified spectra originates from unexpected modifications or natural peptide variants. In this review, recent advancements in microbial proteomics for unrestricted protein modification discovery are reviewed, and recent research integrating this additional layer of information to elucidate protein interaction and regulation in yeast is briefly discussed.
Collapse
Affiliation(s)
- Maxime den Ridder
- Delft University of Technology, Department of Biotechnology, van der Maasweg 9, 2629 HZ Delft, The Netherlands
| | - Pascale Daran-Lapujade
- Delft University of Technology, Department of Biotechnology, van der Maasweg 9, 2629 HZ Delft, The Netherlands
| | - Martin Pabst
- Delft University of Technology, Department of Biotechnology, van der Maasweg 9, 2629 HZ Delft, The Netherlands
| |
Collapse
|
38
|
Bittremieux W. spectrum_utils: A Python Package for Mass Spectrometry Data Processing and Visualization. Anal Chem 2019; 92:659-661. [PMID: 31809021 DOI: 10.1021/acs.analchem.9b04884] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Given the wide diversity in applications of biological mass spectrometry, custom data analyses are often needed to fully interpret the results of an experiment. Such bioinformatics scripts necessarily include similar basic functionality to read mass spectral data from standard file formats, process it, and visualize it. Rather than having to reimplement this functionality, to facilitate this task, spectrum_utils is a Python package for mass spectrometry data processing and visualization. Its high-level functionality enables developers to quickly prototype ideas for computational mass spectrometry projects in only a few lines of code. Notably, the data processing functionality is highly optimized for computational efficiency to be able to deal with the large volumes of data that are generated during mass spectrometry experiments. The visualization functionality makes it possible to easily produce publication-quality figures as well as interactive spectrum plots for inclusion on web pages. spectrum_utils is available for Python 3.6+, includes extensive online documentation and examples, and can be easily installed using conda. It is freely available as open source under the Apache 2.0 license at https://github.com/bittremieux/spectrum_utils .
Collapse
Affiliation(s)
- Wout Bittremieux
- Skaggs School of Pharmacy and Pharmaceutical Sciences , University of California San Diego , La Jolla , California 92093 , United States.,Department of Mathematics and Computer Science , University of Antwerp , 2020 Antwerp , Belgium.,Biomedical Informatics Network Antwerpen (biomina) , 2020 Antwerp , Belgium
| |
Collapse
|
39
|
Pino L, Lin A, Bittremieux W. 2018 YPIC Challenge: A Case Study in Characterizing an Unknown Protein Sample. J Proteome Res 2019; 18:3936-3943. [PMID: 31556620 PMCID: PMC6824964 DOI: 10.1021/acs.jproteome.9b00384] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
For the 2018 YPIC Challenge, contestants were invited to try to decipher two unknown English questions encoded by a synthetic protein expressed in Escherichia coli. In addition to deciphering the sentence, contestants were asked to determine the three-dimensional structure and detect any post-translation modifications left by the host organism. We present our experimental and computational strategy to characterize this sample by identifying the unknown protein sequence and detecting the presence of post-translational modifications. The sample was acquired with dynamic exclusion disabled to increase the signal-to-noise ratio of the measured molecules, after which spectral clustering was used to generate high-quality consensus spectra. De novo spectrum identification was used to determine the synthetic protein sequence, and any post-translational modifications introduced by E. coli on the synthetic protein were analyzed via spectral networking. This workflow resulted in a de novo sequence coverage of 70%, on par with sequence database searching performance. Additionally, the spectral networking analysis indicated that no systematic modifications were introduced on the synthetic protein by E. coli. The strategy presented here can be directly used to analyze samples for which no protein sequence information is available or when the identity of the sample is unknown. All software and code to perform the bioinformatics analysis is available as open source, and self-contained Jupyter notebooks are provided to fully recreate the analysis.
Collapse
Affiliation(s)
- Lindsay Pino
- Department of Genome Sciences, University of Washington, Seattle WA 98195, USA
| | - Andy Lin
- Department of Genome Sciences, University of Washington, Seattle WA 98195, USA
| | - Wout Bittremieux
- Department of Genome Sciences, University of Washington, Seattle WA 98195, USA
- Department of Mathematics and Computer Science, University of Antwerp, 2020 Antwerp, Belgium
- Biomedical Informatics Network Antwerpen (biomina), 2020 Antwerp, Belgium
| |
Collapse
|
40
|
Bittremieux W, Laukens K, Noble WS. Extremely Fast and Accurate Open Modification Spectral Library Searching of High-Resolution Mass Spectra Using Feature Hashing and Graphics Processing Units. J Proteome Res 2019; 18:3792-3799. [PMID: 31448616 PMCID: PMC6886738 DOI: 10.1021/acs.jproteome.9b00291] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Open modification searching (OMS) is a powerful search strategy to identify peptides with any type of modification. OMS works by using a very wide precursor mass window to allow modified spectra to match against their unmodified variants, after which the modification types can be inferred from the corresponding precursor mass differences. A disadvantage of this strategy, however, is the large computational cost, because each query spectrum has to be compared against a multitude of candidate peptides. We have previously introduced the ANN-SoLo tool for fast and accurate open spectral library searching. ANN-SoLo uses approximate nearest neighbor indexing to speed up OMS by selecting only a limited number of the most relevant library spectra to compare to an unknown query spectrum. Here we demonstrate how this candidate selection procedure can be further optimized using graphics processing units. Additionally, we introduce a feature hashing scheme to convert high-resolution spectra to low-dimensional vectors. On the basis of these algorithmic advances, along with low-level code optimizations, the new version of ANN-SoLo is up to an order of magnitude faster than its initial version. This makes it possible to efficiently perform open searches on a large scale to gain a deeper understanding about the protein modification landscape. We demonstrate the computational efficiency and identification performance of ANN-SoLo based on a large data set of the draft human proteome. ANN-SoLo is implemented in Python and C++. It is freely available under the Apache 2.0 license at https://github.com/bittremieux/ANN-SoLo .
Collapse
Affiliation(s)
- Wout Bittremieux
- Department of Mathematics and Computer Science , University of Antwerp , 2020 Antwerp , Belgium
- Biomedical Informatics Network Antwerpen (biomina) , 2020 Antwerp , Belgium
- Department of Genome Sciences , University of Washington , Seattle , Washington 98195 , United States
| | - Kris Laukens
- Department of Mathematics and Computer Science , University of Antwerp , 2020 Antwerp , Belgium
- Biomedical Informatics Network Antwerpen (biomina) , 2020 Antwerp , Belgium
| | - William Stafford Noble
- Department of Genome Sciences , University of Washington , Seattle , Washington 98195 , United States
- Department of Computer Science and Engineering , University of Washington , Seattle , Washington 98195 , United States
| |
Collapse
|
41
|
Boonen K, Hens K, Menschaert G, Baggerman G, Valkenborg D, Ertaylan G. Beyond Genes: Re-Identifiability of Proteomic Data and Its Implications for Personalized Medicine. Genes (Basel) 2019; 10:E682. [PMID: 31492022 PMCID: PMC6770961 DOI: 10.3390/genes10090682] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2019] [Revised: 08/30/2019] [Accepted: 09/01/2019] [Indexed: 02/07/2023] Open
Abstract
The increasing availability of high throughput proteomics data provides us with opportunities as well as posing new ethical challenges regarding data privacy and re-identifiability of participants. Moreover, the fact that proteomics represents a level between the genotype and the phenotype further exacerbates the situation, introducing dilemmas related to publicly available data, anonymization, ownership of information and incidental findings. In this paper, we try to differentiate proteomics from genomics data and cover the ethical challenges related to proteomics data sharing. Finally, we give an overview of the proposed solutions and the outlook for future studies.
Collapse
Affiliation(s)
- Kurt Boonen
- VITO Health, Boeretang 200, Mol 2400, Belgium.
- Centre for Proteomics, University of Antwerpen, Antwerp 2020, Belgium.
| | - Kristien Hens
- Department of Philosophy, University of Antwerp, Antwerp 2000 & Institute of Philosophy, KU Leuven, Leuven 3000, Belgium.
| | - Gerben Menschaert
- Biobix, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent 9000, Belgium.
| | - Geert Baggerman
- VITO Health, Boeretang 200, Mol 2400, Belgium.
- Centre for Proteomics, University of Antwerpen, Antwerp 2020, Belgium.
| | | | | |
Collapse
|
42
|
Binz PA, Shofstahl J, Vizcaíno JA, Barsnes H, Chalkley RJ, Menschaert G, Alpi E, Clauser K, Eng JK, Lane L, Seymour SL, Sánchez LFH, Mayer G, Eisenacher M, Perez-Riverol Y, Kapp EA, Mendoza L, Baker PR, Collins A, Van Den Bossche T, Deutsch EW. Proteomics Standards Initiative Extended FASTA Format. J Proteome Res 2019; 18:2686-2692. [PMID: 31081335 DOI: 10.1021/acs.jproteome.9b00064] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Mass-spectrometry-based proteomics enables the high-throughput identification and quantification of proteins, including sequence variants and post-translational modifications (PTMs) in biological samples. However, most workflows require that such variations be included in the search space used to analyze the data, and doing so remains challenging with most analysis tools. In order to facilitate the search for known sequence variants and PTMs, the Proteomics Standards Initiative (PSI) has designed and implemented the PSI extended FASTA format (PEFF). PEFF is based on the very popular FASTA format but adds a uniform mechanism for encoding substantially more metadata about the sequence collection as well as individual entries, including support for encoding known sequence variants, PTMs, and proteoforms. The format is very nearly backward compatible, and as such, existing FASTA parsers will require little or no changes to be able to read PEFF files as FASTA files, although without supporting any of the extra capabilities of PEFF. PEFF is defined by a full specification document, controlled vocabulary terms, a set of example files, software libraries, and a file validator. Popular software and resources are starting to support PEFF, including the sequence search engine Comet and the knowledge bases neXtProt and UniProtKB. Widespread implementation of PEFF is expected to further enable proteogenomics and top-down proteomics applications by providing a standardized mechanism for encoding protein sequences and their known variations. All the related documentation, including the detailed file format specification and example files, are available at http://www.psidev.info/peff .
Collapse
Affiliation(s)
- Pierre-Alain Binz
- CHUV Centre Hospitalier Universitaire Vaudois , CH-1011 Lausanne 14 , Switzerland
| | - Jim Shofstahl
- Thermo Fisher Scientific , 355 River Oaks Parkway , San Jose , California 95134 , United States
| | - Juan Antonio Vizcaíno
- European Molecular Biology Laboratory , European Bioinformatics Institute (EMBL-EBI) , Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD , United Kingdom
| | - Harald Barsnes
- Proteomics Unit, Department of Biomedicine , University of Bergen , N-5009 Bergen , Norway.,Computational Biology Unit, Department of Informatics , University of Bergen , N-5008 Bergen , Norway
| | - Robert J Chalkley
- University California at San Francisco , San Francisco , California 94143 , United States
| | - Gerben Menschaert
- Biobix, Department of Data Analysis and Mathematical Modelling , Ghent University , 9000 Ghent , Belgium
| | - Emanuele Alpi
- European Molecular Biology Laboratory , European Bioinformatics Institute (EMBL-EBI) , Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD , United Kingdom
| | - Karl Clauser
- Broad Institute , Cambridge , Massachusetts 02142 , United States
| | - Jimmy K Eng
- University of Washington , Seattle , Washington 98195 , United States
| | - Lydie Lane
- SIB Swiss Institute of Bioinformatics , CH-1211 Geneva 4 , Switzerland.,Department of Microbiology and Molecular Medicine, Faculty of Medicine , University of Geneva , CH-1211 Geneva 4 , Switzerland
| | - Sean L Seymour
- Seymour Data Science, LLC , San Francisco , California 95000 , United States
| | - Luis Francisco Hernández Sánchez
- K.G. Jebsen Center for Diabetes Research, Department of Clinical Science , University of Bergen , 5021 Bergen , Norway.,Center for Medical Genetics and Molecular Medicine , Haukeland University Hospital , 5021 Bergen , Norway
| | - Gerhard Mayer
- Medical Faculty, Medizinisches Proteom-Center , Ruhr University Bochum , D-44801 Bochum , Germany
| | - Martin Eisenacher
- Medical Faculty, Medizinisches Proteom-Center , Ruhr University Bochum , D-44801 Bochum , Germany
| | - Yasset Perez-Riverol
- European Molecular Biology Laboratory , European Bioinformatics Institute (EMBL-EBI) , Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD , United Kingdom
| | - Eugene A Kapp
- Walter & Eliza Hall Institute of Medical Research and the University of Melbourne , Melbourne , VIC 3052 , Australia
| | - Luis Mendoza
- Institute for Systems Biology , Seattle , Washington 98109 , United States
| | - Peter R Baker
- University California at San Francisco , San Francisco , California 94143 , United States
| | - Andrew Collins
- Department of Functional and Comparative Genomics, Institute of Integrated Biology , University of Liverpool , Liverpool L69 7ZB , United Kingdom
| | - Tim Van Den Bossche
- VIB-UGent Center for Medical Biotechnology , Ghent University , 9000 Ghent , Belgium
| | - Eric W Deutsch
- Institute for Systems Biology , Seattle , Washington 98109 , United States
| |
Collapse
|
43
|
Application of the Operational Research Method to Determine the Optimum Transport Collection Cycle of Municipal Waste in a Predesignated Urban Area. SUSTAINABILITY 2019. [DOI: 10.3390/su11082275] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
This paper deals with waste management. The aim is to find out whether the number of municipal mixed waste bins can be reduced due to the impact of waste separation and to determine the optimum municipal waste collection cycle within a predesignated area with an existing urban road network. The number of mixed waste bins depends on two factors—household waste volume and household waste composition. Both of these factors have a significant impact on the number of mixed waste bins required, with household waste composition being of particular importance when it comes to calculating the potential reduction in the number of mixed waste bins required due to waste separation. The calculations for the weight and volume of mixed household waste per person and the composition (types) thereof are presented accordingly. The results reveal which types of waste are the most useful in minimising the number of mixed waste bins by up to 30–50%. To determine the optimum waste collection cycle within the predesignated area with a real urban road network, the Nearest Neighbour Search method was applied. In the discussion, the focus is on whether other methods, including the Two-Phase Heuristic approach and the Bellman-Ford Algorithm, could be applied to solve the problem, whereby parameters such as application time and the capacity of the waste collection vehicle are compared.
Collapse
|
44
|
Levitsky LI, Klein JA, Ivanov MV, Gorshkov MV. Pyteomics 4.0: Five Years of Development of a Python Proteomics Framework. J Proteome Res 2019; 18:709-714. [PMID: 30576148 DOI: 10.1021/acs.jproteome.8b00717] [Citation(s) in RCA: 89] [Impact Index Per Article: 17.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Many of the novel ideas that drive today's proteomic technologies are focused essentially on experimental or data-processing workflows. The latter are implemented and published in a number of ways, from custom scripts and programs, to projects built using general-purpose or specialized workflow engines; a large part of routine data processing is performed manually or with custom scripts that remain unpublished. Facilitating the development of reproducible data-processing workflows becomes essential for increasing the efficiency of proteomic research. To assist in overcoming the bioinformatics challenges in the daily practice of proteomic laboratories, 5 years ago we developed and announced Pyteomics, a freely available open-source library providing Python interfaces to proteomic data. We summarize the new functionality of Pyteomics developed during the time since its introduction.
Collapse
Affiliation(s)
- Lev I Levitsky
- Moscow Institute of Physics and Technology , Dolgoprudny, Moscow Region 141701 , Russia.,V.L. Talrose Institute for Energy Problems of Chemical Physics , Russian Academy of Sciences , Moscow 119334 , Russia
| | - Joshua A Klein
- Bioinformatics Program , Boston University , Boston , Massachusetts 02215 , United States
| | - Mark V Ivanov
- V.L. Talrose Institute for Energy Problems of Chemical Physics , Russian Academy of Sciences , Moscow 119334 , Russia
| | - Mikhail V Gorshkov
- V.L. Talrose Institute for Energy Problems of Chemical Physics , Russian Academy of Sciences , Moscow 119334 , Russia
| |
Collapse
|
45
|
Avtonomov DM, Kong A, Nesvizhskii AI. DeltaMass: Automated Detection and Visualization of Mass Shifts in Proteomic Open-Search Results. J Proteome Res 2018; 18:715-720. [PMID: 30523686 DOI: 10.1021/acs.jproteome.8b00728] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Routine identification of thousands of proteins in a single LC-MS experiment has long become the norm. With these vast amounts of data, more rigorous treatment of modified forms of peptides becomes possible. "Open search", a protein database search with a large precursor ion mass tolerance window, is becoming a popular method to evaluate possible sets of post-translational and chemical modifications in samples. The extraction of statistical information about the modification from peptide search results requires additional effort and data processing, such as recalibration of masses and accurate detection of precursors in MS1 signals. Here we present a software tool, DeltaMass, which performs kernel-density-based estimation of observed mass shifts and allows for the detection of poorly resolved mass deltas. The software also maps observed mass shifts to known modifications from public databases such as UniMod and augments them with additionally generated possible chemical changes to the molecule. Its interactive graphical interface provides an effective option for the visual interrogation of the data and the identification of potentially interesting mass shifts or unusual artifacts for subsequent analysis. However, the program can also be used in fully automated command-line mode to generate mass-shift peak lists as well.
Collapse
Affiliation(s)
- Dmitry M Avtonomov
- Department of Pathology , University of Michigan , Ann Arbor , Michigan 48109 , United States
| | - Andy Kong
- Department of Pathology , University of Michigan , Ann Arbor , Michigan 48109 , United States
| | - Alexey I Nesvizhskii
- Department of Pathology , University of Michigan , Ann Arbor , Michigan 48109 , United States.,Department of Computational Medicine and Bioinformatics , University of Michigan , Ann Arbor , Michigan 48109 , United States
| |
Collapse
|