1
|
Petrovskiy DV, Nikolsky KS, Kulikova LI, Rudnev VR, Butkova TV, Malsagova KA, Kopylov AT, Kaysheva AL. PowerNovo: de novo peptide sequencing via tandem mass spectrometry using an ensemble of transformer and BERT models. Sci Rep 2024; 14:15000. [PMID: 38951578 PMCID: PMC11217302 DOI: 10.1038/s41598-024-65861-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2024] [Accepted: 06/25/2024] [Indexed: 07/03/2024] Open
Abstract
The primary objective of analyzing the data obtained in a mass spectrometry-based proteomic experiment is peptide and protein identification, or correct assignment of the tandem mass spectrum to one amino acid sequence. Comparison of empirical fragment spectra with the theoretical predicted one or matching with the collected spectra library are commonly accepted strategies of proteins identification and defining of their amino acid sequences. Although these approaches are widely used and are appreciably efficient for the well-characterized model organisms or measured proteins, they cannot detect novel peptide sequences that have not been previously annotated or are rare. This study presents PowerNovo tool for de novo sequencing of proteins using tandem mass spectra acquired in a variety of types of mass analyzers and different fragmentation techniques. PowerNovo involves an ensemble of models for peptide sequencing: model for detecting regularities in tandem mass spectra, precursors, and fragment ions and a natural language processing model, which has a function of peptide sequence quality assessment and helps with reconstruction of noisy sequences. The results of testing showed that the performance of PowerNovo is comparable and even better than widely utilized PointNovo, DeepNovo, Casanovo, and Novor packages. Also, PowerNovo provides complete cycle of processing (pipeline) of mass spectrometry data and, along with predicting the peptide sequence, involves the peptide assembly and protein inference blocks.
Collapse
|
2
|
Choi S, Paek E. pXg: Comprehensive Identification of Noncanonical MHC-I-Associated Peptides From De Novo Peptide Sequencing Using RNA-Seq Reads. Mol Cell Proteomics 2024; 23:100743. [PMID: 38403075 PMCID: PMC10979277 DOI: 10.1016/j.mcpro.2024.100743] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Revised: 02/19/2024] [Accepted: 02/21/2024] [Indexed: 02/27/2024] Open
Abstract
Discovering noncanonical peptides has been a common application of proteogenomics. Recent studies suggest that certain noncanonical peptides, known as noncanonical major histocompatibility complex-I (MHC-I)-associated peptides (ncMAPs), that bind to MHC-I may make good immunotherapeutic targets. De novo peptide sequencing is a great way to find ncMAPs since it can detect peptide sequences from their tandem mass spectra without using any sequence databases. However, this strategy has not been widely applied for ncMAP identification because there is not a good way to estimate its false-positive rates. In order to completely and accurately identify immunopeptides using de novo peptide sequencing, we describe a unique pipeline called proteomics X genomics. In contrast to current pipelines, it makes use of genomic data, RNA-Seq abundance and sequencing quality, in addition to proteomic features to increase the sensitivity and specificity of peptide identification. We show that the peptide-spectrum match quality and genetic traits have a clear relationship, showing that they can be utilized to evaluate peptide-spectrum matches. From 10 samples, we found 24,449 canonical MHC-I-associated peptides and 956 ncMAPs by using a target-decoy competition. Three hundred eighty-seven ncMAPs and 1611 canonical MHC-I-associated peptides were new identifications that had not yet been published. We discovered 11 ncMAPs produced from a squirrel monkey retrovirus in human cell lines in addition to the two ncMAPs originating from a complementarity determining region 3 in an antibody thanks to the unrestricted search space assumed by de novo sequencing. These entirely new identifications show that proteomics X genomics can make the most of de novo peptide sequencing's advantages and its potential use in the search for new immunotherapeutic targets.
Collapse
Affiliation(s)
- Seunghyuk Choi
- Department of Computer Science, Hanyang University, Seoul, Republic of Korea
| | - Eunok Paek
- Department of Computer Science, Hanyang University, Seoul, Republic of Korea; Institute for Artificial Intelligence Research, Hanyang University, Seoul, Republic of Korea.
| |
Collapse
|
3
|
Chen Z, Lim YW, Neo JY, Ting Chan RS, Koh LQ, Yuen TY, Lim YH, Johannes CW, Gates ZP. De Novo Sequencing of Synthetic Bis-cysteine Peptide Macrocycles Enabled by "Chemical Linearization" of Compound Mixtures. Anal Chem 2023; 95:14870-14878. [PMID: 37724843 PMCID: PMC10569172 DOI: 10.1021/acs.analchem.3c01742] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2023] [Accepted: 09/04/2023] [Indexed: 09/21/2023]
Abstract
A "chemical linearization" approach was applied to synthetic peptide macrocycles to enable their de novo sequencing from mixtures using nanoliquid chromatography-tandem mass spectrometry (nLC-MS/MS). This approach─previously applied to individual macrocycles but not to mixtures─involves cleavage of the peptide backbone at a defined position to give a product capable of generating sequence-determining fragment ions. Here, we first established the compatibility of "chemical linearization" by Edman degradation with a prominent macrocycle scaffold based on bis-Cys peptides cross-linked with the m-xylene linker, which are of major significance in therapeutics discovery. Then, using macrocycle libraries of known sequence composition, the ability to recover accurate de novo assignments to linearized products was critically tested using performance metrics unique to mixtures. Significantly, we show that linearized macrocycles can be sequenced with lower recall compared to linear peptides but with similar accuracy, which establishes the potential of using "chemical linearization" with synthetic libraries and selection procedures that yield compound mixtures. Sodiated precursor ions were identified as a significant source of high-scoring but inaccurate assignments, with potential implications for improving automated de novo sequencing more generally.
Collapse
Affiliation(s)
- Zhi’ang Chen
- Institute
of Molecular and Cell Biology (IMCB), Agency
for Science, Technology and Research (A*STAR), 61 Biopolis Drive, Proteos, Singapore 138673, Republic of Singapore
- Institute
of Sustainability for Chemicals, Energy and Environment (ISCE), Agency for Science, Technology
and Research (A*STAR), 8 Biomedical Grove, #07-01 Neuros, Singapore 138665, Republic
of Singapore
| | - Yi Wee Lim
- Institute
of Sustainability for Chemicals, Energy and Environment (ISCE), Agency for Science, Technology
and Research (A*STAR), 8 Biomedical Grove, #07-01 Neuros, Singapore 138665, Republic
of Singapore
| | - Jin Yong Neo
- Institute
of Sustainability for Chemicals, Energy and Environment (ISCE), Agency for Science, Technology
and Research (A*STAR), 8 Biomedical Grove, #07-01 Neuros, Singapore 138665, Republic
of Singapore
| | - Rachel Shu Ting Chan
- Institute
of Sustainability for Chemicals, Energy and Environment (ISCE), Agency for Science, Technology
and Research (A*STAR), 8 Biomedical Grove, #07-01 Neuros, Singapore 138665, Republic
of Singapore
| | - Li Quan Koh
- Institute
of Molecular and Cell Biology (IMCB), Agency
for Science, Technology and Research (A*STAR), 61 Biopolis Drive, Proteos, Singapore 138673, Republic of Singapore
- Institute
of Sustainability for Chemicals, Energy and Environment (ISCE), Agency for Science, Technology
and Research (A*STAR), 8 Biomedical Grove, #07-01 Neuros, Singapore 138665, Republic
of Singapore
| | - Tsz Ying Yuen
- Institute
of Sustainability for Chemicals, Energy and Environment (ISCE), Agency for Science, Technology
and Research (A*STAR), 8 Biomedical Grove, #07-01 Neuros, Singapore 138665, Republic
of Singapore
| | - Yee Hwee Lim
- Institute
of Sustainability for Chemicals, Energy and Environment (ISCE), Agency for Science, Technology
and Research (A*STAR), 8 Biomedical Grove, #07-01 Neuros, Singapore 138665, Republic
of Singapore
| | - Charles W. Johannes
- Institute
of Molecular and Cell Biology (IMCB), Agency
for Science, Technology and Research (A*STAR), 61 Biopolis Drive, Proteos, Singapore 138673, Republic of Singapore
| | - Zachary P. Gates
- Institute
of Molecular and Cell Biology (IMCB), Agency
for Science, Technology and Research (A*STAR), 61 Biopolis Drive, Proteos, Singapore 138673, Republic of Singapore
- Institute
of Sustainability for Chemicals, Energy and Environment (ISCE), Agency for Science, Technology
and Research (A*STAR), 8 Biomedical Grove, #07-01 Neuros, Singapore 138665, Republic
of Singapore
| |
Collapse
|
4
|
Beslic D, Tscheuschner G, Renard BY, Weller MG, Muth T. Comprehensive evaluation of peptide de novo sequencing tools for monoclonal antibody assembly. Brief Bioinform 2023; 24:bbac542. [PMID: 36545804 PMCID: PMC9851299 DOI: 10.1093/bib/bbac542] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2022] [Revised: 10/25/2022] [Accepted: 11/10/2022] [Indexed: 12/24/2022] Open
Abstract
Monoclonal antibodies are biotechnologically produced proteins with various applications in research, therapeutics and diagnostics. Their ability to recognize and bind to specific molecule structures makes them essential research tools and therapeutic agents. Sequence information of antibodies is helpful for understanding antibody-antigen interactions and ensuring their affinity and specificity. De novo protein sequencing based on mass spectrometry is a valuable method to obtain the amino acid sequence of peptides and proteins without a priori knowledge. In this study, we evaluated six recently developed de novo peptide sequencing algorithms (Novor, pNovo 3, DeepNovo, SMSNet, PointNovo and Casanovo), which were not specifically designed for antibody data. We validated their ability to identify and assemble antibody sequences on three multi-enzymatic data sets. The deep learning-based tools Casanovo and PointNovo showed an increased peptide recall across different enzymes and data sets compared with spectrum-graph-based approaches. We evaluated different error types of de novo peptide sequencing tools and their performance for different numbers of missing cleavage sites, noisy spectra and peptides of various lengths. We achieved a sequence coverage of 97.69-99.53% on the light chains of three different antibody data sets using the de Bruijn assembler ALPS and the predictions from Casanovo. However, low sequence coverage and accuracy on the heavy chains demonstrate that complete de novo protein sequencing remains a challenging issue in proteomics that requires improved de novo error correction, alternative digestion strategies and hybrid approaches such as homology search to achieve high accuracy on long protein sequences.
Collapse
Affiliation(s)
- Denis Beslic
- Robert Koch Institute, MF1, Nordufer 20, 13353 Berlin
| | - Georg Tscheuschner
- Federal Institute for Materials Research and Testing (BAM), Richard-Willstätter-Straße 11, 12489 Berlin
| | - Bernhard Y Renard
- Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Prof.-Dr.-Helmert-Straße 2-3, 14482 Potsdam
| | - Michael G Weller
- Federal Institute for Materials Research and Testing (BAM), Richard-Willstätter-Straße 11, 12489 Berlin
| | - Thilo Muth
- Federal Institute for Materials Research and Testing (BAM), Richard-Willstätter-Straße 11, 12489 Berlin
| |
Collapse
|
5
|
Miura N, Okuda S. Current progress and critical challenges to overcome in the bioinformatics of mass spectrometry-based metaproteomics. Comput Struct Biotechnol J 2023; 21:1140-1150. [PMID: 36817962 PMCID: PMC9925844 DOI: 10.1016/j.csbj.2023.01.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2022] [Revised: 01/14/2023] [Accepted: 01/14/2023] [Indexed: 01/18/2023] Open
Abstract
Metaproteomics is a relatively young field that has only been studied for approximately 15 years. Nevertheless, it has the potential to play a key role in disease research by elucidating the mechanisms of communication between the human host and the microbiome. Although it has been useful in developing an understanding of various diseases, its analytical strategies remain limited to the extended application of proteomics. The sequence databases in metaproteomics must be large because of the presence of thousands of species in a typical sample, which causes problems unique to large databases. In this review, we demonstrate the usefulness of metaproteomics in disease research through examples from several studies. Additionally, we discuss the challenges of applying metaproteomics to conventional proteomics analysis methods and introduce studies that may provide clues to the solutions. We also discuss the need for a standard false discovery rate control method for metaproteomics to replace common target-decoy search approaches in proteomics and a method to ensure the reliability of peptide spectrum match.
Collapse
Affiliation(s)
- Nobuaki Miura
- Division of Bioinformatics, Niigata University Graduate School of Medical and Dental Sciences, 2-5274 Gakkocho-dori, Chuo-ku, Niigata 951-8514, Japan
| | - Shujiro Okuda
- Division of Bioinformatics, Niigata University Graduate School of Medical and Dental Sciences, 2-5274 Gakkocho-dori, Chuo-ku, Niigata 951-8514, Japan
- Medical AI Center, Niigata University School of Medicine, 2-5274 Gakkocho-dori, Chuo-ku, Niigata 951-8514, Japan
| |
Collapse
|
6
|
McDonnell K, Howley E, Abram F. Critical evaluation of the use of artificial data for machine learning based de novo peptide identification. Comput Struct Biotechnol J 2023; 21:2732-2743. [PMID: 37168871 PMCID: PMC10165132 DOI: 10.1016/j.csbj.2023.04.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Revised: 04/16/2023] [Accepted: 04/16/2023] [Indexed: 05/13/2023] Open
Abstract
Proteins are essential components of all living cells and so the study of their in situ expression, proteomics, has wide reaching applications. Peptide identification in proteomics typically relies on matching high resolution tandem mass spectra to a protein database but can also be performed de novo. While artificial spectra have been successfully incorporated into database search pipelines to increase peptide identification rates, little work has been done to investigate the utility of artificial spectra in the context of de novo peptide identification. Here, we perform a critical analysis of the use of artificial data for the training and evaluation of de novo peptide identification algorithms. First, we classify the different fragment ion types present in real spectra and then estimate the number of spurious matches using random peptides. We then categorise the different types of noise present in real spectra. Finally, we transfer this knowledge to artificial data and test the performance of a state-of-the-art de novo peptide identification algorithm trained using artificial spectra with and without relevant noise addition. Noise supplementation increased artificial training data performance from 30% to 77% of real training data peptide recall. While real data performance was not fully replicated, this work provides the first steps towards an artificial spectrum framework for the training and evaluation of de novo peptide identification algorithms. Further enhanced artificial spectra may allow for more in depth analysis of de novo algorithms as well as alleviating the reliance on database searches for training data.
Collapse
Affiliation(s)
- Kevin McDonnell
- Functional Environmental Microbiology, School of Natural Sciences, Ryan Institute, University of Galway, Ireland
- School of Computer Science, University of Galway, Ireland
- Corresponding author at: Functional Environmental Microbiology, School of Natural Sciences, Ryan Institute, University of Galway, Ireland.
| | - Enda Howley
- School of Computer Science, University of Galway, Ireland
| | - Florence Abram
- Functional Environmental Microbiology, School of Natural Sciences, Ryan Institute, University of Galway, Ireland
- Corresponding author.
| |
Collapse
|
7
|
McDonnell K, Abram F, Howley E. Application of a Novel Hybrid CNN-GNN for Peptide Ion Encoding. J Proteome Res 2022; 22:323-333. [PMID: 36534699 PMCID: PMC9903319 DOI: 10.1021/acs.jproteome.2c00234] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
Almost all state-of-the-art de novo peptide sequencing algorithms now use machine learning models to encode fragment peaks and hence identify amino acids in mass spectrometry (MS) spectra. Previous work has highlighted how the inherent MS challenges of noise and missing peptide peaks detrimentally affect the performance of these models. In the present research we extracted and evaluated the encoding modules from 3 state-of-the-art de novo peptide sequencing algorithms. We also propose a convolutional neural network-graph neural network machine learning model for encoding peptide ions in tandem MS spectra. We compared the proposed encoding module to those used in the state-of-the-art de novo peptide sequencing algorithms by assessing their ability to identify b-ions and y-ions in MS spectra. This included a comprehensive evaluation in both real and artificial data across various levels of noise and missing peptide peaks. The proposed model performed best across all data sets using two different metrics (area under the receiver operating characteristic curve (AUC) and average precision). The work also highlighted the effect of including additional features such as intensity rank in these encoding modules as well as issues with using the AUC as a metric. This work is of significance to those designing future de novo peptide identification algorithms as it is the first step toward a new approach.
Collapse
Affiliation(s)
- Kevin McDonnell
- Department
of Information Technology, School of Computer Science, University of Galway, GalwayH91 TK33, Ireland,Functional
Environmental Microbiology, School of Natural Sciences, Ryan Institute, University of Galway, GalwayH91 TK33, Ireland,E-mail:
| | - Florence Abram
- Functional
Environmental Microbiology, School of Natural Sciences, Ryan Institute, University of Galway, GalwayH91 TK33, Ireland
| | - Enda Howley
- Department
of Information Technology, School of Computer Science, University of Galway, GalwayH91 TK33, Ireland
| |
Collapse
|