1
|
Picciani M, Gabriel W, Giurcoiu VG, Shouman O, Hamood F, Lautenbacher L, Jensen CB, Müller J, Kalhor M, Soleymaniniya A, Kuster B, The M, Wilhelm M. Oktoberfest: Open-source spectral library generation and rescoring pipeline based on Prosit. Proteomics 2024; 24:e2300112. [PMID: 37672792 DOI: 10.1002/pmic.202300112] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Revised: 08/17/2023] [Accepted: 08/18/2023] [Indexed: 09/08/2023]
Abstract
Machine learning (ML) and deep learning (DL) models for peptide property prediction such as Prosit have enabled the creation of high quality in silico reference libraries. These libraries are used in various applications, ranging from data-independent acquisition (DIA) data analysis to data-driven rescoring of search engine results. Here, we present Oktoberfest, an open source Python package of our spectral library generation and rescoring pipeline originally only available online via ProteomicsDB. Oktoberfest is largely search engine agnostic and provides access to online peptide property predictions, promoting the adoption of state-of-the-art ML/DL models in proteomics analysis pipelines. We demonstrate its ability to reproduce and even improve our results from previously published rescoring analyses on two distinct use cases. Oktoberfest is freely available on GitHub (https://github.com/wilhelm-lab/oktoberfest) and can easily be installed locally through the cross-platform PyPI Python package.
Collapse
Affiliation(s)
- Mario Picciani
- Computational Mass Spectrometry, TUM School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Wassim Gabriel
- Computational Mass Spectrometry, TUM School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Victor-George Giurcoiu
- Computational Mass Spectrometry, TUM School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Omar Shouman
- Computational Mass Spectrometry, TUM School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Firas Hamood
- Chair of Proteomics and Bioanalytics, TUM School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Ludwig Lautenbacher
- Computational Mass Spectrometry, TUM School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Cecilia Bang Jensen
- Chair of Proteomics and Bioanalytics, TUM School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Julian Müller
- Chair of Proteomics and Bioanalytics, TUM School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Mostafa Kalhor
- Computational Mass Spectrometry, TUM School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Armin Soleymaniniya
- Computational Mass Spectrometry, TUM School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Bernhard Kuster
- Chair of Proteomics and Bioanalytics, TUM School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Matthew The
- Chair of Proteomics and Bioanalytics, TUM School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Mathias Wilhelm
- Computational Mass Spectrometry, TUM School of Life Sciences, Technical University of Munich, Freising, Germany
| |
Collapse
|
2
|
Tarn C, Wu YZ, Wang KF. PepPre: Promote Peptide Identification Using Accurate and Comprehensive Precursors. J Proteome Res 2024; 23:574-584. [PMID: 38157563 DOI: 10.1021/acs.jproteome.3c00293] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2024]
Abstract
Accurate and comprehensive peptide precursor ions are crucial to tandem mass-spectrometry-based peptide identification. An identification engine can derive great advantages from the search space reduction enabled by credible and detailed precursors. Furthermore, by considering multiple precursors per spectrum, both the number of identifications and the spectrum explainability can be substantially improved. Here, we introduce PepPre, which detects precursors by decomposing peaks into multiple isotope clusters using linear programming methods. The detected precursors are scored and ranked, and the high-scoring ones are used for subsequent peptide identification. PepPre is evaluated both on regular and cross-linked peptide data sets and compared with 11 methods. The experimental results show that PepPre achieves a remarkable increase of 203% in PSM and 68% in peptide identifications compared to instrument software for regular peptides and 99% in PSM and 27% in peptide pair identifications for cross-linked peptides, surpassing the performance of all other evaluated methods. In addition to the increased identification numbers, further credibility evaluations evidence the reliability of the identified results. Moreover, by widening the isolation window of data acquisition from 2 to 8 Th, with PepPre, an engine is able to identify at least 64% more PSMs, thereby demonstrating the potential advantages of wide-window data acquisition. PepPre is open-source and available at http://peppre.ctarn.io.
Collapse
Affiliation(s)
- Ching Tarn
- Institute of Computing Technology, Chinese Academy of Sciences, 100190 Beijing, China
- University of Chinese Academy of Sciences, 100049 Beijing, China
| | - Yu-Zhuo Wu
- Institute of Computing Technology, Chinese Academy of Sciences, 100190 Beijing, China
- University of Chinese Academy of Sciences, 100049 Beijing, China
| | - Kai-Fei Wang
- Institute of Computing Technology, Chinese Academy of Sciences, 100190 Beijing, China
- University of Chinese Academy of Sciences, 100049 Beijing, China
| |
Collapse
|
3
|
Li Y, He Q, Guo H, Shuai SC, Cheng J, Liu L, Shuai J. AttnPep: A Self-Attention-Based Deep Learning Method for Peptide Identification in Shotgun Proteomics. J Proteome Res 2024; 23:834-843. [PMID: 38252705 DOI: 10.1021/acs.jproteome.3c00729] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
In shotgun proteomics, the proteome search engine analyzes mass spectra obtained by experiments, and then a peptide-spectra match (PSM) is reported for each spectrum. However, most of the PSMs identified are incorrect, and therefore various postprocessing software have been developed for reranking the peptide identifications. Yet these methods suffer from issues such as dependency on distribution, reliance on shallow models, and limited effectiveness. In this work, we propose AttnPep, a deep learning model for rescoring PSM scores that utilizes the Self-Attention module. This module helps the neural network focus on features relevant to the classification of PSMs and ignore irrelevant features. This allows AttnPep to analyze the output of different search engines and improve PSM discrimination accuracy. We considered a PSM to be correct if it achieves a q-value <0.01 and compared AttnPep with existing mainstream software PeptideProphet, Percolator, and proteoTorch. The results indicated that AttnPep found an average increase in correct PSMs of 9.29% relative to the other methods. Additionally, AttnPep was able to better distinguish between correct and incorrect PSMs and found more synthetic peptides in the complex SWATH data set.
Collapse
Affiliation(s)
- Yulin Li
- Department of Physics, Xiamen University, Xiamen 361005, China
| | - Qingzu He
- Department of Physics, Xiamen University, Xiamen 361005, China
- Wenzhou Key Laboratory of Biophysics, Wenzhou Institute, University of Chinese Academy of Sciences, Wenzhou, Zhejiang 325001, China
| | - Huan Guo
- Department of Physics, Xiamen University, Xiamen 361005, China
| | - Stella C Shuai
- Biological Science, Northwestern University, Evanston, Illinois 60208, United States
| | - Jinyan Cheng
- Wenzhou Key Laboratory of Biophysics, Wenzhou Institute, University of Chinese Academy of Sciences, Wenzhou, Zhejiang 325001, China
| | - Liyu Liu
- Wenzhou Key Laboratory of Biophysics, Wenzhou Institute, University of Chinese Academy of Sciences, Wenzhou, Zhejiang 325001, China
| | - Jianwei Shuai
- Department of Physics, Xiamen University, Xiamen 361005, China
- Wenzhou Key Laboratory of Biophysics, Wenzhou Institute, University of Chinese Academy of Sciences, Wenzhou, Zhejiang 325001, China
| |
Collapse
|
4
|
Dodd-O J, Acevedo-Jake AM, Azizogli AR, Mulligan VK, Kumar VA. How to Design Peptides. Methods Mol Biol 2023; 2597:187-216. [PMID: 36374423 DOI: 10.1007/978-1-0716-2835-5_15] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Novel design of proteins to target receptors for treatment or tissue augmentation has come to the fore owing to advancements in computing power, modeling frameworks, and translational successes. Shorter proteins, or peptides, can offer combinatorial synergies with dendrimer, polymer, or other peptide carriers for enhanced local signaling, which larger proteins may sterically hinder. Here, we present a generalized method for designing a novel peptide. We first show how to create a script protocol that can be used to iteratively optimize and screen novel peptide sequences for binding a target protein. We present a step-by-step introduction to utilizing file repositories, data bases, and the Rosetta software suite. RosettaScripts, an .xml interface that allows for sequential functions to be performed, is used to order the functions for repeatable performance. These strategies may lead to more groups venturing into computational design, which may result in synergies from artificial intelligence/machine learning (AI/ML) to phage display and screening. Importantly, the beginner is expected to be able to design their first peptide ligand and begin their journey in peptide drug discovery. Generally, these peptides potentially could be used to interact with any enzyme or receptor, for example, in the study of chemokines and their interactions with glycosoaminoglycans and their receptors.
Collapse
Affiliation(s)
- Joseph Dodd-O
- Department of Biomedical Engineering, New Jersey Institute of Technology, Newark, NJ, USA
| | - Amanda M Acevedo-Jake
- Department of Biomedical Engineering, New Jersey Institute of Technology, Newark, NJ, USA
| | | | | | - Vivek A Kumar
- York Center for Environmental Engineering and Science, New Jersey Institute of Technology, Newark, NJ, USA.
| |
Collapse
|
5
|
Zhu H, Jiang S, Zhou W, Chi H, Sun J, Shi J, Zhang Z, Chang L, Yu L, Zhang L, Lyu Z, Xu P, Zhang Y. Ac-LysargiNase efficiently helps genome reannotation of Mycolicibacterium smegmatis MC2 155. J Proteomics 2022; 264:104622. [DOI: 10.1016/j.jprot.2022.104622] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2022] [Revised: 05/10/2022] [Accepted: 05/16/2022] [Indexed: 10/18/2022]
|
6
|
Zhou WJ, Wei ZH, He SM, Chi H. pValid 2: A deep learning based validation method for peptide identification in shotgun proteomics with increased discriminating power. J Proteomics 2022; 251:104414. [PMID: 34737111 DOI: 10.1016/j.jprot.2021.104414] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2021] [Revised: 10/13/2021] [Accepted: 10/25/2021] [Indexed: 11/26/2022]
Abstract
Tandem mass spectrometry has been the principal method in shotgun proteomics for peptide and protein identification. However, incorrect identifications reported by proteome search engines are still unknown, and further validation methods are needed. We have proposed a validation method pValid before, but its scope of application is limited because two features used in pValid are related to open database search and sub-optimal peptide candidates for tandem mass spectra, and the performance on complex datasets still has room for improvement. In this study, we developed a more comprehensive validation method, pValid 2, to break these limitations by removing the two features and bringing in a new feature related to the retention time predicted by a deep learning-based method pPredRT. pValid 2 yielded an average false positive rate of 0.03% and an average false negative rate of 1.37% on three testing datasets, better than those of pValid, and flagged 8.47% to 11.31% more incorrect identifications than pValid on two complex datasets. Moreover, pValid 2 flagged almost all decoy identifications in validating the open-search datasets. In addition, the function of validating identifications given by MaxQuant and MS-GF+ was implemented in pValid 2, and the validation results showed that pValid 2 performed dramatically better than three metabolic labeling validation methods. Further considering its cost-effectiveness as a pure computational approach, pValid 2 has the potential to be a widely used validation tool for peptide identifications of any proteome search engines in shotgun proteomics. SIGNIFICANCE: Identification results given by shotgun proteomics are vital to life science research. The correctness of identifications deeply affects the precision of the subsequent studies about protein structures and functions, protein-protein interactions, pathogenic mechanism, and targeted drugs. Thus, validating the correctness of identifications is crucial and urgent. In 2019, we developed an identification credibility validation method named pValid, whose false positive rate (FPR) is 0.03% and false negative rate (FNR) is 1.79%, comparable to those of the gold standard, i.e., the Synthetic-peptide validation method. However, pValid can only be used for validating the results from pFind, and its validation performance on a few complex datasets still has room for improvement. So, in this submission, we proposed pValid 2, a more comprehensive computational validation method that can validate identifications from any proteome search engines with increased discriminating power.
Collapse
Affiliation(s)
- Wen-Jing Zhou
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
| | - Zhuo-Hong Wei
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
| | - Si-Min He
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
| | - Hao Chi
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, China; University of Chinese Academy of Sciences, Beijing, China.
| |
Collapse
|
7
|
Tarn C, Zeng WF. pDeep3: Toward More Accurate Spectrum Prediction with Fast Few-Shot Learning. Anal Chem 2021; 93:5815-5822. [PMID: 33797898 DOI: 10.1021/acs.analchem.0c05427] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Spectrum prediction using deep learning has attracted a lot of attention in recent years. Although existing deep learning methods have dramatically increased the prediction accuracy, there is still considerable space for improvement, which is presently limited by the difference of fragmentation types or instrument settings. In this work, we use the few-shot learning method to fit the data online to make up for the shortcoming. The method is evaluated using ten data sets, where the instruments includes Velos, QE, Lumos, and Sciex, with collision energies being differently set. Experimental results show that few-shot learning can achieve higher prediction accuracy with almost negligible computing resources. For example, on the data set from a untrained instrument Sciex-6600, within about 10 s, the prediction accuracy is increased from 69.7% to 86.4%; on the CID (collision-induced dissociation) data set, the prediction accuracy of the model trained by HCD (higher energy collision dissociation) spectra is increased from 48.0% to 83.9%. It is also shown that, the method is not critical to data quality and is sufficiently efficient to fill the accuracy gap. The source code of pDeep3 is available at http://pfind.ict.ac.cn/software/pdeep3.
Collapse
Affiliation(s)
- Ching Tarn
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, 100190, Beijing, China.,University of Chinese Academy of Sciences, 100049, Beijing, China
| | - Wen-Feng Zeng
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, 100190, Beijing, China.,University of Chinese Academy of Sciences, 100049, Beijing, China
| |
Collapse
|
8
|
Yang J, Gao Z, Ren X, Sheng J, Xu P, Chang C, Fu Y. DeepDigest: Prediction of Protein Proteolytic Digestion with Deep Learning. Anal Chem 2021; 93:6094-6103. [DOI: 10.1021/acs.analchem.0c04704] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Affiliation(s)
- Jinghan Yang
- CEMS, NCMIS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, P. R. China
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, P. R. China
| | - Zhiqiang Gao
- CEMS, NCMIS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, P. R. China
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, P. R. China
| | - Xiuhan Ren
- School of Sciences, China University of Mining & Technology, Beijing 100083, P. R. China
| | - Jie Sheng
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Lifeomics, Beijing 102206, P. R. China
| | - Ping Xu
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Lifeomics, Beijing 102206, P. R. China
| | - Cheng Chang
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Lifeomics, Beijing 102206, P. R. China
| | - Yan Fu
- CEMS, NCMIS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, P. R. China
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, P. R. China
| |
Collapse
|
9
|
Chen ZL, Mao PZ, Zeng WF, Chi H, He SM. pDeepXL: MS/MS Spectrum Prediction for Cross-Linked Peptide Pairs by Deep Learning. J Proteome Res 2021; 20:2570-2582. [PMID: 33821641 DOI: 10.1021/acs.jproteome.0c01004] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
In cross-linking mass spectrometry, the identification of cross-linked peptide pairs heavily relies on the ability of a database search engine to measure the similarities between experimental and theoretical MS/MS spectra. However, the lack of accurate ion intensities in theoretical spectra impairs the performance of search engines, in particular, on proteome scales. Here we introduce pDeepXL, a deep neural network to predict MS/MS spectra of cross-linked peptide pairs. To train pDeepXL, we used the transfer-learning technique because it facilitated the training with limited benchmark data of cross-linked peptide pairs. Test results on more than ten data sets showed that pDeepXL accurately predicted the spectra of both noncleavable DSS/BS3/Leiker cross-linked peptide pairs (>80% of predicted spectra have Pearson's r values higher than 0.9) and cleavable DSSO/DSBU cross-linked peptide pairs (>75% of predicted spectra have Pearson's r values higher than 0.9). pDeepXL also achieved the accurate prediction on unseen data sets using an online fine-tuning technique. Lastly, integrating pDeepXL into a database search engine increased the number of identified cross-link spectra by 18% on average.
Collapse
Affiliation(s)
- Zhen-Lin Chen
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Peng-Zhi Mao
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Wen-Feng Zeng
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Hao Chi
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Si-Min He
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100049, China
| |
Collapse
|
10
|
Tariq MU, Haseeb M, Aledhari M, Razzak R, Parizi RM, Saeed F. Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey. IEEE ACCESS : PRACTICAL INNOVATIONS, OPEN SOLUTIONS 2020; 9:5497-5516. [PMID: 33537181 PMCID: PMC7853650 DOI: 10.1109/access.2020.3047588] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
Big Data Proteogenomics lies at the intersection of high-throughput Mass Spectrometry (MS) based proteomics and Next Generation Sequencing based genomics. The combined and integrated analysis of these two high-throughput technologies can help discover novel proteins using genomic, and transcriptomic data. Due to the biological significance of integrated analysis, the recent past has seen an influx of proteogenomic tools that perform various tasks, including mapping proteins to the genomic data, searching experimental MS spectra against a six-frame translation genome database, and automating the process of annotating genome sequences. To date, most of such tools have not focused on scalability issues that are inherent in proteogenomic data analysis where the size of the database is much larger than a typical protein database. These state-of-the-art tools can take more than half a month to process a small-scale dataset of one million spectra against a genome of 3 GB. In this article, we provide an up-to-date review of tools that can analyze proteogenomic datasets, providing a critical analysis of the techniques' relative merits and potential pitfalls. We also point out potential bottlenecks and recommendations that can be incorporated in the future design of these workflows to ensure scalability with the increasing size of proteogenomic data. Lastly, we make a case of how high-performance computing (HPC) solutions may be the best bet to ensure the scalability of future big data proteogenomic data analysis.
Collapse
Affiliation(s)
- Muhammad Usman Tariq
- School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA
| | - Muhammad Haseeb
- School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA
| | - Mohammed Aledhari
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA
| | - Rehma Razzak
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA
| | - Reza M Parizi
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA
| | - Fahad Saeed
- School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA
| |
Collapse
|
11
|
Wen B, Zeng W, Liao Y, Shi Z, Savage SR, Jiang W, Zhang B. Deep Learning in Proteomics. Proteomics 2020; 20:e1900335. [PMID: 32939979 PMCID: PMC7757195 DOI: 10.1002/pmic.201900335] [Citation(s) in RCA: 70] [Impact Index Per Article: 17.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 09/14/2020] [Indexed: 12/17/2022]
Abstract
Proteomics, the study of all the proteins in biological systems, is becoming a data-rich science. Protein sequences and structures are comprehensively catalogued in online databases. With recent advancements in tandem mass spectrometry (MS) technology, protein expression and post-translational modifications (PTMs) can be studied in a variety of biological systems at the global scale. Sophisticated computational algorithms are needed to translate the vast amount of data into novel biological insights. Deep learning automatically extracts data representations at high levels of abstraction from data, and it thrives in data-rich scientific research domains. Here, a comprehensive overview of deep learning applications in proteomics, including retention time prediction, MS/MS spectrum prediction, de novo peptide sequencing, PTM prediction, major histocompatibility complex-peptide binding prediction, and protein structure prediction, is provided. Limitations and the future directions of deep learning in proteomics are also discussed. This review will provide readers an overview of deep learning and how it can be used to analyze proteomics data.
Collapse
Affiliation(s)
- Bo Wen
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Wen‐Feng Zeng
- Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS)Chinese Academy of SciencesInstitute of Computing TechnologyBeijing100190China
| | - Yuxing Liao
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Zhiao Shi
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Sara R. Savage
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Wen Jiang
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Bing Zhang
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| |
Collapse
|
12
|
Wiles TA, Saba LM, Delong T. Peptide-Spectrum Match Validation with Internal Standards (P-VIS): Internally-Controlled Validation of Mass Spectrometry-Based Peptide Identifications. J Proteome Res 2020; 20:236-249. [PMID: 32924495 DOI: 10.1021/acs.jproteome.0c00355] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Liquid chromatography-tandem mass spectrometry is an increasingly powerful tool for studying proteins in the context of disease. As technological advances in instrumentation and data analysis have enabled deeper profiling of proteomes and peptidomes, the need for a rigorous, standardized approach to validate individual peptide-spectrum matches (PSMs) has emerged. To address this need, we developed a novel and broadly applicable workflow: PSM validation with internal standards (P-VIS). In this approach, the fragmentation spectrum and chromatographic retention time of a peptide within a biological sample are compared with those of a synthetic version of the putative peptide sequence match. Similarity measurements obtained for a panel of internal standard peptides are then used to calculate a prediction interval for valid matches. If the observed degree of similarity between the biological and the synthetic peptide falls within this prediction interval, then the match is considered valid. P-VIS enables systematic and objective assessment of the validity of individual PSMs, providing a measurable degree of confidence when identifying peptides by mass spectrometry.
Collapse
Affiliation(s)
- Timothy Aaron Wiles
- Department of Pharmaceutical Sciences, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of Colorado Anschutz Medical Campus, Aurora, Colorado 80045-0508, United States States
| | - Laura M Saba
- Department of Pharmaceutical Sciences, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of Colorado Anschutz Medical Campus, Aurora, Colorado 80045-0508, United States States
| | - Thomas Delong
- Department of Pharmaceutical Sciences, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of Colorado Anschutz Medical Campus, Aurora, Colorado 80045-0508, United States States
| |
Collapse
|
13
|
Xu R, Sheng J, Bai M, Shu K, Zhu Y, Chang C. A Comprehensive Evaluation of MS/MS Spectrum Prediction Tools for Shotgun Proteomics. Proteomics 2020; 20:e1900345. [DOI: 10.1002/pmic.201900345] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2019] [Revised: 04/29/2020] [Indexed: 01/27/2023]
Affiliation(s)
- Rui Xu
- State Key Laboratory of Proteomics Beijing Proteome Research Center National Center for Protein Sciences (Beijing) Beijing Institute of Lifeomics Beijing 102206 China
- Chongqing Key Laboratory on Big Data for Bio Intelligence Chongqing University of Posts and Telecommunications Chongqing 400065 China
| | - Jie Sheng
- State Key Laboratory of Proteomics Beijing Proteome Research Center National Center for Protein Sciences (Beijing) Beijing Institute of Lifeomics Beijing 102206 China
- Chongqing Key Laboratory on Big Data for Bio Intelligence Chongqing University of Posts and Telecommunications Chongqing 400065 China
| | - Mingze Bai
- Chongqing Key Laboratory on Big Data for Bio Intelligence Chongqing University of Posts and Telecommunications Chongqing 400065 China
| | - Kunxian Shu
- Chongqing Key Laboratory on Big Data for Bio Intelligence Chongqing University of Posts and Telecommunications Chongqing 400065 China
| | - Yunping Zhu
- State Key Laboratory of Proteomics Beijing Proteome Research Center National Center for Protein Sciences (Beijing) Beijing Institute of Lifeomics Beijing 102206 China
| | - Cheng Chang
- State Key Laboratory of Proteomics Beijing Proteome Research Center National Center for Protein Sciences (Beijing) Beijing Institute of Lifeomics Beijing 102206 China
| |
Collapse
|