1
|
Raj A, Aggarwal S, Singh P, Yadav AK, Dash D. PgxSAVy: A tool for comprehensive evaluation of variant peptide quality in proteogenomics - catching the (un)usual suspects. Comput Struct Biotechnol J 2024; 23:711-722. [PMID: 38292474 PMCID: PMC10825656 DOI: 10.1016/j.csbj.2023.12.033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2023] [Revised: 12/19/2023] [Accepted: 12/23/2023] [Indexed: 02/01/2024] Open
Abstract
Variant peptides resulting from single nucleotide polymorphisms (SNPs) can lead to aberrant protein functions and have translational potential for disease diagnosis and personalized therapy. Variant peptides detected by proteogenomics are fraught with high number of false positives, but there is no uniform and comprehensive approach to assess variant quality across analysis pipelines. Despite class-specific FDR along with ad-hoc filters, the problem is far from solved. These protocols are typically manual and tedious, and thus not uniform across labs. We demonstrate that variant peptide rescoring, integrated with intensity, variant event information and search result features, allows better discrimination of correct variant peptides. Implemented into PgxSAVy - a tool for quality control of variant peptides, this method can tackle the high rate of false positives. PgxSAVy provides a rigorous framework for quality control and annotations of variant peptides on the basis of (i) variant quality, (ii) isobaric masses, and (iii) disease annotation. PgxSAVy demonstrated high accuracy by identifying true variants with 98.43% accuracy on simulated data. Large-scale proteogenomic reanalysis of ∼2.8 million spectra (PXD004010 and PXD001468) resulted in 12,705 variant peptide spectrum matches (PSMs), of which PgxSAVy evaluated 3028 (23.8%), 1409 (11.1%) and 8268 (65.1%) as confident, semi-confident and doubtful respectively. PgxSAVy also annotates the variants based on their pathogenicity and provides support for assisted manual validation. The analysis of proteins carrying variants can provide fine granularity in discovering important pathways. PgxSAVy will advance personalized medicine by providing a comprehensive framework for quality control and prioritization of proteogenomics variants. PgxSAVy is freely available at https://pgxsavy.igib.res.in/ as a webserver and https://github.com/anuragraj/PgxSAVy as a stand-alone tool.
Collapse
Affiliation(s)
- Anurag Raj
- G. N. Ramachandran Knowledge Centre for Genomics Informatics, CSIR – Institute of Genomics and Integrative Biology, New Delhi, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad 201002, India
| | - Suruchi Aggarwal
- Computational and Mathematical Biology Centre (CMBC), 3rd Milestone, Faridabad-Gurgaon Expressway, Faridabad, Haryana 121001, India
- Centre for Drug Discovery (CDD), 3rd Milestone, Faridabad-Gurgaon Expressway, Faridabad, Haryana 121001, India
- Centre for Microbial Research (CMR), Translational Health Science and Technology Institute, NCR Biotech Science Cluster, 3rd Milestone, Faridabad-Gurgaon Expressway, Faridabad, Haryana 121001, India
| | - Prateek Singh
- G. N. Ramachandran Knowledge Centre for Genomics Informatics, CSIR – Institute of Genomics and Integrative Biology, New Delhi, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad 201002, India
| | - Amit Kumar Yadav
- Computational and Mathematical Biology Centre (CMBC), 3rd Milestone, Faridabad-Gurgaon Expressway, Faridabad, Haryana 121001, India
- Centre for Drug Discovery (CDD), 3rd Milestone, Faridabad-Gurgaon Expressway, Faridabad, Haryana 121001, India
- Centre for Microbial Research (CMR), Translational Health Science and Technology Institute, NCR Biotech Science Cluster, 3rd Milestone, Faridabad-Gurgaon Expressway, Faridabad, Haryana 121001, India
| | - Debasis Dash
- G. N. Ramachandran Knowledge Centre for Genomics Informatics, CSIR – Institute of Genomics and Integrative Biology, New Delhi, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad 201002, India
| |
Collapse
|
2
|
Zhang M, Gong C, Ge F, Yu DJ. FCMSTrans: Accurate Prediction of Disease-Associated nsSNPs by Utilizing Multiscale Convolution and Deep Feature Combination within a Transformer Framework. J Chem Inf Model 2024; 64:1394-1406. [PMID: 38349747 DOI: 10.1021/acs.jcim.3c02025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/27/2024]
Abstract
Nonsynonymous single-nucleotide polymorphisms (nsSNPs), implicated in over 6000 diseases, necessitate accurate prediction for expedited drug discovery and improved disease diagnosis. In this study, we propose FCMSTrans, a novel nsSNP predictor that innovatively combines the transformer framework and multiscale modules for comprehensive feature extraction. The distinctive attribute of FCMSTrans resides in a deep feature combination strategy. This strategy amalgamates evolutionary-scale modeling (ESM) and ProtTrans (PT) features, providing an understanding of protein biochemical properties, and position-specific scoring matrix, secondary structure, predicted relative solvent accessibility, and predicted disorder (PSPP) features, which are derived from four protein sequences and structure-oriented characteristics. This feature combination offers a comprehensive view of the molecular dynamics involving nsSNPs. Our model employs the transformer's self-attention mechanisms across multiple layers, extracting higher-level and abstract representations. Simultaneously, varied-level features are captured by multiscale convolutions, enriching feature abstraction at multiple echelons. Our comparative analyses with existing methodologies highlight significant improvements made possible by the integrated feature fusion approach adopted in FCMSTrans. This is further substantiated by performance assessments based on diverse data sets, such as PredictSNP, MMP, and PMD, with areas under the curve (AUCs) of 0.869, 0.819, and 0.693, respectively. Furthermore, FCMSTrans shows robustness and superiority by outperforming the current best predictor, PROVEAN, in a blind test conducted on a third-party data set, achieving an impressive AUC score of 0.7838. The Python code of FCMSTrans is available at https://github.com/gc212/FCMSTrans for academic usage.
Collapse
Affiliation(s)
- Ming Zhang
- School of Computer, Jiangsu University of Science and Technology, 666 Changhui Road, Zhenjiang 212100, China
| | - Chao Gong
- School of Computer, Jiangsu University of Science and Technology, 666 Changhui Road, Zhenjiang 212100, China
| | - Fang Ge
- State Key Laboratory of Organic Electronics and Information Displays & Institute of Advanced Materials (IAM), Nanjing University of Posts & Telecommunications, 9 Wenyuan Road, Nanjing 210023, China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| |
Collapse
|
3
|
Lin TT, Zhang T, Kitata RB, Liu T, Smith RD, Qian WJ, Shi T. Mass spectrometry-based targeted proteomics for analysis of protein mutations. MASS SPECTROMETRY REVIEWS 2023; 42:796-821. [PMID: 34719806 PMCID: PMC9054944 DOI: 10.1002/mas.21741] [Citation(s) in RCA: 17] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/19/2021] [Revised: 09/28/2021] [Accepted: 10/07/2021] [Indexed: 05/03/2023]
Abstract
Cancers are caused by accumulated DNA mutations. This recognition of the central role of mutations in cancer and recent advances in next-generation sequencing, has initiated the massive screening of clinical samples and the identification of 1000s of cancer-associated gene mutations. However, proteomic analysis of the expressed mutation products lags far behind genomic (transcriptomic) analysis. With comprehensive global proteomics analysis, only a small percentage of single nucleotide variants detected by DNA and RNA sequencing have been observed as single amino acid variants due to current technical limitations. Proteomic analysis of mutations is important with the potential to advance cancer biomarker development and the discovery of new therapeutic targets for more effective disease treatment. Targeted proteomics using selected reaction monitoring (also known as multiple reaction monitoring) and parallel reaction monitoring, has emerged as a powerful tool with significant advantages over global proteomics for analysis of protein mutations in terms of detection sensitivity, quantitation accuracy and overall practicality (e.g., reliable identification and the scale of quantification). Herein we review recent advances in the targeted proteomics technology for enhancing detection sensitivity and multiplexing capability and highlight its broad biomedical applications for analysis of protein mutations in human bodily fluids, tissues, and cell lines. Furthermore, we review recent applications of top-down proteomics for analysis of protein mutations. Unlike the commonly used bottom-up proteomics which requires digestion of proteins into peptides, top-down proteomics directly analyzes intact proteins for more precise characterization of mutation isoforms. Finally, general perspectives on the potential of achieving both high sensitivity and high sample throughput for large-scale targeted detection and quantification of important protein mutations are discussed.
Collapse
Affiliation(s)
- Tai-Tu Lin
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, Washington 99354, USA
| | - Tong Zhang
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, Washington 99354, USA
| | - Reta B. Kitata
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, Washington 99354, USA
| | - Tao Liu
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, Washington 99354, USA
| | - Richard D. Smith
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, Washington 99354, USA
| | - Wei-Jun Qian
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, Washington 99354, USA
| | - Tujin Shi
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, Washington 99354, USA
| |
Collapse
|
4
|
Yu Y, Zhang Z, Dong X, Yang R, Duan Z, Xiang Z, Li J, Li G, Yan F, Xue H, Jiao D, Lu J, Lu H, Zhang W, Wei Y, Fan S, Li J, Jia J, Zhang J, Ji J, Liu P, Lu H, Zhao H, Chen S, Wei C, Chen H, Zhu Z. Pangenomic analysis of Chinese gastric cancer. Nat Commun 2022; 13:5412. [PMID: 36109518 PMCID: PMC9477819 DOI: 10.1038/s41467-022-33073-7] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2022] [Accepted: 08/31/2022] [Indexed: 11/25/2022] Open
Abstract
Pangenomic study might improve the completeness of human reference genome (GRCh38) and promote precision medicine. Here, we use an automated pipeline of human pangenomic analysis to build gastric cancer pan-genome for 185 paired deep sequencing data (370 samples), and characterize the gene presence-absence variations (PAVs) at whole genome level. Genes ACOT1, GSTM1, SIGLEC14 and UGT2B17 are identified as highly absent genes in gastric cancer population. A set of genes from unaligned sequences with GRCh38 are predicted. We successfully locate one of predicted genes GC0643 on chromosome 9q34.2. Overexpression of GC0643 significantly inhibits cell growth, cell migration and invasion, cell cycle progression, and induces cell apoptosis in cancer cells. The tumor suppressor functions can be reversed by shGC0643 knockdown. The GC0643 is approved by NCBI database (GenBank: MW194843.1). Collectively, the robust pan-genome strategy provides a deeper understanding of the gene PAVs in the human cancer genome. Human pan-genomics are increasing our knowledge of genomic diversity and genetic factors in disease. Here, the authors built a gastric cancer pan-genome that included the sequences of Chinese Han patients, and predicted putative and previously unaligned genes associated with gastric cancer.
Collapse
|
5
|
Fancello L, Burger T. An analysis of proteogenomics and how and when transcriptome-informed reduction of protein databases can enhance eukaryotic proteomics. Genome Biol 2022; 23:132. [PMID: 35725496 PMCID: PMC9208142 DOI: 10.1186/s13059-022-02701-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2021] [Accepted: 06/09/2022] [Indexed: 12/03/2022] Open
Abstract
Background Proteogenomics aims to identify variant or unknown proteins in bottom-up proteomics, by searching transcriptome- or genome-derived custom protein databases. However, empirical observations reveal that these large proteogenomic databases produce lower-sensitivity peptide identifications. Various strategies have been proposed to avoid this, including the generation of reduced transcriptome-informed protein databases, which only contain proteins whose transcripts are detected in the sample-matched transcriptome. These were found to increase peptide identification sensitivity. Here, we present a detailed evaluation of this approach. Results We establish that the increased sensitivity in peptide identification is in fact a statistical artifact, directly resulting from the limited capability of target-decoy competition to accurately model incorrect target matches when using excessively small databases. As anti-conservative false discovery rates (FDRs) are likely to hamper the robustness of the resulting biological conclusions, we advocate for alternative FDR control methods that are less sensitive to database size. Nevertheless, reduced transcriptome-informed databases are useful, as they reduce the ambiguity of protein identifications, yielding fewer shared peptides. Furthermore, searching the reference database and subsequently filtering proteins whose transcripts are not expressed reduces protein identification ambiguity to a similar extent, but is more transparent and reproducible. Conclusions In summary, using transcriptome information is an interesting strategy that has not been promoted for the right reasons. While the increase in peptide identifications from searching reduced transcriptome-informed databases is an artifact caused by the use of an FDR control method unsuitable to excessively small databases, transcriptome information can reduce the ambiguity of protein identifications. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-022-02701-2.
Collapse
Affiliation(s)
- Laura Fancello
- CNRS, CEA, Inserm, BioSanté U1292, Profi FR2048, Université Grenoble Alpes, Grenoble, France
| | - Thomas Burger
- CNRS, CEA, Inserm, BioSanté U1292, Profi FR2048, Université Grenoble Alpes, Grenoble, France.
| |
Collapse
|
6
|
Ge F, Zhang Y, Xu J, Muhammad A, Song J, Yu DJ. Prediction of disease-associated nsSNPs by integrating multi-scale ResNet models with deep feature fusion. Brief Bioinform 2021; 23:6483068. [PMID: 34953462 DOI: 10.1093/bib/bbab530] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2021] [Revised: 11/13/2021] [Accepted: 11/16/2021] [Indexed: 11/13/2022] Open
Abstract
More than 6000 human diseases have been recorded to be caused by non-synonymous single nucleotide polymorphisms (nsSNPs). Rapid and accurate prediction of pathogenic nsSNPs can improve our understanding of the principle and design of new drugs, which remains an unresolved challenge. In the present work, a new computational approach, termed MSRes-MutP, is proposed based on ResNet blocks with multi-scale kernel size to predict disease-associated nsSNPs. By feeding the serial concatenation of the extracted four types of features, the performance of MSRes-MutP does not obviously improve. To address this, a second model FFMSRes-MutP is developed, which utilizes deep feature fusion strategy and multi-scale 2D-ResNet and 1D-ResNet blocks to extract relevant two-dimensional features and physicochemical properties. FFMSRes-MutP with the concatenated features achieves a better performance than that with individual features. The performance of FFMSRes-MutP is benchmarked on five different datasets. It achieves the Matthew's correlation coefficient (MCC) of 0.593 and 0.618 on the PredictSNP and MMP datasets, which are 0.101 and 0.210 higher than that of the existing best method PredictSNP1. When tested on the HumDiv and HumVar datasets, it achieves MCC of 0.9605 and 0.9507, and area under curve (AUC) of 0.9796 and 0.9748, which are 0.1747 and 0.2669, 0.0853 and 0.1335, respectively, higher than the existing best methods PolyPhen-2 and FATHMM (weighted). In addition, on blind test using a third-party dataset, FFMSRes-MutP performs as the second-best predictor (with MCC and AUC of 0.5215 and 0.7633, respectively), when compared with the other four predictors. Extensive benchmarking experiments demonstrate that FFMSRes-MutP achieves effective feature fusion and can be explored as a useful approach for predicting disease-associated nsSNPs. The webserver is freely available at http://csbio.njust.edu.cn/bioinf/ffmsresmutp/ for academic use.
Collapse
Affiliation(s)
- Fang Ge
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - Ying Zhang
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - Jian Xu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - Arif Muhammad
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| |
Collapse
|
7
|
The structure-based cancer-related single amino acid variation prediction. Sci Rep 2021; 11:13599. [PMID: 34193921 PMCID: PMC8245468 DOI: 10.1038/s41598-021-92793-w] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2021] [Accepted: 06/16/2021] [Indexed: 11/09/2022] Open
Abstract
Single amino acid variation (SAV) is an amino acid substitution of the protein sequence that can potentially influence the entire protein structure or function, as well as its binding affinity. Protein destabilization is related to diseases, including several cancers, although using traditional experiments to clarify the relationship between SAVs and cancer uses much time and resources. Some SAV prediction methods use computational approaches, with most predicting SAV-induced changes in protein stability. In this investigation, all SAV characteristics generated from protein sequences, structures and the microenvironment were converted into feature vectors and fed into an integrated predicting system using a support vector machine and genetic algorithm. Critical features were used to estimate the relationship between their properties and cancers caused by SAVs. We describe how we developed a prediction system based on protein sequences and structure that is capable of distinguishing if the SAV is related to cancer or not. The five-fold cross-validation performance of our system is 89.73% for the accuracy, 0.74 for the Matthews correlation coefficient, and 0.81 for the F1 score. We have built an online prediction server, CanSavPre ( http://bioinfo.cmu.edu.tw/CanSavPre/ ), which is expected to become a useful, practical tool for cancer research and precision medicine.
Collapse
|
8
|
Salz R, Bouwmeester R, Gabriels R, Degroeve S, Martens L, Volders PJ, 't Hoen PAC. Personalized Proteome: Comparing Proteogenomics and Open Variant Search Approaches for Single Amino Acid Variant Detection. J Proteome Res 2021; 20:3353-3364. [PMID: 33998808 PMCID: PMC8280751 DOI: 10.1021/acs.jproteome.1c00264] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2021] [Indexed: 12/30/2022]
Abstract
Discovery of variant peptides such as a single amino acid variant (SAAV) in shotgun proteomics data is essential for personalized proteomics. Both the resolution of shotgun proteomics methods and the search engines have improved dramatically, allowing for confident identification of SAAV peptides. However, it is not yet known if these methods are truly successful in accurately identifying SAAV peptides without prior genomic information in the search database. We studied this in unprecedented detail by exploiting publicly available long-read RNA sequences and shotgun proteomics data from the gold standard reference cell line NA12878. Searching spectra from this cell line with the state-of-the-art open modification search engine ionbot against carefully curated search databases resulted in 96.7% false-positive SAAVs and an 85% lower true positive rate than searching with peptide search databases that incorporate prior genetic information. While adding genetic variants to the search database remains indispensable for correct peptide identification, inclusion of long-read RNA sequences in the search database contributes only 0.3% new peptide identifications. These findings reveal the differences in SAAV detection that result from various approaches, providing guidance to researchers studying SAAV peptides and developers of peptide spectrum identification tools.
Collapse
Affiliation(s)
- Renee Salz
- Centre for Molecular and Biomolecular Informatics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen 6525 GA, The Netherlands
| | - Robbin Bouwmeester
- VIB-UGent Center for Medical Biotechnology VIB, Technologiepark-Zwijnaarde 75, 9052 Ghent, Belgium
- Department of Biomolecular Medicine, Ghent University, Technologiepark-Zwijnaarde 75, 9052 Ghent, Belgium
| | - Ralf Gabriels
- VIB-UGent Center for Medical Biotechnology VIB, Technologiepark-Zwijnaarde 75, 9052 Ghent, Belgium
- Department of Biomolecular Medicine, Ghent University, Technologiepark-Zwijnaarde 75, 9052 Ghent, Belgium
| | - Sven Degroeve
- VIB-UGent Center for Medical Biotechnology VIB, Technologiepark-Zwijnaarde 75, 9052 Ghent, Belgium
- Department of Biomolecular Medicine, Ghent University, Technologiepark-Zwijnaarde 75, 9052 Ghent, Belgium
| | - Lennart Martens
- VIB-UGent Center for Medical Biotechnology VIB, Technologiepark-Zwijnaarde 75, 9052 Ghent, Belgium
- Department of Biomolecular Medicine, Ghent University, Technologiepark-Zwijnaarde 75, 9052 Ghent, Belgium
| | - Pieter-Jan Volders
- VIB-UGent Center for Medical Biotechnology VIB, Technologiepark-Zwijnaarde 75, 9052 Ghent, Belgium
- Department of Biomolecular Medicine, Ghent University, Technologiepark-Zwijnaarde 75, 9052 Ghent, Belgium
| | - Peter A C 't Hoen
- Centre for Molecular and Biomolecular Informatics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen 6525 GA, The Netherlands
| |
Collapse
|
9
|
Forensic proteomics. Forensic Sci Int Genet 2021; 54:102529. [PMID: 34139528 DOI: 10.1016/j.fsigen.2021.102529] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2020] [Revised: 05/06/2021] [Accepted: 05/07/2021] [Indexed: 12/19/2022]
Abstract
Protein is a major component of all biological evidence, often the matrix that embeds other biomolecules such as polynucleotides, lipids, carbohydrates, and small molecules. The proteins in a sample reflect the transcriptional and translational program of the originating cell types. Because of this, proteins can be used to identify body fluids and tissues, as well as convey genetic information in the form of single amino acid polymorphisms, the result of non-synonymous SNPs. This review explores the application and potential of forensic proteomics. The historical role that protein analysis played in the development of forensic science is examined. This review details how innovations in proteomic mass spectrometry have addressed many of the historical limitations of forensic protein science, and how the application of forensic proteomics differs from proteomics in the life sciences. Two more developed applications of forensic proteomics are examined in detail: body fluid and tissue identification, and proteomic genotyping. The review then highlights developing areas of proteomics that have the potential to impact forensic science in the near future: fingermark analysis, species identification, peptide toxicology, proteomic sex estimation, and estimation of post-mortem intervals. Finally, the review highlights some of the newer innovations in proteomics that may drive further development of the field. In addition to potential impact, this review also attempts to evaluate the stage of each application in the development, validation and implementation process. This review is targeted at investigators who are interested in learning about proteomics in a forensic context and expanding the amount of information they can extract from biological evidence.
Collapse
|
10
|
Ge F, Hu J, Zhu YH, Arif M, Yu DJ. TargetMM: Accurate Missense Mutation Prediction by Utilizing Local and Global Sequence Information with Classifier Ensemble. Comb Chem High Throughput Screen 2021; 25:38-52. [DOI: 10.2174/1386207323666201204140438] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2020] [Revised: 10/22/2020] [Accepted: 10/26/2020] [Indexed: 11/22/2022]
Abstract
Aim and Objective:
Missense mutation (MM) may lead to various human diseases by
disabling proteins. Accurate prediction of MM is important and challenging for both protein
function annotation and drug design. Although several computational methods yielded acceptable
success rates, there is still room for further enhancing the prediction performance of MM.
Materials and Methods:
In the present study, we designed a new feature extracting method, which
considers the impact degree of residues in the microenvironment range to the mutation site.
Stringent cross-validation and independent test on benchmark datasets were performed to evaluate
the efficacy of the proposed feature extracting method. Furthermore, three heterogeneous
prediction models were trained and then ensembled for the final prediction. By combining the
feature representation method and classifier ensemble technique, we reported a novel MM
predictor called TargetMM for identifying the pathogenic mutations from the neutral ones.
Results:
Comparison outcomes based on statistical evaluation demonstrate that TargetMM
outperforms the prior advanced methods on the independent test data. The source codes and
benchmark datasets of TargetMM are freely available at https://github.com/sera616/TargetMM.git
for academic use.
Collapse
Affiliation(s)
- Fang Ge
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094,China
| | - Jun Hu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023,China
| | - Yi-Heng Zhu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094,China
| | - Muhammad Arif
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094,China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094,China
| |
Collapse
|
11
|
Jorge GL, Balbuena TS. Identification of novel protein-coding sequences in Eucalyptus grandis plants by high-resolution mass spectrometry. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2020; 1869:140594. [PMID: 33385527 DOI: 10.1016/j.bbapap.2020.140594] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Received: 10/27/2020] [Revised: 12/11/2020] [Accepted: 12/23/2020] [Indexed: 10/22/2022]
Abstract
Eucalyptus species are widely used in the forestry industry, and a significant increase in the number of sequences available in database repositories has been observed for these species. In proteomics, a protein is identified by correlating the theoretical fragmentation spectrum derived from genomic/transcriptomic data against the experimental fragmentation mass spectrum acquired from large-scale analysis of protein mixtures. Proteogenomics is an alternative approach that can identify novel proteins encoded by regions previously considered as non-coding. This study aimed to confidently identify and confirm the existence of previously unknown protein-coding sequences in the Eucalyptus grandis genome. To this end, we used a modified spectral correlation strategy and a dedicated de novo peptide sequencing pipeline. Upon the strategy used here, we confidently identified 41 novel peptide forms and six peptides containing at least one single amino acid substitution. The most representative genomic class of novel peptides was identified as originating from alternative reading frames. In contrast, no clear single amino acid substitution pattern was identified. Validation of the identifications was carried out using a parallel reaction monitoring approach that provided further mass spectrometry support for the existence of the novel peptide sequences. Data are available via ProteomeXchange with identifier PXD022110.
Collapse
Affiliation(s)
- Gabriel Lemes Jorge
- Sao Paulo State University, Department of Technology, Jaboticabal, Sao Paulo, Brazil.
| | | |
Collapse
|
12
|
Shukla N, Siva N, Malik B, Suravajhala P. Current Challenges and Implications of Proteogenomic Approaches in Prostate Cancer. Curr Top Med Chem 2020; 20:1968-1980. [PMID: 32703135 DOI: 10.2174/1568026620666200722112450] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2020] [Revised: 05/30/2020] [Accepted: 06/29/2020] [Indexed: 12/16/2022]
Abstract
In the recent past, next-generation sequencing (NGS) approaches have heralded the omics era. With NGS data burgeoning, there arose a need to disseminate the omic data better. Proteogenomics has been vividly used for characterising the functions of candidate genes and is applied in ascertaining various diseased phenotypes, including cancers. However, not much is known about the role and application of proteogenomics, especially Prostate Cancer (PCa). In this review, we outline the need for proteogenomic approaches, their applications and their role in PCa.
Collapse
Affiliation(s)
- Nidhi Shukla
- Department of Biotechnology and Bioinformatics, Birla Institute of Scientific Research, Statue Circle, Jaipur 302001, RJ, India.,Department of Chemistry, School of Basic Sciences, Manipal University Jaipur, Jaipur, India
| | - Narmadhaa Siva
- Department of Biotechnology and Bioinformatics, Birla Institute of Scientific Research, Statue Circle, Jaipur 302001, RJ, India
| | - Babita Malik
- Department of Chemistry, School of Basic Sciences, Manipal University Jaipur, Jaipur, India
| | - Prashanth Suravajhala
- Department of Biotechnology and Bioinformatics, Birla Institute of Scientific Research, Statue Circle, Jaipur 302001, RJ, India
| |
Collapse
|
13
|
Kiseleva O, Zgoda V, Naryzhny S, Poverennaya E. Empowering Shotgun Mass Spectrometry with 2DE: A HepG2 Study. Int J Mol Sci 2020; 21:E3813. [PMID: 32471280 PMCID: PMC7312985 DOI: 10.3390/ijms21113813] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2020] [Revised: 05/21/2020] [Accepted: 05/26/2020] [Indexed: 01/07/2023] Open
Abstract
One of the major goals of the Chromosome-Centric Human Proteome Project (C-HPP) is to catalog and annotate a myriad of heterogeneous proteoforms, produced by ca. 20 thousand genes. To achieve a detailed and personalized understanding into proteomes, we suggest using a customized RNA-seq library of potential proteoforms, which includes aberrant variants specific to certain biological samples. Two-dimensional electrophoresis coupled with high-performance liquid chromatography allowed us to downgrade the difficulty of biological mixing following shotgun mass spectrometry. To benchmark the proposed pipeline, we examined heterogeneity of the HepG2 hepatoblastoma cell line proteome. Data are available via ProteomeXchange with identifier PXD018450.
Collapse
Affiliation(s)
- Olga Kiseleva
- Institute of Biomedical Chemistry, Moscow 119121, Russia; (V.Z.); (S.N.); (E.P.)
| | - Victor Zgoda
- Institute of Biomedical Chemistry, Moscow 119121, Russia; (V.Z.); (S.N.); (E.P.)
| | - Stanislav Naryzhny
- Institute of Biomedical Chemistry, Moscow 119121, Russia; (V.Z.); (S.N.); (E.P.)
- Petersburg Nuclear Physics Institute named by B.P. Konstantinov of NRC “Kurchatov Institute”, Gatchina 188300, Russia
| | | |
Collapse
|
14
|
Yi X, Gong F, Fu Y. Transfer posterior error probability estimation for peptide identification. BMC Bioinformatics 2020; 21:173. [PMID: 32366221 PMCID: PMC7199311 DOI: 10.1186/s12859-020-3485-y] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2019] [Accepted: 04/08/2020] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND In shotgun proteomics, database searching of tandem mass spectra results in a great number of peptide-spectrum matches (PSMs), many of which are false positives. Quality control of PSMs is a multiple hypothesis testing problem, and the false discovery rate (FDR) or the posterior error probability (PEP) is the commonly used statistical confidence measure. PEP, also called local FDR, can evaluate the confidence of individual PSMs and thus is more desirable than FDR, which evaluates the global confidence of a collection of PSMs. Estimation of PEP can be achieved by decomposing the null and alternative distributions of PSM scores as long as the given data is sufficient. However, in many proteomic studies, only a group (subset) of PSMs, e.g. those with specific post-translational modifications, are of interest. The group can be very small, making the direct PEP estimation by the group data inaccurate, especially for the high-score area where the score threshold is taken. Using the whole set of PSMs to estimate the group PEP is inappropriate either, because the null and/or alternative distributions of the group can be very different from those of combined scores. RESULTS The transfer PEP algorithm is proposed to more accurately estimate the PEPs of peptide identifications in small groups. Transfer PEP derives the group null distribution through its empirical relationship with the combined null distribution, and estimates the group alternative distribution, as well as the null proportion, using an iterative semi-parametric method. Validated on both simulated data and real proteomic data, transfer PEP showed remarkably higher accuracy than the direct combined and separate PEP estimation methods. CONCLUSIONS We presented a novel approach to group PEP estimation for small groups and implemented it for the peptide identification problem in proteomics. The methodology of the approach is in principle applicable to the small-group PEP estimation problems in other fields.
Collapse
Affiliation(s)
- Xinpei Yi
- National Center for Mathematics and Interdisciplinary Sciences, Key Laboratory of Random Complex Structures and Data Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, China.,School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Fuzhou Gong
- National Center for Mathematics and Interdisciplinary Sciences, Key Laboratory of Random Complex Structures and Data Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, China. .,School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, 100049, China.
| | - Yan Fu
- National Center for Mathematics and Interdisciplinary Sciences, Key Laboratory of Random Complex Structures and Data Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, China. .,School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, 100049, China.
| |
Collapse
|
15
|
Wen B, Li K, Zhang Y, Zhang B. Cancer neoantigen prioritization through sensitive and reliable proteogenomics analysis. Nat Commun 2020; 11:1759. [PMID: 32273506 PMCID: PMC7145864 DOI: 10.1038/s41467-020-15456-w] [Citation(s) in RCA: 69] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2019] [Accepted: 03/10/2020] [Indexed: 01/01/2023] Open
Abstract
Genomics-based neoantigen discovery can be enhanced by proteomic evidence, but there remains a lack of consensus on the performance of different quality control methods for variant peptide identification in proteogenomics. We propose to use the difference between accurately predicted and observed retention times for each peptide as a metric to evaluate different quality control methods. To this end, we develop AutoRT, a deep learning algorithm with high accuracy in retention time prediction. Analysis of three cancer data sets with a total of 287 tumor samples using different quality control strategies results in substantially different numbers of identified variant peptides and putative neoantigens. Our systematic evaluation, using the proposed retention time metric, provides insights and practical guidance on the selection of quality control strategies. We implement the recommended strategy in a computational workflow named NeoFlow to support proteogenomics-based neoantigen prioritization, enabling more sensitive discovery of putative neoantigens. Identifying mutation-derived neoantigens by proteogenomics requires robust strategies for quality control. Here, the authors propose peptide retention time as an evaluation metric for proteogenomics quality control methods, and develop a deep learning algorithm for accurate retention time prediction.
Collapse
Affiliation(s)
- Bo Wen
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, TX, 77030, USA.,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Kai Li
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, TX, 77030, USA.,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Yun Zhang
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, TX, 77030, USA.,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Bing Zhang
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, TX, 77030, USA. .,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA.
| |
Collapse
|
16
|
Kwon OK, Ha YS, Lee JN, Kim S, Lee H, Chun SY, Kwon TG, Lee S. Comparative Proteome Profiling and Mutant Protein Identification in Metastatic Prostate Cancer Cells by Quantitative Mass Spectrometry-based Proteogenomics. Cancer Genomics Proteomics 2019; 16:273-286. [PMID: 31243108 DOI: 10.21873/cgp.20132] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2019] [Revised: 04/16/2019] [Accepted: 04/18/2019] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND/AIM Prostate cancer (PCa) is the most frequent cancer found in males worldwide. The aim of this study was to identify new biomarkers using mutated peptides for the prognosis and prediction of advanced PCa, based on proteogenomics. MATERIALS AND METHODS The tryptic peptides were analyzed by tandem mass tag-based quantitative proteomics. Proteogenomics were used to identify mutant peptides as novel biomarkers in advanced PCa. RESULTS Using a human database, increased levels of INTS7 and decreased levels of SH3BGRL were found to be associated with the aggressiveness of PCa. Using proteogenomics and a cancer mutation database, 70 mutant peptides were identified in PCa cell lines. Using parallel reaction monitoring, the expression of seven mutant peptides was found to be altered in tumors, amongst which CAPN2 D22E was the most significantly up-regulated mutant peptide in PCa tissues. CONCLUSION Altered mutant peptides present in PCa tissue could be used as new biomarkers in advanced PCa.
Collapse
Affiliation(s)
- Oh Kwang Kwon
- BK21 Plus KNU Multi-Omics-based Creative Drug Research Team, College of Pharmacy, Research Institute of Pharmaceutical Sciences, Kyungpook National University, Daegu, Republic of Korea
| | - Yun-Sok Ha
- Department of Urology, School of Medicine, Kyungpook National University, Daegu, Republic of Korea.,Department of Urology, Kyungpook National University Hospital, Daegu, Republic of Korea
| | - Jun Nyung Lee
- Department of Urology, School of Medicine, Kyungpook National University, Daegu, Republic of Korea.,Department of Urology, Kyungpook National University Hospital, Daegu, Republic of Korea
| | - Sunjoo Kim
- BK21 Plus Team for Creative Leader Program for Pharmacomics-based Future, Pharmacy and Integrated Research Institute of Pharmaceutical Sciences, College of Pharmacy, The Catholic University of Korea, Bucheon, Republic of Korea
| | - Hyesuk Lee
- BK21 Plus Team for Creative Leader Program for Pharmacomics-based Future, Pharmacy and Integrated Research Institute of Pharmaceutical Sciences, College of Pharmacy, The Catholic University of Korea, Bucheon, Republic of Korea
| | - So Young Chun
- Joint Institute for Regenerative Medicine, Kyungpook National University Hospital, Daegu, Republic of Korea
| | - Tae Gyun Kwon
- Department of Urology, School of Medicine, Kyungpook National University, Daegu, Republic of Korea .,Department of Urology, Kyungpook National University Hospital, Daegu, Republic of Korea
| | - Sangkyu Lee
- BK21 Plus KNU Multi-Omics-based Creative Drug Research Team, College of Pharmacy, Research Institute of Pharmaceutical Sciences, Kyungpook National University, Daegu, Republic of Korea
| |
Collapse
|
17
|
Na S, Kim J, Paek E. MODplus: Robust and Unrestrictive Identification of Post-Translational Modifications Using Mass Spectrometry. Anal Chem 2019; 91:11324-11333. [PMID: 31365238 DOI: 10.1021/acs.analchem.9b02445] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
Abstract
Post-translational modifications regulate various cellular processes and are of great biological interest. Unrestrictive searches of mass spectrometry data enable the detection of any type of modification. Here we propose MODplus, which makes practical unrestrictive searches possible by allowing (1) hundreds of modifications, (2) multiple modifications per peptide, (3) the whole proteome database, and (4) any tolerant values in search parameters. The utility of MODplus was demonstrated in large human data sets of HEK293 cells and TMT-labeled phosphorylation enrichment. Notably, MODplus supports identifying different modification types at multiple sites and reports real chemical and biological modifications, as it has been very labor intensive to link unrestrictive search results to real modifications. We also confirmed the presence of Missing Precursor (MP) spectra that were not identifiable using targeted precursor masses. The MP spectra mostly resulted in identifications of wrong modifications and negatively affected the overall performance, often by as much as 10%. MODplus can rapidly recognize MP spectra and correct their identifications, resulting in increased identification rate up to 70% in the HEK293 data set as well as improved reliability.
Collapse
Affiliation(s)
- Seungjin Na
- Department of Computer Science , Hanyang University , Seoul 04763 , South Korea
| | - Jihyung Kim
- Department of Computer Science , Hanyang University , Seoul 04763 , South Korea
| | - Eunok Paek
- Department of Computer Science , Hanyang University , Seoul 04763 , South Korea
| |
Collapse
|
18
|
Weldatsadik R, Datta N, Kolmeder C, Vuopio J, Kere J, Wilkman S, Flatt J, Vuento R, Haapasalo K, Keskitalo S, Varjosalo M, Jokiranta T. Pool-seq driven proteogenomic database for Group G Streptococcus. J Proteomics 2019; 201:84-92. [DOI: 10.1016/j.jprot.2019.04.015] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Revised: 03/29/2019] [Accepted: 04/17/2019] [Indexed: 02/07/2023]
|
19
|
Schiza C, Korbakis D, Jarvi K, Diamandis EP, Drabovich AP. Identification of TEX101-associated Proteins Through Proteomic Measurement of Human Spermatozoa Homozygous for the Missense Variant rs35033974. Mol Cell Proteomics 2019; 18:338-351. [PMID: 30429210 PMCID: PMC6356071 DOI: 10.1074/mcp.ra118.001170] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2018] [Indexed: 01/19/2023] Open
Abstract
TEX101 is a germ-cell-specific protein and a validated biomarker of male infertility. Mouse TEX101 was found essential for male fertility and was suggested to function as a cell surface chaperone involved in maturation of proteins required for sperm migration and sperm-oocyte interaction. However, the precise functional role of human TEX101 is not known and cannot be studied in vitro due to the lack of human germ cell lines. Here, we genotyped 386 men for a common missense variant rs35033974 of TEX101 and identified 52 heterozygous and 4 homozygous men. We then discovered by targeted proteomics that the variant allele rs35033974 was associated with the near-complete degradation (>97%) of the corresponding G99V TEX101 form and suggested that spermatozoa of homozygous men could serve as a knockdown model to study TEX101 function in humans. Differential proteomic profiling with label-free quantification measured 8,046 proteins in spermatozoa of eight men and identified eight cell-surface and nine secreted testis-specific proteins significantly down-regulated in four patients homozygous for rs35033974. Substantially reduced levels of testis-specific cell-surface proteins potentially involved in sperm migration and sperm-oocyte interaction (including LY6K and ADAM29) were confirmed by targeted proteomics and Western blotting assays. Because recent population-scale genomic data revealed homozygous fathers with biological children, rs35033974 is not a monogenic factor of male infertility in humans. However, median TEX101 levels in seminal plasma were found fivefold lower (p = 0.0005) in heterozygous than in wild-type men of European ancestry. We conclude that spermatozoa of rs35033974 homozygous men have substantially reduced levels of TEX101 and could be used as a model to elucidate the precise TEX101 function, which will advance biology of human reproduction.
Collapse
Affiliation(s)
- Christina Schiza
- Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Canada;; Department of Pathology and Laboratory Medicine
| | - Dimitrios Korbakis
- Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Canada;; Lunenfeld-Tanenbaum Research Institute
| | - Keith Jarvi
- Lunenfeld-Tanenbaum Research Institute,; Department of Surgery, Division of Urology, Mount Sinai Hospital, Toronto, Canada
| | - Eleftherios P Diamandis
- Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Canada;; Department of Pathology and Laboratory Medicine,; Lunenfeld-Tanenbaum Research Institute,; Department of Clinical Biochemistry, University Health Network, Toronto, Canada.
| | - Andrei P Drabovich
- Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Canada;; Department of Pathology and Laboratory Medicine,; Department of Clinical Biochemistry, University Health Network, Toronto, Canada.
| |
Collapse
|
20
|
Wen B, Wang X, Zhang B. PepQuery enables fast, accurate, and convenient proteomic validation of novel genomic alterations. Genome Res 2019; 29:485-493. [PMID: 30610011 PMCID: PMC6396417 DOI: 10.1101/gr.235028.118] [Citation(s) in RCA: 57] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2018] [Accepted: 12/28/2018] [Indexed: 12/20/2022]
Abstract
Massively parallel or second-generation sequencing-based genomic studies continuously identify new genomic alterations that may lead to novel protein sequences, which are attractive candidates for disease biomarkers and therapeutic targets after proteomic validation. Integrative proteogenomic methods have been developed to use mass spectrometry (MS)-based proteomics data for such validation. These methods replace the reference sequence database in proteomic database searching with a customized protein database that incorporates sample- or disease-specific sequences derived from DNA or RNA sequencing, thus enabling the identification of novel protein sequences. Although useful, this spectrum-centric approach requires a full evaluation of all possible spectrum-peptide pairs, which is time-consuming, error-prone, and difficult to apply. Here, we present PepQuery, a peptide-centric approach that focuses on only novel DNA or protein sequences of interest. PepQuery allows quick and easy proteomic validation of genomic alterations without customized database construction. We demonstrated the sensitivity and specificity of the approach in validating completely novel proteins, novel splice junctions, and single amino acid variants using simulations and experimental data. Notably, enabling unrestricted modification searching in PepQuery reduced false positives by up to 95%. We implemented PepQuery as both web-based and stand-alone applications. The web version provides direct access to more than half a billion MS/MS spectra from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) and other cancer proteomic studies. The stand-alone version supports batch analysis and user-provided MS/MS data. PepQuery will increase the usage of proteogenomics beyond the proteomics community and will broaden the application of proteogenomics in personalized medicine.
Collapse
Affiliation(s)
- Bo Wen
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas 77030, USA.,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Xiaojing Wang
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas 77030, USA.,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Bing Zhang
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas 77030, USA.,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, USA
| |
Collapse
|
21
|
Tan Z, Yi X, Carruthers NJ, Stemmer PM, Lubman DM. Single Amino Acid Variant Discovery in Small Numbers of Cells. J Proteome Res 2018; 18:417-425. [PMID: 30404448 DOI: 10.1021/acs.jproteome.8b00694] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
We have performed deep proteomic profiling down to as few as 9 Panc-1 cells using sample fractionation, TMT multiplexing, and a carrier/reference strategy. Off line fractionation of the TMT-labeled sample pooled with TMT-labeled carrier Panc-1 whole cell proteome was achieved using alkaline reversed phase spin columns. The fractionation in conjunction with the carrier/reference (C/R) proteome allowed us to detect 47 414 unique peptides derived from 6261 proteins, which provided a sufficient coverage to search for single amino acid variants (SAAVs) related to cancer. This high sample coverage is essential in order to detect a significant number of SAAVs. In order to verify genuine SAAVs versus false SAAVs, we used the SAVControl pipeline and found a total of 79 SAAVs from the 9-cell Panc-1 sample and 174 SAAVs from the 5000-cell Panc-1 C/R proteome. The SAAVs as sorted into high confidence and low confidence SAAVs were checked manually. All the high confidence SAAVs were found to be genuine SAAVs, while half of the low confidence SAAVs were found to be false SAAVs mainly related to PTMs. We identified several cancer-related SAAVs including KRAS, which is an important oncoprotein in pancreatic cancer. In addition, we were able to detect sites involved in loss or gain of glycosylation due to the enhanced coverage available in these experiments where we can detect both sites of loss and gain of glycosylation.
Collapse
Affiliation(s)
- Zhijing Tan
- Department of Surgery , University of Michigan , Ann Arbor , Michigan 48109 , United States
| | - Xinpei Yi
- NCMIS, RCSDS, Academy of Mathematics and Systems Science , Chinese Academy of Sciences , Beijing 100190 , China.,School of Mathematical Sciences , University of Chinese Academy of Sciences , Beijing 100049 , China
| | - Nicholas J Carruthers
- Institute of Environmental Health Sciences , Wayne State University , Detroit , Michigan 48202 , United States
| | - Paul M Stemmer
- Institute of Environmental Health Sciences , Wayne State University , Detroit , Michigan 48202 , United States
| | - David M Lubman
- Department of Surgery , University of Michigan , Ann Arbor , Michigan 48109 , United States
| |
Collapse
|
22
|
Bubis JA, Levitsky LI, Ivanov MV, Gorshkov MV. Validation of Peptide Identification Results in Proteomics Using Amino Acid Counting. Proteomics 2018; 18:e1800117. [PMID: 30307114 DOI: 10.1002/pmic.201800117] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2018] [Revised: 09/12/2018] [Indexed: 01/11/2023]
Abstract
The efficiency of proteome analysis depends strongly on the configuration parameters of the search engine. One of the murkiest and nontrivial among them is the list of amino acid modifications included for the search. Here, an approach called AA_stat is presented for uncovering the unexpected modifications of amino acid residues in the protein sequences, as well as possible artifacts of data acquisition or processing, in the results of proteome analyses. The approach is based on comparing the amino acid frequencies of different mass shifts observed using the open search method introduced recently. In this work, the proposed approach is applied to publicly available proteomic data is applied and its feasibility for discovering unaccounted modifications or possible pitfalls of the identification workflow is demonstrated. The algorithm is implemented in Python as an open-source command-line tool available at https://bitbucket.org/J_Bale/aa_stat/.
Collapse
Affiliation(s)
- Julia A Bubis
- Institute for Energy Problems of Chemical Physics, Russian Academy of Sciences, 119334 Moscow, Russia.,Moscow Institute of Physics and Technology State University, 141700 Dolgoprudny, Russia
| | - Lev I Levitsky
- Institute for Energy Problems of Chemical Physics, Russian Academy of Sciences, 119334 Moscow, Russia
| | - Mark V Ivanov
- Institute for Energy Problems of Chemical Physics, Russian Academy of Sciences, 119334 Moscow, Russia
| | - Mikhail V Gorshkov
- Institute for Energy Problems of Chemical Physics, Russian Academy of Sciences, 119334 Moscow, Russia
| |
Collapse
|
23
|
Robin T, Bairoch A, Müller M, Lisacek F, Lane L. Large-Scale Reanalysis of Publicly Available HeLa Cell Proteomics Data in the Context of the Human Proteome Project. J Proteome Res 2018; 17:4160-4170. [DOI: 10.1021/acs.jproteome.8b00392] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Affiliation(s)
- Thibault Robin
- CALIPHO Group, SIB Swiss Institute of Bioinformatics, CMU, Rue Michel-Servet 1, CH-1211 Geneva, Switzerland
- Proteome Informatics Group, SIB Swiss Institute of Bioinformatics, CMU, Rue Michel-Servet 1, CH-1211 Geneva, Switzerland
- Computer Science Department, University of Geneva, CH-1211 Geneva, Switzerland
- Department of Microbiology and Molecular Medicine, Faculty of Medicine, University of Geneva, CH-1211 Geneva, Switzerland
| | - Amos Bairoch
- CALIPHO Group, SIB Swiss Institute of Bioinformatics, CMU, Rue Michel-Servet 1, CH-1211 Geneva, Switzerland
- Department of Microbiology and Molecular Medicine, Faculty of Medicine, University of Geneva, CH-1211 Geneva, Switzerland
| | - Markus Müller
- Vital-IT Group, SIB Swiss Institute of Bioinformatics, Genopode Building, Quartier Sorge, CH-1015 Lausanne, Switzerland
| | - Frédérique Lisacek
- Proteome Informatics Group, SIB Swiss Institute of Bioinformatics, CMU, Rue Michel-Servet 1, CH-1211 Geneva, Switzerland
- Computer Science Department, University of Geneva, CH-1211 Geneva, Switzerland
- Section of Biology, University of Geneva, CH-1211 Geneva, Switzerland
| | - Lydie Lane
- CALIPHO Group, SIB Swiss Institute of Bioinformatics, CMU, Rue Michel-Servet 1, CH-1211 Geneva, Switzerland
- Department of Microbiology and Molecular Medicine, Faculty of Medicine, University of Geneva, CH-1211 Geneva, Switzerland
| |
Collapse
|
24
|
Yi X, Wang B, An Z, Gong F, Li J, Fu Y. Quality control of single amino acid variations detected by tandem mass spectrometry. J Proteomics 2018; 187:144-151. [PMID: 30012419 DOI: 10.1016/j.jprot.2018.07.004] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2018] [Revised: 06/26/2018] [Accepted: 07/02/2018] [Indexed: 02/04/2023]
Abstract
Study of single amino acid variations (SAVs) of proteins, resulting from single nucleotide polymorphisms, is of great importance for understanding the relationships between genotype and phenotype. In mass spectrometry based shotgun proteomics, identification of peptides with SAVs often suffers from high error rates on the variant sites detected. These site errors are due to multiple reasons and can be confirmed by manual inspection or genomic sequencing. Here, we present a software tool, named SAVControl, for site-level quality control of variant peptide identifications. It mainly includes strict false discovery rate control of variant peptide identifications and variant site verification by unrestrictive mass shift relocalization. SAVControl was validated on three colorectal adenocarcinoma cell line datasets with genomic sequencing evidences and tested on a colorectal cancer dataset from The Cancer Genome Atlas. The results show that SAVControl can effectively remove false detections of SAVs. SIGNIFICANCE Protein sequence variations caused by single nucleotide polymorphisms (SNPs) are single amino acid variations (SAVs). The investigation of SAVs may provide a chance for understanding the relationships between genotype and phenotype. Mass spectrometry (MS) based proteomics provides a large-scale way to detect SAVs. However, using the current analysis strategy to detect SAVs may lead to high rate of false positives. The SAVControl we present here is a computational workflow and software tool for site-level quality control of SAVs detected by MS. It accesses the confidence of detected variant sites by relocating the mass shift responsible for an SAV to search for alternative interpretations. In addition, it uses a strict false discovery rate control method for variant peptide identifications. The advantages of SAVControl were demonstrated on three colorectal adenocarcinoma cell line datasets and a colorectal cancer dataset. We believe that SAVControl will be a powerful tool for computational proteomics and proteogenomics.
Collapse
Affiliation(s)
- Xinpei Yi
- NCMIS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China; School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Bo Wang
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Zhiwu An
- NCMIS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China; School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Fuzhou Gong
- NCMIS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China; School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China.
| | - Jing Li
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China.
| | - Yan Fu
- NCMIS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China; School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China.
| |
Collapse
|
25
|
Discovery of coding regions in the human genome by integrated proteogenomics analysis workflow. Nat Commun 2018; 9:903. [PMID: 29500430 PMCID: PMC5834625 DOI: 10.1038/s41467-018-03311-y] [Citation(s) in RCA: 86] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2017] [Accepted: 02/02/2018] [Indexed: 01/23/2023] Open
Abstract
Proteogenomics enable the discovery of novel peptides (from unannotated genomic protein-coding loci) and single amino acid variant peptides (derived from single-nucleotide polymorphisms and mutations). Increasing the reliability of these identifications is crucial to ensure their usefulness for genome annotation and potential application as neoantigens in cancer immunotherapy. We here present integrated proteogenomics analysis workflow (IPAW), which combines peptide discovery, curation, and validation. IPAW includes the SpectrumAI tool for automated inspection of MS/MS spectra, eliminating false identifications of single-residue substitution peptides. We employ IPAW to analyze two proteomics data sets acquired from A431 cells and five normal human tissues using extended (pH range, 3–10) high-resolution isoelectric focusing (HiRIEF) pre-fractionation and TMT-based peptide quantitation. The IPAW results provide evidence for the translation of pseudogenes, lncRNAs, short ORFs, alternative ORFs, N-terminal extensions, and intronic sequences. Moreover, our quantitative analysis indicates that protein production from certain pseudogenes and lncRNAs is tissue specific. Proteogenomics enables the discovery of protein coding regions and disease-relevant mutations but their verification remains challenging. Here, the authors combine peptide discovery, curation and validation in an integrated proteogenomics workflow, robustly identifying unknown coding regions and mutations.
Collapse
|
26
|
Xiao J, Tanca A, Jia B, Yang R, Wang B, Zhang Y, Li J. Metagenomic Taxonomy-Guided Database-Searching Strategy for Improving Metaproteomic Analysis. J Proteome Res 2018; 17:1596-1605. [DOI: 10.1021/acs.jproteome.7b00894] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
Affiliation(s)
- Jinqiu Xiao
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, People’s Republic of China
| | - Alessandro Tanca
- Porto Conte Ricerche, Science and Technology Park of Sardinia, Tramariglio, Alghero, Italy
| | - Ben Jia
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, People’s Republic of China
| | - Runqing Yang
- College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, People’s Republic of China
| | - Bo Wang
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, People’s Republic of China
| | - Yu Zhang
- Institute of Oceanography, Shanghai Jiao Tong University, Shanghai 200240, People’s Republic of China
| | - Jing Li
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, People’s Republic of China
| |
Collapse
|
27
|
Dimitrakopoulos L, Prassas I, Diamandis EP, Charames GS. Onco-proteogenomics: Multi-omics level data integration for accurate phenotype prediction. Crit Rev Clin Lab Sci 2017; 54:414-432. [DOI: 10.1080/10408363.2017.1384446] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Affiliation(s)
- Lampros Dimitrakopoulos
- Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, ON, Canada
- Department of Pathology and Laboratory Medicine, Mount Sinai Hospital, Joseph and Wolf Lebovic Health Complex, Toronto, ON, Canada
- Lunenfeld-Tanenbaum Research Institute, Sinai Health System, Toronto, ON, Canada
| | - Ioannis Prassas
- Department of Pathology and Laboratory Medicine, Mount Sinai Hospital, Joseph and Wolf Lebovic Health Complex, Toronto, ON, Canada
| | - Eleftherios P. Diamandis
- Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, ON, Canada
- Department of Pathology and Laboratory Medicine, Mount Sinai Hospital, Joseph and Wolf Lebovic Health Complex, Toronto, ON, Canada
- Lunenfeld-Tanenbaum Research Institute, Sinai Health System, Toronto, ON, Canada
- Department of Clinical Biochemistry, University Health Network, Toronto, ON, Canada
| | - George S. Charames
- Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, ON, Canada
- Department of Pathology and Laboratory Medicine, Mount Sinai Hospital, Joseph and Wolf Lebovic Health Complex, Toronto, ON, Canada
- Lunenfeld-Tanenbaum Research Institute, Sinai Health System, Toronto, ON, Canada
| |
Collapse
|
28
|
Choong WK, Lih TSM, Chen YJ, Sung TY. Decoding the Effect of Isobaric Substitutions on Identifying Missing Proteins and Variant Peptides in Human Proteome. J Proteome Res 2017; 16:4415-4424. [DOI: 10.1021/acs.jproteome.7b00342] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Wai-Kok Choong
- Institute of Information Science, Academia Sinica, Taipei 11529, Taiwan
- Institute of Chemistry, Academia Sinica, Taipei 11529, Taiwan
| | - Tung-Shing Mamie Lih
- Institute of Information Science, Academia Sinica, Taipei 11529, Taiwan
- Institute of Chemistry, Academia Sinica, Taipei 11529, Taiwan
| | - Yu-Ju Chen
- Institute of Chemistry, Academia Sinica, Taipei 11529, Taiwan
| | - Ting-Yi Sung
- Institute of Information Science, Academia Sinica, Taipei 11529, Taiwan
| |
Collapse
|
29
|
Hernandez-Valladares M, Vaudel M, Selheim F, Berven F, Bruserud Ø. Proteogenomics approaches for studying cancer biology and their potential in the identification of acute myeloid leukemia biomarkers. Expert Rev Proteomics 2017; 14:649-663. [DOI: 10.1080/14789450.2017.1352474] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
Affiliation(s)
- Maria Hernandez-Valladares
- Department of Clinical Science, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway
- Proteomics Unit, Department of Biomedicine, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway
| | - Marc Vaudel
- KG Jebsen Center for Diabetes Research, Department of Clinical Science, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway
- Center for Medical Genetics and Molecular Medicine, Haukeland University Hospital, Bergen, Norway
| | - Frode Selheim
- Proteomics Unit, Department of Biomedicine, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway
| | - Frode Berven
- Proteomics Unit, Department of Biomedicine, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway
| | - Øystein Bruserud
- Department of Clinical Science, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway
| |
Collapse
|
30
|
Detecting protein variants by mass spectrometry: a comprehensive study in cancer cell-lines. Genome Med 2017; 9:62. [PMID: 28716134 PMCID: PMC5514513 DOI: 10.1186/s13073-017-0454-9] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2017] [Accepted: 06/22/2017] [Indexed: 02/07/2023] Open
Abstract
Background Onco-proteogenomics aims to understand how changes in a cancer’s genome influences its proteome. One challenge in integrating these molecular data is the identification of aberrant protein products from mass-spectrometry (MS) datasets, as traditional proteomic analyses only identify proteins from a reference sequence database. Methods We established proteomic workflows to detect peptide variants within MS datasets. We used a combination of publicly available population variants (dbSNP and UniProt) and somatic variations in cancer (COSMIC) along with sample-specific genomic and transcriptomic data to examine proteome variation within and across 59 cancer cell-lines. Results We developed a set of recommendations for the detection of variants using three search algorithms, a split target-decoy approach for FDR estimation, and multiple post-search filters. We examined 7.3 million unique variant tryptic peptides not found within any reference proteome and identified 4771 mutations corresponding to somatic and germline deviations from reference proteomes in 2200 genes among the NCI60 cell-line proteomes. Conclusions We discuss in detail the technical and computational challenges in identifying variant peptides by MS and show that uncovering these variants allows the identification of druggable mutations within important cancer genes. Electronic supplementary material The online version of this article (doi:10.1186/s13073-017-0454-9) contains supplementary material, which is available to authorized users.
Collapse
|
31
|
Li H, Park J, Kim H, Hwang KB, Paek E. Systematic Comparison of False-Discovery-Rate-Controlling Strategies for Proteogenomic Search Using Spike-in Experiments. J Proteome Res 2017; 16:2231-2239. [PMID: 28452485 DOI: 10.1021/acs.jproteome.7b00033] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Proteogenomic searches are useful for novel peptide identification from tandem mass spectra. Usually, separate and multistage approaches are adopted to accurately control the false discovery rate (FDR) for proteogenomic search. Their performance on novel peptide identification has not been thoroughly evaluated, however, mainly due to the difficulty in confirming the existence of identified novel peptides. We simulated a proteogenomic search using a controlled, spike-in proteomic data set. After confirming that the results of the simulated proteogenomic search were similar to those of a real proteogenomic search using a human cell line data set, we evaluated the performance of six FDR control methods-global, separate, and multistage FDR estimation, respectively, coupled to a target-decoy search and a mixture model-based method-on novel peptide identification. The multistage approach showed the highest accuracy for FDR estimation. However, global and separate FDR estimation with the mixture model-based method showed higher sensitivities than others at the same true FDR. Furthermore, the mixture model-based method performed equally well when applied without or with a reduced set of decoy sequences. Considering different prior probabilities for novel and known protein identification, we recommend using mixture model-based methods with separate FDR estimation for sensitive and reliable identification of novel peptides from proteogenomic searches.
Collapse
Affiliation(s)
- Honglan Li
- School of Computer Science and Engineering, Soongsil University , Seoul 06978, Republic of Korea
| | - Jonghun Park
- Department of Computer Science, Hanyang University , Seoul 04763, Republic of Korea
| | - Hyunwoo Kim
- Scientific Data Research Center, Korea Institute of Science and Technology Information , Daejeon 34141, Republic of Korea
| | - Kyu-Baek Hwang
- School of Computer Science and Engineering, Soongsil University , Seoul 06978, Republic of Korea
| | - Eunok Paek
- Department of Computer Science, Hanyang University , Seoul 04763, Republic of Korea
| |
Collapse
|
32
|
Zhang M, Wang B, Xu J, Wang X, Xie L, Zhang B, Li Y, Li J. CanProVar 2.0: An Updated Database of Human Cancer Proteome Variation. J Proteome Res 2017; 16:421-432. [PMID: 27977206 PMCID: PMC5515284 DOI: 10.1021/acs.jproteome.6b00505] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
Identification and annotation of the mutations involved in oncogenesis and tumor progression are crucial for both cancer biology and clinical applications. Previously, we developed a public resource CanProVar, a human cancer proteome variation database for storing and querying single amino acid alterations in the human cancers. Since the publication of CanProVar, extensive cancer genomics efforts have revealed the enormous genomic complexity of various types of human cancers. Thus, there is an overwhelming need for comprehensive annotation of the genomic alterations at the protein level and making such knowledge easily accessible. Here, we describe CanProVar 2.0, a significantly expanded version of CanProVar, in which the amount of cancer-related variations and noncancer specific variations was increased by about 10-fold as compared to the previous version. To facilitate the interpretation of the variations, we added to the database functional data on potential impact of the cancer-related variations on 3D protein interaction and on the differential expression of the variant-bearing proteins between cancer and normal samples. The web interface allows for flexible queries based on gene or protein IDs, cancer types, chromosome locations, or pathways. An integrated protein sequence database containing variations that can be directly used for proteomics database searching can be downloaded.
Collapse
Affiliation(s)
- Menghuan Zhang
- Department of Bioinformatics & Biostatistics, School of Life Science and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
- Shanghai Center for Bioinformation Technology, Shanghai Academy of Science and Technology, Shanghai, 201203, China
| | - Bo Wang
- Department of Bioinformatics & Biostatistics, School of Life Science and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Jia Xu
- Department of Bioinformatics & Biostatistics, School of Life Science and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Xiaojing Wang
- Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Suite NB100, Houston, TX 77030
- Lester & Sue Smith Breast Center, Baylor College of Medicine, One Baylor Plaza, Suite NB100, Houston, TX 77030
| | - Lu Xie
- Shanghai Center for Bioinformation Technology, Shanghai Academy of Science and Technology, Shanghai, 201203, China
| | - Bing Zhang
- Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Suite NB100, Houston, TX 77030
- Lester & Sue Smith Breast Center, Baylor College of Medicine, One Baylor Plaza, Suite NB100, Houston, TX 77030
| | - Yixue Li
- Department of Bioinformatics & Biostatistics, School of Life Science and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
- Shanghai Center for Bioinformation Technology, Shanghai Academy of Science and Technology, Shanghai, 201203, China
- Key Lab of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Jing Li
- Department of Bioinformatics & Biostatistics, School of Life Science and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
- Shanghai Center for Bioinformation Technology, Shanghai Academy of Science and Technology, Shanghai, 201203, China
| |
Collapse
|
33
|
Tan Z, Nie S, McDermott SP, Wicha MS, Lubman DM. Single Amino Acid Variant Profiles of Subpopulations in the MCF-7 Breast Cancer Cell Line. J Proteome Res 2017; 16:842-851. [PMID: 28076950 DOI: 10.1021/acs.jproteome.6b00824] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
Cancers are initiated and developed from a small population of stem-like cells termed cancer stem cells (CSCs). There is heterogeneity among this CSC population that leads to multiple subpopulations with their own distinct biological features and protein expression. The protein expression and function may be impacted by amino acid variants that can occur largely due to single nucleotide changes. We have thus performed proteomic analysis of breast CSC subpopulations by mass spectrometry to study the presence of single amino acid variants (SAAVs) and their relation to breast cancer. We have used CSC markers to isolate pure breast CSC subpopulation fractions (ALDH+ and CD44+/CD24- cell populations) and the mature luminal cells (CD49f-EpCAM+) from the MCF-7 breast cancer cell line. By searching the Swiss-CanSAAVs database, 374 unique SAAVs were identified in total, where 27 are cancer-related SAAVs. 135 unique SAAVs were found in the CSC population compared with the mature luminal cells. The distribution of SAAVs detected in MCF-7 cells was compared with those predicted from the Swiss-CanSAAVs database, where we found distinct differences in the numbers of SAAVs detected relative to that expected from the Swiss-CanSAAVs database for several of the amino acids.
Collapse
Affiliation(s)
- Zhijing Tan
- Department of Surgery, University of Michigan , Ann Arbor, Michigan 48109, United States
| | - Song Nie
- Department of Surgery, University of Michigan , Ann Arbor, Michigan 48109, United States.,Biological Sciences Division and Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory , Richland, Washington 99352, United States
| | - Sean P McDermott
- Department of Internal Medicine, Division of Hematology/Oncology, University of Michigan , Ann Arbor, Michigan 48109, United States.,Comprehensive Cancer Center, University of Michigan , Ann Arbor, Michigan 48109, United States
| | - Max S Wicha
- Department of Internal Medicine, Division of Hematology/Oncology, University of Michigan , Ann Arbor, Michigan 48109, United States.,Comprehensive Cancer Center, University of Michigan , Ann Arbor, Michigan 48109, United States
| | - David M Lubman
- Department of Surgery, University of Michigan , Ann Arbor, Michigan 48109, United States
| |
Collapse
|
34
|
Cao R, Shi Y, Chen S, Ma Y, Chen J, Yang J, Chen G, Shi T. dbSAP: single amino-acid polymorphism database for protein variation detection. Nucleic Acids Res 2017; 45:D827-D832. [PMID: 27903894 PMCID: PMC5210569 DOI: 10.1093/nar/gkw1096] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2016] [Revised: 10/25/2016] [Accepted: 11/01/2016] [Indexed: 12/13/2022] Open
Abstract
Millions of human single nucleotide polymorphisms (SNPs) or mutations have been identified so far, and these variants could be strongly correlated with phenotypic variations of traits/diseases. Among these variants, non-synonymous ones can result in amino-acid changes that are called single amino-acid polymorphisms (SAPs). Although some studies have tried to investigate the SAPs, only a small fraction of SAPs have been identified due to inadequately inferred protein variation database and the low coverage of mass spectrometry (MS) experiments. Here, we present the dbSAP database for conveniently accessing the comprehensive information and relationships of spectra, peptides and proteins of SAPs, as well as related genes, pathways, diseases and drug targets. In order to fully explore human SAPs, we built a customized protein database that contained comprehensive variant proteins by integrating and annotating the human SNPs and mutations from eight distinct databases (UniProt, Protein Mutation Database, HPMD, MSIPI, MS-CanProVar, dbSNP, Ensembl and COSMIC). After a series of quality controls, a total of 16 854 SAP peptides involving in 439 537 spectra were identified with large scale MS datasets from various human tissues and cell lines. dbSAP is freely available at http://www.megabionet.org/dbSAP/index.html.
Collapse
Affiliation(s)
- Ruifang Cao
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai 200241, China
| | - Yan Shi
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai 200241, China
| | - Shuangguan Chen
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai 200241, China
| | - Yimin Ma
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai 200241, China
| | - Jiajun Chen
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai 200241, China
| | - Juan Yang
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai 200241, China
| | - Geng Chen
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai 200241, China
| | - Tieliu Shi
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai 200241, China
| |
Collapse
|
35
|
Rahman SMJ, Ji X, Zimmerman LJ, Li M, Harris BK, Hoeksema MD, Trenary IA, Zou Y, Qian J, Slebos RJ, Beane J, Spira A, Shyr Y, Eisenberg R, Liebler DC, Young JD, Massion PP. The airway epithelium undergoes metabolic reprogramming in individuals at high risk for lung cancer. JCI Insight 2016; 1:e88814. [PMID: 27882349 DOI: 10.1172/jci.insight.88814] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
The molecular determinants of lung cancer risk remain largely unknown. Airway epithelial cells are prone to assault by risk factors and are considered to be the primary cell type involved in the field of cancerization. To investigate risk-associated changes in the bronchial epithelium proteome that may offer new insights into the molecular pathogenesis of lung cancer, proteins were identified in the airway epithelial cells of bronchial brushing specimens from risk-stratified individuals by shotgun proteomics. Differential expression of selected proteins was validated by parallel reaction monitoring mass spectrometry in an independent set of individual bronchial brushings. We identified 2,869 proteins, of which 312 proteins demonstrated a trend in expression. Pathway analysis revealed enrichment of carbohydrate metabolic enzymes in high-risk individuals. Glucose consumption and lactate production were increased in human bronchial epithelial BEAS2B cells treated with cigarette smoke condensate for 7 months. Increased lipid biosynthetic capacity and net reductive carboxylation were revealed by metabolic flux analyses of [U-13C5] glutamine in this in vitro model, suggesting profound metabolic reprogramming in the airway epithelium of high-risk individuals. These results provide a rationale for the development of potentially new chemopreventive strategies and selection of patients for surveillance programs.
Collapse
Affiliation(s)
- S M Jamshedur Rahman
- Division of Allergy, Pulmonary and Critical Care Medicine, Department of Medicine, Cancer Early Detection and Prevention Initiative, Vanderbilt Ingram Cancer Center
| | - Xiangming Ji
- Division of Allergy, Pulmonary and Critical Care Medicine, Department of Medicine, Cancer Early Detection and Prevention Initiative, Vanderbilt Ingram Cancer Center
| | | | - Ming Li
- Department of Biostatistics, and
| | - Bradford K Harris
- Division of Allergy, Pulmonary and Critical Care Medicine, Department of Medicine, Cancer Early Detection and Prevention Initiative, Vanderbilt Ingram Cancer Center
| | - Megan D Hoeksema
- Division of Allergy, Pulmonary and Critical Care Medicine, Department of Medicine, Cancer Early Detection and Prevention Initiative, Vanderbilt Ingram Cancer Center
| | - Irina A Trenary
- Department of Chemical and Biomolecular Engineering, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Yong Zou
- Division of Allergy, Pulmonary and Critical Care Medicine, Department of Medicine, Cancer Early Detection and Prevention Initiative, Vanderbilt Ingram Cancer Center
| | - Jun Qian
- Division of Allergy, Pulmonary and Critical Care Medicine, Department of Medicine, Cancer Early Detection and Prevention Initiative, Vanderbilt Ingram Cancer Center
| | | | - Jennifer Beane
- Pulmonary Center and Section of Computational Biomedicine, Department of Medicine, Boston University Medical Center, Boston, Massachusetts, USA
| | - Avrum Spira
- Pulmonary Center and Section of Computational Biomedicine, Department of Medicine, Boston University Medical Center, Boston, Massachusetts, USA
| | - Yu Shyr
- Department of Biostatistics, and
| | | | | | - Jamey D Young
- Department of Chemical and Biomolecular Engineering, Vanderbilt University Medical Center, Nashville, Tennessee, USA.,Department of Molecular Physiology and Biophysics, and
| | - Pierre P Massion
- Division of Allergy, Pulmonary and Critical Care Medicine, Department of Medicine, Cancer Early Detection and Prevention Initiative, Vanderbilt Ingram Cancer Center.,Department of Cancer Biology, Vanderbilt University Medical Center, Nashville, Tennessee, USA.,Veterans Affairs, Tennessee Valley Healthcare System, Nashville, Tennessee, USA
| |
Collapse
|
36
|
On the privacy risks of sharing clinical proteomics data. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2016; 2016:122-31. [PMID: 27595046 PMCID: PMC5009298] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
Although the privacy issues in human genomic studies are well known, the privacy risks in clinical proteomic data have not been thoroughly studied. As a proof of concept, we reported a comprehensive analysis of the privacy risks in clinical proteomic data. It showed that a small number of peptides carrying the minor alleles (referred to as the minor allelic peptides) at non-synonymous single nucleotide polymorphism (nsSNP) sites can be identified in typical clinical proteomic datasets acquired from the blood/serum samples of individual patient, from which the patient can be identified with high confidence. Our results suggested the presence of significant privacy risks in raw clinical proteomic data. However, these risks can be mitigated by a straightforward pre-processing step of the raw data that removing a very small fraction (0.1%, 7.14 out of 7,504 spectra on average) of MS/MS spectra identified as the minor allelic peptides, which has little or no impact on the subsequent analysis (and re-use) of these datasets.
Collapse
|
37
|
Wen B, Xu S, Zhou R, Zhang B, Wang X, Liu X, Xu X, Liu S. PGA: an R/Bioconductor package for identification of novel peptides using a customized database derived from RNA-Seq. BMC Bioinformatics 2016; 17:244. [PMID: 27316337 PMCID: PMC4912784 DOI: 10.1186/s12859-016-1133-3] [Citation(s) in RCA: 42] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2015] [Accepted: 06/09/2016] [Indexed: 11/27/2022] Open
Abstract
Background Peptide identification based upon mass spectrometry (MS) is generally achieved by comparison of the experimental mass spectra with the theoretically digested peptides derived from a reference protein database. Obviously, this strategy could not identify peptide and protein sequences that are absent from a reference database. A customized protein database on the basis of RNA-Seq data is thus proposed to assist with and improve the identification of novel peptides. Correspondingly, development of a comprehensive pipeline, which provides an end-to-end solution for novel peptide detection with the customized protein database, is necessary. Results A pipeline with an R package, assigned as a PGA utility, was developed that enables automated treatment to the tandem mass spectrometry (MS/MS) data acquired from different MS platforms and construction of customized protein databases based on RNA-Seq data with or without a reference genome guide. Hence, PGA can identify novel peptides and generate an HTML-based report with a visualized interface. On the basis of a published dataset, PGA was employed to identify peptides, resulting in 636 novel peptides, including 510 single amino acid polymorphism (SAP) peptides, 2 INDEL peptides, 49 splice junction peptides, and 75 novel transcript-derived peptides. The software is freely available from http://bioconductor.org/packages/PGA/, and the example reports are available at http://wenbostar.github.io/PGA/. Conclusions The pipeline of PGA, aimed at being platform-independent and easy-to-use, was successfully developed and shown to be capable of identifying novel peptides by searching the customized protein database derived from RNA-Seq data. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1133-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Bo Wen
- BGI-Shenzhen, Shenzhen, 518083, China
| | | | - Ruo Zhou
- BGI-Shenzhen, Shenzhen, 518083, China
| | - Bing Zhang
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, 37232, USA
| | - Xiaojing Wang
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, 37232, USA
| | - Xin Liu
- BGI-Shenzhen, Shenzhen, 518083, China
| | - Xun Xu
- BGI-Shenzhen, Shenzhen, 518083, China
| | - Siqi Liu
- BGI-Shenzhen, Shenzhen, 518083, China. .,Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China.
| |
Collapse
|
38
|
Li Y, Wang X, Cho JH, Shaw TI, Wu Z, Bai B, Wang H, Zhou S, Beach TG, Wu G, Zhang J, Peng J. JUMPg: An Integrative Proteogenomics Pipeline Identifying Unannotated Proteins in Human Brain and Cancer Cells. J Proteome Res 2016; 15:2309-20. [PMID: 27225868 DOI: 10.1021/acs.jproteome.6b00344] [Citation(s) in RCA: 62] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
Proteogenomics is an emerging approach to improve gene annotation and interpretation of proteomics data. Here we present JUMPg, an integrative proteogenomics pipeline including customized database construction, tag-based database search, peptide-spectrum match filtering, and data visualization. JUMPg creates multiple databases of DNA polymorphisms, mutations, splice junctions, partially trypticity, as well as protein fragments translated from the whole transcriptome in all six frames upon RNA-seq de novo assembly. We use a multistage strategy to search these databases sequentially, in which the performance is optimized by re-searching only unmatched high-quality spectra and reusing amino acid tags generated by the JUMP search engine. The identified peptides/proteins are displayed with gene loci using the UCSC genome browser. Then, the JUMPg program is applied to process a label-free mass spectrometry data set of Alzheimer's disease postmortem brain, uncovering 496 new peptides of amino acid substitutions, alternative splicing, frame shift, and "non-coding gene" translation. The novel protein PNMA6BL specifically expressed in the brain is highlighted. We also tested JUMPg to analyze a stable-isotope labeled data set of multiple myeloma cells, revealing 991 sample-specific peptides that include protein sequences in the immunoglobulin light chain variable region. Thus, the JUMPg program is an effective proteogenomics tool for multiomics data integration.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Hong Wang
- Integrated Biomedical Sciences Program, University of Tennessee Health Science Center , 920 Madison Avenue, Memphis, Tennessee 38163, United States
| | | | - Thomas G Beach
- Banner Sun Health Research Institute , Sun City, Arizona 85351, United States
| | | | | | | |
Collapse
|
39
|
Sheynkman GM, Shortreed MR, Cesnik AJ, Smith LM. Proteogenomics: Integrating Next-Generation Sequencing and Mass Spectrometry to Characterize Human Proteomic Variation. ANNUAL REVIEW OF ANALYTICAL CHEMISTRY (PALO ALTO, CALIF.) 2016; 9:521-45. [PMID: 27049631 PMCID: PMC4991544 DOI: 10.1146/annurev-anchem-071015-041722] [Citation(s) in RCA: 73] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/09/2023]
Abstract
Mass spectrometry-based proteomics has emerged as the leading method for detection, quantification, and characterization of proteins. Nearly all proteomic workflows rely on proteomic databases to identify peptides and proteins, but these databases typically contain a generic set of proteins that lack variations unique to a given sample, precluding their detection. Fortunately, proteogenomics enables the detection of such proteomic variations and can be defined, broadly, as the use of nucleotide sequences to generate candidate protein sequences for mass spectrometry database searching. Proteogenomics is experiencing heightened significance due to two developments: (a) advances in DNA sequencing technologies that have made complete sequencing of human genomes and transcriptomes routine, and (b) the unveiling of the tremendous complexity of the human proteome as expressed at the levels of genes, cells, tissues, individuals, and populations. We review here the field of human proteogenomics, with an emphasis on its history, current implementations, the types of proteomic variations it reveals, and several important applications.
Collapse
Affiliation(s)
- Gloria M Sheynkman
- Center for Cancer Systems Biology (CCSB) and Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02215;
- Department of Genetics, Harvard Medical School, Boston, Massachusetts 02115
- Department of Chemistry, University of Wisconsin, Madison, Wisconsin 53706; ,
| | - Michael R Shortreed
- Department of Chemistry, University of Wisconsin, Madison, Wisconsin 53706; ,
| | - Anthony J Cesnik
- Department of Chemistry, University of Wisconsin, Madison, Wisconsin 53706; ,
| | - Lloyd M Smith
- Department of Chemistry, University of Wisconsin, Madison, Wisconsin 53706; ,
- Genome Center of Wisconsin, University of Wisconsin, Madison, Wisconsin 53706;
| |
Collapse
|
40
|
Xiong Y, Guo Y, Xiao W, Cao Q, Li S, Qi X, Zhang Z, Wang Q, Shui W. An NGS-Independent Strategy for Proteome-Wide Identification of Single Amino Acid Polymorphisms by Mass Spectrometry. Anal Chem 2016; 88:2784-91. [PMID: 26810586 DOI: 10.1021/acs.analchem.5b04417] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Detection of proteins containing single amino acid polymorphisms (SAPs) encoded by nonsynonymous SNPs (nsSNPs) can aid researchers in studying the functional significance of protein variants. Most proteogenomic approaches for large-scale SAPs mapping require construction of a sample-specific database containing protein variants predicted from the next-generation sequencing (NGS) data. Searching shotgun proteomic data sets against these NGS-derived databases allowed for identification of SAP peptides, thus validating the proteome-level sequence variation. Contrary to the conventional approaches, our study presents a novel strategy for proteome-wide SAP detection without relying on sample-specific NGS data. By searching a deep-coverage proteomic data set from an industrial thermotolerant yeast strain using our strategy, we identified 337 putative SAPs compared to the reference genome. Among the SAP peptides identified with stringent criteria, 85.2% of SAP sites were validated using whole-genome sequencing data obtained for this organism, which indicates high accuracy of SAP identification with our strategy. More interestingly, for certain SAP peptides that cannot be predicted by genomic sequencing, we used synthetic peptide standards to verify expression of peptide variants in the proteome. Our study has provided a unique tool for proteogenomics to enable proteome-wide direct SAP identification and capture nongenetic protein variants not linked to nsSNPs.
Collapse
Affiliation(s)
- Yun Xiong
- Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences , Tianjin 300308, China
| | - Yufeng Guo
- Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences , Tianjin 300308, China
| | - Weidi Xiao
- College of Life Sciences, Nankai University , Tianjin 300071, China
| | - Qichen Cao
- Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences , Tianjin 300308, China
| | - Shanshan Li
- Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences , Tianjin 300308, China
| | - Xianni Qi
- Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences , Tianjin 300308, China
| | - Zhidan Zhang
- Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences , Tianjin 300308, China
| | - Qinhong Wang
- Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences , Tianjin 300308, China
| | - Wenqing Shui
- Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences , Tianjin 300308, China
| |
Collapse
|
41
|
Cesnik AJ, Shortreed MR, Sheynkman GM, Frey BL, Smith LM. Human Proteomic Variation Revealed by Combining RNA-Seq Proteogenomics and Global Post-Translational Modification (G-PTM) Search Strategy. J Proteome Res 2016; 15:800-8. [PMID: 26704769 PMCID: PMC4779408 DOI: 10.1021/acs.jproteome.5b00817] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
![]()
Mass-spectrometry-based
proteomic analysis underestimates proteomic
variation due to the absence of variant peptides and posttranslational
modifications (PTMs) from standard protein databases. Each individual
carries thousands of missense mutations that lead to single amino
acid variants, but these are missed because they are absent from generic
proteomic search databases. Myriad types of protein PTMs play essential
roles in biological processes but remain undetected because of increased
false discovery rates in variable modification searches. We address
these two fundamental shortcomings of bottom-up proteomics with two
recently developed software tools. The first consists of workflows
in Galaxy that mine RNA sequencing data to generate sample-specific
databases containing variant peptides and products of alternative
splicing events. The second tool applies a new strategy that alters
the variable modification approach to consider only curated PTMs at
specific positions, thereby avoiding the combinatorial explosion that
traditionally leads to high false discovery rates. Using RNA-sequencing-derived
databases with this Global Post-Translational Modification (G-PTM)
search strategy revealed hundreds of single amino acid variant peptides,
tens of novel splice junction peptides, and several hundred posttranslationally
modified peptides in each of ten human cell lines.
Collapse
Affiliation(s)
- Anthony J Cesnik
- Department of Chemistry, University of Wisconsin-Madison , 1101 University Avenue, Madison, Wisconsin 53706, United States
| | - Michael R Shortreed
- Department of Chemistry, University of Wisconsin-Madison , 1101 University Avenue, Madison, Wisconsin 53706, United States
| | - Gloria M Sheynkman
- Department of Chemistry, University of Wisconsin-Madison , 1101 University Avenue, Madison, Wisconsin 53706, United States
| | - Brian L Frey
- Department of Chemistry, University of Wisconsin-Madison , 1101 University Avenue, Madison, Wisconsin 53706, United States
| | - Lloyd M Smith
- Department of Chemistry, University of Wisconsin-Madison , 1101 University Avenue, Madison, Wisconsin 53706, United States.,Genome Center of Wisconsin, University of Wisconsin-Madison , 425G Henry Mall, Madison, Wisconsin 53706, United States
| |
Collapse
|
42
|
Giese SH, Zickmann F, Renard BY. Detection of Unknown Amino Acid Substitutions Using Error-Tolerant Database Search. Methods Mol Biol 2016; 1362:247-264. [PMID: 26519182 DOI: 10.1007/978-1-4939-3106-4_16] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Recent studies have demonstrated that mass spectrometry-based variant detection is feasible. Typically, either genomic variant databases or transcript data are used to construct customized target databases for the identification of single-amino acid variants in mass spectrometry data. However, both approaches require additional data to perform the identification of SAAVs. Here, we discuss the application of an error-tolerant peptide search engine such as BICEPS for identifying variants exclusively based on standard Uniprot databases. Thereby, unnecessary and redundant extensions of the search space are avoided. The workflow provides an unbiased view on the data; the search space is not limited to known variants and simultaneously does not require additional data. In a subsequent step a second identification search is performed to verify the initially identified variant peptides and aggregate information on the protein level.
Collapse
Affiliation(s)
- Sven H Giese
- Research Group Bioinformatics (NG4), Robert Koch-Institute, Nordufer 20, 13353, Berlin, Germany
- Department of Bioanalytics, Institute of Biotechnology, Technische Universität Berlin, 13355, Berlin, Germany
- Wellcome Trust Centre for Cell Biology, School of Biological Sciences, University of Edinburgh, Edinburgh, EH9 3JR, UK
| | - Franziska Zickmann
- Research Group Bioinformatics (NG4), Robert Koch-Institute, Nordufer 20, 13353, Berlin, Germany
| | - Bernhard Y Renard
- Research Group Bioinformatics (NG4), Robert Koch-Institute, Nordufer 20, 13353, Berlin, Germany.
| |
Collapse
|
43
|
Abstract
Identification of mutant proteins in biological samples is one of the emerging areas of proteogenomics. Despite the fact that only a limited number of studies have been published up to now, it has the potential to recognize novel disease biomarkers that have unique structure and desirably high specificity. Such properties would identify mutant proteoforms related to diseases as optimal drug targets useful for future therapeutic strategies. While mass spectrometry has demonstrated its outstanding analytical power in proteomics, the most frequently applied bottom-up strategy is not suitable for the detection of mutant proteins if only databases with consensus sequences are searched. It is likely that many unassigned tandem mass spectra of tryptic peptides originate from single amino acid variants (SAAVs). To address this problem, a couple of protein databases have been constructed that include canonical and SAAV sequences, allowing for the observation of mutant proteoforms in mass spectral data for the first time. Since the resulting large search space may compromise the probability of identifications, a novel concept was proposed that included identification as well as verification strategies. Together with transcriptome based approaches, targeted proteomics appears to be a suitable method for the verification of initial identifications in databases and can also provide quantitative insights to expression profiles, which often reflect disease progression. Important applications in the field of mutant proteoform identification have already highlighted novel biomarkers in large-scale investigations.
Collapse
|
44
|
Abstract
![]()
Every
molecular player in the cast of biology’s central
dogma is being sequenced and quantified with increasing ease and coverage.
To bring the resulting genomic, transcriptomic, and proteomic data
sets into coherence, tools must be developed that do not constrain
data acquisition and analytics in any way but rather provide simple
links across previously acquired data sets with minimal preprocessing
and hassle. Here we present such a tool: PGx, which supports proteogenomic
integration of mass spectrometry proteomics data with next-generation
sequencing by mapping identified peptides onto their putative genomic
coordinates.
Collapse
Affiliation(s)
- Manor Askenazi
- Biomedical Hosting LLC, 33 Lewis Avenue, Arlington, Massachusetts 02474, United States
| | - Kelly V Ruggles
- NYU Langone Medical Center , 227 East 30th Street, New York, New York 10016, United States
| | - David Fenyö
- NYU Langone Medical Center , 227 East 30th Street, New York, New York 10016, United States
| |
Collapse
|
45
|
Ruggles KV, Tang Z, Wang X, Grover H, Askenazi M, Teubl J, Cao S, McLellan MD, Clauser KR, Tabb DL, Mertins P, Slebos R, Erdmann-Gilmore P, Li S, Gunawardena HP, Xie L, Liu T, Zhou JY, Sun S, Hoadley KA, Perou CM, Chen X, Davies SR, Maher CA, Kinsinger CR, Rodland KD, Zhang H, Zhang Z, Ding L, Townsend RR, Rodriguez H, Chan D, Smith RD, Liebler DC, Carr SA, Payne S, Ellis MJ, Fenyő D. An Analysis of the Sensitivity of Proteogenomic Mapping of Somatic Mutations and Novel Splicing Events in Cancer. Mol Cell Proteomics 2015; 15:1060-71. [PMID: 26631509 DOI: 10.1074/mcp.m115.056226] [Citation(s) in RCA: 90] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2015] [Indexed: 11/06/2022] Open
Abstract
Improvements in mass spectrometry (MS)-based peptide sequencing provide a new opportunity to determine whether polymorphisms, mutations, and splice variants identified in cancer cells are translated. Herein, we apply a proteogenomic data integration tool (QUILTS) to illustrate protein variant discovery using whole genome, whole transcriptome, and global proteome datasets generated from a pair of luminal and basal-like breast-cancer-patient-derived xenografts (PDX). The sensitivity of proteogenomic analysis for singe nucleotide variant (SNV) expression and novel splice junction (NSJ) detection was probed using multiple MS/MS sample process replicates defined here as an independent tandem MS experiment using identical sample material. Despite analysis of over 30 sample process replicates, only about 10% of SNVs (somatic and germline) detected by both DNA and RNA sequencing were observed as peptides. An even smaller proportion of peptides corresponding to NSJ observed by RNA sequencing were detected (<0.1%). Peptides mapping to DNA-detected SNVs without a detectable mRNA transcript were also observed, suggesting that transcriptome coverage was incomplete (∼80%). In contrast to germline variants, somatic variants were less likely to be detected at the peptide level in the basal-like tumor than in the luminal tumor, raising the possibility of differential translation or protein degradation effects. In conclusion, this large-scale proteogenomic integration allowed us to determine the degree to which mutations are translated and identify gaps in sequence coverage, thereby benchmarking current technology and progress toward whole cancer proteome and transcriptome analysis.
Collapse
Affiliation(s)
- Kelly V Ruggles
- From the ‡New York University School of Medicine, New York, NY
| | - Zuojian Tang
- From the ‡New York University School of Medicine, New York, NY
| | - Xuya Wang
- From the ‡New York University School of Medicine, New York, NY
| | - Himanshu Grover
- From the ‡New York University School of Medicine, New York, NY
| | | | - Jennifer Teubl
- From the ‡New York University School of Medicine, New York, NY
| | - Song Cao
- ¶Washington University in St. Louis, St. Louis, MO
| | | | | | - David L Tabb
- **Vanderbilt University School of Medicine, Nashville, TN
| | | | - Robbert Slebos
- **Vanderbilt University School of Medicine, Nashville, TN
| | | | - Shunqiang Li
- ¶Washington University in St. Louis, St. Louis, MO
| | | | - Ling Xie
- ‡‡Universtiy of North Carolina School of Medicine, Chapel Hill, NC
| | - Tao Liu
- §§Pacific Northwest National Laboratory, Richland, WA
| | | | | | | | - Charles M Perou
- ‡‡Universtiy of North Carolina School of Medicine, Chapel Hill, NC
| | - Xian Chen
- ‡‡Universtiy of North Carolina School of Medicine, Chapel Hill, NC
| | | | | | | | | | - Hui Zhang
- ¶¶Johns Hopkins University, Baltimore, MD
| | - Zhen Zhang
- ¶¶Johns Hopkins University, Baltimore, MD
| | - Li Ding
- ¶Washington University in St. Louis, St. Louis, MO
| | | | - Henry Rodriguez
- ‖‖Office of Cancer Clinical Proteomics Research, National Cancer Institute, Bethesda, MD
| | | | | | | | | | - Samuel Payne
- §§Pacific Northwest National Laboratory, Richland, WA;
| | | | - David Fenyő
- From the ‡New York University School of Medicine, New York, NY;
| |
Collapse
|
46
|
Shukla HD, Mahmood J, Vujaskovic Z. Integrated proteo-genomic approach for early diagnosis and prognosis of cancer. Cancer Lett 2015; 369:28-36. [DOI: 10.1016/j.canlet.2015.08.003] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2015] [Revised: 08/05/2015] [Accepted: 08/05/2015] [Indexed: 12/28/2022]
|
47
|
Choong WK, Chang HY, Chen CT, Tsai CF, Hsu WL, Chen YJ, Sung TY. Informatics View on the Challenges of Identifying Missing Proteins from Shotgun Proteomics. J Proteome Res 2015; 14:5396-407. [DOI: 10.1021/acs.jproteome.5b00482] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Affiliation(s)
- Wai-Kok Choong
- Institute
of Information Science, Academia Sinica, Taipei 11529, Taiwan
| | - Hui-Yin Chang
- Institute
of Information Science, Academia Sinica, Taipei 11529, Taiwan
- Bioinformatics
Program, Taiwan International Graduate Program, Academia Sinica, Taipei 11529, Taiwan
- Institute
of Biomedical Informatics, National Yang-Ming University, Taipei 11221, Taiwan
| | - Ching-Tai Chen
- Institute
of Information Science, Academia Sinica, Taipei 11529, Taiwan
| | - Chia-Feng Tsai
- Institute
of Chemistry, Academia Sinica, Taipei 11529, Taiwan
| | - Wen-Lian Hsu
- Institute
of Information Science, Academia Sinica, Taipei 11529, Taiwan
| | - Yu-Ju Chen
- Institute
of Chemistry, Academia Sinica, Taipei 11529, Taiwan
| | - Ting-Yi Sung
- Institute
of Information Science, Academia Sinica, Taipei 11529, Taiwan
| |
Collapse
|
48
|
Stewart PA, Parapatics K, Welsh EA, Müller AC, Cao H, Fang B, Koomen JM, Eschrich SA, Bennett KL, Haura EB. A Pilot Proteogenomic Study with Data Integration Identifies MCT1 and GLUT1 as Prognostic Markers in Lung Adenocarcinoma. PLoS One 2015; 10:e0142162. [PMID: 26539827 PMCID: PMC4634858 DOI: 10.1371/journal.pone.0142162] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2015] [Accepted: 10/19/2015] [Indexed: 11/19/2022] Open
Abstract
We performed a pilot proteogenomic study to compare lung adenocarcinoma to lung squamous cell carcinoma using quantitative proteomics (6-plex TMT) combined with a customized Affymetrix GeneChip. Using MaxQuant software, we identified 51,001 unique peptides that mapped to 7,241 unique proteins and from these identified 6,373 genes with matching protein expression for further analysis. We found a minor correlation between gene expression and protein expression; both datasets were able to independently recapitulate known differences between the adenocarcinoma and squamous cell carcinoma subtypes. We found 565 proteins and 629 genes to be differentially expressed between adenocarcinoma and squamous cell carcinoma, with 113 of these consistently differentially expressed at both the gene and protein levels. We then compared our results to published adenocarcinoma versus squamous cell carcinoma proteomic data that we also processed with MaxQuant. We selected two proteins consistently overexpressed in squamous cell carcinoma in all studies, MCT1 (SLC16A1) and GLUT1 (SLC2A1), for further investigation. We found differential expression of these same proteins at the gene level in our study as well as in other public gene expression datasets. These findings combined with survival analysis of public datasets suggest that MCT1 and GLUT1 may be potential prognostic markers in adenocarcinoma and druggable targets in squamous cell carcinoma. Data are available via ProteomeXchange with identifier PXD002622.
Collapse
Affiliation(s)
- Paul A. Stewart
- Department of Thoracic Oncology, H. Lee Moffitt Cancer Center and Research Institute, Tampa, Florida, United States of America 33612
| | - Katja Parapatics
- CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, Lazarettgasse 14, 1090 Vienna, Austria
| | - Eric A. Welsh
- Cancer Informatics Core Facility, H. Lee Moffitt Cancer Center and Research Institute, Tampa, Florida, United States of America 33612
| | - André C. Müller
- CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, Lazarettgasse 14, 1090 Vienna, Austria
| | - Haoyun Cao
- Department of Biostatistics and Bioinformatics, H. Lee Moffitt Cancer Center and Research Institute, Tampa, Florida, United States of America 33612
| | - Bin Fang
- Proteomics Core Facility, H. Lee Moffitt Cancer Center and Research Institute, Tampa, Florida, United States of America 33612
| | - John M. Koomen
- Proteomics Core Facility, H. Lee Moffitt Cancer Center and Research Institute, Tampa, Florida, United States of America 33612
- Department of Molecular Oncology, H. Lee Moffitt Cancer Center and Research Institute, Tampa, Florida, United States of America 33612
| | - Steven A. Eschrich
- Cancer Informatics Core Facility, H. Lee Moffitt Cancer Center and Research Institute, Tampa, Florida, United States of America 33612
- Department of Biostatistics and Bioinformatics, H. Lee Moffitt Cancer Center and Research Institute, Tampa, Florida, United States of America 33612
| | - Keiryn L. Bennett
- CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, Lazarettgasse 14, 1090 Vienna, Austria
| | - Eric B. Haura
- Department of Thoracic Oncology, H. Lee Moffitt Cancer Center and Research Institute, Tampa, Florida, United States of America 33612
- * E-mail:
| |
Collapse
|
49
|
Song Y, Laskay ÜA, Vilcins IME, Barbour AG, Wysocki VH. Top-down-assisted bottom-up method for homologous protein sequencing: hemoglobin from 33 bird species. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2015; 26:1875-84. [PMID: 26111519 PMCID: PMC6467653 DOI: 10.1007/s13361-015-1185-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/23/2015] [Revised: 05/08/2015] [Accepted: 05/08/2015] [Indexed: 05/12/2023]
Abstract
Ticks are vectors for disease transmission because they are indiscriminant in their feeding on multiple vertebrate hosts, transmitting pathogens between their hosts. Identifying the hosts on which ticks have fed is important for disease prevention and intervention. We have previously shown that hemoglobin (Hb) remnants from a host on which a tick fed can be used to reveal the host's identity. For the present research, blood was collected from 33 bird species that are common in the U.S. as hosts for ticks but that have unknown Hb sequences. A top-down-assisted bottom-up mass spectrometry approach with a customized searching database, based on variability in known bird hemoglobin sequences, has been devised to facilitate fast and complete sequencing of hemoglobin from birds with unknown sequences. These hemoglobin sequences will be added to a hemoglobin database and used for tick host identification. The general approach has the potential to sequence any set of homologous proteins completely in a rapid manner. Graphical Abstract ᅟ.
Collapse
Affiliation(s)
- Yang Song
- Department of Chemistry and Biochemistry, The Ohio State University, Columbus, OH, 43210, USA
- Department of Chemistry and Biochemistry, The University of Arizona, Tucson, AZ, 85721, USA
| | - Ünige A Laskay
- Department of Chemistry and Biochemistry, The University of Arizona, Tucson, AZ, 85721, USA
| | - Inger-Marie E Vilcins
- Emerging and Acute Infectious Diseases Branch, Department of State Health Services, Austin, TX, 78756, USA
| | - Alan G Barbour
- Microbiology and Molecular Genetics, Medicine, and Ecology and Evolutionary Biology, University of California, Irvine, CA, 92687, USA
| | - Vicki H Wysocki
- Department of Chemistry and Biochemistry, The Ohio State University, Columbus, OH, 43210, USA.
- Department of Chemistry and Biochemistry, The University of Arizona, Tucson, AZ, 85721, USA.
| |
Collapse
|
50
|
Woo S, Cha SW, Bonissone S, Na S, Tabb DL, Pevzner PA, Bafna V. Advanced Proteogenomic Analysis Reveals Multiple Peptide Mutations and Complex Immunoglobulin Peptides in Colon Cancer. J Proteome Res 2015; 14:3555-67. [PMID: 26139413 DOI: 10.1021/acs.jproteome.5b00264] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
Aiming toward an improved understanding of the regulation of proteins in cancer, recent studies from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) have focused on analyzing cancer tissue using proteomic technologies and workflows. Although many proteogenomics approaches for the study of cancer samples have been proposed, serious methodological challenges remain, especially in the identification of multiple mutational variants or structural variations such as fusion gene events. In addition, although immune system genes play an important role in cancer, identification of IgG peptides remains challenging in proteomic data sets. Here, we describe an integrative proteogenomic method that extends the limit of proteogenomic searches to identify multiple variant peptides as well as immunoglobulin gene variations/rearrangements using customized mining of RNA-seq data. Our results also provide the first extensive characterization of tumor immune response and demonstrate the potential of this method to improve the molecular characterization of tumor subtypes.
Collapse
Affiliation(s)
| | | | | | | | - David L Tabb
- Department of Biomedical Informatics, Vanderbilt University , Nashville, Tennessee 37203, United States
| | | | | |
Collapse
|