1
|
Desai H, Ofori S, Boatner L, Yu F, Villanueva M, Ung N, Nesvizhskii AI, Backus K. Multi-omic stratification of the missense variant cysteinome. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.12.553095. [PMID: 37645963 PMCID: PMC10461992 DOI: 10.1101/2023.08.12.553095] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/31/2023]
Abstract
Cancer genomes are rife with genetic variants; one key outcome of this variation is gain-ofcysteine, which is the most frequently acquired amino acid due to missense variants in COSMIC. Acquired cysteines are both driver mutations and sites targeted by precision therapies. However, despite their ubiquity, nearly all acquired cysteines remain uncharacterized. Here, we pair cysteine chemoproteomics-a technique that enables proteome-wide pinpointing of functional, redox sensitive, and potentially druggable residues-with genomics to reveal the hidden landscape of cysteine acquisition. For both cancer and healthy genomes, we find that cysteine acquisition is a ubiquitous consequence of genetic variation that is further elevated in the context of decreased DNA repair. Our chemoproteogenomics platform integrates chemoproteomic, whole exome, and RNA-seq data, with a customized 2-stage false discovery rate (FDR) error controlled proteomic search, further enhanced with a user-friendly FragPipe interface. Integration of CADD predictions of deleteriousness revealed marked enrichment for likely damaging variants that result in acquisition of cysteine. By deploying chemoproteogenomics across eleven cell lines, we identify 116 gain-of-cysteines, of which 10 were liganded by electrophilic druglike molecules. Reference cysteines proximal to missense variants were also found to be pervasive, 791 in total, supporting heretofore untapped opportunities for proteoform-specific chemical probe development campaigns. As chemoproteogenomics is further distinguished by sample-matched combinatorial variant databases and compatible with redox proteomics and small molecule screening, we expect widespread utility in guiding proteoform-specific biology and therapeutic discovery.
Collapse
Affiliation(s)
- Heta Desai
- Biological Chemistry Department, David Geffen School of Medicine, UCLA, Los Angeles, CA, 90095, USA
- Molecular Biology Institute, UCLA, Los Angeles, CA, 90095, USA
| | - Samuel Ofori
- Biological Chemistry Department, David Geffen School of Medicine, UCLA, Los Angeles, CA, 90095, USA
| | - Lisa Boatner
- Biological Chemistry Department, David Geffen School of Medicine, UCLA, Los Angeles, CA, 90095, USA
- Department of Chemistry and Biochemistry, UCLA, Los Angeles, CA, 90095, USA
| | - Fengchao Yu
- Department of Pathology, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Miranda Villanueva
- Biological Chemistry Department, David Geffen School of Medicine, UCLA, Los Angeles, CA, 90095, USA
- Molecular Biology Institute, UCLA, Los Angeles, CA, 90095, USA
| | - Nicholas Ung
- Biological Chemistry Department, David Geffen School of Medicine, UCLA, Los Angeles, CA, 90095, USA
- Department of Chemistry and Biochemistry, UCLA, Los Angeles, CA, 90095, USA
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, 48109, USA
- Department of Pathology, University of Michigan, Ann Arbor, MI, 48109, USA
- Molecular Biology Institute, UCLA, Los Angeles, CA, 90095, USA
- DOE Institute for Genomics and Proteomics, UCLA, Los Angeles, CA, 90095, USA
- Jonsson Comprehensive Cancer Center, UCLA, Los Angeles, CA, 90095, USA
- Eli and Edythe Broad Center of Regenerative Medicine and Stem Cell Research, UCLA, Los Angeles, CA, 90095, USA
| | - Alexey I Nesvizhskii
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, 48109, USA
- Department of Pathology, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Keriann Backus
- Biological Chemistry Department, David Geffen School of Medicine, UCLA, Los Angeles, CA, 90095, USA
- Department of Chemistry and Biochemistry, UCLA, Los Angeles, CA, 90095, USA
- Molecular Biology Institute, UCLA, Los Angeles, CA, 90095, USA
- DOE Institute for Genomics and Proteomics, UCLA, Los Angeles, CA, 90095, USA
- Jonsson Comprehensive Cancer Center, UCLA, Los Angeles, CA, 90095, USA
- Eli and Edythe Broad Center of Regenerative Medicine and Stem Cell Research, UCLA, Los Angeles, CA, 90095, USA
| |
Collapse
|
2
|
Aggarwal S, Raj A, Kumar D, Dash D, Yadav AK. False discovery rate: the Achilles' heel of proteogenomics. Brief Bioinform 2022; 23:6582880. [PMID: 35534181 DOI: 10.1093/bib/bbac163] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Revised: 03/14/2022] [Accepted: 04/12/2022] [Indexed: 12/25/2022] Open
Abstract
Proteogenomics refers to the integrated analysis of the genome and proteome that leverages mass-spectrometry (MS)-based proteomics data to improve genome annotations, understand gene expression control through proteoforms and find sequence variants to develop novel insights for disease classification and therapeutic strategies. However, proteogenomic studies often suffer from reduced sensitivity and specificity due to inflated database size. To control the error rates, proteogenomics depends on the target-decoy search strategy, the de-facto method for false discovery rate (FDR) estimation in proteomics. The proteogenomic databases constructed from three- or six-frame nucleotide database translation not only increase the search space and compute-time but also violate the equivalence of target and decoy databases. These searches result in poorer separation between target and decoy scores, leading to stringent FDR thresholds. Understanding these factors and applying modified strategies such as two-pass database search or peptide-class-specific FDR can result in a better interpretation of MS data without introducing additional statistical biases. Based on these considerations, a user can interpret the proteogenomics results appropriately and control false positives and negatives in a more informed manner. In this review, first, we briefly discuss the proteogenomic workflows and limitations in database construction, followed by various considerations that can influence potential novel discoveries in a proteogenomic study. We conclude with suggestions to counter these challenges for better proteogenomic data interpretation.
Collapse
Affiliation(s)
- Suruchi Aggarwal
- Translational Health Science and Technology Institute, NCR Biotech Science Cluster, 3rd milestone, PO Box No. 04, Faridabad-Gurgaon Expressway, Faridabad-121001, Haryana, India
| | - Anurag Raj
- GN Ramachandran Knowledge Centre for Genome Informatics, CSIR-Institute of Genomics & Integrative Biology, South Campus, Mathura Road, New Delhi 110025, India.,Academy of Scientific and Innovative Research (AcSIR), Ghaziabad-201002, India
| | - Dhirendra Kumar
- GN Ramachandran Knowledge Centre for Genome Informatics, CSIR-Institute of Genomics & Integrative Biology, South Campus, Mathura Road, New Delhi 110025, India
| | - Debasis Dash
- GN Ramachandran Knowledge Centre for Genome Informatics, CSIR-Institute of Genomics & Integrative Biology, South Campus, Mathura Road, New Delhi 110025, India.,Academy of Scientific and Innovative Research (AcSIR), Ghaziabad-201002, India
| | - Amit Kumar Yadav
- Translational Health Science and Technology Institute, NCR Biotech Science Cluster, 3rd milestone, PO Box No. 04, Faridabad-Gurgaon Expressway, Faridabad-121001, Haryana, India
| |
Collapse
|
3
|
Cao X, Xing J. PrecisionProDB: improving the proteomics performance for precision medicine. Bioinformatics 2021; 37:3361-3363. [PMID: 33787868 DOI: 10.1093/bioinformatics/btab218] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2021] [Revised: 03/06/2021] [Accepted: 03/30/2021] [Indexed: 01/03/2023] Open
Abstract
SUMMARY As the next-generation sequencing technology becomes broadly applied, genomics and transcriptomics are becoming more commonly used in both research and clinical settings. However, proteomics is still an obstacle to be conquered. For most peptide search programs in proteomics, a standard reference protein database is used. Because of the thousands of coding DNA variants in each individual, a standard reference database does not provide perfect match for many proteins/peptides of an individual. A personalized reference database can improve the detection power and accuracy for individual proteomics data. To connect genomics and proteomics, we designed a Python package PrecisionProDB that is specialized for generating a personized protein database for proteomics applications. PrecisionProDB supports multiple popular file formats and reference databases, and can generate a personized database in minutes. To demonstrate the application of PrecisionProDB, we generated human population-specific reference protein databases with PrecisionProDB, which improves the number of identified peptides by 0.34% on average. In addition, by incorporating cell line-specific variants into the protein database, we demonstrated a 0.71% improvement for peptide identification in the Jurkat cell line. With PrecisionProDB and these datasets, researchers and clinicians can improve their peptide search performance by adopting the more representative protein database or adding population and individual-specific proteins to the search database with minimum increase of efforts. AVAILABILITY PrecisionProDB and pre-calculated protein databases are freely available at https://github.com/ATPs/PrecisionProDB and https://github.com/ATPs/PrecisionProDB_references. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xiaolong Cao
- Department of Genetics, Human Genetic Institute of New Jersey, Rutgers, The State University of New Jersey, Piscataway, NJ, 08854, USA
| | - Jinchuan Xing
- Department of Genetics, Human Genetic Institute of New Jersey, Rutgers, The State University of New Jersey, Piscataway, NJ, 08854, USA
| |
Collapse
|
4
|
Choong WK, Wang JH, Sung TY. MinProtMaxVP: Generating a minimized number of protein variant sequences containing all possible variant peptides for proteogenomic analysis. J Proteomics 2020; 223:103819. [PMID: 32407886 DOI: 10.1016/j.jprot.2020.103819] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2019] [Revised: 05/04/2020] [Accepted: 05/09/2020] [Indexed: 12/12/2022]
Abstract
Identifying single-amino-acid variants (SAVs) from mass spectrometry-based experiments is critical for validating single-nucleotide variants (SNVs) at the protein level to facilitate biomedical research. Currently, two approaches are usually applied to convert SNV annotations into SAV-harboring protein sequences. One approach generates one sequence containing exactly one SAV, and the other all SAVs. However, they may neglect the possibility of SAV combinations, e.g., haplotypes, existing in bio-samples. Therefore, it is necessary to consider all SAV combinations of a protein when generating SAV-harboring protein sequences. In this paper, we propose MinProtMaxVP, a novel approach which selects a minimized number of SAV-harboring protein sequences generated from the exhaustive approach, while still accommodating all possible variant peptides, by solving a classic set covering problem. Our study on known haplotype variations of TAS2R38 justifies the necessity for MinProtMaxVP to consider all combinations of SAVs. The performance of MinProtMaxVP is demonstrated by an in silico study on OR2T27 with five SAVs and real experimental data of the HEK293 cell line. Furthermore, assuming simulated somatic and germline variants of OR2T27 in tumor and normal tissues demonstrates that when adopting the appropriate somatic and germline SAV integration strategy, MinProtMaxVP is adaptable to labeling and label-free mass spectrometry-based experiments. SIGNIFICANCE: We present MinProtMaxVP, a novel approach to generate SAV-harboring protein sequences for constructing a customized protein sequence database, which is used in database searching for variant peptide identification. This approach outperforms the existing approaches in generating all possible variant peptides to be included in protein sequences and possibly leading to identification of more variant peptides in proteogenomic analysis.
Collapse
Affiliation(s)
- Wai-Kok Choong
- Institute of Information Science, Academia Sinica, Nankang, Taipei 11529, Taiwan
| | - Jen-Hung Wang
- Institute of Information Science, Academia Sinica, Nankang, Taipei 11529, Taiwan
| | - Ting-Yi Sung
- Institute of Information Science, Academia Sinica, Nankang, Taipei 11529, Taiwan.
| |
Collapse
|
5
|
Low TY, Mohtar MA, Ang MY, Jamal R. Connecting Proteomics to Next‐Generation Sequencing: Proteogenomics and Its Current Applications in Biology. Proteomics 2018; 19:e1800235. [DOI: 10.1002/pmic.201800235] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2018] [Revised: 10/09/2018] [Indexed: 12/17/2022]
Affiliation(s)
- Teck Yew Low
- UKM Medical Molecular Biology Institute (UMBI)Universiti Kebangsaan Malaysia 56000 Kuala Lumpur Malaysia
| | - M. Aiman Mohtar
- UKM Medical Molecular Biology Institute (UMBI)Universiti Kebangsaan Malaysia 56000 Kuala Lumpur Malaysia
| | - Mia Yang Ang
- UKM Medical Molecular Biology Institute (UMBI)Universiti Kebangsaan Malaysia 56000 Kuala Lumpur Malaysia
| | - Rahman Jamal
- UKM Medical Molecular Biology Institute (UMBI)Universiti Kebangsaan Malaysia 56000 Kuala Lumpur Malaysia
| |
Collapse
|
6
|
Cifani P, Dhabaria A, Chen Z, Yoshimi A, Kawaler E, Abdel-Wahab O, Poirier JT, Kentsis A. ProteomeGenerator: A Framework for Comprehensive Proteomics Based on de Novo Transcriptome Assembly and High-Accuracy Peptide Mass Spectral Matching. J Proteome Res 2018; 17:3681-3692. [PMID: 30295032 DOI: 10.1021/acs.jproteome.8b00295] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Modern mass spectrometry now permits genome-scale and quantitative measurements of biological proteomes. However, analysis of specific specimens is currently hindered by the incomplete representation of biological variability of protein sequences in canonical reference proteomes and the technical demands for their construction. Here, we report ProteomeGenerator, a framework for de novo and reference-assisted proteogenomic database construction and analysis based on sample-specific transcriptome sequencing and high-accuracy mass spectrometry proteomics. This enables the assembly of proteomes encoded by actively transcribed genes, including sample-specific protein isoforms resulting from non-canonical mRNA transcription, splicing, or editing. To improve the accuracy of protein isoform identification in non-canonical proteomes, ProteomeGenerator relies on statistical target-decoy database matching calibrated using sample-specific controls. Its current implementation includes automatic integration with MaxQuant mass spectrometry proteomics algorithms. We applied this method for the proteogenomic analysis of splicing factor SRSF2 mutant leukemia cells, demonstrating high-confidence identification of non-canonical protein isoforms arising from alternative transcriptional start sites, intron retention, and cryptic exon splicing as well as improved accuracy of genome-scale proteome discovery. Additionally, we report proteogenomic performance metrics for current state-of-the-art implementations of SEQUEST HT, MaxQuant, Byonic, and PEAKS mass spectral analysis algorithms. Finally, ProteomeGenerator is implemented as a Snakemake workflow within a Singularity container for one-step installation in diverse computing environments, thereby enabling open, scalable, and facile discovery of sample-specific, non-canonical, and neomorphic biological proteomes.
Collapse
Affiliation(s)
- Paolo Cifani
- Molecular Pharmacology Program , Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center , New York City , New York 10065 , United States
| | - Avantika Dhabaria
- Molecular Pharmacology Program , Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center , New York City , New York 10065 , United States
| | - Zining Chen
- Molecular Pharmacology Program , Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center , New York City , New York 10065 , United States
| | | | | | - Omar Abdel-Wahab
- Institute for Systems Genetics and Department of Biochemistry and Molecular Pharmacology , New York University Langone Health , New York City , New York 10016 , United States
| | - John T Poirier
- Molecular Pharmacology Program , Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center , New York City , New York 10065 , United States.,Institute for Systems Genetics and Department of Biochemistry and Molecular Pharmacology , New York University Langone Health , New York City , New York 10016 , United States
| | - Alex Kentsis
- Molecular Pharmacology Program , Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center , New York City , New York 10065 , United States.,Departments of Pediatrics, Pharmacology, and Physiology & Biophysics, Weill Cornell Medical College , Cornell University , New York , New York 10065 , United States
| |
Collapse
|
7
|
Yi X, Wang B, An Z, Gong F, Li J, Fu Y. Quality control of single amino acid variations detected by tandem mass spectrometry. J Proteomics 2018; 187:144-151. [PMID: 30012419 DOI: 10.1016/j.jprot.2018.07.004] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2018] [Revised: 06/26/2018] [Accepted: 07/02/2018] [Indexed: 02/04/2023]
Abstract
Study of single amino acid variations (SAVs) of proteins, resulting from single nucleotide polymorphisms, is of great importance for understanding the relationships between genotype and phenotype. In mass spectrometry based shotgun proteomics, identification of peptides with SAVs often suffers from high error rates on the variant sites detected. These site errors are due to multiple reasons and can be confirmed by manual inspection or genomic sequencing. Here, we present a software tool, named SAVControl, for site-level quality control of variant peptide identifications. It mainly includes strict false discovery rate control of variant peptide identifications and variant site verification by unrestrictive mass shift relocalization. SAVControl was validated on three colorectal adenocarcinoma cell line datasets with genomic sequencing evidences and tested on a colorectal cancer dataset from The Cancer Genome Atlas. The results show that SAVControl can effectively remove false detections of SAVs. SIGNIFICANCE Protein sequence variations caused by single nucleotide polymorphisms (SNPs) are single amino acid variations (SAVs). The investigation of SAVs may provide a chance for understanding the relationships between genotype and phenotype. Mass spectrometry (MS) based proteomics provides a large-scale way to detect SAVs. However, using the current analysis strategy to detect SAVs may lead to high rate of false positives. The SAVControl we present here is a computational workflow and software tool for site-level quality control of SAVs detected by MS. It accesses the confidence of detected variant sites by relocating the mass shift responsible for an SAV to search for alternative interpretations. In addition, it uses a strict false discovery rate control method for variant peptide identifications. The advantages of SAVControl were demonstrated on three colorectal adenocarcinoma cell line datasets and a colorectal cancer dataset. We believe that SAVControl will be a powerful tool for computational proteomics and proteogenomics.
Collapse
Affiliation(s)
- Xinpei Yi
- NCMIS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China; School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Bo Wang
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Zhiwu An
- NCMIS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China; School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Fuzhou Gong
- NCMIS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China; School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China.
| | - Jing Li
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China.
| | - Yan Fu
- NCMIS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China; School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China.
| |
Collapse
|
8
|
Heunis T, Dippenaar A, Warren RM, van Helden PD, van der Merwe RG, Gey van Pittius NC, Pain A, Sampson SL, Tabb DL. Proteogenomic Investigation of Strain Variation in Clinical Mycobacterium tuberculosis Isolates. J Proteome Res 2017; 16:3841-3851. [PMID: 28820946 DOI: 10.1021/acs.jproteome.7b00483] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
Abstract
Mycobacterium tuberculosis consists of a large number of different strains that display unique virulence characteristics. Whole-genome sequencing has revealed substantial genetic diversity among clinical M. tuberculosis isolates, and elucidating the phenotypic variation encoded by this genetic diversity will be of the utmost importance to fully understand M. tuberculosis biology and pathogenicity. In this study, we integrated whole-genome sequencing and mass spectrometry (GeLC-MS/MS) to reveal strain-specific characteristics in the proteomes of two clinical M. tuberculosis Latin American-Mediterranean isolates. Using this approach, we identified 59 peptides containing single amino acid variants, which covered ∼9% of all coding nonsynonymous single nucleotide variants detected by whole-genome sequencing. Furthermore, we identified 29 distinct peptides that mapped to a hypothetical protein not present in the M. tuberculosis H37Rv reference proteome. Here, we provide evidence for the expression of this protein in the clinical M. tuberculosis SAWC3651 isolate. The strain-specific databases enabled confirmation of genomic differences (i.e., large genomic regions of difference and nonsynonymous single nucleotide variants) in these two clinical M. tuberculosis isolates and allowed strain differentiation at the proteome level. Our results contribute to the growing field of clinical microbial proteogenomics and can improve our understanding of phenotypic variation in clinical M. tuberculosis isolates.
Collapse
Affiliation(s)
- Tiaan Heunis
- DST/NRF Centre of Excellence for Biomedical Tuberculosis Research, SAMRC Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences, Stellenbosch University , Cape Town 7505, South Africa
| | - Anzaan Dippenaar
- DST/NRF Centre of Excellence for Biomedical Tuberculosis Research, SAMRC Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences, Stellenbosch University , Cape Town 7505, South Africa
| | - Robin M Warren
- DST/NRF Centre of Excellence for Biomedical Tuberculosis Research, SAMRC Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences, Stellenbosch University , Cape Town 7505, South Africa
| | - Paul D van Helden
- DST/NRF Centre of Excellence for Biomedical Tuberculosis Research, SAMRC Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences, Stellenbosch University , Cape Town 7505, South Africa
| | - Ruben G van der Merwe
- DST/NRF Centre of Excellence for Biomedical Tuberculosis Research, SAMRC Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences, Stellenbosch University , Cape Town 7505, South Africa
| | - Nicolaas C Gey van Pittius
- DST/NRF Centre of Excellence for Biomedical Tuberculosis Research, SAMRC Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences, Stellenbosch University , Cape Town 7505, South Africa
| | - Arnab Pain
- Pathogen Genomics Laboratory, BESE Division, King Abdullah University of Science and Technology , Thuwal 23955, Saudi Arabia
| | - Samantha L Sampson
- DST/NRF Centre of Excellence for Biomedical Tuberculosis Research, SAMRC Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences, Stellenbosch University , Cape Town 7505, South Africa
| | - David L Tabb
- DST/NRF Centre of Excellence for Biomedical Tuberculosis Research, SAMRC Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences, Stellenbosch University , Cape Town 7505, South Africa
| |
Collapse
|
9
|
Wingo TS, Duong DM, Zhou M, Dammer EB, Wu H, Cutler DJ, Lah JJ, Levey AI, Seyfried NT. Integrating Next-Generation Genomic Sequencing and Mass Spectrometry To Estimate Allele-Specific Protein Abundance in Human Brain. J Proteome Res 2017; 16:3336-3347. [PMID: 28691493 DOI: 10.1021/acs.jproteome.7b00324] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
Gene expression contributes to phenotypic traits and human disease. To date, comparatively less is known about regulators of protein abundance, which is also under genetic control and likely influences clinical phenotypes. However, identifying and quantifying allele-specific protein abundance by bottom-up proteomics is challenging since single nucleotide variants (SNVs) that alter protein sequence are not considered in standard human protein databases. To address this, we developed the GenPro software and used it to create personalized protein databases (PPDs) to identify single amino acid variants (SAAVs) at the protein level from whole exome sequencing. In silico assessment of PPDs generated by GenPro revealed only a 1% increase in tryptic search space compared to a direct translation of all human transcripts and an equivalent search space compared to the UniProtKB reference database. To identify a large unbiased number of SAAV peptides, we performed high-resolution mass spectrometry-based proteomics for two human post-mortem brain samples and searched the collected MS/MS spectra against their respective PPD. We found an average of ∼117 000 unique peptides mapping to ∼9300 protein groups for each sample, and of these, 977 were unique variant peptides. We found that over 400 reference and SAAV peptide pairs were, on average, equally abundant in human brain by label-free ion intensity measurements and confirmed the absolute levels of three reference and SAAV peptide pairs using heavy labeled peptides standards coupled with parallel reaction monitoring (PRM). Our results highlight the utility of integrating genomic and proteomic sequencing data to identify sample-specific SAAV peptides and support the hypothesis that most alleles are equally expressed in human brain.
Collapse
Affiliation(s)
- Thomas S Wingo
- Division of Neurology, Department of Veterans Affairs Medical Center , Decatur, Georgia 30033, United States
| | | | | | | | | | | | | | | | | |
Collapse
|
10
|
Ruggles KV, Krug K, Wang X, Clauser KR, Wang J, Payne SH, Fenyö D, Zhang B, Mani DR. Methods, Tools and Current Perspectives in Proteogenomics. Mol Cell Proteomics 2017; 16:959-981. [PMID: 28456751 DOI: 10.1074/mcp.mr117.000024] [Citation(s) in RCA: 95] [Impact Index Per Article: 13.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2017] [Indexed: 12/20/2022] Open
Abstract
With combined technological advancements in high-throughput next-generation sequencing and deep mass spectrometry-based proteomics, proteogenomics, i.e. the integrative analysis of proteomic and genomic data, has emerged as a new research field. Early efforts in the field were focused on improving protein identification using sample-specific genomic and transcriptomic sequencing data. More recently, integrative analysis of quantitative measurements from genomic and proteomic studies have identified novel insights into gene expression regulation, cell signaling, and disease. Many methods and tools have been developed or adapted to enable an array of integrative proteogenomic approaches and in this article, we systematically classify published methods and tools into four major categories, (1) Sequence-centric proteogenomics; (2) Analysis of proteogenomic relationships; (3) Integrative modeling of proteogenomic data; and (4) Data sharing and visualization. We provide a comprehensive review of methods and available tools in each category and highlight their typical applications.
Collapse
Affiliation(s)
- Kelly V Ruggles
- From the ‡Department of Medicine, New York University School of Medicine, New York, New York 10016
| | - Karsten Krug
- §The Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142
| | - Xiaojing Wang
- ¶Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas 77030.,‖Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030
| | - Karl R Clauser
- §The Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142
| | - Jing Wang
- ¶Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas 77030.,‖Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030
| | - Samuel H Payne
- **Biological Sciences Division, Pacific Northwest National Laboratory, Richland, Washington 99354
| | - David Fenyö
- ‡‡Department of Biochemistry and Molecular Pharmacology, New York University School of Medicine, New York, New York 10016; .,§§Institute for Systems Genetics, New York University School of Medicine, New York, New York 10016
| | - Bing Zhang
- ¶Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas 77030; .,‖Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030
| | - D R Mani
- §The Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142;
| |
Collapse
|
11
|
Tan Z, Nie S, McDermott SP, Wicha MS, Lubman DM. Single Amino Acid Variant Profiles of Subpopulations in the MCF-7 Breast Cancer Cell Line. J Proteome Res 2017; 16:842-851. [PMID: 28076950 DOI: 10.1021/acs.jproteome.6b00824] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
Cancers are initiated and developed from a small population of stem-like cells termed cancer stem cells (CSCs). There is heterogeneity among this CSC population that leads to multiple subpopulations with their own distinct biological features and protein expression. The protein expression and function may be impacted by amino acid variants that can occur largely due to single nucleotide changes. We have thus performed proteomic analysis of breast CSC subpopulations by mass spectrometry to study the presence of single amino acid variants (SAAVs) and their relation to breast cancer. We have used CSC markers to isolate pure breast CSC subpopulation fractions (ALDH+ and CD44+/CD24- cell populations) and the mature luminal cells (CD49f-EpCAM+) from the MCF-7 breast cancer cell line. By searching the Swiss-CanSAAVs database, 374 unique SAAVs were identified in total, where 27 are cancer-related SAAVs. 135 unique SAAVs were found in the CSC population compared with the mature luminal cells. The distribution of SAAVs detected in MCF-7 cells was compared with those predicted from the Swiss-CanSAAVs database, where we found distinct differences in the numbers of SAAVs detected relative to that expected from the Swiss-CanSAAVs database for several of the amino acids.
Collapse
Affiliation(s)
- Zhijing Tan
- Department of Surgery, University of Michigan , Ann Arbor, Michigan 48109, United States
| | - Song Nie
- Department of Surgery, University of Michigan , Ann Arbor, Michigan 48109, United States.,Biological Sciences Division and Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory , Richland, Washington 99352, United States
| | - Sean P McDermott
- Department of Internal Medicine, Division of Hematology/Oncology, University of Michigan , Ann Arbor, Michigan 48109, United States.,Comprehensive Cancer Center, University of Michigan , Ann Arbor, Michigan 48109, United States
| | - Max S Wicha
- Department of Internal Medicine, Division of Hematology/Oncology, University of Michigan , Ann Arbor, Michigan 48109, United States.,Comprehensive Cancer Center, University of Michigan , Ann Arbor, Michigan 48109, United States
| | - David M Lubman
- Department of Surgery, University of Michigan , Ann Arbor, Michigan 48109, United States
| |
Collapse
|
12
|
Luge T, Fischer C, Sauer S. Efficient Application of De Novo RNA Assemblers for Proteomics Informed by Transcriptomics. J Proteome Res 2016; 15:3938-3943. [DOI: 10.1021/acs.jproteome.6b00301] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Toni Luge
- Otto-Warburg-Laboratory, Max Planck Institute for Molecular Genetics, Ihnestraße 63-73, 14195 Berlin, Germany
| | - Cornelius Fischer
- Otto-Warburg-Laboratory, Max Planck Institute for Molecular Genetics, Ihnestraße 63-73, 14195 Berlin, Germany
- BIMSB
and BIH Genomics Platforms, Laboratory of Functional Genomics, Nutrigenomics
and Systems Biology, Max-Delbrück-Center for Molecular Medicine, Robert-Rössle-Straße
10, 13125 Berlin, Germany
| | - Sascha Sauer
- Otto-Warburg-Laboratory, Max Planck Institute for Molecular Genetics, Ihnestraße 63-73, 14195 Berlin, Germany
- BIMSB
and BIH Genomics Platforms, Laboratory of Functional Genomics, Nutrigenomics
and Systems Biology, Max-Delbrück-Center for Molecular Medicine, Robert-Rössle-Straße
10, 13125 Berlin, Germany
- CU Systems
Medicine, University of Würzburg, Josef-Schneider-Straße 2, 97080 Würzburg, Germany
| |
Collapse
|
13
|
Sheynkman GM, Shortreed MR, Cesnik AJ, Smith LM. Proteogenomics: Integrating Next-Generation Sequencing and Mass Spectrometry to Characterize Human Proteomic Variation. ANNUAL REVIEW OF ANALYTICAL CHEMISTRY (PALO ALTO, CALIF.) 2016; 9:521-45. [PMID: 27049631 PMCID: PMC4991544 DOI: 10.1146/annurev-anchem-071015-041722] [Citation(s) in RCA: 73] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/09/2023]
Abstract
Mass spectrometry-based proteomics has emerged as the leading method for detection, quantification, and characterization of proteins. Nearly all proteomic workflows rely on proteomic databases to identify peptides and proteins, but these databases typically contain a generic set of proteins that lack variations unique to a given sample, precluding their detection. Fortunately, proteogenomics enables the detection of such proteomic variations and can be defined, broadly, as the use of nucleotide sequences to generate candidate protein sequences for mass spectrometry database searching. Proteogenomics is experiencing heightened significance due to two developments: (a) advances in DNA sequencing technologies that have made complete sequencing of human genomes and transcriptomes routine, and (b) the unveiling of the tremendous complexity of the human proteome as expressed at the levels of genes, cells, tissues, individuals, and populations. We review here the field of human proteogenomics, with an emphasis on its history, current implementations, the types of proteomic variations it reveals, and several important applications.
Collapse
Affiliation(s)
- Gloria M Sheynkman
- Center for Cancer Systems Biology (CCSB) and Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02215;
- Department of Genetics, Harvard Medical School, Boston, Massachusetts 02115
- Department of Chemistry, University of Wisconsin, Madison, Wisconsin 53706; ,
| | - Michael R Shortreed
- Department of Chemistry, University of Wisconsin, Madison, Wisconsin 53706; ,
| | - Anthony J Cesnik
- Department of Chemistry, University of Wisconsin, Madison, Wisconsin 53706; ,
| | - Lloyd M Smith
- Department of Chemistry, University of Wisconsin, Madison, Wisconsin 53706; ,
- Genome Center of Wisconsin, University of Wisconsin, Madison, Wisconsin 53706;
| |
Collapse
|
14
|
Zickmann F, Renard BY. MSProGene: integrative proteogenomics beyond six-frames and single nucleotide polymorphisms. Bioinformatics 2015; 31:i106-15. [PMID: 26072472 PMCID: PMC4765881 DOI: 10.1093/bioinformatics/btv236] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
Summary: Ongoing advances in high-throughput technologies have facilitated accurate proteomic measurements and provide a wealth of information on genomic and transcript level. In proteogenomics, this multi-omics data is combined to analyze unannotated organisms and to allow more accurate sample-specific predictions. Existing analysis methods still mainly depend on six-frame translations or reference protein databases that are extended by transcriptomic information or known single nucleotide polymorphisms (SNPs). However, six-frames introduce an artificial sixfold increase of the target database and SNP integration requires a suitable database summarizing results from previous experiments. We overcome these limitations by introducing MSProGene, a new method for integrative proteogenomic analysis based on customized RNA-Seq driven transcript databases. MSProGene is independent from existing reference databases or annotated SNPs and avoids large six-frame translated databases by constructing sample-specific transcripts. In addition, it creates a network combining RNA-Seq and peptide information that is optimized by a maximum-flow algorithm. It thereby also allows resolving the ambiguity of shared peptides for protein inference. We applied MSProGene on three datasets and show that it facilitates a database-independent reliable yet accurate prediction on gene and protein level and additionally identifies novel genes. Availability and implementation: MSProGene is written in Java and Python. It is open source and available at http://sourceforge.net/projects/msprogene/. Contact:renardb@rki.de
Collapse
Affiliation(s)
- Franziska Zickmann
- Research Group Bioinformatics (NG4), Robert Koch Institute, 13353 Berlin, Germany
| | - Bernhard Y Renard
- Research Group Bioinformatics (NG4), Robert Koch Institute, 13353 Berlin, Germany
| |
Collapse
|
15
|
Sunagar K, Morgenstern D, Reitzel AM, Moran Y. Ecological venomics: How genomics, transcriptomics and proteomics can shed new light on the ecology and evolution of venom. J Proteomics 2015; 135:62-72. [PMID: 26385003 DOI: 10.1016/j.jprot.2015.09.015] [Citation(s) in RCA: 63] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2015] [Revised: 09/02/2015] [Accepted: 09/09/2015] [Indexed: 01/18/2023]
Abstract
Animal venom is a complex cocktail of bioactive chemicals that traditionally drew interest mostly from biochemists and pharmacologists. However, in recent years the evolutionary and ecological importance of venom is realized as this trait has direct and strong influence on interactions between species. Moreover, venom content can be modulated by environmental factors. Like many other fields of biology, venom research has been revolutionized in recent years by the introduction of systems biology approaches, i.e., genomics, transcriptomics and proteomics. The employment of these methods in venom research is known as 'venomics'. In this review we describe the history and recent advancements of venomics and discuss how they are employed in studying venom in general and in particular in the context of evolutionary ecology. We also discuss the pitfalls and challenges of venomics and what the future may hold for this emerging scientific field.
Collapse
Affiliation(s)
- Kartik Sunagar
- Department of Ecology, Evolution and Behavior, Alexander Silberman Institute of Life Sciences, Hebrew University of Jerusalem, Jerusalem 91904, Israel
| | - David Morgenstern
- Proteomics Resource Center, Langone Medical Center, New York University, New York, USA.
| | - Adam M Reitzel
- Department of Biological Sciences, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | - Yehu Moran
- Department of Ecology, Evolution and Behavior, Alexander Silberman Institute of Life Sciences, Hebrew University of Jerusalem, Jerusalem 91904, Israel.
| |
Collapse
|