1
|
Aggarwal S, Raj A, Kumar D, Dash D, Yadav AK. False discovery rate: the Achilles' heel of proteogenomics. Brief Bioinform 2022; 23:6582880. [PMID: 35534181 DOI: 10.1093/bib/bbac163] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Revised: 03/14/2022] [Accepted: 04/12/2022] [Indexed: 12/25/2022] Open
Abstract
Proteogenomics refers to the integrated analysis of the genome and proteome that leverages mass-spectrometry (MS)-based proteomics data to improve genome annotations, understand gene expression control through proteoforms and find sequence variants to develop novel insights for disease classification and therapeutic strategies. However, proteogenomic studies often suffer from reduced sensitivity and specificity due to inflated database size. To control the error rates, proteogenomics depends on the target-decoy search strategy, the de-facto method for false discovery rate (FDR) estimation in proteomics. The proteogenomic databases constructed from three- or six-frame nucleotide database translation not only increase the search space and compute-time but also violate the equivalence of target and decoy databases. These searches result in poorer separation between target and decoy scores, leading to stringent FDR thresholds. Understanding these factors and applying modified strategies such as two-pass database search or peptide-class-specific FDR can result in a better interpretation of MS data without introducing additional statistical biases. Based on these considerations, a user can interpret the proteogenomics results appropriately and control false positives and negatives in a more informed manner. In this review, first, we briefly discuss the proteogenomic workflows and limitations in database construction, followed by various considerations that can influence potential novel discoveries in a proteogenomic study. We conclude with suggestions to counter these challenges for better proteogenomic data interpretation.
Collapse
Affiliation(s)
- Suruchi Aggarwal
- Translational Health Science and Technology Institute, NCR Biotech Science Cluster, 3rd milestone, PO Box No. 04, Faridabad-Gurgaon Expressway, Faridabad-121001, Haryana, India
| | - Anurag Raj
- GN Ramachandran Knowledge Centre for Genome Informatics, CSIR-Institute of Genomics & Integrative Biology, South Campus, Mathura Road, New Delhi 110025, India.,Academy of Scientific and Innovative Research (AcSIR), Ghaziabad-201002, India
| | - Dhirendra Kumar
- GN Ramachandran Knowledge Centre for Genome Informatics, CSIR-Institute of Genomics & Integrative Biology, South Campus, Mathura Road, New Delhi 110025, India
| | - Debasis Dash
- GN Ramachandran Knowledge Centre for Genome Informatics, CSIR-Institute of Genomics & Integrative Biology, South Campus, Mathura Road, New Delhi 110025, India.,Academy of Scientific and Innovative Research (AcSIR), Ghaziabad-201002, India
| | - Amit Kumar Yadav
- Translational Health Science and Technology Institute, NCR Biotech Science Cluster, 3rd milestone, PO Box No. 04, Faridabad-Gurgaon Expressway, Faridabad-121001, Haryana, India
| |
Collapse
|
2
|
Lee S, Park H, Kim H. Comparison of false-discovery rates of various decoy databases. Proteome Sci 2021; 19:11. [PMID: 34537052 PMCID: PMC8449453 DOI: 10.1186/s12953-021-00179-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2021] [Accepted: 09/01/2021] [Indexed: 02/15/2024] Open
Abstract
BACKGROUND The target-decoy strategy effectively estimates the false-discovery rate (FDR) by creating a decoy database with a size identical to that of the target database. Decoy databases are created by various methods, such as, the reverse, pseudo-reverse, shuffle, pseudo-shuffle, and the de Bruijn methods. FDR is sometimes over- or under-estimated depending on which decoy database is used because the ratios of redundant peptides in the target databases are different, that is, the numbers of unique (non-redundancy) peptides in the target and decoy databases differ. RESULTS We used two protein databases (the UniProt Saccharomyces cerevisiae protein database and the UniProt human protein database) to compare the FDRs of various decoy databases. When the ratio of redundant peptides in the target database is low, the FDR is not overestimated by any decoy construction method. However, if the ratio of redundant peptides in the target database is high, the FDR is overestimated when the (pseudo) shuffle decoy database is used. Additionally, human and S. cerevisiae six frame translation databases, which are large databases, also showed outcomes similar to that from the UniProt human protein database. CONCLUSION The FDR must be estimated using the correction factor proposed by Elias and Gygi or that by Kim et al. when (pseudo) shuffle decoy databases are used.
Collapse
Affiliation(s)
- Sangjeong Lee
- Department of Computer Science, Hanyang University, Seoul, 06978, Republic of Korea
| | - Heejin Park
- Department of Computer Science, Hanyang University, Seoul, 06978, Republic of Korea.
| | - Hyunwoo Kim
- Center for Supercomputing Applications, Korea Institute of Science and Technology Information, Daejeon, 34141, Republic of Korea.
| |
Collapse
|
3
|
Tang Y, Harrington PB. Noninteger Root Transformations for Preprocessing Nanoelectrospray Ionization High-Resolution Mass Spectra for the Classification of Cannabis. Anal Chem 2019; 91:1328-1334. [PMID: 30565911 DOI: 10.1021/acs.analchem.8b03145] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Typically, for measurements with a high dynamic range, the range is reduced by using the square root transform. By using noninteger roots coupled with systematic experimental design, improvements to the measurements may be obtained. The effect of using noninteger root transformation was evaluated using high-resolution mass spectrometry (HRMS) combined with nanoelectrospray ionization (Nano-ESI) to differentiate 23 samples of Cannabis. The mass spectra were evaluated and classified using different mass resolving powers and noninteger root transformations. Classification was achieved by super partial least-squares discriminant analysis (sPLS-DA), support vector machine (SVM), and SVM classification tree type entropy (SVMTreeH). The 2.5 root transformation gave the best overall performance at different resolving powers for chemical profiling from a multilevel factorial experimental design using 2 factors and more than 4 levels. Response surface modeling using a cubic polynomial model of the bootstrapped sPLS-DA average prediction accuracies yielded optima at 0.005 for resolving power and 2.3 for the root transformation. Root transformation is an important spectral preprocessing tool for decreasing the dynamic range so that the relative variance of smaller but more important features may be inflated. For the classification of Cannabis using Nano-ESI, the optimal ranges of root and resolution were broad. The chasing-the-optimum method has been introduced for refining the polynomial response surface model.
Collapse
Affiliation(s)
- Yue Tang
- Ohio University Center for Intelligent Chemical Instrumentation , Department of Chemistry and Biochemistry, Clippinger Laboratories , Athens , Ohio 45701-2979 , United States
| | - Peter B Harrington
- Ohio University Center for Intelligent Chemical Instrumentation , Department of Chemistry and Biochemistry, Clippinger Laboratories , Athens , Ohio 45701-2979 , United States
| |
Collapse
|
4
|
Wang X, Harrington PDB, Baugh SF. Effect of preprocessing high-resolution mass spectra on the pattern recognition of Cannabis, hemp, and liquor. Talanta 2017; 180:229-238. [PMID: 29332804 DOI: 10.1016/j.talanta.2017.12.032] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2017] [Revised: 12/11/2017] [Accepted: 12/11/2017] [Indexed: 01/17/2023]
Abstract
High-resolution mass spectrometry (HRMS) combined with pattern recognition was used to discriminate among twenty-five Cannabis samples, twenty hemp samples, and eight liquor samples. The effects of preprocessing on multivariate data analysis were evaluated for Orbitrap high-resolution mass spectra. Different root transformations were evaluated with respect to the bin width and the average classification rates. In addition, linear binning and proportional binning with various resolving powers were studied with respect to the average classification rates. The proportional binning with the square root transformation gave the best overall performance for chemical profiling or spectral fingerprinting. Six classification methods, fuzzy rule-building expert system (FuRES), linear discriminant analysis (LDA), super partial least squares discriminant analysis (sPLS-DA), support vector machine (SVM), SVM classification tree type gap (SVMTreeG), and SVM classification tree type entropy (SVMTreeH) had similar trends in prediction rate with respect to the resolving power. The optimal proportional mass bin width may depend on the data set, i.e., for the Cannabis samples is resolving power 10-4, for the hemp samples and the liquor samples are resolving power 10-3. Hence, data preprocessing methods such as different transformations, binning strategies, and resolving powers are important factors to be optimized for HRMS direct infusion measurements combined with pattern recognition to be an authentication and characterization tool for various products.
Collapse
Affiliation(s)
- Xinyi Wang
- Center for Intelligent Chemical Instrumentation, Clippinger Laboratories, Department of Chemistry and Biochemistry, Ohio University, Athens, OH 45701-2979, USA
| | - Peter de B Harrington
- Center for Intelligent Chemical Instrumentation, Clippinger Laboratories, Department of Chemistry and Biochemistry, Ohio University, Athens, OH 45701-2979, USA.
| | - Steven F Baugh
- Chemistry Mapping, Inc., 5902 McIntyre St., Golden, CO 80403, USA
| |
Collapse
|
5
|
Li H, Joh YS, Kim H, Paek E, Lee SW, Hwang KB. Evaluating the effect of database inflation in proteogenomic search on sensitive and reliable peptide identification. BMC Genomics 2016; 17:1031. [PMID: 28155652 PMCID: PMC5259817 DOI: 10.1186/s12864-016-3327-5] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Background Proteogenomics is a promising approach for various tasks ranging from gene annotation to cancer research. Databases for proteogenomic searches are often constructed by adding peptide sequences inferred from genomic or transcriptomic evidence to reference protein sequences. Such inflation of databases has potential of identifying novel peptides. However, it also raises concerns on sensitive and reliable peptide identification. Spurious peptides included in target databases may result in underestimated false discovery rate (FDR). On the other hand, inflation of decoy databases could decrease the sensitivity of peptide identification due to the increased number of high-scoring random hits. Although several studies have addressed these issues, widely applicable guidelines for sensitive and reliable proteogenomic search have hardly been available. Results To systematically evaluate the effect of database inflation in proteogenomic searches, we constructed a variety of real and simulated proteogenomic databases for yeast and human tandem mass spectrometry (MS/MS) data, respectively. Against these databases, we tested two popular database search tools with various approaches to search result validation: the target-decoy search strategy (with and without a refined scoring-metric) and a mixture model-based method. The effect of separate filtering of known and novel peptides was also examined. The results from real and simulated proteogenomic searches confirmed that separate filtering increases the sensitivity and reliability in proteogenomic search. However, no one method consistently identified the largest (or the smallest) number of novel peptides from real proteogenomic searches. Conclusions We propose to use a set of search result validation methods with separate filtering, for sensitive and reliable identification of peptides in proteogenomic search. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-3327-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Honglan Li
- School of Computer Science and Engineering, Soongsil University, Seoul, 06978, Republic of Korea
| | - Yoon Sung Joh
- Department of Computer Science, Hanyang University, Seoul, 04763, Republic of Korea
| | - Hyunwoo Kim
- Scientific Data Research Center, Korea Institute of Science and Technology Information, Daejeon, 34141, Republic of Korea
| | - Eunok Paek
- Department of Computer Science, Hanyang University, Seoul, 04763, Republic of Korea
| | - Sang-Won Lee
- Department of Chemistry, Research Institute for Natural Sciences, Korea University, Seoul, 02841, Republic of Korea
| | - Kyu-Baek Hwang
- School of Computer Science and Engineering, Soongsil University, Seoul, 06978, Republic of Korea.
| |
Collapse
|
6
|
Yadav AK, Kumar D, Dash D. Learning from decoys to improve the sensitivity and specificity of proteomics database search results. PLoS One 2012. [PMID: 23189209 PMCID: PMC3506577 DOI: 10.1371/journal.pone.0050651] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
The statistical validation of database search results is a complex issue in bottom-up proteomics. The correct and incorrect peptide spectrum match (PSM) scores overlap significantly, making an accurate assessment of true peptide matches challenging. Since the complete separation between the true and false hits is practically never achieved, there is need for better methods and rescoring algorithms to improve upon the primary database search results. Here we describe the calibration and False Discovery Rate (FDR) estimation of database search scores through a dynamic FDR calculation method, FlexiFDR, which increases both the sensitivity and specificity of search results. Modelling a simple linear regression on the decoy hits for different charge states, the method maximized the number of true positives and reduced the number of false negatives in several standard datasets of varying complexity (18-mix, 49-mix, 200-mix) and few complex datasets (E. coli and Yeast) obtained from a wide variety of MS platforms. The net positive gain for correct spectral and peptide identifications was up to 14.81% and 6.2% respectively. The approach is applicable to different search methodologies- separate as well as concatenated database search, high mass accuracy, and semi-tryptic and modification searches. FlexiFDR was also applied to Mascot results and showed better performance than before. We have shown that appropriate threshold learnt from decoys, can be very effective in improving the database search results. FlexiFDR adapts itself to different instruments, data types and MS platforms. It learns from the decoy hits and sets a flexible threshold that automatically aligns itself to the underlying variables of data quality and size.
Collapse
Affiliation(s)
- Amit Kumar Yadav
- GNR Knowledge Center for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, Delhi, India
| | - Dhirendra Kumar
- GNR Knowledge Center for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, Delhi, India
| | - Debasis Dash
- GNR Knowledge Center for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, Delhi, India
- * E-mail:
| |
Collapse
|
7
|
van den Toorn HWP, Muñoz J, Mohammed S, Raijmakers R, Heck AJR, van Breukelen B. RockerBox: Analysis and Filtering of Massive Proteomics Search Results. J Proteome Res 2011; 10:1420-4. [DOI: 10.1021/pr1010185] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Affiliation(s)
- Henk W. P. van den Toorn
- Bijvoet Center for Biomolecular Research and Utrecht Institute for Pharmaceutical Sciences, Utrecht University, Padualaan 8, 3584 CH Utrecht, The Netherlands
- Netherlands Proteomics Centre, The Netherlands
- Netherlands Bioinformatics Centre, The Netherlands
| | - Javier Muñoz
- Bijvoet Center for Biomolecular Research and Utrecht Institute for Pharmaceutical Sciences, Utrecht University, Padualaan 8, 3584 CH Utrecht, The Netherlands
- Netherlands Proteomics Centre, The Netherlands
| | - Shabaz Mohammed
- Bijvoet Center for Biomolecular Research and Utrecht Institute for Pharmaceutical Sciences, Utrecht University, Padualaan 8, 3584 CH Utrecht, The Netherlands
- Netherlands Proteomics Centre, The Netherlands
| | - Reinout Raijmakers
- Bijvoet Center for Biomolecular Research and Utrecht Institute for Pharmaceutical Sciences, Utrecht University, Padualaan 8, 3584 CH Utrecht, The Netherlands
- Netherlands Proteomics Centre, The Netherlands
| | - Albert J. R. Heck
- Bijvoet Center for Biomolecular Research and Utrecht Institute for Pharmaceutical Sciences, Utrecht University, Padualaan 8, 3584 CH Utrecht, The Netherlands
- Netherlands Proteomics Centre, The Netherlands
| | - Bas van Breukelen
- Bijvoet Center for Biomolecular Research and Utrecht Institute for Pharmaceutical Sciences, Utrecht University, Padualaan 8, 3584 CH Utrecht, The Netherlands
- Netherlands Proteomics Centre, The Netherlands
- Netherlands Bioinformatics Centre, The Netherlands
| |
Collapse
|
8
|
Nesvizhskii AI. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. J Proteomics 2010; 73:2092-123. [PMID: 20816881 DOI: 10.1016/j.jprot.2010.08.009] [Citation(s) in RCA: 358] [Impact Index Per Article: 25.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2010] [Revised: 08/25/2010] [Accepted: 08/25/2010] [Indexed: 12/18/2022]
Abstract
This manuscript provides a comprehensive review of the peptide and protein identification process using tandem mass spectrometry (MS/MS) data generated in shotgun proteomic experiments. The commonly used methods for assigning peptide sequences to MS/MS spectra are critically discussed and compared, from basic strategies to advanced multi-stage approaches. A particular attention is paid to the problem of false-positive identifications. Existing statistical approaches for assessing the significance of peptide to spectrum matches are surveyed, ranging from single-spectrum approaches such as expectation values to global error rate estimation procedures such as false discovery rates and posterior probabilities. The importance of using auxiliary discriminant information (mass accuracy, peptide separation coordinates, digestion properties, and etc.) is discussed, and advanced computational approaches for joint modeling of multiple sources of information are presented. This review also includes a detailed analysis of the issues affecting the interpretation of data at the protein level, including the amplification of error rates when going from peptide to protein level, and the ambiguities in inferring the identifies of sample proteins in the presence of shared peptides. Commonly used methods for computing protein-level confidence scores are discussed in detail. The review concludes with a discussion of several outstanding computational issues.
Collapse
|