1
|
Seneviratne AJ, Peters S, Clarke D, Dausmann M, Hecker M, Tully B, Hains PG, Zhong Q. Improved identification and quantification of peptides in mass spectrometry data via chemical and random additive noise elimination (CRANE). Bioinformatics 2021; 37:4719-4726. [PMID: 34323970 PMCID: PMC8711017 DOI: 10.1093/bioinformatics/btab563] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2020] [Revised: 06/15/2021] [Accepted: 07/28/2021] [Indexed: 11/19/2022] Open
Abstract
Motivation The output of electrospray ionization–liquid chromatography mass spectrometry (ESI-LC-MS) is influenced by multiple sources of noise and major contributors can be broadly categorized as baseline, random and chemical noise. Noise has a negative impact on the identification and quantification of peptides, which influences the reliability and reproducibility of MS-based proteomics data. Most attempts at denoising have been made on either spectra or chromatograms independently, thus, important 2D information is lost because the mass-to-charge ratio and retention time dimensions are not considered jointly. Results This article presents a novel technique for denoising raw ESI-LC-MS data via 2D undecimated wavelet transform, which is applied to proteomics data acquired by data-independent acquisition MS (DIA-MS). We demonstrate that denoising DIA-MS data results in the improvement of peptide identification and quantification in complex biological samples. Availability and implementation The software is available on Github (https://github.com/CMRI-ProCan/CRANE). The datasets were obtained from ProteomeXchange (Identifiers—PXD002952 and PXD008651). Preliminary data and intermediate files are available via ProteomeXchange (Identifiers—PXD020529 and PXD025103). Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Akila J Seneviratne
- ProCan®, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
| | - Sean Peters
- ProCan®, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
| | - David Clarke
- ProCan®, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
| | - Michael Dausmann
- ProCan®, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
| | - Michael Hecker
- ProCan®, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
| | - Brett Tully
- ProCan®, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
| | - Peter G Hains
- ProCan®, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
| | - Qing Zhong
- ProCan®, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
| |
Collapse
|
2
|
Tariq MU, Haseeb M, Aledhari M, Razzak R, Parizi RM, Saeed F. Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey. IEEE ACCESS : PRACTICAL INNOVATIONS, OPEN SOLUTIONS 2020; 9:5497-5516. [PMID: 33537181 PMCID: PMC7853650 DOI: 10.1109/access.2020.3047588] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
Big Data Proteogenomics lies at the intersection of high-throughput Mass Spectrometry (MS) based proteomics and Next Generation Sequencing based genomics. The combined and integrated analysis of these two high-throughput technologies can help discover novel proteins using genomic, and transcriptomic data. Due to the biological significance of integrated analysis, the recent past has seen an influx of proteogenomic tools that perform various tasks, including mapping proteins to the genomic data, searching experimental MS spectra against a six-frame translation genome database, and automating the process of annotating genome sequences. To date, most of such tools have not focused on scalability issues that are inherent in proteogenomic data analysis where the size of the database is much larger than a typical protein database. These state-of-the-art tools can take more than half a month to process a small-scale dataset of one million spectra against a genome of 3 GB. In this article, we provide an up-to-date review of tools that can analyze proteogenomic datasets, providing a critical analysis of the techniques' relative merits and potential pitfalls. We also point out potential bottlenecks and recommendations that can be incorporated in the future design of these workflows to ensure scalability with the increasing size of proteogenomic data. Lastly, we make a case of how high-performance computing (HPC) solutions may be the best bet to ensure the scalability of future big data proteogenomic data analysis.
Collapse
Affiliation(s)
- Muhammad Usman Tariq
- School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA
| | - Muhammad Haseeb
- School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA
| | - Mohammed Aledhari
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA
| | - Rehma Razzak
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA
| | - Reza M Parizi
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA
| | - Fahad Saeed
- School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA
| |
Collapse
|
3
|
Deng Y, Ren Z, Pan Q, Qi D, Wen B, Ren Y, Yang H, Wu L, Chen F, Liu S. pClean: An Algorithm To Preprocess High-Resolution Tandem Mass Spectra for Database Searching. J Proteome Res 2019; 18:3235-3244. [PMID: 31364357 DOI: 10.1021/acs.jproteome.9b00141] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Database searches of MS/MS spectra are the main approach to peptide/protein identification in proteomics. Since most database search engines only utilize a small portion of the original MS/MS signals for peptide detection, how to improve the quality of MS/MS signals is a primary concern for enhancement of the peptide/protein identification rate. A fundamental issue is that some noise MS signals, informative or uninformative, have to be filtered out prior to database searching. Herein, an integrative preprocessing algorithm was designed, termed pClean, which incorporates three modules to preprocess MS/MS spectra, such as the removal of isobaric-labeling related ions, the reduction in isotopic peaks, the deconvolution of ions with higher charges, and the clearance of uninformative MS/MS signals. In contrast to the currently available approaches to MS/MS data preprocessing, pClean enables treatment of MS/MS spectra with high mass accuracy and favors filtering for the labeling or nonlabeling of peptides. Data sets at various scales gained from mass spectrometers with high resolution were used to assess the quality of peptides identified after pClean treatment and to compare the pClean improvement with those of other software programs. On the basis of the analysis of peptides identified and the Mascot ion score, pClean was proven to be effective in the removal of mass spectral noise and the reduction of random matching. Compared with other software programs, pClean appeared to be beneficial in terms of preprocessing performances for the enhancement of confidence scores and the increase in peptides identified. pClean is available at https://github.com/AimeeD90/pClean_release .
Collapse
Affiliation(s)
- Yamei Deng
- CAS Key Laboratory of Genome Sciences and Information , Beijing Institute of Genomics, Chinese Academy of Sciences , Beijing 100101 , China.,University of the Chinese Academy of Sciences , Beijing 100049 , China.,BGI-Shenzhen , Shenzhen 518083 , China
| | - Zhe Ren
- BGI-Shenzhen , Shenzhen 518083 , China.,China National GeneBank, BGI-Shenzhen , Shenzhen 518120 , China
| | - Qingfei Pan
- CAS Key Laboratory of Genome Sciences and Information , Beijing Institute of Genomics, Chinese Academy of Sciences , Beijing 100101 , China.,University of the Chinese Academy of Sciences , Beijing 100049 , China.,BGI-Shenzhen , Shenzhen 518083 , China
| | - Da Qi
- BGI-Shenzhen , Shenzhen 518083 , China.,China National GeneBank, BGI-Shenzhen , Shenzhen 518120 , China
| | | | - Yan Ren
- BGI-Shenzhen , Shenzhen 518083 , China.,China National GeneBank, BGI-Shenzhen , Shenzhen 518120 , China
| | - Huanming Yang
- BGI-Shenzhen , Shenzhen 518083 , China.,China National GeneBank, BGI-Shenzhen , Shenzhen 518120 , China.,James D. Watson Institute of Genome Sciences , Hangzhou 310058 , China
| | - Lin Wu
- CAS Key Laboratory of Genome Sciences and Information , Beijing Institute of Genomics, Chinese Academy of Sciences , Beijing 100101 , China
| | - Fei Chen
- CAS Key Laboratory of Genome Sciences and Information , Beijing Institute of Genomics, Chinese Academy of Sciences , Beijing 100101 , China
| | - Siqi Liu
- CAS Key Laboratory of Genome Sciences and Information , Beijing Institute of Genomics, Chinese Academy of Sciences , Beijing 100101 , China.,BGI-Shenzhen , Shenzhen 518083 , China.,China National GeneBank, BGI-Shenzhen , Shenzhen 518120 , China
| |
Collapse
|
4
|
Awan MG, Eslami T, Saeed F. GPU-DAEMON: GPU algorithm design, data management & optimization template for array based big omics data. Comput Biol Med 2018; 101:163-173. [PMID: 30145436 DOI: 10.1016/j.compbiomed.2018.08.015] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2018] [Revised: 08/10/2018] [Accepted: 08/12/2018] [Indexed: 11/29/2022]
Abstract
In the age of ever increasing data, faster and more efficient data processing algorithms are needed. Graphics Processing Units (GPU) are emerging as a cost-effective alternative architecture for high-end computing. The optimal design of GPU algorithms is a challenging task which requires thorough understanding of the high performance computing architecture as well as the algorithmic design. The steep learning curve needed for effective GPU-centric algorithm design and implementation requires considerable expertise, time, and resources. In this paper, we present GPU-DAEMON, a GPU Data Management, Algorithm Design and Optimization technique suitable for processing array based big omics data. Our proposed GPU algorithm design template outlines and provides generic methods to tackle critical bottlenecks which can be followed to implement high performance, scalable GPU algorithms for given big data problem. We study the capability of GPU-DAEMON by reviewing the implementation of GPU-DAEMON based algorithms for three different big data problems. Speed up of as large as 386x (over the sequential version) and 50x (over naive GPU design methods) are observed using the proposed GPU-DAEMON. GPU-DAEMON template is available at https://github.com/pcdslab/GPU-DAEMON and the source codes for GPU-ArraySort, G-MSR and GPU-PCC are available at https://github.com/pcdslab.
Collapse
Affiliation(s)
- Muaaz Gul Awan
- Department of Computer Science, Western Michigan University, Kalamazoo, MI, USA
| | - Taban Eslami
- Department of Computer Science, Western Michigan University, Kalamazoo, MI, USA
| | - Fahad Saeed
- School of Computing and Information Sciences, Florida International University, Miami, FL, USA.
| |
Collapse
|
5
|
Awan MG, Saeed F. An Out-of-Core GPU based dimensionality reduction algorithm for Big Mass Spectrometry Data and its application in bottom-up Proteomics. ACM-BCB ... ... : THE ... ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE. ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE 2017; 2017:550-555. [PMID: 28868521 DOI: 10.1145/3107411.3107466] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
Modern high resolution Mass Spectrometry instruments can generate millions of spectra in a single systems biology experiment. Each spectrum consists of thousands of peaks but only a small number of peaks actively contribute to deduction of peptides. Therefore, pre-processing of MS data to detect noisy and non-useful peaks are an active area of research. Most of the sequential noise reducing algorithms are impractical to use as a pre-processing step due to high time-complexity. In this paper, we present a GPU based dimensionality-reduction algorithm, called G-MSR, for MS2 spectra. Our proposed algorithm uses novel data structures which optimize the memory and computational operations inside GPU. These novel data structures include Binary Spectra and Quantized Indexed Spectra (QIS). The former helps in communicating essential information between CPU and GPU using minimum amount of data while latter enables us to store and process complex 3-D data structure into a 1-D array structure while maintaining the integrity of MS data. Our proposed algorithm also takes into account the limited memory of GPUs and switches between in-core and out-of-core modes based upon the size of input data. G-MSR achieves a peak speed-up of 386x over its sequential counterpart and is shown to process over a million spectra in just 32 seconds. The code for this algorithm is available as a GPL open-source at GitHub at the following link: https://github.com/pcdslab/G-MSR.
Collapse
Affiliation(s)
- Muaaz Gul Awan
- Department of Computer Science, Western Michigan University, 4601 Campus Drive, Kalamazoo, Michigan 49009,
| | - Fahad Saeed
- Department of Computer Science, Western Michigan University, 4601 Campus Drive, Kalamazoo, Michigan 49009,
| |
Collapse
|
6
|
Abstract
Proteogenomic searching is a useful method for identifying novel proteins, annotating genes and detecting peptides unique to an individual genome. The approach, however, can be laborious, as it often requires search segmentation and the use of several unintegrated tools. Furthermore, many proteogenomic efforts have been limited to small genomes, as large genomes can prove impractical due to the required amount of computer memory and computation time. We present Peppy, a software tool designed to perform every necessary task of proteogenomic searches quickly, accurately and automatically. The software generates a peptide database from a genome, tracks peptide loci, matches peptides to MS/MS spectra and assigns confidence values to those matches. Peppy automatically performs a decoy database generation, search and analysis to return identifications at the desired false discovery rate threshold. Written in Java for cross-platform execution, the software is fully multithreaded for enhanced speed. The program can run on regular desktop computers, opening the doors of proteogenomic searching to a wider audience of proteomics and genomics researchers. Peppy is available at http://geneffects.com/peppy .
Collapse
Affiliation(s)
- Brian A Risk
- Department of Biochemistry & Biophysics, UNC School of Medicine, Chapel Hill, North Carolina 27599, United States.
| | | | | |
Collapse
|
7
|
Hatch JJ, McJunkin TR, Hanson C, Scott JR. Automated interpretation of LIBS spectra using a fuzzy logic inference engine. APPLIED OPTICS 2012; 51:B155-B164. [PMID: 22410914 DOI: 10.1364/ao.51.00b155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/05/2011] [Accepted: 01/05/2012] [Indexed: 05/31/2023]
Abstract
Automated interpretation of laser-induced breakdown spectroscopy (LIBS) data is necessary due to the plethora of spectra that can be acquired in a relatively short time. However, traditional chemometric and artificial neural network methods that have been employed are not always transparent to a skilled user. A fuzzy logic approach to data interpretation has now been adapted to LIBS spectral interpretation. Fuzzy logic inference rules were developed using methodology that includes data mining methods and operator expertise to differentiate between various copper-containing and stainless steel alloys as well as unknowns. Results using the fuzzy logic inference engine indicate a high degree of confidence in spectral assignment.
Collapse
Affiliation(s)
- Jeremy J Hatch
- Interfacial Chemistry, Idaho National Laboratory (INL), Idaho Falls, Idaho 83415, USA
| | | | | | | |
Collapse
|
8
|
An Z, Chen Y, Koomen JM, Merkler DJ. A mass spectrometry-based method to screen for α-amidated peptides. Proteomics 2011; 12:173-82. [PMID: 22106059 DOI: 10.1002/pmic.201100327] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2011] [Revised: 10/26/2011] [Accepted: 11/03/2011] [Indexed: 01/23/2023]
Abstract
Amidation is a post-translational modification found at the C-terminus of ~50% of all neuropeptide hormones. Cleavage of the C(α)-N bond of a C-terminal glycine yields the α-amidated peptide in a reaction catalyzed by peptidylglycine α-amidating monooxygenase (PAM). The mass of an α-amidated peptide decreases by 58 Da relative to its precursor. The amino acid sequences of an α-amidated peptide and its precursor differ only by the C-terminal glycine meaning that the peptides exhibit similar RP-HPLC properties and tandem mass spectral (MS/MS) fragmentation patterns. Growth of cultured cells in the presence of a PAM inhibitor ensured the coexistence of α-amidated peptides and their precursors. A strategy was developed for precursor and α-amidated peptide pairing (PAPP): LC-MS/MS data of peptide extracts were scanned for peptide pairs that differed by 58 Da in mass, but had similar RP-HPLC retention times. The resulting peptide pairs were validated by checking for similar fragmentation patterns in their MS/MS data prior to identification by database searching or manual interpretation. This approach significantly reduced the number of spectra requiring interpretation, decreasing the computing time required for database searching and enabling manual interpretation of unidentified spectra. Reported here are the α-amidated peptides identified from AtT-20 cells using the PAPP method.
Collapse
Affiliation(s)
- Zhenming An
- Department of Chemistry, University of South Florida, Tampa, FL 33620-5250, USA
| | | | | | | |
Collapse
|
9
|
van den Toorn HWP, Muñoz J, Mohammed S, Raijmakers R, Heck AJR, van Breukelen B. RockerBox: Analysis and Filtering of Massive Proteomics Search Results. J Proteome Res 2011; 10:1420-4. [DOI: 10.1021/pr1010185] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Affiliation(s)
- Henk W. P. van den Toorn
- Bijvoet Center for Biomolecular Research and Utrecht Institute for Pharmaceutical Sciences, Utrecht University, Padualaan 8, 3584 CH Utrecht, The Netherlands
- Netherlands Proteomics Centre, The Netherlands
- Netherlands Bioinformatics Centre, The Netherlands
| | - Javier Muñoz
- Bijvoet Center for Biomolecular Research and Utrecht Institute for Pharmaceutical Sciences, Utrecht University, Padualaan 8, 3584 CH Utrecht, The Netherlands
- Netherlands Proteomics Centre, The Netherlands
| | - Shabaz Mohammed
- Bijvoet Center for Biomolecular Research and Utrecht Institute for Pharmaceutical Sciences, Utrecht University, Padualaan 8, 3584 CH Utrecht, The Netherlands
- Netherlands Proteomics Centre, The Netherlands
| | - Reinout Raijmakers
- Bijvoet Center for Biomolecular Research and Utrecht Institute for Pharmaceutical Sciences, Utrecht University, Padualaan 8, 3584 CH Utrecht, The Netherlands
- Netherlands Proteomics Centre, The Netherlands
| | - Albert J. R. Heck
- Bijvoet Center for Biomolecular Research and Utrecht Institute for Pharmaceutical Sciences, Utrecht University, Padualaan 8, 3584 CH Utrecht, The Netherlands
- Netherlands Proteomics Centre, The Netherlands
| | - Bas van Breukelen
- Bijvoet Center for Biomolecular Research and Utrecht Institute for Pharmaceutical Sciences, Utrecht University, Padualaan 8, 3584 CH Utrecht, The Netherlands
- Netherlands Proteomics Centre, The Netherlands
- Netherlands Bioinformatics Centre, The Netherlands
| |
Collapse
|
10
|
Meier JL, Patel AD, Niessen S, Meehan M, Kersten R, Yang JY, Rothmann M, Cravatt BF, Dorrestein PC, Burkart MD, Bafna V. Practical 4'-phosphopantetheine active site discovery from proteomic samples. J Proteome Res 2010; 10:320-9. [PMID: 21067235 DOI: 10.1021/pr100953b] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Polyketide and nonribosomal peptides constitute important classes of small molecule natural products. Due to the proven biological activities of these compounds, novel methods for discovery and study of the polyketide synthase (PKS) and nonribosomal peptide synthetase (NRPS) enzymes responsible for their production remains an area of intense interest, and proteomic approaches represent a relatively unexplored avenue. While these enzymes may be distinguished from the proteomic milieu by their use of the 4'-phosphopantetheine (PPant) post-translational modification, proteomic detection of PPant peptides is hindered by their low abundance and labile nature which leaves them unassigned using traditional database searching. Here we address key experimental and computational challenges to facilitate practical discovery of this important post-translational modification during shotgun proteomics analysis using low-resolution ion-trap mass spectrometers. Activity-based enrichment maximizes MS input of PKS/NRPS peptides, while targeted fragmentation detects putative PPant active sites. An improved data analysis pipeline allows experimental identification and validation of these PPant peptides directly from MS² data. Finally, a machine learning approach is developed to directly detect PPant peptides from only MS² fragmentation data. By providing new methods for analysis of an often cryptic post-translational modification, these methods represent a first step toward the study of natural product biosynthesis in proteomic settings.
Collapse
Affiliation(s)
- Jordan L Meier
- Department of Chemistry and Biochemistry, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California at San Diego, La Jolla California 92093, USA
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|