1
|
Haseeb M, Saeed F. GPU-acceleration of the distributed-memory database peptide search of mass spectrometry data. Sci Rep 2023; 13:18713. [PMID: 37907498 PMCID: PMC10618243 DOI: 10.1038/s41598-023-43033-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Accepted: 09/18/2023] [Indexed: 11/02/2023] Open
Abstract
Database peptide search is the primary computational technique for identifying peptides from the mass spectrometry (MS) data. Graphical Processing Units (GPU) computing is now ubiquitous in the current-generation of high-performance computing (HPC) systems, yet its application in the database peptide search domain remains limited. Part of the reason is the use of sub-optimal algorithms in the existing GPU-accelerated methods resulting in significantly inefficient hardware utilization. In this paper, we design and implement a new-age CPU-GPU HPC framework, called GiCOPS, for efficient and complete GPU-acceleration of the modern database peptide search algorithms on supercomputers. Our experimentation shows that the GiCOPS exhibits between 1.2 to 5[Formula: see text] speed improvement over its CPU-only predecessor, HiCOPS, and over 10[Formula: see text] improvement over several existing GPU-based database search algorithms for sufficiently large experiment sizes. We further assess and optimize the performance of our framework using the Roofline Model and report near-optimal results for several metrics including computations per second, occupancy rate, memory workload, branch efficiency and shared memory performance. Finally, the CPU-GPU methods and optimizations proposed in our work for complex integer- and memory-bounded algorithmic pipelines can also be extended to accelerate the existing and future peptide identification algorithms. GiCOPS is now integrated with our umbrella HPC framework HiCOPS and is available at: https://github.com/pcdslab/gicops .
Collapse
Affiliation(s)
- Muhammad Haseeb
- Knight Foundation School of Computing and Information Sciences, Florida International University (FIU), Miami, FL, USA
| | - Fahad Saeed
- Knight Foundation School of Computing and Information Sciences, Florida International University (FIU), Miami, FL, USA.
- Biomolecular Sciences Institute (BSI), Miami, FL, USA.
- Department of Human and Molecular Genetics, Herbert Wertheim School of Medicine, Florida International University, Miami, FL, USA.
| |
Collapse
|
2
|
Kumar S, Saeed F. Communication-avoiding micro-architecture to compute Xcorr scores for peptide identification. INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS : [PROCEEDINGS]. INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS 2021; 2021:99-103. [PMID: 35440952 PMCID: PMC9015013 DOI: 10.1109/fpl53798.2021.00024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Database algorithms play a crucial part in systems biology studies by identifying proteins from mass spectrometry data. Many of these database search algorithms incur huge computational costs by computing similarity scores for each pair of sparse experimental spectrum and candidate theoretical spectrum vectors. Modern MS instrumentation techniques which are capable of generating high-resolution spectrometry data require comparison against an enormous search space, further emphasizing the need of efficient accelerators. Recent research has shown that the overall cost of scoring, and deducing peptides is dominated by the communication costs between different hierarchies of memory and processing units. However, these communication costs are seldom considered in accelerator-based architectures leading to inefficient DRAM accesses, and poor data-utilization due to irregular memory access patterns. In this paper, we propose a novel communication-avoiding micro-architecture to compute cross-correlation based similarity score by utilizing efficient local cache, and peptide pre-fetching to minimize DRAM accesses, and a custom-designed peptide broadcast bus to allow input reuse. An efficient bus arbitration scheme was designed, and implemented to minimize synchronization cost and exploit parallelism of processing elements. Our simulation results show that the proposed micro-architecture performs on average 24x better than a CPU implementation running on a 3.6 GHz Intel i7-4970 processor with 16GB memory.
Collapse
Affiliation(s)
- Sumesh Kumar
- Knight Foundation School of Computing and Information Sciences, Florida International University (FIU), Miami, FL USA 33199
| | - Fahad Saeed
- Knight Foundation School of Computing and Information Sciences, Florida International University (FIU), Miami, FL USA 33199
| |
Collapse
|
3
|
Abstract
Mass spectrometry (MS)-based proteomics is currently the most successful approach to measure and compare peptides and proteins in a large variety of biological samples. Modern mass spectrometers, equipped with high-resolution analyzers, provide large amounts of data output. This is the case of shotgun/bottom-up proteomics, which consists in the enzymatic digestion of protein into peptides that are then measured by MS-instruments through a data dependent acquisition (DDA) mode. Dedicated bioinformatic tools and platforms have been developed to face the increasing size and complexity of raw MS data that need to be processed and interpreted for large-scale protein identification and quantification. This chapter illustrates the most popular bioinformatics solution for the analysis of shotgun MS-proteomics data. A general description will be provided on the data preprocessing options and the different search engines available, including practical suggestions on how to optimize the parameters for peptide search, based on hands-on experience.
Collapse
Affiliation(s)
- Avinash Yadav
- Department of Experimental Oncology, European Institute of Oncology (IEO), IRCCS, Milan, Italy
| | - Federica Marini
- Department of Experimental Oncology, European Institute of Oncology (IEO), IRCCS, Milan, Italy
| | - Alessandro Cuomo
- Department of Experimental Oncology, European Institute of Oncology (IEO), IRCCS, Milan, Italy
| | - Tiziana Bonaldi
- Department of Experimental Oncology, European Institute of Oncology (IEO), IRCCS, Milan, Italy.
| |
Collapse
|
4
|
Verheggen K, Raeder H, Berven FS, Martens L, Barsnes H, Vaudel M. Anatomy and evolution of database search engines-a central component of mass spectrometry based proteomic workflows. MASS SPECTROMETRY REVIEWS 2020; 39:292-306. [PMID: 28902424 DOI: 10.1002/mas.21543] [Citation(s) in RCA: 60] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/06/2016] [Accepted: 07/05/2017] [Indexed: 06/07/2023]
Abstract
Sequence database search engines are bioinformatics algorithms that identify peptides from tandem mass spectra using a reference protein sequence database. Two decades of development, notably driven by advances in mass spectrometry, have provided scientists with more than 30 published search engines, each with its own properties. In this review, we present the common paradigm behind the different implementations, and its limitations for modern mass spectrometry datasets. We also detail how the search engines attempt to alleviate these limitations, and provide an overview of the different software frameworks available to the researcher. Finally, we highlight alternative approaches for the identification of proteomic mass spectrometry datasets, either as a replacement for, or as a complement to, sequence database search engines.
Collapse
Affiliation(s)
- Kenneth Verheggen
- VIB-UGent Center for Medical Biotechnology, VIB, Ghent, Belgium
- Department of Biochemistry, Ghent University, Ghent, Belgium
- Bioinformatics Institute Ghent, Ghent University, Ghent, Belgium
| | - Helge Raeder
- KG Jebsen Center for Diabetes Research, Department of Clinical Science, University of Bergen, Norway
- Department of Pediatrics, Haukeland University Hospital, Bergen, Norway
| | - Frode S Berven
- Proteomics Unit, Department of Biomedicine, University of Bergen, Norway
| | - Lennart Martens
- VIB-UGent Center for Medical Biotechnology, VIB, Ghent, Belgium
- Department of Biochemistry, Ghent University, Ghent, Belgium
- Bioinformatics Institute Ghent, Ghent University, Ghent, Belgium
| | - Harald Barsnes
- KG Jebsen Center for Diabetes Research, Department of Clinical Science, University of Bergen, Norway
- Proteomics Unit, Department of Biomedicine, University of Bergen, Norway
- Computational Biology Unit, Department of Informatics, University of Bergen, Norway
| | - Marc Vaudel
- KG Jebsen Center for Diabetes Research, Department of Clinical Science, University of Bergen, Norway
- Proteomics Unit, Department of Biomedicine, University of Bergen, Norway
- Center for Medical Genetics and Molecular Medicine, Haukeland University Hospital, Bergen, Norway
| |
Collapse
|
5
|
Bittremieux W, Laukens K, Noble WS. Extremely Fast and Accurate Open Modification Spectral Library Searching of High-Resolution Mass Spectra Using Feature Hashing and Graphics Processing Units. J Proteome Res 2019; 18:3792-3799. [PMID: 31448616 PMCID: PMC6886738 DOI: 10.1021/acs.jproteome.9b00291] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Open modification searching (OMS) is a powerful search strategy to identify peptides with any type of modification. OMS works by using a very wide precursor mass window to allow modified spectra to match against their unmodified variants, after which the modification types can be inferred from the corresponding precursor mass differences. A disadvantage of this strategy, however, is the large computational cost, because each query spectrum has to be compared against a multitude of candidate peptides. We have previously introduced the ANN-SoLo tool for fast and accurate open spectral library searching. ANN-SoLo uses approximate nearest neighbor indexing to speed up OMS by selecting only a limited number of the most relevant library spectra to compare to an unknown query spectrum. Here we demonstrate how this candidate selection procedure can be further optimized using graphics processing units. Additionally, we introduce a feature hashing scheme to convert high-resolution spectra to low-dimensional vectors. On the basis of these algorithmic advances, along with low-level code optimizations, the new version of ANN-SoLo is up to an order of magnitude faster than its initial version. This makes it possible to efficiently perform open searches on a large scale to gain a deeper understanding about the protein modification landscape. We demonstrate the computational efficiency and identification performance of ANN-SoLo based on a large data set of the draft human proteome. ANN-SoLo is implemented in Python and C++. It is freely available under the Apache 2.0 license at https://github.com/bittremieux/ANN-SoLo .
Collapse
Affiliation(s)
- Wout Bittremieux
- Department of Mathematics and Computer Science , University of Antwerp , 2020 Antwerp , Belgium
- Biomedical Informatics Network Antwerpen (biomina) , 2020 Antwerp , Belgium
- Department of Genome Sciences , University of Washington , Seattle , Washington 98195 , United States
| | - Kris Laukens
- Department of Mathematics and Computer Science , University of Antwerp , 2020 Antwerp , Belgium
- Biomedical Informatics Network Antwerpen (biomina) , 2020 Antwerp , Belgium
| | - William Stafford Noble
- Department of Genome Sciences , University of Washington , Seattle , Washington 98195 , United States
- Department of Computer Science and Engineering , University of Washington , Seattle , Washington 98195 , United States
| |
Collapse
|
6
|
Kim H, Han S, Um JH, Park K. Accelerating a cross-correlation score function to search modifications using a single GPU. BMC Bioinformatics 2018; 19:480. [PMID: 30541430 PMCID: PMC6291950 DOI: 10.1186/s12859-018-2559-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2018] [Accepted: 12/04/2018] [Indexed: 11/13/2022] Open
Abstract
Background A cross-correlation (XCorr) score function is one of the most popular score functions utilized to search peptide identifications in databases, and many computer programs, such as SEQUEST, Comet, and Tide, currently use this score function. Recently, the HiXCorr algorithm was developed to speed up this score function for high-resolution spectra by improving the preprocessing step of the tandem mass spectra. However, despite the development of the HiXCorr algorithm, the score function is still slow because candidate peptides increase when post-translational modifications (PTMs) are considered in the search. Results We used a graphics processing unit (GPU) to develop the accelerating score function derived by combining Tide’s XCorr score function and the HiXCorr algorithm. Our method is 2.7 and 5.8 times faster than the original Tide and Tide-Hi, respectively, for 50 Da precursor tolerance. Our GPU-based method produced identical scores as did the CPU-based Tide and Tide-Hi. Conclusion We propose the accelerating score function to search modifications using a single GPU. The software is available at https://github.com/Tide-for-PTM-search/Tide-for-PTM-search.
Collapse
Affiliation(s)
- Hyunwoo Kim
- Research Data Hub Center, Korea Institute of Science and Technology Information, Daejeon, 34141, Republic of Korea.
| | - Sunggeun Han
- KISTI Scientific Data School, Korea Institute of Science and Technology Information, Daejeon, 34141, Republic of Korea
| | - Jung-Ho Um
- Research Data Hub Center, Korea Institute of Science and Technology Information, Daejeon, 34141, Republic of Korea
| | - Kyongseok Park
- Super Computing Cloud Center, Korea Institute of Science and Technology Information, Daejeon, 34141, Republic of Korea
| |
Collapse
|
7
|
Adamo ME, Gerber SA. Tempest: Accelerated MS/MS Database Search Software for Heterogeneous Computing Platforms. CURRENT PROTOCOLS IN BIOINFORMATICS 2016; 55:13.29.1-13.29.23. [PMID: 27603022 PMCID: PMC5736398 DOI: 10.1002/cpbi.15] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
MS/MS database search algorithms derive a set of candidate peptide sequences from in silico digest of a protein sequence database, and compute theoretical fragmentation patterns to match these candidates against observed MS/MS spectra. The original Tempest publication described these operations mapped to a CPU-GPU model, in which the CPU (central processing unit) generates peptide candidates that are asynchronously sent to a discrete GPU (graphics processing unit) to be scored against experimental spectra in parallel. The current version of Tempest expands this model, incorporating OpenCL to offer seamless parallelization across multicore CPUs, GPUs, integrated graphics chips, and general-purpose coprocessors. Three protocols describe how to configure and run a Tempest search, including discussion of how to leverage Tempest's unique feature set to produce optimal results. © 2016 by John Wiley & Sons, Inc.
Collapse
Affiliation(s)
- Mark E Adamo
- Norris Cotton Cancer Center, Geisel School at Dartmouth, Lebanon, New Hampshire
| | - Scott A Gerber
- Norris Cotton Cancer Center, Geisel School at Dartmouth, Lebanon, New Hampshire
- Department of Genetics, Geisel School at Dartmouth, Lebanon, New Hampshire
- Department of Biochemistry, Geisel School at Dartmouth, Lebanon, New Hampshire
| |
Collapse
|
8
|
Tabb DL. The SEQUEST family tree. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2015; 26:1814-9. [PMID: 26122518 PMCID: PMC4607603 DOI: 10.1007/s13361-015-1201-3] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/30/2015] [Revised: 05/14/2015] [Accepted: 05/19/2015] [Indexed: 06/04/2023]
Abstract
Since its introduction in 1994, SEQUEST has gained many important new capabilities, and a host of successor algorithms have built upon its successes. This Account and Perspective maps the evolution of this important tool and charts the relationships among contributions to the SEQUEST legacy. Many of the changes represented improvements in computing speed by clusters and graphics cards. Mass spectrometry innovations in mass accuracy and activation methods led to shifts in fragment modeling and scoring strategies. These changes, as well as the movement of laboratories and lab members, have led to great diversity among the members of the SEQUEST family. Graphical Abstract ᅟ.
Collapse
Affiliation(s)
- David L Tabb
- School of Medicine, Vanderbilt University, Nashville, TN, 37232-8575, USA.
| |
Collapse
|
9
|
Eng JK, Hoopmann MR, Jahan TA, Egertson JD, Noble WS, MacCoss MJ. A deeper look into Comet--implementation and features. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2015; 26:1865-74. [PMID: 26115965 PMCID: PMC4607604 DOI: 10.1007/s13361-015-1179-x] [Citation(s) in RCA: 152] [Impact Index Per Article: 16.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/23/2015] [Revised: 04/22/2015] [Accepted: 04/27/2015] [Indexed: 05/04/2023]
Abstract
The Comet database search software was initially released as an open source project in late 2012. Prior to that, Comet existed as the University of Washington's academic version of the SEQUEST database search tool. Despite its availability and widespread use over the years, some details about its implementation have not been previously disseminated or are not well understood. We address a few of these details in depth and highlight new features available in the latest release. Comet is freely available for download at http://comet-ms.sourceforge.net or it can be accessed as a component of a number of larger software projects into which it has been incorporated. Graphical Abstract ᅟ.
Collapse
Affiliation(s)
- Jimmy K Eng
- Department of Genome Sciences, University of Washington, Seattle, WA, USA.
| | | | - Tahmina A Jahan
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Jarrett D Egertson
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - William S Noble
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Michael J MacCoss
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| |
Collapse
|
10
|
Shanley MR, Hawley D, Leung S, Zaidi NF, Dave R, Schlosser KA, Bandopadhyay R, Gerber SA, Liu M. LRRK2 Facilitates tau Phosphorylation through Strong Interaction with tau and cdk5. Biochemistry 2015; 54:5198-208. [PMID: 26268594 DOI: 10.1021/acs.biochem.5b00326] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Leucine-rich repeat kinase 2 (LRRK2) and tau have been identified as risk factors of Parkinson's disease (PD). As LRRK2 is a kinase and tau is hyperphosphorylated in some LRRK2 mutation carriers of PD patients, the obvious hypothesis is that tau could be a substrate of LRRK2. Previous reports that LRRK2 phosphorylates free tau or tubulin-associated tau provide direct support for this proposition. By comparing LRRK2 with cdk5, we show that wild-type LRRK2 and the G2019S mutant phosphorylate free recombinant full-length tau protein with specific activity 480- and 250-fold lower than cdk5, respectively. More strikingly tau binds to wt LRRK2 or the G2019S mutant 140- or 200-fold more strongly than cdk5. The extremely low activity of LRRK2 but strong binding affinity with tau suggests that LRRK2 may facilitate tau phosphorylation as a scaffold protein rather than as a major tau kinase. This hypothesis is further supported by the observation that (i) cdk5 or tau coimmunoprecipitates with endogenous LRRK2 in SH-SY5Y cells, in mouse brain tissue, and in human PBMCs; (ii) knocking down endogenous LRRK2 by its siRNA in SH-SY5Y cells reduces tau phosphorylation at Ser396 and Ser404; (iii) inhibiting LRRK2 kinase activity by its inhibitors has no effect on tau phosphorylation at these two sites; and (iv) overexpressing wt LRRK2, the G2019S mutant, or the D1994A kinase-dead mutant in SH-SY5Y cells has no effect on tau phosphorylation. Our results suggest that LRRK2 facilitates tau phosphorylation indirectly by recruiting tau or cdk5 rather than by directly phosphorylating tau.
Collapse
Affiliation(s)
- Mary R Shanley
- Neurology Department, Brigham and Women's Hospital, Harvard Medical School , 65 Landsdowne Street, Fourth Floor, Cambridge, Massachusetts 02139, United States
| | - Dillon Hawley
- Neurology Department, Brigham and Women's Hospital, Harvard Medical School , 65 Landsdowne Street, Fourth Floor, Cambridge, Massachusetts 02139, United States
| | - Shirley Leung
- Neurology Department, Brigham and Women's Hospital, Harvard Medical School , 65 Landsdowne Street, Fourth Floor, Cambridge, Massachusetts 02139, United States
| | - Nikhat F Zaidi
- Neurology Department, Brigham and Women's Hospital, Harvard Medical School , 65 Landsdowne Street, Fourth Floor, Cambridge, Massachusetts 02139, United States
| | - Roshni Dave
- Neurology Department, Brigham and Women's Hospital, Harvard Medical School , 65 Landsdowne Street, Fourth Floor, Cambridge, Massachusetts 02139, United States
| | - Kate A Schlosser
- Department of Genetics and of Biochemistry, Geisel School of Medicine at Dartmouth , One Medical Center Drive HB-7937, Lebanon, New Hampshire 03756, United States
| | - Rina Bandopadhyay
- Reta Lila, Weston Institute of Neurological Studies Department of Molecular Neuroscience UCL, Institute of Neurology 1 , Wakefield Street, London WC1N 1PJ, U.K
| | - Scott A Gerber
- Department of Genetics and of Biochemistry, Geisel School of Medicine at Dartmouth , One Medical Center Drive HB-7937, Lebanon, New Hampshire 03756, United States
| | - Min Liu
- Neurology Department, Brigham and Women's Hospital, Harvard Medical School , 65 Landsdowne Street, Fourth Floor, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
11
|
Agarwal P, Owzar K. Next generation distributed computing for cancer research. Cancer Inform 2015; 13:97-109. [PMID: 25983539 PMCID: PMC4412427 DOI: 10.4137/cin.s16344] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2014] [Revised: 01/05/2015] [Accepted: 01/06/2015] [Indexed: 11/28/2022] Open
Abstract
Advances in next generation sequencing (NGS) and mass spectrometry (MS) technologies have provided many new opportunities and angles for extending the scope of translational cancer research while creating tremendous challenges in data management and analysis. The resulting informatics challenge is invariably not amenable to the use of traditional computing models. Recent advances in scalable computing and associated infrastructure, particularly distributed computing for Big Data, can provide solutions for addressing these challenges. In this review, the next generation of distributed computing technologies that can address these informatics problems is described from the perspective of three key components of a computational platform, namely computing, data storage and management, and networking. A broad overview of scalable computing is provided to set the context for a detailed description of Hadoop, a technology that is being rapidly adopted for large-scale distributed computing. A proof-of-concept Hadoop cluster, set up for performance benchmarking of NGS read alignment, is described as an example of how to work with Hadoop. Finally, Hadoop is compared with a number of other current technologies for distributed computing.
Collapse
Affiliation(s)
- Pankaj Agarwal
- Duke Cancer Institute, Duke University Medical Center, Durham, NC, USA
| | - Kouros Owzar
- Duke Cancer Institute, Duke University Medical Center, Durham, NC, USA
- Department of Biostatistics and Bioinformatics, Duke University Medical Center, Durham, NC, USA
| |
Collapse
|
12
|
Li Y, Chi H, Xia L, Chu X. Accelerating the scoring module of mass spectrometry-based peptide identification using GPUs. BMC Bioinformatics 2014; 15:121. [PMID: 24773593 PMCID: PMC4049470 DOI: 10.1186/1471-2105-15-121] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2012] [Accepted: 04/23/2014] [Indexed: 11/10/2022] Open
Abstract
Background Tandem mass spectrometry-based database searching is currently the main method for protein identification in shotgun proteomics. The explosive growth of protein and peptide databases, which is a result of genome translations, enzymatic digestions, and post-translational modifications (PTMs), is making computational efficiency in database searching a serious challenge. Profile analysis shows that most search engines spend 50%-90% of their total time on the scoring module, and that the spectrum dot product (SDP) based scoring module is the most widely used. As a general purpose and high performance parallel hardware, graphics processing units (GPUs) are promising platforms for speeding up database searches in the protein identification process. Results We designed and implemented a parallel SDP-based scoring module on GPUs that exploits the efficient use of GPU registers, constant memory and shared memory. Compared with the CPU-based version, we achieved a 30 to 60 times speedup using a single GPU. We also implemented our algorithm on a GPU cluster and achieved an approximately favorable speedup. Conclusions Our GPU-based SDP algorithm can significantly improve the speed of the scoring module in mass spectrometry-based protein identification. The algorithm can be easily implemented in many database search engines such as X!Tandem, SEQUEST, and pFind. A software tool implementing this algorithm is available at http://www.comp.hkbu.edu.hk/~youli/ProteinByGPU.html
Collapse
Affiliation(s)
| | | | | | - Xiaowen Chu
- Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong.
| |
Collapse
|
13
|
Verheggen K, Barsnes H, Martens L. Distributed computing and data storage in proteomics: many hands make light work, and a stronger memory. Proteomics 2013; 14:367-77. [PMID: 24285552 DOI: 10.1002/pmic.201300288] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2013] [Revised: 09/09/2013] [Accepted: 09/23/2013] [Indexed: 12/25/2022]
Abstract
Modern day proteomics generates ever more complex data, causing the requirements on the storage and processing of such data to outgrow the capacity of most desktop computers. To cope with the increased computational demands, distributed architectures have gained substantial popularity in the recent years. In this review, we provide an overview of the current techniques for distributed computing, along with examples of how the techniques are currently being employed in the field of proteomics. We thus underline the benefits of distributed computing in proteomics, while also pointing out the potential issues and pitfalls involved.
Collapse
Affiliation(s)
- Kenneth Verheggen
- Department of Medical Protein Research, VIB, Ghent, Belgium; Department of Biochemistry, Faculty of Medicine and Health Sciences, Ghent University, Ghent, Belgium
| | | | | |
Collapse
|
14
|
Gilmore JM, Milloy JA, Gerber SA. SILAC surrogates: rescue of quantitative information for orphan analytes in spike-in SILAC experiments. Anal Chem 2013; 85:10812-9. [PMID: 24152235 DOI: 10.1021/ac4021352] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Super-stable isotope labeling by amino acids in cell culture (Super-SILAC) enables the sensitive and accurate analysis of complex biological tissue and tumor samples by comparison of light peptides observed in biological samples to heavy peptides from SILAC cell culture spike-ins. However, despite the use of multiple cell lines for Super-SILAC spike-in standards, the full protein and peptide profiles of biological samples are not completely represented in these internal standards, leading to orphan analytes for which sample to standard ratios cannot be calculated. This problem is exacerbated in some biological systems, such as muscle tissue, which lack adequate cell culture lines to reflect their complex and idiosyncratic protein profiles, resulting in up to 40% of peptide analytes without heavy cognates. Furthermore, these unquantified orphan analytes may be among the most biologically interesting and significant species, since their presence is not common to cell lines cultured in vitro. Here, we report on the development of a surrogate analysis strategy to interpolate quantitative relationships between peptide species, observed across multiple biological samples, which lack representation within the spike-in standards. The precision and accuracy of this method was assessed by replicate experiments in which surrogate-derived ratios from defined mixtures of spike-in SILAC standard and tissue lysate were compared against traditional SILAC ratios for species where both light and heavy peptide cognates were observed. We demonstrate the robustness of our SILAC surrogates strategy across a variety of murine tissues, including liver, spleen, brain, and muscle. Our approach increases the quantitative coverage and precision within a biological sample by rescuing previously intractable peptide species and applying additional evidence to improve the precision of existing quantifications.
Collapse
Affiliation(s)
- Jason M Gilmore
- Department of Genetics, Geisel School of Medicine at Dartmouth , Lebanon, New Hampshire 03756, United States
| | | | | |
Collapse
|
15
|
Hoopmann MR, Moritz RL. Current algorithmic solutions for peptide-based proteomics data generation and identification. Curr Opin Biotechnol 2012; 24:31-8. [PMID: 23142544 DOI: 10.1016/j.copbio.2012.10.013] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2012] [Revised: 10/08/2012] [Accepted: 10/18/2012] [Indexed: 12/28/2022]
Abstract
Peptide-based proteomic data sets are ever increasing in size and complexity. These data sets provide computational challenges when attempting to quickly analyze spectra and obtain correct protein identifications. Database search and de novo algorithms must consider high-resolution MS/MS spectra and alternative fragmentation methods. Protein inference is a tricky problem when analyzing large data sets of degenerate peptide identifications. Combining multiple algorithms for improved peptide identification puts significant strain on computational systems when investigating large data sets. This review highlights some of the recent developments in peptide and protein identification algorithms for analyzing shotgun mass spectrometry data when encountering the aforementioned hurdles. Also explored are the roles that analytical pipelines, public spectral libraries, and cloud computing play in the evolution of peptide-based proteomics.
Collapse
|