1
|
Potemkin AA, Proskurnin MA, Volkov DS. Noise Filtering Algorithm Using Gaussian Mixture Models for High-Resolution Mass Spectra of Natural Organic Matter. Anal Chem 2024; 96:5455-5461. [PMID: 38530650 DOI: 10.1021/acs.analchem.3c05453] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/28/2024]
Abstract
High-resolution mass spectra of natural organic matter (NOM) contain a large number of noise signals. These signals interfere with the correct molecular composition estimation during nontargeted analysis because formula-assignment programs find empirical formulas for such peaks as well. Previously proposed noise filtering methods that utilize the profile of the intensity distribution of mass spectrum peaks rely on a histogram to calculate the intensity threshold value. However, the histogram profile can vary depending on the user settings. In addition, these algorithms are not automated, so they are handled manually. To overcome the mentioned drawbacks, we propose a new algorithm for noise filtering in mass spectra. This filter is based on Gaussian Mixture Models (GMMs), a machine learning method to find the intensity threshold value. The algorithm is completely data-driven and eliminates the need to work with a histogram. It has no customizable parameters and automatically determines the noise level for each individual mass spectrum. The algorithm performance was tested on mass spectra of natural organic matter obtained by averaging a different number of microscans (transients), and the results were compared with other noise filters proposed in the literature. Finally, the effect of this noise filtering approach on the fraction of peaks with assigned formulas was investigated. It was shown that there is always an increase in the identification rate, but the magnitude of the effect changes with the number of microscans averaged. The increase can be as high as 15%.
Collapse
Affiliation(s)
- Alexander A Potemkin
- Chemistry Department of M.V. Lomonosov Moscow State University, Leninskie Gory, 1-3, GSP-1, Moscow 119991, Russia
| | - Mikhail A Proskurnin
- Chemistry Department of M.V. Lomonosov Moscow State University, Leninskie Gory, 1-3, GSP-1, Moscow 119991, Russia
| | - Dmitry S Volkov
- Chemistry Department of M.V. Lomonosov Moscow State University, Leninskie Gory, 1-3, GSP-1, Moscow 119991, Russia
| |
Collapse
|
2
|
Wilding-McBride D, Dagley LF, Spall SK, Infusini G, Webb AI. Simplifying MS1 and MS2 spectra to achieve lower mass error, more dynamic range, and higher peptide identification confidence on the Bruker timsTOF Pro. PLoS One 2022; 17:e0271025. [PMID: 35797390 PMCID: PMC9262215 DOI: 10.1371/journal.pone.0271025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2022] [Accepted: 06/19/2022] [Indexed: 11/24/2022] Open
Abstract
For bottom-up proteomic analysis, the goal of analytical pipelines that process the raw output of mass spectrometers is to detect, characterise, identify, and quantify peptides. The initial steps of detecting and characterising features in raw data must overcome some considerable challenges. The data presents as a sparse array, sometimes containing billions of intensity readings over time. These points represent both signal and chemical or electrical noise. Depending on the biological sample's complexity, tens to hundreds of thousands of peptides may be present in this vast data landscape. For ion mobility-based LC-MS analysis, each peptide is comprised of a grouping of hundreds of single intensity readings in three dimensions: mass-over-charge (m/z), mobility, and retention time. There is no inherent information about any associations between individual points; whether they represent a peptide or noise must be inferred from their structure. Peptides each have multiple isotopes, different charge states, and a dynamic range of intensity of over six orders of magnitude. Due to the high complexity of most biological samples, peptides often overlap in time and mobility, making it very difficult to tease apart isotopic peaks, to apportion the intensity of each and the contribution of each isotope to the determination of the peptide's monoisotopic mass, which is critical for the peptide's identification. Here we describe four algorithms for the Bruker timsTOF Pro that each play an important role in finding peptide features and determining their characteristics. These algorithms focus on separate characteristics that determine how candidate features are detected in the raw data. The first two algorithms deal with the complexity of the raw data, rapidly clustering raw data into spectra that allows isotopic peaks to be resolved. The third algorithm compensates for saturation of the instrument's detector thereby recovering lost dynamic range, and lastly, the fourth algorithm increases confidence of peptide identifications by simplification of the fragment spectra. These algorithms are effective in processing raw data to detect features and extracting the attributes required for peptide identification, and make an important contribution to an analytical pipeline by detecting features that are higher quality and better segmented from other peptides in close proximity. The software has been developed in Python using Numpy and Pandas and made freely available with an open-source MIT license to facilitate experimentation and further improvement (DOI 10.5281/zenodo.6513126). Data are available via ProteomeXchange with identifier PXD030706.
Collapse
Affiliation(s)
- Daryl Wilding-McBride
- The Walter and Eliza Hall Institute of Medical Research, Melbourne, Victoria, Australia
- Department of Medical Biology, University of Melbourne, Melbourne, Victoria, Australia
| | - Laura F. Dagley
- The Walter and Eliza Hall Institute of Medical Research, Melbourne, Victoria, Australia
- Department of Medical Biology, University of Melbourne, Melbourne, Victoria, Australia
| | - Sukhdeep K. Spall
- The Walter and Eliza Hall Institute of Medical Research, Melbourne, Victoria, Australia
- Department of Medical Biology, University of Melbourne, Melbourne, Victoria, Australia
| | - Giuseppe Infusini
- The Walter and Eliza Hall Institute of Medical Research, Melbourne, Victoria, Australia
- Department of Medical Biology, University of Melbourne, Melbourne, Victoria, Australia
- Mass Dynamics, Melbourne, Victoria, Australia
| | - Andrew I. Webb
- The Walter and Eliza Hall Institute of Medical Research, Melbourne, Victoria, Australia
- Department of Medical Biology, University of Melbourne, Melbourne, Victoria, Australia
| |
Collapse
|
3
|
Koehler CJ, Bollineni RC, Thiede B. Application of the half decimal place rule to increase the peptide identification rate. RAPID COMMUNICATIONS IN MASS SPECTROMETRY : RCM 2017; 31:227-233. [PMID: 27806443 DOI: 10.1002/rcm.7780] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/08/2016] [Revised: 10/21/2016] [Accepted: 10/28/2016] [Indexed: 06/06/2023]
Abstract
RATIONALE Many MS2 spectra in bottom-up proteomics experiments remain unassigned. To improve proteome coverage, we applied the half decimal place rule (HDPR) to remove non-peptidic molecules. The HDPR considers the ratio of the digits after the decimal point to the full molecular mass and results in a relatively small permitted mass window for most peptides. METHODS First, the HDPR mass filter was calculated for the human and other proteomes. Subsequently, the HDPR was applied to three technical replicates of an in-solution tryptic digest of HeLa cells which were analysed by liquid chromatography/mass spectrometry (LC/MS) using a quadrupole-orbitrap mass spectrometer (Q Exactive). In addition, the same sample was analysed three times with a fixed exclusion list. The exclusion list was based on only choosing doubly charged ions for fragmentation. RESULTS The peptide spectrum match (PSM) rate increased by 2-4% applying HDPR filters from 0.1-0.25 Da and 75-150 ppm, respectively. Excluding all MS2 events by applying an HDPR filter of doubly charged ions, we were able to improve PSMs by 0.9% and the PSM rate by 2.5%. CONCLUSIONS An algorithm to filter precursors based on the HDPR was established to improve the targeting of the acquisition of MS2 spectra in data-dependent acquisition (DDA) experiments. According to our data, a total gain of PSMs of 1-5% might be achievable if the HPDR filter would already be applied during MS data acquisition. Copyright © 2016 John Wiley & Sons, Ltd.
Collapse
Affiliation(s)
| | | | - Bernd Thiede
- Department of Biosciences, University of Oslo, Oslo, Norway
| |
Collapse
|
4
|
May JC, McLean JA. Advanced Multidimensional Separations in Mass Spectrometry: Navigating the Big Data Deluge. ANNUAL REVIEW OF ANALYTICAL CHEMISTRY (PALO ALTO, CALIF.) 2016; 9:387-409. [PMID: 27306312 PMCID: PMC5763907 DOI: 10.1146/annurev-anchem-071015-041734] [Citation(s) in RCA: 60] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/10/2023]
Abstract
Hybrid analytical instrumentation constructed around mass spectrometry (MS) is becoming the preferred technique for addressing many grand challenges in science and medicine. From the omics sciences to drug discovery and synthetic biology, multidimensional separations based on MS provide the high peak capacity and high measurement throughput necessary to obtain large-scale measurements used to infer systems-level information. In this article, we describe multidimensional MS configurations as technologies that are big data drivers and review some new and emerging strategies for mining information from large-scale datasets. We discuss the information content that can be obtained from individual dimensions, as well as the unique information that can be derived by comparing different levels of data. Finally, we summarize some emerging data visualization strategies that seek to make highly dimensional datasets both accessible and comprehensible.
Collapse
Affiliation(s)
- Jody C May
- Department of Chemistry, Center for Innovative Technology, Vanderbilt Institute for Chemical Biology, Vanderbilt Institute for Integrative Biosystems Research and Education, Vanderbilt University, Nashville, Tennessee 37235;
| | - John A McLean
- Department of Chemistry, Center for Innovative Technology, Vanderbilt Institute for Chemical Biology, Vanderbilt Institute for Integrative Biosystems Research and Education, Vanderbilt University, Nashville, Tennessee 37235;
| |
Collapse
|
5
|
Sadygov RG. Using SEQUEST with theoretically complete sequence databases. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2015; 26:1858-1864. [PMID: 26238326 PMCID: PMC4607654 DOI: 10.1007/s13361-015-1228-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/02/2015] [Revised: 05/08/2015] [Accepted: 06/17/2015] [Indexed: 06/04/2023]
Abstract
SEQUEST has long been used to identify peptides/proteins from their tandem mass spectra and protein sequence databases. The algorithm has proven to be hugely successful for its sensitivity and specificity in identifying peptides/proteins, the sequences of which are present in the protein sequence databases. In this work, we report on work that attempts a new use for the algorithm by applying it to search a complete list of theoretically possible peptides, a de novo-like sequencing. We used freely available mass spectral data and determined a number of unique peptides as identified by SEQUEST. Using masses of these peptides and the mass accuracy of 0.001 Da, we have created a database of all theoretically possible peptide sequences corresponding to the precursor masses. We used our recently developed algorithm for determining all amino acid compositions corresponding to a mass interval, and used a lexicographic ordering to generate theoretical sequences from the compositions. The newly generated theoretical database was many-fold more complex than the original protein sequence database. We used SEQUEST to search and identify the best matches to the spectra from all theoretically possible peptide sequences. We found that SEQUEST cross-correlation score ranked the correct peptide match among the top sequence matches. The results testify to the high specificity of SEQUEST when combined with the high mass accuracy for intact peptides. Graphical Abstract ᅟ.
Collapse
Affiliation(s)
- Rovshan G Sadygov
- Department of Biochemistry and Molecular Biology, The University of Texas Medical Branch, Galveston, TX, 77555, USA.
- Sealy Center for Molecular Medicine, The University of Texas Medical Branch, Galveston, TX, 77555, USA.
| |
Collapse
|
6
|
Engaging challenges in glycoproteomics: recent advances in MS-based glycopeptide analysis. Bioanalysis 2015; 7:113-31. [PMID: 25558940 DOI: 10.4155/bio.14.272] [Citation(s) in RCA: 51] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
The proteomic analysis of glycosylation is uniquely challenging. The numerous and varied biological roles of protein-linked glycans have fueled a tremendous demand for technologies that enable rapid, in-depth structural examination of glycosylated proteins in complex biological systems. In turn, this demand has driven many innovations in wide ranging fields of bioanalytical science. This review will summarize key developments in glycoprotein separation and enrichment, glycoprotein proteolysis strategies, glycopeptide separation and enrichment, the role of mass measurement accuracy in glycopeptide detection, glycopeptide ion dissociation methods for MS/MS, and informatic tools for glycoproteomic analysis. In aggregate, this selection of topics serves to encapsulate the present status of MS-based analytical technologies for engaging the challenges of glycoproteomic analysis.
Collapse
|
7
|
Dittwald P, Nghia VT, Harris GA, Caprioli RM, Van de Plas R, Laukens K, Gambin A, Valkenborg D. Towards automated discrimination of lipids versus peptides from full scan mass spectra. EUPA OPEN PROTEOMICS 2014; 4:87-100. [PMID: 25414814 PMCID: PMC4234154 DOI: 10.1016/j.euprot.2014.05.002] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Although physicochemical fractionation techniques play a crucial role in the analysis of complex mixtures, they are not necessarily the best solution to separate specific molecular classes, such as lipids and peptides. Any physical fractionation step such as, for example, those based on liquid chromatography, will introduce its own variation and noise. In this paper we investigate to what extent the high sensitivity and resolution of contemporary mass spectrometers offers viable opportunities for computational separation of signals in full scan spectra. We introduce an automatic method that can discriminate peptide from lipid peaks in full scan mass spectra, based on their isotopic properties. We systematically evaluate which features maximally contribute to a peptide versus lipid classification. The selected features are subsequently used to build a random forest classifier that enables almost perfect separation between lipid and peptide signals without requiring ion fragmentation and classical tandem MS-based identification approaches. The classifier is trained on in silico data, but is also capable of discriminating signals in real world experiments. We evaluate the influence of typical data inaccuracies of common classes of mass spectrometry instruments on the optimal set of discriminant features. Finally, the method is successfully extended towards the classification of individual lipid classes from full scan mass spectral features, based on input data defined by the Lipid Maps Consortium.
Collapse
Affiliation(s)
- Piotr Dittwald
- College of Inter-faculty Individual Studies in Mathematics and Natural Sciences, University of Warsaw, Warsaw, Poland ; Institute of Informatics, University of Warsaw, Warsaw, Poland
| | - Vu Trung Nghia
- Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium ; Biomedical Informatics Research Center Antwerp (biomina), University of Antwerp/Antwerp University Hospital, Edegem, Belgium
| | - Glenn A Harris
- Mass Spectrometry Research Center and Departments of Biochemistry, Chemistry, Pharmacology, and Medicine, Vanderbilt University, Nashville, USA
| | - Richard M Caprioli
- Mass Spectrometry Research Center and Departments of Biochemistry, Chemistry, Pharmacology, and Medicine, Vanderbilt University, Nashville, USA
| | - Raf Van de Plas
- Mass Spectrometry Research Center and Departments of Biochemistry, Chemistry, Pharmacology, and Medicine, Vanderbilt University, Nashville, USA
| | - Kris Laukens
- Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium ; Biomedical Informatics Research Center Antwerp (biomina), University of Antwerp/Antwerp University Hospital, Edegem, Belgium
| | - Anna Gambin
- Institute of Informatics, University of Warsaw, Warsaw, Poland ; Mossakowski Medical Research Centre, Polish Academy of Sciences, Warsaw, Poland
| | - Dirk Valkenborg
- Applied Bio & molecular Systems, VITO, Mol, Belgium ; Center for Proteomics, Antwerp, Belgium ; Interuniversity Institute for Biostatistics and Statistical Bioinformatics, Hasselt University, Diepenbeek, Belgium
| |
Collapse
|
8
|
Sadygov RG. Use of singular value decomposition analysis to differentiate phosphorylated precursors in strong cation exchange fractions. Electrophoresis 2014; 35:3498-503. [PMID: 24913822 DOI: 10.1002/elps.201400053] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2014] [Revised: 04/13/2014] [Accepted: 05/20/2014] [Indexed: 01/26/2023]
Abstract
We studied the use of peak deviations (PDs) for application in phosphoproteomics. Due to the differences in the mass defects, the PDs of samples containing mixtures of phosphorylated and nonphosphorylated peptides show bimodal distributions. The ratios of peak heights accurately predict the phosphoproteome content of a sample. In this work, we apply a signal-processing tool, singular value decomposition, to reveal characteristic features of the phosphorylated, nonphosphorylated, and mixed samples. We show that a simple application of singular value decomposition to the PD matrix (i) detects transitions from mostly phosphorylated samples to mostly nonphosphorylated samples, (ii) reveals modes of low-abundance species in the presence of the high-abundance species (e.g., phosphorylated peptides), and (iii) simplifies the interpretation of the clustering of a covariance matrix obtained from PDs. As the eigenfunctions of the inner-product of the data matrix (made from the PDs) are Hermite functions, we observe a change of sign in the transition from samples enriched in phosphorylated peptides to samples containing fewer phosphorylated peptides. The ordering of the singular values of the data matrix points in the direction of changes to the phosphorylation content. No peptide identifications from a database were used for this study.
Collapse
Affiliation(s)
- Rovshan G Sadygov
- Department of Biochemistry and Molecular Biology, Sealy Center for Molecular Medicine, University of Texas Medical Branch, Galveston, TX, USA
| |
Collapse
|
9
|
Benjamin AM, Thompson JW, Soderblom EJ, Geromanos SJ, Henao R, Kraus VB, Moseley MA, Lucas JE. A flexible statistical model for alignment of label-free proteomics data--incorporating ion mobility and product ion information. BMC Bioinformatics 2013; 14:364. [PMID: 24341404 PMCID: PMC3878627 DOI: 10.1186/1471-2105-14-364] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2013] [Accepted: 12/06/2013] [Indexed: 11/30/2022] Open
Abstract
Background The goal of many proteomics experiments is to determine the abundance of proteins in biological samples, and the variation thereof in various physiological conditions. High-throughput quantitative proteomics, specifically label-free LC-MS/MS, allows rapid measurement of thousands of proteins, enabling large-scale studies of various biological systems. Prior to analyzing these information-rich datasets, raw data must undergo several computational processing steps. We present a method to address one of the essential steps in proteomics data processing - the matching of peptide measurements across samples. Results We describe a novel method for label-free proteomics data alignment with the ability to incorporate previously unused aspects of the data, particularly ion mobility drift times and product ion information. We compare the results of our alignment method to PEPPeR and OpenMS, and compare alignment accuracy achieved by different versions of our method utilizing various data characteristics. Our method results in increased match recall rates and similar or improved mismatch rates compared to PEPPeR and OpenMS feature-based alignment. We also show that the inclusion of drift time and product ion information results in higher recall rates and more confident matches, without increases in error rates. Conclusions Based on the results presented here, we argue that the incorporation of ion mobility drift time and product ion information are worthy pursuits. Alignment methods should be flexible enough to utilize all available data, particularly with recent advancements in experimental separation methods.
Collapse
Affiliation(s)
- Ashlee M Benjamin
- Institute for Genome Sciences and Policy, Duke University Medical Center, Durham, North Carolina, USA.
| | | | | | | | | | | | | | | |
Collapse
|
10
|
Kalita M, Kasumov T, Brasier AR, Sadygov RG. Use of theoretical peptide distributions in phosphoproteome analysis. J Proteome Res 2013; 12:3207-14. [PMID: 23731183 PMCID: PMC3758224 DOI: 10.1021/pr4003382] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
Abstract
The high mass accuracy and resolution of modern mass spectrometers provides new opportunities to employ theoretical peptide distributions in large-scale proteomic studies. We used theoretical distributions to study noise filtering and mass measurement errors and to examine mass-based differentiation of phosphorylated and nonphosphorylated peptides. Only the monoisotopic mass of the experimental precursor ion was necessary for this analysis. We found that peak deviations can be used to characterize the modification states of peptides in a sample. When applied to large-scale proteomic data sets, the peak deviation distribution can be used to filter chemical/electronic noise for singly charged species. Using peak deviation distributions, it is possible to separate the phosphorylated peptides from the nonphosphorylated peptides, enabling evaluation of the phosphoproteome content of a sample. Because this approach is simple, with light computational requirements, the analysis of theoretical peptide distributions has a significant potential for application to phosphoproteome analyses. For our studies we used publicly available data sets from three large-scale proteomic studies.
Collapse
Affiliation(s)
- Mridul Kalita
- Department of Biochemistry and Molecular Biology, University of Texas Medical Branch, Galveston, TX 77573
| | - Takhar Kasumov
- Department of Gastroenterology and Hepatology, Cleveland Clinic, 9500 Euclid, Avenue, Cleveland, OH 44195
| | - Allan R. Brasier
- Sealy Center for Molecular Medicine, University of Texas Medical Branch, Galveston, TX 77573
| | - Rovshan G. Sadygov
- Institute for Translational Sciences, University of Texas Medical Branch, Galveston, TX 77573
| |
Collapse
|