1
|
North N, Enders AA, Cable ML, Allen HC. Array-Based Machine Learning for Functional Group Detection in Electron Ionization Mass Spectrometry. ACS OMEGA 2023; 8:24341-24350. [PMID: 37457446 PMCID: PMC10339417 DOI: 10.1021/acsomega.3c01684] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Accepted: 05/22/2023] [Indexed: 07/18/2023]
Abstract
Mass spectrometry is a ubiquitous technique capable of complex chemical analysis. The fragmentation patterns that appear in mass spectrometry are an excellent target for artificial intelligence methods to automate and expedite the analysis of data to identify targets such as functional groups. To develop this approach, we trained models on electron ionization (a reproducible hard fragmentation technique) mass spectra so that not only the final model accuracies but also the reasoning behind model assignments could be evaluated. The convolutional neural network (CNN) models were trained on 2D images of the spectra using transfer learning of Inception V3, and the logistic regression models were trained using array-based data and Scikit Learn implementation in Python. Our training dataset consisted of 21,166 mass spectra from the United States' National Institute of Standards and Technology (NIST) Webbook. The data was used to train models to identify functional groups, both specific (e.g., amines, esters) and generalized classifications (aromatics, oxygen-containing functional groups, and nitrogen-containing functional groups). We found that the highest final accuracies on identifying new data were observed using logistic regression rather than transfer learning on CNN models. It was also determined that the mass range most beneficial for functional group analysis is 0-100 m/z. We also found success in correctly identifying functional groups of example molecules selected from both the NIST database and experimental data. Beyond functional group analysis, we also have developed a methodology to identify impactful fragments for the accurate detection of the models' targets. The results demonstrate a potential pathway for analyzing and screening substantial amounts of mass spectral data.
Collapse
Affiliation(s)
- Nicole
M. North
- Department
of Chemistry & Biochemistry, The Ohio
State University, Columbus, Ohio 43210, United States
| | - Abigail A. Enders
- Department
of Chemistry & Biochemistry, The Ohio
State University, Columbus, Ohio 43210, United States
| | - Morgan L. Cable
- NASA
Jet Propulsion Laboratory, California Institute
of Technology, Pasadena, California 91109, United States
| | - Heather C. Allen
- Department
of Chemistry & Biochemistry, The Ohio
State University, Columbus, Ohio 43210, United States
| |
Collapse
|
2
|
Huang X, Chen X, Chen X, Wang W. Screening of Serum miRNAs as Diagnostic Biomarkers for Lung Cancer Using the Minimal-Redundancy-Maximal-Relevance Algorithm and Random Forest Classifier Based on a Public Database. Public Health Genomics 2022; 25:1-9. [PMID: 35917800 DOI: 10.1159/000525316] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2022] [Accepted: 05/12/2022] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND Lung cancer is one of the deadliest cancers, early diagnosis of which can efficiently enhance patient's survival. We aimed to screening out the serum miRNAs as diagnostic biomarkers for patients with lung cancer. METHODS A total of 416 remarkably differentially expressed miRNAs were acquired using the limma package, and next feature ranking was derived by the minimal-redundancy-maximal-relevance method. An incremental feature selection algorithm of a random forest (RF) classifier was utilized to choose the top 5 miRNA combination with the optimum predictive performance. The performance of the RF classifier of top 5 miRNAs was analyzed using the receiver operator characteristic (ROC) curve. Afterward, the classification effect of the 5-miRNA combination was validated through principal component analysis and hierarchical clustering analysis. Analysis of top 5 miRNA expressions between lung cancer patients and normal people was performed based on GSE137140 dataset, and their expression was validated by qPCR. The hierarchical clustering analysis was used to analyze the similarity of 5 miRNAs expression profiles. ROC analysis was undertaken on each miRNA. RESULTS We acquired top 5 miRNAs finally, with the Matthews correlation coefficient value as 0.988 and the area under the curve (AUC) value as 0.996. The 5 feature miRNAs were capable of distinguishing most cancer patients and normal people. Furthermore, except for the lowly expressed miR-6875-5p in lung cancer tissue, the other 4 miRNAs all expressed highly in cancer patients. Performance analysis revealed that their AUC values were 0.92, 0.96, 0.94, 0.95, and 0.93, respectively. CONCLUSION By and large, the 5 feature miRNAs screened here were anticipated to be effective biomarkers for lung cancer.
Collapse
Affiliation(s)
- Xiaoyan Huang
- Medical Oncology, 900 Hospital of the Joint Logistics Team, Fuzhou, China
| | - Xiong Chen
- Medical Oncology, 900 Hospital of the Joint Logistics Team, Fuzhou, China
| | - Xi Chen
- Medical Oncology, 900 Hospital of the Joint Logistics Team, Fuzhou, China
| | - Wenling Wang
- Medical Oncology, 900 Hospital of the Joint Logistics Team, Fuzhou, China
| |
Collapse
|
3
|
Mitigating Cold Start Problem in Serverless Computing with Function Fusion. SENSORS 2021; 21:s21248416. [PMID: 34960506 PMCID: PMC8704235 DOI: 10.3390/s21248416] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/01/2021] [Revised: 12/04/2021] [Accepted: 12/15/2021] [Indexed: 11/26/2022]
Abstract
As Artificial Intelligence (AI) is becoming ubiquitous in many applications, serverless computing is also emerging as a building block for developing cloud-based AI services. Serverless computing has received much interest because of its simplicity, scalability, and resource efficiency. However, due to the trade-off with resource efficiency, serverless computing suffers from the cold start problem, that is, a latency between a request arrival and function execution. The cold start problem significantly influences the overall response time of workflow that consists of functions because the cold start may occur in every function within the workflow. Function fusion can be one of the solutions to mitigate the cold start latency of a workflow. If two functions are fused into a single function, the cold start of the second function is removed; however, if parallel functions are fused, the workflow response time can be increased because the parallel functions run sequentially even if the cold start latency is reduced. This study presents an approach to mitigate the cold start latency of a workflow using function fusion while considering a parallel run. First, we identify three latencies that affect response time, present a workflow response time model considering the latency, and efficiently find a fusion solution that can optimize the response time on the cold start. Our method shows a response time of 28–86% of the response time of the original workflow in five workflows.
Collapse
|
4
|
Feng S, Sterzenbach R, Guo X. Deep learning for peptide identification from metaproteomics datasets. J Proteomics 2021; 247:104316. [PMID: 34246788 DOI: 10.1016/j.jprot.2021.104316] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2021] [Revised: 06/02/2021] [Accepted: 06/18/2021] [Indexed: 10/20/2022]
Abstract
Metaproteomics is becoming widely used in microbiome research for gaining insights into the functional state of the microbial community. Current metaproteomics studies are generally based on high-throughput tandem mass spectrometry (MS/MS) coupled with liquid chromatography. In this paper, we proposed a deep-learning-based algorithm, named DeepFilter, for improving peptide identifications from a collection of tandem mass spectra. The key advantage of the DeepFilter is that it does not need ad hoc training or fine-tuning as in existing filtering tools. DeepFilter is freely available under the GNU GPL license at https://github.com/Biocomputing-Research-Group/DeepFilter. SIGNIFICANCE: The identification of peptides and proteins from MS data involves the computational procedure of searching MS/MS spectra against a predefined protein sequence database and assigning top-scored peptides to spectra. Existing computational tools are still far from being able to extract all the information out of MS/MS data sets acquired from metaproteome samples. Systematical experiment results demonstrate that the DeepFilter identified up to 12% and 9% more peptide-spectrum-matches and proteins, respectively, compared with existing filtering algorithms, including Percolator, Q-ranker, PeptideProphet, and iProphet, on marine and soil microbial metaproteome samples with false discovery rate at 1%. The taxonomic analysis shows that DeepFilter found up to 7%, 10%, and 14% more species from marine, soil, and human gut samples compared with existing filtering algorithms. Therefore, DeepFilter was believed to generalize properly to new, previously unseen peptide-spectrum-matches and can be readily applied in peptide identification from metaproteomics data.
Collapse
Affiliation(s)
- Shichao Feng
- Department of Computer Science and Engineering, University of North Texas, TX, USA
| | - Ryan Sterzenbach
- Department of Biomedical Engineering, University of North Texas, TX, USA
| | - Xuan Guo
- Department of Computer Science and Engineering, University of North Texas, TX, USA.
| |
Collapse
|
5
|
Inferring Potential CircRNA–Disease Associations via Deep Autoencoder-Based Classification. Mol Diagn Ther 2020; 25:87-97. [DOI: 10.1007/s40291-020-00499-y] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/06/2020] [Indexed: 01/09/2023]
|
6
|
Dong N, Spencer DM, Quan Q, Le Blanc JCY, Feng J, Li M, Siu KWM, Chu IK. rPTMDetermine: A Fully Automated Methodology for Endogenous Tyrosine Nitration Validation, Site-Localization, and Beyond. Anal Chem 2020; 92:10768-10776. [DOI: 10.1021/acs.analchem.0c02148] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Affiliation(s)
- Naiping Dong
- Department of Chemistry, The University of Hong Kong, Pokfulam, Hong Kong, China
| | - Daniel M. Spencer
- Department of Chemistry, The University of Hong Kong, Pokfulam, Hong Kong, China
| | - Quan Quan
- Department of Chemistry, The University of Hong Kong, Pokfulam, Hong Kong, China
| | | | - Jinwen Feng
- Department of Chemistry, The University of Hong Kong, Pokfulam, Hong Kong, China
| | - Mengzhu Li
- Department of Chemistry, The University of Hong Kong, Pokfulam, Hong Kong, China
| | - K. W. Michael Siu
- Department of Chemistry, The University of Hong Kong, Pokfulam, Hong Kong, China
- Department of Chemistry and Centre for Research in Mass Spectrometry, York University, Toronto, Ontario M3J 1P3, Canada
- Department of Chemistry and Biochemistry, University of Windsor, Windsor, Ontario N9B 3P4, Canada
| | - Ivan K. Chu
- Department of Chemistry, The University of Hong Kong, Pokfulam, Hong Kong, China
| |
Collapse
|
7
|
Zhao Y, Chen X, Yin J. Adaptive boosting-based computational model for predicting potential miRNA-disease associations. Bioinformatics 2019; 35:4730-4738. [DOI: 10.1093/bioinformatics/btz297] [Citation(s) in RCA: 87] [Impact Index Per Article: 17.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2018] [Revised: 03/19/2019] [Accepted: 04/18/2019] [Indexed: 12/24/2022] Open
Abstract
AbstractMotivationRecent studies have shown that microRNAs (miRNAs) play a critical part in several biological processes and dysregulation of miRNAs is related with numerous complex human diseases. Thus, in-depth research of miRNAs and their association with human diseases can help us to solve many problems.ResultsDue to the high cost of traditional experimental methods, revealing disease-related miRNAs through computational models is a more economical and efficient way. Considering the disadvantages of previous models, in this paper, we developed adaptive boosting for miRNA-disease association prediction (ABMDA) to predict potential associations between diseases and miRNAs. We balanced the positive and negative samples by performing random sampling based on k-means clustering on negative samples, whose process was quick and easy, and our model had higher efficiency and scalability for large datasets than previous methods. As a boosting technology, ABMDA was able to improve the accuracy of given learning algorithm by integrating weak classifiers that could score samples to form a strong classifier based on corresponding weights. Here, we used decision tree as our weak classifier. As a result, the area under the curve (AUC) of global and local leave-one-out cross validation reached 0.9170 and 0.8220, respectively. What is more, the mean and the standard deviation of AUCs achieved 0.9023 and 0.0016, respectively in 5-fold cross validation. Besides, in the case studies of three important human cancers, 49, 50 and 50 out of the top 50 predicted miRNAs for colon neoplasms, hepatocellular carcinoma and breast neoplasms were confirmed by the databases and experimental literatures.Availability and implementationThe code and dataset of ABMDA are freely available at https://github.com/githubcode007/ABMDA.Supplementary informationSupplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yan Zhao
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Xing Chen
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Jun Yin
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| |
Collapse
|
8
|
Wang CC, Chen X, Qu J, Sun YZ, Li JQ. RFSMMA: A New Computational Model to Identify and Prioritize Potential Small Molecule-MiRNA Associations. J Chem Inf Model 2019; 59:1668-1679. [PMID: 30840454 DOI: 10.1021/acs.jcim.9b00129] [Citation(s) in RCA: 37] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
More and more studies found that many complex human diseases occur accompanied by aberrant expression of microRNAs (miRNAs). Small molecule (SM) drugs have been utilized to treat complex human diseases by affecting the expression of miRNAs. Several computational methods were proposed to infer underlying associations between SMs and miRNAs. In our study, we proposed a new calculation model of random forest based small molecule-miRNA association prediction (RFSMMA) which was based on the known SM-miRNA associations in the SM2miR database. RFSMMA utilized the similarity of SMs and miRNAs as features to represent SM-miRNA pairs and further implemented the machine learning algorithm of random forest to train training samples and obtain a prediction model. In RFSMMA, integrating multiple kinds of similarity can avoid the bias of single similarity and choosing more reliable features from original features can represent SM-miRNA pairs more accurately. We carried out cross validations to assess predictive accuracy of RFSMMA. As a result, RFSMMA acquired AUCs of 0.9854, 0.9839, 0.7052, and 0.9917 ± 0.0008 under global leave-one-out cross validation (LOOCV), miRNA-fixed local LOOCV, SM-fixed local LOOCV, and 5-fold cross validation, respectively, under data set 1. Based on data set 2, RFSMMA obtained AUCs of 0.8456, 0.8463, 0.6653, and 0.8389 ± 0.0033 under four cross validations according to the order mentioned above. In addition, we implemented a case study on three common SMs, namely, 5-fluorouracil, 17β-estradiol, and 5-aza-2'-deoxycytidine. Among the top 50 associated miRNAs of these three SMs predicted by RFSMMA, 31, 32, and 28 miRNAs were verified, respectively. Therefore, RFSMMA is shown to be an effective and reliable tool for identifying underlying SM-miRNA associations.
Collapse
Affiliation(s)
- Chun-Chun Wang
- School of Information and Control Engineering , China University of Mining and Technology , Xuzhou 221116 , China
| | - Xing Chen
- School of Information and Control Engineering , China University of Mining and Technology , Xuzhou 221116 , China
| | - Jia Qu
- School of Information and Control Engineering , China University of Mining and Technology , Xuzhou 221116 , China
| | - Ya-Zhou Sun
- College of Computer Science and Software Engineering , Shenzhen University , Shenzhen 518060 , China
| | - Jian-Qiang Li
- College of Computer Science and Software Engineering , Shenzhen University , Shenzhen 518060 , China
| |
Collapse
|
9
|
Chen X, Wang CC, Yin J, You ZH. Novel Human miRNA-Disease Association Inference Based on Random Forest. MOLECULAR THERAPY. NUCLEIC ACIDS 2018; 13:568-579. [PMID: 30439645 PMCID: PMC6234518 DOI: 10.1016/j.omtn.2018.10.005] [Citation(s) in RCA: 83] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/28/2018] [Revised: 07/30/2018] [Accepted: 10/05/2018] [Indexed: 01/23/2023]
Abstract
Since the first microRNA (miRNA) was discovered, a lot of studies have confirmed the associations between miRNAs and human complex diseases. Besides, obtaining and taking advantage of association information between miRNAs and diseases play an increasingly important role in improving the treatment level for complex diseases. However, due to the high cost of traditional experimental methods, many researchers have proposed different computational methods to predict potential associations between miRNAs and diseases. In this work, we developed a computational model of Random Forest for miRNA-disease association (RFMDA) prediction based on machine learning. The training sample set for RFMDA was constructed according to the human microRNA disease database (HMDD) version (v.)2.0, and the feature vectors to represent miRNA-disease samples were defined by integrating miRNA functional similarity, disease semantic similarity, and Gaussian interaction profile kernel similarity. The Random Forest algorithm was first employed to infer miRNA-disease associations. In addition, a filter-based method was implemented to select robust features from the miRNA-disease feature set, which could efficiently distinguish related miRNA-disease pairs from unrelated miRNA-disease pairs. RFMDA achieved areas under the curve (AUCs) of 0.8891, 0.8323, and 0.8818 ± 0.0014 under global leave-one-out cross-validation, local leave-one-out cross-validation, and 5-fold cross-validation, respectively, which were higher than many previous computational models. To further evaluate the accuracy of RFMDA, we carried out three types of case studies for four human complex diseases. As a result, 43 (esophageal neoplasms), 46 (lymphoma), 47 (lung neoplasms), and 48 (breast neoplasms) of the top 50 predicted disease-related miRNAs were verified by experiments in different kinds of case studies. The results of cross-validation and case studies indicated that RFMDA is a reliable model for predicting miRNA-disease associations.
Collapse
Affiliation(s)
- Xing Chen
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China.
| | - Chun-Chun Wang
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Jun Yin
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Zhu-Hong You
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Science, Ürümqi 830011, China.
| |
Collapse
|
10
|
Tu C, Li J, Shen S, Sheng Q, Shyr Y, Qu J. Performance Investigation of Proteomic Identification by HCD/CID Fragmentations in Combination with High/Low-Resolution Detectors on a Tribrid, High-Field Orbitrap Instrument. PLoS One 2016; 11:e0160160. [PMID: 27472422 PMCID: PMC4966894 DOI: 10.1371/journal.pone.0160160] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2016] [Accepted: 07/14/2016] [Indexed: 11/24/2022] Open
Abstract
The recently-introduced Orbitrap Fusion mass spectrometry permits various types of MS2 acquisition methods. To date, these different MS2 strategies and the optimal data interpretation approach for each have not been adequately evaluated. This study comprehensively investigated the four MS2 strategies: HCD-OT (higher-energy-collisional-dissociation with Orbitrap detection), HCD-IT (HCD with ion trap, IT), CID-IT (collision-induced-dissociation with IT) and CID-OT on Orbitrap Fusion. To achieve extensive comparison and identify the optimal data interpretation method for each technique, several search engines (SEQUEST and Mascot) and post-processing methods (score-based, PeptideProphet, and Percolator) were assessed for all techniques for the analysis of a human cell proteome. It was found that divergent conclusions could be made from the same dataset when different data interpretation approaches were used and therefore requiring a relatively fair comparison among techniques. Percolator was chosen for comparison of techniques because it performs the best among all search engines and MS2 strategies. For the analysis of human cell proteome using individual MS2 strategies, the highest number of identifications was achieved by HCD-OT, followed by HCD-IT and CID-IT. Based on these results, we concluded that a relatively fair platform for data interpretation is necessary to avoid divergent conclusions from the same dataset, and HCD-OT and HCD-IT may be preferable for protein/peptide identification using Orbitrap Fusion.
Collapse
Affiliation(s)
- Chengjian Tu
- Department of Pharmaceutical Sciences, University at Buffalo, State University of New York, Buffalo, United States of America
- New York State Center of Excellence in Bioinformatics and Life Sciences, 701 Ellicott Street, Buffalo, United States of America
- * E-mail: (JQ); (CT)
| | - Jun Li
- Department of Pharmaceutical Sciences, University at Buffalo, State University of New York, Buffalo, United States of America
- New York State Center of Excellence in Bioinformatics and Life Sciences, 701 Ellicott Street, Buffalo, United States of America
| | - Shichen Shen
- Department of Pharmaceutical Sciences, University at Buffalo, State University of New York, Buffalo, United States of America
- New York State Center of Excellence in Bioinformatics and Life Sciences, 701 Ellicott Street, Buffalo, United States of America
| | - Quanhu Sheng
- Center for Quantitative Sciences, Vanderbilt University School of Medicine, Nashville, United States of America
| | - Yu Shyr
- Center for Quantitative Sciences, Vanderbilt University School of Medicine, Nashville, United States of America
| | - Jun Qu
- Department of Pharmaceutical Sciences, University at Buffalo, State University of New York, Buffalo, United States of America
- New York State Center of Excellence in Bioinformatics and Life Sciences, 701 Ellicott Street, Buffalo, United States of America
- * E-mail: (JQ); (CT)
| |
Collapse
|
11
|
Tu C, Sheng Q, Li J, Ma D, Shen X, Wang X, Shyr Y, Yi Z, Qu J. Optimization of Search Engines and Postprocessing Approaches to Maximize Peptide and Protein Identification for High-Resolution Mass Data. J Proteome Res 2015; 14:4662-73. [PMID: 26390080 DOI: 10.1021/acs.jproteome.5b00536] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
The two key steps for analyzing proteomic data generated by high-resolution MS are database searching and postprocessing. While the two steps are interrelated, studies on their combinatory effects and the optimization of these procedures have not been adequately conducted. Here, we investigated the performance of three popular search engines (SEQUEST, Mascot, and MS Amanda) in conjunction with five filtering approaches, including respective score-based filtering, a group-based approach, local false discovery rate (LFDR), PeptideProphet, and Percolator. A total of eight data sets from various proteomes (e.g., E. coli, yeast, and human) produced by various instruments with high-accuracy survey scan (MS1) and high- or low-accuracy fragment ion scan (MS2) (LTQ-Orbitrap, Orbitrap-Velos, Orbitrap-Elite, Q-Exactive, Orbitrap-Fusion, and Q-TOF) were analyzed. It was found combinations involving Percolator achieved markedly more peptide and protein identifications at the same FDR level than the other 12 combinations for all data sets. Among these, combinations of SEQUEST-Percolator and MS Amanda-Percolator provided slightly better performances for data sets with low-accuracy MS2 (ion trap or IT) and high accuracy MS2 (Orbitrap or TOF), respectively, than did other methods. For approaches without Percolator, SEQUEST-group performs the best for data sets with MS2 produced by collision-induced dissociation (CID) and IT analysis; Mascot-LFDR gives more identifications for data sets generated by higher-energy collisional dissociation (HCD) and analyzed in Orbitrap (HCD-OT) and in Orbitrap Fusion (HCD-IT); MS Amanda-Group excels for the Q-TOF data set and the Orbitrap Velos HCD-OT data set. Therefore, if Percolator was not used, a specific combination should be applied for each type of data set. Moreover, a higher percentage of multiple-peptide proteins and lower variation of protein spectral counts were observed when analyzing technical replicates using Percolator-associated combinations; therefore, Percolator enhanced the reliability for both identification and quantification. The analyses were performed using the specific programs embedded in Proteome Discoverer, Scaffold, and an in-house algorithm (BuildSummary). These results provide valuable guidelines for the optimal interpretation of proteomic results and the development of fit-for-purpose protocols under different situations.
Collapse
Affiliation(s)
- Chengjian Tu
- Department of Pharmaceutical Sciences, State University of New York , 285 Kapoor Hall, Buffalo, New York 14260, United States.,New York State Center of Excellence in Bioinformatics and Life Sciences , 701 Ellicott Street, Buffalo, New York 14203, United States
| | - Quanhu Sheng
- Center for Quantitative Sciences, Vanderbilt University School of Medicine , 2220 Pierce Avenue, Nashville, Tennessee 37232, United States
| | - Jun Li
- Department of Pharmaceutical Sciences, State University of New York , 285 Kapoor Hall, Buffalo, New York 14260, United States.,New York State Center of Excellence in Bioinformatics and Life Sciences , 701 Ellicott Street, Buffalo, New York 14203, United States
| | - Danjun Ma
- Department of Pharmaceutical Sciences, Eugene Applebaum College of Pharmacy/Health Sciences, Wayne State University , 259 Mack Avenue, Detroit, Michigan 48202, United States
| | - Xiaomeng Shen
- Department of Pharmaceutical Sciences, State University of New York , 285 Kapoor Hall, Buffalo, New York 14260, United States.,New York State Center of Excellence in Bioinformatics and Life Sciences , 701 Ellicott Street, Buffalo, New York 14203, United States
| | - Xue Wang
- Department of Pharmaceutical Sciences, State University of New York , 285 Kapoor Hall, Buffalo, New York 14260, United States.,New York State Center of Excellence in Bioinformatics and Life Sciences , 701 Ellicott Street, Buffalo, New York 14203, United States.,Department of Cell Stress Biology, Roswell Park Cancer Institute , Elm and Carlton Streets, Buffalo, New York 14263, United States
| | - Yu Shyr
- Center for Quantitative Sciences, Vanderbilt University School of Medicine , 2220 Pierce Avenue, Nashville, Tennessee 37232, United States
| | - Zhengping Yi
- Department of Pharmaceutical Sciences, Eugene Applebaum College of Pharmacy/Health Sciences, Wayne State University , 259 Mack Avenue, Detroit, Michigan 48202, United States
| | - Jun Qu
- Department of Pharmaceutical Sciences, State University of New York , 285 Kapoor Hall, Buffalo, New York 14260, United States.,New York State Center of Excellence in Bioinformatics and Life Sciences , 701 Ellicott Street, Buffalo, New York 14203, United States
| |
Collapse
|
12
|
Kelchtermans P, Bittremieux W, De Grave K, Degroeve S, Ramon J, Laukens K, Valkenborg D, Barsnes H, Martens L. Machine learning applications in proteomics research: how the past can boost the future. Proteomics 2014; 14:353-66. [PMID: 24323524 DOI: 10.1002/pmic.201300289] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2013] [Revised: 09/24/2013] [Accepted: 10/14/2013] [Indexed: 01/22/2023]
Abstract
Machine learning is a subdiscipline within artificial intelligence that focuses on algorithms that allow computers to learn solving a (complex) problem from existing data. This ability can be used to generate a solution to a particularly intractable problem, given that enough data are available to train and subsequently evaluate an algorithm on. Since MS-based proteomics has no shortage of complex problems, and since publicly available data are becoming available in ever growing amounts, machine learning is fast becoming a very popular tool in the field. We here therefore present an overview of the different applications of machine learning in proteomics that together cover nearly the entire wet- and dry-lab workflow, and that address key bottlenecks in experiment planning and design, as well as in data processing and analysis.
Collapse
Affiliation(s)
- Pieter Kelchtermans
- Department of Medical Protein Research, VIB, Ghent, Belgium; Faculty of Medicine and Health Sciences, Department of Biochemistry, Ghent University, Ghent, Belgium; Flemish Institute for Technological Research (VITO), Boeretang, Mol, Belgium
| | | | | | | | | | | | | | | | | |
Collapse
|
13
|
Hanselmann M, Röder J, Köthe U, Renard BY, Heeren RMA, Hamprecht FA. Active learning for convenient annotation and classification of secondary ion mass spectrometry images. Anal Chem 2012; 85:147-55. [PMID: 23157438 DOI: 10.1021/ac3023313] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Digital staining for the automated annotation of mass spectrometry imaging (MSI) data has previously been achieved using state-of-the-art classifiers such as random forests or support vector machines (SVMs). However, the training of such classifiers requires an expert to label exemplary data in advance. This process is time-consuming and hence costly, especially if the tissue is heterogeneous. In theory, it may be sufficient to only label a few highly representative pixels of an MS image, but it is not known a priori which pixels to select. This motivates active learning strategies in which the algorithm itself queries the expert by automatically suggesting promising candidate pixels of an MS image for labeling. Given a suitable querying strategy, the number of required training labels can be significantly reduced while maintaining classification accuracy. In this work, we propose active learning for convenient annotation of MSI data. We generalize a recently proposed active learning method to the multiclass case and combine it with the random forest classifier. Its superior performance over random sampling is demonstrated on secondary ion mass spectrometry data, making it an interesting approach for the classification of MS images.
Collapse
Affiliation(s)
- Michael Hanselmann
- Heidelberg Collaboratory for Image Processing, Interdisciplinary Center for Scientific Computing, University of Heidelberg, Germany
| | | | | | | | | | | |
Collapse
|
14
|
Yadav AK, Kumar D, Dash D. Learning from decoys to improve the sensitivity and specificity of proteomics database search results. PLoS One 2012. [PMID: 23189209 PMCID: PMC3506577 DOI: 10.1371/journal.pone.0050651] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
The statistical validation of database search results is a complex issue in bottom-up proteomics. The correct and incorrect peptide spectrum match (PSM) scores overlap significantly, making an accurate assessment of true peptide matches challenging. Since the complete separation between the true and false hits is practically never achieved, there is need for better methods and rescoring algorithms to improve upon the primary database search results. Here we describe the calibration and False Discovery Rate (FDR) estimation of database search scores through a dynamic FDR calculation method, FlexiFDR, which increases both the sensitivity and specificity of search results. Modelling a simple linear regression on the decoy hits for different charge states, the method maximized the number of true positives and reduced the number of false negatives in several standard datasets of varying complexity (18-mix, 49-mix, 200-mix) and few complex datasets (E. coli and Yeast) obtained from a wide variety of MS platforms. The net positive gain for correct spectral and peptide identifications was up to 14.81% and 6.2% respectively. The approach is applicable to different search methodologies- separate as well as concatenated database search, high mass accuracy, and semi-tryptic and modification searches. FlexiFDR was also applied to Mascot results and showed better performance than before. We have shown that appropriate threshold learnt from decoys, can be very effective in improving the database search results. FlexiFDR adapts itself to different instruments, data types and MS platforms. It learns from the decoy hits and sets a flexible threshold that automatically aligns itself to the underlying variables of data quality and size.
Collapse
Affiliation(s)
- Amit Kumar Yadav
- GNR Knowledge Center for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, Delhi, India
| | - Dhirendra Kumar
- GNR Knowledge Center for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, Delhi, India
| | - Debasis Dash
- GNR Knowledge Center for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, Delhi, India
- * E-mail:
| |
Collapse
|
15
|
Mattison HA, Stewart T, Zhang J. Applying bioinformatics to proteomics: is machine learning the answer to biomarker discovery for PD and MSA? Mov Disord 2012; 27:1595-7. [PMID: 23115026 DOI: 10.1002/mds.25189] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2012] [Accepted: 08/05/2012] [Indexed: 11/10/2022] Open
Abstract
Bioinformatics tools are increasingly being applied to proteomic data to facilitate the identification of biomarkers and classification of patients. In the June, 2012 issue, Ishigami et al. used principal component analysis (PCA) to extract features and support vector machine (SVM) to differentiate and classify cerebrospinal fluid (CSF) samples from two small cohorts of patients diagnosed with either Parkinson's disease (PD) or multiple system atrophy (MSA) based on differences in the patterns of peaks generated with matrix-assisted desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS). PCA accurately segregated patients with PD and MSA from controls when the cohorts were combined, but did not perform well when segregating PD from MSA. On the other hand, SVM, a machine learning classification model, correctly classified the samples from patients with early PD or MSA, and the peak at m/z 6250 was identified as a strong contributor to the ability of SVM to distinguish the proteomic profiles of either cohort when trained on one cohort. This study, while preliminary, provides promising results for the application of bioinformatics tools to proteomic data, an approach that may eventually facilitate the ability of clinicians to differentiate and diagnose closely related parkinsonian disorders.
Collapse
Affiliation(s)
- Hayley A Mattison
- Department of Pathology, University of Washington, Seattle, Washington 98104, USA
| | | | | |
Collapse
|
16
|
Li N, Wu S, Zhang C, Chang C, Zhang J, Ma J, Li L, Qian X, Xu P, Zhu Y, He F. PepDistiller: A quality control tool to improve the sensitivity and accuracy of peptide identifications in shotgun proteomics. Proteomics 2012; 12:1720-5. [PMID: 22623377 DOI: 10.1002/pmic.201100167] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Affiliation(s)
- Ning Li
- State Key Laboratory of Proteomics; Beijing Proteome Research Center; Beijing Institute of Radiation Medicine; Beijing P. R. China
| | - Songfeng Wu
- State Key Laboratory of Proteomics; Beijing Proteome Research Center; Beijing Institute of Radiation Medicine; Beijing P. R. China
| | - Chengpu Zhang
- State Key Laboratory of Proteomics; Beijing Proteome Research Center; Beijing Institute of Radiation Medicine; Beijing P. R. China
| | - Cheng Chang
- State Key Laboratory of Proteomics; Beijing Proteome Research Center; Beijing Institute of Radiation Medicine; Beijing P. R. China
| | - Jiyang Zhang
- College of Mechanical and Electronic Engineering and Automatization; National University of Defense Technology; Changsha P. R. China
| | - Jie Ma
- State Key Laboratory of Proteomics; Beijing Proteome Research Center; Beijing Institute of Radiation Medicine; Beijing P. R. China
| | - Liwei Li
- State Key Laboratory of Proteomics; Beijing Proteome Research Center; Beijing Institute of Radiation Medicine; Beijing P. R. China
| | - Xiaohong Qian
- State Key Laboratory of Proteomics; Beijing Proteome Research Center; Beijing Institute of Radiation Medicine; Beijing P. R. China
| | - Ping Xu
- State Key Laboratory of Proteomics; Beijing Proteome Research Center; Beijing Institute of Radiation Medicine; Beijing P. R. China
| | - Yunping Zhu
- State Key Laboratory of Proteomics; Beijing Proteome Research Center; Beijing Institute of Radiation Medicine; Beijing P. R. China
| | - Fuchu He
- State Key Laboratory of Proteomics; Beijing Proteome Research Center; Beijing Institute of Radiation Medicine; Beijing P. R. China
- Institutes of Biomedical Sciences; Fudan University; Shanghai P. R. China
| |
Collapse
|
17
|
Källberg M, Lu H. An improved machine learning protocol for the identification of correct Sequest search results. BMC Bioinformatics 2010; 11:591. [PMID: 21138573 PMCID: PMC3013103 DOI: 10.1186/1471-2105-11-591] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2010] [Accepted: 12/07/2010] [Indexed: 11/18/2022] Open
Abstract
Background Mass spectrometry has become a standard method by which the proteomic profile of cell or tissue samples is characterized. To fully take advantage of tandem mass spectrometry (MS/MS) techniques in large scale protein characterization studies robust and consistent data analysis procedures are crucial. In this work we present a machine learning based protocol for the identification of correct peptide-spectrum matches from Sequest database search results, improving on previously published protocols. Results The developed model improves on published machine learning classification procedures by 6% as measured by the area under the ROC curve. Further, we show how the developed model can be presented as an interpretable tree of additive rules, thereby effectively removing the 'black-box' notion often associated with machine learning classifiers, allowing for comparison with expert rule-of-thumb. Finally, a method for extending the developed peptide identification protocol to give probabilistic estimates of the presence of a given protein is proposed and tested. Conclusions We demonstrate the construction of a high accuracy classification model for Sequest search results from MS/MS spectra obtained by using the MALDI ionization. The developed model performs well in identifying correct peptide-spectrum matches and is easily extendable to the protein identification problem. The relative ease with which additional experimental parameters can be incorporated into the classification framework, to give additional discriminatory power, allows for future tailoring of the model to take advantage of information from specific instrument set-ups.
Collapse
Affiliation(s)
- Morten Källberg
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL, USA
| | | |
Collapse
|
18
|
Delporte C, Van Antwerpen P, Zouaoui Boudjeltia K, Noyon C, Abts F, Métral F, Vanhamme L, Reyé F, Rousseau A, Vanhaeverbeek M, Ducobu J, Nève J. Optimization of apolipoprotein-B-100 sequence coverage by liquid chromatography-tandem mass spectrometry for the future study of its posttranslational modifications. Anal Biochem 2010; 411:129-38. [PMID: 21129357 DOI: 10.1016/j.ab.2010.11.039] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2010] [Revised: 11/24/2010] [Accepted: 11/24/2010] [Indexed: 11/18/2022]
Abstract
Proteomic applications have been increasingly used to study posttranslational modifications of proteins (PTMs). For the purpose of identifying and localizing specific but unknown PTMs on huge proteins, improving their sequence coverage is fundamental. Using liquid chromatography coupled to mass spectrometry (LC-MS/MS), peptide mapping of the native apolipoprotein-B-100 was performed to further document the effects of oxidation. Apolipoprotein-B-100 is the main protein of low-density lipoprotein particles and its oxidation could play a role in atherogenesis. Because it is one of the largest human proteins, the sequence recovery rate of apolipoprotein-B-100 only reached 1% when conventional analysis parameters were used. The different steps of the peptide mapping process-from protein treatment to data analysis-were therefore reappraised and optimized. These optimizations allowed a protein sequence recovery rate of 79%, a rate which has never been achieved previously for such a large human protein. The key points for improving peptide mapping were optimization of the data analysis software; peptide separation by LC; sample preparation; and MS acquisition. The new protocol has allowed us to increase by a factor of 4 the detection of modified peptides in apolipoprotein-B-100. This approach could easily be transferred to any study of PTMs using LC-MS/MS.
Collapse
|
19
|
Nesvizhskii AI. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. J Proteomics 2010; 73:2092-123. [PMID: 20816881 DOI: 10.1016/j.jprot.2010.08.009] [Citation(s) in RCA: 370] [Impact Index Per Article: 26.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2010] [Revised: 08/25/2010] [Accepted: 08/25/2010] [Indexed: 12/18/2022]
Abstract
This manuscript provides a comprehensive review of the peptide and protein identification process using tandem mass spectrometry (MS/MS) data generated in shotgun proteomic experiments. The commonly used methods for assigning peptide sequences to MS/MS spectra are critically discussed and compared, from basic strategies to advanced multi-stage approaches. A particular attention is paid to the problem of false-positive identifications. Existing statistical approaches for assessing the significance of peptide to spectrum matches are surveyed, ranging from single-spectrum approaches such as expectation values to global error rate estimation procedures such as false discovery rates and posterior probabilities. The importance of using auxiliary discriminant information (mass accuracy, peptide separation coordinates, digestion properties, and etc.) is discussed, and advanced computational approaches for joint modeling of multiple sources of information are presented. This review also includes a detailed analysis of the issues affecting the interpretation of data at the protein level, including the amplification of error rates when going from peptide to protein level, and the ambiguities in inferring the identifies of sample proteins in the presence of shared peptides. Commonly used methods for computing protein-level confidence scores are discussed in detail. The review concludes with a discussion of several outstanding computational issues.
Collapse
|
20
|
Reichenbach SE, Tian X, Tao Q, Stoll DR, Carr PW. Comprehensive feature analysis for sample classification with comprehensive two‐dimensional LC. J Sep Sci 2010; 33:1365-74. [DOI: 10.1002/jssc.200900859] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Affiliation(s)
- Stephen E. Reichenbach
- Computer Science and Engineering Department, University of Nebraska – Lincoln, Lincoln, NE, USA
| | - Xue Tian
- Computer Science and Engineering Department, University of Nebraska – Lincoln, Lincoln, NE, USA
| | | | - Dwight R. Stoll
- Department of Chemistry, Gustavus Adolphus College, Saint Peter, MN, USA
| | - Peter W. Carr
- Department of Chemistry, University of Minnesota, Minneapolis, MN, USA
| |
Collapse
|
21
|
van Breukelen B, Georgiou A, Drugan MM, Taouatas N, Mohammed S, Heck AJR. LysNDeNovo
: An algorithm enabling de novo
sequencing of Lys-N generated peptides fragmented by electron transfer dissociation. Proteomics 2010; 10:1196-201. [DOI: 10.1002/pmic.200900405] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
22
|
Hanselmann M, Köthe U, Kirchner M, Renard BY, Amstalden ER, Glunde K, Heeren RMA, Hamprecht FA. Toward digital staining using imaging mass spectrometry and random forests. J Proteome Res 2009; 8:3558-67. [PMID: 19469555 DOI: 10.1021/pr900253y] [Citation(s) in RCA: 64] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
We show on imaging mass spectrometry (IMS) data that the Random Forest classifier can be used for automated tissue classification and that it results in predictions with high sensitivities and positive predictive values, even when intersample variability is present in the data. We further demonstrate how Markov Random Fields and vector-valued median filtering can be applied to reduce noise effects to further improve the classification results in a posthoc smoothing step. Our study gives clear evidence that digital staining by means of IMS constitutes a promising complement to chemical staining techniques.
Collapse
Affiliation(s)
- Michael Hanselmann
- Heidelberg Collaboratory for Image Processing (HCI), Interdisciplinary Center for Scientific Computing (IWR), University of Heidelberg, Speyerer Strasse 6, 69115 Heidelberg, Germany
| | | | | | | | | | | | | | | |
Collapse
|
23
|
Salmi J, Nyman TA, Nevalainen OS, Aittokallio T. Filtering strategies for improving protein identification in high-throughput MS/MS studies. Proteomics 2009; 9:848-60. [PMID: 19160393 DOI: 10.1002/pmic.200800517] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Despite the recent advances in streamlining high-throughput proteomic pipelines using tandem mass spectrometry (MS/MS), reliable identification of peptides and proteins on a larger scale has remained a challenging task, still involving a considerable degree of user interaction. Recently, a number of papers have proposed computational strategies both for distinguishing poor MS/MS spectra prior to database search (pre-filtering) as well as for verifying the peptide identifications made by the search programs (post-filtering). Both of these filtering approaches can be very beneficial to the overall protein identification pipeline, since they can remove a substantial part of the time consuming manual validation work and convert large sets of MS/MS spectra into more reliable and interpretable proteome information. The choice of the filtering method depends both on the properties of the data and on the goals of the experiment. This review discusses the different pre- and post-filtering strategies available to the researchers, together with their relative merits and potential pitfalls. We also highlight some additional research topics, such as spectral denoising and statistical assessment of the identification results, which aim at further improving the coverage and accuracy of high-throughput protein identification studies.
Collapse
Affiliation(s)
- Jussi Salmi
- Department of Information Technology, University of Turku, Turku, Finland.
| | | | | | | |
Collapse
|
24
|
Brosch M, Yu L, Hubbard T, Choudhary J. Accurate and sensitive peptide identification with Mascot Percolator. J Proteome Res 2009; 8:3176-81. [PMID: 19338334 PMCID: PMC2734080 DOI: 10.1021/pr800982s] [Citation(s) in RCA: 329] [Impact Index Per Article: 21.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Sound scoring methods for sequence database search algorithms such as Mascot and Sequest are essential for sensitive and accurate peptide and protein identifications from proteomic tandem mass spectrometry data. In this paper, we present a software package that interfaces Mascot with Percolator, a well performing machine learning method for rescoring database search results, and demonstrate it to be amenable for both low and high accuracy mass spectrometry data, outperforming all available Mascot scoring schemes as well as providing reliable significance measures. Mascot Percolator can be readily used as a stand alone tool or integrated into existing data analysis pipelines.
Collapse
Affiliation(s)
- Markus Brosch
- The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, United Kingdom
| | - Lu Yu
- The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, United Kingdom
| | - Tim Hubbard
- The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, United Kingdom
| | - Jyoti Choudhary
- The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, United Kingdom
| |
Collapse
|
25
|
Edwards N, Wu X, Tseng CW. An Unsupervised, Model-Free, Machine-Learning Combiner for Peptide Identifications from Tandem Mass Spectra. Clin Proteomics 2009. [DOI: 10.1007/s12014-009-9024-5] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023] Open
Abstract
Abstract
As the speed of mass spectrometers, sophistication of sample fractionation, and complexity of experimental designs increase, the volume of tandem mass spectra requiring reliable automated analysis continues to grow. Software tools that quickly, effectively, and robustly determine the peptide associated with each spectrum with high confidence are sorely needed. Currently available tools that postprocess the output of sequence-database search engines use three techniques to distinguish the correct peptide identifications from the incorrect: statistical significance re-estimation, supervised machine learning scoring and prediction, and combining or merging of search engine results. We present a unifying framework that encompasses each of these techniques in a single model-free machine-learning framework that can be trained in an unsupervised manner. The predictor is trained on the fly for each new set of search results without user intervention, making it robust for different instruments, search engines, and search engine parameters. We demonstrate the performance of the technique using mixtures of known proteins and by using shuffled databases to estimate false discovery rates, from data acquired on three different instruments with two different ionization technologies. We show that this approach outperforms machine-learning techniques applied to a single search engine’s output, and demonstrate that combining search engine results provides additional benefit. We show that the performance of the commercial Mascot tool can be bested by the machine-learning combination of two open-source tools X!Tandem and OMSSA, but that the use of all three search engines boosts performance further still. The Peptide identification Arbiter by Machine Learning (PepArML) unsupervised, model-free, combining framework can be easily extended to support an arbitrary number of additional searches, search engines, or specialized peptide–spectrum match metrics for each spectrum data set. PepArML is open-source and is available from http://peparml.sourceforge.net.
Collapse
|
26
|
Yun D, Lu H, Yang P, He F. Spectral quality assessment and application for gel-based matrix-assisted laser desorption ionization-time of flight tandem mass spectrometer. Anal Chim Acta 2009; 634:158-65. [DOI: 10.1016/j.aca.2008.12.020] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2008] [Revised: 12/03/2008] [Accepted: 12/10/2008] [Indexed: 10/21/2022]
|
27
|
YUN D, LU H, WANG H, ZHANG Y, CHENG G, JIN H, YU Y, XU Y, YANG P, HE F. Iterative Non- m/ z-sharing Rule for Confident and Sensitive Protein Identification of Non-shotgun Proteomics. CHINESE J CHEM 2009. [DOI: 10.1002/cjoc.200990053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
28
|
Teh SK, Zheng W, Lau DP, Huang Z. Spectroscopic diagnosis of laryngeal carcinoma using near-infrared Raman spectroscopy and random recursive partitioning ensemble techniques. Analyst 2009; 134:1232-9. [DOI: 10.1039/b811008e] [Citation(s) in RCA: 56] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
|
29
|
Shao C, Sun W, Li F, Yang R, Zhang L, Gao Y. Oscore: a combined score to reduce false negative rates for peptide identification in tandem mass spectrometry analysis. JOURNAL OF MASS SPECTROMETRY : JMS 2009; 44:25-31. [PMID: 18698557 DOI: 10.1002/jms.1466] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
Tandem mass spectrometry (MS/MS) has been widely used in proteomics studies. Multiple algorithms have been developed for assessing matches between MS/MS spectra and peptide sequences in databases. However, it is still a challenge to reduce false negative rates without compromising the high confidence of peptide identification. In this study, we developed the score, Oscore, by logistic regression using SEQUEST and AMASS variables to identify fully tryptic peptides. Since these variables showed complicated association with each other, combining them together rather than applying them to a threshold model improved the classification of correct and incorrect peptide identifications. Oscore achieved both a lower false negative rate and a lower false positive rate than PeptideProphet on datasets from 18 known protein mixtures and several proteome-scale samples of different complexity, database size and separation methods. By a three-way comparison among Oscore, PeptideProphet and another logistic regression model which made use of PeptideProphet's variables, the main contributor for the improvement made by Oscore is discussed.
Collapse
Affiliation(s)
- Chen Shao
- Department of Physiology and Pathophysiology, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences, School of Basic Medicine, Peking Union Medical College, Beijing, China
| | | | | | | | | | | |
Collapse
|
30
|
Zhang J, Ma J, Dou L, Wu S, Qian X, Xie H, Zhu Y, He F. Bayesian nonparametric model for the validation of peptide identification in shotgun proteomics. Mol Cell Proteomics 2008; 8:547-57. [PMID: 19005226 DOI: 10.1074/mcp.m700558-mcp200] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Tandem mass spectrometry combined with database searching allows high throughput identification of peptides in shotgun proteomics. However, validating database search results, a problem with a lot of solutions proposed, is still advancing in some aspects, such as the sensitivity, specificity, and generalizability of the validation algorithms. Here a Bayesian nonparametric (BNP) model for the validation of database search results was developed that incorporates several popular techniques in statistical learning, including the compression of feature space with a linear discriminant function, the flexible nonparametric probability density function estimation for the variable probability structure in complex problem, and the Bayesian method to calculate the posterior probability. Importantly the BNP model is compatible with the popular target-decoy database search strategy naturally. We tested the BNP model on standard proteins and real, complex sample data sets from multiple MS platforms and compared it with Peptide-Prophet, the cutoff-based method, and a simple nonparametric method (proposed by us previously). The performance of the BNP model was shown to be superior for all data sets searched on sensitivity and generalizability. Some high quality matches that had been filtered out by other methods were detected and assigned with high probability by the BNP model. Thus, the BNP model could be able to validate the database search results effectively and extract more information from MS/MS data.
Collapse
Affiliation(s)
- Jiyang Zhang
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, Beijing Institute of Radiation Medicine, Beijing 102206, China
| | | | | | | | | | | | | | | |
Collapse
|
31
|
Jiang X, Dong X, Ye M, Zou H. Instance Based Algorithm for Posterior Probability Calculation by Target−Decoy Strategy to Improve Protein Identifications. Anal Chem 2008; 80:9326-35. [DOI: 10.1021/ac8017229] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Xinning Jiang
- National Chromatographic R&A Center, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian 116023, China, Graduate School of Chinese Academy of Sciences, Beijing 100049, China, and Department of Chemistry, Xixi Campus, Zhejiang University, Hangzhou 310028, China
| | - Xiaoli Dong
- National Chromatographic R&A Center, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian 116023, China, Graduate School of Chinese Academy of Sciences, Beijing 100049, China, and Department of Chemistry, Xixi Campus, Zhejiang University, Hangzhou 310028, China
| | - Mingliang Ye
- National Chromatographic R&A Center, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian 116023, China, Graduate School of Chinese Academy of Sciences, Beijing 100049, China, and Department of Chemistry, Xixi Campus, Zhejiang University, Hangzhou 310028, China
| | - Hanfa Zou
- National Chromatographic R&A Center, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian 116023, China, Graduate School of Chinese Academy of Sciences, Beijing 100049, China, and Department of Chemistry, Xixi Campus, Zhejiang University, Hangzhou 310028, China
| |
Collapse
|
32
|
Reichenbach SE, Carr PW, Stoll DR, Tao Q. Smart templates for peak pattern matching with comprehensive two-dimensional liquid chromatography. J Chromatogr A 2008; 1216:3458-66. [PMID: 18848329 DOI: 10.1016/j.chroma.2008.09.058] [Citation(s) in RCA: 61] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2008] [Revised: 08/29/2008] [Accepted: 09/05/2008] [Indexed: 11/26/2022]
Abstract
Comprehensive two-dimensional liquid chromatography (LCxLC) generates information-rich but complex peak patterns that require automated processing for rapid chemical identification and classification. This paper describes a powerful approach and specific methods for peak pattern matching to identify and classify constituent peaks in data from LCxLC and other multidimensional chemical separations. The approach records a prototypical pattern of peaks with retention times and associated metadata, such as chemical identities and classes, in a template. Then, the template pattern is matched to the detected peaks in subsequent data and the metadata are copied from the template to identify and classify the matched peaks. Smart Templates employ rule-based constraints (e.g., multispectral matching) to increase matching accuracy. Experimental results demonstrate Smart Templates, with the combination of retention-time pattern matching and multispectral constraints, are accurate and robust with respect to changes in peak patterns associated with variable chromatographic conditions.
Collapse
Affiliation(s)
- Stephen E Reichenbach
- Computer Science and Engineering Department, University of Nebraska-Lincoln, Lincoln, NE 68588-0115, USA.
| | | | | | | |
Collapse
|
33
|
Ding Y, Choi H, Nesvizhskii AI. Adaptive discriminant function analysis and reranking of MS/MS database search results for improved peptide identification in shotgun proteomics. J Proteome Res 2008; 7:4878-89. [PMID: 18788775 DOI: 10.1021/pr800484x] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Robust statistical validation of peptide identifications obtained by tandem mass spectrometry and sequence database searching is an important task in shotgun proteomics. PeptideProphet is a commonly used computational tool that computes confidence measures for peptide identifications. In this paper, we investigate several limitations of the PeptideProphet modeling approach, including the use of fixed coefficients in computing the discriminant search score and selection of the top scoring peptide assignment per spectrum only. To address these limitations, we describe an adaptive method in which a new discriminant function is learned from the data in an iterative fashion. We extend the modeling framework to go beyond the top scoring peptide assignment per spectrum. We also investigate the effect of clustering the spectra according to their spectrum quality score followed by cluster-specific mixture modeling. The analysis is carried out using data acquired from a mixture of purified proteins on four different types of mass spectrometers, as well as using a complex human serum data set. A special emphasis is placed on the analysis of data generated on high mass accuracy instruments.
Collapse
Affiliation(s)
- Ying Ding
- Department of Pathology, Department of Biostatistics, and Center for Computational Biology and Medicine, University of Michigan, Ann Arbor, Michigan 48109, USA
| | | | | |
Collapse
|
34
|
Fang J, Dong Y, Williams TD, Lushington GH. Feature selection in validating mass spectrometry database search results. J Bioinform Comput Biol 2008; 6:223-40. [PMID: 18324754 DOI: 10.1142/s0219720008003345] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2007] [Revised: 10/11/2007] [Accepted: 10/26/2007] [Indexed: 11/18/2022]
Abstract
Tandem mass spectrometry (MS/MS) combined with protein database searching has been widely used in protein identification. A validation procedure is generally required to reduce the number of false positives. Advanced tools using statistical and machine learning approaches may provide faster and more accurate validation than manual inspection and empirical filtering criteria. In this study, we use two feature selection algorithms based on random forest and support vector machine to identify peptide properties that can be used to improve validation models. We demonstrate that an improved model based on an optimized set of features reduces the number of false positives by 58% relative to the model which used only search engine scores, at the same sensitivity score of 0.8. In addition, we develop classification models based on the physicochemical properties and protein sequence environment of these peptides without using search engine scores. The performance of the best model based on the support vector machine algorithm is at 0.8 AUC, 0.78 accuracy, and 0.7 specificity, suggesting a reasonably accurate classification. The identified properties important to fragmentation and ionization can be either used in independent validation tools or incorporated into peptide sequencing and database search algorithms to improve existing software programs.
Collapse
Affiliation(s)
- Jianwen Fang
- Bioinformatics Core Facility & Information and Telecommunication Technology Center, University of Kansas, 2099 Constant Dr., Lawrence, Kansas 66047, USA.
| | | | | | | |
Collapse
|
35
|
Brosch M, Swamy S, Hubbard T, Choudhary J. Comparison of Mascot and X!Tandem performance for low and high accuracy mass spectrometry and the development of an adjusted Mascot threshold. Mol Cell Proteomics 2008; 7:962-70. [PMID: 18216375 DOI: 10.1074/mcp.m700293-mcp200] [Citation(s) in RCA: 54] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
It is a major challenge to develop effective sequence database search algorithms to translate molecular weight and fragment mass information obtained from tandem mass spectrometry into high quality peptide and protein assignments. We investigated the peptide identification performance of Mascot and X!Tandem for mass tolerance settings common for low and high accuracy mass spectrometry. We demonstrated that sensitivity and specificity of peptide identification can vary substantially for different mass tolerance settings, but this effect was more significant for Mascot. We present an adjusted Mascot threshold, which allows the user to freely select the best trade-off between sensitivity and specificity. The adjusted Mascot threshold was compared with the default Mascot and X!Tandem scoring thresholds and shown to be more sensitive at the same false discovery rates for both low and high accuracy mass spectrometry data.
Collapse
Affiliation(s)
- Markus Brosch
- The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom
| | | | | | | |
Collapse
|
36
|
Zhang J, Li J, Liu X, Xie H, Zhu Y, He F. A nonparametric model for quality control of database search results in shotgun proteomics. BMC Bioinformatics 2008; 9:29. [PMID: 18205957 PMCID: PMC2267700 DOI: 10.1186/1471-2105-9-29] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2007] [Accepted: 01/21/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Analysis of complex samples with tandem mass spectrometry (MS/MS) has become routine in proteomic research. However, validation of database search results creates a bottleneck in MS/MS data processing. Recently, methods based on a randomized database have become popular for quality control of database search results. However, a consequent problem is the ignorance of how to combine different database search scores to improve the sensitivity of randomized database methods. RESULTS In this paper, a multivariate nonlinear discriminate function (DF) based on the multivariate nonparametric density estimation technique was used to filter out false-positive database search results with a predictable false positive rate (FPR). Application of this method to control datasets of different instruments (LCQ, LTQ, and LTQ/FT) yielded an estimated FPR close to the actual FPR. As expected, the method was more sensitive when more features were used. Furthermore, the new method was shown to be more sensitive than two commonly used methods on 3 complex sample datasets and 3 control datasets. CONCLUSION Using the nonparametric model, a more flexible DF can be obtained, resulting in improved sensitivity and good FPR estimation. This nonparametric statistical technique is a powerful tool for tackling the complexity and diversity of datasets in shotgun proteomics.
Collapse
Affiliation(s)
- Jiyang Zhang
- College of Mechanical & Electronic Engineering and Automatization, National University of Defense Technology, Changsha, 410073, China.
| | | | | | | | | | | |
Collapse
|
37
|
Zhang J, Li J, Xie H, Zhu Y, He F. A new strategy to filter out false positive identifications of peptides in SEQUEST database search results. Proteomics 2008; 7:4036-44. [PMID: 17952874 DOI: 10.1002/pmic.200600929] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Based on the randomized database method and a linear discriminant function (LDF) model, a new strategy to filter out false positive matches in SEQUEST database search results is proposed. Given an experiment MS/MS dataset and a protein sequence database, a randomized database is constructed and merged with the original database. Then, all MS/MS spectra are searched against the combined database. For each expected false positive rate (FPR), LDFs are constructed for different charge states and used to filter out the false positive matches from the normal database. In order to investigate the error of FPR estimation, the new strategy was applied to a reference dataset. As a result, the estimated FPR was very close to the actual FPR. While applied to a human K562 cell line dataset, which is a complicated dataset from real sample, more matches could be confirmed than the traditional cutoff-based methods at the same estimated FPR. Also, though most of the results confirmed by the LDF model were consistent with those of PeptideProphet, the LDF model could still provide complementary information. These results indicate that the new method can reliably control the FPR of peptide identifications and is more sensitive than traditional cutoff-based methods.
Collapse
Affiliation(s)
- Jiyang Zhang
- College of Mechanical and Electronic Engineering and Automatization, National University of Defense Technology, Changsha, China
| | | | | | | | | |
Collapse
|
38
|
Higgs RE, Knierman MD, Gelfanova V, Butler JP, Hale JE. Label-free LC-MS method for the identification of biomarkers. Methods Mol Biol 2008; 428:209-230. [PMID: 18287776 DOI: 10.1007/978-1-59745-117-8_12] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
Pharmaceutical companies and regulatory agencies are pursuing biomarkers as a means to increase the productivity of drug development. Quantifying differential levels of proteins from complex biological samples like plasma or cerebrospinal fluid is one specific approach being used to identify markers of drug action, efficacy, toxicity, etc. Academic investigators are also interested in markers that are diagnostic or prognostic of disease states. We report a comprehensive, fully automated, and label-free approach to relative protein quantification including: sample preparation, proteolytic protein digestion, LCMS/MS data acquisition, de-noising, mass and charge state estimation, chromatographic alignment, and peptide quantification via integration of extracted ion chromatograms. Additionally, we describe methods for transformation and normalization of the quantitative peptide levels in multiplexed measurements to improve precision for statistical analysis. Lastly, we outline how the described methods can be used to design and power biomarker discovery studies.
Collapse
|
39
|
Choi H, Nesvizhskii AI. Semisupervised Model-Based Validation of Peptide Identifications in Mass Spectrometry-Based Proteomics. J Proteome Res 2008; 7:254-65. [DOI: 10.1021/pr070542g] [Citation(s) in RCA: 119] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
40
|
Nesvizhskii AI, Vitek O, Aebersold R. Analysis and validation of proteomic data generated by tandem mass spectrometry. Nat Methods 2007; 4:787-97. [PMID: 17901868 DOI: 10.1038/nmeth1088] [Citation(s) in RCA: 443] [Impact Index Per Article: 26.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
The analysis of the large amount of data generated in mass spectrometry-based proteomics experiments represents a significant challenge and is currently a bottleneck in many proteomics projects. In this review we discuss critical issues related to data processing and analysis in proteomics and describe available methods and tools. We place special emphasis on the elaboration of results that are supported by sound statistical arguments.
Collapse
Affiliation(s)
- Alexey I Nesvizhskii
- University of Michigan, Department of Pathology and Center for Computational Medicine and Biology, Ann Arbor, Michigan 48105, USA
| | | | | |
Collapse
|
41
|
Jiang X, Jiang X, Han G, Ye M, Zou H. Optimization of filtering criterion for SEQUEST database searching to improve proteome coverage in shotgun proteomics. BMC Bioinformatics 2007; 8:323. [PMID: 17761002 PMCID: PMC2040164 DOI: 10.1186/1471-2105-8-323] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2006] [Accepted: 08/31/2007] [Indexed: 11/24/2022] Open
Abstract
Background In proteomic analysis, MS/MS spectra acquired by mass spectrometer are assigned to peptides by database searching algorithms such as SEQUEST. The assignations of peptides to MS/MS spectra by SEQUEST searching algorithm are defined by several scores including Xcorr, ΔCn, Sp, Rsp, matched ion count and so on. Filtering criterion using several above scores is used to isolate correct identifications from random assignments. However, the filtering criterion was not favorably optimized up to now. Results In this study, we implemented a machine learning approach known as predictive genetic algorithm (GA) for the optimization of filtering criteria to maximize the number of identified peptides at fixed false-discovery rate (FDR) for SEQUEST database searching. As the FDR was directly determined by decoy database search scheme, the GA based optimization approach did not require any pre-knowledge on the characteristics of the data set, which represented significant advantages over statistical approaches such as PeptideProphet. Compared with PeptideProphet, the GA based approach can achieve similar performance in distinguishing true from false assignment with only 1/10 of the processing time. Moreover, the GA based approach can be easily extended to process other database search results as it did not rely on any assumption on the data. Conclusion Our results indicated that filtering criteria should be optimized individually for different samples. The new developed software using GA provides a convenient and fast way to create tailored optimal criteria for different proteome samples to improve proteome coverage.
Collapse
Affiliation(s)
- Xinning Jiang
- National Chromatographic R&A Center, Dalian Institute of Chemical Physics, The Chinese Academy of Sciences, Dalian 116023, China
| | - Xiaogang Jiang
- National Chromatographic R&A Center, Dalian Institute of Chemical Physics, The Chinese Academy of Sciences, Dalian 116023, China
| | - Guanghui Han
- National Chromatographic R&A Center, Dalian Institute of Chemical Physics, The Chinese Academy of Sciences, Dalian 116023, China
| | - Mingliang Ye
- National Chromatographic R&A Center, Dalian Institute of Chemical Physics, The Chinese Academy of Sciences, Dalian 116023, China
| | - Hanfa Zou
- National Chromatographic R&A Center, Dalian Institute of Chemical Physics, The Chinese Academy of Sciences, Dalian 116023, China
| |
Collapse
|
42
|
Lubec G, Afjehi-Sadat L. Limitations and pitfalls in protein identification by mass spectrometry. Chem Rev 2007; 107:3568-84. [PMID: 17645314 DOI: 10.1021/cr068213f] [Citation(s) in RCA: 84] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Affiliation(s)
- Gert Lubec
- Medical University of Vienna, Department of Pediatrics, Waehringer Guertel 18, A-1090 Vienna, Austria.
| | | |
Collapse
|
43
|
Leitner A, Foettinger A, Lindner W. Improving fragmentation of poorly fragmenting peptides and phosphopeptides during collision-induced dissociation by malondialdehyde modification of arginine residues. JOURNAL OF MASS SPECTROMETRY : JMS 2007; 42:950-9. [PMID: 17539043 DOI: 10.1002/jms.1233] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]
Abstract
Despite significant technological and methodological advancements in peptide sequencing by mass spectrometry, analyzing peptides that exhibit only poor fragmentation upon collision-induced dissociation (CID) remains a challenge. A major cause for unfavorable fragmentation is insufficient proton 'mobility' due to charge localization at strongly basic sites, in particular, the guanidine group of arginine. We have recently demonstrated that the conversion of the guanidine group of the arginine side chain by malondialdehyde (MDA) is a convenient tool to reduce the basicity of arginine residues and can have beneficial effects for peptide fragmentation. In the present work, we have focused on peptides that typically yield incomplete sequence information in CID-MS/MS experiments. Energy-resolved tandem MS experiments were carried out on angiotensins and arginine-containing phosphopeptides to study in detail the influence of the modification step on the fragmentation process. MDA modification dramatically improved the fragmentation behavior of peptides that exhibited only one or two dominant cleavages in their unmodified form. Neutral loss of phosphoric acid from phosphopeptides carrying phosphoserine and threonine residues was significantly reduced in favor of a higher abundance of fragment ions. Complementary experiments were carried out on three different instrumental platforms (triple-quadrupole, 3D ion trap, quadrupole-linear ion trap hybrid) to ascertain that the observation is a general effect.
Collapse
Affiliation(s)
- Alexander Leitner
- Department of Analytical Chemistry and Food Chemistry, University of Vienna, Waehringer Strasse 38, 1090 Vienna, Austria.
| | | | | |
Collapse
|
44
|
Higgs RE, Knierman MD, Freeman AB, Gelbert LM, Patil ST, Hale JE. Estimating the statistical significance of peptide identifications from shotgun proteomics experiments. J Proteome Res 2007; 6:1758-67. [PMID: 17397207 DOI: 10.1021/pr0605320] [Citation(s) in RCA: 56] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
We present a wrapper-based approach to estimate and control the false discovery rate for peptide identifications using the outputs from multiple commercially available MS/MS search engines. Features of the approach include the flexibility to combine output from multiple search engines with sequence and spectral derived features in a flexible classification model to produce a score associated with correct peptide identifications. This classification model score from a reversed database search is taken as the null distribution for estimating p-values and false discovery rates using a simple and established statistical procedure. Results from 10 analyses of rat sera on an LTQ-FT mass spectrometer indicate that the method is well calibrated for controlling the proportion of false positives in a set of reported peptide identifications while correctly identifying more peptides than rule-based methods using one search engine alone.
Collapse
Affiliation(s)
- Richard E Higgs
- Lilly Research Laboratories, MS 1533, Lilly Corporate Center, Indianapolis, Indiana 46285, USA.
| | | | | | | | | | | |
Collapse
|
45
|
Current literature in mass spectrometry. JOURNAL OF MASS SPECTROMETRY : JMS 2006; 41:1654-1665. [PMID: 17136768 DOI: 10.1002/jms.959] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
|
46
|
Abstract
To date, proteomics approaches have aimed to either identify novel proteins or change in protein expression/modification in various organisms under normal or disease conditions. One major aspect of functional proteomics is to identify protein biological properties in a given context, however, forward proteomics approaches alone cannot complete this goal. Indeed, with the increasing successes of such proteomics-based research strategies and the subsequent increasing amounts of proteins identified with unknown molecular functions, approaches allowing for systematic analyses of protein functions are desired. In this review, we propose to depict the complementarities of forward and reverse proteomics approaches in the definite understanding of protein functions. This dual strategy requires a data integration loop which allows for systematic characterization of protein function(s). The details of the integrative process combining both in silico and experimental resources and tools are presented. Altogether, we believe that the integration of forward and reverse proteomics approaches supported by bioinformatics will provide an efficient path towards systems biology.
Collapse
Affiliation(s)
- Sandrine Palcy
- Organelle Signaling laboratory, Department of Surgery, McGill University, Montreal, Quebec, Canada.
| | | |
Collapse
|
47
|
Jaffe JD, Mani DR, Leptos KC, Church GM, Gillette MA, Carr SA. PEPPeR, a platform for experimental proteomic pattern recognition. Mol Cell Proteomics 2006; 5:1927-41. [PMID: 16857664 PMCID: PMC2649820 DOI: 10.1074/mcp.m600222-mcp200] [Citation(s) in RCA: 116] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Quantitative proteomics holds considerable promise for elucidation of basic biology and for clinical biomarker discovery. However, it has been difficult to fulfill this promise due to over-reliance on identification-based quantitative methods and problems associated with chromatographic separation reproducibility. Here we describe new algorithms termed "Landmark Matching" and "Peak Matching" that greatly reduce these problems. Landmark Matching performs time base-independent propagation of peptide identities onto accurate mass LC-MS features in a way that leverages historical data derived from disparate data acquisition strategies. Peak Matching builds upon Landmark Matching by recognizing identical molecular species across multiple LC-MS experiments in an identity-independent fashion by clustering. We have bundled these algorithms together with other algorithms, data acquisition strategies, and experimental designs to create a Platform for Experimental Proteomic Pattern Recognition (PEPPeR). These developments enable use of established statistical tools previously limited to microarray analysis for treatment of proteomics data. We demonstrate that the proposed platform can be calibrated across 2.5 orders of magnitude and can perform robust quantification of ratios in both simple and complex mixtures with good precision and error characteristics across multiple sample preparations. We also demonstrate de novo marker discovery based on statistical significance of unidentified accurate mass components that changed between two mixtures. These markers were subsequently identified by accurate mass-driven MS/MS acquisition and demonstrated to be contaminant proteins associated with known proteins whose concentrations were designed to change between the two mixtures. These results have provided a real world validation of the platform for marker discovery.
Collapse
Affiliation(s)
- Jacob D Jaffe
- The Broad Institute of Harvard and the Massachusetts Institute of Technology, Cambridge, 02142, USA
| | | | | | | | | | | |
Collapse
|