1
|
Alves G, Ogurtsov AY, Porterfield H, Maity T, Jenkins LM, Sacks DB, Yu YK. Multiplexing the Identification of Microorganisms via Tandem Mass Tag Labeling Augmented by Interference Removal through a Novel Modification of the Expectation Maximization Algorithm. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2024; 35:1138-1155. [PMID: 38740383 PMCID: PMC11157548 DOI: 10.1021/jasms.3c00445] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/19/2023] [Revised: 04/12/2024] [Accepted: 04/17/2024] [Indexed: 05/16/2024]
Abstract
Having fast, accurate, and broad spectrum methods for the identification of microorganisms is of paramount importance to public health, research, and safety. Bottom-up mass spectrometer-based proteomics has emerged as an effective tool for the accurate identification of microorganisms from microbial isolates. However, one major hurdle that limits the deployment of this tool for routine clinical diagnosis, and other areas of research such as culturomics, is the instrument time required for the mass spectrometer to analyze a single sample, which can take ∼1 h per sample, when using mass spectrometers that are presently used in most institutes. To address this issue, in this study, we employed, for the first time, tandem mass tags (TMTs) in multiplex identifications of microorganisms from multiple TMT-labeled samples in one MS/MS experiment. A difficulty encountered when using TMT labeling is the presence of interference in the measured intensities of TMT reporter ions. To correct for interference, we employed in the proposed method a modified version of the expectation maximization (EM) algorithm that redistributes the signal from ion interference back to the correct TMT-labeled samples. We have evaluated the sensitivity and specificity of the proposed method using 94 MS/MS experiments (covering a broad range of protein concentration ratios across TMT-labeled channels and experimental parameters), containing a total of 1931 true positive TMT-labeled channels and 317 true negative TMT-labeled channels. The results of the evaluation show that the proposed method has an identification sensitivity of 93-97% and a specificity of 100% at the species level. Furthermore, as a proof of concept, using an in-house-generated data set composed of some of the most common urinary tract pathogens, we demonstrated that by using the proposed method the mass spectrometer time required per sample, using a 1 h LC-MS/MS run, can be reduced to 10 and 6 min when samples are labeled with TMT-6 and TMT-10, respectively. The proposed method can also be used along with Orbitrap mass spectrometers that have faster MS/MS acquisition rates, like the recently released Orbitrap Astral mass spectrometer, to further reduce the mass spectrometer time required per sample.
Collapse
Affiliation(s)
- Gelio Alves
- National
Center for Biotechnology Information, National Library of Medicine,
National Institutes of Health, Bethesda, Maryland 20894, United States
| | - Aleksey Y. Ogurtsov
- National
Center for Biotechnology Information, National Library of Medicine,
National Institutes of Health, Bethesda, Maryland 20894, United States
| | - Harry Porterfield
- Department
of Laboratory Medicine, Clinical Center, National Institutes of Health, Bethesda, Maryland 20892, United States
| | - Tapan Maity
- Laboratory
of Cell Biology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, United States
| | - Lisa M. Jenkins
- Laboratory
of Cell Biology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, United States
| | - David B. Sacks
- Department
of Laboratory Medicine, Clinical Center, National Institutes of Health, Bethesda, Maryland 20892, United States
| | - Yi-Kuo Yu
- National
Center for Biotechnology Information, National Library of Medicine,
National Institutes of Health, Bethesda, Maryland 20894, United States
| |
Collapse
|
2
|
Alves G, Ogurtsov A, Karlsson R, Jaén-Luchoro D, Piñeiro-Iglesias B, Salvà-Serra F, Andersson B, Moore ERB, Yu YK. Identification of Antibiotic Resistance Proteins via MiCId's Augmented Workflow. A Mass Spectrometry-Based Proteomics Approach. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2022; 33:917-931. [PMID: 35500907 PMCID: PMC9164240 DOI: 10.1021/jasms.1c00347] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/17/2021] [Revised: 02/17/2022] [Accepted: 02/18/2022] [Indexed: 06/01/2023]
Abstract
Fast and accurate identifications of pathogenic bacteria along with their associated antibiotic resistance proteins are of paramount importance for patient treatments and public health. To meet this goal from the mass spectrometry aspect, we have augmented the previously published Microorganism Classification and Identification (MiCId) workflow for this capability. To evaluate the performance of this augmented workflow, we have used MS/MS datafiles from samples of 10 antibiotic resistance bacterial strains belonging to three different species: Escherichia coli, Klebsiella pneumoniae, and Pseudomonas aeruginosa. The evaluation shows that MiCId's workflow has a sensitivity value around 85% (with a lower bound at about 72%) and a precision greater than 95% in identifying antibiotic resistance proteins. In addition to having high sensitivity and precision, MiCId's workflow is fast and portable, making it a valuable tool for rapid identifications of bacteria as well as detection of their antibiotic resistance proteins. It performs microorganismal identifications, protein identifications, sample biomass estimates, and antibiotic resistance protein identifications in 6-17 min per MS/MS sample using computing resources that are available in most desktop and laptop computers. We have also demonstrated other use of MiCId's workflow. Using MS/MS data sets from samples of two bacterial clonal isolates, one being antibiotic-sensitive while the other being multidrug-resistant, we applied MiCId's workflow to investigate possible mechanisms of antibiotic resistance in these pathogenic bacteria; the results showed that MiCId's conclusions agree with the published study. The new version of MiCId (v.07.01.2021) is freely available for download at https://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads.html.
Collapse
Affiliation(s)
- Gelio Alves
- National
Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, United States
| | - Aleksey Ogurtsov
- National
Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, United States
| | - Roger Karlsson
- Department
of Infectious Diseases, Sahlgrenska Academy, University of Gothenburg, 40530 Gothenburg, Sweden
- Department
of Clinical Microbiology, Sahlgrenska University
Hospital, 40234 Gothenburg, Sweden
- Center
for Antibiotic Resistance Research (CARe), University of Gothenburg, 40016 Gothenburg, Sweden
- Nanoxis
Consulting AB, 40234 Gothenburg, Sweden
| | - Daniel Jaén-Luchoro
- Department
of Infectious Diseases, Sahlgrenska Academy, University of Gothenburg, 40530 Gothenburg, Sweden
- Center
for Antibiotic Resistance Research (CARe), University of Gothenburg, 40016 Gothenburg, Sweden
- Culture Collection
University of Gothenburg (CCUG), Sahlgrenska
Academy of the University of Gothenburg, 40234 Gothenburg, Sweden
| | - Beatriz Piñeiro-Iglesias
- Department
of Clinical Microbiology, Sahlgrenska University
Hospital, 40234 Gothenburg, Sweden
- Center
for Antibiotic Resistance Research (CARe), University of Gothenburg, 40016 Gothenburg, Sweden
| | - Francisco Salvà-Serra
- Department
of Infectious Diseases, Sahlgrenska Academy, University of Gothenburg, 40530 Gothenburg, Sweden
- Department
of Clinical Microbiology, Sahlgrenska University
Hospital, 40234 Gothenburg, Sweden
- Center
for Antibiotic Resistance Research (CARe), University of Gothenburg, 40016 Gothenburg, Sweden
- Culture Collection
University of Gothenburg (CCUG), Sahlgrenska
Academy of the University of Gothenburg, 40234 Gothenburg, Sweden
- Microbiology,
Department of Biology, University of the
Balearic Islands, 07122 Palma de Mallorca, Spain
| | - Björn Andersson
- Bioinformatics
Core Facility at Sahlgrenska Academy, University
of Gothenburg, Box 413, 40530 Gothenburg, Sweden
| | - Edward R. B. Moore
- Department
of Infectious Diseases, Sahlgrenska Academy, University of Gothenburg, 40530 Gothenburg, Sweden
- Department
of Clinical Microbiology, Sahlgrenska University
Hospital, 40234 Gothenburg, Sweden
- Center
for Antibiotic Resistance Research (CARe), University of Gothenburg, 40016 Gothenburg, Sweden
- Culture Collection
University of Gothenburg (CCUG), Sahlgrenska
Academy of the University of Gothenburg, 40234 Gothenburg, Sweden
| | - Yi-Kuo Yu
- National
Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, United States
| |
Collapse
|
3
|
Kondori N, Kurtovic A, Piñeiro-Iglesias B, Salvà-Serra F, Jaén-Luchoro D, Andersson B, Alves G, Ogurtsov A, Thorsell A, Fuchs J, Tunovic T, Kamenska N, Karlsson A, Yu YK, Moore ERB, Karlsson R. Mass Spectrometry Proteotyping-Based Detection and Identification of Staphylococcus aureus, Escherichia coli, and Candida albicans in Blood. Front Cell Infect Microbiol 2021; 11:634215. [PMID: 34381737 PMCID: PMC8350517 DOI: 10.3389/fcimb.2021.634215] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2020] [Accepted: 07/09/2021] [Indexed: 12/12/2022] Open
Abstract
Bloodstream infections (BSIs), the presence of microorganisms in blood, are potentially serious conditions that can quickly develop into sepsis and life-threatening situations. When assessing proper treatment, rapid diagnosis is the key; besides clinical judgement performed by attending physicians, supporting microbiological tests typically are performed, often requiring microbial isolation and culturing steps, which increases the time required for confirming positive cases of BSI. The additional waiting time forces physicians to prescribe broad-spectrum antibiotics and empirically based treatments, before determining the precise cause of the disease. Thus, alternative and more rapid cultivation-independent methods are needed to improve clinical diagnostics, supporting prompt and accurate treatment and reducing the development of antibiotic resistance. In this study, a culture-independent workflow for pathogen detection and identification in blood samples was developed, using peptide biomarkers and applying bottom-up proteomics analyses, i.e., so-called "proteotyping". To demonstrate the feasibility of detection of blood infectious pathogens, using proteotyping, Escherichia coli and Staphylococcus aureus were included in the study, as the most prominent bacterial causes of bacteremia and sepsis, as well as Candida albicans, one of the most prominent causes of fungemia. Model systems including spiked negative blood samples, as well as positive blood cultures, without further culturing steps, were investigated. Furthermore, an experiment designed to determine the incubation time needed for correct identification of the infectious pathogens in blood cultures was performed. The results for the spiked negative blood samples showed that proteotyping was 100- to 1,000-fold more sensitive, in comparison with the MALDI-TOF MS-based approach. Furthermore, in the analyses of ten positive blood cultures each of E. coli and S. aureus, both the MALDI-TOF MS-based and proteotyping approaches were successful in the identification of E. coli, although only proteotyping could identify S. aureus correctly in all samples. Compared with the MALDI-TOF MS-based approaches, shotgun proteotyping demonstrated higher sensitivity and accuracy, and required significantly shorter incubation time before detection and identification of the correct pathogen could be accomplished.
Collapse
Affiliation(s)
- Nahid Kondori
- Department of Infectious Diseases, Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden
- Department of Clinical Microbiology, Sahlgrenska University Hospital, Gothenburg, Sweden
| | - Amra Kurtovic
- Department of Clinical Microbiology, Sahlgrenska University Hospital, Gothenburg, Sweden
| | | | - Francisco Salvà-Serra
- Department of Infectious Diseases, Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden
- Department of Clinical Microbiology, Sahlgrenska University Hospital, Gothenburg, Sweden
- Culture Collection University of Gothenburg (CCUG), Sahlgrenska Academy of the University of Gothenburg, Gothenburg, Sweden
- Microbiology, Department of Biology, University of the Balearic Islands, Palma de Mallorca, Spain
| | - Daniel Jaén-Luchoro
- Department of Infectious Diseases, Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden
- Culture Collection University of Gothenburg (CCUG), Sahlgrenska Academy of the University of Gothenburg, Gothenburg, Sweden
| | - Björn Andersson
- Bioinformatics Core Facility at Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden
| | - Gelio Alves
- National Center for Biotechnology Information (NCBI), Bethesda, MD, United States
| | - Aleksey Ogurtsov
- National Center for Biotechnology Information (NCBI), Bethesda, MD, United States
| | - Annika Thorsell
- Proteomics Core Facility at Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden
| | - Johannes Fuchs
- Proteomics Core Facility at Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden
| | - Timur Tunovic
- Department of Clinical Microbiology, Sahlgrenska University Hospital, Gothenburg, Sweden
| | - Nina Kamenska
- Norra-Älvsborgs-Länssjukhus (NÄL), Trollhättan, Sweden
| | | | - Yi-Kuo Yu
- National Center for Biotechnology Information (NCBI), Bethesda, MD, United States
| | - Edward R. B. Moore
- Department of Infectious Diseases, Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden
- Department of Clinical Microbiology, Sahlgrenska University Hospital, Gothenburg, Sweden
- Culture Collection University of Gothenburg (CCUG), Sahlgrenska Academy of the University of Gothenburg, Gothenburg, Sweden
| | - Roger Karlsson
- Department of Infectious Diseases, Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden
- Department of Clinical Microbiology, Sahlgrenska University Hospital, Gothenburg, Sweden
- Nanoxis Consulting AB, Gothenburg, Sweden
| |
Collapse
|
4
|
Verheggen K, Raeder H, Berven FS, Martens L, Barsnes H, Vaudel M. Anatomy and evolution of database search engines-a central component of mass spectrometry based proteomic workflows. MASS SPECTROMETRY REVIEWS 2020; 39:292-306. [PMID: 28902424 DOI: 10.1002/mas.21543] [Citation(s) in RCA: 60] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/06/2016] [Accepted: 07/05/2017] [Indexed: 06/07/2023]
Abstract
Sequence database search engines are bioinformatics algorithms that identify peptides from tandem mass spectra using a reference protein sequence database. Two decades of development, notably driven by advances in mass spectrometry, have provided scientists with more than 30 published search engines, each with its own properties. In this review, we present the common paradigm behind the different implementations, and its limitations for modern mass spectrometry datasets. We also detail how the search engines attempt to alleviate these limitations, and provide an overview of the different software frameworks available to the researcher. Finally, we highlight alternative approaches for the identification of proteomic mass spectrometry datasets, either as a replacement for, or as a complement to, sequence database search engines.
Collapse
Affiliation(s)
- Kenneth Verheggen
- VIB-UGent Center for Medical Biotechnology, VIB, Ghent, Belgium
- Department of Biochemistry, Ghent University, Ghent, Belgium
- Bioinformatics Institute Ghent, Ghent University, Ghent, Belgium
| | - Helge Raeder
- KG Jebsen Center for Diabetes Research, Department of Clinical Science, University of Bergen, Norway
- Department of Pediatrics, Haukeland University Hospital, Bergen, Norway
| | - Frode S Berven
- Proteomics Unit, Department of Biomedicine, University of Bergen, Norway
| | - Lennart Martens
- VIB-UGent Center for Medical Biotechnology, VIB, Ghent, Belgium
- Department of Biochemistry, Ghent University, Ghent, Belgium
- Bioinformatics Institute Ghent, Ghent University, Ghent, Belgium
| | - Harald Barsnes
- KG Jebsen Center for Diabetes Research, Department of Clinical Science, University of Bergen, Norway
- Proteomics Unit, Department of Biomedicine, University of Bergen, Norway
- Computational Biology Unit, Department of Informatics, University of Bergen, Norway
| | - Marc Vaudel
- KG Jebsen Center for Diabetes Research, Department of Clinical Science, University of Bergen, Norway
- Proteomics Unit, Department of Biomedicine, University of Bergen, Norway
- Center for Medical Genetics and Molecular Medicine, Haukeland University Hospital, Bergen, Norway
| |
Collapse
|
5
|
Ogurtsov AY, Alves G, Yu YK. RAId: Knowledge-Integrated Proteomics Web Service with Accurate Statistical Significance Assignment. Proteomics 2019; 19:e1800367. [PMID: 30908818 DOI: 10.1002/pmic.201800367] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2018] [Revised: 03/13/2019] [Indexed: 11/07/2022]
Abstract
Mass spectrometry-based proteomics starts with identifications of peptides and proteins, which provide the bases for forming the next-level hypotheses whose "validations" are often employed for forming even higher level hypotheses and so forth. Scientifically meaningful conclusions are thus attainable only if the number of falsely identified peptides/proteins is accurately controlled. For this reason, RAId continued to be developed in the past decade. RAId employs rigorous statistics for peptides/proteins identification, hence assigning accurate P-values/E-values that can be used confidently to control the number of falsely identified peptides and proteins. The RAId web service is a versatile tool built to identify peptides and proteins from tandem mass spectrometry data. Not only recognizing various spectra file formats, the web service also allows four peptide scoring functions and choice of three statistical methods for assigning P-values/E-values to identified peptides. Users may upload their own protein database or use one of the available knowledge integrated organismal databases that contain annotated information such as single amino acid polymorphisms, post-translational modifications, and their disease associations. The web service also provides a friendly interface to display, sort using different criteria, and download the identified peptides and proteins. RAId web service is freely available at https://www.ncbi.nlm.nih.gov/CBBresearch/Yu/raid.
Collapse
Affiliation(s)
- Aleksey Y Ogurtsov
- National Center for Biotechnology Information, National Library of Medicine, NIH, Bethesda, MD, 20894, USA
| | - Gelio Alves
- National Center for Biotechnology Information, National Library of Medicine, NIH, Bethesda, MD, 20894, USA
| | - Yi-Kuo Yu
- National Center for Biotechnology Information, National Library of Medicine, NIH, Bethesda, MD, 20894, USA
| |
Collapse
|
6
|
Joyce B, Lee D, Rubio A, Ogurtsov A, Alves G, Yu YK. A graphical user interface for RAId, a knowledge integrated proteomics analysis suite with accurate statistics. BMC Res Notes 2018; 11:182. [PMID: 29544540 PMCID: PMC5856202 DOI: 10.1186/s13104-018-3289-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2017] [Accepted: 03/09/2018] [Indexed: 11/10/2022] Open
Abstract
OBJECTIVE RAId is a software package that has been actively developed for the past 10 years for computationally and visually analyzing MS/MS data. Founded on rigorous statistical methods, RAId's core program computes accurate E-values for peptides and proteins identified during database searches. Making this robust tool readily accessible for the proteomics community by developing a graphical user interface (GUI) is our main goal here. RESULTS We have constructed a graphical user interface to facilitate the use of RAId on users' local machines. Written in Java, RAId_GUI not only makes easy executions of RAId but also provides tools for data/spectra visualization, MS-product analysis, molecular isotopic distribution analysis, and graphing the retrieval versus the proportion of false discoveries. The results viewer displays and allows the users to download the analyses results. Both the knowledge-integrated organismal databases and the code package (containing source code, the graphical user interface, and a user manual) are available for download at https://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads/raid.html .
Collapse
Affiliation(s)
- Brendan Joyce
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD, 20894, USA
| | - Danny Lee
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD, 20894, USA
| | - Alex Rubio
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD, 20894, USA
| | - Aleksey Ogurtsov
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD, 20894, USA
| | - Gelio Alves
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD, 20894, USA
| | - Yi-Kuo Yu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD, 20894, USA.
| |
Collapse
|
7
|
Hernandez-Valladares M, Vaudel M, Selheim F, Berven F, Bruserud Ø. Proteogenomics approaches for studying cancer biology and their potential in the identification of acute myeloid leukemia biomarkers. Expert Rev Proteomics 2017; 14:649-663. [DOI: 10.1080/14789450.2017.1352474] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
Affiliation(s)
- Maria Hernandez-Valladares
- Department of Clinical Science, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway
- Proteomics Unit, Department of Biomedicine, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway
| | - Marc Vaudel
- KG Jebsen Center for Diabetes Research, Department of Clinical Science, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway
- Center for Medical Genetics and Molecular Medicine, Haukeland University Hospital, Bergen, Norway
| | - Frode Selheim
- Proteomics Unit, Department of Biomedicine, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway
| | - Frode Berven
- Proteomics Unit, Department of Biomedicine, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway
| | - Øystein Bruserud
- Department of Clinical Science, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway
| |
Collapse
|
8
|
Sheynkman GM, Shortreed MR, Cesnik AJ, Smith LM. Proteogenomics: Integrating Next-Generation Sequencing and Mass Spectrometry to Characterize Human Proteomic Variation. ANNUAL REVIEW OF ANALYTICAL CHEMISTRY (PALO ALTO, CALIF.) 2016; 9:521-45. [PMID: 27049631 PMCID: PMC4991544 DOI: 10.1146/annurev-anchem-071015-041722] [Citation(s) in RCA: 73] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/09/2023]
Abstract
Mass spectrometry-based proteomics has emerged as the leading method for detection, quantification, and characterization of proteins. Nearly all proteomic workflows rely on proteomic databases to identify peptides and proteins, but these databases typically contain a generic set of proteins that lack variations unique to a given sample, precluding their detection. Fortunately, proteogenomics enables the detection of such proteomic variations and can be defined, broadly, as the use of nucleotide sequences to generate candidate protein sequences for mass spectrometry database searching. Proteogenomics is experiencing heightened significance due to two developments: (a) advances in DNA sequencing technologies that have made complete sequencing of human genomes and transcriptomes routine, and (b) the unveiling of the tremendous complexity of the human proteome as expressed at the levels of genes, cells, tissues, individuals, and populations. We review here the field of human proteogenomics, with an emphasis on its history, current implementations, the types of proteomic variations it reveals, and several important applications.
Collapse
Affiliation(s)
- Gloria M Sheynkman
- Center for Cancer Systems Biology (CCSB) and Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02215;
- Department of Genetics, Harvard Medical School, Boston, Massachusetts 02115
- Department of Chemistry, University of Wisconsin, Madison, Wisconsin 53706; ,
| | - Michael R Shortreed
- Department of Chemistry, University of Wisconsin, Madison, Wisconsin 53706; ,
| | - Anthony J Cesnik
- Department of Chemistry, University of Wisconsin, Madison, Wisconsin 53706; ,
| | - Lloyd M Smith
- Department of Chemistry, University of Wisconsin, Madison, Wisconsin 53706; ,
- Genome Center of Wisconsin, University of Wisconsin, Madison, Wisconsin 53706;
| |
Collapse
|
9
|
Alves G, Yu YK. Confidence assignment for mass spectrometry based peptide identifications via the extreme value distribution. Bioinformatics 2016; 32:2642-9. [PMID: 27153659 DOI: 10.1093/bioinformatics/btw225] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2015] [Accepted: 04/16/2016] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION There is a growing trend for biomedical researchers to extract evidence and draw conclusions from mass spectrometry based proteomics experiments, the cornerstone of which is peptide identification. Inaccurate assignments of peptide identification confidence thus may have far-reaching and adverse consequences. Although some peptide identification methods report accurate statistics, they have been limited to certain types of scoring function. The extreme value statistics based method, while more general in the scoring functions it allows, demands accurate parameter estimates and requires, at least in its original design, excessive computational resources. Improving the parameter estimate accuracy and reducing the computational cost for this method has two advantages: it provides another feasible route to accurate significance assessment, and it could provide reliable statistics for scoring functions yet to be developed. RESULTS We have formulated and implemented an efficient algorithm for calculating the extreme value statistics for peptide identification applicable to various scoring functions, bypassing the need for searching large random databases. AVAILABILITY AND IMPLEMENTATION The source code, implemented in C ++ on a linux system, is available for download at ftp://ftp.ncbi.nlm.nih.gov/pub/qmbp/qmbp_ms/RAId/RAId_Linux_64Bit CONTACT yyu@ncbi.nlm.nih.gov SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Gelio Alves
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Yi-Kuo Yu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|
10
|
Choong WK, Chang HY, Chen CT, Tsai CF, Hsu WL, Chen YJ, Sung TY. Informatics View on the Challenges of Identifying Missing Proteins from Shotgun Proteomics. J Proteome Res 2015; 14:5396-407. [DOI: 10.1021/acs.jproteome.5b00482] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Affiliation(s)
- Wai-Kok Choong
- Institute
of Information Science, Academia Sinica, Taipei 11529, Taiwan
| | - Hui-Yin Chang
- Institute
of Information Science, Academia Sinica, Taipei 11529, Taiwan
- Bioinformatics
Program, Taiwan International Graduate Program, Academia Sinica, Taipei 11529, Taiwan
- Institute
of Biomedical Informatics, National Yang-Ming University, Taipei 11221, Taiwan
| | - Ching-Tai Chen
- Institute
of Information Science, Academia Sinica, Taipei 11529, Taiwan
| | - Chia-Feng Tsai
- Institute
of Chemistry, Academia Sinica, Taipei 11529, Taiwan
| | - Wen-Lian Hsu
- Institute
of Information Science, Academia Sinica, Taipei 11529, Taiwan
| | - Yu-Ju Chen
- Institute
of Chemistry, Academia Sinica, Taipei 11529, Taiwan
| | - Ting-Yi Sung
- Institute
of Information Science, Academia Sinica, Taipei 11529, Taiwan
| |
Collapse
|
11
|
Sheynkman GM, Shortreed MR, Frey BL, Scalf M, Smith LM. Large-scale mass spectrometric detection of variant peptides resulting from nonsynonymous nucleotide differences. J Proteome Res 2013; 13:228-40. [PMID: 24175627 DOI: 10.1021/pr4009207] [Citation(s) in RCA: 69] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
Each individual carries thousands of nonsynonymous single nucleotide variants (nsSNVs) in their genome, each corresponding to a single amino acid polymorphism (SAP) in the encoded proteins. It is important to be able to directly detect and quantify these variations at the protein level to study post-transcriptional regulation, differential allelic expression, and other important biological processes. However, such variant peptides are not generally detected in standard proteomic analyses due to their absence from the generic databases that are employed for mass spectrometry searching. Here we extend previous work that demonstrated the use of customized SAP databases constructed from sample-matched RNA-Seq data. We collected deep-coverage RNA-Seq data from the Jurkat cell line, compiled the set of nsSNVs that are expressed, used this information to construct a customized SAP database, and searched it against deep-coverage shotgun MS data obtained from the same sample. This approach enabled the detection of 421 SAP peptides mapping to 395 nsSNVs. We compared these peptides to peptides identified from a large generic search database containing all known nsSNVs (dbSNP) and found that more than 70% of the SAP peptides from this dbSNP-derived search were not supported by the RNA-Seq data and thus are likely false positives. Next, we increased the SAP coverage from the RNA-Seq derived database by utilizing multiple protease digestions, thereby increasing variant detection to 695 SAP peptides mapping to 504 nsSNV sites. These detected SAP peptides corresponded to moderate to high abundance transcripts (30+ transcripts per million, TPM). The SAP peptides included 192 allelic pairs; the relative expression levels of the two alleles were evaluated for 51 of those pairs and were found to be comparable in all cases.
Collapse
Affiliation(s)
- Gloria M Sheynkman
- Department of Chemistry, University of Wisconsin-Madison , 1101 University Avenue, Madison, Wisconsin 53706, United States
| | | | | | | | | |
Collapse
|
12
|
Wang X, Slebos RJC, Wang D, Halvey PJ, Tabb DL, Liebler DC, Zhang B. Protein identification using customized protein sequence databases derived from RNA-Seq data. J Proteome Res 2011; 11:1009-17. [PMID: 22103967 DOI: 10.1021/pr200766z] [Citation(s) in RCA: 132] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The standard shotgun proteomics data analysis strategy relies on searching MS/MS spectra against a context-independent protein sequence database derived from the complete genome sequence of an organism. Because transcriptome sequence analysis (RNA-Seq) promises an unbiased and comprehensive picture of the transcriptome, we reason that a sample-specific protein database derived from RNA-Seq data can better approximate the real protein pool in the sample and thus improve protein identification. In this study, we have developed a two-step strategy for building sample-specific protein databases from RNA-Seq data. First, the database size is reduced by eliminating unexpressed or lowly expressed genes according to transcript quantification. Second, high-quality nonsynonymous coding single nucleotide variations (SNVs) are identified based on RNA-Seq data, and corresponding protein variants are added to the database. Using RNA-Seq and shotgun proteomics data from two colorectal cancer cell lines SW480 and RKO, we demonstrated that customized protein sequence databases could significantly increase the sensitivity of peptide identification, reduce ambiguity in protein assembly, and enable the detection of known and novel peptide variants. Thus, sample-specific databases from RNA-Seq data can enable more sensitive and comprehensive protein discovery in shotgun proteomics studies.
Collapse
Affiliation(s)
- Xiaojing Wang
- Department of Biomedical Informatics,Vanderbilt University School of Medicine , Nashville, Tennessee 37232, United States
| | | | | | | | | | | | | |
Collapse
|
13
|
Stojmirović A, Yu YK. ppiTrim: constructing non-redundant and up-to-date interactomes. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2011; 2011:bar036. [PMID: 21873645 PMCID: PMC3162744 DOI: 10.1093/database/bar036] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Robust advances in interactome analysis demand comprehensive, non-redundant and consistently annotated data sets. By non-redundant, we mean that the accounting of evidence for every interaction should be faithful: each independent experimental support is counted exactly once, no more, no less. While many interactions are shared among public repositories, none of them contains the complete known interactome for any model organism. In addition, the annotations of the same experimental result by different repositories often disagree. This brings up the issue of which annotation to keep while consolidating evidences that are the same. The iRefIndex database, including interactions from most popular repositories with a standardized protein nomenclature, represents a significant advance in all aspects, especially in comprehensiveness. However, iRefIndex aims to maintain all information/annotation from original sources and requires users to perform additional processing to fully achieve the aforementioned goals. Another issue has to do with protein complexes. Some databases represent experimentally observed complexes as interactions with more than two participants, while others expand them into binary interactions using spoke or matrix model. To avoid untested interaction information buildup, it is preferable to replace the expanded protein complexes, either from spoke or matrix models, with a flat list of complex members. To address these issues and to achieve our goals, we have developed ppiTrim, a script that processes iRefIndex to produce non-redundant, consistently annotated data sets of physical interactions. Our script proceeds in three stages: mapping all interactants to gene identifiers and removing all undesired raw interactions, deflating potentially expanded complexes, and reconciling for each interaction the annotation labels among different source databases. As an illustration, we have processed the three largest organismal data sets: yeast, human and fruitfly. While ppiTrim can resolve most apparent conflicts between different labelings, we also discovered some unresolvable disagreements mostly resulting from different annotation policies among repositories. Database URL:http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads/ppiTrim.html
Collapse
Affiliation(s)
- Aleksandar Stojmirović
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | | |
Collapse
|
14
|
Chernobrovkin AL, Mitkevich VA, Popov IA, Indeikina MI, Ilgisonis EV, Lisitsa AV, Archakov AI. Identification of single amino acid polymorphisms in MS/MS spectra of peptides. DOKL BIOCHEM BIOPHYS 2011; 437:90-3. [PMID: 21590384 DOI: 10.1134/s1607672911020098] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2010] [Indexed: 11/23/2022]
Affiliation(s)
- A L Chernobrovkin
- Orekhovich Institute of Biomedical Chemistry, Russian Academy of Medical Sciences, Moscow, 119832, Russia
| | | | | | | | | | | | | |
Collapse
|
15
|
Alexandridou A, Dovrolis N, Tsangaris GT, Nikita K, Spyrou G. PepServe: a web server for peptide analysis, clustering and visualization. Nucleic Acids Res 2011; 39:W381-4. [PMID: 21572105 PMCID: PMC3125752 DOI: 10.1093/nar/gkr318] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
Peptides, either as protein fragments or as naturally occurring entities are characterized by their sequence and function features. Many times the researchers need to massively manage peptide lists concerning protein identification, biomarker discovery, bioactivity, immune response or other functionalities. We present a web server that manages peptide lists in terms of feature analysis as well as interactive clustering and visualization of the given peptides. PepServe is a useful tool in the understanding of the peptide feature distribution among a group of peptides. The PepServe web application is freely available at http://bioserver-1.bioacademy.gr/Bioserver/PepServe/.
Collapse
Affiliation(s)
- Anastasia Alexandridou
- Biomedical Research Foundation, Academy of Athens, 4 Soranou Ephessiou, 115 27 Athens, Greece
| | | | | | | | | |
Collapse
|
16
|
RAId_aPS: MS/MS analysis with multiple scoring functions and spectrum-specific statistics. PLoS One 2010; 5:e15438. [PMID: 21103371 PMCID: PMC2982831 DOI: 10.1371/journal.pone.0015438] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2010] [Accepted: 09/20/2010] [Indexed: 11/26/2022] Open
Abstract
Statistically meaningful comparison/combination of peptide identification results from various search methods is impeded by the lack of a universal statistical standard. Providing an -value calibration protocol, we demonstrated earlier the feasibility of translating either the score or heuristic -value reported by any method into the textbook-defined -value, which may serve as the universal statistical standard. This protocol, although robust, may lose spectrum-specific statistics and might require a new calibration when changes in experimental setup occur. To mitigate these issues, we developed a new MS/MS search tool, RAId_aPS, that is able to provide spectrum-specific-values for additive scoring functions. Given a selection of scoring functions out of RAId score, K-score, Hyperscore and XCorr, RAId_aPS generates the corresponding score histograms of all possible peptides using dynamic programming. Using these score histograms to assign -values enables a calibration-free protocol for accurate significance assignment for each scoring function. RAId_aPS features four different modes: (i) compute the total number of possible peptides for a given molecular mass range, (ii) generate the score histogram given a MS/MS spectrum and a scoring function, (iii) reassign -values for a list of candidate peptides given a MS/MS spectrum and the scoring functions chosen, and (iv) perform database searches using selected scoring functions. In modes (iii) and (iv), RAId_aPS is also capable of combining results from different scoring functions using spectrum-specific statistics. The web link is http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/raid_aps/index.html. Relevant binaries for Linux, Windows, and Mac OS X are available from the same page.
Collapse
|
17
|
Alves G, Ogurtsov AY, Yu YK. Assigning statistical significance to proteotypic peptides via database searches. J Proteomics 2010; 74:199-211. [PMID: 21055489 DOI: 10.1016/j.jprot.2010.10.005] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2010] [Revised: 10/18/2010] [Accepted: 10/21/2010] [Indexed: 11/19/2022]
Abstract
Querying MS/MS spectra against a database containing only proteotypic peptides reduces data analysis time due to reduction of database size. Despite the speed advantage, this search strategy is challenged by issues of statistical significance and coverage. The former requires separating systematically significant identifications from less confident identifications, while the latter arises when the underlying peptide is not present, due to single amino acid polymorphisms (SAPs) or post-translational modifications (PTMs), in the proteotypic peptide libraries searched. To address both issues simultaneously, we have extended RAId's knowledge database to include proteotypic information, utilized RAId's statistical strategy to assign statistical significance to proteotypic peptides, and modified RAId's programs to allow for consideration of proteotypic information during database searches. The extended database alleviates the coverage problem since all annotated modifications, even those that occurred within proteotypic peptides, may be considered. Taking into account the likelihoods of observation, the statistical strategy of RAId provides accurate E-value assignments regardless whether a candidate peptide is proteotypic or not. The advantage of including proteotypic information is evidenced by its superior retrieval performance when compared to regular database searches.
Collapse
Affiliation(s)
- Gelio Alves
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | | | | |
Collapse
|
18
|
Kwon KH. Analytical methods for proteome data obtained from SDS-PAGE multi-dimensional separation and mass spectrometry. J Anal Sci Technol 2010. [DOI: 10.5355/jast.2010.1] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
|
19
|
Alves G, Ogurtsov AY, Yu YK. RAId_DbS: mass-spectrometry based peptide identification web server with knowledge integration. BMC Genomics 2008; 9:505. [PMID: 18954448 PMCID: PMC2605478 DOI: 10.1186/1471-2164-9-505] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2008] [Accepted: 10/27/2008] [Indexed: 11/17/2022] Open
Abstract
Background Existing scientific literature is a rich source of biological information such as disease markers. Integration of this information with data analysis may help researchers to identify possible controversies and to form useful hypotheses for further validations. In the context of proteomics studies, individualized proteomics era may be approached through consideration of amino acid substitutions/modifications as well as information from disease studies. Integration of such information with peptide searches facilitates speedy, dynamic information retrieval that may significantly benefit clinical laboratory studies. Description We have integrated from various sources annotated single amino acid polymorphisms, post-translational modifications, and their documented disease associations (if they exist) into one enhanced database per organism. We have also augmented our peptide identification software RAId_DbS to take into account this information while analyzing a tandem mass spectrum. In principle, one may choose to respect or ignore the correlation of amino acid polymorphisms/modifications within each protein. The former leads to targeted searches and avoids scoring of unnecessary polymorphism/modification combinations; the latter explores possible polymorphisms in a controlled fashion. To facilitate new discoveries, RAId_DbS also allows users to conduct searches permitting novel polymorphisms as well as to search a knowledge database created by the users. Conclusion We have finished constructing enhanced databases for 17 organisms. The web link to RAId_DbS and the enhanced databases is . The relevant databases and binaries of RAId_DbS for Linux, Windows, and Mac OS X are available for download from the same web page.
Collapse
Affiliation(s)
- Gelio Alves
- National Center for Biotechnology Information, National Library of Medicine, NIH, Bethesda, MD 20894, USA.
| | | | | |
Collapse
|