1
|
Klaproth-Andrade D, Hingerl J, Bruns Y, Smith NH, Träuble J, Wilhelm M, Gagneur J. Deep learning-driven fragment ion series classification enables highly precise and sensitive de novo peptide sequencing. Nat Commun 2024; 15:151. [PMID: 38167372 PMCID: PMC10762064 DOI: 10.1038/s41467-023-44323-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Accepted: 12/08/2023] [Indexed: 01/05/2024] Open
Abstract
Unlike for DNA and RNA, accurate and high-throughput sequencing methods for proteins are lacking, hindering the utility of proteomics in applications where the sequences are unknown including variant calling, neoepitope identification, and metaproteomics. We introduce Spectralis, a de novo peptide sequencing method for tandem mass spectrometry. Spectralis leverages several innovations including a convolutional neural network layer connecting peaks in spectra spaced by amino acid masses, proposing fragment ion series classification as a pivotal task for de novo peptide sequencing, and a peptide-spectrum confidence score. On spectra for which database search provided a ground truth, Spectralis surpassed 40% sensitivity at 90% precision, nearly doubling state-of-the-art sensitivity. Application to unidentified spectra confirmed its superiority and showcased its applicability to variant calling. Altogether, these algorithmic innovations and the substantial sensitivity increase in the high-precision range constitute an important step toward broadly applicable peptide sequencing.
Collapse
Affiliation(s)
- Daniela Klaproth-Andrade
- Computational Molecular Medicine, School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
- Munich Data Science Institute, Technical University of Munich, Garching, Germany
| | - Johannes Hingerl
- Computational Molecular Medicine, School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
| | - Yanik Bruns
- Computational Molecular Medicine, School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
| | - Nicholas H Smith
- Computational Molecular Medicine, School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
| | - Jakob Träuble
- Computational Molecular Medicine, School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
| | - Mathias Wilhelm
- Munich Data Science Institute, Technical University of Munich, Garching, Germany.
- Computational Mass Spectrometry, School of Life Sciences, Technical University of Munich, Freising, Germany.
| | - Julien Gagneur
- Computational Molecular Medicine, School of Computation, Information and Technology, Technical University of Munich, Garching, Germany.
- Munich Data Science Institute, Technical University of Munich, Garching, Germany.
- Institute of Human Genetics, School of Medicine, Technical University of Munich, Munich, Germany.
- Computational Health Center, Helmholtz Center Munich, Neuherberg, Germany.
| |
Collapse
|
2
|
Tariq MU, Ebert S, Saeed F. Making MS Omics Data ML-Ready: SpeCollate Protocols. Methods Mol Biol 2024; 2836:135-155. [PMID: 38995540 DOI: 10.1007/978-1-0716-4007-4_9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/13/2024]
Abstract
The increasing complexity and volume of mass spectrometry (MS) data have presented new challenges and opportunities for proteomics data analysis and interpretation. In this chapter, we provide a comprehensive guide to transforming MS data for machine learning (ML) training, inference, and applications. The chapter is organized into three parts. The first part describes the data analysis needed for MS-based experiments and a general introduction to our deep learning model SpeCollate-which we will use throughout the chapter for illustration. The second part of the chapter explores the transformation of MS data for inference, providing a step-by-step guide for users to deduce peptides from their MS data. This section aims to bridge the gap between data acquisition and practical applications by detailing the necessary steps for data preparation and interpretation. In the final part, we present a demonstrative example of SpeCollate, a deep learning-based peptide database search engine that overcomes the problems of simplistic simulation of theoretical spectra and heuristic scoring functions for peptide-spectrum matches by generating joint embeddings for spectra and peptides. SpeCollate is a user-friendly tool with an intuitive command-line interface to perform the search, showcasing the effectiveness of the techniques and methodologies discussed in the earlier sections and highlighting the potential of machine learning in the context of mass spectrometry data analysis. By offering a comprehensive overview of data transformation, inference, and ML model applications for mass spectrometry, this chapter aims to empower researchers and practitioners in leveraging the power of machine learning to unlock novel insights and drive innovation in the field of mass spectrometry-based omics.
Collapse
Affiliation(s)
- Muhammad Usman Tariq
- Knight Foundation School of Computing and Information Sciences (KFSCIS), Florida International University (FIU), Miami, FL, USA
| | - Samuel Ebert
- Knight Foundation School of Computing and Information Sciences (KFSCIS), Florida International University (FIU), Miami, FL, USA
| | - Fahad Saeed
- Knight Foundation School of Computing and Information Sciences (KFSCIS), Florida International University (FIU), Miami, FL, USA.
| |
Collapse
|
3
|
Ng CCA, Zhou Y, Yao ZP. Algorithms for de-novo sequencing of peptides by tandem mass spectrometry: A review. Anal Chim Acta 2023; 1268:341330. [PMID: 37268337 DOI: 10.1016/j.aca.2023.341330] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2022] [Revised: 05/04/2023] [Accepted: 05/06/2023] [Indexed: 06/04/2023]
Abstract
Peptide sequencing is of great significance to fundamental and applied research in the fields such as chemical, biological, medicinal and pharmaceutical sciences. With the rapid development of mass spectrometry and sequencing algorithms, de-novo peptide sequencing using tandem mass spectrometry (MS/MS) has become the main method for determining amino acid sequences of novel and unknown peptides. Advanced algorithms allow the amino acid sequence information to be accurately obtained from MS/MS spectra in short time. In this review, algorithms from exhaustive search to the state-of-art machine learning and neural network for high-throughput and automated de-novo sequencing are introduced and compared. Impacts of datasets on algorithm performance are highlighted. The current limitations and promising direction of de-novo peptide sequencing are also discussed in this review.
Collapse
Affiliation(s)
- Cheuk Chi A Ng
- State Key Laboratory of Chemical Biology and Drug Discovery, and Department of Applied Biology and Chemical Technology, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong Special Administrative Region of China; Research Institute for Future Food, and Research Center for Chinese Medicine Innovation, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong Special Administrative Region of China; State Key Laboratory of Chinese Medicine and Molecular Pharmacology (Incubation), and Shenzhen Key Laboratory of Food Biological Safety Control, The Hong Kong Polytechnic University Shenzhen Research Institute, Shenzhen, 518057, China
| | - Yin Zhou
- State Key Laboratory of Chemical Biology and Drug Discovery, and Department of Applied Biology and Chemical Technology, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong Special Administrative Region of China; Research Institute for Future Food, and Research Center for Chinese Medicine Innovation, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong Special Administrative Region of China; State Key Laboratory of Chinese Medicine and Molecular Pharmacology (Incubation), and Shenzhen Key Laboratory of Food Biological Safety Control, The Hong Kong Polytechnic University Shenzhen Research Institute, Shenzhen, 518057, China
| | - Zhong-Ping Yao
- State Key Laboratory of Chemical Biology and Drug Discovery, and Department of Applied Biology and Chemical Technology, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong Special Administrative Region of China; Research Institute for Future Food, and Research Center for Chinese Medicine Innovation, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong Special Administrative Region of China; State Key Laboratory of Chinese Medicine and Molecular Pharmacology (Incubation), and Shenzhen Key Laboratory of Food Biological Safety Control, The Hong Kong Polytechnic University Shenzhen Research Institute, Shenzhen, 518057, China.
| |
Collapse
|
4
|
Li L, Wu J, Lyon CJ, Jiang L, Hu TY. Clinical Peptidomics: Advances in Instrumentation, Analyses, and Applications. BME FRONTIERS 2023; 4:0019. [PMID: 37849662 PMCID: PMC10521655 DOI: 10.34133/bmef.0019] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2023] [Accepted: 04/19/2023] [Indexed: 10/19/2023] Open
Abstract
Extensive effort has been devoted to the discovery, development, and validation of biomarkers for early disease diagnosis and prognosis as well as rapid evaluation of the response to therapeutic interventions. Genomic and transcriptomic profiling are well-established means to identify disease-associated biomarkers. However, analysis of disease-associated peptidomes can also identify novel peptide biomarkers or signatures that provide sensitive and specific diagnostic and prognostic information for specific malignant, chronic, and infectious diseases. Growing evidence also suggests that peptidomic changes in liquid biopsies may more effectively detect changes in disease pathophysiology than other molecular methods. Knowledge gained from peptide-based diagnostic, therapeutic, and imaging approaches has led to promising new theranostic applications that can increase their bioavailability in target tissues at reduced doses to decrease side effects and improve treatment responses. However, despite major advances, multiple factors can still affect the utility of peptidomic data. This review summarizes several remaining challenges that affect peptide biomarker discovery and their use as diagnostics, with a focus on technological advances that can improve the detection, identification, and monitoring of peptide biomarkers for personalized medicine.
Collapse
Affiliation(s)
- Lin Li
- Center for Cellular and Molecular Diagnostics, Department of Biochemistry and Molecular Biology, School of Medicine, Tulane University, New Orleans, LA, USA
- Department of Laboratory Medicine and Sichuan Provincial Key Laboratory for Human Disease Gene Study, Sichuan Academy of Medical Sciences and Sichuan Provincial People’s Hospital, Chengdu, China
| | - Jing Wu
- Department of Clinical Laboratory, Third Central Hospital of Tianjin, Tianjin Institute of Hepatobiliary Disease, Tianjin Key Laboratory of Artificial Cell, Artificial Cell Engineering Technology Research Center of Public Health Ministry, Tianjin, China
| | - Christopher J. Lyon
- Center for Cellular and Molecular Diagnostics, Department of Biochemistry and Molecular Biology, School of Medicine, Tulane University, New Orleans, LA, USA
| | - Li Jiang
- Department of Laboratory Medicine and Sichuan Provincial Key Laboratory for Human Disease Gene Study, Sichuan Academy of Medical Sciences and Sichuan Provincial People’s Hospital, Chengdu, China
| | - Tony Y. Hu
- Center for Cellular and Molecular Diagnostics, Department of Biochemistry and Molecular Biology, School of Medicine, Tulane University, New Orleans, LA, USA
- Department of Biomedical Engineering, School of Science and Engineering, Tulane University, New Orleans, LA, USA
| |
Collapse
|
5
|
Multienzyme deep learning models improve peptide de novo sequencing by mass spectrometry proteomics. PLoS Comput Biol 2023; 19:e1010457. [PMID: 36668672 PMCID: PMC9891523 DOI: 10.1371/journal.pcbi.1010457] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Revised: 02/01/2023] [Accepted: 01/04/2023] [Indexed: 01/21/2023] Open
Abstract
Generating and analyzing overlapping peptides through multienzymatic digestion is an efficient procedure for de novo protein using from bottom-up mass spectrometry (MS). Despite improved instrumentation and software, de novo MS data analysis remains challenging. In recent years, deep learning models have represented a performance breakthrough. Incorporating that technology into de novo protein sequencing workflows require machine-learning models capable of handling highly diverse MS data. In this study, we analyzed the requirements for assembling such generalizable deep learning models by systemcally varying the composition and size of the training set. We assessed the generated models' performances using two test sets composed of peptides originating from the multienzyme digestion of samples from various species. The peptide recall values on the test sets showed that the deep learning models generated from a collection of highly N- and C-termini diverse peptides generalized 76% more over the termini-restricted ones. Moreover, expanding the training set's size by adding peptides from the multienzymatic digestion with five proteases of several species samples led to a 2-3 fold generalizability gain. Furthermore, we tested the applicability of these multienzyme deep learning (MEM) models by fully de novo sequencing the heavy and light monomeric chains of five commercial antibodies (mAbs). MEMs extracted over 10000 matching and overlapped peptides across six different proteases mAb samples, achieving a 100% sequence coverage for 8 of the ten polypeptide chains. We foretell that the MEMs' proven improvements to de novo analysis will positively impact several applications, such as analyzing samples of high complexity, unknown nature, or the peptidomics field.
Collapse
|
6
|
Zhang Z, Li Y, Yuan W, Wang Z, Wan C. Proteomic-driven identification of short open reading frame-encoded peptides. Proteomics 2022; 22:e2100312. [PMID: 35384297 DOI: 10.1002/pmic.202100312] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2022] [Revised: 03/29/2022] [Accepted: 03/30/2022] [Indexed: 11/10/2022]
Abstract
Accumulating evidence has shown that a large number of short open reading frames (sORFs) also have the ability to encode proteins. The discovery of sORFs opens up a new research area, leading to the identification and functional study of sORF encoded peptides (SEPs) at the omics level. Besides bioinformatics prediction and ribosomal profiling, mass spectrometry (MS) has become a significant tool as it directly detects the sequence of SEPs. Though MS-based proteomics methods have proved to be effective for qualitative and quantitative analysis of SEPs, the detection of SEPs is still a great challenge due to their low abundance and short sequence. To illustrate the progress in method development, we described and discussed the main steps of large-scale proteomics identification of SEPs, including SEP extraction and enrichment, MS detection, data processing and quality control, quantification, and function prediction and validation methods. This article is protected by copyright. All rights reserved.
Collapse
Affiliation(s)
- Zheng Zhang
- School of Life Sciences and Hubei Key Laboratory of Genetic Regulation and Integrative Biology, Central China Normal University, Wuhan, Hubei, 430079, People's Republic of China
| | - Yujie Li
- School of Life Sciences and Hubei Key Laboratory of Genetic Regulation and Integrative Biology, Central China Normal University, Wuhan, Hubei, 430079, People's Republic of China
| | - Wenqian Yuan
- School of Life Sciences and Hubei Key Laboratory of Genetic Regulation and Integrative Biology, Central China Normal University, Wuhan, Hubei, 430079, People's Republic of China
| | - Zhiwei Wang
- School of Life Sciences and Hubei Key Laboratory of Genetic Regulation and Integrative Biology, Central China Normal University, Wuhan, Hubei, 430079, People's Republic of China
| | - Cuihong Wan
- School of Life Sciences and Hubei Key Laboratory of Genetic Regulation and Integrative Biology, Central China Normal University, Wuhan, Hubei, 430079, People's Republic of China
| |
Collapse
|
7
|
Simopoulos CMA, Figeys D, Lavallée-Adam M. Novel Bioinformatics Strategies Driving Dynamic Metaproteomic Studies. Methods Mol Biol 2022; 2456:319-338. [PMID: 35612752 DOI: 10.1007/978-1-0716-2124-0_22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
Constant improvements in mass spectrometry technologies and laboratory workflows have enabled the proteomics investigation of biological samples of growing complexity. Microbiomes represent such complex samples for which metaproteomics analyses are becoming increasingly popular. Metaproteomics experimental procedures create large amounts of data from which biologically relevant signal must be efficiently extracted to draw meaningful conclusions. Such a data processing requires appropriate bioinformatics tools specifically developed for, or capable of handling metaproteomics data. In this chapter, we outline current and novel tools that can perform the most commonly used steps in the analysis of cutting-edge metaproteomics data, such as peptide and protein identification and quantification, as well as data normalization, imputation, mining, and visualization. We also provide details about the experimental setups in which these tools should be used.
Collapse
Affiliation(s)
- Caitlin M A Simopoulos
- Department of Biochemistry, Microbiology and Immunology and Ottawa Institute of Systems Biology, University of Ottawa, Ottawa, ON, Canada
| | - Daniel Figeys
- Department of Biochemistry, Microbiology and Immunology and Ottawa Institute of Systems Biology, University of Ottawa, Ottawa, ON, Canada
- School of Pharmaceutical Sciences, University of Ottawa, Ottawa, ON, Canada
| | - Mathieu Lavallée-Adam
- Department of Biochemistry, Microbiology and Immunology and Ottawa Institute of Systems Biology, University of Ottawa, Ottawa, ON, Canada.
| |
Collapse
|
8
|
Tariq MU, Saeed F. SpeCollate: Deep cross-modal similarity network for mass spectrometry data based peptide deductions. PLoS One 2021; 16:e0259349. [PMID: 34714871 PMCID: PMC8555789 DOI: 10.1371/journal.pone.0259349] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2021] [Accepted: 10/18/2021] [Indexed: 11/19/2022] Open
Abstract
Historically, the database search algorithms have been the de facto standard for inferring peptides from mass spectrometry (MS) data. Database search algorithms deduce peptides by transforming theoretical peptides into theoretical spectra and matching them to the experimental spectra. Heuristic similarity-scoring functions are used to match an experimental spectrum to a theoretical spectrum. However, the heuristic nature of the scoring functions and the simple transformation of the peptides into theoretical spectra, along with noisy mass spectra for the less abundant peptides, can introduce a cascade of inaccuracies. In this paper, we design and implement a Deep Cross-Modal Similarity Network called SpeCollate, which overcomes these inaccuracies by learning the similarity function between experimental spectra and peptides directly from the labeled MS data. SpeCollate transforms spectra and peptides into a shared Euclidean subspace by learning fixed size embeddings for both. Our proposed deep-learning network trains on sextuplets of positive and negative examples coupled with our custom-designed SNAP-loss function. Online hardest negative mining is used to select the appropriate negative examples for optimal training performance. We use 4.8 million sextuplets obtained from the NIST and MassIVE peptide libraries to train the network and demonstrate that for closed search, SpeCollate is able to perform better than Crux and MSFragger in terms of the number of peptide-spectrum matches (PSMs) and unique peptides identified under 1% FDR for real-world data. SpeCollate also identifies a large number of peptides not reported by either Crux or MSFragger. To the best of our knowledge, our proposed SpeCollate is the first deep-learning network that can determine the cross-modal similarity between peptides and mass-spectra for MS-based proteomics. We believe SpeCollate is significant progress towards developing machine-learning solutions for MS-based omics data analysis. SpeCollate is available at https://deepspecs.github.io/.
Collapse
Affiliation(s)
- Muhammad Usman Tariq
- School of Computing & Information Sciences, Florida International University, Miami, FL, United States of America
| | - Fahad Saeed
- School of Computing & Information Sciences, Florida International University, Miami, FL, United States of America
| |
Collapse
|
9
|
Progress and challenges in mass spectrometry-based analysis of antibody repertoires. Trends Biotechnol 2021; 40:463-481. [PMID: 34535228 DOI: 10.1016/j.tibtech.2021.08.006] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2021] [Revised: 08/16/2021] [Accepted: 08/17/2021] [Indexed: 12/22/2022]
Abstract
Humoral immunity is divided into the cellular B cell and protein-level antibody responses. High-throughput sequencing has advanced our understanding of both these fundamental aspects of B cell immunology as well as aspects pertaining to vaccine and therapeutics biotechnology. Although the protein-level serum and mucosal antibody repertoire make major contributions to humoral protection, the sequence composition and dynamics of antibody repertoires remain underexplored. This limits insight into important immunological and biotechnological parameters such as the number of antigen-specific antibodies, which are for example, relevant for pathogen neutralization, microbiota regulation, severity of autoimmunity, and therapeutic efficacy. High-resolution mass spectrometry (MS) has allowed initial insights into the antibody repertoire. We outline current challenges in MS-based sequence analysis of antibody repertoires and propose strategies for their resolution.
Collapse
|
10
|
Zhang W, Liang Z, Chen X, Xin L, Shan B, Luo Z, Li M. ChimST: An Efficient Spectral Library Search Tool for Peptide Identification from Chimeric Spectra in Data-Dependent Acquisition. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1416-1425. [PMID: 31603795 DOI: 10.1109/tcbb.2019.2945954] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Accurate and sensitive identification of peptides from MS/MS spectra is a very challenging problem in computational shotgun proteomics. To tackle this problem, spectral library search has been one of the competitive solutions. However, most existing library search tools were developed on the basis of one peptide per spectrum, which prevents them from working properly on chimeric spectra where two or more peptides are co-fragmented. In this work, we present a new library search tool called ChimST, which is particularly capable of reliably identifying multiple peptides from a chimeric spectrum. It starts with associating each query MS/MS spectrum with MS precursor features. For each precursor feature, there is a list of peptide candidates extracted from an input spectral library. Then, it takes one peptide candidate from each associated feature and scores how well they could collectively interpret the query spectrum. The highest-scoring set of peptide candidates are finally reported as the identification of the query spectrum. Our experimental tests show that ChimST could significantly outperform the three state-of-the-art library search tools, SpectraST, reSpect, and MSPLIT, in terms of the numbers of both peptide-spectrum matches and unique peptides, especially when the acquisition isolation window is broad.
Collapse
|
11
|
Abstract
Proteomics, the large-scale study of all proteins of an organism or system, is a powerful tool for studying biological systems. It can provide a holistic view of the physiological and biochemical states of given samples through identification and quantification of large numbers of peptides and proteins. In forensic science, proteomics can be used as a confirmatory and orthogonal technique for well-built genomic analyses. Proteomics is highly valuable in cases where nucleic acids are absent or degraded, such as hair and bone samples. It can be used to identify body fluids, ethnic group, gender, individual, and estimate post-mortem interval using bone, muscle, and decomposition fluid samples. Compared to genomic analysis, proteomics can provide a better global picture of a sample. It has been used in forensic science for a wide range of sample types and applications. In this review, we briefly introduce proteomic methods, including sample preparation techniques, data acquisition using liquid chromatography-tandem mass spectrometry, and data analysis using database search, spectral library search, and de novo sequencing. We also summarize recent applications in the past decade of proteomics in forensic science with a special focus on human samples, including hair, bone, body fluids, fingernail, muscle, brain, and fingermark, and address the challenges, considerations, and future developments of forensic proteomics.
Collapse
|
12
|
Tariq MU, Haseeb M, Aledhari M, Razzak R, Parizi RM, Saeed F. Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey. IEEE ACCESS : PRACTICAL INNOVATIONS, OPEN SOLUTIONS 2020; 9:5497-5516. [PMID: 33537181 PMCID: PMC7853650 DOI: 10.1109/access.2020.3047588] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
Big Data Proteogenomics lies at the intersection of high-throughput Mass Spectrometry (MS) based proteomics and Next Generation Sequencing based genomics. The combined and integrated analysis of these two high-throughput technologies can help discover novel proteins using genomic, and transcriptomic data. Due to the biological significance of integrated analysis, the recent past has seen an influx of proteogenomic tools that perform various tasks, including mapping proteins to the genomic data, searching experimental MS spectra against a six-frame translation genome database, and automating the process of annotating genome sequences. To date, most of such tools have not focused on scalability issues that are inherent in proteogenomic data analysis where the size of the database is much larger than a typical protein database. These state-of-the-art tools can take more than half a month to process a small-scale dataset of one million spectra against a genome of 3 GB. In this article, we provide an up-to-date review of tools that can analyze proteogenomic datasets, providing a critical analysis of the techniques' relative merits and potential pitfalls. We also point out potential bottlenecks and recommendations that can be incorporated in the future design of these workflows to ensure scalability with the increasing size of proteogenomic data. Lastly, we make a case of how high-performance computing (HPC) solutions may be the best bet to ensure the scalability of future big data proteogenomic data analysis.
Collapse
Affiliation(s)
- Muhammad Usman Tariq
- School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA
| | - Muhammad Haseeb
- School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA
| | - Mohammed Aledhari
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA
| | - Rehma Razzak
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA
| | - Reza M Parizi
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA
| | - Fahad Saeed
- School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA
| |
Collapse
|
13
|
Vitorino R, Guedes S, Trindade F, Correia I, Moura G, Carvalho P, Santos MAS, Amado F. De novo sequencing of proteins by mass spectrometry. Expert Rev Proteomics 2020; 17:595-607. [PMID: 33016158 DOI: 10.1080/14789450.2020.1831387] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
INTRODUCTION Proteins are crucial for every cellular activity and unraveling their sequence and structure is a crucial step to fully understand their biology. Early methods of protein sequencing were mainly based on the use of enzymatic or chemical degradation of peptide chains. With the completion of the human genome project and with the expansion of the information available for each protein, various databases containing this sequence information were formed. AREAS COVERED De novo protein sequencing, shotgun proteomics and other mass-spectrometric techniques, along with the various software are currently available for proteogenomic analysis. Emphasis is placed on the methods for de novo sequencing, together with potential and shortcomings using databases for interpretation of protein sequence data. EXPERT OPINION As mass-spectrometry sequencing performance is improving with better software and hardware optimizations, combined with user-friendly interfaces, de-novo protein sequencing becomes imperative in shotgun proteomic studies. Issues regarding unknown or mutated peptide sequences, as well as, unexpected post-translational modifications (PTMs) and their identification through false discovery rate searches using the target/decoy strategy need to be addressed. Ideally, it should become integrated in standard proteomic workflows as an add-on to conventional database search engines, which then would be able to provide improved identification.
Collapse
Affiliation(s)
- Rui Vitorino
- QOPNA & LAQV-REQUIMTE, Departamento De Química, Institute of Biomedicine - iBiMED , Aveiro, Portugal.,iBiMED, Department of Medical Sciences, University of Aveiro , Aveiro, Portugal.,Unidade De Investigação Cardiovascular, Departamento De Cirurgia E Fisiologia, Faculdade De Medicina, Universidade Do Porto , Porto, Portugal
| | - Sofia Guedes
- QOPNA & LAQV-REQUIMTE, Departamento De Química, Institute of Biomedicine - iBiMED , Aveiro, Portugal
| | - Fabio Trindade
- Unidade De Investigação Cardiovascular, Departamento De Cirurgia E Fisiologia, Faculdade De Medicina, Universidade Do Porto , Porto, Portugal
| | - Inês Correia
- iBiMED, Department of Medical Sciences, University of Aveiro , Aveiro, Portugal
| | - Gabriela Moura
- iBiMED, Department of Medical Sciences, University of Aveiro , Aveiro, Portugal
| | - Paulo Carvalho
- Laboratory for Structural and Computational Proteomics, Carlos Chagas Institute, FIOCRUZ, Laboratory for Proteomics and Protein Engineering , Brazil
| | - Manuel A S Santos
- iBiMED, Department of Medical Sciences, University of Aveiro , Aveiro, Portugal
| | - Francisco Amado
- QOPNA & LAQV-REQUIMTE, Departamento De Química, Institute of Biomedicine - iBiMED , Aveiro, Portugal
| |
Collapse
|
14
|
DeLaney K, Cao W, Ma Y, Ma M, Zhang Y, Li L. PRESnovo: Prescreening Prior to de novo Sequencing to Improve Accuracy and Sensitivity of Neuropeptide Identification. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2020; 31:1358-1371. [PMID: 32266812 PMCID: PMC7332408 DOI: 10.1021/jasms.0c00013] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]
Abstract
Identification of peptides in species lacking fully sequenced genomes is challenging due to the lack of prior knowledge. De novo sequencing is the method of choice, but its performance is less than satisfactory due to algorithmic bias and interference in complex MS/MS spectra. The task becomes even more challenging for endogenous peptides that do not involve an enzymatic digestion step, such as neuropeptides. However, many neuropeptides possess common sequence motifs that are conserved across members of the same family. Taking advantage of this feature to improve de novo sequencing of neuropeptides, we have developed a method named PRESnovo (prescreening precursors prior to de novo sequencing) to predict the motif from a MS/MS spectrum. A neuropeptide sequence is broken into a motif with conserved amino acid residues and the remaining partial sequence. By searching against a predefined motif database constructed from known homologous sequences, PRESnovo assigns the most probable motif to each precursor via a sophisticated scoring function. Performance analysis was conducted with 15 neuropeptide standards, and 11 neuropeptides were correctly identified with PRESnovo compared to 1 identification by PEAKS only. We applied PRESnovo to assign motifs to peptide sequences in conjunction with PEAKS for assigning the rest of the peptide sequence in order to discover neuropeptides in tissue samples of green crab, C. maenas, and Jonah crab, C. borealis. Collectively, a large number of neuropeptides were identified, including 13 putative neuropeptides identified in green crab brain, 77 in Jonah crab brain, and 47 in Jonah crab sinus glands for the first time. This PRESnovo strategy greatly simplifies de novo sequencing and enhances the accuracy and sensitivity of neuropeptide identification when common motifs are present.
Collapse
|
15
|
Bioinformatics Methods for Mass Spectrometry-Based Proteomics Data Analysis. Int J Mol Sci 2020; 21:ijms21082873. [PMID: 32326049 PMCID: PMC7216093 DOI: 10.3390/ijms21082873] [Citation(s) in RCA: 119] [Impact Index Per Article: 29.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2020] [Revised: 04/16/2020] [Accepted: 04/18/2020] [Indexed: 01/15/2023] Open
Abstract
Recent advances in mass spectrometry (MS)-based proteomics have enabled tremendous progress in the understanding of cellular mechanisms, disease progression, and the relationship between genotype and phenotype. Though many popular bioinformatics methods in proteomics are derived from other omics studies, novel analysis strategies are required to deal with the unique characteristics of proteomics data. In this review, we discuss the current developments in the bioinformatics methods used in proteomics and how they facilitate the mechanistic understanding of biological processes. We first introduce bioinformatics software and tools designed for mass spectrometry-based protein identification and quantification, and then we review the different statistical and machine learning methods that have been developed to perform comprehensive analysis in proteomics studies. We conclude with a discussion of how quantitative protein data can be used to reconstruct protein interactions and signaling networks.
Collapse
|
16
|
Mao Y, Daly TJ, Li N. Lys-Sequencer: An algorithm for de novo sequencing of peptides by paired single residue transposed Lys-C and Lys-N digestion coupled with high-resolution mass spectrometry. RAPID COMMUNICATIONS IN MASS SPECTROMETRY : RCM 2020; 34:e8574. [PMID: 31499586 DOI: 10.1002/rcm.8574] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/12/2019] [Revised: 08/27/2019] [Accepted: 09/02/2019] [Indexed: 06/10/2023]
Abstract
RATIONALE Database-dependent identification of proteins by mass spectrometry is well established, but has limitations when there are novel proteins, mutations, splice variants, and post-translational modifications (PTMs) not available in the established reference database. De novo sequencing as a database-independent approach could address these limitations by deducing peptide sequences directly from experimental tandem mass spectrometry spectra, while concomitantly yielding residue-by-residue confidence metrics. METHODS Equal amounts of bovine serum albumin (BSA) sample aliquots were digested separately with Lys-C and Lys-N complementary peptidases, separated by reversed-phase ultra-high-performance liquid chromatography (UPLC), and analyzed by collision-induced dissociation (CID)-based mass spectrometry on an Orbitrap mass spectrometer. In the Lys-Sequencer algorithm, matched tandem mass spectra with equal precursor ion mass from complementary digestions were paired, and fragment ion types were identified based on the unique mass relationship between fragment ions extracted from a spectrum pair followed by de novo sequencing of peptides with identification confidence assigned at the residue level. RESULTS In all the matched spectrum pairs, 34 top-ranked BSA peptides were identified, from which 391 amino acid residues were identified correctly, covering ~67% of the full sequence of BSA (583 residues) with only ~6% (35 residues) exhibiting ambiguity in the sequence order (although amino acid compositions were still correctly assigned). Of note, this approach identified peptide sequences up to 17 amino acids in length without ambiguity, with the exception of the N-terminal or C-terminal peptides containing lysine (18-mer). CONCLUSIONS The algorithm ("Lys-Sequencer") developed in this work achieves high precision for de novo sequencing of peptides. This method facilitates the identification of point mutation and new PTMs in the protein characterization and discovery of new peptides and proteins with varying levels of confidence.
Collapse
Affiliation(s)
- Yuan Mao
- Department of Analytical Chemistry, Regeneron Pharmaceuticals, Inc., 777 Old Saw Mill River Road, Tarrytown, NY, 10591, USA
| | - Thomas J Daly
- Department of Analytical Chemistry, Regeneron Pharmaceuticals, Inc., 777 Old Saw Mill River Road, Tarrytown, NY, 10591, USA
| | - Ning Li
- Department of Analytical Chemistry, Regeneron Pharmaceuticals, Inc., 777 Old Saw Mill River Road, Tarrytown, NY, 10591, USA
| |
Collapse
|
17
|
Valli M, Russo HM, Pilon AC, Pinto MEF, Dias NB, Freire RT, Castro-Gamboa I, Bolzani VDS. Computational methods for NMR and MS for structure elucidation I: software for basic NMR. PHYSICAL SCIENCES REVIEWS 2019. [DOI: 10.1515/psr-2018-0108] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
Abstract
Structure elucidation is an important and sometimes time-consuming step for natural products research. This step has evolved in the past few years to a faster and more automated process due to the development of several computational programs and analytical techniques. In this paper, the topics of NMR prediction and CASE programs are addressed. Furthermore, the elucidation of natural peptides is discussed.
Collapse
|
18
|
Issa Isaac N, Philippe D, Nicholas A, Raoult D, Eric C. Metaproteomics of the human gut microbiota: Challenges and contributions to other OMICS. CLINICAL MASS SPECTROMETRY 2019; 14 Pt A:18-30. [DOI: 10.1016/j.clinms.2019.06.001] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/05/2019] [Revised: 06/02/2019] [Accepted: 06/03/2019] [Indexed: 12/22/2022]
|
19
|
Schweiger R, Erlich Y, Carmi S. FactorialHMM: fast and exact inference in factorial hidden Markov models. Bioinformatics 2019; 35:2162-2164. [PMID: 30445428 DOI: 10.1093/bioinformatics/bty944] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2018] [Revised: 11/07/2018] [Accepted: 11/13/2018] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Hidden Markov models (HMMs) are powerful tools for modeling processes along the genome. In a standard genomic HMM, observations are drawn, at each genomic position, from a distribution whose parameters depend on a hidden state, and the hidden states evolve along the genome as a Markov chain. Often, the hidden state is the Cartesian product of multiple processes, each evolving independently along the genome. Inference in these so-called Factorial HMMs has a naïve running time that scales as the square of the number of possible states, which by itself increases exponentially with the number of sub-chains; such a running time scaling is impractical for many applications. While faster algorithms exist, there is no available implementation suitable for developing bioinformatics applications. RESULTS We developed FactorialHMM, a Python package for fast exact inference in Factorial HMMs. Our package allows simulating either directly from the model or from the posterior distribution of states given the observations. Additionally, we allow the inference of all key quantities related to HMMs: (i) the (Viterbi) sequence of states with the highest posterior probability; (ii) the likelihood of the data and (iii) the posterior probability (given all observations) of the marginal and pairwise state probabilities. The running time and space requirement of all procedures is linearithmic in the number of possible states. Our package is highly modular, providing the user with maximal flexibility for developing downstream applications. AVAILABILITY AND IMPLEMENTATION https://github.com/regevs/factorial_hmm. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Regev Schweiger
- Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel.,MyHeritage, Or Yehuda, Israel
| | - Yaniv Erlich
- MyHeritage, Or Yehuda, Israel.,Department of Computer Science, Fu Foundation School of Engineering, Columbia University, New York, NY, USA.,Department of Systems Biology, Center for Computational Biology and Bioinformatics (C2B2), Columbia University, New York, NY, USA.,New York Genome Center, New York, NY, USA
| | - Shai Carmi
- Braun School of Public Health and Community Medicine, The Hebrew University of Jerusalem, Jerusalem, Israel
| |
Collapse
|
20
|
Muth T, Renard BY. Evaluating de novo sequencing in proteomics: already an accurate alternative to database-driven peptide identification? Brief Bioinform 2019; 19:954-970. [PMID: 28369237 DOI: 10.1093/bib/bbx033] [Citation(s) in RCA: 63] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2016] [Indexed: 01/24/2023] Open
Abstract
While peptide identifications in mass spectrometry (MS)-based shotgun proteomics are mostly obtained using database search methods, high-resolution spectrum data from modern MS instruments nowadays offer the prospect of improving the performance of computational de novo peptide sequencing. The major benefit of de novo sequencing is that it does not require a reference database to deduce full-length or partial tag-based peptide sequences directly from experimental tandem mass spectrometry spectra. Although various algorithms have been developed for automated de novo sequencing, the prediction accuracy of proposed solutions has been rarely evaluated in independent benchmarking studies. The main objective of this work is to provide a detailed evaluation on the performance of de novo sequencing algorithms on high-resolution data. For this purpose, we processed four experimental data sets acquired from different instrument types from collision-induced dissociation and higher energy collisional dissociation (HCD) fragmentation mode using the software packages Novor, PEAKS and PepNovo. Moreover, the accuracy of these algorithms is also tested on ground truth data based on simulated spectra generated from peak intensity prediction software. We found that Novor shows the overall best performance compared with PEAKS and PepNovo with respect to the accuracy of correct full peptide, tag-based and single-residue predictions. In addition, the same tool outpaced the commercial competitor PEAKS in terms of running time speedup by factors of around 12-17. Despite around 35% prediction accuracy for complete peptide sequences on HCD data sets, taken as a whole, the evaluated algorithms perform moderately on experimental data but show a significantly better performance on simulated data (up to 84% accuracy). Further, we describe the most frequently occurring de novo sequencing errors and evaluate the influence of missing fragment ion peaks and spectral noise on the accuracy. Finally, we discuss the potential of de novo sequencing for now becoming more widely used in the field.
Collapse
Affiliation(s)
- Thilo Muth
- Research Group Bioinformatics, Robert Koch Institute, Berlin, Germany
| | - Bernhard Y Renard
- Research Group Bioinformatics, Robert Koch Institute, Berlin, Germany
| |
Collapse
|
21
|
Yang H, Li YC, Zhao MZ, Wu FL, Wang X, Xiao WD, Wang YH, Zhang JL, Wang FQ, Xu F, Zeng WF, Overall CM, He SM, Chi H, Xu P. Precision De Novo Peptide Sequencing Using Mirror Proteases of Ac-LysargiNase and Trypsin for Large-scale Proteomics. Mol Cell Proteomics 2019; 18:773-785. [PMID: 30622160 PMCID: PMC6442358 DOI: 10.1074/mcp.tir118.000918] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2018] [Revised: 11/20/2018] [Indexed: 11/06/2022] Open
Abstract
De novo peptide sequencing for large-scale proteomics remains challenging because of the lack of full coverage of ion series in tandem mass spectra. We developed a mirror protease of trypsin, acetylated LysargiNase (Ac-LysargiNase), with superior activity and stability. The mirror spectrum pairs derived from the Ac-LysargiNase and trypsin treated samples can generate full b and y ion series, which provide mutual complementarity of each other, and allow us to develop a novel algorithm, pNovoM, for de novo sequencing. Using pNovoM to sequence peptides of purified proteins, the accuracy of the sequence was close to 100%. More importantly, from a large-scale yeast proteome sample digested with trypsin and Ac-LysargiNase individually, 48% of all tandem mass spectra formed mirror spectrum pairs, 97% of which contained full coverage of ion series, resulting in precision de novo sequencing of full-length peptides by pNovoM. This enabled pNovoM to successfully sequence 21,249 peptides from 3,753 proteins and interpreted 44-152% more spectra than pNovo+ and PEAKS at a 5% FDR at the spectrum level. Moreover, the mirror protease strategy had an obvious advantage in sequencing long peptides. We believe that the combination of mirror protease strategy and pNovoM will be an effective approach for precision de novo sequencing on both single proteins and proteome samples.
Collapse
Affiliation(s)
- Hao Yang
- From the ‡Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS; University of Chinese Academy of Sciences; Institute of Computing Technology, CAS, Beijing 100190, China
| | - Yan-Chang Li
- §State Key Laboratory of Proteomics; Beijing Proteome Research Center; National Center for Protein Sciences Beijing; Beijing Institute of Lifeomics, Beijing 102206, China
| | - Ming-Zhi Zhao
- §State Key Laboratory of Proteomics; Beijing Proteome Research Center; National Center for Protein Sciences Beijing; Beijing Institute of Lifeomics, Beijing 102206, China
| | - Fei-Lin Wu
- §State Key Laboratory of Proteomics; Beijing Proteome Research Center; National Center for Protein Sciences Beijing; Beijing Institute of Lifeomics, Beijing 102206, China
| | - Xi Wang
- From the ‡Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS; University of Chinese Academy of Sciences; Institute of Computing Technology, CAS, Beijing 100190, China
| | - Wei-Di Xiao
- §State Key Laboratory of Proteomics; Beijing Proteome Research Center; National Center for Protein Sciences Beijing; Beijing Institute of Lifeomics, Beijing 102206, China
| | - Yi-Hao Wang
- §State Key Laboratory of Proteomics; Beijing Proteome Research Center; National Center for Protein Sciences Beijing; Beijing Institute of Lifeomics, Beijing 102206, China
| | - Jun-Ling Zhang
- §State Key Laboratory of Proteomics; Beijing Proteome Research Center; National Center for Protein Sciences Beijing; Beijing Institute of Lifeomics, Beijing 102206, China
| | - Fu-Qiang Wang
- §State Key Laboratory of Proteomics; Beijing Proteome Research Center; National Center for Protein Sciences Beijing; Beijing Institute of Lifeomics, Beijing 102206, China
| | - Feng Xu
- §State Key Laboratory of Proteomics; Beijing Proteome Research Center; National Center for Protein Sciences Beijing; Beijing Institute of Lifeomics, Beijing 102206, China
| | - Wen-Feng Zeng
- From the ‡Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS; University of Chinese Academy of Sciences; Institute of Computing Technology, CAS, Beijing 100190, China
| | - Christopher M Overall
- ‖Centre for Blood Research, University of British Columbia, Vancouver, British Columbia, Canada
| | - Si-Min He
- From the ‡Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS; University of Chinese Academy of Sciences; Institute of Computing Technology, CAS, Beijing 100190, China;.
| | - Hao Chi
- From the ‡Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS; University of Chinese Academy of Sciences; Institute of Computing Technology, CAS, Beijing 100190, China;.
| | - Ping Xu
- §State Key Laboratory of Proteomics; Beijing Proteome Research Center; National Center for Protein Sciences Beijing; Beijing Institute of Lifeomics, Beijing 102206, China;; ¶Key Laboratory of Combinatorial Biosynthesis and Drug Discovery of Ministry of Education Wuhan University, Wuhan University School of Pharmaceutical Sciences, Wuhan 430071, China;; College of Life Sciences, Hebei University, Baoding 071002, China.
| |
Collapse
|
22
|
Fomin E. A Simple Approach to the Reconstruction of a Set of Points from the Multiset of Pairwise Distances in n2 Steps for the Sequencing Problem: III. Noise Inputs for the Beltway Case. J Comput Biol 2019; 26:68-75. [DOI: 10.1089/cmb.2018.0078] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
Affiliation(s)
- Eduard Fomin
- Institute of Cytology and Genetics, SB RAS, Novosibirsk, Russia
| |
Collapse
|
23
|
Muth T, Hartkopf F, Vaudel M, Renard BY. A Potential Golden Age to Come-Current Tools, Recent Use Cases, and Future Avenues for De Novo Sequencing in Proteomics. Proteomics 2018; 18:e1700150. [PMID: 29968278 DOI: 10.1002/pmic.201700150] [Citation(s) in RCA: 33] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2018] [Revised: 05/23/2018] [Indexed: 01/15/2023]
Abstract
In shotgun proteomics, peptide and protein identification is most commonly conducted using database search engines, the method of choice when reference protein sequences are available. Despite its widespread use the database-driven approach is limited, mainly because of its static search space. In contrast, de novo sequencing derives peptide sequence information in an unbiased manner, using only the fragment ion information from the tandem mass spectra. In recent years, with the improvements in MS instrumentation, various new methods have been proposed for de novo sequencing. This review article provides an overview of existing de novo sequencing algorithms and software tools ranging from peptide sequencing to sequence-to-protein mapping. Various use cases are described for which de novo sequencing was successfully applied. Finally, limitations of current methods are highlighted and new directions are discussed for a wider acceptance of de novo sequencing in the community.
Collapse
Affiliation(s)
- Thilo Muth
- Bioinformatics Unit (MF 1), Department for Methods Development and Research Infrastructure, Robert Koch Institute, 13353, Berlin, Germany
| | - Felix Hartkopf
- Bioinformatics Unit (MF 1), Department for Methods Development and Research Infrastructure, Robert Koch Institute, 13353, Berlin, Germany
| | - Marc Vaudel
- K.G. Jebsen Center for Diabetes Research, Department of Clinical Science, University of Bergen, 5020, Bergen, Norway.,Center for Medical Genetics and Molecular Medicine, Haukeland University Hospital, 5020, Bergen, Norway
| | - Bernhard Y Renard
- Bioinformatics Unit (MF 1), Department for Methods Development and Research Infrastructure, Robert Koch Institute, 13353, Berlin, Germany
| |
Collapse
|
24
|
Affiliation(s)
- Ngoc Hieu Tran
- David R. Cheriton School of Computer Science; University of Waterloo; Waterloo, ON Canada
| | - Xianglilan Zhang
- David R. Cheriton School of Computer Science; University of Waterloo; Waterloo, ON Canada
- State Key Laboratory of Pathogen and Biosecurity; Beijing Institute of Microbiology and Epidemiology; Beijing P.R. China
| | - Ming Li
- David R. Cheriton School of Computer Science; University of Waterloo; Waterloo, ON Canada
| |
Collapse
|
25
|
Vinogradov AA, Gates ZP, Zhang C, Quartararo AJ, Halloran KH, Pentelute BL. Library Design-Facilitated High-Throughput Sequencing of Synthetic Peptide Libraries. ACS COMBINATORIAL SCIENCE 2017; 19:694-701. [PMID: 28892357 PMCID: PMC5818986 DOI: 10.1021/acscombsci.7b00109] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
A methodology to achieve high-throughput de novo sequencing of synthetic peptide mixtures is reported. The approach leverages shotgun nanoliquid chromatography coupled with tandem mass spectrometry-based de novo sequencing of library mixtures (up to 2000 peptides) as well as automated data analysis protocols to filter away incorrect assignments, noise, and synthetic side-products. For increasing the confidence in the sequencing results, mass spectrometry-friendly library designs were developed that enabled unambiguous decoding of up to 600 peptide sequences per hour while maintaining greater than 85% sequence identification rates in most cases. The reliability of the reported decoding strategy was additionally confirmed by matching fragmentation spectra for select authentic peptides identified from library sequencing samples. The methods reported here are directly applicable to screening techniques that yield mixtures of active compounds, including particle sorting of one-bead one-compound libraries and affinity enrichment of synthetic library mixtures performed in solution.
Collapse
Affiliation(s)
| | - Zachary P. Gates
- Department of Chemistry, Massachusetts Institute of Technology, 18-563, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Chi Zhang
- Department of Chemistry, Massachusetts Institute of Technology, 18-563, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Anthony J. Quartararo
- Department of Chemistry, Massachusetts Institute of Technology, 18-563, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Kathryn H. Halloran
- Department of Chemistry, Massachusetts Institute of Technology, 18-563, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Bradley L Pentelute
- Department of Chemistry, Massachusetts Institute of Technology, 18-563, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
26
|
Zhang Z, Boonen K, Li M, Geuten K. mRNA Interactome Capture from Plant Protoplasts. J Vis Exp 2017. [PMID: 28784956 DOI: 10.3791/56011] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
RNA-binding proteins (RBPs) determine the fates of RNAs. They participate in all RNA biogenesis pathways and especially contribute to post-transcriptional gene regulation (PTGR) of messenger RNAs (mRNAs). In the past few years, a number of mRNA-bound proteomes from yeast and mammalian cell lines have been successfully isolated through the use of a novel method called "mRNA interactome capture," which allows for the identification of mRNA-binding proteins (mRBPs) directly from a physiological environment. The method is composed of in vivo ultraviolet (UV) crosslinking, pull-down and purification of messenger ribonucleoprotein complexes (mRNPs) by oligo(dT) beads, and the subsequent identification of the crosslinked proteins by mass spectrometry (MS). Very recently, by applying the same method, several plant mRNA-bound proteomes have been reported simultaneously from different Arabidopsis tissue sources: etiolated seedlings, leaf tissue, leaf mesophyll protoplasts, and cultured root cells. Here, we present the optimized mRNA interactome capture method for Arabidopsis thaliana leaf mesophyll protoplasts, a cell type that serves as a versatile tool for experiments that include various cellular assays. The conditions for optimal protein yield include the amount of starting tissue and the duration of UV irradiation. In the mRNA-bound proteome obtained from a medium-scale experiment (107 cells), RBPs noted to have RNA-binding capacity were found to be overrepresented, and many novel RBPs were identified. The experiment can be scaled up (109 cells), and the optimized method can be applied to other plant cell types and species to broadly isolate, catalog, and compare mRNA-bound proteomes in plants.
Collapse
|
27
|
Abstract
De novo peptide sequencing from tandem MS data is the key technology in proteomics for the characterization of proteins, especially for new sequences, such as mAbs. In this study, we propose a deep neural network model, DeepNovo, for de novo peptide sequencing. DeepNovo architecture combines recent advances in convolutional neural networks and recurrent neural networks to learn features of tandem mass spectra, fragment ions, and sequence patterns of peptides. The networks are further integrated with local dynamic programming to solve the complex optimization task of de novo sequencing. We evaluated the method on a wide variety of species and found that DeepNovo considerably outperformed state of the art methods, achieving 7.7-22.9% higher accuracy at the amino acid level and 38.1-64.0% higher accuracy at the peptide level. We further used DeepNovo to automatically reconstruct the complete sequences of antibody light and heavy chains of mouse, achieving 97.5-100% coverage and 97.2-99.5% accuracy, without assisting databases. Moreover, DeepNovo is retrainable to adapt to any sources of data and provides a complete end-to-end training and prediction solution to the de novo sequencing problem. Not only does our study extend the deep learning revolution to a new field, but it also shows an innovative approach in solving optimization problems by using deep learning and dynamic programming.
Collapse
|
28
|
Burke MC, Mirokhin YA, Tchekhovskoi DV, Markey SP, Heidbrink Thompson J, Larkin C, Stein SE. The Hybrid Search: A Mass Spectral Library Search Method for Discovery of Modifications in Proteomics. J Proteome Res 2017; 16:1924-1935. [DOI: 10.1021/acs.jproteome.6b00988] [Citation(s) in RCA: 47] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Affiliation(s)
- Meghan C. Burke
- Mass
Spectrometry Data Center, National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, Maryland 20899, United States
| | - Yuri A. Mirokhin
- Mass
Spectrometry Data Center, National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, Maryland 20899, United States
| | - Dmitrii V. Tchekhovskoi
- Mass
Spectrometry Data Center, National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, Maryland 20899, United States
| | - Sanford P. Markey
- Mass
Spectrometry Data Center, National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, Maryland 20899, United States
| | - Jenny Heidbrink Thompson
- Analytical
Sciences, MedImmune LLC, One MedImmune Way, Gaithersburg, Maryland 20878, United States
| | - Christopher Larkin
- Analytical
Sciences, MedImmune LLC, One MedImmune Way, Gaithersburg, Maryland 20878, United States
| | - Stephen E. Stein
- Mass
Spectrometry Data Center, National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, Maryland 20899, United States
| |
Collapse
|
29
|
Guan X, Brownstein NC, Young NL, Marshall AG. Ultrahigh-resolution Fourier transform ion cyclotron resonance mass spectrometry and tandem mass spectrometry for peptide de novo amino acid sequencing for a seven-protein mixture by paired single-residue transposed Lys-N and Lys-C digestion. RAPID COMMUNICATIONS IN MASS SPECTROMETRY : RCM 2017; 31:207-217. [PMID: 27813191 DOI: 10.1002/rcm.7783] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/29/2016] [Revised: 10/29/2016] [Accepted: 10/30/2016] [Indexed: 06/06/2023]
Abstract
RATIONALE Bottom-up tandem mass spectrometry (MS/MS) is regularly used in proteomics to identify proteins from a sequence database. De novo sequencing is also available for sequencing peptides with relatively short sequence lengths. We recently showed that paired Lys-C and Lys-N proteases produce peptides of identical mass and similar retention time, but different tandem mass spectra. Such parallel experiments provide complementary information, and allow for up to 100% MS/MS sequence coverage. METHODS Here, we report digestion by paired Lys-C and Lys-N proteases of a seven-protein mixture: human hemoglobin alpha, bovine carbonic anhydrase 2, horse skeletal muscle myoglobin, hen egg white lysozyme, bovine pancreatic ribonuclease, bovine rhodanese, and bovine serum albumin, followed by reversed-phase nanoflow liquid chromatography, collision-induced dissociation, and 14.5 T Fourier transform ion cyclotron resonance mass spectrometry. RESULTS Matched pairs of product peptide ions of equal precursor mass and similar retention times from each digestion are compared, leveraging single-residue transposed information with independent interferences to confidently identify fragment ion types, residues, and peptides. Selected pairs of product ion mass spectra for de novo sequenced protein segments from each member of the mixture are presented. CONCLUSIONS Pairs of the transposed product ions as well as complementary information from the parallel experiments allow for both high MS/MS coverage for long peptide sequences and high confidence in the amino acid identification. Moreover, the parallel experiments in the de novo sequencing reduce false-positive matches of product ions from the single-residue transposed peptides from the same segment, and thereby further improve the confidence in protein identification. Copyright © 2016 John Wiley & Sons, Ltd.
Collapse
Affiliation(s)
- Xiaoyan Guan
- Ion Cyclotron Resonance Program, National High Magnetic Field Laboratory, Florida State University, 1800 East Paul Dirac Drive, Tallahassee, FL, 32310, USA
| | - Naomi C Brownstein
- Department of Behavioral Sciences and Social Medicine, College of Medicine, Florida State University, 1115 W. Call St., Tallahassee, FL, 32306, USA
- Department of Statistics, Florida State University, 117 N. Woodward Ave., Tallahassee, FL, 32306, USA
| | - Nicolas L Young
- Verna & Marrs McLean Department of Biochemistry & Molecular Biology, Baylor College of Medicine, One Baylor Plaza, MS-125, Houston, TX, 77030-3411, USA
| | - Alan G Marshall
- Ion Cyclotron Resonance Program, National High Magnetic Field Laboratory, Florida State University, 1800 East Paul Dirac Drive, Tallahassee, FL, 32310, USA
- Department of Chemistry and Biochemistry, Florida State University, 95 Chieftain Way, Tallahassee, FL, 32303, USA
| |
Collapse
|
30
|
Yang H, Chi H, Zhou WJ, Zeng WF, He K, Liu C, Sun RX, He SM. Open-pNovo: De Novo Peptide Sequencing with Thousands of Protein Modifications. J Proteome Res 2017; 16:645-654. [PMID: 28019094 DOI: 10.1021/acs.jproteome.6b00716] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
De novo peptide sequencing has improved remarkably, but sequencing full-length peptides with unexpected modifications is still a challenging problem. Here we present an open de novo sequencing tool, Open-pNovo, for de novo sequencing of peptides with arbitrary types of modifications. Although the search space increases by ∼300 times, Open-pNovo is close to or even ∼10-times faster than the other three proposed algorithms. Furthermore, considering top-1 candidates on three MS/MS data sets, Open-pNovo can recall over 90% of the results obtained by any one traditional algorithm and report 5-87% more peptides, including 14-250% more modified peptides. On a high-quality simulated data set, ∼85% peptides with arbitrary modifications can be recalled by Open-pNovo, while hardly any results can be recalled by others. In summary, Open-pNovo is an excellent tool for open de novo sequencing and has great potential for discovering unexpected modifications in the real biological applications.
Collapse
Affiliation(s)
- Hao Yang
- Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, Chinese Academy of Sciences , Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Hao Chi
- Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, Chinese Academy of Sciences , Beijing 100190, China
| | - Wen-Jing Zhou
- Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, Chinese Academy of Sciences , Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Wen-Feng Zeng
- Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, Chinese Academy of Sciences , Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Kun He
- Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, Chinese Academy of Sciences , Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Chao Liu
- Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, Chinese Academy of Sciences , Beijing 100190, China
| | - Rui-Xiang Sun
- Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, Chinese Academy of Sciences , Beijing 100190, China
| | - Si-Min He
- Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, Chinese Academy of Sciences , Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100049, China
| |
Collapse
|
31
|
Islam MT, Mohamedali A, Fernandes CS, Baker MS, Ranganathan S. De Novo Peptide Sequencing: Deep Mining of High-Resolution Mass Spectrometry Data. Methods Mol Biol 2017; 1549:119-134. [PMID: 27975288 DOI: 10.1007/978-1-4939-6740-7_10] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
High resolution mass spectrometry has revolutionized proteomics over the past decade, resulting in tremendous amounts of data in the form of mass spectra, being generated in a relatively short span of time. The mining of this spectral data for analysis and interpretation though has lagged behind such that potentially valuable data is being overlooked because it does not fit into the mold of traditional database searching methodologies. Although the analysis of spectra by de novo sequences removes such biases and has been available for a long period of time, its uptake has been slow or almost nonexistent within the scientific community. In this chapter, we propose a methodology to integrate de novo peptide sequencing using three commonly available software solutions in tandem, complemented by homology searching, and manual validation of spectra. This simplified method would allow greater use of de novo sequencing approaches and potentially greatly increase proteome coverage leading to the unearthing of valuable insights into protein biology, especially of organisms whose genomes have been recently sequenced or are poorly annotated.
Collapse
Affiliation(s)
- Mohammad Tawhidul Islam
- Department of Chemistry and Biomolecular Sciences, Faculty of Science and Engineering, Macquarie University, Sydney, NSW, 2109, Australia
| | - Abidali Mohamedali
- Department of Chemistry and Biomolecular Sciences, Faculty of Science and Engineering, Macquarie University, Sydney, NSW, 2109, Australia
- Department of Biomedical Sciences, Faculty of Medicine and Health Sciences, Macquarie University, Sydney, NSW, 2109, Australia
| | - Criselda Santan Fernandes
- Department of Chemistry and Biomolecular Sciences, Faculty of Science and Engineering, Macquarie University, Sydney, NSW, 2109, Australia
| | - Mark S Baker
- Department of Biomedical Sciences, Faculty of Medicine and Health Sciences, Macquarie University, Sydney, NSW, 2109, Australia
| | - Shoba Ranganathan
- Department of Chemistry and Biomolecular Sciences, Faculty of Science and Engineering, Macquarie University, Sydney, NSW, 2109, Australia.
| |
Collapse
|
32
|
Abstract
Recent advances in high resolution tandem mass spectrometry (MS) has resulted in the accumulation of high quality data. Paralleled with these advances in instrumentation, bioinformatics software have been developed to analyze such quality datasets. In spite of these advances, data analysis in mass spectrometry still remains critical for protein identification. In addition, the complexity of the generated MS/MS spectra, unpredictable nature of peptide fragmentation, sequence annotation errors, and posttranslational modifications has impeded the protein identification process. In a typical MS data analysis, about 60 % of the MS/MS spectra remains unassigned. While some of these could attribute to the low quality of the MS/MS spectra, a proportion can be classified as high quality. Further analysis may reveal how much of the unassigned MS spectra attribute to search space, sequence annotation errors, mutations, and/or posttranslational modifications. In this chapter, the tools used to identify proteins and ways to assign unassigned tandem MS spectra are discussed.
Collapse
Affiliation(s)
- Mohashin Pathan
- Department of Biochemistry and Genetics, La Trobe Institute for Molecular Science, La Trobe University, Melbourne, VIC, 3086, Australia
| | - Monisha Samuel
- Department of Physiology, Anatomy and Microbiology, La Trobe University, Bundoora, Melbourne, VIC, 3086, Australia
| | - Shivakumar Keerthikumar
- Department of Biochemistry and Genetics, La Trobe Institute for Molecular Science, La Trobe University, Melbourne, VIC, 3086, Australia
| | - Suresh Mathivanan
- Department of Biochemistry and Genetics, La Trobe Institute for Molecular Science, La Trobe University, Melbourne, VIC, 3086, Australia.
| |
Collapse
|
33
|
Tandem Mass Spectrum Sequencing: An Alternative to Database Search Engines in Shotgun Proteomics. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2016. [PMID: 27975219 DOI: 10.1007/978-3-319-41448-5_10] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register]
Abstract
Protein identification via database searches has become the gold standard in mass spectrometry based shotgun proteomics. However, as the quality of tandem mass spectra improves, direct mass spectrum sequencing gains interest as a database-independent alternative. In this chapter, the general principle of this so-called de novo sequencing is introduced along with pitfalls and challenges of the technique. The main tools available are presented with a focus on user friendly open source software which can be directly applied in everyday proteomic workflows.
Collapse
|
34
|
Fomin E. A Simple Approach to the Reconstruction of a Set of Points from the Multiset of n2 Pairwise Distances in n2 Steps for the Sequencing Problem: II. Algorithm. J Comput Biol 2016; 23:934-942. [DOI: 10.1089/cmb.2016.0046] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Eduard Fomin
- Institute of Cytology and Genetics, SB RAS, Novosibirsk, Russia
| |
Collapse
|
35
|
Ma B. De novo Peptide Sequencing. PROTEOME INFORMATICS 2016:15-38. [DOI: 10.1039/9781782626732-00015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
Abstract
De novo peptide sequencing refers to the process of determining a peptide’s amino acid sequence from its MS/MS spectrum alone. The principle of this process is fairly straightforward: a high-quality spectrum may present a ladder of fragment ion peaks. The mass difference between every two adjacent peaks in the ladder is used to determine a residue of the peptide. However, most practical spectra do not have sufficient quality to support this straightforward process. Therefore, research in de novo sequencing has largely been a battle against the errors in the data. This chapter reviews some of the major developments in this field. The chapter starts with a quick review of the history in Section 1. Then manual de novo sequencing is examined in Section 2. Section 3 introduces a few commonly used de novo sequencing algorithms. An important aspect of automated de novo sequencing software is a good scoring function that serves as the optimization goal of the algorithm. Thus, Section 4 is devoted for the methods to define good scoring functions. Section 5 reviews a list of relevant software. The chapter concludes with a discussion of the applications and limitations of de novosequencing in Section 6.
Collapse
Affiliation(s)
- Bin Ma
- School of Computer Science, University of Waterloo Canada
| |
Collapse
|
36
|
Gorshkov V, Hotta SYK, Verano-Braga T, Kjeldsen F. Peptide de novo sequencing of mixture tandem mass spectra. Proteomics 2016; 16:2470-9. [PMID: 27329701 PMCID: PMC5297990 DOI: 10.1002/pmic.201500549] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2015] [Revised: 04/27/2016] [Accepted: 06/17/2016] [Indexed: 02/02/2023]
Abstract
The impact of mixture spectra deconvolution on the performance of four popular de novo sequencing programs was tested using artificially constructed mixture spectra as well as experimental proteomics data. Mixture fragmentation spectra are recognized as a limitation in proteomics because they decrease the identification performance using database search engines. De novo sequencing approaches are expected to be even more sensitive to the reduction in mass spectrum quality resulting from peptide precursor co‐isolation and thus prone to false identifications. The deconvolution approach matched complementary b‐, y‐ions to each precursor peptide mass, which allowed the creation of virtual spectra containing sequence specific fragment ions of each co‐isolated peptide. Deconvolution processing resulted in equally efficient identification rates but increased the absolute number of correctly sequenced peptides. The improvement was in the range of 20–35% additional peptide identifications for a HeLa lysate sample. Some correct sequences were identified only using unprocessed spectra; however, the number of these was lower than those where improvement was obtained by mass spectral deconvolution. Tight candidate peptide score distribution and high sensitivity to small changes in the mass spectrum introduced by the employed deconvolution method could explain some of the missing peptide identifications.
Collapse
Affiliation(s)
- Vladimir Gorshkov
- Department of Biochemistry and Molecular Biology, University of Southern Denmark Odense M, Odense, Denmark.
| | | | - Thiago Verano-Braga
- Department of Biochemistry and Molecular Biology, University of Southern Denmark Odense M, Odense, Denmark.,Department of Physiology and Biophysics, Federal University of Minas Gerais Belo Horizonte - MG, Belo Horizonte, Brazil
| | - Frank Kjeldsen
- Department of Biochemistry and Molecular Biology, University of Southern Denmark Odense M, Odense, Denmark
| |
Collapse
|
37
|
Engler MS, Scheubert K, Schubert US, Böcker S. New Statistical Models for Copolymerization. Polymers (Basel) 2016; 8:E240. [PMID: 30979335 PMCID: PMC6432000 DOI: 10.3390/polym8060240] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2016] [Revised: 06/07/2016] [Accepted: 06/15/2016] [Indexed: 11/16/2022] Open
Abstract
For many years, copolymerization has been studied using mathematical and statistical models. Here, we present new Markov chain models for copolymerization kinetics: the Bernoulli and Geometric models. They model copolymer synthesis as a random process and are based on a basic reaction scheme. In contrast to previous Markov chain approaches to copolymerization, both models take variable chain lengths and time-dependent monomer probabilities into account and allow for computing sequence likelihoods and copolymer fingerprints. Fingerprints can be computed from copolymer mass spectra, potentially allowing us to estimate the model parameters from measured fingerprints. We compare both models against Monte Carlo simulations. We find that computing the models is fast and memory efficient.
Collapse
Affiliation(s)
- Martin S Engler
- Chair of Bioinformatics, Friedrich Schiller University Jena, Ernst-Abbe-Platz 2, 07743 Jena, Germany.
| | - Kerstin Scheubert
- Chair of Bioinformatics, Friedrich Schiller University Jena, Ernst-Abbe-Platz 2, 07743 Jena, Germany.
| | - Ulrich S Schubert
- Laboratory of Organic and Macromolecular Chemistry (IOMC), Friedrich Schiller University Jena, Humboldtstr. 10, 07743 Jena, Germany.
- Jena Center for Soft Matter (JCMS), Friedrich Schiller University Jena, Philosophenweg 7, 07743 Jena, Germany.
| | - Sebastian Böcker
- Chair of Bioinformatics, Friedrich Schiller University Jena, Ernst-Abbe-Platz 2, 07743 Jena, Germany.
- Jena Center for Soft Matter (JCMS), Friedrich Schiller University Jena, Philosophenweg 7, 07743 Jena, Germany.
| |
Collapse
|
38
|
Vinciguerra R, De Chiaro A, Pucci P, Marino G, Birolo L. Proteomic strategies for cultural heritage: From bones to paintings. Microchem J 2016. [DOI: 10.1016/j.microc.2015.12.024] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
|
39
|
Robotham SA, Horton AP, Cannon JR, Cotham VC, Marcotte EM, Brodbelt JS. UVnovo: A de Novo Sequencing Algorithm Using Single Series of Fragment Ions via Chromophore Tagging and 351 nm Ultraviolet Photodissociation Mass Spectrometry. Anal Chem 2016; 88:3990-7. [PMID: 26938041 PMCID: PMC4850734 DOI: 10.1021/acs.analchem.6b00261] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
De novo peptide sequencing by mass spectrometry represents an important strategy for characterizing novel peptides and proteins, in which a peptide's amino acid sequence is inferred directly from the precursor peptide mass and tandem mass spectrum (MS/MS or MS(3)) fragment ions, without comparison to a reference proteome. This method is ideal for organisms or samples lacking a complete or well-annotated reference sequence set. One of the major barriers to de novo spectral interpretation arises from confusion of N- and C-terminal ion series due to the symmetry between b and y ion pairs created by collisional activation methods (or c, z ions for electron-based activation methods). This is known as the "antisymmetric path problem" and leads to inverted amino acid subsequences within a de novo reconstruction. Here, we combine several key strategies for de novo peptide sequencing into a single high-throughput pipeline: high-efficiency carbamylation blocks lysine side chains, and subsequent tryptic digestion and N-terminal peptide derivatization with the ultraviolet chromophore AMCA yield peptides susceptible to 351 nm ultraviolet photodissociation (UVPD). UVPD-MS/MS of the AMCA-modified peptides then predominantly produces y ions in the MS/MS spectra, specifically addressing the antisymmetric path problem. Finally, the program UVnovo applies a random forest algorithm to automatically learn from and then interpret UVPD mass spectra, passing results to a hidden Markov model for de novo sequence prediction and scoring. We show this combined strategy provides high-performance de novo peptide sequencing, enabling the de novo sequencing of thousands of peptides from an Escherichia coli lysate at high confidence.
Collapse
Affiliation(s)
- Scott A Robotham
- Department of Chemistry, University of Texas , Austin, Texas 78712, United States
| | - Andrew P Horton
- Center for Systems and Synthetic Biology, Department of Molecular Biosciences, University of Texas , Austin, Texas 78712, United States
| | - Joe R Cannon
- Department of Chemistry, University of Texas , Austin, Texas 78712, United States
| | - Victoria C Cotham
- Department of Chemistry, University of Texas , Austin, Texas 78712, United States
| | - Edward M Marcotte
- Center for Systems and Synthetic Biology, Department of Molecular Biosciences, University of Texas , Austin, Texas 78712, United States
| | - Jennifer S Brodbelt
- Department of Chemistry, University of Texas , Austin, Texas 78712, United States
| |
Collapse
|
40
|
Liu Y, Sun W, John J, Lajoie G, Ma B, Zhang K. De Novo Sequencing Assisted Approach for Characterizing Mixture MS/MS Spectra. IEEE Trans Nanobioscience 2016; 15:166-76. [PMID: 26800542 DOI: 10.1109/tnb.2016.2519841] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Extensive research has been conducted for the computational analysis of mass spectrometry based proteomics data. However, there are still remaining challenges, among which, one particular challenge is the low identification rate of the collected spectral data. A specific contributing factor is the existence of mixture spectra in the collected MS/MS spectra which are generated by the concurrent fragmentation of multiple precursors in one sequencing attempt. The quite frequently observed mixture spectra necessitates the development of effective computational approaches to characterize those non-conventional spectral data. In this research, we proposed an approach for matching the query mixture spectra with a pair of peptide sequences acquired from the protein database by incorporating a special de novo assisted filtration strategy. The experiment results on two different datasets of MS/MS spectra containing mixed ion fragments from multiple peptides demonstrated the efficiency of the integrated filtration strategy in reducing examination space and verified the effectiveness of the proposed matching scheme as well.
Collapse
|
41
|
Affiliation(s)
- Jennifer S Brodbelt
- Department of Chemistry, University of Texas at Austin , Austin, Texas 78712, United States
| |
Collapse
|
42
|
Ma B. Peptide De Novo Sequencing with MS/MS. ENCYCLOPEDIA OF ALGORITHMS 2016:1545-1547. [DOI: 10.1007/978-1-4939-2864-4_286] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
|
43
|
Ma B. Novor: real-time peptide de novo sequencing software. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2015; 26:1885-94. [PMID: 26122521 PMCID: PMC4604512 DOI: 10.1007/s13361-015-1204-0] [Citation(s) in RCA: 123] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/12/2015] [Revised: 05/12/2015] [Accepted: 05/17/2015] [Indexed: 05/09/2023]
Abstract
De novo sequencing software has been widely used in proteomics to sequence new peptides from tandem mass spectrometry data. This study presents a new software tool, Novor, to greatly improve both the speed and accuracy of today's peptide de novo sequencing analyses. To improve the accuracy, Novor's scoring functions are based on two large decision trees built from a peptide spectral library with more than 300,000 spectra with machine learning. Important knowledge about peptide fragmentation is extracted automatically from the library and incorporated into the scoring functions. The decision tree model also enables efficient score calculation and contributes to the speed improvement. To further improve the speed, a two-stage algorithmic approach, namely dynamic programming and refinement, is used. The software program was also carefully optimized. On the testing datasets, Novor sequenced 7%-37% more correct residues than the state-of-the-art de novo sequencing tool, PEAKS, while being an order of magnitude faster. Novor can de novo sequence more than 300 MS/MS spectra per second on a laptop computer. The speed surpasses the acquisition speed of today's mass spectrometer and, therefore, opens a new possibility to de novo sequence in real time while the spectrometer is acquiring the spectral data. Graphical Abstract ᅟ.
Collapse
Affiliation(s)
- Bin Ma
- School of Computer Science, University of Waterloo, 200 University Ave. W., Waterloo, ON, N2L3G1, Canada.
| |
Collapse
|
44
|
Song Y, Chi AY. Peptide sequencing via graph path decomposition. Inf Sci (N Y) 2015. [DOI: 10.1016/j.ins.2015.01.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
45
|
Ma B. Peptide De Novo Sequencing with MS/MS. ENCYCLOPEDIA OF ALGORITHMS 2015:1-4. [DOI: 10.1007/978-3-642-27848-8_286-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/14/2015] [Accepted: 01/14/2015] [Indexed: 09/01/2023]
|
46
|
Wang X, Li Y, Wu Z, Wang H, Tan H, Peng J. JUMP: a tag-based database search tool for peptide identification with high sensitivity and accuracy. Mol Cell Proteomics 2014; 13:3663-73. [PMID: 25202125 DOI: 10.1074/mcp.o114.039586] [Citation(s) in RCA: 102] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Database search programs are essential tools for identifying peptides via mass spectrometry (MS) in shotgun proteomics. Simultaneously achieving high sensitivity and high specificity during a database search is crucial for improving proteome coverage. Here we present JUMP, a new hybrid database search program that generates amino acid tags and ranks peptide spectrum matches (PSMs) by an integrated score from the tags and pattern matching. In a typical run of liquid chromatography coupled with high-resolution tandem MS, more than 95% of MS/MS spectra can generate at least one tag, whereas the remaining spectra are usually too poor to derive genuine PSMs. To enhance search sensitivity, the JUMP program enables the use of tags as short as one amino acid. Using a target-decoy strategy, we compared JUMP with other programs (e.g. SEQUEST, Mascot, PEAKS DB, and InsPecT) in the analysis of multiple datasets and found that JUMP outperformed these preexisting programs. JUMP also permitted the analysis of multiple co-fragmented peptides from "mixture spectra" to further increase PSMs. In addition, JUMP-derived tags allowed partial de novo sequencing and facilitated the unambiguous assignment of modified residues. In summary, JUMP is an effective database search algorithm complementary to current search programs.
Collapse
Affiliation(s)
- Xusheng Wang
- From the ‡St. Jude Proteomics Facility, St. Jude Children's Research Hospital, Memphis, Tennessee 38105
| | - Yuxin Li
- §Departments of Structural Biology and Developmental Neurobiology, St. Jude Children's Research Hospital, Memphis, Tennessee 38105
| | - Zhiping Wu
- §Departments of Structural Biology and Developmental Neurobiology, St. Jude Children's Research Hospital, Memphis, Tennessee 38105
| | - Hong Wang
- §Departments of Structural Biology and Developmental Neurobiology, St. Jude Children's Research Hospital, Memphis, Tennessee 38105; ‡‡Integrated Biomedical Sciences Program, The University of Tennessee Health Science Center, Memphis, Tennessee 38163
| | - Haiyan Tan
- From the ‡St. Jude Proteomics Facility, St. Jude Children's Research Hospital, Memphis, Tennessee 38105
| | - Junmin Peng
- From the ‡St. Jude Proteomics Facility, St. Jude Children's Research Hospital, Memphis, Tennessee 38105; §Departments of Structural Biology and Developmental Neurobiology, St. Jude Children's Research Hospital, Memphis, Tennessee 38105;
| |
Collapse
|
47
|
Leprevost FV, Valente RH, Lima DB, Perales J, Melani R, Yates JR, Barbosa VC, Junqueira M, Carvalho PC. PepExplorer: a similarity-driven tool for analyzing de novo sequencing results. Mol Cell Proteomics 2014; 13:2480-9. [PMID: 24878498 DOI: 10.1074/mcp.m113.037002] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Peptide spectrum matching is the current gold standard for protein identification via mass-spectrometry-based proteomics. Peptide spectrum matching compares experimental mass spectra against theoretical spectra generated from a protein sequence database to perform identification, but protein sequences not present in a database cannot be identified unless their sequences are in part conserved. The alternative approach, de novo sequencing, can make it possible to infer a peptide sequence directly from a mass spectrum, but interpreting long lists of peptide sequences resulting from large-scale experiments is not trivial. With this as motivation, PepExplorer was developed to use rigorous pattern recognition to assemble a list of homologue proteins using de novo sequencing data coupled to sequence alignment to allow biological interpretation of the data. PepExplorer can read the output of various widely adopted de novo sequencing tools and converge to a list of proteins with a global false-discovery rate. To this end, it employs a radial basis function neural network that considers precursor charge states, de novo sequencing scores, peptide lengths, and alignment scores to select similar protein candidates, from a target-decoy database, usually obtained from phylogenetically related species. Alignments are performed using a modified Smith-Waterman algorithm tailored for the task at hand. We verified the effectiveness of our approach using a reference set of identifications generated by ProLuCID when searching for Pyrococcus furiosus mass spectra on the corresponding NCBI RefSeq database. We then modified the sequence database by swapping amino acids until ProLuCID was no longer capable of identifying any proteins. By searching the mass spectra using PepExplorer on the modified database, we were able to recover most of the identifications at a 1% false-discovery rate. Finally, we employed PepExplorer to disclose a comprehensive proteomic assessment of the Bothrops jararaca plasma, a known biological source of natural inhibitors of snake toxins. PepExplorer is integrated into the PatternLab for Proteomics environment, which makes available various tools for downstream data analysis, including resources for quantitative and differential proteomics.
Collapse
Affiliation(s)
- Felipe V Leprevost
- From the ‡Laboratory for Proteomics and Protein Engineering, Carlos Chagas Institute, Fiocruz, Paraná, Brazil
| | - Richard H Valente
- §Laboratory of Toxinology, Oswaldo Cruz Institute, Fiocruz, Rio de Janeiro, Brazil; ¶Instituto Nacional de Ciência e Tecnologia em Toxinas (INCTTox/CNPq), Brazil
| | - Diogo B Lima
- From the ‡Laboratory for Proteomics and Protein Engineering, Carlos Chagas Institute, Fiocruz, Paraná, Brazil
| | - Jonas Perales
- §Laboratory of Toxinology, Oswaldo Cruz Institute, Fiocruz, Rio de Janeiro, Brazil; ¶Instituto Nacional de Ciência e Tecnologia em Toxinas (INCTTox/CNPq), Brazil
| | - Rafael Melani
- ‖Proteomics Unit, Rio de Janeiro Proteomics Network, Department of Biochemistry, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
| | - John R Yates
- **Department of Chemical Physiology, The Scripps Research Institute, La Jolla, California
| | - Valmir C Barbosa
- ‡‡Systems Engineering and Computer Science Program, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
| | - Magno Junqueira
- ‖Proteomics Unit, Rio de Janeiro Proteomics Network, Department of Biochemistry, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
| | - Paulo C Carvalho
- From the ‡Laboratory for Proteomics and Protein Engineering, Carlos Chagas Institute, Fiocruz, Paraná, Brazil
| |
Collapse
|
48
|
Abstract
De novo sequencing is an important computational approach to determining the amino acid sequence of a peptide with tandem mass spectrometry (MS/MS). Most of the existing approaches use a graph model to describe a spectrum and the sequencing is performed by computing the longest antisymmetric path in the graph. The task is often computationally intensive since a given MS/MS spectrum often contains noisy data, missing mass peaks, or post translational modifications/mutations. This paper develops a new parameterized algorithm that can efficiently compute the longest antisymmetric partial path in an extended spectrum graph that is of bounded path width. Our testing results show that this algorithm can efficiently process experimental spectra and provide sequencing results of high accuracy.
Collapse
Affiliation(s)
- Yinglei Song
- School of Computer Science and Engineering, Jiangsu University of Science and Technology, Zhenjiang, Jiangsu, China
| |
Collapse
|
49
|
Romero-Rodríguez MC, Pascual J, Valledor L, Jorrín-Novo J. Improving the quality of protein identification in non-model species. Characterization of Quercus ilex seed and Pinus radiata needle proteomes by using SEQUEST and custom databases. J Proteomics 2014; 105:85-91. [PMID: 24508333 DOI: 10.1016/j.jprot.2014.01.027] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2013] [Accepted: 01/27/2014] [Indexed: 01/10/2023]
Abstract
UNLABELLED Nowadays the most used pipeline for protein identification consists in the comparison of the MS/MS spectra to reference databases. Search algorithms compare obtained spectra to an in silico digestion of a sequence database to find exact matches. In this context, the database has a paramount importance and will determine in a great deal the number of identifications and its quality, being this especially relevant for non-model plant species. Using a single Viridiplantae database (NCBI, UniProt) and TAIR is not the best choice for non-model species since they are underrepresented in databases resulting in poor identification rates. We demonstrate how it is possible to improve the rate and quality of identifications in two orphan species, Quercus ilex and Pinus radiata, by using SEQUEST and a combination of public (Viridiplantae NCBI, UniProt) and a custom-built specific database which contained 593,294 and 455,096 peptide sequences (Quercus and Pinus, respectively). These databases were built after gathering and processing (trimming, contiging, 6-frame translation) publicly available RNA sequences, mostly ESTs and NGS reads. A total of 149 and 1533 proteins were identified from Quercus seeds and Pinus needles, representing a 3.1- or 1.5-fold increase in the number of protein identifications and scores compared to the use of a single database. Since this approach greatly improves the identification rate, and is not significantly more complicated or time consuming than other approaches, we recommend its routine use when working with non-model species. BIOLOGICAL SIGNIFICANCE In this work we demonstrate how the construction of a custom database (DB) gathering all available RNA sequences and its use in combination with Viridiplantae public DBs (NCBI, UniProt) significantly improve protein identification when working with non-model species. Protein identification rate and quality is higher to those obtained in routine procedures based on using only one database (commonly Viridiplantae from NCBI), as we demonstrated analyzing Quercus seeds and Pine needles. The proposed approach based on the building of a custom database is not difficult or time consuming, so we recommend its routine use when working with non-model species. This article is part of a Special Issue entitled: Proteomics of non-model organisms.
Collapse
Affiliation(s)
- M Cristina Romero-Rodríguez
- Agricultural and Plant Biochemistry and Proteomics Research Group, Dept. of Biochemistry and Molecular Biology, University of Córdoba, Spain
| | - Jesús Pascual
- Plant Physiology, Faculty of Biology, Dept. of Organisms and Systems Biology, University of Oviedo, Spain
| | - Luis Valledor
- Dept. of Biology & Centre for Environmental and Marine Studies, University of Aveiro, Aveiro, Portugal; GCRC, Adaption Biotechnologies, Academy of Sciences of the Czech Republic, Brno, Czech Republic.
| | - Jesús Jorrín-Novo
- Agricultural and Plant Biochemistry and Proteomics Research Group, Dept. of Biochemistry and Molecular Biology, University of Córdoba, Spain.
| |
Collapse
|
50
|
Terterov I, Vyatkina K, Kononikhin AS, Boitsov V, Vyazmin S, Popov IA, Nikolaev EN, Pevzner P, Dubina M. Application of de novo sequencing tools to study abiogenic peptide formations by tandem mass spectrometry. The case of homo-peptides from glutamic acid complicated by substitutions of hydrogen by sodium or potassium atoms. RAPID COMMUNICATIONS IN MASS SPECTROMETRY : RCM 2014; 28:33-41. [PMID: 24285388 DOI: 10.1002/rcm.6757] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/06/2013] [Revised: 09/24/2013] [Accepted: 10/04/2013] [Indexed: 06/02/2023]
Abstract
RATIONALE Peptides and proteins are among the most important components of living systems. Different attempts have been made to experimentally model the formation of peptides from amino acid monomers in investigation of the origin of life. Detailed characterization of peptides formed under various conditions in such reactions is very important for understanding processes of abiogenic peptide formation. METHODS We used liquid chromatography coupled with tandem mass spectrometry (MS/MS) for an accurate study of homo-peptides formed in a model reaction: glutamic acid oligomerization catalyzed by 1,1'-carbonyldiimidazole in aqueous solution with 1 M of sodium or potassium chloride and without any salts. We used de novo sequencing software for peptide identification. In addition we propose an approach that uses more spectral information for de novo sequencing then standard methods. RESULTS Peptides up to 9 amino acids long were found in the experiments with KCl, while in experiments with NaCl and without salts only peptides of up to 7 amino acids were detected. Due to high salt concentrations in samples a high number of singly charged peptide ions with up to 4 substitutions of hydrogen atoms by sodium or potassium atoms were observed. De novo sequencing software provided correct identifications even for peptide ions with substitutions. CONCLUSIONS Multiple substitutions of hydrogen by alkali metal atoms in peptide ions strongly change their fragmentation patterns. Proposed approach for de novo sequencing was found very effective, even for ions with substitutions. So, it may be useful in more complicated cases like sequencing abiogenic peptides consisting of different amino acids.
Collapse
Affiliation(s)
- Ivan Terterov
- St. Petersburg Academic University Nanotechnology Research and Education Center RAS, 8/3 Khlopina st., St. Petersburg, 194021, Russia
| | | | | | | | | | | | | | | | | |
Collapse
|