51
|
Nesvizhskii AI. Proteogenomics: concepts, applications and computational strategies. Nat Methods 2015; 11:1114-25. [PMID: 25357241 DOI: 10.1038/nmeth.3144] [Citation(s) in RCA: 533] [Impact Index Per Article: 53.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2014] [Accepted: 09/22/2014] [Indexed: 12/19/2022]
Abstract
Proteogenomics is an area of research at the interface of proteomics and genomics. In this approach, customized protein sequence databases generated using genomic and transcriptomic information are used to help identify novel peptides (not present in reference protein sequence databases) from mass spectrometry-based proteomic data; in turn, the proteomic data can be used to provide protein-level evidence of gene expression and to help refine gene models. In recent years, owing to the emergence of new sequencing technologies such as RNA-seq and dramatic improvements in the depth and throughput of mass spectrometry-based proteomics, the pace of proteogenomic research has greatly accelerated. Here I review the current state of proteogenomic methods and applications, including computational strategies for building and using customized protein sequence databases. I also draw attention to the challenge of false positive identifications in proteogenomics and provide guidelines for analyzing the data and reporting the results of proteogenomic studies.
Collapse
Affiliation(s)
- Alexey I Nesvizhskii
- 1] Department of Pathology, University of Michigan, Ann Arbor, Michigan, USA. [2] Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA
| |
Collapse
|
52
|
Crappé J, Ndah E, Koch A, Steyaert S, Gawron D, De Keulenaer S, De Meester E, De Meyer T, Van Criekinge W, Van Damme P, Menschaert G. PROTEOFORMER: deep proteome coverage through ribosome profiling and MS integration. Nucleic Acids Res 2014; 43:e29. [PMID: 25510491 PMCID: PMC4357689 DOI: 10.1093/nar/gku1283] [Citation(s) in RCA: 110] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
An increasing amount of studies integrate mRNA sequencing data into MS-based proteomics to complement the translation product search space. However, several factors, including extensive regulation of mRNA translation and the need for three- or six-frame-translation, impede the use of mRNA-seq data for the construction of a protein sequence search database. With that in mind, we developed the PROTEOFORMER tool that automatically processes data of the recently developed ribosome profiling method (sequencing of ribosome-protected mRNA fragments), resulting in genome-wide visualization of ribosome occupancy. Our tool also includes a translation initiation site calling algorithm allowing the delineation of the open reading frames (ORFs) of all translation products. A complete protein synthesis-based sequence database can thus be compiled for mass spectrometry-based identification. This approach increases the overall protein identification rates with 3% and 11% (improved and new identifications) for human and mouse, respectively, and enables proteome-wide detection of 5′-extended proteoforms, upstream ORF translation and near-cognate translation start sites. The PROTEOFORMER tool is available as a stand-alone pipeline and has been implemented in the galaxy framework for ease of use.
Collapse
Affiliation(s)
- Jeroen Crappé
- Lab of Bioinformatics and Computational Genomics, Department of Mathematical Modeling, Statistics and Bioinformatics, Faculty of Bioscience Engineering, Ghent University, Ghent, Belgium
| | - Elvis Ndah
- Lab of Bioinformatics and Computational Genomics, Department of Mathematical Modeling, Statistics and Bioinformatics, Faculty of Bioscience Engineering, Ghent University, Ghent, Belgium Department of Medical Protein Research, Flemish Institute of Biotechnology, Ghent, Belgium Department of Biochemistry, Faculty of Medicine and Health Sciences, Ghent University, Ghent, Belgium
| | - Alexander Koch
- Lab of Bioinformatics and Computational Genomics, Department of Mathematical Modeling, Statistics and Bioinformatics, Faculty of Bioscience Engineering, Ghent University, Ghent, Belgium
| | - Sandra Steyaert
- Lab of Bioinformatics and Computational Genomics, Department of Mathematical Modeling, Statistics and Bioinformatics, Faculty of Bioscience Engineering, Ghent University, Ghent, Belgium
| | - Daria Gawron
- Department of Medical Protein Research, Flemish Institute of Biotechnology, Ghent, Belgium Department of Biochemistry, Faculty of Medicine and Health Sciences, Ghent University, Ghent, Belgium
| | - Sarah De Keulenaer
- Lab of Bioinformatics and Computational Genomics, Department of Mathematical Modeling, Statistics and Bioinformatics, Faculty of Bioscience Engineering, Ghent University, Ghent, Belgium
| | - Ellen De Meester
- Lab of Bioinformatics and Computational Genomics, Department of Mathematical Modeling, Statistics and Bioinformatics, Faculty of Bioscience Engineering, Ghent University, Ghent, Belgium
| | - Tim De Meyer
- Lab of Bioinformatics and Computational Genomics, Department of Mathematical Modeling, Statistics and Bioinformatics, Faculty of Bioscience Engineering, Ghent University, Ghent, Belgium
| | - Wim Van Criekinge
- Lab of Bioinformatics and Computational Genomics, Department of Mathematical Modeling, Statistics and Bioinformatics, Faculty of Bioscience Engineering, Ghent University, Ghent, Belgium
| | - Petra Van Damme
- Department of Medical Protein Research, Flemish Institute of Biotechnology, Ghent, Belgium Department of Biochemistry, Faculty of Medicine and Health Sciences, Ghent University, Ghent, Belgium
| | - Gerben Menschaert
- Lab of Bioinformatics and Computational Genomics, Department of Mathematical Modeling, Statistics and Bioinformatics, Faculty of Bioscience Engineering, Ghent University, Ghent, Belgium
| |
Collapse
|
53
|
Kucharova V, Wiker HG. Proteogenomics in microbiology: taking the right turn at the junction of genomics and proteomics. Proteomics 2014; 14:2360-675. [PMID: 25263021 DOI: 10.1002/pmic.201400168] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2014] [Revised: 08/18/2014] [Accepted: 09/23/2014] [Indexed: 12/14/2022]
Abstract
High-accuracy and high-throughput proteomic methods have completely changed the way we can identify and characterize proteins. MS-based proteomics can now provide a unique supplement to genomic data and add a new level of information to the interpretation of genomic sequences. Proteomics-driven genome annotation has become especially relevant in microbiology where genomes are sequenced on a daily basis and limitations of an in silico driven annotation process are well recognized. In this review paper, we outline different strategies on how one can design a proteogenomic experiment, for example on genome-sequenced (synonymous proteogenomics) versus unsequenced organisms (ortho-proteogenomics) or with the aid of other "omic" data such as RNA-seq. We touch upon many challenges that are encountered during a typical proteogenomic study, mostly concerning bioinformatics methods and downstream data analysis, but also related to creation and use of sequence databases. A large list of proteogenomic case studies of different microorganisms is provided to illustrate the mapping of MS/MS-derived peptide spectra to genomic DNA sequences. These investigations have led to accurate determination of translational initiation sites, pointed out eventual read-throughs or programmed frameshifts, detected signal peptide processing or other protein maturation events, removed questionable annotation assignments, and provided evidence for predicted hypothetical proteins.
Collapse
Affiliation(s)
- Veronika Kucharova
- Department of Clinical Science, The Gade Research Group for Infection and Immunity, University of Bergen, Norway
| | | |
Collapse
|
54
|
Jagtap PD, Johnson JE, Onsongo G, Sadler FW, Murray K, Wang Y, Shenykman GM, Bandhakavi S, Smith LM, Griffin TJ. Flexible and accessible workflows for improved proteogenomic analysis using the Galaxy framework. J Proteome Res 2014; 13:5898-908. [PMID: 25301683 PMCID: PMC4261978 DOI: 10.1021/pr500812t] [Citation(s) in RCA: 65] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
![]()
Proteogenomics combines large-scale
genomic and transcriptomic
data with mass-spectrometry-based proteomic data to discover novel
protein sequence variants and improve genome annotation. In contrast
with conventional proteomic applications, proteogenomic analysis requires
a number of additional data processing steps. Ideally, these required
steps would be integrated and automated via a single software platform
offering accessibility for wet-bench researchers as well as flexibility
for user-specific customization and integration of new software tools
as they emerge. Toward this end, we have extended the Galaxy bioinformatics
framework to facilitate proteogenomic analysis. Using analysis of
whole human saliva as an example, we demonstrate Galaxy’s flexibility
through the creation of a modular workflow incorporating both established
and customized software tools that improve depth and quality of proteogenomic
results. Our customized Galaxy-based software includes automated,
batch-mode BLASTP searching and a Peptide Sequence Match Evaluator
tool, both useful for evaluating the veracity of putative novel peptide
identifications. Our complex workflow (approximately 140 steps) can
be easily shared using built-in Galaxy functions, enabling their use
and customization by others. Our results provide a blueprint for the
establishment of the Galaxy framework as an ideal solution for the
emerging field of proteogenomics.
Collapse
Affiliation(s)
- Pratik D Jagtap
- Center for Mass Spectrometry and Proteomics, University of Minnesota , 43 Gortner Laboratory, 1479 Gortner Avenue, St. Paul, Minnesota 55108, United States
| | | | | | | | | | | | | | | | | | | |
Collapse
|
55
|
Sheynkman GM, Johnson JE, Jagtap PD, Shortreed MR, Onsongo G, Frey BL, Griffin TJ, Smith LM. Using Galaxy-P to leverage RNA-Seq for the discovery of novel protein variations. BMC Genomics 2014; 15:703. [PMID: 25149441 PMCID: PMC4158061 DOI: 10.1186/1471-2164-15-703] [Citation(s) in RCA: 60] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2014] [Accepted: 08/12/2014] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Current practice in mass spectrometry (MS)-based proteomics is to identify peptides by comparison of experimental mass spectra with theoretical mass spectra derived from a reference protein database; however, this strategy necessarily fails to detect peptide and protein sequences that are absent from the database. We and others have recently shown that customized proteomic databases derived from RNA-Seq data can be employed for MS-searching to both improve MS analysis and identify novel peptides. While this general strategy constitutes a significant advance for the discovery of novel protein variations, it has not been readily transferable to other laboratories due to the need for many specialized software tools. To address this problem, we have implemented readily accessible, modifiable, and extensible workflows within Galaxy-P, short for Galaxy for Proteomics, a web-based bioinformatic extension of the Galaxy framework for the analysis of multi-omics (e.g. genomics, transcriptomics, proteomics) data. RESULTS We present three bioinformatic workflows that allow the user to upload raw RNA sequencing reads and convert the data into high-quality customized proteomic databases suitable for MS searching. We show the utility of these workflows on human and mouse samples, identifying 544 peptides containing single amino acid polymorphisms (SAPs) and 187 peptides corresponding to unannotated splice junction peptides, correlating protein and transcript expression levels, and providing the option to incorporate transcript abundance measures within the MS database search process (reduced databases, incorporation of transcript abundance for protein identification score calculations, etc.). CONCLUSIONS Using RNA-Seq data to enhance MS analysis is a promising strategy to discover novel peptides specific to a sample and, more generally, to improve proteomics results. The main bottleneck for widespread adoption of this strategy has been the lack of easily used and modifiable computational tools. We provide a solution to this problem by introducing a set of workflows within the Galaxy-P framework that converts raw RNA-Seq data into customized proteomic databases.
Collapse
Affiliation(s)
- Gloria M Sheynkman
- />Chemistry Department, University of Wisconsin-Madison, 1101 University Ave., Madison, WI 53706 USA
| | - James E Johnson
- />Minnesota Supercomputing Institute, University of Minnesota, 117 Pleasant St SE, Minneapolis, MN 55455 USA
| | - Pratik D Jagtap
- />Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota, 6-155 Jackson Hall, 321 Church Street SE, Minneapolis, MN 55455 USA
- />Center for Mass Spectrometry and Proteomics, University of Minnesota, 43 Gortner Laboratory, 1479 Gortner Avenue, St. Paul, MN 55108 USA
| | - Michael R Shortreed
- />Chemistry Department, University of Wisconsin-Madison, 1101 University Ave., Madison, WI 53706 USA
| | - Getiria Onsongo
- />Minnesota Supercomputing Institute, University of Minnesota, 117 Pleasant St SE, Minneapolis, MN 55455 USA
| | - Brian L Frey
- />Chemistry Department, University of Wisconsin-Madison, 1101 University Ave., Madison, WI 53706 USA
| | - Timothy J Griffin
- />Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota, 6-155 Jackson Hall, 321 Church Street SE, Minneapolis, MN 55455 USA
- />Center for Mass Spectrometry and Proteomics, University of Minnesota, 43 Gortner Laboratory, 1479 Gortner Avenue, St. Paul, MN 55108 USA
| | - Lloyd M Smith
- />Chemistry Department, University of Wisconsin-Madison, 1101 University Ave., Madison, WI 53706 USA
- />Genome Center, University of Wisconsin-Madison, 111 University Ave, Madison, WI 53705 USA
| |
Collapse
|
56
|
Kelkar DS, Provost E, Chaerkady R, Muthusamy B, Manda SS, Subbannayya T, Selvan LDN, Wang CH, Datta KK, Woo S, Dwivedi SB, Renuse S, Getnet D, Huang TC, Kim MS, Pinto SM, Mitchell CJ, Madugundu AK, Kumar P, Sharma J, Advani J, Dey G, Balakrishnan L, Syed N, Nanjappa V, Subbannayya Y, Goel R, Prasad TSK, Bafna V, Sirdeshmukh R, Gowda H, Wang C, Leach SD, Pandey A. Annotation of the zebrafish genome through an integrated transcriptomic and proteomic analysis. Mol Cell Proteomics 2014; 13:3184-98. [PMID: 25060758 DOI: 10.1074/mcp.m114.038299] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Accurate annotation of protein-coding genes is one of the primary tasks upon the completion of whole genome sequencing of any organism. In this study, we used an integrated transcriptomic and proteomic strategy to validate and improve the existing zebrafish genome annotation. We undertook high-resolution mass-spectrometry-based proteomic profiling of 10 adult organs, whole adult fish body, and two developmental stages of zebrafish (SAT line), in addition to transcriptomic profiling of six organs. More than 7,000 proteins were identified from proteomic analyses, and ∼ 69,000 high-confidence transcripts were assembled from the RNA sequencing data. Approximately 15% of the transcripts mapped to intergenic regions, the majority of which are likely long non-coding RNAs. These high-quality transcriptomic and proteomic data were used to manually reannotate the zebrafish genome. We report the identification of 157 novel protein-coding genes. In addition, our data led to modification of existing gene structures including novel exons, changes in exon coordinates, changes in frame of translation, translation in annotated UTRs, and joining of genes. Finally, we discovered four instances of genome assembly errors that were supported by both proteomic and transcriptomic data. Our study shows how an integrative analysis of the transcriptome and the proteome can extend our understanding of even well-annotated genomes.
Collapse
Affiliation(s)
- Dhanashree S Kelkar
- From the *Institute of Bioinformatics, International Technology Park, Bangalore 560 066, India; ‡Amrita School of Biotechnology, Amrita University, Kollam 690 525, India
| | - Elayne Provost
- §Department of Surgery, Johns Hopkins University, Baltimore, Maryland 21205
| | - Raghothama Chaerkady
- ¶McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, Maryland 21205
| | - Babylakshmi Muthusamy
- From the *Institute of Bioinformatics, International Technology Park, Bangalore 560 066, India; ‖Centre of Excellence in Bioinformatics, School of Life Sciences, Pondicherry University, Puducherry 605014, India
| | - Srikanth S Manda
- From the *Institute of Bioinformatics, International Technology Park, Bangalore 560 066, India; ‖Centre of Excellence in Bioinformatics, School of Life Sciences, Pondicherry University, Puducherry 605014, India; **Departments of Biological Chemistry, Johns Hopkins University School of Medicine, Baltimore, Maryland 21205
| | - Tejaswini Subbannayya
- From the *Institute of Bioinformatics, International Technology Park, Bangalore 560 066, India; ‡Amrita School of Biotechnology, Amrita University, Kollam 690 525, India
| | - Lakshmi Dhevi N Selvan
- From the *Institute of Bioinformatics, International Technology Park, Bangalore 560 066, India; ‡Amrita School of Biotechnology, Amrita University, Kollam 690 525, India
| | - Chieh-Huei Wang
- ¶McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, Maryland 21205
| | - Keshava K Datta
- From the *Institute of Bioinformatics, International Technology Park, Bangalore 560 066, India; ‡‡School of Biotechnology, KIIT University, Bhubaneswar, Odisha 751024, India
| | - Sunghee Woo
- §§Department of Computer Science, University of California, San Diego, California 92093
| | - Sutopa B Dwivedi
- From the *Institute of Bioinformatics, International Technology Park, Bangalore 560 066, India; ‡Amrita School of Biotechnology, Amrita University, Kollam 690 525, India
| | - Santosh Renuse
- From the *Institute of Bioinformatics, International Technology Park, Bangalore 560 066, India; ‡Amrita School of Biotechnology, Amrita University, Kollam 690 525, India
| | - Derese Getnet
- ¶McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, Maryland 21205
| | - Tai-Chung Huang
- ¶McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, Maryland 21205
| | - Min-Sik Kim
- ¶McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, Maryland 21205; **Departments of Biological Chemistry, Johns Hopkins University School of Medicine, Baltimore, Maryland 21205
| | - Sneha M Pinto
- From the *Institute of Bioinformatics, International Technology Park, Bangalore 560 066, India; ¶McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, Maryland 21205; ¶¶Manipal University, Madhav Nagar, Manipal, Karnataka 576104, India
| | - Christopher J Mitchell
- ¶McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, Maryland 21205
| | - Anil K Madugundu
- From the *Institute of Bioinformatics, International Technology Park, Bangalore 560 066, India
| | - Praveen Kumar
- From the *Institute of Bioinformatics, International Technology Park, Bangalore 560 066, India
| | - Jyoti Sharma
- From the *Institute of Bioinformatics, International Technology Park, Bangalore 560 066, India; ¶¶Manipal University, Madhav Nagar, Manipal, Karnataka 576104, India
| | - Jayshree Advani
- From the *Institute of Bioinformatics, International Technology Park, Bangalore 560 066, India
| | - Gourav Dey
- From the *Institute of Bioinformatics, International Technology Park, Bangalore 560 066, India; ¶¶Manipal University, Madhav Nagar, Manipal, Karnataka 576104, India
| | - Lavanya Balakrishnan
- From the *Institute of Bioinformatics, International Technology Park, Bangalore 560 066, India; ‖‖Department of Biotechnology, Kuvempu University, Shimoga 577 451, India
| | - Nazia Syed
- From the *Institute of Bioinformatics, International Technology Park, Bangalore 560 066, India; Department of Biochemistry and Molecular Biology, School of Life Sciences, Pondicherry University, Puducherry 605 014, India
| | - Vishalakshi Nanjappa
- From the *Institute of Bioinformatics, International Technology Park, Bangalore 560 066, India; ‡Amrita School of Biotechnology, Amrita University, Kollam 690 525, India
| | - Yashwanth Subbannayya
- From the *Institute of Bioinformatics, International Technology Park, Bangalore 560 066, India
| | - Renu Goel
- From the *Institute of Bioinformatics, International Technology Park, Bangalore 560 066, India
| | - T S Keshava Prasad
- From the *Institute of Bioinformatics, International Technology Park, Bangalore 560 066, India; ‡Amrita School of Biotechnology, Amrita University, Kollam 690 525, India; ‖Centre of Excellence in Bioinformatics, School of Life Sciences, Pondicherry University, Puducherry 605014, India; ¶¶Manipal University, Madhav Nagar, Manipal, Karnataka 576104, India
| | - Vineet Bafna
- §§Department of Computer Science, University of California, San Diego, California 92093
| | - Ravi Sirdeshmukh
- From the *Institute of Bioinformatics, International Technology Park, Bangalore 560 066, India
| | - Harsha Gowda
- From the *Institute of Bioinformatics, International Technology Park, Bangalore 560 066, India
| | - Charles Wang
- The Center for Genomics and Division of Microbiology & Molecular Genetics, School of Medicine, Loma Linda University, Loma Linda, California 92350;
| | - Steven D Leach
- §Department of Surgery, Johns Hopkins University, Baltimore, Maryland 21205; ¶McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, Maryland 21205;
| | - Akhilesh Pandey
- From the *Institute of Bioinformatics, International Technology Park, Bangalore 560 066, India; ¶McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, Maryland 21205; **Departments of Biological Chemistry, Johns Hopkins University School of Medicine, Baltimore, Maryland 21205; Sol Goldman Pancreatic Cancer Research Center, Department of Pathology, Johns Hopkins University School of Medicine, Baltimore, Maryland 21205; Department of Oncology, Johns Hopkins University School of Medicine, Baltimore, Maryland 21205
| |
Collapse
|
57
|
Armengaud J, Trapp J, Pible O, Geffard O, Chaumot A, Hartmann EM. Non-model organisms, a species endangered by proteogenomics. J Proteomics 2014; 105:5-18. [PMID: 24440519 DOI: 10.1016/j.jprot.2014.01.007] [Citation(s) in RCA: 100] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2013] [Revised: 12/24/2013] [Accepted: 01/07/2014] [Indexed: 10/25/2022]
Abstract
UNLABELLED Previously, large-scale proteomics was possible only for organisms whose genomes were sequenced, meaning the most common model organisms. The use of next-generation sequencers is now changing the deal. With "proteogenomics", the use of experimental proteomics data to refine genome annotations, a higher integration of omics data is gaining ground. By extension, combining genomic and proteomic data is becoming routine in many research projects. "Proteogenomic"-flavored approaches are currently expanding, enabling the molecular studies of non-model organisms at an unprecedented depth. Today draft genomes can be obtained using next-generation sequencers in a rather straightforward way and at a reasonable cost for any organism. Unfinished genome sequences can be used to interpret tandem mass spectrometry proteomics data without the need for time-consuming genome annotation, and the use of RNA-seq to establish nucleotide sequences that are directly translated into protein sequences appears promising. There are, however, certain drawbacks that deserve further attention for RNA-seq to become more efficient. Here, we discuss the opportunities of working with non-model organisms, the proteomic methods that have been used until now, and the dramatic improvements proffered by proteogenomics. These put the distinction between model and non-model organisms in great danger, at least in terms of proteomics! BIOLOGICAL SIGNIFICANCE Model organisms have been crucial for in-depth analysis of cellular and molecular processes of life. Focusing the efforts of thousands of researchers on the Escherichia coli bacterium, Saccharomyces cerevisiae yeast, Arabidopsis thaliana plant, Danio rerio fish and other models for which genetic manipulation was possible was certainly worthwhile in terms of fundamental and invaluable biological insights. Until recently, proteomics of non-model organisms was limited to tedious, homology-based techniques, but today draft genomes or RNA-seq data can be straightforwardly obtained using next-generation sequencers, allowing the establishment of a draft protein database for any organism. Thus, proteogenomics opens new perspectives for molecular studies of non-model organisms, although they are still difficult experimental organisms. This article is part of a Special Issue entitled: Proteomics of non-model organisms.
Collapse
Affiliation(s)
- Jean Armengaud
- CEA, DSV, IBEB, Lab Biochim System Perturb, Bagnols-sur-Cèze F-30207, France.
| | - Judith Trapp
- CEA, DSV, IBEB, Lab Biochim System Perturb, Bagnols-sur-Cèze F-30207, France; Irstea, UR MALY, F-69626 Villeurbanne, France
| | - Olivier Pible
- CEA, DSV, IBEB, Lab Biochim System Perturb, Bagnols-sur-Cèze F-30207, France
| | | | | | - Erica M Hartmann
- CEA, DSV, IBEB, Lab Biochim System Perturb, Bagnols-sur-Cèze F-30207, France
| |
Collapse
|
58
|
Tanca A, Palomba A, Deligios M, Cubeddu T, Fraumene C, Biosa G, Pagnozzi D, Addis MF, Uzzau S. Evaluating the impact of different sequence databases on metaproteome analysis: insights from a lab-assembled microbial mixture. PLoS One 2013; 8:e82981. [PMID: 24349410 PMCID: PMC3857319 DOI: 10.1371/journal.pone.0082981] [Citation(s) in RCA: 92] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2013] [Accepted: 10/30/2013] [Indexed: 01/10/2023] Open
Abstract
Metaproteomics enables the investigation of the protein repertoire expressed by complex microbial communities. However, to unleash its full potential, refinements in bioinformatic approaches for data analysis are still needed. In this context, sequence databases selection represents a major challenge. This work assessed the impact of different databases in metaproteomic investigations by using a mock microbial mixture including nine diverse bacterial and eukaryotic species, which was subjected to shotgun metaproteomic analysis. Then, both the microbial mixture and the single microorganisms were subjected to next generation sequencing to obtain experimental metagenomic- and genomic-derived databases, which were used along with public databases (namely, NCBI, UniProtKB/SwissProt and UniProtKB/TrEMBL, parsed at different taxonomic levels) to analyze the metaproteomic dataset. First, a quantitative comparison in terms of number and overlap of peptide identifications was carried out among all databases. As a result, only 35% of peptides were common to all database classes; moreover, genus/species-specific databases provided up to 17% more identifications compared to databases with generic taxonomy, while the metagenomic database enabled a slight increment in respect to public databases. Then, database behavior in terms of false discovery rate and peptide degeneracy was critically evaluated. Public databases with generic taxonomy exhibited a markedly different trend compared to the counterparts. Finally, the reliability of taxonomic attribution according to the lowest common ancestor approach (using MEGAN and Unipept software) was assessed. The level of misassignments varied among the different databases, and specific thresholds based on the number of taxon-specific peptides were established to minimize false positives. This study confirms that database selection has a significant impact in metaproteomics, and provides critical indications for improving depth and reliability of metaproteomic results. Specifically, the use of iterative searches and of suitable filters for taxonomic assignments is proposed with the aim of increasing coverage and trustworthiness of metaproteomic data.
Collapse
Affiliation(s)
- Alessandro Tanca
- Porto Conte Ricerche Srl, Tramariglio, Alghero, Italy
- Dipartimento di Scienze Biomediche, Università di Sassari, Sassari, Italy
| | - Antonio Palomba
- Dipartimento di Scienze Biomediche, Università di Sassari, Sassari, Italy
| | - Massimo Deligios
- Porto Conte Ricerche Srl, Tramariglio, Alghero, Italy
- Dipartimento di Scienze Biomediche, Università di Sassari, Sassari, Italy
| | | | | | - Grazia Biosa
- Porto Conte Ricerche Srl, Tramariglio, Alghero, Italy
| | | | - Maria Filippa Addis
- Porto Conte Ricerche Srl, Tramariglio, Alghero, Italy
- Dipartimento di Scienze Biomediche, Università di Sassari, Sassari, Italy
- * E-mail: (MFA); (SU)
| | - Sergio Uzzau
- Porto Conte Ricerche Srl, Tramariglio, Alghero, Italy
- Dipartimento di Scienze Biomediche, Università di Sassari, Sassari, Italy
- * E-mail: (MFA); (SU)
| |
Collapse
|
59
|
Sun H, Xing X, Li J, Zhou F, Chen Y, He Y, Li W, Wei G, Chang X, Jia J, Li Y, Xie L. Identification of gene fusions from human lung cancer mass spectrometry data. BMC Genomics 2013; 14 Suppl 8:S5. [PMID: 24564548 PMCID: PMC4042237 DOI: 10.1186/1471-2164-14-s8-s5] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
Background Tandem mass spectrometry (MS/MS) technology has been applied to identify proteins, as an ultimate approach to confirm the original genome annotation. To be able to identify gene fusion proteins, a special database containing peptides that cross over gene fusion breakpoints is needed. Methods It is impractical to construct a database that includes all possible fusion peptides originated from potential breakpoints. Focusing on 6259 reported and predicted gene fusion pairs from ChimerDB 2.0 and Cancer Gene Census, we for the first time created a database CanProFu that comprehensively annotates fusion peptides formed by exon-exon linkage between these pairing genes. Results Applying this database to mass spectrometry datasets of 40 human non-small cell lung cancer (NSCLC) samples and 39 normal lung samples with stringent searching criteria, we were able to identify 19 unique fusion peptides characterizing gene fusion events. Among them 11 gene fusion events were only found in NSCLC samples. And also, 4 alternative splicing events were characterized in cancerous or normal lung samples. Conclusions The database and workflow in this work can be flexibly applied to other MS/MS based human cancer experiments to detect gene fusions as potential disease biomarkers or drug targets.
Collapse
|
60
|
Pang CNI, Tay AP, Aya C, Twine NA, Harkness L, Hart-Smith G, Chia SZ, Chen Z, Deshpande NP, Kaakoush NO, Mitchell HM, Kassem M, Wilkins MR. Tools to covisualize and coanalyze proteomic data with genomes and transcriptomes: validation of genes and alternative mRNA splicing. J Proteome Res 2013; 13:84-98. [PMID: 24152167 DOI: 10.1021/pr400820p] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
Direct links between proteomic and genomic/transcriptomic data are not frequently made, partly because of lack of appropriate bioinformatics tools. To help address this, we have developed the PG Nexus pipeline. The PG Nexus allows users to covisualize peptides in the context of genomes or genomic contigs, along with RNA-seq reads. This is done in the Integrated Genome Viewer (IGV). A Results Analyzer reports the precise base position where LC-MS/MS-derived peptides cover genes or gene isoforms, on the chromosomes or contigs where this occurs. In prokaryotes, the PG Nexus pipeline facilitates the validation of genes, where annotation or gene prediction is available, or the discovery of genes using a "virtual protein"-based unbiased approach. We illustrate this with a comprehensive proteogenomics analysis of two strains of Campylobacter concisus . For higher eukaryotes, the PG Nexus facilitates gene validation and supports the identification of mRNA splice junction boundaries and splice variants that are protein-coding. This is illustrated with an analysis of splice junctions covered by human phosphopeptides, and other examples of relevance to the Chromosome-Centric Human Proteome Project. The PG Nexus is open-source and available from https://github.com/IntersectAustralia/ap11_Samifier. It has been integrated into Galaxy and made available in the Galaxy tool shed.
Collapse
Affiliation(s)
- Chi Nam Ignatius Pang
- Systems Biology Initiative, The University of New South Wales , Sydney, New South Wales 2052, Australia
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
61
|
Krug K, Carpy A, Behrends G, Matic K, Soares NC, Macek B. Deep coverage of the Escherichia coli proteome enables the assessment of false discovery rates in simple proteogenomic experiments. Mol Cell Proteomics 2013; 12:3420-30. [PMID: 23908556 DOI: 10.1074/mcp.m113.029165] [Citation(s) in RCA: 66] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Recent advances in mass spectrometry (MS) have led to increased applications of shotgun proteomics to the refinement of genome annotation. The typical "proteo-genomic" workflows rely on the mapping of peptide MS/MS spectra onto databases derived via six-frame translation of the genome sequence. These databases contain a large proportion of spurious protein sequences which make the statistical confidence of the resulting peptide spectrum matches difficult to assess. Here we performed a comprehensive analysis of the Escherichia coli proteome using LTQ-Orbitrap MS and mapped the corresponding MS/MS spectra onto a six-frame translation of the E. coli genome. We hypothesized that the protein-coding part of the E. coli genome approaches complete annotation and that the majority of six frame-specific (novel) peptide spectrum matches can be considered as false positive identifications. We confirm our hypothesis by showing that the posterior error probability distribution of novel hits is almost identical to that of reversed (decoy) hits; this enables us to estimate the sensitivity, specificity, accuracy, and false discovery rate in a typical bacterial proteo-genomic dataset. We use two complementary computational frameworks for processing and statistical assessment of MS/MS data: MaxQuant and Trans-Proteomic Pipeline. We show that MaxQuant achieves a more sensitive six-frame database search with an acceptable false discovery rate and is therefore well suited for global genome reannotation applications, whereas the Trans-Proteomic Pipeline achieves higher specificity and is well suited for high-confidence validation. The use of a small and well-annotated bacterial genome enables us to address genome coverage achieved in state-of-the-art bacterial proteomics: identified peptide sequences mapped to all expressed E. coli proteins but covered 31.7% of the protein-coding genome sequence. Our results show that false discovery rates can be substantially underestimated even in "simple" proteo-genomic experiments obtained by means of high-accuracy MS and point to the necessity of further improvements concerning the coverage of peptide sequences by MS-based methods.
Collapse
Affiliation(s)
- Karsten Krug
- Proteome Center Tuebingen, University of Tuebingen, 72076 Tuebingen, Germany
| | | | | | | | | | | |
Collapse
|
62
|
Armengaud J, Hartmann EM, Bland C. Proteogenomics for environmental microbiology. Proteomics 2013; 13:2731-42. [PMID: 23636904 DOI: 10.1002/pmic.201200576] [Citation(s) in RCA: 53] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2012] [Revised: 03/06/2013] [Accepted: 04/09/2013] [Indexed: 11/09/2022]
Abstract
Proteogenomics sensu stricto refers to the use of proteomic data to refine the annotation of genomes from model organisms. Because of the limitations of automatic annotation pipelines, a relatively high number of errors occur during the structural annotation of genes coding for proteins. Whether putative orphan sequences or short genes encoding low-molecular-weight proteins really exist is still frequently a mystery. Whether start codons are well defined is also an open debate. These problems are exacerbated for genomes of microorganisms belonging to poorly documented genera, as related sequences are not always available for homology-guided annotation. The functional annotation of a significant proportion of genes is also another well-known issue when annotating environmental microorganisms. High-throughput shotgun proteomics has recently greatly evolved, allowing the exploration of the proteome from any microorganism at an unprecedented depth. The structural and functional annotation process may be usefully complemented with experimental data. Indeed, proteogenomic mapping has been successfully performed for a wide variety of organisms. Specific approaches devoted to systematically establishing the N-termini of a large set of proteins are being developed. N-terminomics is giving rise to datasets of experimentally proven translational start codons as well as validated peptide signals for secreted proteins. By extension, combining genomic and proteomic data is becoming routine in many research projects. The proteomic analysis of organisms with unfinished genome sequences, the so-called composite proteomics, and the search for microbial biomarkers by bottom-up and top-down combined approaches are some examples of proteogenomic-flavored studies. They illustrate the advent of a new era of environmental microbiology where proteomics and genomics are intimately integrated to answer key biological questions.
Collapse
Affiliation(s)
- Jean Armengaud
- CEA, DSV, IBEB, Lab Biochim System Perturb, Bagnols-sur-Cèze, France
| | | | | |
Collapse
|
63
|
Bertaccini D, Vaca S, Carapito C, Arsène-Ploetze F, Van Dorsselaer A, Schaeffer-Reiss C. An Improved Stable Isotope N-Terminal Labeling Approach with Light/Heavy TMPP To Automate Proteogenomics Data Validation: dN-TOP. J Proteome Res 2013; 12:3063-70. [DOI: 10.1021/pr4002993] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Diego Bertaccini
- Laboratoire de Spectrométrie
de Masse BioOrganique, IPHC, Université de Strasbourg, CNRS, UMR7178, Strasbourg, France
| | - Sebastian Vaca
- Laboratoire de Spectrométrie
de Masse BioOrganique, IPHC, Université de Strasbourg, CNRS, UMR7178, Strasbourg, France
| | - Christine Carapito
- Laboratoire de Spectrométrie
de Masse BioOrganique, IPHC, Université de Strasbourg, CNRS, UMR7178, Strasbourg, France
| | - Florence Arsène-Ploetze
- Laboratoire de Génétique
Moléculaire, Génomique et Microbiologie, Université de Strasbourg, CNRS UMR7156, Strasbourg,
France
| | - Alain Van Dorsselaer
- Laboratoire de Spectrométrie
de Masse BioOrganique, IPHC, Université de Strasbourg, CNRS, UMR7178, Strasbourg, France
| | - Christine Schaeffer-Reiss
- Laboratoire de Spectrométrie
de Masse BioOrganique, IPHC, Université de Strasbourg, CNRS, UMR7178, Strasbourg, France
| |
Collapse
|
64
|
Sheynkman GM, Shortreed MR, Frey BL, Smith LM. Discovery and mass spectrometric analysis of novel splice-junction peptides using RNA-Seq. Mol Cell Proteomics 2013; 12:2341-53. [PMID: 23629695 DOI: 10.1074/mcp.o113.028142] [Citation(s) in RCA: 105] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
Human proteomic databases required for MS peptide identification are frequently updated and carefully curated, yet are still incomplete because it has been challenging to acquire every protein sequence from the diverse assemblage of proteoforms expressed in every tissue and cell type. In particular, alternative splicing has been shown to be a major source of this cell-specific proteomic variation. Many new alternative splice forms have been detected at the transcript level using next generation sequencing methods, especially RNA-Seq, but it is not known how many of these transcripts are being translated. Leveraging the unprecedented capabilities of next generation sequencing methods, we collected RNA-Seq and proteomics data from the same cell population (Jurkat cells) and created a bioinformatics pipeline that builds customized databases for the discovery of novel splice-junction peptides. Eighty million paired-end Illumina reads and ∼500,000 tandem mass spectra were used to identify 12,873 transcripts (19,320 including isoforms) and 6810 proteins. We developed a bioinformatics workflow to retrieve high-confidence, novel splice junction sequences from the RNA data, translate these sequences into the analogous polypeptide sequence, and create a customized splice junction database for MS searching. Based on the RefSeq gene models, we detected 136,123 annotated and 144,818 unannotated transcript junctions. Of those, 24,834 unannotated junctions passed various quality filters (e.g. minimum read depth) and these entries were translated into 33,589 polypeptide sequences and used for database searching. We discovered 57 splice junction peptides not present in the Uniprot-Trembl proteomic database comprising an array of different splicing events, including skipped exons, alternative donors and acceptors, and noncanonical transcriptional start sites. To our knowledge this is the first example of using sample-specific RNA-Seq data to create a splice-junction database and discover new peptides resulting from alternative splicing.
Collapse
Affiliation(s)
- Gloria M Sheynkman
- Department of Chemistry, University of Wisconsin-Madison, 1101 University Ave., Madison, Wisconsin 53706, USA
| | | | | | | |
Collapse
|