1
|
Rajczewski AT, Jagtap PD, Griffin TJ. An overview of technologies for MS-based proteomics-centric multi-omics. Expert Rev Proteomics 2022; 19:165-181. [PMID: 35466851 PMCID: PMC9613604 DOI: 10.1080/14789450.2022.2070476] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
INTRODUCTION Mass spectrometry-based proteomics reveals dynamic molecular signatures underlying phenotypes reflecting normal and perturbed conditions in living systems. Although valuable on its own, the proteome has only one level of moleclar information, with the genome, epigenome, transcriptome, and metabolome, all providing complementary information. Multi-omic analysis integrating information from one or more of these other domains with proteomic information provides a more complete picture of molecular contributors to dynamic biological systems. AREAS COVERED Here, we discuss the improvements to mass spectrometry-based technologies, focused on peptide-based, bottom-up approaches that have enabled deep, quantitative characterization of complex proteomes. These advances are facilitating the integration of proteomics data with other 'omic information, providing a more complete picture of living systems. We also describe the current state of bioinformatics software and approaches for integrating proteomics and other 'omics data, critical for enabling new discoveries driven by multi-omics. EXPERT COMMENTARY Multi-omics, centered on the integration of proteomics information with other 'omic information, has tremendous promise for biological and biomedical studies. Continued advances in approaches for generating deep, reliable proteomic data and bioinformatics tools aimed at integrating data across 'omic domains will ensure the discoveries offered by these multi-omic studies continue to increase.
Collapse
Affiliation(s)
- Andrew T. Rajczewski
- Department of Biochemistry, Molecular and Cell Biology Building, University of Minnesota, 420 Washington Ave SE 7-129, Minneapolis, MN, 55455, USA
| | - Pratik D. Jagtap
- Department of Biochemistry, Molecular and Cell Biology Building, University of Minnesota, 420 Washington Ave SE 7-129, Minneapolis, MN, 55455, USA,Coauthor, Research Department of Biochemistry, Molecular and Cell Biology Building, University of Minnesota, 420 Washington Ave SE 7-129, Minneapolis, MN, 55455, USA
| | - Timothy J. Griffin
- Department of Biochemistry, Molecular and Cell Biology Building, University of Minnesota, 420 Washington Ave SE 7-129, Minneapolis, MN, 55455, USA,Department of Biochemistry, Molecular and Cell Biology Building, University of Minnesota, 420 Washington Ave SE 7-129, Minneapolis, MN, 55455, USA
| |
Collapse
|
2
|
Wang Z, Pan N, Yan J, Wan J, Wan C. Systematic Identification of Microproteins during the Development of Drosophila melanogaster. J Proteome Res 2022; 21:1114-1123. [PMID: 35227063 DOI: 10.1021/acs.jproteome.2c00004] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Short open reading frame-encoded peptides (SEPs) are microproteins with less than 100 amino acids that play an essential role in the growth and development of organisms. There are plenty of short open reading frames in Drosophila melanogaster that potentially code polypeptides. We chose 11 time points during the life cycle of Drosophila to investigate microproteins, particularly those related to development. Finally, we identified a total of 410 microproteins, of which 27 were noncoding RNA-encoded proteins. Of the 410 microproteins, 74 were expressed in all stages from embryo to adults, whereas 300 microproteins were only found in one or two time points. Approximately, one-third of the microproteins were not reported previously and 44 were obtained from de novo sequencing, validated by synthetic peptides. These microproteins are related to the main bioprocesses of growth and development, such as multicellular organism reproduction, postmating behavior, and oviposition. Over half of the microproteins have predicted functional domains and are conserved across species, suggesting that these microproteins have critical functions in fly development. This work enriches the D. melanogaster proteome and provides a significant data resource for growth and development research.
Collapse
Affiliation(s)
- Zhiwei Wang
- School of Life Sciences and Hubei Key Laboratory of Genetic Regulation and Integrative Biology, Central China Normal University, Wuhan, Hubei 430079, People's Republic of China
| | - Ni Pan
- School of Life Sciences and Hubei Key Laboratory of Genetic Regulation and Integrative Biology, Central China Normal University, Wuhan, Hubei 430079, People's Republic of China
| | - Jiahao Yan
- School of Life Sciences and Hubei Key Laboratory of Genetic Regulation and Integrative Biology, Central China Normal University, Wuhan, Hubei 430079, People's Republic of China
| | - Jian Wan
- School of Life Sciences and Hubei Key Laboratory of Genetic Regulation and Integrative Biology, Central China Normal University, Wuhan, Hubei 430079, People's Republic of China
| | - Cuihong Wan
- School of Life Sciences and Hubei Key Laboratory of Genetic Regulation and Integrative Biology, Central China Normal University, Wuhan, Hubei 430079, People's Republic of China
| |
Collapse
|
3
|
Pan N, Wang Z, Wang B, Wan J, Wan C. Mapping Microproteins and ncRNA-Encoded Polypeptides in Different Mouse Tissues. Front Cell Dev Biol 2021; 9:687748. [PMID: 34381774 PMCID: PMC8350139 DOI: 10.3389/fcell.2021.687748] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Accepted: 06/30/2021] [Indexed: 12/30/2022] Open
Abstract
Small open reading frame encoded peptides (SEPs), also called microproteins, play a vital role in biological processes. Plenty of their open reading frames are located within the non-coding RNA (ncRNA) range. Recent research has demonstrated that ncRNA-encoded polypeptides have essential functions and exist ubiquitously in various tissues. To better understand the role of microproteins, especially ncRNA-encoded proteins, expressed in different tissues, we profiled the proteomic characterization of five mouse tissues by mass spectrometry, including bottom-up, top-down, and de novo sequencing strategies. Bottom-up and top-down with database-dependent searches identified 811 microproteins in the OpenProt database. De novo sequencing identified 290 microproteins, including 12 ncRNA-encoded microproteins that were not found in current databases. In this study, we discovered 1,074 microproteins in total, including 270 ncRNA-encoded microproteins. From the annotation of these microproteins, we found that the brain contains the largest number of neuropeptides, while the spleen contains the most immunoassociated microproteins. This suggests that microproteins in different tissues have tissue-specific functions. These unannotated ncRNA-coded microproteins have predicted domains, such as the macrophage migration inhibitory factor domain and the Prefoldin domain. These results expand the mouse proteome and provide insight into the molecular biology of mouse tissues.
Collapse
Affiliation(s)
- Ni Pan
- Hubei Key Laboratory of Genetic Regulation and Integrative Biology, School of Life Sciences, Central China Normal University, Wuhan, China
| | - Zhiwei Wang
- Hubei Key Laboratory of Genetic Regulation and Integrative Biology, School of Life Sciences, Central China Normal University, Wuhan, China
| | - Bing Wang
- Hubei Key Laboratory of Genetic Regulation and Integrative Biology, School of Life Sciences, Central China Normal University, Wuhan, China
| | - Jian Wan
- Hubei Key Laboratory of Genetic Regulation and Integrative Biology, School of Life Sciences, Central China Normal University, Wuhan, China
| | - Cuihong Wan
- Hubei Key Laboratory of Genetic Regulation and Integrative Biology, School of Life Sciences, Central China Normal University, Wuhan, China
| |
Collapse
|
4
|
Wang B, Wang Z, Pan N, Huang J, Wan C. Improved Identification of Small Open Reading Frames Encoded Peptides by Top-Down Proteomic Approaches and De Novo Sequencing. Int J Mol Sci 2021; 22:ijms22115476. [PMID: 34067398 PMCID: PMC8197016 DOI: 10.3390/ijms22115476] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Revised: 05/14/2021] [Accepted: 05/18/2021] [Indexed: 12/20/2022] Open
Abstract
Small open reading frames (sORFs) have translational potential to produce peptides that play essential roles in various biological processes. Nevertheless, many sORF-encoded peptides (SEPs) are still on the prediction level. Here, we construct a strategy to analyze SEPs by combining top-down and de novo sequencing to improve SEP identification and sequence coverage. With de novo sequencing, we identified 1682 peptides mapping to 2544 human sORFs, which were all first characterized in this work. Two-thirds of these new sORFs have reading frame shifts and use a non-ATG start codon. The top-down approach identified 241 human SEPs, with high sequence coverage. The average length of the peptides from the bottom-up database search was 19 amino acids (AA); from de novo sequencing, it was 9 AA; and from the top-down approach, it was 25 AA. The longer peptide positively boosts the sequence coverage, more efficiently distinguishing SEPs from the known gene coding sequence. Top-down has the advantage of identifying peptides with sequential K/R or high K/R content, which is unfavorable in the bottom-up approach. Our method can explore new coding sORFs and obtain highly accurate sequences of their SEPs, which can also benefit future function research.
Collapse
|
5
|
Schlaffner CN, Pirklbauer GJ, Bender A, Choudhary JS. Fast, Quantitative and Variant Enabled Mapping of Peptides to Genomes. Cell Syst 2019; 5:152-156.e4. [PMID: 28837811 PMCID: PMC5571441 DOI: 10.1016/j.cels.2017.07.007] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2016] [Revised: 03/24/2017] [Accepted: 07/26/2017] [Indexed: 12/24/2022]
Abstract
Current tools for visualization and integration of proteomics with other omics datasets are inadequate for large-scale studies and capture only basic sequence identity information. Furthermore, the frequent reformatting of annotations for reference genomes required by these tools is known to be highly error prone. We developed PoGo for mapping peptides identified through mass spectrometry to overcome these limitations. PoGo reduced runtime and memory usage by 85% and 20%, respectively, and exhibited overall superior performance over other tools on benchmarking with large-scale human tissue and cancer phosphoproteome datasets comprising ∼3 million peptides. In addition, extended functionality enables representation of single-nucleotide variants, post-translational modifications, and quantitative features. PoGo has been integrated in established frameworks such as the PRIDE tool suite and OpenMS, as well as a standalone tool with user-friendly graphical interface. With the rapid increase of quantitative high-resolution datasets capturing proteomes and global modifications to complement orthogonal genomics platforms, PoGo provides a central utility enabling large-scale visualization and interpretation of transomics datasets.
Collapse
Affiliation(s)
- Christoph N Schlaffner
- Proteomic Mass Spectrometry, Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire CB10 1SA, UK; Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, Cambridgeshire CB2 1EW, UK.
| | - Georg J Pirklbauer
- Proteomic Mass Spectrometry, Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire CB10 1SA, UK
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, Cambridgeshire CB2 1EW, UK
| | - Jyoti S Choudhary
- Proteomic Mass Spectrometry, Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire CB10 1SA, UK
| |
Collapse
|
6
|
Guillot L, Delage L, Viari A, Vandenbrouck Y, Com E, Ritter A, Lavigne R, Marie D, Peterlongo P, Potin P, Pineau C. Peptimapper: proteogenomics workflow for the expert annotation of eukaryotic genomes. BMC Genomics 2019; 20:56. [PMID: 30654742 PMCID: PMC6337836 DOI: 10.1186/s12864-019-5431-9] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2018] [Accepted: 01/03/2019] [Indexed: 01/02/2023] Open
Abstract
Background Accurate structural annotation of genomes is still a challenge, despite the progress made over the past decade. The prediction of gene structure remains difficult, especially for eukaryotic species, and is often erroneous and incomplete. We used a proteogenomics strategy, taking advantage of the combination of proteomics datasets and bioinformatics tools, to identify novel protein coding-genes and splice isoforms, assign correct start sites, and validate predicted exons and genes. Results Our proteogenomics workflow, Peptimapper, was applied to the genome annotation of Ectocarpus sp., a key reference genome for both the brown algal lineage and stramenopiles. We generated proteomics data from various life cycle stages of Ectocarpus sp. strains and sub-cellular fractions using a shotgun approach. First, we directly generated peptide sequence tags (PSTs) from the proteomics data. Second, we mapped PSTs onto the translated genomic sequence. Closely located hits (i.e., PSTs locations on the genome) were then clustered to detect potential coding regions based on parameters optimized for the organism. Third, we evaluated each cluster and compared it to gene predictions from existing conventional genome annotation approaches. Finally, we integrated cluster locations into GFF files to use a genome viewer. We identified two potential novel genes, a ribosomal protein L22 and an aryl sulfotransferase and corrected the gene structure of a dihydrolipoamide acetyltransferase. We experimentally validated the results by RT-PCR and using transcriptomics data. Conclusions Peptimapper is a complementary tool for the expert annotation of genomes. It is suitable for any organism and is distributed through a Docker image available on two public bioinformatics docker repositories: Docker Hub and BioShaDock. This workflow is also accessible through the Galaxy framework and for use by non-computer scientists at https://galaxy.protim.eu. Data are available via ProteomeXchange under identifier PXD010618. Electronic supplementary material The online version of this article (10.1186/s12864-019-5431-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Laetitia Guillot
- Univ Rennes, Inserm, EHESP, Irset (Institut de recherche en santé, environnement et travail) - UMR_S 1085, F-35042, Rennes cedex, France.,Protim, Univ Rennes, F-35042, Rennes cedex, France
| | - Ludovic Delage
- Sorbonne Université, UPMC, CNRS, UMR 8227, Integrative Biology of Marine Models, Biological Station, CS 90074, F-29688, Roscoff, France
| | - Alain Viari
- INRIA Grenoble-Rhône-Alpes, F-38330, Montbonnot-Saint-Martin, France
| | - Yves Vandenbrouck
- University Grenoble Alpes, CEA, Inserm, BIG-BGE, 38000, Grenoble, France
| | - Emmanuelle Com
- Univ Rennes, Inserm, EHESP, Irset (Institut de recherche en santé, environnement et travail) - UMR_S 1085, F-35042, Rennes cedex, France.,Protim, Univ Rennes, F-35042, Rennes cedex, France
| | - Andrés Ritter
- Sorbonne Université, UPMC, CNRS, UMR 8227, Integrative Biology of Marine Models, Biological Station, CS 90074, F-29688, Roscoff, France.,Present address: Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratory of Computational and Quantitative Biology, F-75005, Paris, France
| | - Régis Lavigne
- Univ Rennes, Inserm, EHESP, Irset (Institut de recherche en santé, environnement et travail) - UMR_S 1085, F-35042, Rennes cedex, France.,Protim, Univ Rennes, F-35042, Rennes cedex, France
| | - Dominique Marie
- Sorbonne Université, UPMC, CNRS, UMR 8227, Integrative Biology of Marine Models, Biological Station, CS 90074, F-29688, Roscoff, France
| | | | - Philippe Potin
- Sorbonne Université, UPMC, CNRS, UMR 8227, Integrative Biology of Marine Models, Biological Station, CS 90074, F-29688, Roscoff, France
| | - Charles Pineau
- Univ Rennes, Inserm, EHESP, Irset (Institut de recherche en santé, environnement et travail) - UMR_S 1085, F-35042, Rennes cedex, France. .,Protim, Univ Rennes, F-35042, Rennes cedex, France.
| |
Collapse
|
7
|
Schlaffner CN, Pirklbauer GJ, Bender A, Steen JAJ, Choudhary JS. A Fast and Quantitative Method for Post-translational Modification and Variant Enabled Mapping of Peptides to Genomes. J Vis Exp 2018. [PMID: 29889196 PMCID: PMC6101353 DOI: 10.3791/57633] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Abstract
Cross-talk between genes, transcripts, and proteins is the key to cellular responses; hence, analysis of molecular levels as distinct entities is slowly being extended to integrative studies to enhance the understanding of molecular dynamics within cells. Current tools for the visualization and integration of proteomics with other omics datasets are inadequate for large-scale studies. Furthermore, they only capture basic sequence identify, discarding post-translational modifications and quantitation. To address these issues, we developed PoGo to map peptides with associated post-translational modifications and quantification to reference genome annotation. In addition, the tool was developed to enable the mapping of peptides identified from customized sequence databases incorporating single amino acid variants. While PoGo is a command line tool, the graphical interface PoGoGUI enables non-bioinformatics researchers to easily map peptides to 25 species supported by Ensembl genome annotation. The generated output borrows file formats from the genomics field and, therefore, visualization is supported in most genome browsers. For large-scale studies, PoGo is supported by TrackHubGenerator to create web-accessible repositories of data mapped to genomes that also enable an easy sharing of proteogenomics data. With little effort, this tool can map millions of peptides to reference genomes within only a few minutes, outperforming other available sequence-identity based tools. This protocol demonstrates the best approaches for proteogenomics mapping through PoGo with publicly available datasets of quantitative and phosphoproteomics, as well as large-scale studies.
Collapse
Affiliation(s)
- Christoph N Schlaffner
- Department of Neurobiology, F. M. Kirby Neurobiology Center, Boston Children's Hospital, Harvard Medical School; Proteomic Mass Spectrometry, Wellcome Trust Sanger Institute, Wellcome Genome Campus; Centre for Molecular Informatics, Department of Chemistry, University of Cambridge;
| | - Georg J Pirklbauer
- Proteomic Mass Spectrometry, Wellcome Trust Sanger Institute, Wellcome Genome Campus
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge
| | - Judith A J Steen
- Department of Neurobiology, F. M. Kirby Neurobiology Center, Boston Children's Hospital, Harvard Medical School
| | - Jyoti S Choudhary
- Proteomic Mass Spectrometry, Wellcome Trust Sanger Institute, Wellcome Genome Campus; Functional Proteomics Group, Chester Beatty Laboratories, Institute of Cancer Research
| |
Collapse
|
8
|
|
9
|
Menschaert G, David F. Proteogenomics from a bioinformatics angle: A growing field. MASS SPECTROMETRY REVIEWS 2017; 36:584-599. [PMID: 26670565 PMCID: PMC6101030 DOI: 10.1002/mas.21483] [Citation(s) in RCA: 52] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/20/2015] [Accepted: 09/01/2015] [Indexed: 05/16/2023]
Abstract
Proteogenomics is a research area that combines areas as proteomics and genomics in a multi-omics setup using both mass spectrometry and high-throughput sequencing technologies. Currently, the main goals of the field are to aid genome annotation or to unravel the proteome complexity. Mass spectrometry based identifications of matching or homologues peptides can further refine gene models. Also, the identification of novel proteoforms is also made possible based on detection of novel translation initiation sites (cognate or near-cognate), novel transcript isoforms, sequence variation or novel (small) open reading frames in intergenic or un-translated genic regions by analyzing high-throughput sequencing data from RNAseq or ribosome profiling experiments. Other proteogenomics studies using a combination of proteomics and genomics techniques focus on antibody sequencing, the identification of immunogenic peptides or venom peptides. Over the years, a growing amount of bioinformatics tools and databases became available to help streamlining these cross-omics studies. Some of these solutions only help in specific steps of the proteogenomics studies, e.g. building custom sequence databases (based on next generation sequencing output) for mass spectrometry fragmentation spectrum matching. Over the last few years a handful integrative tools also became available that can execute complete proteogenomics analyses. Some of these are presented as stand-alone solutions, whereas others are implemented in a web-based framework such as Galaxy. In this review we aimed at sketching a comprehensive overview of all the bioinformatics solutions that are available for this growing research area. © 2015 Wiley Periodicals, Inc. Mass Spec Rev 36:584-599, 2017.
Collapse
Affiliation(s)
- Gerben Menschaert
- Lab of Bioinformatics and Computational Genomics, Department of
Mathematical Modeling, Statistics and Bioinformatics, Faculty of Bioscience
Engineering, Ghent University, Ghent, Belgium
- To whom correspondence should be addressed. Tel:
+32 9 264 99 22; Fax: +32 9 264 6220;
| | - Fenyö David
- Center for Health Informatics and Bioinformatics and Department of
Biochemistry and Molecular Pharmacology, New York University School of Medicine, New
York, New York, USA
| |
Collapse
|
10
|
Hernandez-Valladares M, Vaudel M, Selheim F, Berven F, Bruserud Ø. Proteogenomics approaches for studying cancer biology and their potential in the identification of acute myeloid leukemia biomarkers. Expert Rev Proteomics 2017; 14:649-663. [DOI: 10.1080/14789450.2017.1352474] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
Affiliation(s)
- Maria Hernandez-Valladares
- Department of Clinical Science, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway
- Proteomics Unit, Department of Biomedicine, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway
| | - Marc Vaudel
- KG Jebsen Center for Diabetes Research, Department of Clinical Science, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway
- Center for Medical Genetics and Molecular Medicine, Haukeland University Hospital, Bergen, Norway
| | - Frode Selheim
- Proteomics Unit, Department of Biomedicine, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway
| | - Frode Berven
- Proteomics Unit, Department of Biomedicine, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway
| | - Øystein Bruserud
- Department of Clinical Science, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway
| |
Collapse
|
11
|
Ruggles KV, Krug K, Wang X, Clauser KR, Wang J, Payne SH, Fenyö D, Zhang B, Mani DR. Methods, Tools and Current Perspectives in Proteogenomics. Mol Cell Proteomics 2017; 16:959-981. [PMID: 28456751 DOI: 10.1074/mcp.mr117.000024] [Citation(s) in RCA: 95] [Impact Index Per Article: 13.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2017] [Indexed: 12/20/2022] Open
Abstract
With combined technological advancements in high-throughput next-generation sequencing and deep mass spectrometry-based proteomics, proteogenomics, i.e. the integrative analysis of proteomic and genomic data, has emerged as a new research field. Early efforts in the field were focused on improving protein identification using sample-specific genomic and transcriptomic sequencing data. More recently, integrative analysis of quantitative measurements from genomic and proteomic studies have identified novel insights into gene expression regulation, cell signaling, and disease. Many methods and tools have been developed or adapted to enable an array of integrative proteogenomic approaches and in this article, we systematically classify published methods and tools into four major categories, (1) Sequence-centric proteogenomics; (2) Analysis of proteogenomic relationships; (3) Integrative modeling of proteogenomic data; and (4) Data sharing and visualization. We provide a comprehensive review of methods and available tools in each category and highlight their typical applications.
Collapse
Affiliation(s)
- Kelly V Ruggles
- From the ‡Department of Medicine, New York University School of Medicine, New York, New York 10016
| | - Karsten Krug
- §The Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142
| | - Xiaojing Wang
- ¶Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas 77030.,‖Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030
| | - Karl R Clauser
- §The Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142
| | - Jing Wang
- ¶Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas 77030.,‖Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030
| | - Samuel H Payne
- **Biological Sciences Division, Pacific Northwest National Laboratory, Richland, Washington 99354
| | - David Fenyö
- ‡‡Department of Biochemistry and Molecular Pharmacology, New York University School of Medicine, New York, New York 10016; .,§§Institute for Systems Genetics, New York University School of Medicine, New York, New York 10016
| | - Bing Zhang
- ¶Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas 77030; .,‖Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030
| | - D R Mani
- §The Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142;
| |
Collapse
|
12
|
|
13
|
Fu S, Liu X, Luo M, Xie K, Nice EC, Zhang H, Huang C. Proteogenomic studies on cancer drug resistance: towards biomarker discovery and target identification. Expert Rev Proteomics 2017; 14:351-362. [PMID: 28276747 DOI: 10.1080/14789450.2017.1299006] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
INTRODUCTION Chemoresistance is a major obstacle for current cancer treatment. Proteogenomics is a powerful multi-omics research field that uses customized protein sequence databases generated by genomic and transcriptomic information to identify novel genes (e.g. noncoding, mutation and fusion genes) from mass spectrometry-based proteomic data. By identifying aberrations that are differentially expressed between tumor and normal pairs, this approach can also be applied to validate protein variants in cancer, which may reveal the response to drug treatment. Areas covered: In this review, we will present recent advances in proteogenomic investigations of cancer drug resistance with an emphasis on integrative proteogenomic pipelines and the biomarker discovery which contributes to achieving the goal of using precision/personalized medicine for cancer treatment. Expert commentary: The discovery and comprehensive understanding of potential biomarkers help identify the cohort of patients who may benefit from particular treatments, and will assist real-time clinical decision-making to maximize therapeutic efficacy and minimize adverse effects. With the development of MS-based proteomics and NGS-based sequencing, a growing number of proteogenomic tools are being developed specifically to investigate cancer drug resistance.
Collapse
Affiliation(s)
- Shuyue Fu
- a State Key Laboratory of Biotherapy and Cancer Center , West China Hospital, Sichuan University, and Collaborative Innovation Center for Biotherapy , Chengdu , P.R. China
| | - Xiang Liu
- b Department of Pathology , Sichuan Academy of Medical Sciences, Sichuan Provincial People's Hospital , Chengdu , P.R. China
| | - Maochao Luo
- c West China School of Public Health, Sichuan University , Chengdu , P.R.China
| | - Ke Xie
- d Department of Oncology , Sichuan Academy of Medical Sciences, Sichuan Provincial People's Hospital , Chengdu , P.R. China
| | - Edouard C Nice
- e Department of Biochemistry and Molecular Biology , Monash University , Clayton , Australia
| | - Haiyuan Zhang
- f School of Medicine , Yangtze University , P. R. China
| | - Canhua Huang
- a State Key Laboratory of Biotherapy and Cancer Center , West China Hospital, Sichuan University, and Collaborative Innovation Center for Biotherapy , Chengdu , P.R. China
| |
Collapse
|
14
|
Zhang J, Yang MK, Zeng H, Ge F. GAPP: A Proteogenomic Software for Genome Annotation and Global Profiling of Post-translational Modifications in Prokaryotes. Mol Cell Proteomics 2016; 15:3529-3539. [PMID: 27630248 DOI: 10.1074/mcp.m116.060046] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2016] [Indexed: 11/06/2022] Open
Abstract
Although the number of sequenced prokaryotic genomes is growing rapidly, experimentally verified annotation of prokaryotic genome remains patchy and challenging. To facilitate genome annotation efforts for prokaryotes, we developed an open source software called GAPP for genome annotation and global profiling of post-translational modifications (PTMs) in prokaryotes. With a single command, it provides a standard workflow to validate and refine predicted genetic models and discover diverse PTM events. We demonstrated the utility of GAPP using proteomic data from Helicobacter pylori, one of the major human pathogens that is responsible for many gastric diseases. Our results confirmed 84.9% of the existing predicted H. pylori proteins, identified 20 novel protein coding genes, and corrected four existing gene models with regard to translation initiation sites. In particular, GAPP revealed a large repertoire of PTMs using the same proteomic data and provided a rich resource that can be used to examine the functions of reversible modifications in this human pathogen. This software is a powerful tool for genome annotation and global discovery of PTMs and is applicable to any sequenced prokaryotic organism; we expect that it will become an integral part of ongoing genome annotation efforts for prokaryotes. GAPP is freely available at https://sourceforge.net/projects/gappproteogenomic/.
Collapse
Affiliation(s)
- Jia Zhang
- From the ‡Key Laboratory of Algal Biology, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan 430072, China
| | - Ming-Kun Yang
- From the ‡Key Laboratory of Algal Biology, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan 430072, China
| | - Honghui Zeng
- §Wuhan Branch, Supercomputing Center, Chinese Academy of Sciences, China
| | - Feng Ge
- From the ‡Key Laboratory of Algal Biology, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan 430072, China; .,§Wuhan Branch, Supercomputing Center, Chinese Academy of Sciences, China
| |
Collapse
|
15
|
Sheynkman GM, Shortreed MR, Cesnik AJ, Smith LM. Proteogenomics: Integrating Next-Generation Sequencing and Mass Spectrometry to Characterize Human Proteomic Variation. ANNUAL REVIEW OF ANALYTICAL CHEMISTRY (PALO ALTO, CALIF.) 2016; 9:521-45. [PMID: 27049631 PMCID: PMC4991544 DOI: 10.1146/annurev-anchem-071015-041722] [Citation(s) in RCA: 73] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/09/2023]
Abstract
Mass spectrometry-based proteomics has emerged as the leading method for detection, quantification, and characterization of proteins. Nearly all proteomic workflows rely on proteomic databases to identify peptides and proteins, but these databases typically contain a generic set of proteins that lack variations unique to a given sample, precluding their detection. Fortunately, proteogenomics enables the detection of such proteomic variations and can be defined, broadly, as the use of nucleotide sequences to generate candidate protein sequences for mass spectrometry database searching. Proteogenomics is experiencing heightened significance due to two developments: (a) advances in DNA sequencing technologies that have made complete sequencing of human genomes and transcriptomes routine, and (b) the unveiling of the tremendous complexity of the human proteome as expressed at the levels of genes, cells, tissues, individuals, and populations. We review here the field of human proteogenomics, with an emphasis on its history, current implementations, the types of proteomic variations it reveals, and several important applications.
Collapse
Affiliation(s)
- Gloria M Sheynkman
- Center for Cancer Systems Biology (CCSB) and Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02215;
- Department of Genetics, Harvard Medical School, Boston, Massachusetts 02115
- Department of Chemistry, University of Wisconsin, Madison, Wisconsin 53706; ,
| | - Michael R Shortreed
- Department of Chemistry, University of Wisconsin, Madison, Wisconsin 53706; ,
| | - Anthony J Cesnik
- Department of Chemistry, University of Wisconsin, Madison, Wisconsin 53706; ,
| | - Lloyd M Smith
- Department of Chemistry, University of Wisconsin, Madison, Wisconsin 53706; ,
- Genome Center of Wisconsin, University of Wisconsin, Madison, Wisconsin 53706;
| |
Collapse
|
16
|
Abd-Alla AMM, Kariithi HM, Cousserans F, Parker NJ, İnce İA, Scully ED, Boeren S, Geib SM, Mekonnen S, Vlak JM, Parker AG, Vreysen MJB, Bergoin M. Comprehensive annotation of Glossina pallidipes salivary gland hypertrophy virus from Ethiopian tsetse flies: a proteogenomics approach. J Gen Virol 2016; 97:1010-1031. [PMID: 26801744 DOI: 10.1099/jgv.0.000409] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Glossina pallidipes salivary gland hypertrophy virus (GpSGHV; family Hytrosaviridae) can establish asymptomatic and symptomatic infection in its tsetse fly host. Here, we present a comprehensive annotation of the genome of an Ethiopian GpSGHV isolate (GpSGHV-Eth) compared with the reference Ugandan GpSGHV isolate (GpSGHV-Uga; GenBank accession number EF568108). GpSGHV-Eth has higher salivary gland hypertrophy syndrome prevalence than GpSGHV-Uga. We show that the GpSGHV-Eth genome has 190 291 nt, a low G+C content (27.9 %) and encodes 174 putative ORFs. Using proteogenomic and transcriptome mapping, 141 and 86 ORFs were mapped by transcripts and peptides, respectively. Furthermore, of the 174 ORFs, 132 had putative transcriptional signals [TATA-like box and poly(A) signals]. Sixty ORFs had both TATA-like box promoter and poly(A) signals, and mapped by both transcripts and peptides, implying that these ORFs encode functional proteins. Of the 60 ORFs, 10 ORFs are homologues to baculovirus and nudivirus core genes, including three per os infectivity factors and four RNA polymerase subunits (LEF4, 5, 8 and 9). Whereas GpSGHV-Eth and GpSGHV-Uga are 98.1 % similar at the nucleotide level, 37 ORFs in the GpSGHV-Eth genome had nucleotide insertions (n = 17) and deletions (n = 20) compared with their homologues in GpSGHV-Uga. Furthermore, compared with the GpSGHV-Uga genome, 11 and 24 GpSGHV ORFs were deleted and novel, respectively. Further, 13 GpSGHV-Eth ORFs were non-canonical; they had either CTG or TTG start codons instead of ATG. Taken together, these data suggest that GpSGHV-Eth and GpSGHV-Uga represent two different lineages of the same virus. Genetic differences combined with host and environmental factors possibly explain the differential GpSGHV pathogenesis observed in different G. pallidipes colonies.
Collapse
Affiliation(s)
- Adly M M Abd-Alla
- Insect Pest Control Laboratories, Joint FAO/IAEA Division of Nuclear Techniques in Food and Agriculture, International Atomic Energy Agency, Vienna, Austria
| | - Henry M Kariithi
- Insect Pest Control Laboratories, Joint FAO/IAEA Division of Nuclear Techniques in Food and Agriculture, International Atomic Energy Agency, Vienna, Austria.,Biotechnology Research Institute, Kenya Agricultural and Livestock Research Organization, PO Box 57811, Loresho, Nairobi, Kenya.,Laboratory of Virology, Wageningen University, 6708 PB, Wageningen, The Netherlands
| | - François Cousserans
- Laboratoire de Pathologie Comparée, Faculté des Sciences, Université de Montpellier, 34095 Montpellier, France
| | | | - İkbal Agah İnce
- Department of Medical Microbiology, School of Medicine, Acibadem University, 34752 Atas¸ehir, Istanbul, Turkey
| | - Erin D Scully
- Grain, Forage and Bioenergy Research Unit, USDA-ARS, University of Nebraska East Campus, Lincoln, NE 68583, USA
| | - Sjef Boeren
- Laboratory of Biochemistry, Wageningen University, 6703 HA Wageningen, The Netherlands
| | - Scott M Geib
- Tropical Crop and Commodity Protection Research Unit, USDA-ARS Daniel K. Inouye US Pacific Basin Agricultural Research Centre, Hilo, HI 96720, USA
| | - Solomon Mekonnen
- National Institute for Control and Eradication of Tsetse and Trypanosomosis (NICETT), Addis Ababa, Ethiopia
| | - Just M Vlak
- Laboratory of Virology, Wageningen University, 6708 PB, Wageningen, The Netherlands
| | - Andrew G Parker
- Insect Pest Control Laboratories, Joint FAO/IAEA Division of Nuclear Techniques in Food and Agriculture, International Atomic Energy Agency, Vienna, Austria
| | - Marc J B Vreysen
- Insect Pest Control Laboratories, Joint FAO/IAEA Division of Nuclear Techniques in Food and Agriculture, International Atomic Energy Agency, Vienna, Austria
| | - Max Bergoin
- Laboratoire de Pathologie Comparée, Faculté des Sciences, Université de Montpellier, 34095 Montpellier, France
| |
Collapse
|
17
|
Next Generation Sequencing Data and Proteogenomics. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2016; 926:11-19. [DOI: 10.1007/978-3-319-42316-6_2] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
|
18
|
Proteogenomic Tools and Approaches to Explore Protein Coding Landscapes of Eukaryotic Genomes. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2016; 926:1-10. [DOI: 10.1007/978-3-319-42316-6_1] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
|
19
|
Abstract
![]()
Every
molecular player in the cast of biology’s central
dogma is being sequenced and quantified with increasing ease and coverage.
To bring the resulting genomic, transcriptomic, and proteomic data
sets into coherence, tools must be developed that do not constrain
data acquisition and analytics in any way but rather provide simple
links across previously acquired data sets with minimal preprocessing
and hassle. Here we present such a tool: PGx, which supports proteogenomic
integration of mass spectrometry proteomics data with next-generation
sequencing by mapping identified peptides onto their putative genomic
coordinates.
Collapse
Affiliation(s)
- Manor Askenazi
- Biomedical Hosting LLC, 33 Lewis Avenue, Arlington, Massachusetts 02474, United States
| | - Kelly V Ruggles
- NYU Langone Medical Center , 227 East 30th Street, New York, New York 10016, United States
| | - David Fenyö
- NYU Langone Medical Center , 227 East 30th Street, New York, New York 10016, United States
| |
Collapse
|
20
|
Kumar D, Mondal AK, Kutum R, Dash D. Proteogenomics of rare taxonomic phyla: A prospective treasure trove of protein coding genes. Proteomics 2015; 16:226-40. [PMID: 26773550 DOI: 10.1002/pmic.201500263] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2015] [Revised: 09/18/2015] [Accepted: 09/28/2015] [Indexed: 01/04/2023]
Abstract
Sustainable innovations in sequencing technologies have resulted in a torrent of microbial genome sequencing projects. However, the prokaryotic genomes sequenced so far are unequally distributed along their phylogenetic tree; few phyla contain the majority, the rest only a few representatives. Accurate genome annotation lags far behind genome sequencing. While automated computational prediction, aided by comparative genomics, remains a popular choice for genome annotation, substantial fraction of these annotations are erroneous. Proteogenomics utilizes protein level experimental observations to annotate protein coding genes on a genome wide scale. Benefits of proteogenomics include discovery and correction of gene annotations regardless of their phylogenetic conservation. This not only allows detection of common, conserved proteins but also the discovery of protein products of rare genes that may be horizontally transferred or taxonomy specific. Chances of encountering such genes are more in rare phyla that comprise a small number of complete genome sequences. We collated all bacterial and archaeal proteogenomic studies carried out to date and reviewed them in the context of genome sequencing projects. Here, we present a comprehensive list of microbial proteogenomic studies, their taxonomic distribution, and also urge for targeted proteogenomics of underexplored taxa to build an extensive reference of protein coding genes.
Collapse
Affiliation(s)
- Dhirendra Kumar
- G. N. Ramachandran Knowledge Center of Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, South Campus, Sukhdev Vihar, Delhi, India
| | - Anupam Kumar Mondal
- G. N. Ramachandran Knowledge Center of Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, South Campus, Sukhdev Vihar, Delhi, India
| | - Rintu Kutum
- G. N. Ramachandran Knowledge Center of Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, South Campus, Sukhdev Vihar, Delhi, India
| | - Debasis Dash
- G. N. Ramachandran Knowledge Center of Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, South Campus, Sukhdev Vihar, Delhi, India
| |
Collapse
|
21
|
Stewart PA, Parapatics K, Welsh EA, Müller AC, Cao H, Fang B, Koomen JM, Eschrich SA, Bennett KL, Haura EB. A Pilot Proteogenomic Study with Data Integration Identifies MCT1 and GLUT1 as Prognostic Markers in Lung Adenocarcinoma. PLoS One 2015; 10:e0142162. [PMID: 26539827 PMCID: PMC4634858 DOI: 10.1371/journal.pone.0142162] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2015] [Accepted: 10/19/2015] [Indexed: 11/19/2022] Open
Abstract
We performed a pilot proteogenomic study to compare lung adenocarcinoma to lung squamous cell carcinoma using quantitative proteomics (6-plex TMT) combined with a customized Affymetrix GeneChip. Using MaxQuant software, we identified 51,001 unique peptides that mapped to 7,241 unique proteins and from these identified 6,373 genes with matching protein expression for further analysis. We found a minor correlation between gene expression and protein expression; both datasets were able to independently recapitulate known differences between the adenocarcinoma and squamous cell carcinoma subtypes. We found 565 proteins and 629 genes to be differentially expressed between adenocarcinoma and squamous cell carcinoma, with 113 of these consistently differentially expressed at both the gene and protein levels. We then compared our results to published adenocarcinoma versus squamous cell carcinoma proteomic data that we also processed with MaxQuant. We selected two proteins consistently overexpressed in squamous cell carcinoma in all studies, MCT1 (SLC16A1) and GLUT1 (SLC2A1), for further investigation. We found differential expression of these same proteins at the gene level in our study as well as in other public gene expression datasets. These findings combined with survival analysis of public datasets suggest that MCT1 and GLUT1 may be potential prognostic markers in adenocarcinoma and druggable targets in squamous cell carcinoma. Data are available via ProteomeXchange with identifier PXD002622.
Collapse
Affiliation(s)
- Paul A. Stewart
- Department of Thoracic Oncology, H. Lee Moffitt Cancer Center and Research Institute, Tampa, Florida, United States of America 33612
| | - Katja Parapatics
- CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, Lazarettgasse 14, 1090 Vienna, Austria
| | - Eric A. Welsh
- Cancer Informatics Core Facility, H. Lee Moffitt Cancer Center and Research Institute, Tampa, Florida, United States of America 33612
| | - André C. Müller
- CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, Lazarettgasse 14, 1090 Vienna, Austria
| | - Haoyun Cao
- Department of Biostatistics and Bioinformatics, H. Lee Moffitt Cancer Center and Research Institute, Tampa, Florida, United States of America 33612
| | - Bin Fang
- Proteomics Core Facility, H. Lee Moffitt Cancer Center and Research Institute, Tampa, Florida, United States of America 33612
| | - John M. Koomen
- Proteomics Core Facility, H. Lee Moffitt Cancer Center and Research Institute, Tampa, Florida, United States of America 33612
- Department of Molecular Oncology, H. Lee Moffitt Cancer Center and Research Institute, Tampa, Florida, United States of America 33612
| | - Steven A. Eschrich
- Cancer Informatics Core Facility, H. Lee Moffitt Cancer Center and Research Institute, Tampa, Florida, United States of America 33612
- Department of Biostatistics and Bioinformatics, H. Lee Moffitt Cancer Center and Research Institute, Tampa, Florida, United States of America 33612
| | - Keiryn L. Bennett
- CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, Lazarettgasse 14, 1090 Vienna, Austria
| | - Eric B. Haura
- Department of Thoracic Oncology, H. Lee Moffitt Cancer Center and Research Institute, Tampa, Florida, United States of America 33612
- * E-mail:
| |
Collapse
|
22
|
Kucharova V, Wiker HG. Proteogenomics in microbiology: taking the right turn at the junction of genomics and proteomics. Proteomics 2014; 14:2360-675. [PMID: 25263021 DOI: 10.1002/pmic.201400168] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2014] [Revised: 08/18/2014] [Accepted: 09/23/2014] [Indexed: 12/14/2022]
Abstract
High-accuracy and high-throughput proteomic methods have completely changed the way we can identify and characterize proteins. MS-based proteomics can now provide a unique supplement to genomic data and add a new level of information to the interpretation of genomic sequences. Proteomics-driven genome annotation has become especially relevant in microbiology where genomes are sequenced on a daily basis and limitations of an in silico driven annotation process are well recognized. In this review paper, we outline different strategies on how one can design a proteogenomic experiment, for example on genome-sequenced (synonymous proteogenomics) versus unsequenced organisms (ortho-proteogenomics) or with the aid of other "omic" data such as RNA-seq. We touch upon many challenges that are encountered during a typical proteogenomic study, mostly concerning bioinformatics methods and downstream data analysis, but also related to creation and use of sequence databases. A large list of proteogenomic case studies of different microorganisms is provided to illustrate the mapping of MS/MS-derived peptide spectra to genomic DNA sequences. These investigations have led to accurate determination of translational initiation sites, pointed out eventual read-throughs or programmed frameshifts, detected signal peptide processing or other protein maturation events, removed questionable annotation assignments, and provided evidence for predicted hypothetical proteins.
Collapse
Affiliation(s)
- Veronika Kucharova
- Department of Clinical Science, The Gade Research Group for Infection and Immunity, University of Bergen, Norway
| | | |
Collapse
|
23
|
Tovchigrechko A, Venepally P, Payne SH. PGP: parallel prokaryotic proteogenomics pipeline for MPI clusters, high-throughput batch clusters and multicore workstations. Bioinformatics 2014; 30:1469-70. [PMID: 24470574 PMCID: PMC4016709 DOI: 10.1093/bioinformatics/btu051] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
SUMMARY We present the first public release of our proteogenomic annotation pipeline. We have previously used our original unreleased implementation to improve the annotation of 46 diverse prokaryotic genomes by discovering novel genes, post-translational modifications and correcting the erroneous annotations by analyzing proteomic mass-spectrometry data. This public version has been redesigned to run in a wide range of parallel Linux computing environments and provided with the automated configuration, build and testing facilities for easy deployment and portability. AVAILABILITY AND IMPLEMENTATION Source code is freely available from https://bitbucket.org/andreyto/proteogenomics under GPL license. It is implemented in Python and C++. It bundles the Makeflow engine to execute the workflows. CONTACT atovtchi@jcvi.org.
Collapse
Affiliation(s)
- Andrey Tovchigrechko
- J. Craig Venter Institute, 9704 Medical Center Drive, Rockville, MD 20850 and Pacific Northwest National Laboratory, 902 Battelle Blvd., Richland, WA 99354, USA
| | | | | |
Collapse
|
24
|
Cappellini E, Gentry A, Palkopoulou E, Ishida Y, Cram D, Roos AM, Watson M, Johansson US, Fernholm B, Agnelli P, Barbagli F, Littlewood DTJ, Kelstrup CD, Olsen JV, Lister AM, Roca AL, Dalén L, Gilbert MTP. Resolution of the type material of the Asian elephant, Elephas maximus Linnaeus, 1758 (Proboscidea, Elephantidae). Zool J Linn Soc 2014. [DOI: 10.1111/zoj12084] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Affiliation(s)
- Enrico Cappellini
- Centre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, Øster Voldgade 5-7, 1350, Copenhagen, Denmark
| | - Anthea Gentry
- Natural History Museum, Cromwell Road, London, SW7 5BD, UK
| | - Eleftheria Palkopoulou
- Department of Bioinformatics and Genetics, Swedish Museum of Natural History, SE-10405, Stockholm, Sweden
- Department of Zoology, Stockholm University, SE-10691, Stockholm, Sweden
| | - Yasuko Ishida
- Department of Animal Sciences, University of Illinois at Urbana-Champaign, Urbana, Illinois, 61801, USA
| | - David Cram
- Jesus College, Turl Street, Oxford, OX1 3DW, UK
| | - Anna-Marie Roos
- Lincoln School of Humanities, University of Lincoln, Brayford Pool, Lincoln, LN6 7TS, UK
| | - Mick Watson
- The Roslin Institute, University of Edinburgh, Midlothian, EH25 9RG, UK
| | - Ulf S. Johansson
- Department of Zoology, Swedish Museum of Natural History, SE-10405, Stockholm, Sweden
| | - Bo Fernholm
- Department of Zoology, Swedish Museum of Natural History, SE-10405, Stockholm, Sweden
| | - Paolo Agnelli
- Natural History Museum of Florence, via Romana 17, 50125, Florence, Italy
| | - Fausto Barbagli
- Natural History Museum of Florence, via Romana 17, 50125, Florence, Italy
| | | | - Christian D. Kelstrup
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Blegdamsvej 3b, 2200, Copenhagen, Denmark
| | - Jesper V. Olsen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Blegdamsvej 3b, 2200, Copenhagen, Denmark
| | | | - Alfred L. Roca
- Department of Animal Sciences, University of Illinois at Urbana-Champaign, Urbana, Illinois, 61801, USA
| | - Love Dalén
- Department of Bioinformatics and Genetics, Swedish Museum of Natural History, SE-10405, Stockholm, Sweden
| | - M. Thomas P. Gilbert
- Centre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, Øster Voldgade 5-7, 1350, Copenhagen, Denmark
- Ancient DNA Laboratory, Murdoch University, South St, Perth, Western Australia, 6150, Australia
| |
Collapse
|
25
|
Schubert OT, Mouritsen J, Ludwig C, Röst HL, Rosenberger G, Arthur PK, Claassen M, Campbell DS, Sun Z, Farrah T, Gengenbacher M, Maiolica A, Kaufmann SHE, Moritz RL, Aebersold R. The Mtb proteome library: a resource of assays to quantify the complete proteome of Mycobacterium tuberculosis. Cell Host Microbe 2013; 13:602-612. [PMID: 23684311 DOI: 10.1016/j.chom.2013.04.008] [Citation(s) in RCA: 135] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2013] [Revised: 03/27/2013] [Accepted: 04/15/2013] [Indexed: 12/18/2022]
Abstract
Research advancing our understanding of Mycobacterium tuberculosis (Mtb) biology and complex host-Mtb interactions requires consistent and precise quantitative measurements of Mtb proteins. We describe the generation and validation of a compendium of assays to quantify 97% of the 4,012 annotated Mtb proteins by the targeted mass spectrometric method selected reaction monitoring (SRM). Furthermore, we estimate the absolute abundance for 55% of all Mtb proteins, revealing a dynamic range within the Mtb proteome of over four orders of magnitude, and identify previously unannotated proteins. As an example of the assay library utility, we monitored the entire Mtb dormancy survival regulon (DosR), which is linked to anaerobic survival and Mtb persistence, and show its dynamic protein-level regulation during hypoxia. In conclusion, we present a publicly available research resource that supports the sensitive, precise, and reproducible quantification of virtually any Mtb protein by a robust and widely accessible mass spectrometric method.
Collapse
Affiliation(s)
- Olga T Schubert
- Institute of Molecular Systems Biology, ETH Zurich, Zurich CH-8093, Switzerland; Systems Biology Graduate School, Zurich, CH-8057, Switzerland
| | - Jeppe Mouritsen
- Institute of Molecular Systems Biology, ETH Zurich, Zurich CH-8093, Switzerland; Molecular Life Sciences Graduate School, Zurich, CH-8093, Switzerland
| | - Christina Ludwig
- Institute of Molecular Systems Biology, ETH Zurich, Zurich CH-8093, Switzerland
| | - Hannes L Röst
- Institute of Molecular Systems Biology, ETH Zurich, Zurich CH-8093, Switzerland; Systems Biology Graduate School, Zurich, CH-8057, Switzerland
| | - George Rosenberger
- Institute of Molecular Systems Biology, ETH Zurich, Zurich CH-8093, Switzerland; Systems Biology Graduate School, Zurich, CH-8057, Switzerland
| | - Patrick K Arthur
- Department of Biochemistry, Cell and Molecular Biology, University of Ghana, Accra, Ghana
| | - Manfred Claassen
- Institute of Molecular Systems Biology, ETH Zurich, Zurich CH-8093, Switzerland
| | | | - Zhi Sun
- Institute for Systems Biology, Seattle, WA 98109, USA
| | - Terry Farrah
- Institute for Systems Biology, Seattle, WA 98109, USA
| | - Martin Gengenbacher
- Department of Immunology, Max Planck Institute for Infection Biology, Berlin D-10117, Germany
| | - Alessio Maiolica
- Institute of Molecular Systems Biology, ETH Zurich, Zurich CH-8093, Switzerland
| | - Stefan H E Kaufmann
- Department of Immunology, Max Planck Institute for Infection Biology, Berlin D-10117, Germany
| | | | - Ruedi Aebersold
- Institute of Molecular Systems Biology, ETH Zurich, Zurich CH-8093, Switzerland; Faculty of Science, University of Zurich, Zurich CH-8057, Switzerland.
| |
Collapse
|
26
|
Cappellini E, Gentry A, Palkopoulou E, Ishida Y, Cram D, Roos AM, Watson M, Johansson US, Fernholm B, Agnelli P, Barbagli F, Littlewood DTJ, Kelstrup CD, Olsen JV, Lister AM, Roca AL, Dalén L, Gilbert MTP. Resolution of the type material of the Asian elephant,Elephas maximusLinnaeus, 1758 (Proboscidea, Elephantidae). Zool J Linn Soc 2013. [DOI: 10.1111/zoj.12084] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Affiliation(s)
- Enrico Cappellini
- Centre for GeoGenetics; Natural History Museum of Denmark; University of Copenhagen; Øster Voldgade 5-7 1350 Copenhagen Denmark
| | - Anthea Gentry
- Natural History Museum; Cromwell Road London SW7 5BD UK
| | - Eleftheria Palkopoulou
- Department of Bioinformatics and Genetics; Swedish Museum of Natural History; SE-10405 Stockholm Sweden
- Department of Zoology; Stockholm University; SE-10691 Stockholm Sweden
| | - Yasuko Ishida
- Department of Animal Sciences; University of Illinois at Urbana-Champaign; Urbana Illinois 61801 USA
| | - David Cram
- Jesus College; Turl Street Oxford OX1 3DW UK
| | - Anna-Marie Roos
- Lincoln School of Humanities; University of Lincoln; Brayford Pool Lincoln LN6 7TS UK
| | - Mick Watson
- The Roslin Institute; University of Edinburgh; Midlothian EH25 9RG UK
| | - Ulf S. Johansson
- Department of Zoology; Swedish Museum of Natural History; SE-10405 Stockholm Sweden
| | - Bo Fernholm
- Department of Zoology; Swedish Museum of Natural History; SE-10405 Stockholm Sweden
| | - Paolo Agnelli
- Natural History Museum of Florence; via Romana 17 50125 Florence Italy
| | - Fausto Barbagli
- Natural History Museum of Florence; via Romana 17 50125 Florence Italy
| | | | - Christian D. Kelstrup
- Novo Nordisk Foundation Center for Protein Research; Faculty of Health Sciences; University of Copenhagen; Blegdamsvej 3b 2200 Copenhagen Denmark
| | - Jesper V. Olsen
- Novo Nordisk Foundation Center for Protein Research; Faculty of Health Sciences; University of Copenhagen; Blegdamsvej 3b 2200 Copenhagen Denmark
| | | | - Alfred L. Roca
- Department of Animal Sciences; University of Illinois at Urbana-Champaign; Urbana Illinois 61801 USA
| | - Love Dalén
- Department of Bioinformatics and Genetics; Swedish Museum of Natural History; SE-10405 Stockholm Sweden
| | - M. Thomas P. Gilbert
- Centre for GeoGenetics; Natural History Museum of Denmark; University of Copenhagen; Øster Voldgade 5-7 1350 Copenhagen Denmark
- Ancient DNA Laboratory; Murdoch University; South St Perth Western Australia 6150 Australia
| |
Collapse
|
27
|
Agrawal GK, Sarkar A, Righetti PG, Pedreschi R, Carpentier S, Wang T, Barkla BJ, Kohli A, Ndimba BK, Bykova NV, Rampitsch C, Zolla L, Rafudeen MS, Cramer R, Bindschedler LV, Tsakirpaloglou N, Ndimba RJ, Farrant JM, Renaut J, Job D, Kikuchi S, Rakwal R. A decade of plant proteomics and mass spectrometry: translation of technical advancements to food security and safety issues. MASS SPECTROMETRY REVIEWS 2013; 32:335-65. [PMID: 23315723 DOI: 10.1002/mas.21365] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/20/2012] [Revised: 09/10/2012] [Accepted: 09/10/2012] [Indexed: 05/21/2023]
Abstract
Tremendous progress in plant proteomics driven by mass spectrometry (MS) techniques has been made since 2000 when few proteomics reports were published and plant proteomics was in its infancy. These achievements include the refinement of existing techniques and the search for new techniques to address food security, safety, and health issues. It is projected that in 2050, the world's population will reach 9-12 billion people demanding a food production increase of 34-70% (FAO, 2009) from today's food production. Provision of food in a sustainable and environmentally committed manner for such a demand without threatening natural resources, requires that agricultural production increases significantly and that postharvest handling and food manufacturing systems become more efficient requiring lower energy expenditure, a decrease in postharvest losses, less waste generation and food with longer shelf life. There is also a need to look for alternative protein sources to animal based (i.e., plant based) to be able to fulfill the increase in protein demands by 2050. Thus, plant biology has a critical role to play as a science capable of addressing such challenges. In this review, we discuss proteomics especially MS, as a platform, being utilized in plant biology research for the past 10 years having the potential to expedite the process of understanding plant biology for human benefits. The increasing application of proteomics technologies in food security, analysis, and safety is emphasized in this review. But, we are aware that no unique approach/technology is capable to address the global food issues. Proteomics-generated information/resources must be integrated and correlated with other omics-based approaches, information, and conventional programs to ensure sufficient food and resources for human development now and in the future.
Collapse
Affiliation(s)
- Ganesh Kumar Agrawal
- Research Laboratory for Biotechnology and Biochemistry, PO Box 13265, Kathmandu, Nepal.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
28
|
Day RS, McDade KK. A decision theory paradigm for evaluating identifier mapping and filtering methods using data integration. BMC Bioinformatics 2013; 14:223. [PMID: 23855655 PMCID: PMC3734162 DOI: 10.1186/1471-2105-14-223] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2012] [Accepted: 07/09/2013] [Indexed: 01/21/2023] Open
Abstract
Background In bioinformatics, we pre-process raw data into a format ready for answering medical and biological questions. A key step in processing is labeling the measured features with the identities of the molecules purportedly assayed: “molecular identification” (MI). Biological meaning comes from identifying these molecular measurements correctly with actual molecular species. But MI can be incorrect. Identifier filtering (IDF) selects features with more trusted MI, leaving a smaller, but more correct dataset. Identifier mapping (IDM) is needed when an analyst is combining two high-throughput (HT) measurement platforms on the same samples. IDM produces ID pairs, one ID from each platform, where the mapping declares that the two analytes are associated through a causal path, direct or indirect (example: pairing an ID for an mRNA species with an ID for a protein species that is its putative translation). Many competing solutions for IDF and IDM exist. Analysts need a rigorous method for evaluating and comparing all these choices. Results We describe a paradigm for critically evaluating and comparing IDF and IDM methods, guided by data on biological samples. The requirements are: a large set of biological samples, measurements on those samples from at least two high-throughput platforms, a model family connecting features from the platforms, and an association measure. From these ingredients, one fits a mixture model coupled to a decision framework. We demonstrate this evaluation paradigm in three settings: comparing performance of several bioinformatics resources for IDM between transcripts and proteins, comparing several published microarray probeset IDF methods and their combinations, and selecting optimal quality thresholds for tandem mass spectrometry spectral events. Conclusions The paradigm outlined here provides a data-grounded approach for evaluating the quality not just of IDM and IDF, but of any pre-processing step or pipeline. The results will help researchers to semantically integrate or filter data optimally, and help bioinformatics database curators to track changes in quality over time and even to troubleshoot causes of MI errors.
Collapse
Affiliation(s)
- Roger S Day
- Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA.
| | | |
Collapse
|
29
|
Armengaud J, Hartmann EM, Bland C. Proteogenomics for environmental microbiology. Proteomics 2013; 13:2731-42. [PMID: 23636904 DOI: 10.1002/pmic.201200576] [Citation(s) in RCA: 53] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2012] [Revised: 03/06/2013] [Accepted: 04/09/2013] [Indexed: 11/09/2022]
Abstract
Proteogenomics sensu stricto refers to the use of proteomic data to refine the annotation of genomes from model organisms. Because of the limitations of automatic annotation pipelines, a relatively high number of errors occur during the structural annotation of genes coding for proteins. Whether putative orphan sequences or short genes encoding low-molecular-weight proteins really exist is still frequently a mystery. Whether start codons are well defined is also an open debate. These problems are exacerbated for genomes of microorganisms belonging to poorly documented genera, as related sequences are not always available for homology-guided annotation. The functional annotation of a significant proportion of genes is also another well-known issue when annotating environmental microorganisms. High-throughput shotgun proteomics has recently greatly evolved, allowing the exploration of the proteome from any microorganism at an unprecedented depth. The structural and functional annotation process may be usefully complemented with experimental data. Indeed, proteogenomic mapping has been successfully performed for a wide variety of organisms. Specific approaches devoted to systematically establishing the N-termini of a large set of proteins are being developed. N-terminomics is giving rise to datasets of experimentally proven translational start codons as well as validated peptide signals for secreted proteins. By extension, combining genomic and proteomic data is becoming routine in many research projects. The proteomic analysis of organisms with unfinished genome sequences, the so-called composite proteomics, and the search for microbial biomarkers by bottom-up and top-down combined approaches are some examples of proteogenomic-flavored studies. They illustrate the advent of a new era of environmental microbiology where proteomics and genomics are intimately integrated to answer key biological questions.
Collapse
Affiliation(s)
- Jean Armengaud
- CEA, DSV, IBEB, Lab Biochim System Perturb, Bagnols-sur-Cèze, France
| | | | | |
Collapse
|
30
|
Abstract
Proteogenomic searching is a useful method for identifying novel proteins, annotating genes and detecting peptides unique to an individual genome. The approach, however, can be laborious, as it often requires search segmentation and the use of several unintegrated tools. Furthermore, many proteogenomic efforts have been limited to small genomes, as large genomes can prove impractical due to the required amount of computer memory and computation time. We present Peppy, a software tool designed to perform every necessary task of proteogenomic searches quickly, accurately and automatically. The software generates a peptide database from a genome, tracks peptide loci, matches peptides to MS/MS spectra and assigns confidence values to those matches. Peppy automatically performs a decoy database generation, search and analysis to return identifications at the desired false discovery rate threshold. Written in Java for cross-platform execution, the software is fully multithreaded for enhanced speed. The program can run on regular desktop computers, opening the doors of proteogenomic searching to a wider audience of proteomics and genomics researchers. Peppy is available at http://geneffects.com/peppy .
Collapse
Affiliation(s)
- Brian A Risk
- Department of Biochemistry & Biophysics, UNC School of Medicine, Chapel Hill, North Carolina 27599, United States.
| | | | | |
Collapse
|
31
|
Kuhring M, Renard BY. iPiG: integrating peptide spectrum matches into genome browser visualizations. PLoS One 2012; 7:e50246. [PMID: 23226516 PMCID: PMC3514238 DOI: 10.1371/journal.pone.0050246] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2012] [Accepted: 10/22/2012] [Indexed: 11/18/2022] Open
Abstract
Proteogenomic approaches have gained increasing popularity, however it is still difficult to integrate mass spectrometry identifications with genomic data due to differing data formats. To address this difficulty, we introduce iPiG as a tool for the integration of peptide identifications from mass spectrometry experiments into existing genome browser visualizations. Thereby, the concurrent analysis of proteomic and genomic data is simplified and proteomic results can directly be compared to genomic data. iPiG is freely available from https://sourceforge.net/projects/ipig/. It is implemented in Java and can be run as a stand-alone tool with a graphical user-interface or integrated into existing workflows. Supplementary data are available at PLOS ONE online.
Collapse
Affiliation(s)
- Mathias Kuhring
- Research Group Bioinformatics (NG4), Robert Koch-Institute, Berlin, Germany
| | - Bernhard Y. Renard
- Research Group Bioinformatics (NG4), Robert Koch-Institute, Berlin, Germany
- * E-mail:
| |
Collapse
|
32
|
Helmy M, Sugiyama N, Tomita M, Ishihama Y. Mass spectrum sequential subtraction speeds up searching large peptide MS/MS spectra datasets against large nucleotide databases for proteogenomics. Genes Cells 2012; 17:633-44. [PMID: 22686349 DOI: 10.1111/j.1365-2443.2012.01615.x] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2012] [Accepted: 04/14/2012] [Indexed: 01/18/2023]
Abstract
We have developed a novel bioinformatics method called mass spectrum sequential subtraction (MSSS) to search large peptide spectra datasets produced by liquid chromatography/mass spectrometry (LC-MS/MS) against protein and large-sized nucleotide sequence databases. The main principle in MSSS is to search the peptide spectra set against the protein database, followed by removal of the spectra corresponding to the identified peptides to create a smaller set of the remaining peptide spectra for searching against the nucleotide sequences database. Therefore, we reduce the number of spectra to be searched to limit the peptide search space. Comparing MSSS and conventional search approach using a dataset of 27 LC-MS/MS runs of rice culture cells indicated that MSSS reduced the search queries to 50% and the search time to 75% on average. In addition, MSSS had no effect on the identification false-positive rate (FPR) or the novel peptide sequences identification ability. We used MSSS to analyze another dataset of 34 LC-MS/MS runs, resulting in identifying additional 74 novel peptides. Proteogenomic analysis with these additional peptides yielded 47 new genomic features in 24 rice genes plus 24 intergenic peptides. These results show that the utility of MSSS in searching large databases with large MS/MS datasets for proteogenomics.
Collapse
Affiliation(s)
- Mohamed Helmy
- Institute for Advanced Biosciences, Keio University, Tsuruoka, Yamagata 997-0017, Japan
| | | | | | | |
Collapse
|
33
|
Translational plant proteomics: a perspective. J Proteomics 2012; 75:4588-601. [PMID: 22516432 DOI: 10.1016/j.jprot.2012.03.055] [Citation(s) in RCA: 59] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2011] [Revised: 02/25/2012] [Accepted: 03/25/2012] [Indexed: 11/21/2022]
Abstract
Translational proteomics is an emerging sub-discipline of the proteomics field in the biological sciences. Translational plant proteomics aims to integrate knowledge from basic sciences to translate it into field applications to solve issues related but not limited to the recreational and economic values of plants, food security and safety, and energy sustainability. In this review, we highlight the substantial progress reached in plant proteomics during the past decade which has paved the way for translational plant proteomics. Increasing proteomics knowledge in plants is not limited to model and non-model plants, proteogenomics, crop improvement, and food analysis, safety, and nutrition but to many more potential applications. Given the wealth of information generated and to some extent applied, there is the need for more efficient and broader channels to freely disseminate the information to the scientific community. This article is part of a Special Issue entitled: Translational Proteomics.
Collapse
|