1
|
Aggarwal S, Raj A, Kumar D, Dash D, Yadav AK. False discovery rate: the Achilles' heel of proteogenomics. Brief Bioinform 2022; 23:6582880. [PMID: 35534181 DOI: 10.1093/bib/bbac163] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Revised: 03/14/2022] [Accepted: 04/12/2022] [Indexed: 12/25/2022] Open
Abstract
Proteogenomics refers to the integrated analysis of the genome and proteome that leverages mass-spectrometry (MS)-based proteomics data to improve genome annotations, understand gene expression control through proteoforms and find sequence variants to develop novel insights for disease classification and therapeutic strategies. However, proteogenomic studies often suffer from reduced sensitivity and specificity due to inflated database size. To control the error rates, proteogenomics depends on the target-decoy search strategy, the de-facto method for false discovery rate (FDR) estimation in proteomics. The proteogenomic databases constructed from three- or six-frame nucleotide database translation not only increase the search space and compute-time but also violate the equivalence of target and decoy databases. These searches result in poorer separation between target and decoy scores, leading to stringent FDR thresholds. Understanding these factors and applying modified strategies such as two-pass database search or peptide-class-specific FDR can result in a better interpretation of MS data without introducing additional statistical biases. Based on these considerations, a user can interpret the proteogenomics results appropriately and control false positives and negatives in a more informed manner. In this review, first, we briefly discuss the proteogenomic workflows and limitations in database construction, followed by various considerations that can influence potential novel discoveries in a proteogenomic study. We conclude with suggestions to counter these challenges for better proteogenomic data interpretation.
Collapse
Affiliation(s)
- Suruchi Aggarwal
- Translational Health Science and Technology Institute, NCR Biotech Science Cluster, 3rd milestone, PO Box No. 04, Faridabad-Gurgaon Expressway, Faridabad-121001, Haryana, India
| | - Anurag Raj
- GN Ramachandran Knowledge Centre for Genome Informatics, CSIR-Institute of Genomics & Integrative Biology, South Campus, Mathura Road, New Delhi 110025, India.,Academy of Scientific and Innovative Research (AcSIR), Ghaziabad-201002, India
| | - Dhirendra Kumar
- GN Ramachandran Knowledge Centre for Genome Informatics, CSIR-Institute of Genomics & Integrative Biology, South Campus, Mathura Road, New Delhi 110025, India
| | - Debasis Dash
- GN Ramachandran Knowledge Centre for Genome Informatics, CSIR-Institute of Genomics & Integrative Biology, South Campus, Mathura Road, New Delhi 110025, India.,Academy of Scientific and Innovative Research (AcSIR), Ghaziabad-201002, India
| | - Amit Kumar Yadav
- Translational Health Science and Technology Institute, NCR Biotech Science Cluster, 3rd milestone, PO Box No. 04, Faridabad-Gurgaon Expressway, Faridabad-121001, Haryana, India
| |
Collapse
|
2
|
Shaw RK, Shen Y, Wang J, Sheng X, Zhao Z, Yu H, Gu H. Advances in Multi-Omics Approaches for Molecular Breeding of Black Rot Resistance in Brassica oleracea L. FRONTIERS IN PLANT SCIENCE 2021; 12:742553. [PMID: 34938304 PMCID: PMC8687090 DOI: 10.3389/fpls.2021.742553] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/16/2021] [Accepted: 10/20/2021] [Indexed: 06/14/2023]
Abstract
Brassica oleracea is one of the most important species of the Brassicaceae family encompassing several economically important vegetables produced and consumed worldwide. But its sustainability is challenged by a range of pathogens, among which black rot, caused by Xanthomonas campestris pv. campestris (Xcc), is the most serious and destructive seed borne bacterial disease, causing huge yield losses. Host-plant resistance could act as the most effective and efficient solution to curb black rot disease for sustainable production of B. oleracea. Recently, 'omics' technologies have emerged as promising tools to understand the host-pathogen interactions, thereby gaining a deeper insight into the resistance mechanisms. In this review, we have summarized the recent achievements made in the emerging omics technologies to tackle the black rot challenge in B. oleracea. With an integrated approach of the omics technologies such as genomics, proteomics, transcriptomics, and metabolomics, it would allow better understanding of the complex molecular mechanisms underlying black rot resistance. Due to the availability of sequencing data, genomics and transcriptomics have progressed as expected for black rot resistance, however, other omics approaches like proteomics and metabolomics are lagging behind, necessitating a holistic and targeted approach to address the complex questions of Xcc-Brassica interactions. Genomic studies revealed that the black rot resistance is a complex trait and is mostly controlled by quantitative trait locus (QTL) with minor effects. Transcriptomic analysis divulged the genes related to photosynthesis, glucosinolate biosynthesis and catabolism, phenylpropanoid biosynthesis pathway, ROS scavenging, calcium signalling, hormonal synthesis and signalling pathway are being differentially expressed upon Xcc infection. Comparative proteomic analysis in relation to susceptible and/or resistance interactions with Xcc identified the involvement of proteins related to photosynthesis, protein biosynthesis, processing and degradation, energy metabolism, innate immunity, redox homeostasis, and defence response and signalling pathways in Xcc-Brassica interaction. Specifically, most of the studies focused on the regulation of the photosynthesis-related proteins as a resistance response in both early and later stages of infection. Metabolomic studies suggested that glucosinolates (GSLs), especially aliphatic and indolic GSLs, its subsequent hydrolysis products, and defensive metabolites synthesized by jasmonic acid (JA)-mediated phenylpropanoid biosynthesis pathway are involved in disease resistance mechanisms against Xcc in Brassica species. Multi-omics analysis showed that JA signalling pathway is regulating resistance against hemibiotrophic pathogen like Xcc. So, the bonhomie between omics technologies and plant breeding is going to trigger major breakthroughs in the field of crop improvement by developing superior cultivars with broad-spectrum resistance. If multi-omics tools are implemented at the right scale, we may be able to achieve the maximum benefits from the minimum. In this review, we have also discussed the challenges, future prospects, and the way forward in the application of omics technologies to accelerate the breeding of B. oleracea for disease resistance. A deeper insight about the current knowledge on omics can offer promising results in the breeding of high-quality disease-resistant crops.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Honghui Gu
- Institute of Vegetables, Zhejiang Academy of Agricultural Sciences, Hangzhou, China
| |
Collapse
|
3
|
Maes E, Oeyen E, Boonen K, Schildermans K, Mertens I, Pauwels P, Valkenborg D, Baggerman G. The challenges of peptidomics in complementing proteomics in a clinical context. MASS SPECTROMETRY REVIEWS 2019; 38:253-264. [PMID: 30372792 DOI: 10.1002/mas.21581] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/14/2016] [Accepted: 10/01/2018] [Indexed: 06/08/2023]
Abstract
Naturally occurring peptides, including growth factors, hormones, and neurotransmitters, represent an important class of biomolecules and have crucial roles in human physiology. The study of these peptides in clinical samples is therefore as relevant as ever. Compared to more routine proteomics applications in clinical research, peptidomics research questions are more challenging and have special requirements with regard to sample handling, experimental design, and bioinformatics. In this review, we describe the issues that confront peptidomics in a clinical context. After these hurdles are (partially) overcome, peptidomics will be ready for a successful translation into medical practice.
Collapse
Affiliation(s)
- Evelyne Maes
- Flemish Institute for Technological Research (VITO), Mol, Belgium
- Centre for Proteomics, University of Antwerp, Antwerp, Belgium
- Food and Bio-Based Products, AgResearch Ltd., Lincoln, New Zealand
| | - Eline Oeyen
- Flemish Institute for Technological Research (VITO), Mol, Belgium
- Centre for Proteomics, University of Antwerp, Antwerp, Belgium
| | - Kurt Boonen
- Flemish Institute for Technological Research (VITO), Mol, Belgium
- Centre for Proteomics, University of Antwerp, Antwerp, Belgium
| | - Karin Schildermans
- Flemish Institute for Technological Research (VITO), Mol, Belgium
- Centre for Proteomics, University of Antwerp, Antwerp, Belgium
| | - Inge Mertens
- Flemish Institute for Technological Research (VITO), Mol, Belgium
- Centre for Proteomics, University of Antwerp, Antwerp, Belgium
| | - Patrick Pauwels
- Molecular Pathology Unit, Department of Pathology, Antwerp University Hospital, Edegem, Belgium
| | - Dirk Valkenborg
- Flemish Institute for Technological Research (VITO), Mol, Belgium
- Centre for Proteomics, University of Antwerp, Antwerp, Belgium
- Center for Statistics, Hasselt University, Diepenbeek, Belgium
| | - Geert Baggerman
- Flemish Institute for Technological Research (VITO), Mol, Belgium
- Centre for Proteomics, University of Antwerp, Antwerp, Belgium
| |
Collapse
|
4
|
Sugiyama N, Miyake S, Lin MH, Wakabayashi M, Marusawa H, Nishiumi S, Yoshida M, Ishihama Y. Comparative proteomics of Helicobacter pylori strains reveals geographical features rather than genomic variations. Genes Cells 2019; 24:139-150. [PMID: 30548729 DOI: 10.1111/gtc.12662] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2018] [Accepted: 12/01/2018] [Indexed: 01/05/2023]
Abstract
Helicobacter pylori, a pathogen of various gastric diseases, has many genome sequence variants. Thus, the pathogenesis and infection mechanisms of the H. pylori-driven gastric diseases have not been elucidated. Here, we carried out a large-scale proteome analysis to profile the heterogeneity of the proteome expression of 7 H. pylori strains by using an LC/MS/MS-based proteomics approach combined with a customized database consisting of nonredundant tryptic peptide sequences derived from full genome sequences of 52 H. pylori strains. The nonredundant peptide database enabled us to identify more peptides in the database search of MS/MS data compared with a simply merged protein database. Using this approach, we carried out proteome analysis of genome-unknown strains of H. pylori at as large a scale as genome-known ones. Clustering of the H. pylori strains using proteome profiling slightly differed from the genome profiling and more clearly divided the strains into two groups based on the isolated area. Furthermore, we identified phosphorylated proteins and sites of the H. pylori strains and obtained the phosphorylation motifs located in the N-terminus that are commonly observed in bacteria.
Collapse
Affiliation(s)
- Naoyuki Sugiyama
- Graduate School of Pharmaceutical Sciences, Kyoto University, Kyoto, Japan
| | - Satomi Miyake
- Graduate School of Pharmaceutical Sciences, Kyoto University, Kyoto, Japan
| | - Miao-Hsia Lin
- Graduate School of Pharmaceutical Sciences, Kyoto University, Kyoto, Japan
| | - Masaki Wakabayashi
- Graduate School of Pharmaceutical Sciences, Kyoto University, Kyoto, Japan
| | - Hiroyuki Marusawa
- Department of Gastroenterology and Hepatology, Graduate School of Medicine, Kyoto University, Kyoto, Japan
| | - Shin Nishiumi
- Division of Gastroenterology, Department of Internal Medicine, Kobe University Graduate School of Medicine, Kobe, Japan
| | - Masaru Yoshida
- Division of Gastroenterology, Department of Internal Medicine, Kobe University Graduate School of Medicine, Kobe, Japan.,Division of Metabolomics Research, Department of Internal Related, Kobe University Graduate School of Medicine, Kobe, Japan.,AMED-CREST, AMED, Kobe, Japan
| | - Yasushi Ishihama
- Graduate School of Pharmaceutical Sciences, Kyoto University, Kyoto, Japan
| |
Collapse
|
5
|
Hernandez-Valladares M, Vaudel M, Selheim F, Berven F, Bruserud Ø. Proteogenomics approaches for studying cancer biology and their potential in the identification of acute myeloid leukemia biomarkers. Expert Rev Proteomics 2017; 14:649-663. [DOI: 10.1080/14789450.2017.1352474] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
Affiliation(s)
- Maria Hernandez-Valladares
- Department of Clinical Science, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway
- Proteomics Unit, Department of Biomedicine, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway
| | - Marc Vaudel
- KG Jebsen Center for Diabetes Research, Department of Clinical Science, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway
- Center for Medical Genetics and Molecular Medicine, Haukeland University Hospital, Bergen, Norway
| | - Frode Selheim
- Proteomics Unit, Department of Biomedicine, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway
| | - Frode Berven
- Proteomics Unit, Department of Biomedicine, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway
| | - Øystein Bruserud
- Department of Clinical Science, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway
| |
Collapse
|
6
|
|
7
|
Grossmann J, Fernández H, Chaubey PM, Valdés AE, Gagliardini V, Cañal MJ, Russo G, Grossniklaus U. Proteogenomic Analysis Greatly Expands the Identification of Proteins Related to Reproduction in the Apogamous Fern Dryopteris affinis ssp. affinis. FRONTIERS IN PLANT SCIENCE 2017; 8:336. [PMID: 28382042 PMCID: PMC5360702 DOI: 10.3389/fpls.2017.00336] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/20/2016] [Accepted: 02/27/2017] [Indexed: 05/19/2023]
Abstract
Performing proteomic studies on non-model organisms with little or no genomic information is still difficult. However, many specific processes and biochemical pathways occur only in species that are poorly characterized at the genomic level. For example, many plants can reproduce both sexually and asexually, the first one allowing the generation of new genotypes and the latter their fixation. Thus, both modes of reproduction are of great agronomic value. However, the molecular basis of asexual reproduction is not well understood in any plant. In ferns, it combines the production of unreduced spores (diplospory) and the formation of sporophytes from somatic cells (apogamy). To set the basis to study these processes, we performed transcriptomics by next-generation sequencing (NGS) and shotgun proteomics by tandem mass spectrometry in the apogamous fern D. affinis ssp. affinis. For protein identification we used the public viridiplantae database (VPDB) to identify orthologous proteins from other plant species and new transcriptomics data to generate a "species-specific transcriptome database" (SSTDB). In total 1,397 protein clusters with 5,865 unique peptide sequences were identified (13 decoy proteins out of 1,410, protFDR 0.93% on protein cluster level). We show that using the SSTDB for protein identification increases the number of identified peptides almost four times compared to using only the publically available VPDB. We identified homologs of proteins involved in reproduction of higher plants, including proteins with a potential role in apogamy. With the increasing availability of genomic data from non-model species, similar proteogenomics approaches will improve the sensitivity in protein identification for species only distantly related to models.
Collapse
Affiliation(s)
| | - Helena Fernández
- Area of Plant Physiology, Department of Organisms and Systems Biology (BOS), Oviedo UniversityOviedo, Spain
- *Correspondence: Helena Fernández
| | - Pururawa M. Chaubey
- Institute of Plant and Microbial Biology, Zurich-Basel Plant Science Center, University of ZurichZürich, Switzerland
| | - Ana E. Valdés
- Physiological Botany, Uppsala BioCenter, Uppsala UniversityUppsala, Sweden
- Linnean Centre for Plant BiologyUppsala, Sweden
| | - Valeria Gagliardini
- Institute of Plant and Microbial Biology, Zurich-Basel Plant Science Center, University of ZurichZürich, Switzerland
| | - María J. Cañal
- Area of Plant Physiology, Department of Organisms and Systems Biology (BOS), Oviedo UniversityOviedo, Spain
| | | | - Ueli Grossniklaus
- Institute of Plant and Microbial Biology, Zurich-Basel Plant Science Center, University of ZurichZürich, Switzerland
- Ueli Grossniklaus
| |
Collapse
|
8
|
Abstract
Omics approaches have become popular in biology as powerful discovery tools, and currently gain in interest for diagnostic applications. Establishing the accurate genome sequence of any organism is easy, but the outcome of its annotation by means of automatic pipelines remains imprecise. Some protein-encoding genes may be missed as soon as they are specific and poorly conserved in a given taxon, while important to explain the specific traits of the organism. Translational starts are also poorly predicted in a relatively important number of cases, thus impacting the protein sequence database used in proteomics, comparative genomics, and systems biology. The use of high-throughput proteomics data to improve genome annotation is an attractive option to obtain a more comprehensive molecular picture of a given organism. Here, protocols for reannotating prokaryote genomes are described based on shotgun proteomics and derivatization of protein N-termini with a positively charged reagent coupled to high-resolution tandem mass spectrometry.
Collapse
|
9
|
Sheynkman GM, Shortreed MR, Cesnik AJ, Smith LM. Proteogenomics: Integrating Next-Generation Sequencing and Mass Spectrometry to Characterize Human Proteomic Variation. ANNUAL REVIEW OF ANALYTICAL CHEMISTRY (PALO ALTO, CALIF.) 2016; 9:521-45. [PMID: 27049631 PMCID: PMC4991544 DOI: 10.1146/annurev-anchem-071015-041722] [Citation(s) in RCA: 73] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/09/2023]
Abstract
Mass spectrometry-based proteomics has emerged as the leading method for detection, quantification, and characterization of proteins. Nearly all proteomic workflows rely on proteomic databases to identify peptides and proteins, but these databases typically contain a generic set of proteins that lack variations unique to a given sample, precluding their detection. Fortunately, proteogenomics enables the detection of such proteomic variations and can be defined, broadly, as the use of nucleotide sequences to generate candidate protein sequences for mass spectrometry database searching. Proteogenomics is experiencing heightened significance due to two developments: (a) advances in DNA sequencing technologies that have made complete sequencing of human genomes and transcriptomes routine, and (b) the unveiling of the tremendous complexity of the human proteome as expressed at the levels of genes, cells, tissues, individuals, and populations. We review here the field of human proteogenomics, with an emphasis on its history, current implementations, the types of proteomic variations it reveals, and several important applications.
Collapse
Affiliation(s)
- Gloria M Sheynkman
- Center for Cancer Systems Biology (CCSB) and Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02215;
- Department of Genetics, Harvard Medical School, Boston, Massachusetts 02115
- Department of Chemistry, University of Wisconsin, Madison, Wisconsin 53706; ,
| | - Michael R Shortreed
- Department of Chemistry, University of Wisconsin, Madison, Wisconsin 53706; ,
| | - Anthony J Cesnik
- Department of Chemistry, University of Wisconsin, Madison, Wisconsin 53706; ,
| | - Lloyd M Smith
- Department of Chemistry, University of Wisconsin, Madison, Wisconsin 53706; ,
- Genome Center of Wisconsin, University of Wisconsin, Madison, Wisconsin 53706;
| |
Collapse
|
10
|
Abstract
The next generation sequencing (NGS) is an important process which assures inexpensive organization of vast size of raw sequence dataset over any traditional sequencing systems or methods. Various aspects of NGS such as template preparation, sequencing imaging and genome alignment and assembly outline the genome sequencing and alignment. Consequently, de Bruijn graph (dBG) is an important mathematical tool that graphically analyzes how the orientations are constructed in groups of nucleotides. Basically, dBG describes the formation of the genome segments in circular iterative fashions. Some pivotal dBG-based de novo algorithms and software packages such as T-IDBA, Oases, IDBA-tran, Euler, Velvet, ABySS, AllPaths, SOAPde novo and SOAPde novo2 are illustrated in this paper. Consequently, overlap layout consensus (OLC) graph-based algorithms also play vital role in NGS assembly. Some important OLC-based algorithms such as MIRA3, CABOG, Newbler, Edena, Mosaik and SHORTY are portrayed in this paper. It has been experimented that greedy graph-based algorithms and software packages are also vital for proper genome dataset assembly. A few algorithms named SSAKE, SHARCGS and VCAKE help to perform proper genome sequencing.
Collapse
Affiliation(s)
- Sonia Farhana Nimmy
- Department of Computer Science and Engineering, BGC Trust University, BGC Biddha Nagar, Chandanaish, Chittagong, Bangladesh
| | - M. S. Kamal
- Department of Computer Science and Engineering, BGC Trust University, BGC Biddha Nagar, Chandanaish, Chittagong, Bangladesh
| |
Collapse
|
11
|
Gonnelli G, Stock M, Verwaeren J, Maddelein D, De Baets B, Martens L, Degroeve S. A Decoy-Free Approach to the Identification of Peptides. J Proteome Res 2015; 14:1792-8. [DOI: 10.1021/pr501164r] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Giulia Gonnelli
- Department
of Medical Protein Research, VIB, Albert Baertsoenkaai 3, B-9000 Ghent, Belgium
- Department
of Biochemistry, Ghent University, Albert Baertsoenkaai 3, B-9000 Ghent, Belgium
| | - Michiel Stock
- Department
of Mathematical Modelling, Statistics and Bioinformatics, Ghent University, Coupure Links 653, B-9000 Ghent, Belgium
| | - Jan Verwaeren
- Department
of Mathematical Modelling, Statistics and Bioinformatics, Ghent University, Coupure Links 653, B-9000 Ghent, Belgium
| | - Davy Maddelein
- Department
of Medical Protein Research, VIB, Albert Baertsoenkaai 3, B-9000 Ghent, Belgium
- Department
of Biochemistry, Ghent University, Albert Baertsoenkaai 3, B-9000 Ghent, Belgium
| | - Bernard De Baets
- Department
of Mathematical Modelling, Statistics and Bioinformatics, Ghent University, Coupure Links 653, B-9000 Ghent, Belgium
| | - Lennart Martens
- Department
of Medical Protein Research, VIB, Albert Baertsoenkaai 3, B-9000 Ghent, Belgium
- Department
of Biochemistry, Ghent University, Albert Baertsoenkaai 3, B-9000 Ghent, Belgium
| | - Sven Degroeve
- Department
of Medical Protein Research, VIB, Albert Baertsoenkaai 3, B-9000 Ghent, Belgium
- Department
of Biochemistry, Ghent University, Albert Baertsoenkaai 3, B-9000 Ghent, Belgium
| |
Collapse
|
12
|
Nesvizhskii AI. Proteogenomics: concepts, applications and computational strategies. Nat Methods 2015; 11:1114-25. [PMID: 25357241 DOI: 10.1038/nmeth.3144] [Citation(s) in RCA: 505] [Impact Index Per Article: 56.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2014] [Accepted: 09/22/2014] [Indexed: 12/19/2022]
Abstract
Proteogenomics is an area of research at the interface of proteomics and genomics. In this approach, customized protein sequence databases generated using genomic and transcriptomic information are used to help identify novel peptides (not present in reference protein sequence databases) from mass spectrometry-based proteomic data; in turn, the proteomic data can be used to provide protein-level evidence of gene expression and to help refine gene models. In recent years, owing to the emergence of new sequencing technologies such as RNA-seq and dramatic improvements in the depth and throughput of mass spectrometry-based proteomics, the pace of proteogenomic research has greatly accelerated. Here I review the current state of proteogenomic methods and applications, including computational strategies for building and using customized protein sequence databases. I also draw attention to the challenge of false positive identifications in proteogenomics and provide guidelines for analyzing the data and reporting the results of proteogenomic studies.
Collapse
Affiliation(s)
- Alexey I Nesvizhskii
- 1] Department of Pathology, University of Michigan, Ann Arbor, Michigan, USA. [2] Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA
| |
Collapse
|
13
|
Karpova MA, Karpov DS, Ivanov MV, Pyatnitskiy MA, Chernobrovkin AL, Lobas AA, Lisitsa AV, Archakov AI, Gorshkov MV, Moshkovskii SA. Exome-driven characterization of the cancer cell lines at the proteome level: the NCI-60 case study. J Proteome Res 2014; 13:5551-60. [PMID: 25333775 DOI: 10.1021/pr500531x] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Cancer genome deviates significantly from the reference human genome, and thus a search against standard genome databases in cancer cell proteomics fails to identify cancer-specific protein variants. The goal of this Article is to combine high-throughput exome data [Abaan et al. Cancer Res. 2013] and shotgun proteomics analysis [Modhaddas Gholami et al. Cell Rep. 2013] for cancer cell lines from NCI-60 panel to demonstrate further that the cell lines can be effectively recognized using identified variant peptides. To achieve this goal, we generated a database containing mutant protein sequences of NCI-60 panel of cell lines. The proteome data were searched using Mascot and X!Tandem search engines against databases of both reference and mutant protein sequences. The identification quality was further controlled by calculating a fraction of variant peptides encoded by the own exome sequence for each cell line. We found that up to 92.2% peptides identified by both search engines are encoded by the own exome. Further, we used the identified variant peptides for cell line recognition. The results of the study demonstrate that proteome data supported by exome sequence information can be effectively used for distinguishing between different types of cancer cell lines.
Collapse
Affiliation(s)
- Maria A Karpova
- Orekhovich Institute of Biomedical Chemistry , 119121, Moscow, Russia
| | | | | | | | | | | | | | | | | | | |
Collapse
|
14
|
El-Metwally S, Hamza T, Zakaria M, Helmy M. Next-generation sequence assembly: four stages of data processing and computational challenges. PLoS Comput Biol 2013; 9:e1003345. [PMID: 24348224 PMCID: PMC3861042 DOI: 10.1371/journal.pcbi.1003345] [Citation(s) in RCA: 68] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
Decoding DNA symbols using next-generation sequencers was a major breakthrough in genomic research. Despite the many advantages of next-generation sequencers, e.g., the high-throughput sequencing rate and relatively low cost of sequencing, the assembly of the reads produced by these sequencers still remains a major challenge. In this review, we address the basic framework of next-generation genome sequence assemblers, which comprises four basic stages: preprocessing filtering, a graph construction process, a graph simplification process, and postprocessing filtering. Here we discuss them as a framework of four stages for data analysis and processing and survey variety of techniques, algorithms, and software tools used during each stage. We also discuss the challenges that face current assemblers in the next-generation environment to determine the current state-of-the-art. We recommend a layered architecture approach for constructing a general assembler that can handle the sequences generated by different sequencing platforms.
Collapse
Affiliation(s)
- Sara El-Metwally
- Computer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura, Egypt
| | - Taher Hamza
- Computer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura, Egypt
| | - Magdi Zakaria
- Computer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura, Egypt
| | - Mohamed Helmy
- Botany Department, Faculty of Agriculture, Al-Azhar University, Cairo, Egypt
- Biotechnology Department, Faculty of Agriculture, Al-Azhar University, Cairo, Egypt
| |
Collapse
|
15
|
Armengaud J, Hartmann EM, Bland C. Proteogenomics for environmental microbiology. Proteomics 2013; 13:2731-42. [PMID: 23636904 DOI: 10.1002/pmic.201200576] [Citation(s) in RCA: 53] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2012] [Revised: 03/06/2013] [Accepted: 04/09/2013] [Indexed: 11/09/2022]
Abstract
Proteogenomics sensu stricto refers to the use of proteomic data to refine the annotation of genomes from model organisms. Because of the limitations of automatic annotation pipelines, a relatively high number of errors occur during the structural annotation of genes coding for proteins. Whether putative orphan sequences or short genes encoding low-molecular-weight proteins really exist is still frequently a mystery. Whether start codons are well defined is also an open debate. These problems are exacerbated for genomes of microorganisms belonging to poorly documented genera, as related sequences are not always available for homology-guided annotation. The functional annotation of a significant proportion of genes is also another well-known issue when annotating environmental microorganisms. High-throughput shotgun proteomics has recently greatly evolved, allowing the exploration of the proteome from any microorganism at an unprecedented depth. The structural and functional annotation process may be usefully complemented with experimental data. Indeed, proteogenomic mapping has been successfully performed for a wide variety of organisms. Specific approaches devoted to systematically establishing the N-termini of a large set of proteins are being developed. N-terminomics is giving rise to datasets of experimentally proven translational start codons as well as validated peptide signals for secreted proteins. By extension, combining genomic and proteomic data is becoming routine in many research projects. The proteomic analysis of organisms with unfinished genome sequences, the so-called composite proteomics, and the search for microbial biomarkers by bottom-up and top-down combined approaches are some examples of proteogenomic-flavored studies. They illustrate the advent of a new era of environmental microbiology where proteomics and genomics are intimately integrated to answer key biological questions.
Collapse
Affiliation(s)
- Jean Armengaud
- CEA, DSV, IBEB, Lab Biochim System Perturb, Bagnols-sur-Cèze, France
| | | | | |
Collapse
|
16
|
Abstract
Proteogenomic searching is a useful method for identifying novel proteins, annotating genes and detecting peptides unique to an individual genome. The approach, however, can be laborious, as it often requires search segmentation and the use of several unintegrated tools. Furthermore, many proteogenomic efforts have been limited to small genomes, as large genomes can prove impractical due to the required amount of computer memory and computation time. We present Peppy, a software tool designed to perform every necessary task of proteogenomic searches quickly, accurately and automatically. The software generates a peptide database from a genome, tracks peptide loci, matches peptides to MS/MS spectra and assigns confidence values to those matches. Peppy automatically performs a decoy database generation, search and analysis to return identifications at the desired false discovery rate threshold. Written in Java for cross-platform execution, the software is fully multithreaded for enhanced speed. The program can run on regular desktop computers, opening the doors of proteogenomic searching to a wider audience of proteomics and genomics researchers. Peppy is available at http://geneffects.com/peppy .
Collapse
Affiliation(s)
- Brian A Risk
- Department of Biochemistry & Biophysics, UNC School of Medicine, Chapel Hill, North Carolina 27599, United States.
| | | | | |
Collapse
|
17
|
Yamana R, Iwasaki M, Wakabayashi M, Nakagawa M, Yamanaka S, Ishihama Y. Rapid and Deep Profiling of Human Induced Pluripotent Stem Cell Proteome by One-shot NanoLC–MS/MS Analysis with Meter-scale Monolithic Silica Columns. J Proteome Res 2012; 12:214-21. [DOI: 10.1021/pr300837u] [Citation(s) in RCA: 53] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Affiliation(s)
- Ryota Yamana
- Department of Molecular & Cellular BioAnalysis, Graduate School of Pharmaceutical Sciences, Kyoto University, Sakyo-ku, Kyoto 606-8501, Japan
| | - Mio Iwasaki
- Department of Molecular & Cellular BioAnalysis, Graduate School of Pharmaceutical Sciences, Kyoto University, Sakyo-ku, Kyoto 606-8501, Japan
| | - Masaki Wakabayashi
- Department of Molecular & Cellular BioAnalysis, Graduate School of Pharmaceutical Sciences, Kyoto University, Sakyo-ku, Kyoto 606-8501, Japan
| | - Masato Nakagawa
- Center for iPS Cell Research and Application, Kyoto University, Sakyo-ku, Kyoto 606-8507, Japan
| | - Shinya Yamanaka
- Center for iPS Cell Research and Application, Kyoto University, Sakyo-ku, Kyoto 606-8507, Japan
| | - Yasushi Ishihama
- Department of Molecular & Cellular BioAnalysis, Graduate School of Pharmaceutical Sciences, Kyoto University, Sakyo-ku, Kyoto 606-8501, Japan
| |
Collapse
|