1
|
Fuchs S, Engelmann S. Small proteins in bacteria - Big challenges in prediction and identification. Proteomics 2023; 23:e2200421. [PMID: 37609810 DOI: 10.1002/pmic.202200421] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2023] [Revised: 08/03/2023] [Accepted: 08/10/2023] [Indexed: 08/24/2023]
Abstract
Proteins with up to 100 amino acids have been largely overlooked due to the challenges associated with predicting and identifying them using traditional methods. Recent advances in bioinformatics and machine learning, DNA sequencing, RNA and Ribo-seq technologies, and mass spectrometry (MS) have greatly facilitated the detection and characterisation of these elusive proteins in recent years. This has revealed their crucial role in various cellular processes including regulation, signalling and transport, as toxins and as folding helpers for protein complexes. Consequently, the systematic identification and characterisation of these proteins in bacteria have emerged as a prominent field of interest within the microbial research community. This review provides an overview of different strategies for predicting and identifying these proteins on a large scale, leveraging the power of these advanced technologies. Furthermore, the review offers insights into the future developments that may be expected in this field.
Collapse
Affiliation(s)
- Stephan Fuchs
- Genome Competence Center (MF1), Department MFI, Robert-Koch-Institut, Berlin, Germany
| | - Susanne Engelmann
- Institute for Microbiology, Technische Universität Braunschweig, Braunschweig, Germany
- Microbial Proteomics, Helmholtzzentrum für Infektionsforschung GmbH, Braunschweig, Germany
| |
Collapse
|
2
|
Mohsen JJ, Martel AA, Slavoff SA. Microproteins-Discovery, structure, and function. Proteomics 2023; 23:e2100211. [PMID: 37603371 PMCID: PMC10841188 DOI: 10.1002/pmic.202100211] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2023] [Revised: 08/03/2023] [Accepted: 08/10/2023] [Indexed: 08/22/2023]
Abstract
Advances in proteogenomic technologies have revealed hundreds to thousands of translated small open reading frames (sORFs) that encode microproteins in genomes across evolutionary space. While many microproteins have now been shown to play critical roles in biology and human disease, a majority of recently identified microproteins have little or no experimental evidence regarding their functionality. Computational tools have some limitations for analysis of short, poorly conserved microprotein sequences, so additional approaches are needed to determine the role of each member of this recently discovered polypeptide class. A currently underexplored avenue in the study of microproteins is structure prediction and determination, which delivers a depth of functional information. In this review, we provide a brief overview of microprotein discovery methods, then examine examples of microprotein structures (and, conversely, intrinsic disorder) that have been experimentally determined using crystallography, cryo-electron microscopy, and NMR, which provide insight into their molecular functions and mechanisms. Additionally, we discuss examples of predicted microprotein structures that have provided insight or context regarding their function. Analysis of microprotein structure at the angstrom level, and confirmation of predicted structures, therefore, has potential to identify translated microproteins that are of biological importance and to provide molecular mechanism for their in vivo roles.
Collapse
Affiliation(s)
- Jessica J. Mohsen
- Department of Chemistry, Yale University, New Haven, CT, USA
- Institute of Biomolecular Design and Discovery, Yale University, West Haven, CT, USA
| | - Alina A. Martel
- Institute of Biomolecular Design and Discovery, Yale University, West Haven, CT, USA
| | - Sarah A. Slavoff
- Department of Chemistry, Yale University, New Haven, CT, USA
- Institute of Biomolecular Design and Discovery, Yale University, West Haven, CT, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| |
Collapse
|
3
|
Chen Y, Cao X, Loh KH, Slavoff SA. Chemical labeling and proteomics for characterization of unannotated small and alternative open reading frame-encoded polypeptides. Biochem Soc Trans 2023; 51:1071-1082. [PMID: 37171061 PMCID: PMC10317152 DOI: 10.1042/bst20221074] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2023] [Revised: 03/27/2023] [Accepted: 04/13/2023] [Indexed: 05/13/2023]
Abstract
Thousands of unannotated small and alternative open reading frames (smORFs and alt-ORFs, respectively) have recently been revealed in mammalian genomes. While hundreds of mammalian smORF- and alt-ORF-encoded proteins (SEPs and alt-proteins, respectively) affect cell proliferation, the overwhelming majority of smORFs and alt-ORFs remain uncharacterized at the molecular level. Complicating the task of identifying the biological roles of smORFs and alt-ORFs, the SEPs and alt-proteins that they encode exhibit limited sequence homology to protein domains of known function. Experimental techniques for the functionalization of these gene classes are therefore required. Approaches combining chemical labeling and quantitative proteomics have greatly advanced our ability to identify and characterize functional SEPs and alt-proteins in high throughput. In this review, we briefly describe the principles of proteomic discovery of SEPs and alt-proteins, then summarize how these technologies interface with chemical labeling for identification of SEPs and alt-proteins with specific properties, as well as in defining the interactome of SEPs and alt-proteins.
Collapse
Affiliation(s)
- Yanran Chen
- Department of Chemistry, Yale University, New Haven, CT, U.S.A
- Institute for Biomolecular Design and Discovery, Yale University, West Haven, CT, U.S.A
| | - Xiongwen Cao
- Department of Chemistry, Yale University, New Haven, CT, U.S.A
- Institute for Biomolecular Design and Discovery, Yale University, West Haven, CT, U.S.A
- Department of Comparative Medicine, Yale University School of Medicine, New Haven, CT, U.S.A
- Shanghai Key Laboratory of Regulatory Biology, Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, China
| | - Ken H. Loh
- Institute for Biomolecular Design and Discovery, Yale University, West Haven, CT, U.S.A
- Department of Comparative Medicine, Yale University School of Medicine, New Haven, CT, U.S.A
| | - Sarah A. Slavoff
- Department of Chemistry, Yale University, New Haven, CT, U.S.A
- Institute for Biomolecular Design and Discovery, Yale University, West Haven, CT, U.S.A
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, U.S.A
| |
Collapse
|
4
|
Jiang S, Shi J, Li Y, Zhang Z, Chang L, Wang G, Wu W, Yu L, Dai E, Zhang L, Lyu Z, Xu P, Zhang Y. Mirror proteases of Ac-Trypsin and Ac-LysargiNase precisely improve novel event identifications in Mycolicibacterium smegmatis MC2 155 by proteogenomic analysis. Front Microbiol 2022; 13:1015140. [PMID: 36312923 PMCID: PMC9597629 DOI: 10.3389/fmicb.2022.1015140] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2022] [Accepted: 09/12/2022] [Indexed: 11/22/2022] Open
Abstract
Accurate identification of novel peptides remains challenging because of the lack of evaluation criteria in large-scale proteogenomic studies. Mirror proteases of trypsin and lysargiNase can generate complementary b/y ion series, providing the opportunity to efficiently assess authentic novel peptides in experiments other than filter potential targets by different false discovery rates (FDRs) ranking. In this study, a pair of in-house developed acetylated mirror proteases, Ac-Trypsin and Ac-LysargiNase, were used in Mycolicibacterium smegmatis MC2 155 for proteogenomic analysis. The mirror proteases accurately identified 368 novel peptides, exhibiting 75–80% b and y ion coverages against 65–68% y or b ion coverages of Ac-Trypsin (38.9% b and 68.3% y) or Ac-LysargiNase (65.5% b and 39.6% y) as annotated peptides from M. smegmatis MC2 155. The complementary b and y ion series largely increased the reliability of overlapped sequences derived from novel peptides. Among these novel peptides, 311 peptides were annotated in other public M. smegmatis strains, and 57 novel peptides with more continuous b and y pairs were obtained for further analysis after spectral quality assessment. This enabled mirror proteases to successfully correct six annotated proteins' N-termini and detect 17 new coding open reading frames (ORFs). We believe that mirror proteases will be an effective strategy for novel peptide detection in both prokaryotic and eukaryotic proteogenomics.
Collapse
Affiliation(s)
- Songhao Jiang
- Key Laboratory of Microbial Diversity Research and Application of Hebei, School of Life Sciences, Hebei University, Baoding, China
- Beijing Proteome Research Center, National Center for Protein Sciences Beijing, State Key Laboratory of Proteomics, Research Unit of Proteomics and Research and Development of New Drug of Chinese Academy of Medical Sciences, Institute of Lifeomics, Beijing, China
| | - Jiahui Shi
- Beijing Proteome Research Center, National Center for Protein Sciences Beijing, State Key Laboratory of Proteomics, Research Unit of Proteomics and Research and Development of New Drug of Chinese Academy of Medical Sciences, Institute of Lifeomics, Beijing, China
| | - Yanchang Li
- Beijing Proteome Research Center, National Center for Protein Sciences Beijing, State Key Laboratory of Proteomics, Research Unit of Proteomics and Research and Development of New Drug of Chinese Academy of Medical Sciences, Institute of Lifeomics, Beijing, China
| | - Zhenpeng Zhang
- Beijing Proteome Research Center, National Center for Protein Sciences Beijing, State Key Laboratory of Proteomics, Research Unit of Proteomics and Research and Development of New Drug of Chinese Academy of Medical Sciences, Institute of Lifeomics, Beijing, China
| | - Lei Chang
- Beijing Proteome Research Center, National Center for Protein Sciences Beijing, State Key Laboratory of Proteomics, Research Unit of Proteomics and Research and Development of New Drug of Chinese Academy of Medical Sciences, Institute of Lifeomics, Beijing, China
| | - Guibin Wang
- Beijing Proteome Research Center, National Center for Protein Sciences Beijing, State Key Laboratory of Proteomics, Research Unit of Proteomics and Research and Development of New Drug of Chinese Academy of Medical Sciences, Institute of Lifeomics, Beijing, China
| | - Wenhui Wu
- Beijing Proteome Research Center, National Center for Protein Sciences Beijing, State Key Laboratory of Proteomics, Research Unit of Proteomics and Research and Development of New Drug of Chinese Academy of Medical Sciences, Institute of Lifeomics, Beijing, China
- Guangzhou University of Chinese Medicine, Second Clinical Medicine College, Guangzhou Higher Education Mega Center, Guangzhou, China
| | - Liyan Yu
- Research Unit of Proteomics and Research and Development of New Drug, Institute of Medicinal Biotechnology, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Erhei Dai
- The Fifth Hospital of Shijiazhuang, School of Public Health, Shijiazhuang, China
| | - Lixia Zhang
- Key Research Laboratory for Infectious Disease Prevention for State Administration of Traditional Chinese Medicine, Tianjin Institute of Respiratory Diseases, Haihe Hospital, Tianjin University, Tianjin, China
| | - Zhitang Lyu
- Key Laboratory of Microbial Diversity Research and Application of Hebei, School of Life Sciences, Hebei University, Baoding, China
- Zhitang Lyu
| | - Ping Xu
- Key Laboratory of Microbial Diversity Research and Application of Hebei, School of Life Sciences, Hebei University, Baoding, China
- Beijing Proteome Research Center, National Center for Protein Sciences Beijing, State Key Laboratory of Proteomics, Research Unit of Proteomics and Research and Development of New Drug of Chinese Academy of Medical Sciences, Institute of Lifeomics, Beijing, China
- Guangzhou University of Chinese Medicine, Second Clinical Medicine College, Guangzhou Higher Education Mega Center, Guangzhou, China
- Research Unit of Proteomics and Research and Development of New Drug, Institute of Medicinal Biotechnology, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
- *Correspondence: Ping Xu
| | - Yao Zhang
- Beijing Proteome Research Center, National Center for Protein Sciences Beijing, State Key Laboratory of Proteomics, Research Unit of Proteomics and Research and Development of New Drug of Chinese Academy of Medical Sciences, Institute of Lifeomics, Beijing, China
- Yao Zhang
| |
Collapse
|
5
|
Aggarwal S, Raj A, Kumar D, Dash D, Yadav AK. False discovery rate: the Achilles' heel of proteogenomics. Brief Bioinform 2022; 23:6582880. [PMID: 35534181 DOI: 10.1093/bib/bbac163] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Revised: 03/14/2022] [Accepted: 04/12/2022] [Indexed: 12/25/2022] Open
Abstract
Proteogenomics refers to the integrated analysis of the genome and proteome that leverages mass-spectrometry (MS)-based proteomics data to improve genome annotations, understand gene expression control through proteoforms and find sequence variants to develop novel insights for disease classification and therapeutic strategies. However, proteogenomic studies often suffer from reduced sensitivity and specificity due to inflated database size. To control the error rates, proteogenomics depends on the target-decoy search strategy, the de-facto method for false discovery rate (FDR) estimation in proteomics. The proteogenomic databases constructed from three- or six-frame nucleotide database translation not only increase the search space and compute-time but also violate the equivalence of target and decoy databases. These searches result in poorer separation between target and decoy scores, leading to stringent FDR thresholds. Understanding these factors and applying modified strategies such as two-pass database search or peptide-class-specific FDR can result in a better interpretation of MS data without introducing additional statistical biases. Based on these considerations, a user can interpret the proteogenomics results appropriately and control false positives and negatives in a more informed manner. In this review, first, we briefly discuss the proteogenomic workflows and limitations in database construction, followed by various considerations that can influence potential novel discoveries in a proteogenomic study. We conclude with suggestions to counter these challenges for better proteogenomic data interpretation.
Collapse
Affiliation(s)
- Suruchi Aggarwal
- Translational Health Science and Technology Institute, NCR Biotech Science Cluster, 3rd milestone, PO Box No. 04, Faridabad-Gurgaon Expressway, Faridabad-121001, Haryana, India
| | - Anurag Raj
- GN Ramachandran Knowledge Centre for Genome Informatics, CSIR-Institute of Genomics & Integrative Biology, South Campus, Mathura Road, New Delhi 110025, India.,Academy of Scientific and Innovative Research (AcSIR), Ghaziabad-201002, India
| | - Dhirendra Kumar
- GN Ramachandran Knowledge Centre for Genome Informatics, CSIR-Institute of Genomics & Integrative Biology, South Campus, Mathura Road, New Delhi 110025, India
| | - Debasis Dash
- GN Ramachandran Knowledge Centre for Genome Informatics, CSIR-Institute of Genomics & Integrative Biology, South Campus, Mathura Road, New Delhi 110025, India.,Academy of Scientific and Innovative Research (AcSIR), Ghaziabad-201002, India
| | - Amit Kumar Yadav
- Translational Health Science and Technology Institute, NCR Biotech Science Cluster, 3rd milestone, PO Box No. 04, Faridabad-Gurgaon Expressway, Faridabad-121001, Haryana, India
| |
Collapse
|
6
|
Zhu H, Jiang S, Zhou W, Chi H, Sun J, Shi J, Zhang Z, Chang L, Yu L, Zhang L, Lyu Z, Xu P, Zhang Y. Ac-LysargiNase efficiently helps genome reannotation of Mycolicibacterium smegmatis MC2 155. J Proteomics 2022; 264:104622. [DOI: 10.1016/j.jprot.2022.104622] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2022] [Revised: 05/10/2022] [Accepted: 05/16/2022] [Indexed: 10/18/2022]
|
7
|
Cassidy L, Kaulich PT, Maaß S, Bartel J, Becher D, Tholey A. Bottom-up and top-down proteomic approaches for the identification, characterization, and quantification of the low molecular weight proteome with focus on short open reading frame-encoded peptides. Proteomics 2021; 21:e2100008. [PMID: 34145981 DOI: 10.1002/pmic.202100008] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2021] [Revised: 06/09/2021] [Accepted: 06/09/2021] [Indexed: 01/14/2023]
Abstract
The recent discovery of alternative open reading frames creates a need for suitable analytical approaches to verify their translation and to characterize the corresponding gene products at the molecular level. As the analysis of small proteins within a background proteome by means of classical bottom-up proteomics is challenging, method development for the analysis of small open reading frame encoded peptides (SEPs) have become a focal point for research. Here, we highlight bottom-up and top-down proteomics approaches established for the analysis of SEPs in both pro- and eukaryotes. Major steps of analysis, including sample preparation and (small) proteome isolation, separation and mass spectrometry, data interpretation and quality control, quantification, the analysis of post-translational modifications, and exploration of functional aspects of the SEPs by means of proteomics technologies are described. These methods do not exclusively cover the analytics of SEPs but simultaneously include the low molecular weight proteome, and moreover, can also be used for the proteome-wide analysis of proteolytic processing events.
Collapse
Affiliation(s)
- Liam Cassidy
- Systematic Proteome Research & Bioanalytics, Institute for Experimental Medicine, Christian-Albrechts-Universität zu Kiel, Kiel, Germany
| | - Philipp T Kaulich
- Systematic Proteome Research & Bioanalytics, Institute for Experimental Medicine, Christian-Albrechts-Universität zu Kiel, Kiel, Germany
| | - Sandra Maaß
- Department of Microbial Proteomics, Institute of Microbiology, University of Greifswald, Greifswald, Germany
| | - Jürgen Bartel
- Department of Microbial Proteomics, Institute of Microbiology, University of Greifswald, Greifswald, Germany
| | - Dörte Becher
- Department of Microbial Proteomics, Institute of Microbiology, University of Greifswald, Greifswald, Germany
| | - Andreas Tholey
- Systematic Proteome Research & Bioanalytics, Institute for Experimental Medicine, Christian-Albrechts-Universität zu Kiel, Kiel, Germany
| |
Collapse
|
8
|
A workflow to identify novel proteins based on the direct mapping of peptide-spectrum-matches to genomic locations. BMC Bioinformatics 2021; 22:277. [PMID: 34039272 PMCID: PMC8157683 DOI: 10.1186/s12859-021-04159-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2021] [Accepted: 04/27/2021] [Indexed: 02/06/2023] Open
Abstract
Background Small Proteins have received increasing attention in recent years. They have in particular been implicated as signals contributing to the coordination of bacterial communities. In genome annotations they are often missing or hidden among large numbers of hypothetical proteins because genome annotation pipelines often exclude short open reading frames or over-predict hypothetical proteins based on simple models. The validation of novel proteins, and in particular of small proteins (sProteins), therefore requires additional evidence. Proteogenomics is considered the gold standard for this purpose. It extends beyond established annotations and includes all possible open reading frames (ORFs) as potential sources of peptides, thus allowing the discovery of novel, unannotated proteins. Typically this results in large numbers of putative novel small proteins fraught with large fractions of false-positive predictions. Results We observe that number and quality of the peptide-spectrum matches (PSMs) that map to a candidate ORF can be highly informative for the purpose of distinguishing proteins from spurious ORF annotations. We report here on a workflow that aggregates PSM quality information and local context into simple descriptors and reliably separates likely proteins from the large pool of false-positive, i.e., most likely untranslated ORFs. We investigated the artificial gut microbiome model SIHUMIx, comprising eight different species, for which we validate 5114 proteins that have previously been annotated only as hypothetical ORFs. In addition, we identified 37 non-annotated protein candidates for which we found evidence at the proteomic and transcriptomic level. Half (19) of these candidates have close functional homologs in other species. Another 12 candidates have homologs designated as hypothetical proteins in other species. The remaining six candidates are short (< 100 AA) and are most likely bona fide novel proteins. Conclusions The aggregation of PSM quality information for predicted ORFs provides a robust and efficient method to identify novel proteins in proteomics data. The workflow is in particular capable of identifying small proteins and frameshift variants. Since PSMs are explicitly mapped to genomic locations, it furthermore facilitates the integration of transcriptomics data and other sources of genome-level information. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04159-8.
Collapse
|
9
|
Tian F, Shi J, Li Y, Gao H, Chang L, Zhang Y, Gao L, Xu P, Tang S. Proteogenomics Study of Blastobotrys adeninivorans TMCC 70007-A Dominant Yeast in the Fermentation Process of Pu-erh Tea. J Proteome Res 2021; 20:3290-3304. [PMID: 34008989 DOI: 10.1021/acs.jproteome.1c00205] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Blastobotrys adeninivorans plays an essential role in pile-fermenting of Pu-erh tea. Its ability to assimilate various carbon and nitrogen sources makes it available for application in a wide range of industry sectors. The genome of B. adeninivorans TMCC 70007 isolated from pile-fermented Pu-erh tea was sequenced and assembled. Proteomics analysis indicated that 4900 proteins in TMCC 70007 were expressed under various culture conditions. Proteogenomics mapping revealed 48 previously unknown genes and corrected 118 gene models predicted by GeneMark-ES. Ortho-proteogenomics analysis identified 17 previously unidentified genes in B. adeninivorans LS3, the first strain with a sequenced genome among the genus Blastobotrys as well. More importantly, five species specific genes were identified from TMCC 70007, which could serve as a barcode for strain typing and were applicable for fermentation process protection of this industrial species. The datasets generated from tea aqueous extract culture not only increased the proteome coverage and accuracy but also contributed to the identification of proteins related to polyphenols and caffeine, which were considered to change greatly during the microbial fermentation of Pu-erh tea. This study provides a proteome perspective on TMCC 70007, which was considered to be an important strain in the production of Pu-erh tea. The systematic proteogenomics analysis not only made a better annotation on the genome of B. adeninivorans TMCC 70007 as previous proteogenomics study but also provided solution for fermentation process protection on valuable industrial species with species specific genes uniquely identified from proteogenomics study.
Collapse
Affiliation(s)
- Fei Tian
- Key Laboratory of Microbial Diversity in Southwest China, Ministry of Education, and Laboratory for Conservation and Utilization of Bio-resources, Yunnan Institute of Microbiology, School of Life Sciences, Yunnan University, Kunming 650091, China.,State Key Laboratory of Proteomics, Beijing Proteome Research Center, Research Unit of Proteomics & Research and Development of New Drug, Chinese Academy of Medical Sciences, National Center for Protein Sciences (Beijing), Beijing Institute of Lifeomics, Beijing 102206, China
| | - Jiahui Shi
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, Research Unit of Proteomics & Research and Development of New Drug, Chinese Academy of Medical Sciences, National Center for Protein Sciences (Beijing), Beijing Institute of Lifeomics, Beijing 102206, China.,Hebei Province Key Lab of Research and Application on Microbial Diversity, College of Life Sciences, Hebei University, Baoding 071002, China
| | - Yanchang Li
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, Research Unit of Proteomics & Research and Development of New Drug, Chinese Academy of Medical Sciences, National Center for Protein Sciences (Beijing), Beijing Institute of Lifeomics, Beijing 102206, China
| | - Huiying Gao
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, Research Unit of Proteomics & Research and Development of New Drug, Chinese Academy of Medical Sciences, National Center for Protein Sciences (Beijing), Beijing Institute of Lifeomics, Beijing 102206, China
| | - Lei Chang
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, Research Unit of Proteomics & Research and Development of New Drug, Chinese Academy of Medical Sciences, National Center for Protein Sciences (Beijing), Beijing Institute of Lifeomics, Beijing 102206, China
| | - Yao Zhang
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, Research Unit of Proteomics & Research and Development of New Drug, Chinese Academy of Medical Sciences, National Center for Protein Sciences (Beijing), Beijing Institute of Lifeomics, Beijing 102206, China
| | - Linrui Gao
- Yunnan Pu-erh Tea Fermentation Engineering Research Center, Yunnan TAETEA Microbial Technology Co., Ltd., Kunming 650217, China
| | - Ping Xu
- State Key Laboratory of Proteomics, Beijing Proteome Research Center, Research Unit of Proteomics & Research and Development of New Drug, Chinese Academy of Medical Sciences, National Center for Protein Sciences (Beijing), Beijing Institute of Lifeomics, Beijing 102206, China.,Hebei Province Key Lab of Research and Application on Microbial Diversity, College of Life Sciences, Hebei University, Baoding 071002, China
| | - Shukun Tang
- Key Laboratory of Microbial Diversity in Southwest China, Ministry of Education, and Laboratory for Conservation and Utilization of Bio-resources, Yunnan Institute of Microbiology, School of Life Sciences, Yunnan University, Kunming 650091, China.,Yunnan Pu-erh Tea Fermentation Engineering Research Center, Yunnan TAETEA Microbial Technology Co., Ltd., Kunming 650217, China
| |
Collapse
|
10
|
Bartel J, Varadarajan AR, Sura T, Ahrens CH, Maaß S, Becher D. Optimized Proteomics Workflow for the Detection of Small Proteins. J Proteome Res 2020; 19:4004-4018. [DOI: 10.1021/acs.jproteome.0c00286] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Affiliation(s)
- Jürgen Bartel
- Department of Microbial Proteomics, Institute of Microbiology, University of Greifswald, D-17489 Greifswald, Germany
| | - Adithi R. Varadarajan
- Agroscope, Research Group Molecular Diagnostics, Genomics & Bioinformatics and SIB Swiss Institute of Bioinformatics, CH-8820 Wädenswil, Switzerland
| | - Thomas Sura
- Department of Microbial Proteomics, Institute of Microbiology, University of Greifswald, D-17489 Greifswald, Germany
| | - Christian H. Ahrens
- Agroscope, Research Group Molecular Diagnostics, Genomics & Bioinformatics and SIB Swiss Institute of Bioinformatics, CH-8820 Wädenswil, Switzerland
| | - Sandra Maaß
- Department of Microbial Proteomics, Institute of Microbiology, University of Greifswald, D-17489 Greifswald, Germany
| | - Dörte Becher
- Department of Microbial Proteomics, Institute of Microbiology, University of Greifswald, D-17489 Greifswald, Germany
| |
Collapse
|
11
|
Shao Y, Chen C, Shen H, He BZ, Yu D, Jiang S, Zhao S, Gao Z, Zhu Z, Chen X, Fu Y, Chen H, Gao G, Long M, Zhang YE. GenTree, an integrated resource for analyzing the evolution and function of primate-specific coding genes. Genome Res 2019; 29:682-696. [PMID: 30862647 PMCID: PMC6442393 DOI: 10.1101/gr.238733.118] [Citation(s) in RCA: 61] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2018] [Accepted: 01/29/2019] [Indexed: 12/13/2022]
Abstract
The origination of new genes contributes to phenotypic evolution in humans. Two major challenges in the study of new genes are the inference of gene ages and annotation of their protein-coding potential. To tackle these challenges, we created GenTree, an integrated online database that compiles age inferences from three major methods together with functional genomic data for new genes. Genome-wide comparison of the age inference methods revealed that the synteny-based pipeline (SBP) is most suited for recently duplicated genes, whereas the protein-family–based methods are useful for ancient genes. For SBP-dated primate-specific protein-coding genes (PSGs), we performed manual evaluation based on published PSG lists and showed that SBP generated a conservative data set of PSGs by masking less reliable syntenic regions. After assessing the coding potential based on evolutionary constraint and peptide evidence from proteomic data, we curated a list of 254 PSGs with different levels of protein evidence. This list also includes 41 candidate misannotated pseudogenes that encode primate-specific short proteins. Coexpression analysis showed that PSGs are preferentially recruited into organs with rapidly evolving pathways such as spermatogenesis, immune response, mother–fetus interaction, and brain development. For brain development, primate-specific KRAB zinc-finger proteins (KZNFs) are specifically up-regulated in the mid-fetal stage, which may have contributed to the evolution of this critical stage. Altogether, hundreds of PSGs are either recruited to processes under strong selection pressure or to processes supporting an evolving novel organ.
Collapse
Affiliation(s)
- Yi Shao
- Key Laboratory of Zoological Systematics and Evolution and State Key Laboratory of Integrated Management of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China.,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Chunyan Chen
- Key Laboratory of Zoological Systematics and Evolution and State Key Laboratory of Integrated Management of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China.,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Hao Shen
- College of Computers, Hunan University of Technology, Zhuzhou Hunan 412007, China
| | - Bin Z He
- FAS Center for Systems Biology and Howard Hughes Medical Institute, Harvard University, Cambridge, Massachusetts 02138, USA
| | - Daqi Yu
- Key Laboratory of Zoological Systematics and Evolution and State Key Laboratory of Integrated Management of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China.,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Shuai Jiang
- State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Center for Bioinformatics, Peking University, Beijing 100871, China.,Beijing Advanced Innovation Center for Genomics (ICG), Biomedical Pioneering Innovation Center (BIOPIC), Peking University, Beijing 100871, China
| | - Shilei Zhao
- University of Chinese Academy of Sciences, Beijing 100049, China.,CAS Key Laboratory of Genomic and Precision Medicine, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China
| | - Zhiqiang Gao
- University of Chinese Academy of Sciences, Beijing 100049, China.,National Center for Mathematics and Interdisciplinary Sciences, Key Laboratory of Random Complex Structures and Data Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, China
| | - Zhenglin Zhu
- School of Life Sciences, Chongqing University, Chongqing 400044, China
| | - Xi Chen
- Wuhan Institute of Biotechnology, Wuhan 430072, China.,Medical Research Institute, Wuhan University, Wuhan 430072, China
| | - Yan Fu
- University of Chinese Academy of Sciences, Beijing 100049, China.,National Center for Mathematics and Interdisciplinary Sciences, Key Laboratory of Random Complex Structures and Data Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, China
| | - Hua Chen
- University of Chinese Academy of Sciences, Beijing 100049, China.,CAS Key Laboratory of Genomic and Precision Medicine, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China.,CAS Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming 650223, China
| | - Ge Gao
- State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Center for Bioinformatics, Peking University, Beijing 100871, China.,Beijing Advanced Innovation Center for Genomics (ICG), Biomedical Pioneering Innovation Center (BIOPIC), Peking University, Beijing 100871, China
| | - Manyuan Long
- Department of Ecology and Evolution, The University of Chicago, Chicago, Illinois 60637, USA
| | - Yong E Zhang
- Key Laboratory of Zoological Systematics and Evolution and State Key Laboratory of Integrated Management of Pest Insects and Rodents, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China.,University of Chinese Academy of Sciences, Beijing 100049, China.,CAS Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming 650223, China
| |
Collapse
|
12
|
Peritore-Galve FC, Schneider DJ, Yang Y, Thannhauser TW, Smart CD, Stodghill P. Proteome Profile and Genome Refinement of the Tomato-Pathogenic Bacterium Clavibacter michiganensis subsp. michiganensis. Proteomics 2019; 19:e1800224. [PMID: 30648817 DOI: 10.1002/pmic.201800224] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2018] [Revised: 11/29/2018] [Indexed: 11/07/2022]
Affiliation(s)
- F Christopher Peritore-Galve
- Plant Pathology and Plant-Microbe Biology Section, School of Integrative Plant Science, Cornell University, Geneva, NY, 14456, USA
| | - David J Schneider
- Global Institute for Food Security, University of Saskatchewan, Saskatoon, SK, S7N 4J8, Canada
| | - Yong Yang
- United States Department of Agriculture (USDA), Agricultural Research Service, Robert W. Holley Center, Ithaca, NY, 14853, USA
| | - Theodore W Thannhauser
- United States Department of Agriculture (USDA), Agricultural Research Service, Robert W. Holley Center, Ithaca, NY, 14853, USA
| | - Christine D Smart
- Plant Pathology and Plant-Microbe Biology Section, School of Integrative Plant Science, Cornell University, Geneva, NY, 14456, USA
| | - Paul Stodghill
- United States Department of Agriculture (USDA), Agricultural Research Service, Robert W. Holley Center, Ithaca, NY, 14853, USA
| |
Collapse
|
13
|
Low TY, Mohtar MA, Ang MY, Jamal R. Connecting Proteomics to Next‐Generation Sequencing: Proteogenomics and Its Current Applications in Biology. Proteomics 2018; 19:e1800235. [DOI: 10.1002/pmic.201800235] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2018] [Revised: 10/09/2018] [Indexed: 12/17/2022]
Affiliation(s)
- Teck Yew Low
- UKM Medical Molecular Biology Institute (UMBI)Universiti Kebangsaan Malaysia 56000 Kuala Lumpur Malaysia
| | - M. Aiman Mohtar
- UKM Medical Molecular Biology Institute (UMBI)Universiti Kebangsaan Malaysia 56000 Kuala Lumpur Malaysia
| | - Mia Yang Ang
- UKM Medical Molecular Biology Institute (UMBI)Universiti Kebangsaan Malaysia 56000 Kuala Lumpur Malaysia
| | - Rahman Jamal
- UKM Medical Molecular Biology Institute (UMBI)Universiti Kebangsaan Malaysia 56000 Kuala Lumpur Malaysia
| |
Collapse
|
14
|
Chemometrics-Assisted Shotgun Proteomics for Establishment of Potential Peptide Markers of Non-Halal Pork (Sus scrofa) among Halal Beef and Chicken. FOOD ANAL METHOD 2018. [DOI: 10.1007/s12161-018-1327-6] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
15
|
Discovery of coding regions in the human genome by integrated proteogenomics analysis workflow. Nat Commun 2018; 9:903. [PMID: 29500430 PMCID: PMC5834625 DOI: 10.1038/s41467-018-03311-y] [Citation(s) in RCA: 96] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2017] [Accepted: 02/02/2018] [Indexed: 01/23/2023] Open
Abstract
Proteogenomics enable the discovery of novel peptides (from unannotated genomic protein-coding loci) and single amino acid variant peptides (derived from single-nucleotide polymorphisms and mutations). Increasing the reliability of these identifications is crucial to ensure their usefulness for genome annotation and potential application as neoantigens in cancer immunotherapy. We here present integrated proteogenomics analysis workflow (IPAW), which combines peptide discovery, curation, and validation. IPAW includes the SpectrumAI tool for automated inspection of MS/MS spectra, eliminating false identifications of single-residue substitution peptides. We employ IPAW to analyze two proteomics data sets acquired from A431 cells and five normal human tissues using extended (pH range, 3–10) high-resolution isoelectric focusing (HiRIEF) pre-fractionation and TMT-based peptide quantitation. The IPAW results provide evidence for the translation of pseudogenes, lncRNAs, short ORFs, alternative ORFs, N-terminal extensions, and intronic sequences. Moreover, our quantitative analysis indicates that protein production from certain pseudogenes and lncRNAs is tissue specific. Proteogenomics enables the discovery of protein coding regions and disease-relevant mutations but their verification remains challenging. Here, the authors combine peptide discovery, curation and validation in an integrated proteogenomics workflow, robustly identifying unknown coding regions and mutations.
Collapse
|
16
|
Heunis T, Dippenaar A, Warren RM, van Helden PD, van der Merwe RG, Gey van Pittius NC, Pain A, Sampson SL, Tabb DL. Proteogenomic Investigation of Strain Variation in Clinical Mycobacterium tuberculosis Isolates. J Proteome Res 2017; 16:3841-3851. [PMID: 28820946 DOI: 10.1021/acs.jproteome.7b00483] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
Abstract
Mycobacterium tuberculosis consists of a large number of different strains that display unique virulence characteristics. Whole-genome sequencing has revealed substantial genetic diversity among clinical M. tuberculosis isolates, and elucidating the phenotypic variation encoded by this genetic diversity will be of the utmost importance to fully understand M. tuberculosis biology and pathogenicity. In this study, we integrated whole-genome sequencing and mass spectrometry (GeLC-MS/MS) to reveal strain-specific characteristics in the proteomes of two clinical M. tuberculosis Latin American-Mediterranean isolates. Using this approach, we identified 59 peptides containing single amino acid variants, which covered ∼9% of all coding nonsynonymous single nucleotide variants detected by whole-genome sequencing. Furthermore, we identified 29 distinct peptides that mapped to a hypothetical protein not present in the M. tuberculosis H37Rv reference proteome. Here, we provide evidence for the expression of this protein in the clinical M. tuberculosis SAWC3651 isolate. The strain-specific databases enabled confirmation of genomic differences (i.e., large genomic regions of difference and nonsynonymous single nucleotide variants) in these two clinical M. tuberculosis isolates and allowed strain differentiation at the proteome level. Our results contribute to the growing field of clinical microbial proteogenomics and can improve our understanding of phenotypic variation in clinical M. tuberculosis isolates.
Collapse
Affiliation(s)
- Tiaan Heunis
- DST/NRF Centre of Excellence for Biomedical Tuberculosis Research, SAMRC Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences, Stellenbosch University , Cape Town 7505, South Africa
| | - Anzaan Dippenaar
- DST/NRF Centre of Excellence for Biomedical Tuberculosis Research, SAMRC Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences, Stellenbosch University , Cape Town 7505, South Africa
| | - Robin M Warren
- DST/NRF Centre of Excellence for Biomedical Tuberculosis Research, SAMRC Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences, Stellenbosch University , Cape Town 7505, South Africa
| | - Paul D van Helden
- DST/NRF Centre of Excellence for Biomedical Tuberculosis Research, SAMRC Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences, Stellenbosch University , Cape Town 7505, South Africa
| | - Ruben G van der Merwe
- DST/NRF Centre of Excellence for Biomedical Tuberculosis Research, SAMRC Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences, Stellenbosch University , Cape Town 7505, South Africa
| | - Nicolaas C Gey van Pittius
- DST/NRF Centre of Excellence for Biomedical Tuberculosis Research, SAMRC Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences, Stellenbosch University , Cape Town 7505, South Africa
| | - Arnab Pain
- Pathogen Genomics Laboratory, BESE Division, King Abdullah University of Science and Technology , Thuwal 23955, Saudi Arabia
| | - Samantha L Sampson
- DST/NRF Centre of Excellence for Biomedical Tuberculosis Research, SAMRC Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences, Stellenbosch University , Cape Town 7505, South Africa
| | - David L Tabb
- DST/NRF Centre of Excellence for Biomedical Tuberculosis Research, SAMRC Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences, Stellenbosch University , Cape Town 7505, South Africa
| |
Collapse
|
17
|
Kroll JE, da Silva VL, de Souza SJ, de Souza GA. A tool for integrating genetic and mass spectrometry-based peptide data: Proteogenomics Viewer. Bioessays 2017; 39. [DOI: 10.1002/bies.201700015] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Affiliation(s)
- José Eduardo Kroll
- Institute of Bioinformatics and Biotechnology; Natal − RN Brazil
- Brain Institute; Universidade Federal do Rio Grande do Norte; Natal − RN Brazil
- Bioinformatics Multidisciplinary Environment; Instituto Metrópole Digital; UFRN, Natal-RN Brazil
| | - Vandeclécio Lira da Silva
- Brain Institute; Universidade Federal do Rio Grande do Norte; Natal − RN Brazil
- Bioinformatics Multidisciplinary Environment; Instituto Metrópole Digital; UFRN, Natal-RN Brazil
| | - Sandro José de Souza
- Brain Institute; Universidade Federal do Rio Grande do Norte; Natal − RN Brazil
- Bioinformatics Multidisciplinary Environment; Instituto Metrópole Digital; UFRN, Natal-RN Brazil
| | - Gustavo Antonio de Souza
- Brain Institute; Universidade Federal do Rio Grande do Norte; Natal − RN Brazil
- Bioinformatics Multidisciplinary Environment; Instituto Metrópole Digital; UFRN, Natal-RN Brazil
- Department of Immunology and Centre for Immune Regulation, Oslo University Hospital HF Rikshospitalet; University of Oslo; Oslo Norway
| |
Collapse
|
18
|
Li H, Park J, Kim H, Hwang KB, Paek E. Systematic Comparison of False-Discovery-Rate-Controlling Strategies for Proteogenomic Search Using Spike-in Experiments. J Proteome Res 2017; 16:2231-2239. [PMID: 28452485 DOI: 10.1021/acs.jproteome.7b00033] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Proteogenomic searches are useful for novel peptide identification from tandem mass spectra. Usually, separate and multistage approaches are adopted to accurately control the false discovery rate (FDR) for proteogenomic search. Their performance on novel peptide identification has not been thoroughly evaluated, however, mainly due to the difficulty in confirming the existence of identified novel peptides. We simulated a proteogenomic search using a controlled, spike-in proteomic data set. After confirming that the results of the simulated proteogenomic search were similar to those of a real proteogenomic search using a human cell line data set, we evaluated the performance of six FDR control methods-global, separate, and multistage FDR estimation, respectively, coupled to a target-decoy search and a mixture model-based method-on novel peptide identification. The multistage approach showed the highest accuracy for FDR estimation. However, global and separate FDR estimation with the mixture model-based method showed higher sensitivities than others at the same true FDR. Furthermore, the mixture model-based method performed equally well when applied without or with a reduced set of decoy sequences. Considering different prior probabilities for novel and known protein identification, we recommend using mixture model-based methods with separate FDR estimation for sensitive and reliable identification of novel peptides from proteogenomic searches.
Collapse
Affiliation(s)
- Honglan Li
- School of Computer Science and Engineering, Soongsil University , Seoul 06978, Republic of Korea
| | - Jonghun Park
- Department of Computer Science, Hanyang University , Seoul 04763, Republic of Korea
| | - Hyunwoo Kim
- Scientific Data Research Center, Korea Institute of Science and Technology Information , Daejeon 34141, Republic of Korea
| | - Kyu-Baek Hwang
- School of Computer Science and Engineering, Soongsil University , Seoul 06978, Republic of Korea
| | - Eunok Paek
- Department of Computer Science, Hanyang University , Seoul 04763, Republic of Korea
| |
Collapse
|
19
|
Willems P, Ndah E, Jonckheere V, Stael S, Sticker A, Martens L, Van Breusegem F, Gevaert K, Van Damme P. N-terminal Proteomics Assisted Profiling of the Unexplored Translation Initiation Landscape in Arabidopsis thaliana. Mol Cell Proteomics 2017; 16:1064-1080. [PMID: 28432195 PMCID: PMC5461538 DOI: 10.1074/mcp.m116.066662] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2016] [Revised: 04/11/2017] [Indexed: 01/05/2023] Open
Abstract
Proteogenomics is an emerging research field yet lacking a uniform method of analysis. Proteogenomic studies in which N-terminal proteomics and ribosome profiling are combined, suggest that a high number of protein start sites are currently missing in genome annotations. We constructed a proteogenomic pipeline specific for the analysis of N-terminal proteomics data, with the aim of discovering novel translational start sites outside annotated protein coding regions. In summary, unidentified MS/MS spectra were matched to a specific N-terminal peptide library encompassing protein N termini encoded in the Arabidopsis thaliana genome. After a stringent false discovery rate filtering, 117 protein N termini compliant with N-terminal methionine excision specificity and indicative of translation initiation were found. These include N-terminal protein extensions and translation from transposable elements and pseudogenes. Gene prediction provided supporting protein-coding models for approximately half of the protein N termini. Besides the prediction of functional domains (partially) contained within the newly predicted ORFs, further supporting evidence of translation was found in the recently released Araport11 genome re-annotation of Arabidopsis and computational translations of sequences stored in public repositories. Most interestingly, complementary evidence by ribosome profiling was found for 23 protein N termini. Finally, by analyzing protein N-terminal peptides, an in silico analysis demonstrates the applicability of our N-terminal proteogenomics strategy in revealing protein-coding potential in species with well- and poorly-annotated genomes.
Collapse
Affiliation(s)
- Patrick Willems
- From the ‡VIB/UGent Center for Plant Systems Biology, 9052 Ghent, Belgium.,§Ghent University, Department of Plant Biotechnology and Bioinformatics, 9052 Ghent.,¶VIB/UGent Center for Medical Biotechnology, 9000 Ghent, Belgium.,‖Ghent University, Department of Biochemistry, 9000 Ghent, Belgium
| | - Elvis Ndah
- ¶VIB/UGent Center for Medical Biotechnology, 9000 Ghent, Belgium.,‖Ghent University, Department of Biochemistry, 9000 Ghent, Belgium.,**Ghent University, Department of Mathematical Modeling, Statistics and Bioinformatics, 9000 Ghent, Belgium
| | - Veronique Jonckheere
- ¶VIB/UGent Center for Medical Biotechnology, 9000 Ghent, Belgium.,‖Ghent University, Department of Biochemistry, 9000 Ghent, Belgium
| | - Simon Stael
- From the ‡VIB/UGent Center for Plant Systems Biology, 9052 Ghent, Belgium.,§Ghent University, Department of Plant Biotechnology and Bioinformatics, 9052 Ghent.,¶VIB/UGent Center for Medical Biotechnology, 9000 Ghent, Belgium.,‖Ghent University, Department of Biochemistry, 9000 Ghent, Belgium
| | - Adriaan Sticker
- ¶VIB/UGent Center for Medical Biotechnology, 9000 Ghent, Belgium.,‖Ghent University, Department of Biochemistry, 9000 Ghent, Belgium.,**Ghent University, Department of Mathematical Modeling, Statistics and Bioinformatics, 9000 Ghent, Belgium
| | - Lennart Martens
- ¶VIB/UGent Center for Medical Biotechnology, 9000 Ghent, Belgium.,‖Ghent University, Department of Biochemistry, 9000 Ghent, Belgium.,**Ghent University, Department of Mathematical Modeling, Statistics and Bioinformatics, 9000 Ghent, Belgium
| | - Frank Van Breusegem
- From the ‡VIB/UGent Center for Plant Systems Biology, 9052 Ghent, Belgium.,§Ghent University, Department of Plant Biotechnology and Bioinformatics, 9052 Ghent
| | - Kris Gevaert
- ¶VIB/UGent Center for Medical Biotechnology, 9000 Ghent, Belgium.,‖Ghent University, Department of Biochemistry, 9000 Ghent, Belgium
| | - Petra Van Damme
- ¶VIB/UGent Center for Medical Biotechnology, 9000 Ghent, Belgium; .,‖Ghent University, Department of Biochemistry, 9000 Ghent, Belgium
| |
Collapse
|
20
|
Abstract
Omics approaches have become popular in biology as powerful discovery tools, and currently gain in interest for diagnostic applications. Establishing the accurate genome sequence of any organism is easy, but the outcome of its annotation by means of automatic pipelines remains imprecise. Some protein-encoding genes may be missed as soon as they are specific and poorly conserved in a given taxon, while important to explain the specific traits of the organism. Translational starts are also poorly predicted in a relatively important number of cases, thus impacting the protein sequence database used in proteomics, comparative genomics, and systems biology. The use of high-throughput proteomics data to improve genome annotation is an attractive option to obtain a more comprehensive molecular picture of a given organism. Here, protocols for reannotating prokaryote genomes are described based on shotgun proteomics and derivatization of protein N-termini with a positively charged reagent coupled to high-resolution tandem mass spectrometry.
Collapse
|
21
|
Zhang J, Yang MK, Zeng H, Ge F. GAPP: A Proteogenomic Software for Genome Annotation and Global Profiling of Post-translational Modifications in Prokaryotes. Mol Cell Proteomics 2016; 15:3529-3539. [PMID: 27630248 DOI: 10.1074/mcp.m116.060046] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2016] [Indexed: 11/06/2022] Open
Abstract
Although the number of sequenced prokaryotic genomes is growing rapidly, experimentally verified annotation of prokaryotic genome remains patchy and challenging. To facilitate genome annotation efforts for prokaryotes, we developed an open source software called GAPP for genome annotation and global profiling of post-translational modifications (PTMs) in prokaryotes. With a single command, it provides a standard workflow to validate and refine predicted genetic models and discover diverse PTM events. We demonstrated the utility of GAPP using proteomic data from Helicobacter pylori, one of the major human pathogens that is responsible for many gastric diseases. Our results confirmed 84.9% of the existing predicted H. pylori proteins, identified 20 novel protein coding genes, and corrected four existing gene models with regard to translation initiation sites. In particular, GAPP revealed a large repertoire of PTMs using the same proteomic data and provided a rich resource that can be used to examine the functions of reversible modifications in this human pathogen. This software is a powerful tool for genome annotation and global discovery of PTMs and is applicable to any sequenced prokaryotic organism; we expect that it will become an integral part of ongoing genome annotation efforts for prokaryotes. GAPP is freely available at https://sourceforge.net/projects/gappproteogenomic/.
Collapse
Affiliation(s)
- Jia Zhang
- From the ‡Key Laboratory of Algal Biology, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan 430072, China
| | - Ming-Kun Yang
- From the ‡Key Laboratory of Algal Biology, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan 430072, China
| | - Honghui Zeng
- §Wuhan Branch, Supercomputing Center, Chinese Academy of Sciences, China
| | - Feng Ge
- From the ‡Key Laboratory of Algal Biology, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan 430072, China; .,§Wuhan Branch, Supercomputing Center, Chinese Academy of Sciences, China
| |
Collapse
|
22
|
Improving GENCODE reference gene annotation using a high-stringency proteogenomics workflow. Nat Commun 2016; 7:11778. [PMID: 27250503 PMCID: PMC4895710 DOI: 10.1038/ncomms11778] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2015] [Accepted: 04/28/2016] [Indexed: 12/16/2022] Open
Abstract
Complete annotation of the human genome is indispensable for medical research. The GENCODE consortium strives to provide this, augmenting computational and experimental evidence with manual annotation. The rapidly developing field of proteogenomics provides evidence for the translation of genes into proteins and can be used to discover and refine gene models. However, for both the proteomics and annotation groups, there is a lack of guidelines for integrating this data. Here we report a stringent workflow for the interpretation of proteogenomic data that could be used by the annotation community to interpret novel proteogenomic evidence. Based on reprocessing of three large-scale publicly available human data sets, we show that a conservative approach, using stringent filtering is required to generate valid identifications. Evidence has been found supporting 16 novel protein-coding genes being added to GENCODE. Despite this many peptide identifications in pseudogenes cannot be annotated due to the absence of orthogonal supporting evidence. Identifying and annotating functional elements in the human genome remains a challenging but important task. Here the authors propose a priority annotation score to rank identifications and suggest how proteogenomics evidence can be interpreted and what additional information substantiates protein-coding potential for annotation.
Collapse
|
23
|
Proteogenomic Tools and Approaches to Explore Protein Coding Landscapes of Eukaryotic Genomes. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2016; 926:1-10. [DOI: 10.1007/978-3-319-42316-6_1] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
|
24
|
Yagoub D, Tay AP, Chen Z, Hamey JJ, Cai C, Chia SZ, Hart-Smith G, Wilkins MR. Proteogenomic Discovery of a Small, Novel Protein in Yeast Reveals a Strategy for the Detection of Unannotated Short Open Reading Frames. J Proteome Res 2015; 14:5038-47. [DOI: 10.1021/acs.jproteome.5b00734] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Daniel Yagoub
- Systems Biology Initiative,
School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, New South Wales 2052, Australia
| | - Aidan P. Tay
- Systems Biology Initiative,
School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, New South Wales 2052, Australia
| | - Zhiliang Chen
- Systems Biology Initiative,
School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, New South Wales 2052, Australia
| | - Joshua J. Hamey
- Systems Biology Initiative,
School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, New South Wales 2052, Australia
| | - Curtis Cai
- Systems Biology Initiative,
School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, New South Wales 2052, Australia
| | - Samantha Z. Chia
- Systems Biology Initiative,
School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, New South Wales 2052, Australia
| | - Gene Hart-Smith
- Systems Biology Initiative,
School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, New South Wales 2052, Australia
| | - Marc R. Wilkins
- Systems Biology Initiative,
School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, New South Wales 2052, Australia
| |
Collapse
|
25
|
Ezkurdia I, Calvo E, Del Pozo A, Vázquez J, Valencia A, Tress ML. The potential clinical impact of the release of two drafts of the human proteome. Expert Rev Proteomics 2015; 12:579-93. [PMID: 26496066 PMCID: PMC4732427 DOI: 10.1586/14789450.2015.1103186] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
The authors have carried out an investigation of the two "draft maps of the human proteome" published in 2014 in Nature. The findings include an abundance of poor spectra, low-scoring peptide-spectrum matches and incorrectly identified proteins in both these studies, highlighting clear issues with the application of false discovery rates. This noise means that the claims made by the two papers - the identification of high numbers of protein coding genes, the detection of novel coding regions and the draft tissue maps themselves - should be treated with considerable caution. The authors recommend that clinicians and researchers do not use the unfiltered data from these studies. Despite this these studies will inspire further investigation into tissue-based proteomics. As long as this future work has proper quality controls, it could help produce a consensus map of the human proteome and improve our understanding of the processes that underlie health and disease.
Collapse
Affiliation(s)
- Iakes Ezkurdia
- Unidad de Proteómica, Centro Nacional de Investigaciones Cardiovasculares, CNIC, Madrid, Spain
| | - Enrique Calvo
- Unidad de Proteómica, Centro Nacional de Investigaciones Cardiovasculares, CNIC, Madrid, Spain
| | - Angela Del Pozo
- Instituto de Genetica Medica y Molecular, Hospital Universitario La Paz, Madrid, Spain
| | - Jesús Vázquez
- Laboratorio de Proteómica Cardiovascular, Centro Nacional de Investigaciones Cardiovasculares, CNIC, Madrid, Spain
| | - Alfonso Valencia
- Structural Biology and Bioinformatics Programme, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
- National Bioinformatics Institute (INB), Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Michael L. Tress
- Structural Biology and Bioinformatics Programme, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| |
Collapse
|