1
|
Rozowsky J, Gao J, Borsari B, Yang YT, Galeev T, Gürsoy G, Epstein CB, Xiong K, Xu J, Li T, Liu J, Yu K, Berthel A, Chen Z, Navarro F, Sun MS, Wright J, Chang J, Cameron CJF, Shoresh N, Gaskell E, Drenkow J, Adrian J, Aganezov S, Aguet F, Balderrama-Gutierrez G, Banskota S, Corona GB, Chee S, Chhetri SB, Cortez Martins GC, Danyko C, Davis CA, Farid D, Farrell NP, Gabdank I, Gofin Y, Gorkin DU, Gu M, Hecht V, Hitz BC, Issner R, Jiang Y, Kirsche M, Kong X, Lam BR, Li S, Li B, Li X, Lin KZ, Luo R, Mackiewicz M, Meng R, Moore JE, Mudge J, Nelson N, Nusbaum C, Popov I, Pratt HE, Qiu Y, Ramakrishnan S, Raymond J, Salichos L, Scavelli A, Schreiber JM, Sedlazeck FJ, See LH, Sherman RM, Shi X, Shi M, Sloan CA, Strattan JS, Tan Z, Tanaka FY, Vlasova A, Wang J, Werner J, Williams B, Xu M, Yan C, Yu L, Zaleski C, Zhang J, Ardlie K, Cherry JM, Mendenhall EM, Noble WS, Weng Z, Levine ME, Dobin A, Wold B, Mortazavi A, Ren B, Gillis J, Myers RM, Snyder MP, Choudhary J, Milosavljevic A, Schatz MC, Bernstein BE, Guigó R, Gingeras TR, Gerstein M. The EN-TEx resource of multi-tissue personal epigenomes & variant-impact models. Cell 2023; 186:1493-1511.e40. [PMID: 37001506 PMCID: PMC10074325 DOI: 10.1016/j.cell.2023.02.018] [Citation(s) in RCA: 16] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2022] [Revised: 10/16/2022] [Accepted: 02/10/2023] [Indexed: 04/03/2023]
Abstract
Understanding how genetic variants impact molecular phenotypes is a key goal of functional genomics, currently hindered by reliance on a single haploid reference genome. Here, we present the EN-TEx resource of 1,635 open-access datasets from four donors (∼30 tissues × ∼15 assays). The datasets are mapped to matched, diploid genomes with long-read phasing and structural variants, instantiating a catalog of >1 million allele-specific loci. These loci exhibit coordinated activity along haplotypes and are less conserved than corresponding, non-allele-specific ones. Surprisingly, a deep-learning transformer model can predict the allele-specific activity based only on local nucleotide-sequence context, highlighting the importance of transcription-factor-binding motifs particularly sensitive to variants. Furthermore, combining EN-TEx with existing genome annotations reveals strong associations between allele-specific and GWAS loci. It also enables models for transferring known eQTLs to difficult-to-profile tissues (e.g., from skin to heart). Overall, EN-TEx provides rich data and generalizable models for more accurate personal functional genomics.
Collapse
Affiliation(s)
- Joel Rozowsky
- Section on Biomedical Informatics and Data Science, Yale University, New Haven, CT, USA; Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Jiahao Gao
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Beatrice Borsari
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA; Centre for Genomic Regulation, The Barcelona Institute of Science and Technology, Barcelona, Catalonia, Spain
| | - Yucheng T Yang
- Institute of Science and Technology for Brain-Inspired Intelligence; MOE Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence; MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200433, China; Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Timur Galeev
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Gamze Gürsoy
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | | | - Kun Xiong
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Jinrui Xu
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Tianxiao Li
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Jason Liu
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Keyang Yu
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Ana Berthel
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Zhanlin Chen
- Department of Statistics and Data Science, Yale University, New Haven, CT, USA
| | - Fabio Navarro
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Maxwell S Sun
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | | | - Justin Chang
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Christopher J F Cameron
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Noam Shoresh
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | | | - Jorg Drenkow
- Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Jessika Adrian
- Department of Genetics, School of Medicine, Stanford University, Palo Alto, CA, USA
| | - Sergey Aganezov
- Departments of Computer Science and Biology, Johns Hopkins University, Baltimore, MD, USA
| | | | | | | | | | - Sora Chee
- Ludwig Institute for Cancer Research, University of California, San Diego, La Jolla, CA, USA
| | - Surya B Chhetri
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA
| | - Gabriel Conte Cortez Martins
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Cassidy Danyko
- Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Carrie A Davis
- Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Daniel Farid
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | | | - Idan Gabdank
- Department of Genetics, School of Medicine, Stanford University, Palo Alto, CA, USA
| | - Yoel Gofin
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - David U Gorkin
- Ludwig Institute for Cancer Research, University of California, San Diego, La Jolla, CA, USA
| | - Mengting Gu
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Vivian Hecht
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Benjamin C Hitz
- Department of Genetics, School of Medicine, Stanford University, Palo Alto, CA, USA
| | - Robbyn Issner
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Yunzhe Jiang
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Melanie Kirsche
- Departments of Computer Science and Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Xiangmeng Kong
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Bonita R Lam
- Department of Genetics, School of Medicine, Stanford University, Palo Alto, CA, USA
| | - Shantao Li
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Bian Li
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Xiqi Li
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Khine Zin Lin
- Department of Genetics, School of Medicine, Stanford University, Palo Alto, CA, USA
| | - Ruibang Luo
- Department of Computer Science, The University of Hong Kong, Hong Kong, CHN
| | - Mark Mackiewicz
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA
| | - Ran Meng
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Jill E Moore
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, MA, USA
| | - Jonathan Mudge
- European Bioinformatics Institute, Cambridge, Cambridgeshire, GB
| | | | - Chad Nusbaum
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Ioann Popov
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Henry E Pratt
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, MA, USA
| | - Yunjiang Qiu
- Ludwig Institute for Cancer Research, University of California, San Diego, La Jolla, CA, USA
| | - Srividya Ramakrishnan
- Departments of Computer Science and Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Joe Raymond
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Leonidas Salichos
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA; Department of Biological and Chemical Sciences, New York Institute of Technology, Old Westbury, NY, USA
| | - Alexandra Scavelli
- Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Jacob M Schreiber
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Fritz J Sedlazeck
- Departments of Computer Science and Biology, Johns Hopkins University, Baltimore, MD, USA; Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA; Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Lei Hoon See
- Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Rachel M Sherman
- Departments of Computer Science and Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Xu Shi
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Minyi Shi
- Department of Genetics, School of Medicine, Stanford University, Palo Alto, CA, USA
| | - Cricket Alicia Sloan
- Department of Genetics, School of Medicine, Stanford University, Palo Alto, CA, USA
| | - J Seth Strattan
- Department of Genetics, School of Medicine, Stanford University, Palo Alto, CA, USA
| | - Zhen Tan
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Forrest Y Tanaka
- Department of Genetics, School of Medicine, Stanford University, Palo Alto, CA, USA
| | - Anna Vlasova
- Centre for Genomic Regulation, The Barcelona Institute of Science and Technology, Barcelona, Catalonia, Spain; Comparative Genomics Group, Life Science Programme, Barcelona Supercomputing Centre, Barcelona, Spain; Institute of Research in Biomedicine, Barcelona, Spain
| | - Jun Wang
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Jonathan Werner
- Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Brian Williams
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Min Xu
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Chengfei Yan
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Lu Yu
- Institute of Cancer Research, London, UK
| | - Christopher Zaleski
- Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Jing Zhang
- Department of Computer Science, University of California, Irvine, Irvine, CA, USA
| | | | - J Michael Cherry
- Department of Genetics, School of Medicine, Stanford University, Palo Alto, CA, USA
| | | | - William S Noble
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Zhiping Weng
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, MA, USA
| | - Morgan E Levine
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Pathology, Yale University School of Medicine, New Haven, CT, USA
| | - Alexander Dobin
- Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Barbara Wold
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Ali Mortazavi
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA, USA
| | - Bing Ren
- Ludwig Institute for Cancer Research, University of California, San Diego, La Jolla, CA, USA
| | - Jesse Gillis
- Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA; Department of Physiology, University of Toronto, Toronto, ON, Canada
| | - Richard M Myers
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA
| | - Michael P Snyder
- Department of Genetics, School of Medicine, Stanford University, Palo Alto, CA, USA
| | | | | | - Michael C Schatz
- Departments of Computer Science and Biology, Johns Hopkins University, Baltimore, MD, USA; Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.
| | - Bradley E Bernstein
- Broad Institute of MIT and Harvard, Cambridge, MA, USA; Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, MA, USA.
| | - Roderic Guigó
- Centre for Genomic Regulation, The Barcelona Institute of Science and Technology, Barcelona, Catalonia, Spain; Universitat Pompeu Fabra, Barcelona, Catalonia, Spain.
| | - Thomas R Gingeras
- Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.
| | - Mark Gerstein
- Section on Biomedical Informatics and Data Science, Yale University, New Haven, CT, USA; Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA; Department of Statistics and Data Science, Yale University, New Haven, CT, USA; Department of Computer Science, Yale University, New Haven, CT, USA.
| |
Collapse
|
2
|
Pelosi B. Developing a bioinformatics pipeline for comparative protein classification analysis. BMC Genom Data 2022; 23:43. [PMID: 35668373 PMCID: PMC9172112 DOI: 10.1186/s12863-022-01045-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2021] [Accepted: 03/11/2022] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Protein classification is a task of paramount importance in various fields of biology. Despite the great momentum of modern implementation of protein classification, machine learning techniques such as Random Forest and Neural Network could not always be used for several reasons: data collection, unbalanced classification or labelling of the data.As an alternative, I propose the use of a bioinformatics pipeline to search for and classify information from protein databases. Hence, to evaluate the efficiency and accuracy of the pipeline, I focused on the carotenoid biosynthetic genes and developed a filtering approach to retrieve orthologs clusters in two well-studied plants that belong to the Brassicaceae family: Arabidopsis thaliana and Brassica rapa Pekinensis group. The result obtained has been compared with previous studies on carotenoid biosynthetic genes in B. rapa where phylogenetic analysis was conducted. RESULTS The developed bioinformatics pipeline relies on commercial software and multiple databeses including the use of phylogeny, Gene Ontology terms (GOs) and Protein Families (Pfams) at a protein level. Furthermore, the phylogeny is coupled with "population analysis" to evaluate the potential orthologs. All the steps taken together give a final table of potential orthologs. The phylogenetic tree gives a result of 43 putative orthologs conserved in B. rapa Pekinensis group. Different A. thaliana proteins have more than one syntenic ortholog as also shown in a previous finding (Li et al., BMC Genomics 16(1):1-11, 2015). CONCLUSIONS This study demonstrates that, when the biological features of proteins of interest are not specific, I can rely on a computational approach in filtering steps for classification purposes. The comparison of the results obtained here for the carotenoid biosynthetic genes with previous research confirmed the accuracy of the developed pipeline which can therefore be applied for filtering different types of datasets.
Collapse
Affiliation(s)
- Benedetta Pelosi
- Department of Molecular Biosciences, The Wenner-Gren Institute, Stockholm University, Stockholm, Sweden.
| |
Collapse
|
3
|
Abstract
Mass spectrometry (MS)-based proteomics is currently the most successful approach to measure and compare peptides and proteins in a large variety of biological samples. Modern mass spectrometers, equipped with high-resolution analyzers, provide large amounts of data output. This is the case of shotgun/bottom-up proteomics, which consists in the enzymatic digestion of protein into peptides that are then measured by MS-instruments through a data dependent acquisition (DDA) mode. Dedicated bioinformatic tools and platforms have been developed to face the increasing size and complexity of raw MS data that need to be processed and interpreted for large-scale protein identification and quantification. This chapter illustrates the most popular bioinformatics solution for the analysis of shotgun MS-proteomics data. A general description will be provided on the data preprocessing options and the different search engines available, including practical suggestions on how to optimize the parameters for peptide search, based on hands-on experience.
Collapse
Affiliation(s)
- Avinash Yadav
- Department of Experimental Oncology, European Institute of Oncology (IEO), IRCCS, Milan, Italy
| | - Federica Marini
- Department of Experimental Oncology, European Institute of Oncology (IEO), IRCCS, Milan, Italy
| | - Alessandro Cuomo
- Department of Experimental Oncology, European Institute of Oncology (IEO), IRCCS, Milan, Italy
| | - Tiziana Bonaldi
- Department of Experimental Oncology, European Institute of Oncology (IEO), IRCCS, Milan, Italy.
| |
Collapse
|
4
|
Phosphorylation-Dependent Assembly of a 14-3-3 Mediated Signaling Complex during Red Blood Cell Invasion by Plasmodium falciparum Merozoites. mBio 2020; 11:mBio.01287-20. [PMID: 32817103 PMCID: PMC7439480 DOI: 10.1128/mbio.01287-20] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Red blood cell (RBC) invasion by Plasmodium merozoites requires multiple steps that are regulated by signaling pathways. Exposure of P. falciparum merozoites to the physiological signal of low K+, as found in blood plasma, leads to a rise in cytosolic Ca2+, which mediates microneme secretion, motility, and invasion. We have used global phosphoproteomic analysis of merozoites to identify signaling pathways that are activated during invasion. Using quantitative phosphoproteomics, we found 394 protein phosphorylation site changes in merozoites subjected to different ionic environments (high K+/low K+), 143 of which were Ca2+ dependent. These included a number of signaling proteins such as catalytic and regulatory subunits of protein kinase A (PfPKAc and PfPKAr) and calcium-dependent protein kinase 1 (PfCDPK1). Proteins of the 14-3-3 family interact with phosphorylated target proteins to assemble signaling complexes. Here, using coimmunoprecipitation and gel filtration chromatography, we demonstrate that Pf14-3-3I binds phosphorylated PfPKAr and PfCDPK1 to mediate the assembly of a multiprotein complex in P. falciparum merozoites. A phospho-peptide, P1, based on the Ca2+-dependent phosphosites of PKAr, binds Pf14-3-3I and disrupts assembly of the Pf14-3-3I-mediated multiprotein complex. Disruption of the multiprotein complex with P1 inhibits microneme secretion and RBC invasion. This study thus identifies a novel signaling complex that plays a key role in merozoite invasion of RBCs. Disruption of this signaling complex could serve as a novel approach to inhibit blood-stage growth of malaria parasites.IMPORTANCE Invasion of red blood cells (RBCs) by Plasmodium falciparum merozoites is a complex process that is regulated by intricate signaling pathways. Here, we used phosphoproteomic profiling to identify the key proteins involved in signaling events during invasion. We found changes in the phosphorylation of various merozoite proteins, including multiple kinases previously implicated in the process of invasion. We also found that a phosphorylation-dependent multiprotein complex including signaling kinases assembles during the process of invasion. Disruption of this multiprotein complex impairs merozoite invasion of RBCs, providing a novel approach for the development of inhibitors to block the growth of blood-stage malaria parasites.
Collapse
|
5
|
Khan YA, Jungreis I, Wright JC, Mudge JM, Choudhary JS, Firth AE, Kellis M. Evidence for a novel overlapping coding sequence in POLG initiated at a CUG start codon. BMC Genet 2020; 21:25. [PMID: 32138667 PMCID: PMC7059407 DOI: 10.1186/s12863-020-0828-7] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2019] [Accepted: 02/19/2020] [Indexed: 11/14/2022] Open
Abstract
BACKGROUND POLG, located on nuclear chromosome 15, encodes the DNA polymerase γ(Pol γ). Pol γ is responsible for the replication and repair of mitochondrial DNA (mtDNA). Pol γ is the only DNA polymerase found in mitochondria for most animal cells. Mutations in POLG are the most common single-gene cause of diseases of mitochondria and have been mapped over the coding region of the POLG ORF. RESULTS Using PhyloCSF to survey alternative reading frames, we found a conserved coding signature in an alternative frame in exons 2 and 3 of POLG, herein referred to as ORF-Y that arose de novo in placental mammals. Using the synplot2 program, synonymous site conservation was found among mammals in the region of the POLG ORF that is overlapped by ORF-Y. Ribosome profiling data revealed that ORF-Y is translated and that initiation likely occurs at a CUG codon. Inspection of an alignment of mammalian sequences containing ORF-Y revealed that the CUG codon has a strong initiation context and that a well-conserved predicted RNA stem-loop begins 14 nucleotides downstream. Such features are associated with enhanced initiation at near-cognate non-AUG codons. Reanalysis of the Kim et al. (2014) draft human proteome dataset yielded two unique peptides that map unambiguously to ORF-Y. An additional conserved uORF, herein referred to as ORF-Z, was also found in exon 2 of POLG. Lastly, we surveyed Clinvar variants that are synonymous with respect to the POLG ORF and found that most of these variants cause amino acid changes in ORF-Y or ORF-Z. CONCLUSIONS We provide evidence for a novel coding sequence, ORF-Y, that overlaps the POLG ORF. Ribosome profiling and mass spectrometry data show that ORF-Y is expressed. PhyloCSF and synplot2 analysis show that ORF-Y is subject to strong purifying selection. An abundance of disease-correlated mutations that map to exons 2 and 3 of POLG but also affect ORF-Y provides potential clinical significance to this finding.
Collapse
Affiliation(s)
- Yousuf A Khan
- Department of Molecular and Cellular Physiology, Stanford University School of Medicine, Stanford, CA, 94305, USA.
- Division of Virology, Department of Pathology, University of Cambridge, Tennis Court Road, Cambridge, CB2 1QP, UK.
| | - Irwin Jungreis
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.
- Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA.
| | - James C Wright
- Functional Proteomics, Division of Cancer Biology, Institute of Cancer Research, 123 Old Brompton Road, London, SW7 3RP, UK
| | - Jonathan M Mudge
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | | | - Andrew E Firth
- Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| | - Manolis Kellis
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| |
Collapse
|
6
|
Mudge JM, Jungreis I, Hunt T, Gonzalez JM, Wright JC, Kay M, Davidson C, Fitzgerald S, Seal R, Tweedie S, He L, Waterhouse RM, Li Y, Bruford E, Choudhary JS, Frankish A, Kellis M. Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci. Genome Res 2019; 29:2073-2087. [PMID: 31537640 PMCID: PMC6886504 DOI: 10.1101/gr.246462.118] [Citation(s) in RCA: 44] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2018] [Accepted: 09/09/2019] [Indexed: 12/15/2022]
Abstract
The most widely appreciated role of DNA is to encode protein, yet the exact portion of the human genome that is translated remains to be ascertained. We previously developed PhyloCSF, a widely used tool to identify evolutionary signatures of protein-coding regions using multispecies genome alignments. Here, we present the first whole-genome PhyloCSF prediction tracks for human, mouse, chicken, fly, worm, and mosquito. We develop a workflow that uses machine learning to predict novel conserved protein-coding regions and efficiently guide their manual curation. We analyze more than 1000 high-scoring human PhyloCSF regions and confidently add 144 conserved protein-coding genes to the GENCODE gene set, as well as additional coding regions within 236 previously annotated protein-coding genes, and 169 pseudogenes, most of them disabled after primates diverged. The majority of these represent new discoveries, including 70 previously undetected protein-coding genes. The novel coding genes are additionally supported by single-nucleotide variant evidence indicative of continued purifying selection in the human lineage, coding-exon splicing evidence from new GENCODE transcripts using next-generation transcriptomic data sets, and mass spectrometry evidence of translation for several new genes. Our discoveries required simultaneous comparative annotation of other vertebrate genomes, which we show is essential to remove spurious ORFs and to distinguish coding from pseudogene regions. Our new coding regions help elucidate disease-associated regions by revealing that 118 GWAS variants previously thought to be noncoding are in fact protein altering. Altogether, our PhyloCSF data sets and algorithms will help researchers seeking to interpret these genomes, while our new annotations present exciting loci for further experimental characterization.
Collapse
Affiliation(s)
- Jonathan M Mudge
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Irwin Jungreis
- MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, Massachusetts 02139, USA.,Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | - Toby Hunt
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Jose Manuel Gonzalez
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - James C Wright
- Functional Proteomics, Division of Cancer Biology, Institute of Cancer Research, London SW7 3RP, United Kingdom
| | - Mike Kay
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Claire Davidson
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Stephen Fitzgerald
- Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom
| | - Ruth Seal
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom.,Department of Haematology, University of Cambridge, Cambridge CB2 0PT, United Kingdom
| | - Susan Tweedie
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Liang He
- MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, Massachusetts 02139, USA.,Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | - Robert M Waterhouse
- Department of Ecology and Evolution, University of Lausanne, Lausanne 1015, Switzerland.,Swiss Institute of Bioinformatics, Lausanne 1015, Switzerland
| | - Yue Li
- MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, Massachusetts 02139, USA.,Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | - Elspeth Bruford
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom.,Department of Haematology, University of Cambridge, Cambridge CB2 0PT, United Kingdom
| | - Jyoti S Choudhary
- Functional Proteomics, Division of Cancer Biology, Institute of Cancer Research, London SW7 3RP, United Kingdom
| | - Adam Frankish
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Manolis Kellis
- MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, Massachusetts 02139, USA.,Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| |
Collapse
|
7
|
Frankish A, Diekhans M, Ferreira AM, Johnson R, Jungreis I, Loveland J, Mudge JM, Sisu C, Wright J, Armstrong J, Barnes I, Berry A, Bignell A, Carbonell Sala S, Chrast J, Cunningham F, Di Domenico T, Donaldson S, Fiddes IT, García Girón C, Gonzalez JM, Grego T, Hardy M, Hourlier T, Hunt T, Izuogu OG, Lagarde J, Martin FJ, Martínez L, Mohanan S, Muir P, Navarro FC, Parker A, Pei B, Pozo F, Ruffier M, Schmitt BM, Stapleton E, Suner MM, Sycheva I, Uszczynska-Ratajczak B, Xu J, Yates A, Zerbino D, Zhang Y, Aken B, Choudhary JS, Gerstein M, Guigó R, Hubbard TJ, Kellis M, Paten B, Reymond A, Tress ML, Flicek P. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res 2019; 47:D766-D773. [PMID: 30357393 PMCID: PMC6323946 DOI: 10.1093/nar/gky955] [Citation(s) in RCA: 1859] [Impact Index Per Article: 371.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2018] [Revised: 09/20/2018] [Accepted: 10/08/2018] [Indexed: 02/06/2023] Open
Abstract
The accurate identification and description of the genes in the human and mouse genomes is a fundamental requirement for high quality analysis of data informing both genome biology and clinical genomics. Over the last 15 years, the GENCODE consortium has been producing reference quality gene annotations to provide this foundational resource. The GENCODE consortium includes both experimental and computational biology groups who work together to improve and extend the GENCODE gene annotation. Specifically, we generate primary data, create bioinformatics tools and provide analysis to support the work of expert manual gene annotators and automated gene annotation pipelines. In addition, manual and computational annotation workflows use any and all publicly available data and analysis, along with the research literature to identify and characterise gene loci to the highest standard. GENCODE gene annotations are accessible via the Ensembl and UCSC Genome Browsers, the Ensembl FTP site, Ensembl Biomart, Ensembl Perl and REST APIs as well as https://www.gencodegenes.org.
Collapse
Affiliation(s)
- Adam Frankish
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Mark Diekhans
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA 95064, USA
| | - Anne-Maud Ferreira
- Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland
| | - Rory Johnson
- Department of Medical Oncology, Inselspital, University Hospital, University of Bern, Bern, Switzerland
- Department of Biomedical Research (DBMR), University of Bern, Bern, Switzerland
| | - Irwin Jungreis
- MIT Computer Science and Artificial Intelligence Laboratory, 32 Vasser St, Cambridge, MA 02139, USA
- Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA 02142, USA
| | - Jane Loveland
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Jonathan M Mudge
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Cristina Sisu
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
- Department of Bioscience, Brunel University London, Uxbridge UB8 3PH, UK
| | - James Wright
- Functional Proteomics, Division of Cancer Biology, Institute of Cancer Research, 123 Old Brompton Road, London SW7 3RP, UK
| | - Joel Armstrong
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA 95064, USA
| | - If Barnes
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Andrew Berry
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Alexandra Bignell
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Silvia Carbonell Sala
- Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Dr. Aiguader 88, Barcelona, E-08003 Catalonia, Spain
| | - Jacqueline Chrast
- Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland
| | - Fiona Cunningham
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Tomás Di Domenico
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Sarah Donaldson
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Ian T Fiddes
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA 95064, USA
| | - Carlos García Girón
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Jose Manuel Gonzalez
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Tiago Grego
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Matthew Hardy
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Thibaut Hourlier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Toby Hunt
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Osagie G Izuogu
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Julien Lagarde
- Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Dr. Aiguader 88, Barcelona, E-08003 Catalonia, Spain
| | - Fergal J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Laura Martínez
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Shamika Mohanan
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Paul Muir
- Department of Molecular, Cellular & Developmental Biology, Yale University, New Haven, CT 06520, USA
- Systems Biology Institute, Yale University, West Haven, CT 06516, USA
| | - Fabio C P Navarro
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
| | - Anne Parker
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Baikang Pei
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
| | - Fernando Pozo
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Magali Ruffier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Bianca M Schmitt
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Eloise Stapleton
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Marie-Marthe Suner
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Irina Sycheva
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | | | - Jinuri Xu
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
| | - Andrew Yates
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Daniel Zerbino
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Yan Zhang
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
| | - Bronwen Aken
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Jyoti S Choudhary
- Functional Proteomics, Division of Cancer Biology, Institute of Cancer Research, 123 Old Brompton Road, London SW7 3RP, UK
| | - Mark Gerstein
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
- Program in Computational Biology & Bioinformatics, Yale University, Bass 432, 266 Whitney Avenue, New Haven, CT 06520, USA
- Department of Computer Science, Yale University, Bass 432, 266 Whitney Avenue, New Haven, CT 06520, USA
| | - Roderic Guigó
- Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Dr. Aiguader 88, Barcelona, E-08003 Catalonia, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, E-08003 Catalonia, Spain
| | - Tim J P Hubbard
- Department of Medical and Molecular Genetics, King's College London, Guys Hospital, Great Maze Pond, London SE1 9RT, UK
| | - Manolis Kellis
- MIT Computer Science and Artificial Intelligence Laboratory, 32 Vasser St, Cambridge, MA 02139, USA
- Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA 02142, USA
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA 95064, USA
| | - Alexandre Reymond
- Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland
| | - Michael L Tress
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| |
Collapse
|
8
|
Schlaffner CN, Pirklbauer GJ, Bender A, Steen JAJ, Choudhary JS. A Fast and Quantitative Method for Post-translational Modification and Variant Enabled Mapping of Peptides to Genomes. J Vis Exp 2018. [PMID: 29889196 PMCID: PMC6101353 DOI: 10.3791/57633] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Abstract
Cross-talk between genes, transcripts, and proteins is the key to cellular responses; hence, analysis of molecular levels as distinct entities is slowly being extended to integrative studies to enhance the understanding of molecular dynamics within cells. Current tools for the visualization and integration of proteomics with other omics datasets are inadequate for large-scale studies. Furthermore, they only capture basic sequence identify, discarding post-translational modifications and quantitation. To address these issues, we developed PoGo to map peptides with associated post-translational modifications and quantification to reference genome annotation. In addition, the tool was developed to enable the mapping of peptides identified from customized sequence databases incorporating single amino acid variants. While PoGo is a command line tool, the graphical interface PoGoGUI enables non-bioinformatics researchers to easily map peptides to 25 species supported by Ensembl genome annotation. The generated output borrows file formats from the genomics field and, therefore, visualization is supported in most genome browsers. For large-scale studies, PoGo is supported by TrackHubGenerator to create web-accessible repositories of data mapped to genomes that also enable an easy sharing of proteogenomics data. With little effort, this tool can map millions of peptides to reference genomes within only a few minutes, outperforming other available sequence-identity based tools. This protocol demonstrates the best approaches for proteogenomics mapping through PoGo with publicly available datasets of quantitative and phosphoproteomics, as well as large-scale studies.
Collapse
Affiliation(s)
- Christoph N Schlaffner
- Department of Neurobiology, F. M. Kirby Neurobiology Center, Boston Children's Hospital, Harvard Medical School; Proteomic Mass Spectrometry, Wellcome Trust Sanger Institute, Wellcome Genome Campus; Centre for Molecular Informatics, Department of Chemistry, University of Cambridge;
| | - Georg J Pirklbauer
- Proteomic Mass Spectrometry, Wellcome Trust Sanger Institute, Wellcome Genome Campus
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge
| | - Judith A J Steen
- Department of Neurobiology, F. M. Kirby Neurobiology Center, Boston Children's Hospital, Harvard Medical School
| | - Jyoti S Choudhary
- Proteomic Mass Spectrometry, Wellcome Trust Sanger Institute, Wellcome Genome Campus; Functional Proteomics Group, Chester Beatty Laboratories, Institute of Cancer Research
| |
Collapse
|
9
|
Hwang H, Park GW, Park JY, Lee HK, Lee JY, Jeong JE, Park SKR, Yates JR, Kwon KH, Park YM, Lee HJ, Paik YK, Kim JY, Yoo JS. Next Generation Proteomic Pipeline for Chromosome-Based Proteomic Research Using NeXtProt and GENCODE Databases. J Proteome Res 2017; 16:4425-4434. [PMID: 28965411 DOI: 10.1021/acs.jproteome.7b00223] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Human Proteome Project aims to map all human proteins including missing proteins as well as proteoforms with post translational modifications, alternative splicing variants (ASVs), and single amino acid variants (SAAVs). neXtProt and Ensemble databases are usually used to provide curated information on human coding genes. However, to find these proteoforms, we (Chr #11 team) first introduce a streamlined pipeline using customized and concatenated neXtProt and GENCODE originated from Ensemble, with controlled false discovery rate (FDR). Because of large sized databases used in this pipeline, we found more stringent FDR filtering (0.1% at the peptide level and 1% at the protein level) to claim novel findings, such as GENCODE ASVs and missing proteins, from human hippocampus data set (MSV000081385) and ProteomeXchange (PXD007166). Using our next generation proteomic pipeline (nextPP) with neXtProt and GENCODE databases, two missing proteins such as activity-regulated cytoskeleton-associated protein (ARC, Chr 8) and glutamate receptor ionotropic, kainite 5 (GRIK5, Chr 19) were additionally identified with two or more unique peptides from human brain tissues. Additionally, by applying the pipeline to human brain related data sets such as cortex (PXD000067 and PXD000561), spinal cord, and fetal brain (PXD000561), seven GENCODE ASVs such as ACTN4-012 (Chr.19), DPYSL2-005 (Chr.8), MPRIP-003 (Chr.17), NCAM1-013 (Chr.11), EPB41L1-017 (Chr.20), AGAP1-004 (Chr.2), and CPNE5-005 (Chr.6) were identified from two or more data sets. The identified peptides of GENCODE ASVs were mapped onto novel exon insertions, alternative translations at 5'-untranslated region, or novel protein coding sequence. Applying the pipeline to male reproductive organ related data sets, 52 GENCODE ASVs were identified from two testis (PXD000561 and PXD002179) and a spermatozoa (PXD003947) data sets. Four out of 52 GENCODE ASVs such as RAB11FIP5-008 (Chr. 2), RP13-347D8.7-001 (Chr. X), PRDX4-002 (Chr. X), and RP11-666A8.13-001 (Chr. 17) were identified in all of the three samples.
Collapse
Affiliation(s)
- Heeyoun Hwang
- Biomedical Omics Group, Korea Basic Science Institute , Cheongju 28119, Republic of Korea
| | - Gun Wook Park
- Biomedical Omics Group, Korea Basic Science Institute , Cheongju 28119, Republic of Korea
| | - Ji Yeong Park
- Biomedical Omics Group, Korea Basic Science Institute , Cheongju 28119, Republic of Korea.,Graduate School of Analytical Science and Technology, Chungnam National University , Daejeon, Republic of Korea
| | - Hyun Kyoung Lee
- Biomedical Omics Group, Korea Basic Science Institute , Cheongju 28119, Republic of Korea.,Graduate School of Analytical Science and Technology, Chungnam National University , Daejeon, Republic of Korea
| | - Ju Yeon Lee
- Biomedical Omics Group, Korea Basic Science Institute , Cheongju 28119, Republic of Korea
| | - Ji Eun Jeong
- Biomedical Omics Group, Korea Basic Science Institute , Cheongju 28119, Republic of Korea.,Graduate School of Analytical Science and Technology, Chungnam National University , Daejeon, Republic of Korea
| | - Sung-Kyu Robin Park
- Department of Chemical Physiology, The Scripps Research Institute , La Jolla, California 92037, United States
| | - John R Yates
- Department of Chemical Physiology, The Scripps Research Institute , La Jolla, California 92037, United States
| | - Kyung-Hoon Kwon
- Biomedical Omics Group, Korea Basic Science Institute , Cheongju 28119, Republic of Korea
| | - Young Mok Park
- Center for Cognition and Sociality, Institute for Basic Science , Daejeon, Republic of Korea
| | - Hyoung-Joo Lee
- Yonsei Proteome Research Center and Department of Integrated OMICS for Biomedical Science, and Department of Biochemistry, College of Life Science and Biotechnology, Yonsei University , Seoul, Republic of Korea
| | - Young-Ki Paik
- Yonsei Proteome Research Center and Department of Integrated OMICS for Biomedical Science, and Department of Biochemistry, College of Life Science and Biotechnology, Yonsei University , Seoul, Republic of Korea
| | - Jin Young Kim
- Biomedical Omics Group, Korea Basic Science Institute , Cheongju 28119, Republic of Korea
| | - Jong Shin Yoo
- Biomedical Omics Group, Korea Basic Science Institute , Cheongju 28119, Republic of Korea.,Graduate School of Analytical Science and Technology, Chungnam National University , Daejeon, Republic of Korea
| |
Collapse
|
10
|
Kroll JE, da Silva VL, de Souza SJ, de Souza GA. A tool for integrating genetic and mass spectrometry-based peptide data: Proteogenomics Viewer. Bioessays 2017; 39. [DOI: 10.1002/bies.201700015] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Affiliation(s)
- José Eduardo Kroll
- Institute of Bioinformatics and Biotechnology; Natal − RN Brazil
- Brain Institute; Universidade Federal do Rio Grande do Norte; Natal − RN Brazil
- Bioinformatics Multidisciplinary Environment; Instituto Metrópole Digital; UFRN, Natal-RN Brazil
| | - Vandeclécio Lira da Silva
- Brain Institute; Universidade Federal do Rio Grande do Norte; Natal − RN Brazil
- Bioinformatics Multidisciplinary Environment; Instituto Metrópole Digital; UFRN, Natal-RN Brazil
| | - Sandro José de Souza
- Brain Institute; Universidade Federal do Rio Grande do Norte; Natal − RN Brazil
- Bioinformatics Multidisciplinary Environment; Instituto Metrópole Digital; UFRN, Natal-RN Brazil
| | - Gustavo Antonio de Souza
- Brain Institute; Universidade Federal do Rio Grande do Norte; Natal − RN Brazil
- Bioinformatics Multidisciplinary Environment; Instituto Metrópole Digital; UFRN, Natal-RN Brazil
- Department of Immunology and Centre for Immune Regulation, Oslo University Hospital HF Rikshospitalet; University of Oslo; Oslo Norway
| |
Collapse
|