1
|
Khatun J, Yu Y, Wrobel JA, Risk BA, Gunawardena HP, Secrest A, Spitzer WJ, Xie L, Wang L, Chen X, Giddings MC. Whole human genome proteogenomic mapping for ENCODE cell line data: identifying protein-coding regions. BMC Genomics 2013; 14:141. [PMID: 23448259 PMCID: PMC3607840 DOI: 10.1186/1471-2164-14-141] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2012] [Accepted: 02/22/2013] [Indexed: 11/28/2022] Open
Abstract
BACKGROUND Proteogenomic mapping is an approach that uses mass spectrometry data from proteins to directly map protein-coding genes and could aid in locating translational regions in the human genome. In concert with the ENcyclopedia of DNA Elements (ENCODE) project, we applied proteogenomic mapping to produce proteogenomic tracks for the UCSC Genome Browser, to explore which putative translational regions may be missing from the human genome. RESULTS We generated ~1 million high-resolution tandem mass (MS/MS) spectra for Tier 1 ENCODE cell lines K562 and GM12878 and mapped them against the UCSC hg19 human genome, and the GENCODE V7 annotated protein and transcript sets. We then compared the results from the three searches to identify the best-matching peptide for each MS/MS spectrum, thereby increasing the confidence of the putative new protein-coding regions found via the whole genome search. At a 1% false discovery rate, we identified 26,472, 24,406, and 13,128 peptides from the protein, transcript, and whole genome searches, respectively; of these, 481 were found solely via the whole genome search. The proteogenomic mapping data are available on the UCSC Genome Browser at http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=wgEncodeUncBsuProt. CONCLUSIONS The whole genome search revealed that ~4% of the uniquely mapping identified peptides were located outside GENCODE V7 annotated exons. The comparison of the results from the disparate searches also identified 15% more spectra than would have been found solely from a protein database search. Therefore, whole genome proteogenomic mapping is a complementary method for genome annotation when performed in conjunction with other searches.
Collapse
Affiliation(s)
- Jainab Khatun
- College of Arts and Sciences, Boise State University, Boise, ID, USA
| | - Yanbao Yu
- Department of Biochemistry & Biophysics, UNC School of Medicine, Chapel Hill, NC, USA
| | - John A Wrobel
- Department of Biochemistry & Biophysics, UNC School of Medicine, Chapel Hill, NC, USA
| | - Brian A Risk
- College of Arts and Sciences, Boise State University, Boise, ID, USA
| | - Harsha P Gunawardena
- Department of Biochemistry & Biophysics, UNC School of Medicine, Chapel Hill, NC, USA
- Program in Molecular Biology & Biotechnology, UNC School of Medicine, Chapel Hill, NC, USA
| | - Ashley Secrest
- College of Arts and Sciences, Boise State University, Boise, ID, USA
| | - Wendy J Spitzer
- College of Arts and Sciences, Boise State University, Boise, ID, USA
| | - Ling Xie
- Department of Biochemistry & Biophysics, UNC School of Medicine, Chapel Hill, NC, USA
| | - Li Wang
- Department of Biochemistry & Biophysics, UNC School of Medicine, Chapel Hill, NC, USA
| | - Xian Chen
- Department of Biochemistry & Biophysics, UNC School of Medicine, Chapel Hill, NC, USA
- Program in Molecular Biology & Biotechnology, UNC School of Medicine, Chapel Hill, NC, USA
| | - Morgan C Giddings
- College of Arts and Sciences, Boise State University, Boise, ID, USA
- Department of Biochemistry & Biophysics, UNC School of Medicine, Chapel Hill, NC, USA
| |
Collapse
|
2
|
Sequencing-based expression profiling in zebrafish. Methods Cell Biol 2011. [PMID: 21924174 DOI: 10.1016/b978-0-12-374814-0.00021-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
Abstract
Gene expression profiling is a powerful technique for studying biological processes, especially tissue/organ-specific ones, at the molecular level. With the rapid development of the next-generation sequencing techniques, high throughput sequencing-based expression profiling techniques have been more and more widely adopted in molecular biology studies. In this chapter, we described a protocol for applying one of the sequencing-based expression profiling techniques, Digital Gene Expression (DGE), for zebrafish research. The protocol provides guidelines for wet-bench experimental procedures as well as for bioinformatics data analyses. We also discuss potential issues/challenges with the use of DGE.
Collapse
|
3
|
Lee TL, Li Y, Alba D, Vong QP, Wu SM, Baxendale V, Rennert OM, Lau YFC, Chan WY. Developmental staging of male murine embryonic gonad by SAGE analysis. J Genet Genomics 2009; 36:215-27. [PMID: 19376482 DOI: 10.1016/s1673-8527(08)60109-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2009] [Revised: 03/18/2009] [Accepted: 03/19/2009] [Indexed: 12/31/2022]
Abstract
Despite the identification of key genes such as Sry integral to embryonic gonadal development, the genomic classification and identification of chromosomal activation of this process is still poorly understood. To better understand the genetic regulation of gonadal development, we performed Serial Analysis of Gene Expression (SAGE) to profile the genes and novel transcripts, and an average of 152,000 tags from male embryonic gonads at E10.5 (embryonic day 10.5), E11.5, E12.5, E13.5, E15.5 and E17.5 were analyzed. A total of 275,583 non-singleton tags that do not map to any annotated sequence were identified in the six gonad libraries, and 47,255 tags were mapped to 24,975 annotated sequences, among which 987 sequences were uncharacterized. Utilizing an unsupervised pattern identification technique, we established molecular staging of male gonadal development. Rather than providing a static descriptive analysis, we developed algorithms to cluster the SAGE data and assign SAGE tags to a corresponding chromosomal position; these data are displayed in chromosome graphic format. A prominent increase in global genomic activity from E10.5 to E17.5 was observed. Important chromosomal regions related to the developmental processes were identified and validated based on established mouse models with developmental disorders. These regions may represent markers for early diagnosis for disorders of male gonad development as well as potential treatment targets.
Collapse
Affiliation(s)
- Tin-Lap Lee
- Section on Developmental Genomics, Laboratory of Clinical Genomics, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD 20892, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
4
|
Abstract
Many serial analysis of gene expression (SAGE) tags can be matched to multiple genes, leading to difficulty in SAGE data interpretation and analysis. As only a subset of genes in the human genome are transcribed in a certain type of tissue/cell, we used microarray expression data from different tissue types to define contexts of gene expression and to annotate SAGE tags collected from the same or similar tissue sources. To predict the original transcript contributing a nonspecific SAGE tag collected from a particular tissue, we ranked the corresponding genes by their expression levels determined by microarray. We developed a tissue-specific SAGE tag annotation database based on microarray data collected from 73 normal human tissues and 18 cancer tissues and cell lines. The database can be queried online at: http://www.basic.northwestern.edu/SAGE/. The accuracy of this database was confirmed by experimental data.
Collapse
Affiliation(s)
- Xijin Ge
- Evanston Northwestern Healtcare Research Institute, Evanston, IL, USA
| | | |
Collapse
|
5
|
Abstract
Serial analysis of gene expression (SAGE) is a method used to obtain comprehensive, unbiased and quantitative gene-expression profiles. Its major advantage over arrays is that it does not require a priori knowledge of the genes to be analyzed and reflects absolute mRNA levels. Since the original SAGE protocol was developed in a short-tag (10-bp) format, several modifications have been made to produce longer SAGE tags for more precise gene identification and to decrease the amount of starting material necessary. Several SAGE-like methods have also been developed for the genome-wide analysis of DNA copy-number changes and methylation patterns, chromatin structure and transcription factor targets. In this protocol, we describe the 17-bp longSAGE method for transcriptome profiling optimized for a small amount of starting material. The generation of such libraries can be completed in 7-10 d, whereas sequencing and data analysis require an additional 2-3 wk.
Collapse
Affiliation(s)
- Min Hu
- Department of Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, 44 Binney Street, D740C, Boston, Massachusetts 02115, USA
| | | |
Collapse
|
6
|
Wang SM. Understanding SAGE data. Trends Genet 2006; 23:42-50. [PMID: 17109989 DOI: 10.1016/j.tig.2006.11.001] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2006] [Revised: 10/05/2006] [Accepted: 11/01/2006] [Indexed: 02/08/2023]
Abstract
Serial analysis of gene expression (SAGE) is a method for identifying and quantifying transcripts from eukaryotic genomes. Since its invention, SAGE has been widely applied to analyzing gene expression in many biological and medical studies. Vast amounts of SAGE data have been collected and more than a thousand SAGE-related studies have been published since the mid-1990s. The principle of SAGE has been developed to address specific issues such as determination of normal gene structure and identification of abnormal genome structural changes. This review focuses on the general features of SAGE data, including the specificity of SAGE tags with respect to their original transcripts, the quantitative nature of SAGE data for differentially expressed genes, the reproducibility, the comparability of SAGE with microarray and the future potential of SAGE. Understanding these basic features should aid the proper interpretation of SAGE data to address biological and medical questions.
Collapse
Affiliation(s)
- San Ming Wang
- Center for Functional Genomics, ENH Research Institute, Robert H. Lurie Comprehensive Cancer Center, Northwestern University, 1001 University Place, Evanston, IL 60201, USA.
| |
Collapse
|
7
|
Kim YC, Jung YC, Xuan Z, Dong H, Zhang MQ, Wang SM. Pan-genome isolation of low abundance transcripts using SAGE tag. FEBS Lett 2006; 580:6721-9. [PMID: 17113583 PMCID: PMC1791009 DOI: 10.1016/j.febslet.2006.11.013] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2006] [Revised: 10/31/2006] [Accepted: 11/03/2006] [Indexed: 11/24/2022]
Abstract
The SAGE (serial analysis of gene expression) method is sensitive at detecting the lower abundance transcripts. More than a third of human SAGE tags identified are novel representing the low abundance unknown transcripts. Using the GLGI method (generation of longer 3' EST from SAGE tag for gene identification), we converted 1009 low-copy, human X chromosome-specific SAGE tags into 10210 3' ESTs. We identified 3418 unique 3' ESTs, 46% of which are novel and originated from the lower abundance transcripts. However, nearly all 3' ESTs were mapped to various regions across the genome but not X chromosome. Detailed analysis indicates that those 3' ESTs were isolated by SAGE tag mis-priming to the non-parent transcripts. Replacing SAGE tags with non-transcribed genomic DNA tags resulted in poor amplification, indicating that the sequence similarity between different transcripts contributed to the amplification. Our study shows the prevalence of novel low abundance transcripts that can be isolated efficiently through SAGE tags mis-priming.
Collapse
Affiliation(s)
- Yeong Cheol Kim
- Center for Functional Genomics, Division of Medical Genetics, Department of Medicine, ENH Research Institute, Northwestern University, Evanston, IL 60201, USA
| | | | | | | | | | | |
Collapse
|