1
|
Ungar RA, Goddard PC, Jensen TD, Degalez F, Smith KS, Jin CA, Bonner DE, Bernstein JA, Wheeler MT, Montgomery SB. Impact of genome build on RNA-seq interpretation and diagnostics. Am J Hum Genet 2024; 111:1282-1300. [PMID: 38834072 PMCID: PMC11267525 DOI: 10.1016/j.ajhg.2024.05.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Revised: 05/04/2024] [Accepted: 05/06/2024] [Indexed: 06/06/2024] Open
Abstract
Transcriptomics is a powerful tool for unraveling the molecular effects of genetic variants and disease diagnosis. Prior studies have demonstrated that choice of genome build impacts variant interpretation and diagnostic yield for genomic analyses. To identify the extent genome build also impacts transcriptomics analyses, we studied the effect of the hg19, hg38, and CHM13 genome builds on expression quantification and outlier detection in 386 rare disease and familial control samples from both the Undiagnosed Diseases Network and Genomics Research to Elucidate the Genetics of Rare Disease Consortium. Across six routinely collected biospecimens, 61% of quantified genes were not influenced by genome build. However, we identified 1,492 genes with build-dependent quantification, 3,377 genes with build-exclusive expression, and 9,077 genes with annotation-specific expression across six routinely collected biospecimens, including 566 clinically relevant and 512 known OMIM genes. Further, we demonstrate that between builds for a given gene, a larger difference in quantification is well correlated with a larger change in expression outlier calling. Combined, we provide a database of genes impacted by build choice and recommend that transcriptomics-guided analyses and diagnoses are cross referenced with these data for robustness.
Collapse
Affiliation(s)
- Rachel A Ungar
- Department of Genetics, School of Medicine, Stanford University, Stanford, CA, USA; Department of Pathology, School of Medicine, Stanford University, Stanford, CA, USA
| | - Pagé C Goddard
- Department of Genetics, School of Medicine, Stanford University, Stanford, CA, USA; Department of Pathology, School of Medicine, Stanford University, Stanford, CA, USA
| | - Tanner D Jensen
- Department of Genetics, School of Medicine, Stanford University, Stanford, CA, USA; Department of Pathology, School of Medicine, Stanford University, Stanford, CA, USA
| | | | - Kevin S Smith
- Department of Pathology, School of Medicine, Stanford University, Stanford, CA, USA
| | - Christopher A Jin
- Department of Genetics, School of Medicine, Stanford University, Stanford, CA, USA
| | - Devon E Bonner
- Department of Pediatrics, School of Medicine, Stanford University, Stanford, CA, USA; Stanford Center for Undiagnosed Diseases, Stanford University, Stanford, CA, USA
| | - Jonathan A Bernstein
- Stanford Center for Undiagnosed Diseases, Stanford University, Stanford, CA, USA
| | - Matthew T Wheeler
- Department of Cardiovascular Medicine, School of Medicine, Stanford University, Stanford, CA, USA
| | - Stephen B Montgomery
- Department of Genetics, School of Medicine, Stanford University, Stanford, CA, USA; Department of Pathology, School of Medicine, Stanford University, Stanford, CA, USA; Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.
| |
Collapse
|
2
|
Ungar RA, Goddard PC, Jensen TD, Degalez F, Smith KS, Jin CA, Bonner DE, Bernstein JA, Wheeler MT, Montgomery SB. Impact of genome build on RNA-seq interpretation and diagnostics. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.01.11.24301165. [PMID: 38260490 PMCID: PMC10802764 DOI: 10.1101/2024.01.11.24301165] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
Transcriptomics is a powerful tool for unraveling the molecular effects of genetic variants and disease diagnosis. Prior studies have demonstrated that choice of genome build impacts variant interpretation and diagnostic yield for genomic analyses. To identify the extent genome build also impacts transcriptomics analyses, we studied the effect of the hg19, hg38, and CHM13 genome builds on expression quantification and outlier detection in 386 rare disease and familial control samples from both the Undiagnosed Diseases Network (UDN) and Genomics Research to Elucidate the Genetics of Rare Disease (GREGoR) Consortium. We identified 2,800 genes with build-dependent quantification across six routinely-collected biospecimens, including 1,391 protein-coding genes and 341 known rare disease genes. We further observed multiple genes that only have detectable expression in a subset of genome builds. Finally, we characterized how genome build impacts the detection of outlier transcriptomic events. Combined, we provide a database of genes impacted by build choice, and recommend that transcriptomics-guided analyses and diagnoses are cross-referenced with these data for robustness.
Collapse
Affiliation(s)
- Rachel A. Ungar
- Department of Genetics, School of Medicine, Stanford University
- Department of Pathology, School of Medicine, Stanford University
| | - Pagé C. Goddard
- Department of Genetics, School of Medicine, Stanford University
- Department of Pathology, School of Medicine, Stanford University
| | - Tanner D. Jensen
- Department of Genetics, School of Medicine, Stanford University
- Department of Pathology, School of Medicine, Stanford University
| | | | - Kevin S. Smith
- Department of Pathology, School of Medicine, Stanford University
| | | | | | - Devon E. Bonner
- Department of Pediatrics, School of Medicine, Stanford University
- Stanford Center for Undiagnosed Diseases, Stanford University
| | | | - Matthew T. Wheeler
- Department of Cardiovascular Medicine, School of Medicine, Stanford University
| | - Stephen B. Montgomery
- Department of Genetics, School of Medicine, Stanford University
- Department of Pathology, School of Medicine, Stanford University
- Department of Biomedical Data Science, Stanford University
| |
Collapse
|
3
|
MacPhillamy C, Chen T, Hiendleder S, Williams JL, Alinejad-Rokny H, Low WY. DNA methylation analysis to differentiate reference, breed, and parent-of-origin effects in the bovine pangenome era. Gigascience 2024; 13:giae061. [PMID: 39435573 PMCID: PMC11484048 DOI: 10.1093/gigascience/giae061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2023] [Revised: 03/19/2024] [Accepted: 07/25/2024] [Indexed: 10/23/2024] Open
Abstract
BACKGROUND Most DNA methylation studies have used a single reference genome with little attention paid to the bias introduced due to the reference chosen. Reference genome artifacts and genetic variation, including single nucleotide polymorphisms (SNPs) and structural variants (SVs), can lead to differences in methylation sites (CpGs) between individuals of the same species. We analyzed whole-genome bisulfite sequencing data from the fetal liver of Angus (Bos taurus taurus), Brahman (Bos taurus indicus), and reciprocally crossed samples. Using reference genomes for each breed from the Bovine Pangenome Consortium, we investigated the influence of reference genome choice on the breed and parent-of-origin effects in methylome analyses. RESULTS Our findings revealed that ∼75% of CpG sites were shared between Angus and Brahman, ∼5% were breed specific, and ∼20% were unresolved. We demonstrated up to ∼2% quantification bias in global methylation when an incorrect reference genome was used. Furthermore, we found that SNPs impacted CpGs 13 times more than other autosomal sites (P < $5 \times {10}^{ - 324}$) and SVs contained 1.18 times (P < $5 \times {10}^{ - 324}$) more CpGs than non-SVs. We found a poor overlap between differentially methylated regions (DMRs) and differentially expressed genes (DEGs) and suggest that DMRs may be impacting enhancers that target these DEGs. DMRs overlapped with imprinted genes, of which 1, DGAT1, which is important for fat metabolism and weight gain, was found in the breed-specific and sire-of-origin comparisons. CONCLUSIONS This work demonstrates the need to consider reference genome effects to explore genetic and epigenetic differences accurately and identify DMRs involved in controlling certain genes.
Collapse
Affiliation(s)
- Callum MacPhillamy
- The Davies Research Centre, School of Animal and Veterinary Sciences, The University of Adelaide, Roseworthy SA 5371, Australia
| | - Tong Chen
- The Davies Research Centre, School of Animal and Veterinary Sciences, The University of Adelaide, Roseworthy SA 5371, Australia
| | - Stefan Hiendleder
- The Davies Research Centre, School of Animal and Veterinary Sciences, The University of Adelaide, Roseworthy SA 5371, Australia
- Robinson Research Institute,, The University of Adelaide, North Adelaide SA 5006, Australia
| | - John L Williams
- The Davies Research Centre, School of Animal and Veterinary Sciences, The University of Adelaide, Roseworthy SA 5371, Australia
- Department of Animal Science, Food and Nutrition, Università Cattolica del Sacro Cuore, 29122 Piacenza, Italy
| | - Hamid Alinejad-Rokny
- BioMedical Machine Learning Lab, The Graduate School of Biomedical Engineering, Univeristy of New South Wales, Sydney, NSW 2052, Australia
| | - Wai Yee Low
- The Davies Research Centre, School of Animal and Veterinary Sciences, The University of Adelaide, Roseworthy SA 5371, Australia
| |
Collapse
|
4
|
Lin MS, Varunjikar MS, Lie KK, Søfteland L, Dellafiora L, Ørnsrud R, Sanden M, Berntssen MHG, Dorne JLCM, Bafna V, Rasinger JD. Multi-tissue proteogenomic analysis for mechanistic toxicology studies in non-model species. ENVIRONMENT INTERNATIONAL 2023; 182:108309. [PMID: 37980879 DOI: 10.1016/j.envint.2023.108309] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/25/2023] [Revised: 08/15/2023] [Accepted: 11/04/2023] [Indexed: 11/21/2023]
Abstract
New approach methodologies (NAM), including omics and in vitro approaches, are contributing to the implementation of 3R (reduction, refinement and replacement) strategies in regulatory science and risk assessment. In this study, we present an integrative transcriptomics and proteomics analysis workflow for the validation and revision of complex fish genomes and demonstrate how proteogenomics expression matrices can be used to support multi-level omics data integration in non-model species in vivo and in vitro. Using Atlantic salmon as an example, we constructed proteogenomic databases from publicly available transcriptomic data and in-house generated RNA-Seq and LC-MS/MS data. Our analysis identified ∼80,000 peptides, providing direct evidence of translation for over 40,000 RefSeq structures. The data also highlighted 183 co-located peptide groups that supported a single transcript each, and in each case, either corrected a previous annotation, supported Ensembl annotations not present in RefSeq, or identified novel previously unannotated genes. Proteogenomics data-derived expression matrices revealed distinct profiles for the different tissue types analyzed. Focusing on proteins involved in defense against xenobiotics, we detected distinct expression patterns across different salmon tissues and observed homology in the expression of chemical defense proteins between in vivo and in vitro liver systems. Our study demonstrates the potential of proteogenomic analyses in extending our understanding of complex fish genomes and provides an advanced bioinformatic toolkit to support the further development of NAMs and their application in regulatory science and (eco)toxicological studies of non-model species.
Collapse
Affiliation(s)
- M S Lin
- Bioinformatics and Systems Biology Program, UC San Diego, San Diego, CA, United States.
| | | | - K K Lie
- Institute of Marine Research, Bergen, Norway.
| | - L Søfteland
- Institute of Marine Research, Bergen, Norway.
| | - L Dellafiora
- Department of Food and Drug, University of Parma, Parco Area delle Scienze 27/A, 43124 Parma, Italy.
| | - R Ørnsrud
- Institute of Marine Research, Bergen, Norway.
| | - M Sanden
- Institute of Marine Research, Bergen, Norway.
| | | | - J L C M Dorne
- European Food Safety Authority, Methodological and Scientific Support Unit, Via Carlo Magno 1A, 43121 Parma, Italy.
| | - V Bafna
- Computer Science & Engineering and HDSI, UC San Diego, San Diego, CA, United States.
| | | |
Collapse
|
5
|
Chen JW, Shrestha L, Green G, Leier A, Marquez-Lago TT. The hitchhikers' guide to RNA sequencing and functional analysis. Brief Bioinform 2023; 24:bbac529. [PMID: 36617463 PMCID: PMC9851315 DOI: 10.1093/bib/bbac529] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2022] [Revised: 10/18/2022] [Accepted: 11/07/2022] [Indexed: 01/10/2023] Open
Abstract
DNA and RNA sequencing technologies have revolutionized biology and biomedical sciences, sequencing full genomes and transcriptomes at very high speeds and reasonably low costs. RNA sequencing (RNA-Seq) enables transcript identification and quantification, but once sequencing has concluded researchers can be easily overwhelmed with questions such as how to go from raw data to differential expression (DE), pathway analysis and interpretation. Several pipelines and procedures have been developed to this effect. Even though there is no unique way to perform RNA-Seq analysis, it usually follows these steps: 1) raw reads quality check, 2) alignment of reads to a reference genome, 3) aligned reads' summarization according to an annotation file, 4) DE analysis and 5) gene set analysis and/or functional enrichment analysis. Each step requires researchers to make decisions, and the wide variety of options and resulting large volumes of data often lead to interpretation challenges. There also seems to be insufficient guidance on how best to obtain relevant information and derive actionable knowledge from transcription experiments. In this paper, we explain RNA-Seq steps in detail and outline differences and similarities of different popular options, as well as advantages and disadvantages. We also discuss non-coding RNA analysis, multi-omics, meta-transcriptomics and the use of artificial intelligence methods complementing the arsenal of tools available to researchers. Lastly, we perform a complete analysis from raw reads to DE and functional enrichment analysis, visually illustrating how results are not absolute truths and how algorithmic decisions can greatly impact results and interpretation.
Collapse
Affiliation(s)
- Jiung-Wen Chen
- Department of Biology, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Lisa Shrestha
- Department of Genetics, University of Alabama at Birmingham, School of Medicine, Birmingham, AL, USA
| | - George Green
- Department of Biology, University of Alabama at Birmingham, Birmingham, AL, USA
| | - André Leier
- Department of Genetics, University of Alabama at Birmingham, School of Medicine, Birmingham, AL, USA
- Department of Cell, Developmental and Integrative Biology, University of Alabama at Birmingham, School of Medicine, Birmingham, AL, USA
| | - Tatiana T Marquez-Lago
- Department of Genetics, University of Alabama at Birmingham, School of Medicine, Birmingham, AL, USA
- Department of Cell, Developmental and Integrative Biology, University of Alabama at Birmingham, School of Medicine, Birmingham, AL, USA
- Department of Microbiology, University of Alabama at Birmingham, School of Medicine, Birmingham, AL, USA
| |
Collapse
|
6
|
Schon MA, Lutzmayer S, Hofmann F, Nodine MD. Bookend: precise transcript reconstruction with end-guided assembly. Genome Biol 2022; 23:143. [PMID: 35768836 PMCID: PMC9245221 DOI: 10.1186/s13059-022-02700-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2021] [Accepted: 06/05/2022] [Indexed: 12/29/2022] Open
Abstract
We developed Bookend, a package for transcript assembly that incorporates data from different RNA-seq techniques, with a focus on identifying and utilizing RNA 5' and 3' ends. We demonstrate that correct identification of transcript start and end sites is essential for precise full-length transcript assembly. Utilization of end-labeled reads present in full-length single-cell RNA-seq datasets dramatically improves the precision of transcript assembly in single cells. Finally, we show that hybrid assembly across short-read, long-read, and end-capture RNA-seq datasets from Arabidopsis thaliana, as well as meta-assembly of RNA-seq from single mouse embryonic stem cells, can produce reference-quality end-to-end transcript annotations.
Collapse
Affiliation(s)
- Michael A Schon
- Cluster of Plant Developmental Biology, Laboratory of Molecular Biology, Wageningen University & Research, Wageningen, 6708, PB, The Netherlands.
- Gregor Mendel Institute (GMI), Austrian Academy of Sciences, Vienna Biocenter (VBC), Dr. Bohr-Gasse 3, 1030, Vienna, Austria.
| | - Stefan Lutzmayer
- Gregor Mendel Institute (GMI), Austrian Academy of Sciences, Vienna Biocenter (VBC), Dr. Bohr-Gasse 3, 1030, Vienna, Austria
| | - Falko Hofmann
- Gregor Mendel Institute (GMI), Austrian Academy of Sciences, Vienna Biocenter (VBC), Dr. Bohr-Gasse 3, 1030, Vienna, Austria
| | - Michael D Nodine
- Cluster of Plant Developmental Biology, Laboratory of Molecular Biology, Wageningen University & Research, Wageningen, 6708, PB, The Netherlands.
- Gregor Mendel Institute (GMI), Austrian Academy of Sciences, Vienna Biocenter (VBC), Dr. Bohr-Gasse 3, 1030, Vienna, Austria.
| |
Collapse
|
7
|
Chisanga D, Liao Y, Shi W. Impact of gene annotation choice on the quantification of RNA-seq data. BMC Bioinformatics 2022; 23:107. [PMID: 35354358 PMCID: PMC8969366 DOI: 10.1186/s12859-022-04644-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2021] [Accepted: 02/03/2022] [Indexed: 12/13/2022] Open
Abstract
Background RNA sequencing is currently the method of choice for genome-wide profiling of gene expression. A popular approach to quantify expression levels of genes from RNA-seq data is to map reads to a reference genome and then count mapped reads to each gene. Gene annotation data, which include chromosomal coordinates of exons for tens of thousands of genes, are required for this quantification process. There are several major sources of gene annotations that can be used for quantification, such as Ensembl and RefSeq databases. However, there is very little understanding of the effect that the choice of annotation has on the accuracy of gene expression quantification in an RNA-seq analysis. Results In this paper, we present results from our comparison of Ensembl and RefSeq human annotations on their impact on gene expression quantification using a benchmark RNA-seq dataset generated by the SEQC consortium. We show that the use of RefSeq gene annotation models led to better quantification accuracy, based on the correlation with ground truths including expression data from >800 real-time PCR validated genes, known titration ratios of gene expression and microarray expression data. We also found that the recent expansion of the RefSeq annotation has led to a decrease in its annotation accuracy. Finally, we demonstrated that the RNA-seq quantification differences observed between different annotations were not affected by the use of different normalization methods. Conclusion In conclusion, our study found that the use of the conservative RefSeq gene annotation yields better RNA-seq quantification results than the more comprehensive Ensembl annotation. We also found that, surprisingly, the recent expansion of the RefSeq database, which was primarily driven by the incorporation of sequencing data into the gene annotation process, resulted in a reduction in the accuracy of RNA-seq quantification. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04644-8.
Collapse
Affiliation(s)
- David Chisanga
- Olivia Newton-John Cancer Research Institute, Melbourne, Australia.,School of Cancer Medicine, La Trobe University, Melbourne, Australia.,Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia.,Department of Medical Biology, The University of Melbourne, Melbourne, Australia
| | - Yang Liao
- Olivia Newton-John Cancer Research Institute, Melbourne, Australia.,School of Cancer Medicine, La Trobe University, Melbourne, Australia.,Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia.,Department of Medical Biology, The University of Melbourne, Melbourne, Australia
| | - Wei Shi
- Olivia Newton-John Cancer Research Institute, Melbourne, Australia. .,School of Cancer Medicine, La Trobe University, Melbourne, Australia. .,Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia. .,School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia.
| |
Collapse
|
8
|
Buaban S, Lengnudum K, Boonkum W, Phakdeedindan P. Genome-wide association study on milk production and somatic cell score for Thai dairy cattle using weighted single-step approach with random regression test-day model. J Dairy Sci 2021; 105:468-494. [PMID: 34756438 DOI: 10.3168/jds.2020-19826] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2020] [Accepted: 08/24/2021] [Indexed: 12/26/2022]
Abstract
Genome-wide association studies are a powerful tool to identify genomic regions and variants associated with phenotypes. However, only limited mutual confirmation from different studies is available. The objectives of this study were to identify genomic regions as well as genes and pathways associated with the first-lactation milk, fat, protein, and total solid yields; fat, protein, and total solid percentage; and somatic cell score (SCS) in a Thai dairy cattle population. Effects of SNPs were estimated by a weighted single-step GWAS, which back-solved the genomic breeding values predicted using single-step genomic BLUP (ssGBLUP) fitting a single-trait random regression test-day model. Genomic regions that explained at least 0.5% of the total genetic variance were selected for further analyses of candidate genes. Despite the small number of genotyped animals, genomic predictions led to an improvement in the accuracy over the traditional BLUP. Genomic predictions using weighted ssGBLUP were slightly better than the ssGBLUP. The genomic regions associated with milk production traits contained 210 candidate genes on 19 chromosomes [Bos taurus autosome (BTA) 1 to 7, 9, 11 to 16, 20 to 21, 26 to 27 and 29], whereas 21 candidate genes on 3 chromosomes (BTA 11, 16, and 21) were associated with SCS. Many genomic regions explained a small fraction of the genetic variance, indicating polygenic inheritance of the studied traits. Several candidate genes coincided with previous reports for milk production traits in Holstein cattle, especially a large region of genes on BTA14. We identified 141 and 5 novel genes related to milk production and SCS, respectively. These novel genes were also found to be functionally related to heat tolerance (e.g., SLC45A2, IRAG1, and LOC101902172), longevity (e.g., SYT10 and LOC101903327), and fertility (e.g., PAG1). These findings may be attributed to indirect selection in our population. Identified biological networks including intracellular cell transportation and protein catabolism implicate milk production, whereas the immunological pathways such as lymphocyte activation are closely related to SCS. Further studies are required to validate our findings before exploiting them in genomic selection.
Collapse
Affiliation(s)
- S Buaban
- Bureau of Animal Husbandry and Genetic Improvement, Department of Livestock Development, Pathum Thani 12000, Thailand
| | - K Lengnudum
- Bureau of Biotechnology in Livestock Production, Department of Livestock Development, Pathum Thani 12000, Thailand
| | - W Boonkum
- Department of Animal Science, Faculty of Agriculture, Khon Kaen University, Khon Kaen 40002, Thailand
| | - P Phakdeedindan
- Department of Animal Husbandry, Faculty of Veterinary Science, Chulalongkorn University, Bangkok 10330, Thailand; Genomics and Precision Dentistry Research Unit, Department of Physiology, Faculty of Dentistry, Chulalongkorn University, Bangkok 10330, Thailand.
| |
Collapse
|
9
|
Brunet MA, Lekehal AM, Roucou X. How to Illuminate the Dark Proteome Using the Multi-omic OpenProt Resource. ACTA ACUST UNITED AC 2021; 71:e103. [PMID: 32780568 DOI: 10.1002/cpbi.103] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Ten of thousands of open reading frames (ORFs) are hidden within genomes. These alternative ORFs, or small ORFs, have eluded annotations because they are either small or within unsuspected locations. They are found in untranslated regions or overlap a known coding sequence in messenger RNA and anywhere in a "non-coding" RNA. Serendipitous discoveries have highlighted these ORFs' importance in biological functions and pathways. With their discovery came the need for deeper ORF annotation and large-scale mining of public repositories to gather supporting experimental evidence. OpenProt, accessible at https://openprot.org/, is the first proteogenomic resource enforcing a polycistronic model of annotation across an exhaustive transcriptome for 10 species. Moreover, OpenProt reports experimental evidence cumulated across a re-analysis of 114 mass spectrometry and 87 ribosome profiling datasets. The multi-omics OpenProt resource also includes the identification of predicted functional domains and evaluation of conservation for all predicted ORFs. The OpenProt web server provides two query interfaces and one genome browser. The query interfaces allow for exploration of the coding potential of genes or transcripts of interest as well as custom downloads of all information contained in OpenProt. © 2020 The Authors. Basic Protocol 1: Using the Search interface Basic Protocol 2: Using the Downloads interface.
Collapse
Affiliation(s)
- Marie A Brunet
- Department of Biochemistry and Functional Genomics, Université de Sherbrooke, Sherbrooke, Québec, Canada.,PROTEO, Quebec Network for Research on Protein Function, Structure, and Engineering, Québec, Canada
| | - Amina M Lekehal
- Department of Biochemistry and Functional Genomics, Université de Sherbrooke, Sherbrooke, Québec, Canada.,PROTEO, Quebec Network for Research on Protein Function, Structure, and Engineering, Québec, Canada
| | - Xavier Roucou
- Department of Biochemistry and Functional Genomics, Université de Sherbrooke, Sherbrooke, Québec, Canada.,PROTEO, Quebec Network for Research on Protein Function, Structure, and Engineering, Québec, Canada
| |
Collapse
|
10
|
Srinivasan KA, Virdee SK, McArthur AG. Strandedness during cDNA synthesis, the stranded parameter in htseq-count and analysis of RNA-Seq data. Brief Funct Genomics 2020; 19:339-342. [PMID: 32415774 DOI: 10.1093/bfgp/elaa010] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Revised: 03/20/2020] [Accepted: 03/24/2020] [Indexed: 11/12/2022] Open
Abstract
RNA sequencing (RNA-Seq) is a complicated protocol, both in the laboratory in generation of data and at the computer in analysis of results. Several decisions during RNA-Seq library construction have important implications for analysis, most notably strandedness during complementary DNA library construction. Here, we clarify bioinformatic decisions related to strandedness in both alignment of DNA sequencing reads to reference genomes and subsequent determination of transcript abundance.
Collapse
|
11
|
Brunet MA, Brunelle M, Lucier JF, Delcourt V, Levesque M, Grenier F, Samandi S, Leblanc S, Aguilar JD, Dufour P, Jacques JF, Fournier I, Ouangraoua A, Scott MS, Boisvert FM, Roucou X. OpenProt: a more comprehensive guide to explore eukaryotic coding potential and proteomes. Nucleic Acids Res 2020; 47:D403-D410. [PMID: 30299502 PMCID: PMC6323990 DOI: 10.1093/nar/gky936] [Citation(s) in RCA: 41] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2018] [Accepted: 10/04/2018] [Indexed: 01/06/2023] Open
Abstract
Advances in proteomics and sequencing have highlighted many non-annotated open reading frames (ORFs) in eukaryotic genomes. Genome annotations, cornerstones of today's research, mostly rely on protein prior knowledge and on ab initio prediction algorithms. Such algorithms notably enforce an arbitrary criterion of one coding sequence (CDS) per transcript, leading to a substantial underestimation of the coding potential of eukaryotes. Here, we present OpenProt, the first database fully endorsing a polycistronic model of eukaryotic genomes to date. OpenProt contains all possible ORFs longer than 30 codons across 10 species, and cumulates supporting evidence such as protein conservation, translation and expression. OpenProt annotates all known proteins (RefProts), novel predicted isoforms (Isoforms) and novel predicted proteins from alternative ORFs (AltProts). It incorporates cutting-edge algorithms to evaluate protein orthology and re-interrogate publicly available ribosome profiling and mass spectrometry datasets, supporting the annotation of thousands of predicted ORFs. The constantly growing database currently cumulates evidence from 87 ribosome profiling and 114 mass spectrometry studies from several species, tissues and cell lines. All data is freely available and downloadable from a web platform (www.openprot.org) supporting a genome browser and advanced queries for each species. Thus, OpenProt enables a more comprehensive landscape of eukaryotic genomes’ coding potential.
Collapse
Affiliation(s)
- Marie A Brunet
- Department of Biochemistry, Université de Sherbrooke, Sherbrooke, Québec, Canada.,PROTEO, Quebec Network for Research on Protein Function, Structure, and Engineering, Université de Lille, F-59000 Lille, France
| | - Mylène Brunelle
- Department of Biochemistry, Université de Sherbrooke, Sherbrooke, Québec, Canada.,PROTEO, Quebec Network for Research on Protein Function, Structure, and Engineering, Université de Lille, F-59000 Lille, France
| | - Jean-François Lucier
- Center for Computational Science, Université de Sherbrooke, Sherbrooke, Québec, Canada.,Biology Department, Université de Sherbrooke, Sherbrooke, Québec, Canada
| | - Vivian Delcourt
- Department of Biochemistry, Université de Sherbrooke, Sherbrooke, Québec, Canada.,PROTEO, Quebec Network for Research on Protein Function, Structure, and Engineering, Université de Lille, F-59000 Lille, France.,INSERM U1192, Laboratoire Protéomique, Réponse Inflammatoire & Spectrométrie de Masse (PRISM), Université de Lille, F-59000 Lille, France
| | - Maxime Levesque
- Center for Computational Science, Université de Sherbrooke, Sherbrooke, Québec, Canada.,Biology Department, Université de Sherbrooke, Sherbrooke, Québec, Canada
| | - Frédéric Grenier
- Center for Computational Science, Université de Sherbrooke, Sherbrooke, Québec, Canada.,Biology Department, Université de Sherbrooke, Sherbrooke, Québec, Canada
| | - Sondos Samandi
- Department of Biochemistry, Université de Sherbrooke, Sherbrooke, Québec, Canada.,PROTEO, Quebec Network for Research on Protein Function, Structure, and Engineering, Université de Lille, F-59000 Lille, France
| | - Sébastien Leblanc
- Department of Biochemistry, Université de Sherbrooke, Sherbrooke, Québec, Canada
| | - Jean-David Aguilar
- Department of Biochemistry, Université de Sherbrooke, Sherbrooke, Québec, Canada
| | - Pascal Dufour
- Department of Biochemistry, Université de Sherbrooke, Sherbrooke, Québec, Canada
| | - Jean-Francois Jacques
- Department of Biochemistry, Université de Sherbrooke, Sherbrooke, Québec, Canada.,PROTEO, Quebec Network for Research on Protein Function, Structure, and Engineering, Université de Lille, F-59000 Lille, France
| | - Isabelle Fournier
- INSERM U1192, Laboratoire Protéomique, Réponse Inflammatoire & Spectrométrie de Masse (PRISM), Université de Lille, F-59000 Lille, France
| | - Aida Ouangraoua
- Informatics Department, Université de Sherbrooke, Sherbrooke, Québec, Canada
| | - Michelle S Scott
- Department of Biochemistry, Université de Sherbrooke, Sherbrooke, Québec, Canada
| | | | - Xavier Roucou
- Department of Biochemistry, Université de Sherbrooke, Sherbrooke, Québec, Canada.,PROTEO, Quebec Network for Research on Protein Function, Structure, and Engineering, Université de Lille, F-59000 Lille, France
| |
Collapse
|
12
|
Zhang Z, Zhang S, Li X, Zhao Z, Chen C, Zhang J, Li M, Wei Z, Jiang W, Pan B, Li Y, Liu Y, Cao Y, Zhao W, Gu Y, Yu Y, Meng Q, Qi L. Reference genome and annotation updates lead to contradictory prognostic predictions in gene expression signatures: a case study of resected stage I lung adenocarcinoma. Brief Bioinform 2020; 22:5834482. [PMID: 32383445 DOI: 10.1093/bib/bbaa081] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2020] [Revised: 04/02/2020] [Accepted: 04/18/2020] [Indexed: 12/28/2022] Open
Abstract
RNA-sequencing enables accurate and low-cost transcriptome-wide detection. However, expression estimates vary as reference genomes and gene annotations are updated, confounding existing expression-based prognostic signatures. Herein, prognostic 9-gene pair signature (GPS) was applied to 197 patients with stage I lung adenocarcinoma derived from previous and latest data from The Cancer Genome Atlas (TCGA) processed with different reference genomes and annotations. For 9-GPS, 6.6% of patients exhibited discordant risk classifications between the two TCGA versions. Similar results were observed for other prognostic signatures, including IRGPI, 15-gene and ORACLE. We found that conflicting annotations for gene length and overlap were the major cause of their discordant risk classification. Therefore, we constructed a prognostic 40-GPS based on stable genes across GENCODE v20-v30 and validated it using public data of 471 stage I samples (log-rank P < 0.0010). Risk classification was still stable in RNA-sequencing data processed with the newest GENCODE v32 versus GENCODE v20-v30. Specifically, 40-GPS could predict survival for 30 stage I samples with formalin-fixed paraffin-embedded tissues (log-rank P = 0.0177). In conclusion, this method overcomes the vulnerability of existing prognostic signatures due to reference genome and annotation updates. 40-GPS may offer individualized clinical applications due to its prognostic accuracy and classification stability.
Collapse
|
13
|
Arora S, Pattwell SS, Holland EC, Bolouri H. Variability in estimated gene expression among commonly used RNA-seq pipelines. Sci Rep 2020; 10:2734. [PMID: 32066774 PMCID: PMC7026138 DOI: 10.1038/s41598-020-59516-z] [Citation(s) in RCA: 34] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2019] [Accepted: 01/27/2020] [Indexed: 11/25/2022] Open
Abstract
RNA-sequencing data is widely used to identify disease biomarkers and therapeutic targets using numerical methods such as clustering, classification, regression, and differential expression analysis. Such approaches rely on the assumption that mRNA abundance estimates from RNA-seq are reliable estimates of true expression levels. Here, using data from five RNA-seq processing pipelines applied to 6,690 human tumor and normal tissues, we show that nearly 88% of protein-coding genes have similar gene expression profiles across all pipelines. However, for >12% of protein-coding genes, current best-in-class RNA-seq processing pipelines differ in their abundance estimates by more than four-fold when applied to exactly the same samples and the same set of RNA-seq reads. Expression fold changes are similarly affected. Many of the impacted genes are widely studied disease-associated genes. We show that impacted genes exhibit diverse patterns of discordance among pipelines, suggesting that many inter-pipeline differences contribute to overall uncertainty in mRNA abundance estimates. A concerted, community-wide effort will be needed to develop gold-standards for estimating the mRNA abundance of the discordant genes reported here. In the meantime, our list of discordantly evaluated genes provides an important resource for robust marker discovery and target selection.
Collapse
Affiliation(s)
- Sonali Arora
- Division of Human Biology, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA
| | - Siobhan S Pattwell
- Division of Human Biology, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA
| | - Eric C Holland
- Division of Human Biology, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA.
| | - Hamid Bolouri
- Division of Human Biology, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA.
| |
Collapse
|
14
|
Marco-Puche G, Lois S, Benítez J, Trivino JC. RNA-Seq Perspectives to Improve Clinical Diagnosis. Front Genet 2019; 10:1152. [PMID: 31781178 PMCID: PMC6861419 DOI: 10.3389/fgene.2019.01152] [Citation(s) in RCA: 60] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2019] [Accepted: 10/22/2019] [Indexed: 01/22/2023] Open
Abstract
In recent years, high-throughput next-generation sequencing technology has allowed a rapid increase in diagnostic capacity and precision through different bioinformatics processing algorithms, tools, and pipelines. The identification, annotation, and classification of sequence variants within different target regions are now considered a gold standard in clinical genetic diagnosis. However, this procedure lacks the ability to link regulatory events such as differential splicing to diseases. RNA-seq is necessary in clinical routine in order to interpret and detect among others splicing events and splicing variants, as it would increase the diagnostic rate by up to 10-35%. The transcriptome has a very dynamic nature, varying according to tissue type, cellular conditions, and environmental factors that may affect regulatory events such as splicing and the expression of genes or their isoforms. RNA-seq offers a robust technical analysis of this complexity, but it requires a profound knowledge of computational/statistical tools that may need to be adjusted depending on the disease under study. In this article we will cover RNA-seq analyses best practices applied to clinical routine, bioinformatics procedures, and present challenges of this approach.
Collapse
Affiliation(s)
| | - Sergio Lois
- Bioinformatics Group, Sistemas Genómicos, Paterna, Spain
| | - Javier Benítez
- Human Genetics Group, Spanish National Cancer Research Center, Madrid, Spain
| | | |
Collapse
|
15
|
Hatje K, Mühlhausen S, Simm D, Kollmar M. The Protein-Coding Human Genome: Annotating High-Hanging Fruits. Bioessays 2019; 41:e1900066. [PMID: 31544971 DOI: 10.1002/bies.201900066] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2019] [Revised: 08/07/2019] [Indexed: 12/19/2022]
Abstract
The major transcript variants of human protein-coding genes are annotated to a certain degree of accuracy combining manual curation, transcript data, and proteomics evidence. However, there is considerable disagreement on the annotation of about 2000 genes-they can be protein-coding, noncoding, or pseudogenes-and on the annotation of most of the predicted alternative transcripts. Pure transcriptome mapping approaches seem to be limited in discriminating functional expression from noise. These limitations have partially been overcome by dedicated algorithms to detect alternative spliced micro-exons and wobble splice variants. Recently, knowledge about splice mechanism and protein structure are incorporated into an algorithm to predict neighboring homologous exons, often spliced in a mutually exclusive manner. Predicted exons are evaluated by transcript data, structural compatibility, and evolutionary conservation, revealing hundreds of novel coding exons and splice mechanism re-assignments. The emerging human pan-genome is necessitating distinctive annotations incorporating differences between individuals and between populations.
Collapse
Affiliation(s)
- Klas Hatje
- Roche Pharmaceutical Research and Early Development, Pharmaceutical Sciences, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Grenzacherstr. 124, 4070, Basel, Switzerland
| | - Stefanie Mühlhausen
- Group Systems Biology of Motor Proteins, Department of NMR-based Structural Biology, Max-Planck-Institute for Biophysical Chemistry, Am Fassberg 11, 37077, Göttingen, Germany
| | - Dominic Simm
- Group Systems Biology of Motor Proteins, Department of NMR-based Structural Biology, Max-Planck-Institute for Biophysical Chemistry, Am Fassberg 11, 37077, Göttingen, Germany.,Theoretical Computer Science and Algorithmic Methods, Institute of Computer Science, Georg-August-University Göttingen, Goldschmidtstr. 7, 37077, Göttingen, Germany
| | - Martin Kollmar
- Group Systems Biology of Motor Proteins, Department of NMR-based Structural Biology, Max-Planck-Institute for Biophysical Chemistry, Am Fassberg 11, 37077, Göttingen, Germany
| |
Collapse
|
16
|
Bhuva DD, Foroutan M, Xie Y, Lyu R, Cursons J, Davis MJ. Using singscore to predict mutation status in acute myeloid leukemia from transcriptomic signatures. F1000Res 2019; 8:776. [PMID: 31723419 PMCID: PMC6844140 DOI: 10.12688/f1000research.19236.3] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 10/03/2019] [Indexed: 12/23/2022] Open
Abstract
Advances in RNA sequencing (RNA-seq) technologies that measure the transcriptome of biological samples have revolutionised our ability to understand transcriptional regulatory programs that underpin diseases such as cancer. We recently published singscore - a single sample, rank-based gene set scoring method which quantifies how concordant the transcriptional profile of individual samples are relative to specific gene sets of interest. Here we demonstrate the application of singscore to investigate transcriptional profiles associated with specific mutations or genetic lesions in acute myeloid leukemia. Using matched genomic and transcriptomic data available through the TCGA we show that scoring of appropriate signatures can distinguish samples with corresponding mutations, reflecting the ability of these mutations to drive aberrant transcriptional programs involved in leukemogenesis. We believe the singscore method is particularly useful for studying heterogeneity within a specific subsets of cancers, and as demonstrated, we show the ability of singscore to identify where alternative mutations appear to drive similar transcriptional programs.
Collapse
Affiliation(s)
- Dharmesh D Bhuva
- Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, Parkville, VIC, 3052, Australia.,School of Mathematics and Statistics, University of Melbourne, Parkville, VIC, 3010, Australia
| | - Momeneh Foroutan
- Department of Clinical Pathology, The University of Melbourne Centre for Cancer Research, Victorian Comprehensive Cancer Centre, Parkville, VIC, 3000, Australia
| | - Yi Xie
- Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, Parkville, VIC, 3052, Australia
| | - Ruqian Lyu
- Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, Parkville, VIC, 3052, Australia
| | - Joseph Cursons
- Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, Parkville, VIC, 3052, Australia.,Department of Medical Biology, University of Melbourne, Parkville, VIC, 3010, Australia
| | - Melissa J Davis
- Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, Parkville, VIC, 3052, Australia.,Department of Medical Biology, University of Melbourne, Parkville, VIC, 3010, Australia.,Department of Biochemistry and Molecular Biology, Faculty of Medicine, Dentistry and Health Sciences, University of Melbourne, Parkville, VIC, 3010, Australia
| |
Collapse
|
17
|
DeRycke MS, Larson MC, Nair AA, McDonnell SK, French AJ, Tillmans LS, Riska SM, Baheti S, Fogarty ZC, Larson NB, O’Brien DR, Cheville JC, Wang L, Schaid DJ, Thibodeau SN. An expanded variant list and assembly annotation identifies multiple novel coding and noncoding genes for prostate cancer risk using a normal prostate tissue eQTL data set. PLoS One 2019; 14:e0214588. [PMID: 30958860 PMCID: PMC6453468 DOI: 10.1371/journal.pone.0214588] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2018] [Accepted: 03/17/2019] [Indexed: 01/01/2023] Open
Abstract
Prostate cancer (PrCa) is highly heritable; 284 variants have been identified to date that are associated with increased prostate cancer risk, yet few genes contributing to its development are known. Expression quantitative trait loci (eQTL) studies link variants with affected genes, helping to determine how these variants might regulate gene expression and may influence prostate cancer risk. In the current study, we performed eQTL analysis on 471 normal prostate epithelium samples and 249 PrCa-risk variants in 196 risk loci, utilizing RNA sequencing transcriptome data based on ENSEMBL gene definition and genome-wide variant data. We identified a total of 213 genes associated with known PrCa-risk variants, including 141 protein-coding genes, 16 lncRNAs, and 56 other non-coding RNA species with differential expression. Compared to our previous analysis, where RefSeq was used for gene annotation, we identified an additional 130 expressed genes associated with known PrCa-risk variants. We detected an eQTL signal for more than half (n = 102, 52%) of the 196 loci tested; 52 (51%) of which were a Group 1 signal, indicating high linkage disequilibrium (LD) between the peak eQTL variant and the PrCa-risk variant (r2>0.5) and may help explain how risk variants influence the development of prostate cancer.
Collapse
Affiliation(s)
- Melissa S. DeRycke
- Department of Laboratory Medicine and Pathology, Mayo Clinic College of Medicine, SW, Rochester, Minnesota, United States of America
| | - Melissa C. Larson
- Department of Health Sciences Research, Mayo Clinic College of Medicine, SW, Rochester, Minnesota, United States of America
| | - Asha A. Nair
- Department of Health Sciences Research, Mayo Clinic College of Medicine, SW, Rochester, Minnesota, United States of America
| | - Shannon K. McDonnell
- Department of Health Sciences Research, Mayo Clinic College of Medicine, SW, Rochester, Minnesota, United States of America
| | - Amy J. French
- Department of Laboratory Medicine and Pathology, Mayo Clinic College of Medicine, SW, Rochester, Minnesota, United States of America
| | - Lori S. Tillmans
- Department of Laboratory Medicine and Pathology, Mayo Clinic College of Medicine, SW, Rochester, Minnesota, United States of America
| | - Shaun M. Riska
- Department of Health Sciences Research, Mayo Clinic College of Medicine, SW, Rochester, Minnesota, United States of America
| | - Saurabh Baheti
- Department of Health Sciences Research, Mayo Clinic College of Medicine, SW, Rochester, Minnesota, United States of America
| | - Zachary C. Fogarty
- Department of Health Sciences Research, Mayo Clinic College of Medicine, SW, Rochester, Minnesota, United States of America
| | - Nicholas B. Larson
- Department of Health Sciences Research, Mayo Clinic College of Medicine, SW, Rochester, Minnesota, United States of America
| | - Daniel R. O’Brien
- Department of Health Sciences Research, Mayo Clinic College of Medicine, SW, Rochester, Minnesota, United States of America
| | - John C. Cheville
- Department of Laboratory Medicine and Pathology, Mayo Clinic College of Medicine, SW, Rochester, Minnesota, United States of America
| | - Liang Wang
- Department of Pathology, Medical College of Wisconsin, Milwaukee, Wisconsin, United States of America
| | - Daniel J. Schaid
- Department of Health Sciences Research, Mayo Clinic College of Medicine, SW, Rochester, Minnesota, United States of America
| | - Stephen N. Thibodeau
- Department of Laboratory Medicine and Pathology, Mayo Clinic College of Medicine, SW, Rochester, Minnesota, United States of America
| |
Collapse
|
18
|
Froussios K, Mourão K, Simpson G, Barton G, Schurch N. Relative Abundance of Transcripts ( RATs): Identifying differential isoform abundance from RNA-seq. F1000Res 2019; 8:213. [PMID: 30906538 PMCID: PMC6426083 DOI: 10.12688/f1000research.17916.1] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 02/08/2019] [Indexed: 12/16/2022] Open
Abstract
The biological importance of changes in RNA expression is reflected by the wide variety of tools available to characterise these changes from RNA-seq data. Several tools exist for detecting differential transcript isoform usage (DTU) from aligned or assembled RNA-seq data, but few exist for DTU detection from alignment-free RNA-seq quantifications. We present the RATs, an R package that identifies DTU transcriptome-wide directly from transcript abundance estimates. RATs is unique in applying bootstrapping to estimate the reliability of detected DTU events and shows good performance at all replication levels (median false positive fraction < 0.05). We compare RATs to two existing DTU tools, DRIM-Seq & SUPPA2, using two publicly available simulated RNA-seq datasets and a published human RNA-seq dataset, in which 248 genes have been previously identified as displaying significant DTU. RATs with default threshold values on the simulated Human data has a sensitivity of 0.55, a Matthews correlation coefficient of 0.71 and a false discovery rate (FDR) of 0.04, outperforming both other tools. Applying the same thresholds for SUPPA2 results in a higher sensitivity (0.61) but poorer FDR performance (0.33). RATs and DRIM-seq use different methods for measuring DTU effect-sizes complicating the comparison of results between these tools, however, for a likelihood-ratio threshold of 30, DRIM-Seq has similar FDR performance to RATs (0.06), but worse sensitivity (0.47). These differences persist for the simulated drosophila dataset. On the published human RNA-seq dataset the greatest agreement between the tools tested is 53%, observed between RATs and SUPPA2. The bootstrapping quality filter in RATs is responsible for removing the majority of DTU events called by SUPPA2 that are not reported by RATs. All methods, including the previously published qRT-PCR of three of the 248 detected DTU events, were found to be sensitive to annotation differences between Ensembl v60 and v87.
Collapse
Affiliation(s)
- Kimon Froussios
- Division of Computational Biology, School of Life Sciences, University of Dundee, Dundee, DD1 5EH, UK
| | - Kira Mourão
- Division of Computational Biology, School of Life Sciences, University of Dundee, Dundee, DD1 5EH, UK
| | - Gordon Simpson
- Centre for Gene Regulation & Expression, School of Life Sciences, University of Dundee, Dundee, DD1 5EH, UK.,Division of Plant Sciences, School of Life Sciences, University of Dundee, Dundee, DD1 5EH, UK.,The James Hutton Institute, Invergowrie, Dundee, DD2 4DA, UK
| | - Geoff Barton
- Division of Computational Biology, School of Life Sciences, University of Dundee, Dundee, DD1 5EH, UK
| | - Nicholas Schurch
- Division of Computational Biology, School of Life Sciences, University of Dundee, Dundee, DD1 5EH, UK
| |
Collapse
|
19
|
Abstract
The first and only published version of the rat reference genome sequence was RGSC3.1, accomplished by the Rat Genome Sequencing Project Consortium. Here we present the history of the community effort in the correction of sequence errors and filling missing gaps in the process of refining and providing researchers with a high-quality rat reference sequence. The genome assembly improvements, addition of different evidence resources over time, such as RNA-Seq data, and software development methodologies had a positive impact on the gene model annotations. Over the years we observed a great increase in the numbers of genes, protein coding sequences, predicted transcripts and transcript features. Before the sequencing of the rat genome was possible, first biochemical and next genomic markers like RAPD, AFLP, RFLP, and SSLP were fundamental in research studies involving cross-breeding between different rat strains, in finding the level of polymorphism, linkage mapping, and phylogeny. Linkage maps provide information on recombination rates, give insight into intra- and interspecies gene rearrangements, and help to identify Mendelian loci and Quantitative Trait Loci (QTL). In the 1990s many reports were published on the construction of rat linkage maps that incorporated increasing numbers of markers and facilitated the localization of disease loci. Current genetic monitoring and linkage mapping relies on single nucleotide polymorphisms (SNPs). The Rat Genome Database collects information on genetic variation from the worldwide community of rat researchers and provides tools for searching and retrieving these data. As of today we show details about almost 605 million variants coming from many studies in our Variant Visualizer tool.
Collapse
|
20
|
Abstract
Whole genome sequencing (WGS) can provide comprehensive insights into the genetic makeup of lymphomas. Here we describe a selection of methods for the analysis of WGS data, including alignment, identification of different classes of genomic variants, the identification of driver mutations, and the identification of mutational signatures. We further outline design considerations for WGS studies and provide a variety of quality control measures to detect common quality problems in the data.
Collapse
|
21
|
Chen C, Le H, Goudar CT. Evaluation of two public genome references for chinese hamster ovary cells in the context of rna-seq based gene expression analysis. Biotechnol Bioeng 2017; 114:1603-1613. [PMID: 28295162 DOI: 10.1002/bit.26290] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2016] [Revised: 02/21/2017] [Accepted: 03/10/2017] [Indexed: 11/08/2022]
Abstract
RNA-Seq is a powerful transcriptomics tool for mammalian cell culture process development. Successful RNA-Seq data analysis requires a high quality reference for read mapping and gene expression quantification. Currently, there are two public genome references for Chinese hamster ovary (CHO) cells, the predominant mammalian cell line in the biopharmaceutical industry. In this study, we compared these two references by analyzing 60 RNA-Seq samples from a variety of CHO cell culture conditions. Among the 20,891 common genes in both references, we observed that 31.5% have more than 7.1% quantification differences, implying gene definition differences in the two references. We propose a framework to quantify this difference using two metrics, Consistency and Stringency, which account for the average quantification difference between the two references over all samples, and the sample-specific effect on the quantification result, respectively. These two metrics can be used to identify potential genes for future gene model improvement and to understand the reliability of differentially expressed genes identified by RNA-Seq data analysis. Before a more comprehensive genome reference for CHO cells emerges, the strategy proposed in this study can enable more robust transcriptome analysis from CHO cell RNA-Seq data. Biotechnol. Bioeng. 2017;114: 1603-1613. © 2017 Wiley Periodicals, Inc.
Collapse
Affiliation(s)
- Chun Chen
- Drug Substance Technologies, Process Development, Amgen Inc., 1 Amgen Center Drive, Thousand Oaks, California, 91320
| | - Huong Le
- Drug Substance Technologies, Process Development, Amgen Inc., 1 Amgen Center Drive, Thousand Oaks, California, 91320
| | - Chetan T Goudar
- Drug Substance Technologies, Process Development, Amgen Inc., 1 Amgen Center Drive, Thousand Oaks, California, 91320
| |
Collapse
|
22
|
In Silico identification and annotation of non-coding RNAs by RNA-seq and De Novo assembly of the transcriptome of Tomato Fruits. PLoS One 2017; 12:e0171504. [PMID: 28187155 PMCID: PMC5302821 DOI: 10.1371/journal.pone.0171504] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2016] [Accepted: 01/21/2017] [Indexed: 12/12/2022] Open
Abstract
The complexity of the tomato (Solanum lycopersicum) transcriptome has not yet been fully elucidated. To gain insights into the diversity and features of coding and non-coding RNA molecules of tomato fruits, we generated strand-specific libraries from berries of two tomato cultivars grown in two open-field conditions with different soil type. Following high-throughput Illumina RNA-sequencing (RNA-seq), more than 90% of the reads (over one billion, derived from twelve dataset) were aligned to the tomato reference genome. We report a comprehensive analysis of the transcriptome, improved with 39,095 transcripts, which reveals previously unannotated novel transcripts, natural antisense transcripts, long non-coding RNAs and alternative splicing variants. In addition, we investigated the sequence variants between the cultivars under investigation to highlight their genetic difference. Our strand-specific analysis allowed us to expand the current tomato transcriptome annotation and it is the first to reveal the complexity of the poly-adenylated RNA world in tomato. Moreover, our work demonstrates the usefulness of strand specific RNA-seq approach for the transcriptome-based genome annotation and provides a resource valuable for further functional studies.
Collapse
|
23
|
Sha Y, Phan JH, Wang MD. Effect of low-expression gene filtering on detection of differentially expressed genes in RNA-seq data. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2016; 2015:6461-4. [PMID: 26737772 DOI: 10.1109/embc.2015.7319872] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
We compare methods for filtering RNA-seq lowexpression genes and investigate the effect of filtering on detection of differentially expressed genes (DEGs). Although RNA-seq technology has improved the dynamic range of gene expression quantification, low-expression genes may be indistinguishable from sampling noise. The presence of noisy, low-expression genes can decrease the sensitivity of detecting DEGs. Thus, identification and filtering of these low-expression genes may improve DEG detection sensitivity. Using the SEQC benchmark dataset, we investigate the effect of different filtering methods on DEG detection sensitivity. Moreover, we investigate the effect of RNA-seq pipelines on optimal filtering thresholds. Results indicate that the filtering threshold that maximizes the total number of DEGs closely corresponds to the threshold that maximizes DEG detection sensitivity. Transcriptome reference annotation, expression quantification method, and DEG detection method are statistically significant RNA-seq pipeline factors that affect the optimal filtering threshold.
Collapse
|
24
|
Bustin SA. Improving the reliability of peer-reviewed publications: We are all in it together. BIOMOLECULAR DETECTION AND QUANTIFICATION 2016; 7:A1-5. [PMID: 27077047 PMCID: PMC4827640 DOI: 10.1016/j.bdq.2015.11.002] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/04/2015] [Revised: 11/20/2015] [Accepted: 11/26/2015] [Indexed: 12/20/2022]
Abstract
The current, and welcome, focus on standardization of techniques and transparency of reporting in the biomedical, peer-reviewed literature is commendable. However, that focus has been intermittent as well as lacklustre and so failed to tackle the alarming lack of reliability and reproducibly of biomedical research. Authors have access to numerous recommendations, ranging from simple standards dealing with technical issues to those regulating clinical trials, suggesting that improved reporting guidelines are not the solution. The elemental solution is for editors to require meticulous implementation of their journals' instructions for authors and reviewers and stipulate that no paper is published without a transparent, complete and accurate materials and methods section.
Collapse
Affiliation(s)
- Stephen A. Bustin
- Faculty of Medical Science, Postgraduate Medical Institute, Anglia Ruskin University, Chelmsford CM1 1SQ, UK
- The Gene Team Ltd., UK
| |
Collapse
|
25
|
Zhao S, Xi L, Zhang B. Union Exon Based Approach for RNA-Seq Gene Quantification: To Be or Not to Be? PLoS One 2015; 10:e0141910. [PMID: 26559532 PMCID: PMC4641603 DOI: 10.1371/journal.pone.0141910] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2015] [Accepted: 10/14/2015] [Indexed: 11/24/2022] Open
Abstract
In recent years, RNA-seq is emerging as a powerful technology in estimation of gene and/or transcript expression, and RPKM (Reads Per Kilobase per Million reads) is widely used to represent the relative abundance of mRNAs for a gene. In general, the methods for gene quantification can be largely divided into two categories: transcript-based approach and ‘union exon’-based approach. Transcript-based approach is intrinsically more difficult because different isoforms of the gene typically have a high proportion of genomic overlap. On the other hand, ‘union exon’-based approach method is much simpler and thus widely used in RNA-seq gene quantification. Biologically, a gene is expressed in one or more transcript isoforms. Therefore, transcript-based approach is logistically more meaningful than ‘union exon’-based approach. Despite the fact that gene quantification is a fundamental task in most RNA-seq studies, however, it remains unclear whether ‘union exon’-based approach for RNA-seq gene quantification is a good practice or not. In this paper, we carried out a side-by-side comparison of ‘union exon’-based approach and transcript-based method in RNA-seq gene quantification. It was found that the gene expression levels are significantly underestimated by ‘union exon’-based approach, and the average of RPKM from ‘union exons’-based method is less than 50% of the mean expression obtained from transcript-based approach. The difference between the two approaches is primarily affected by the number of transcripts in a gene. We performed differential analysis at both gene and transcript levels, respectively, and found more insights, such as isoform switches, are gained from isoform differential analysis. The accuracy of isoform quantification would improve if the read coverage pattern and exon-exon spanning reads are taken into account and incorporated into EM (Expectation Maximization) algorithm. Our investigation discourages the use of ‘union exons’-based approach in gene quantification despite its simplicity.
Collapse
Affiliation(s)
- Shanrong Zhao
- Clinical Genetics and Bioinformatics, Pfizer Worldwide Research & Development, Cambridge, Massachusetts, 02139, United States of America
- * E-mail: ;
| | - Li Xi
- Clinical Genetics and Bioinformatics, Pfizer Worldwide Research & Development, Cambridge, Massachusetts, 02139, United States of America
| | - Baohong Zhang
- Clinical Genetics and Bioinformatics, Pfizer Worldwide Research & Development, Cambridge, Massachusetts, 02139, United States of America
| |
Collapse
|
26
|
Zhao S, Zhang Y, Gordon W, Quan J, Xi H, Du S, von Schack D, Zhang B. Comparison of stranded and non-stranded RNA-seq transcriptome profiling and investigation of gene overlap. BMC Genomics 2015; 16:675. [PMID: 26334759 PMCID: PMC4559181 DOI: 10.1186/s12864-015-1876-7] [Citation(s) in RCA: 65] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2015] [Accepted: 08/24/2015] [Indexed: 12/17/2022] Open
Abstract
Background While RNA-sequencing (RNA-seq) is becoming a powerful technology in transcriptome profiling, one significant shortcoming of the first-generation RNA-seq protocol is that it does not retain the strand specificity of origin for each transcript. Without strand information it is difficult and sometimes impossible to accurately quantify gene expression levels for genes with overlapping genomic loci that are transcribed from opposite strands. It has recently become possible to retain the strand information by modifying the RNA-seq protocol, known as strand-specific or stranded RNA-seq. Here, we evaluated the advantages of stranded RNA-seq in transcriptome profiling of whole blood RNA samples compared with non-stranded RNA-seq, and investigated the influence of gene overlaps on gene expression profiling results based on practical RNA-seq datasets and also from a theoretical perspective. Results Our results demonstrated a substantial impact of stranded RNA-seq on transcriptome profiling and gene expression measurements. As many as 1751 genes in Gencode Release 19 were identified to be differentially expressed when comparing stranded and non-stranded RNA-seq whole blood samples. Antisense and pseudogenes were significantly enriched in differential expression analyses. Because stranded RNA-seq retains strand information of a read, we can resolve read ambiguity in overlapping genes transcribed from opposite strands, which provides a more accurate quantification of gene expression levels compared with traditional non-stranded RNA-seq. In the human genome, it is not uncommon to find genomic loci where both strands encode distinct genes. Among the over 57,800 annotated genes in Gencode release 19, there are an estimated 19 % (about 11,000) of overlapping genes transcribed from the opposite strands. Based on our whole blood mRNA-seq datasets, the fraction of overlapping nucleotide bases on the same and opposite strands were estimated at 2.94 % and 3.1 %, respectively. The corresponding theoretical estimations are 3 % and 3.6 %, well in agreement with our own findings. Conclusions Stranded RNA-seq provides a more accurate estimate of transcript expression compared with non-stranded RNA-seq, and is therefore the recommended RNA-seq approach for future mRNA-seq studies. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1876-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Shanrong Zhao
- Clinical Genetics and Bioinformatics, Pfizer Worldwide Research & Development, Cambridge, MA, 02139, USA.
| | - Ying Zhang
- Precision Medicine - Bioanalytical, PTx Clinical R&D, Pfizer Worldwide Research & Development, Cambridge, MA, 02139, USA.
| | - William Gordon
- Precision Medicine - Bioanalytical, PTx Clinical R&D, Pfizer Worldwide Research & Development, Cambridge, MA, 02139, USA.
| | - Jie Quan
- Computational Sciences Centers of Excellence, Pfizer Worldwide Research & Development, Cambridge, MA, 02139, USA.
| | - Hualin Xi
- Computational Sciences Centers of Excellence, Pfizer Worldwide Research & Development, Cambridge, MA, 02139, USA.
| | - Sarah Du
- Precision Medicine - Bioanalytical, PTx Clinical R&D, Pfizer Worldwide Research & Development, Cambridge, MA, 02139, USA.
| | - David von Schack
- Precision Medicine - Bioanalytical, PTx Clinical R&D, Pfizer Worldwide Research & Development, Cambridge, MA, 02139, USA.
| | - Baohong Zhang
- Clinical Genetics and Bioinformatics, Pfizer Worldwide Research & Development, Cambridge, MA, 02139, USA.
| |
Collapse
|
27
|
Zheng CL, Ratnakar V, Gil Y, McWeeney SK. Use of semantic workflows to enhance transparency and reproducibility in clinical omics. Genome Med 2015; 7:73. [PMID: 26289940 PMCID: PMC4545705 DOI: 10.1186/s13073-015-0202-y] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2015] [Accepted: 07/14/2015] [Indexed: 01/13/2023] Open
Abstract
BACKGROUND Recent highly publicized cases of premature patient assignment into clinical trials, resulting from non-reproducible omics analyses, have prompted many to call for a more thorough examination of translational omics and highlighted the critical need for transparency and reproducibility to ensure patient safety. The use of workflow platforms such as Galaxy and Taverna have greatly enhanced the use, transparency and reproducibility of omics analysis pipelines in the research domain and would be an invaluable tool in a clinical setting. However, the use of these workflow platforms requires deep domain expertise that, particularly within the multi-disciplinary fields of translational and clinical omics, may not always be present in a clinical setting. This lack of domain expertise may put patient safety at risk and make these workflow platforms difficult to operationalize in a clinical setting. In contrast, semantic workflows are a different class of workflow platform where resultant workflow runs are transparent, reproducible, and semantically validated. Through semantic enforcement of all datasets, analyses and user-defined rules/constraints, users are guided through each workflow run, enhancing analytical validity and patient safety. METHODS To evaluate the effectiveness of semantic workflows within translational and clinical omics, we have implemented a clinical omics pipeline for annotating DNA sequence variants identified through next generation sequencing using the Workflow Instance Generation and Specialization (WINGS) semantic workflow platform. RESULTS We found that the implementation and execution of our clinical omics pipeline in a semantic workflow helped us to meet the requirements for enhanced transparency, reproducibility and analytical validity recommended for clinical omics. We further found that many features of the WINGS platform were particularly primed to help support the critical needs of clinical omics analyses. CONCLUSIONS This is the first implementation and execution of a clinical omics pipeline using semantic workflows. Evaluation of this implementation provides guidance for their use in both translational and clinical settings.
Collapse
Affiliation(s)
- Christina L Zheng
- Division of Bioinformatics and Computational Biology, Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR, USA.
- Knight Cancer Institute, Oregon Health & Science University, Portland, OR, USA.
| | - Varun Ratnakar
- Information Sciences Institute, University of Southern California, Los Angeles, CA, USA.
| | - Yolanda Gil
- Information Sciences Institute, University of Southern California, Los Angeles, CA, USA.
| | - Shannon K McWeeney
- Division of Bioinformatics and Computational Biology, Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR, USA.
- Knight Cancer Institute, Oregon Health & Science University, Portland, OR, USA.
- Division of Biostatistics, Department of Public Health and Preventative Medicine, Oregon Health & Science University, Portland, OR, USA.
| |
Collapse
|
28
|
Systematic transcriptome analysis reveals tumor-specific isoforms for ovarian cancer diagnosis and therapy. Proc Natl Acad Sci U S A 2015; 112:E3050-7. [PMID: 26015570 PMCID: PMC4466751 DOI: 10.1073/pnas.1508057112] [Citation(s) in RCA: 54] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
Tumor-specific molecules are needed across diverse areas of oncology for use in early detection, diagnosis, prognosis and therapy. Large and growing public databases of transcriptome sequencing data (RNA-seq) derived from tumors and normal tissues hold the potential of yielding tumor-specific molecules, but because the data are new they have not been fully explored for this purpose. We have developed custom bioinformatic algorithms and used them with 296 high-grade serous ovarian (HGS-OvCa) tumor and 1,839 normal RNA-seq datasets to identify mRNA isoforms with tumor-specific expression. We rank prioritized isoforms by likelihood of being expressed in HGS-OvCa tumors and not in normal tissues and analyzed 671 top-ranked isoforms by high-throughput RT-qPCR. Six of these isoforms were expressed in a majority of the 12 tumors examined but not in 18 normal tissues. An additional 11 were expressed in most tumors and only one normal tissue, which in most cases was fallopian or colon. Of the 671 isoforms, the topmost 5% (n = 33) ranked based on having tumor-specific or highly restricted normal tissue expression by RT-qPCR analysis are enriched for oncogenic, stem cell/cancer stem cell, and early development loci--including ETV4, FOXM1, LSR, CD9, RAB11FIP4, and FGFRL1. Many of the 33 isoforms are predicted to encode proteins with unique amino acid sequences, which would allow them to be specifically targeted for one or more therapeutic strategies--including monoclonal antibodies and T-cell-based vaccines. The systematic process described herein is readily and rapidly applicable to the more than 30 additional tumor types for which sufficient amounts of RNA-seq already exist.
Collapse
|
29
|
Zhao S, Zhang B. A comprehensive evaluation of ensembl, RefSeq, and UCSC annotations in the context of RNA-seq read mapping and gene quantification. BMC Genomics 2015; 16:97. [PMID: 25765860 PMCID: PMC4339237 DOI: 10.1186/s12864-015-1308-8] [Citation(s) in RCA: 83] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2014] [Accepted: 01/30/2015] [Indexed: 01/09/2023] Open
Abstract
Background RNA-Seq has become increasingly popular in transcriptome profiling. One aspect of transcriptome research is to quantify the expression levels of genomic elements, such as genes, their transcripts and exons. Acquiring a transcriptome expression profile requires genomic elements to be defined in the context of the genome. Multiple human genome annotation databases exist, including RefGene (RefSeq Gene), Ensembl, and the UCSC annotation database. The impact of the choice of an annotation on estimating gene expression remains insufficiently investigated. Results In this paper, we systematically characterized the impact of genome annotation choice on read mapping and transcriptome quantification by analyzing a RNA-Seq dataset generated by the Human Body Map 2.0 Project. The impact of a gene model on mapping of non-junction reads is different from junction reads. For the RNA-Seq dataset with a read length of 75 bp, on average, 95% of non-junction reads were mapped to exactly the same genomic location regardless of which gene models was used. By contrast, this percentage dropped to 53% for junction reads. In addition, about 30% of junction reads failed to align without the assistance of a gene model, while 10–15% mapped alternatively. There are 21,958 common genes among RefGene, Ensembl, and UCSC annotations. When we compared the gene quantification results in RefGene and Ensembl annotations, 20% of genes are not expressed, and thus have a zero count in both annotations. Surprisingly, identical gene quantification results were obtained for only 16.3% (about one sixth) of genes. Approximately 28.1% of genes’ expression levels differed by 5% or higher, and of those, the relative expression levels for 9.3% of genes (equivalent to 2038) differed by 50% or greater. The case studies revealed that the gene definition differences in gene models frequently result in inconsistency in gene quantification. Conclusions We demonstrated that the choice of a gene model has a dramatic effect on both gene quantification and differential analysis. Our research will help RNA-Seq data analysts to make an informed choice of gene model in practical RNA-Seq data analysis. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1308-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Shanrong Zhao
- Clinical Genetics and Bioinformatics, BioTherapeutics Clinical R&D, Pfizer Worldwide Research & Development, Cambridge, MA, 02139, USA.
| | - Baohong Zhang
- Clinical Genetics and Bioinformatics, BioTherapeutics Clinical R&D, Pfizer Worldwide Research & Development, Cambridge, MA, 02139, USA.
| |
Collapse
|
30
|
Zheng CL, Kawane S, Bottomly D, Wilmot B. Analysis considerations for utilizing RNA-Seq to characterize the brain transcriptome. INTERNATIONAL REVIEW OF NEUROBIOLOGY 2014; 116:21-54. [PMID: 25172470 DOI: 10.1016/b978-0-12-801105-8.00002-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
RNA-Seq allows one to examine only gene expression as well as expression of noncoding RNAs, alternative splicing, and allele-specific expression. With this increased sensitivity and dynamic range, there are computational and statistical considerations that need to be contemplated, which are highly dependent on the biological question being asked. We highlight these to provide an overview of their importance and the impact they can have on downstream interpretation of the brain transcriptome.
Collapse
Affiliation(s)
- Christina L Zheng
- Department of Medical Informatics and Clinical Epidemiology, Oregon Health and Science University, Portland, Oregon, USA; Knight Cancer Institute, Oregon Health, Oregon Health and Science University, Portland, Oregon, USA.
| | - Sunita Kawane
- Clinical & Translational Research Institute, Oregon Health and Science University, Portland, Oregon, USA
| | - Daniel Bottomly
- Clinical & Translational Research Institute, Oregon Health and Science University, Portland, Oregon, USA
| | - Beth Wilmot
- Department of Medical Informatics and Clinical Epidemiology, Oregon Health and Science University, Portland, Oregon, USA; Clinical & Translational Research Institute, Oregon Health and Science University, Portland, Oregon, USA
| |
Collapse
|
31
|
Harati S, Phan JH, Wang MD. Investigation of factors affecting RNA-seq gene expression calls. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2014; 2014:5232-5. [PMID: 25571173 PMCID: PMC4983432 DOI: 10.1109/embc.2014.6944805] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
RNA-seq enables quantification of the human transcriptome. Estimation of gene expression is a fundamental issue in the analysis of RNA-seq data. However, there is an inherent ambiguity in distinguishing between genes with very low expression and experimental or transcriptional noise. We conducted an exploratory investigation of some factors that may affect gene expression calls. We observed that the distribution of reads that map to exonic, intronic, and intergenic regions are distinct. These distributions may provide useful insights into the behavior of gene expression noise. Moreover, we observed that these distributions are qualitatively similar between two sequence mapping algorithms. Finally, we examined the relationship between gene length and gene expression calls, and observed that they are correlated. This preliminary investigation is important for RNA-seq gene expression analysis because it may lead to more effective algorithms for distinguishing between true gene expression and experimental or transcriptional noise.
Collapse
|