1
|
Williams EC, Chazarra-Gil R, Shahsavari A, Mohorianu I. The Sum of Two Halves May Be Different from the Whole-Effects of Splitting Sequencing Samples Across Lanes. Genes (Basel) 2022; 13:genes13122265. [PMID: 36553532 PMCID: PMC9777937 DOI: 10.3390/genes13122265] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Revised: 11/23/2022] [Accepted: 11/25/2022] [Indexed: 12/03/2022] Open
Abstract
The advances in high-throughput sequencing (HTS) have enabled the characterisation of biological processes at an unprecedented level of detail; most hypotheses in molecular biology rely on analyses of HTS data. However, achieving increased robustness and reproducibility of results remains a main challenge. Although variability in results may be introduced at various stages, e.g., alignment, summarisation or detection of differential expression, one source of variability was systematically omitted: the sequencing design, which propagates through analyses and may introduce an additional layer of technical variation. We illustrate qualitative and quantitative differences arising from splitting samples across lanes on bulk and single-cell sequencing. For bulk mRNAseq data, we focus on differential expression and enrichment analyses; for bulk ChIPseq data, we investigate the effect on peak calling and the peaks' properties. At the single-cell level, we concentrate on identifying cell subpopulations. We rely on markers used for assigning cell identities; both smartSeq and 10× data are presented. The observed reduction in the number of unique sequenced fragments limits the level of detail on which the different prediction approaches depend. Furthermore, the sequencing stochasticity adds in a weighting bias corroborated with variable sequencing depths and (yet unexplained) sequencing bias. Subsequently, we observe an overall reduction in sequencing complexity and a distortion in the biological signal across technologies, experimental contexts, organisms and tissues.
Collapse
Affiliation(s)
- Eleanor C. Williams
- Wellcome-MRC Cambridge Stem Cell Institute, University of Cambridge, Cambridge CB2 0AW, UK
| | - Ruben Chazarra-Gil
- Wellcome-MRC Cambridge Stem Cell Institute, University of Cambridge, Cambridge CB2 0AW, UK
- Life Sciences-Transcriptomics and Functional Genomics Lab, Barcelona Supercomputing Center (BSC-CNS), 08034 Barcelona, Spain
| | - Arash Shahsavari
- Wellcome-MRC Cambridge Stem Cell Institute, University of Cambridge, Cambridge CB2 0AW, UK
| | - Irina Mohorianu
- Wellcome-MRC Cambridge Stem Cell Institute, University of Cambridge, Cambridge CB2 0AW, UK
- Correspondence:
| |
Collapse
|
2
|
Lozoya OA, McClelland KS, Papas BN, Li JL, Yao HHC. Patterns, Profiles, and Parsimony: Dissecting Transcriptional Signatures From Minimal Single-Cell RNA-Seq Output With SALSA. Front Genet 2020; 11:511286. [PMID: 33193599 PMCID: PMC7586319 DOI: 10.3389/fgene.2020.511286] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2019] [Accepted: 09/18/2020] [Indexed: 11/23/2022] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) technologies have precipitated the development of bioinformatic tools to reconstruct cell lineage specification and differentiation processes with single-cell precision. However, current start-up costs and recommended data volumes for statistical analysis remain prohibitively expensive, preventing scRNA-seq technologies from becoming mainstream. Here, we introduce single-cell amalgamation by latent semantic analysis (SALSA), a versatile workflow that combines measurement reliability metrics with latent variable extraction to infer robust expression profiles from ultra-sparse sc-RNAseq data. SALSA uses a matrix focusing approach that starts by identifying facultative genes with expression levels greater than experimental measurement precision and ends with cell clustering based on a minimal set of Profiler genes, each one a putative biomarker of cluster-specific expression profiles. To benchmark how SALSA performs in experimental settings, we used the publicly available 10X Genomics PBMC 3K dataset, a pre-curated silver standard from human frozen peripheral blood comprising 2,700 single-cell barcodes, and identified 7 major cell groups matching transcriptional profiles of peripheral blood cell types and driven agnostically by < 500 Profiler genes. Finally, we demonstrate successful implementation of SALSA in a replicative scRNA-seq scenario by using previously published DropSeq data from a multi-batch mouse retina experimental design, thereby identifying 10 transcriptionally distinct cell types from > 64,000 single cells across 7 independent biological replicates based on < 630 Profiler genes. With these results, SALSA demonstrates that robust pattern detection from scRNA-seq expression matrices only requires a fraction of the accrued data, suggesting that single-cell sequencing technologies can become affordable and widespread if meant as hypothesis-generation tools to extract large-scale differential expression effects.
Collapse
Affiliation(s)
- Oswaldo A. Lozoya
- Genomic Integrity & Structural Biology Laboratory, National Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, NC, United States
| | - Kathryn S. McClelland
- Reproductive and Developmental Biology Laboratory, National Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, NC, United States
| | - Brian N. Papas
- Integrative Bioinformatics Support Group, National Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, NC, United States
| | - Jian-Liang Li
- Integrative Bioinformatics Support Group, National Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, NC, United States
| | - Humphrey H.-C. Yao
- Reproductive and Developmental Biology Laboratory, National Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, NC, United States
| |
Collapse
|
3
|
Kuzmin DA, Feranchuk SI, Sharov VV, Cybin AN, Makolov SV, Putintseva YA, Oreshkova NV, Krutovsky KV. Stepwise large genome assembly approach: a case of Siberian larch (Larix sibirica Ledeb). BMC Bioinformatics 2019; 20:37. [PMID: 30717661 PMCID: PMC6362582 DOI: 10.1186/s12859-018-2570-y] [Citation(s) in RCA: 32] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Background De novo assembling of large genomes, such as in conifers (~ 12–30 Gbp), which also consist of ~ 80% of repetitive DNA, is a very complex and computationally intense endeavor. One of the main problems in assembling such genomes lays in computing limitations of nucleotide sequence assembly programs (DNA assemblers). As a rule, modern assemblers are usually designed to assemble genomes with a length not exceeding the length of the human genome (3.24 Gbp). Most assemblers cannot handle the amount of input sequence data required to provide sufficient coverage needed for a high-quality assembly. Results An original stepwise method of de novo assembly by parts (sets), which allows to bypass the limitations of modern assemblers associated with a huge amount of data being processed, is presented in this paper. The results of numerical assembling experiments conducted using the model plant Arabidopsis thaliana, Prunus persica (peach) and four most popular assemblers, ABySS, SOAPdenovo, SPAdes, and CLC Assembly Cell, showed the validity and effectiveness of the proposed stepwise assembling method. Conclusion Using the new stepwise de novo assembling method presented in the paper, the genome of Siberian larch, Larix sibirica Ledeb. (12.34 Gbp) was completely assembled de novo by the CLC Assembly Cell assembler. It is the first genome assembly for larch species in addition to only five other conifer genomes sequenced and assembled for Picea abies, Picea glauca, Pinus taeda, Pinus lambertiana, and Pseudotsuga menziesii var. menziesii. Electronic supplementary material The online version of this article (10.1186/s12859-018-2570-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Dmitry A Kuzmin
- Laboratory of Forest Genomics, Genome Research and Education Center, Siberian Federal University, 660036, Krasnoyarsk, Russia.,Department of High Performance Computing, Institute of Space and Information Technologies, Siberian Federal University, 660074, Krasnoyarsk, Russia
| | - Sergey I Feranchuk
- Laboratory of Forest Genomics, Genome Research and Education Center, Siberian Federal University, 660036, Krasnoyarsk, Russia.,Department of Informatics, National Research Technical University, 664074, Irkutsk, Russia.,Limnological Institute, Siberian Branch of Russian Academy of Sciences, 664033, Irkutsk, Russia
| | - Vadim V Sharov
- Laboratory of Forest Genomics, Genome Research and Education Center, Siberian Federal University, 660036, Krasnoyarsk, Russia.,Department of High Performance Computing, Institute of Space and Information Technologies, Siberian Federal University, 660074, Krasnoyarsk, Russia
| | - Alexander N Cybin
- Laboratory of Forest Genomics, Genome Research and Education Center, Siberian Federal University, 660036, Krasnoyarsk, Russia.,Department of High Performance Computing, Institute of Space and Information Technologies, Siberian Federal University, 660074, Krasnoyarsk, Russia
| | - Stepan V Makolov
- Laboratory of Forest Genomics, Genome Research and Education Center, Siberian Federal University, 660036, Krasnoyarsk, Russia.,Department of High Performance Computing, Institute of Space and Information Technologies, Siberian Federal University, 660074, Krasnoyarsk, Russia
| | - Yuliya A Putintseva
- Laboratory of Forest Genomics, Genome Research and Education Center, Siberian Federal University, 660036, Krasnoyarsk, Russia
| | - Natalya V Oreshkova
- Laboratory of Forest Genomics, Genome Research and Education Center, Siberian Federal University, 660036, Krasnoyarsk, Russia.,Laboratory of Forest Genetics and Selection, V. N. Sukachev Institute of Forest, Siberian Branch of Russian Academy of Sciences, 660036, Krasnoyarsk, Russia
| | - Konstantin V Krutovsky
- Laboratory of Forest Genomics, Genome Research and Education Center, Siberian Federal University, 660036, Krasnoyarsk, Russia. .,Department of Forest Genetics and Forest Tree Breeding, Georg-August University of Göttingen, 37077, Göttingen, Germany. .,Laboratory of Population Genetics, N. I. Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, 119333, Russia. .,Department of Ecosystem Science and Management, Texas A&M University, College Station, TX, 77843-2138, USA.
| |
Collapse
|
4
|
Expression analysis of RNA sequencing data from human neural and glial cell lines depends on technical replication and normalization methods. BMC Bioinformatics 2018; 19:412. [PMID: 30453873 PMCID: PMC6245503 DOI: 10.1186/s12859-018-2382-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
Background The potential for astrocyte participation in central nervous system recovery is highlighted by in vitro experiments demonstrating their capacity to transdifferentiate into neurons. Understanding astrocyte plasticity could be advanced by comparing astrocytes with stem cells. RNA sequencing (RNA-seq) is ideal for comparing differences across cell types. However, this novel multi-stage process has the potential to introduce unwanted technical variation at several points in the experimental workflow. Quantitative understanding of the contribution of experimental parameters to technical variation would facilitate the design of robust RNA-Seq experiments. Results RNA-Seq was used to achieve biological and technical objectives. The biological aspect compared gene expression between normal human fetal-derived astrocytes and human neural stem cells cultured in identical conditions. When differential expression threshold criteria of |log2fold change| > 2 were applied to the data, no significant differences were observed. The technical component quantified variation arising from particular steps in the research pathway, and compared the ability of different normalization methods to reduce unwanted variance. To facilitate this objective, a liberal false discovery rate of 10% and a |log2fold change| > 0.5 were implemented for the differential expression threshold. Data were normalized with RPKM, TMM, and UQS methods using JMP Genomics. The contributions of key replicable experimental parameters (cell lot; library preparation; flow cell) to variance in the data were evaluated using principal variance component analysis. Our analysis showed that, although the variance for every parameter is strongly influenced by the normalization method, the largest contributor to technical variance was library preparation. The ability to detect differentially expressed genes was also affected by normalization; differences were only detected in non-normalized and TMM-normalized data. Conclusions The similarity in gene expression between astrocytes and neural stem cells supports the potential for astrocytic transdifferentiation into neurons, and emphasizes the need to evaluate the therapeutic potential of astrocytes for central nervous system damage. The choice of normalization method influences the contributions to experimental variance as well as the outcomes of differential expression analysis. However irrespective of normalization method, our findings illustrate that library preparation contributed the largest component of technical variance. Electronic supplementary material The online version of this article (10.1186/s12859-018-2382-0) contains supplementary material, which is available to authorized users.
Collapse
|
5
|
Veras PST, Ramos PIP, de Menezes JPB. In Search of Biomarkers for Pathogenesis and Control of Leishmaniasis by Global Analyses of Leishmania-Infected Macrophages. Front Cell Infect Microbiol 2018; 8:326. [PMID: 30283744 PMCID: PMC6157484 DOI: 10.3389/fcimb.2018.00326] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2018] [Accepted: 08/27/2018] [Indexed: 12/12/2022] Open
Abstract
Leishmaniasis is a vector-borne, neglected tropical disease with a worldwide distribution that can present in a variety of clinical forms, depending on the parasite species and host genetic background. The pathogenesis of this disease remains far from being elucidated because the involvement of a complex immune response orchestrated by host cells significantly affects the clinical outcome. Among these cells, macrophages are the main host cells, produce cytokines and chemokines, thereby triggering events that contribute to the mediation of the host immune response and, subsequently, to the establishment of infection or, alternatively, disease control. There has been relatively limited commercial interest in developing new pharmaceutical compounds to treat leishmaniasis. Moreover, advances in the understanding of the underlying biology of Leishmania spp. have not translated into the development of effective new chemotherapeutic compounds. As a result, biomarkers as surrogate disease endpoints present several potential advantages to be used in the identification of targets capable of facilitating therapeutic interventions considered to ameliorate disease outcome. More recently, large-scale genomic and proteomic analyses have allowed the identification and characterization of the pathways involved in the infection process in both parasites and the host, and these analyses have been shown to be more effective than studying individual molecules to elucidate disease pathogenesis. RNA-seq and proteomics are large-scale approaches that characterize genes or proteins in a given cell line, tissue, or organism to provide a global and more integrated view of the myriad biological processes that occur within a cell than focusing on an individual gene or protein. Bioinformatics provides us with the means to computationally analyze and integrate the large volumes of data generated by high-throughput sequencing approaches. The integration of genomic expression and proteomic data offers a rich multi-dimensional analysis, despite the inherent technical and statistical challenges. We propose that these types of global analyses facilitate the identification, among a large number of genes and proteins, those that hold potential as biomarkers. The present review focuses on large-scale studies that have identified and evaluated relevant biomarkers in macrophages in response to Leishmania infection.
Collapse
Affiliation(s)
- Patricia Sampaio Tavares Veras
- Laboratory of Host-Parasite Interaction and Epidemiology, Gonçalo Moniz Institute, Fiocruz-Bahia, Salvador, Brazil.,National Institute of Tropical Disease, Brasilia, Brazil
| | - Pablo Ivan Pereira Ramos
- Center for Data and Knowledge Integration for Health, Gonçalo Moniz Institute, Fiocruz-Bahia, Salvador, Brazil
| | | |
Collapse
|
6
|
Pannala VR, Wall ML, Estes SK, Trenary I, O'Brien TP, Printz RL, Vinnakota KC, Reifman J, Shiota M, Young JD, Wallqvist A. Metabolic network-based predictions of toxicant-induced metabolite changes in the laboratory rat. Sci Rep 2018; 8:11678. [PMID: 30076366 PMCID: PMC6076258 DOI: 10.1038/s41598-018-30149-7] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2018] [Accepted: 07/23/2018] [Indexed: 12/11/2022] Open
Abstract
In order to provide timely treatment for organ damage initiated by therapeutic drugs or exposure to environmental toxicants, we first need to identify markers that provide an early diagnosis of potential adverse effects before permanent damage occurs. Specifically, the liver, as a primary organ prone to toxicants-induced injuries, lacks diagnostic markers that are specific and sensitive to the early onset of injury. Here, to identify plasma metabolites as markers of early toxicant-induced injury, we used a constraint-based modeling approach with a genome-scale network reconstruction of rat liver metabolism to incorporate perturbations of gene expression induced by acetaminophen, a known hepatotoxicant. A comparison of the model results against the global metabolic profiling data revealed that our approach satisfactorily predicted altered plasma metabolite levels as early as 5 h after exposure to 2 g/kg of acetaminophen, and that 10 h after treatment the predictions significantly improved when we integrated measured central carbon fluxes. Our approach is solely driven by gene expression and physiological boundary conditions, and does not rely on any toxicant-specific model component. As such, it provides a mechanistic model that serves as a first step in identifying a list of putative plasma metabolites that could change due to toxicant-induced perturbations.
Collapse
Affiliation(s)
- Venkat R Pannala
- Department of Defense Biotechnology High Performance Computing Software Applications Institute, Telemedicine and Advanced Technology Research Center, U.S. Army Medical Research and Materiel Command, Fort Detrick, MD, 21702, USA.
| | - Martha L Wall
- Department of Chemical and Biomolecular Engineering, Vanderbilt University School of Engineering, Nashville, TN, 37232, USA
| | - Shanea K Estes
- Department of Molecular Physiology and Biophysics, Vanderbilt University School of Medicine, Nashville, TN, 37232, USA
| | - Irina Trenary
- Department of Chemical and Biomolecular Engineering, Vanderbilt University School of Engineering, Nashville, TN, 37232, USA
| | - Tracy P O'Brien
- Department of Molecular Physiology and Biophysics, Vanderbilt University School of Medicine, Nashville, TN, 37232, USA
| | - Richard L Printz
- Department of Molecular Physiology and Biophysics, Vanderbilt University School of Medicine, Nashville, TN, 37232, USA
| | - Kalyan C Vinnakota
- Department of Defense Biotechnology High Performance Computing Software Applications Institute, Telemedicine and Advanced Technology Research Center, U.S. Army Medical Research and Materiel Command, Fort Detrick, MD, 21702, USA
| | - Jaques Reifman
- Department of Defense Biotechnology High Performance Computing Software Applications Institute, Telemedicine and Advanced Technology Research Center, U.S. Army Medical Research and Materiel Command, Fort Detrick, MD, 21702, USA
| | - Masakazu Shiota
- Department of Molecular Physiology and Biophysics, Vanderbilt University School of Medicine, Nashville, TN, 37232, USA
| | - Jamey D Young
- Department of Molecular Physiology and Biophysics, Vanderbilt University School of Medicine, Nashville, TN, 37232, USA. .,Department of Chemical and Biomolecular Engineering, Vanderbilt University School of Engineering, Nashville, TN, 37232, USA.
| | - Anders Wallqvist
- Department of Defense Biotechnology High Performance Computing Software Applications Institute, Telemedicine and Advanced Technology Research Center, U.S. Army Medical Research and Materiel Command, Fort Detrick, MD, 21702, USA.
| |
Collapse
|
7
|
Lozoya OA, Santos JH, Woychik RP. A Leveraged Signal-to-Noise Ratio (LSTNR) Method to Extract Differentially Expressed Genes and Multivariate Patterns of Expression From Noisy and Low-Replication RNAseq Data. Front Genet 2018; 9:176. [PMID: 29868123 PMCID: PMC5964166 DOI: 10.3389/fgene.2018.00176] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2017] [Accepted: 04/27/2018] [Indexed: 12/11/2022] Open
Abstract
To life scientists, one important feature offered by RNAseq, a next-generation sequencing tool used to estimate changes in gene expression levels, lies in its unprecedented resolution. It can score countable differences in transcript numbers among thousands of genes and between experimental groups, all at once. However, its high cost limits experimental designs to very small sample sizes, usually N = 3, which often results in statistically underpowered analysis and poor reproducibility. All these issues are compounded by the presence of experimental noise, which is harder to distinguish from instrumental error when sample sizes are limiting (e.g., small-budget pilot tests), experimental populations exhibit biologically heterogeneous or diffuse expression phenotypes (e.g., patient samples), or when discriminating among transcriptional signatures of closely related experimental conditions (e.g., toxicological modes of action, or MOAs). Here, we present a leveraged signal-to-noise ratio (LSTNR) thresholding method, founded on generalized linear modeling (GLM) of aligned read detection limits to extract differentially expressed genes (DEGs) from noisy low-replication RNAseq data. The LSTNR method uses an agnostic independent filtering strategy to define the dynamic range of detected aggregate read counts per gene, and assigns statistical weights that prioritize genes with better sequencing resolution in differential expression analyses. To assess its performance, we implemented the LSTNR method to analyze three separate datasets: first, using a systematically noisy in silico dataset, we demonstrated that LSTNR can extract pre-designed patterns of expression and discriminate between "noise" and "true" differentially expressed pseudogenes at a 100% success rate; then, we illustrated how the LSTNR method can assign patient-derived breast cancer specimens correctly to one out of their four reported molecular subtypes (luminal A, luminal B, Her2-enriched and basal-like); and last, we showed the ability to retrieve five different modes of action (MOA) elicited in livers of rats exposed to three toxicants under three nutritional routes by using the LSTNR method. By combining differential measurements with resolving power to detect DEGs, the LSTNR method offers an alternative approach to interrogate noisy and low-replication RNAseq datasets, which handles multiple biological conditions at once, and defines benchmarks to validate RNAseq experiments with standard benchtop assays.
Collapse
Affiliation(s)
- Oswaldo A Lozoya
- Genome Integrity and Structural Biology Laboratory, National Institute of Environmental Health Sciences, National Institutes of Health, Durham, NC, United States
| | - Janine H Santos
- Genome Integrity and Structural Biology Laboratory, National Institute of Environmental Health Sciences, National Institutes of Health, Durham, NC, United States
| | - Richard P Woychik
- Genome Integrity and Structural Biology Laboratory, National Institute of Environmental Health Sciences, National Institutes of Health, Durham, NC, United States
| |
Collapse
|
8
|
Genome-wide gene expression changes associated with exposure of rat liver, heart, and kidney cells to endosulfan. Toxicol In Vitro 2018; 48:244-254. [DOI: 10.1016/j.tiv.2018.01.022] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2017] [Revised: 01/25/2018] [Accepted: 01/27/2018] [Indexed: 02/06/2023]
|
9
|
Wang M, Uebbing S, Ellegren H. Bayesian Inference of Allele-Specific Gene Expression Indicates Abundant Cis-Regulatory Variation in Natural Flycatcher Populations. Genome Biol Evol 2017; 9:1266-1279. [PMID: 28453623 PMCID: PMC5434935 DOI: 10.1093/gbe/evx080] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/25/2017] [Indexed: 12/13/2022] Open
Abstract
Polymorphism in cis-regulatory sequences can lead to different levels of expression for the two alleles of a gene, providing a starting point for the evolution of gene expression. Little is known about the genome-wide abundance of genetic variation in gene regulation in natural populations but analysis of allele-specific expression (ASE) provides a means for investigating such variation. We performed RNA-seq of multiple tissues from population samples of two closely related flycatcher species and developed a Bayesian algorithm that maximizes data usage by borrowing information from the whole data set and combines several SNPs per transcript to detect ASE. Of 2,576 transcripts analyzed in collared flycatcher, ASE was detected in 185 (7.2%) and a similar frequency was seen in the pied flycatcher. Transcripts with statistically significant ASE commonly showed the major allele in >90% of the reads, reflecting that power was highest when expression was heavily biased toward one of the alleles. This would suggest that the observed frequencies of ASE likely are underestimates. The proportion of ASE transcripts varied among tissues, being lowest in testis and highest in muscle. Individuals often showed ASE of particular transcripts in more than one tissue (73.4%), consistent with a genetic basis for regulation of gene expression. The results suggest that genetic variation in regulatory sequences commonly affects gene expression in natural populations and that it provides a seedbed for phenotypic evolution via divergence in gene expression.
Collapse
Affiliation(s)
- Mi Wang
- Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Sweden
| | - Severin Uebbing
- Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Sweden
| | - Hans Ellegren
- Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Sweden
| |
Collapse
|
10
|
Huang WC, Ferris E, Cheng T, Hörndli CS, Gleason K, Tamminga C, Wagner JD, Boucher KM, Christian JL, Gregg C. Diverse Non-genetic, Allele-Specific Expression Effects Shape Genetic Architecture at the Cellular Level in the Mammalian Brain. Neuron 2017; 93:1094-1109.e7. [PMID: 28238550 PMCID: PMC5774018 DOI: 10.1016/j.neuron.2017.01.033] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2016] [Revised: 11/27/2016] [Accepted: 01/30/2017] [Indexed: 01/19/2023]
Abstract
Interactions between genetic and epigenetic effects shape brain function, behavior, and the risk for mental illness. Random X inactivation and genomic imprinting are epigenetic allelic effects that are well known to influence genetic architecture and disease risk. Less is known about the nature, prevalence, and conservation of other potential epigenetic allelic effects in vivo in the mouse and primate brain. Here we devise genomics, in situ hybridization, and mouse genetics strategies to uncover diverse allelic effects in the brain that are not caused by imprinting or genetic variation. We found allelic effects that are developmental stage and cell type specific, that are prevalent in the neonatal brain, and that cause mosaics of monoallelic brain cells that differentially express wild-type and mutant alleles for heterozygous mutations. Finally, we show that diverse non-genetic allelic effects that impact mental illness risk genes exist in the macaque and human brain. Our findings have potential implications for mammalian brain genetics. VIDEO ABSTRACT.
Collapse
Affiliation(s)
- Wei-Chao Huang
- Departments of Neurobiology & Anatomy, University of Utah School of Medicine, Salt Lake City, UT 84112, USA
| | - Elliott Ferris
- Departments of Neurobiology & Anatomy, University of Utah School of Medicine, Salt Lake City, UT 84112, USA
| | - Tong Cheng
- Departments of Neurobiology & Anatomy, University of Utah School of Medicine, Salt Lake City, UT 84112, USA
| | - Cornelia Stacher Hörndli
- Departments of Neurobiology & Anatomy, University of Utah School of Medicine, Salt Lake City, UT 84112, USA
| | - Kelly Gleason
- Department of Psychiatry, UT Southwestern, Dallas, TX 75390-9127, USA
| | - Carol Tamminga
- Department of Psychiatry, UT Southwestern, Dallas, TX 75390-9127, USA
| | - Janice D Wagner
- Department of Pathology, Wake Forest School of Medicine, Winston-Salem, NC 27157, USA
| | - Kenneth M Boucher
- Department of Internal Medicine, University of Utah School of Medicine, Salt Lake City, UT 84112, USA; Cancer Biostatistics Shared Resource, University of Utah School of Medicine, Salt Lake City, UT 84112, USA; Huntsman Cancer Institute, University of Utah School of Medicine, Salt Lake City, UT 84112, USA
| | - Jan L Christian
- Departments of Neurobiology & Anatomy, University of Utah School of Medicine, Salt Lake City, UT 84112, USA; Department of Internal Medicine, University of Utah School of Medicine, Salt Lake City, UT 84112, USA
| | - Christopher Gregg
- Robertson Neuroscience Investigator, New York Stem Cell Foundation, University of Utah School of Medicine, Salt Lake City, UT 84112, USA; Departments of Neurobiology & Anatomy, University of Utah School of Medicine, Salt Lake City, UT 84112, USA; Department of Human Genetics, University of Utah School of Medicine, Salt Lake City, UT 84112, USA.
| |
Collapse
|
11
|
Gene signatures associated with adaptive humoral immunity following seasonal influenza A/H1N1 vaccination. Genes Immun 2016; 17:371-379. [PMID: 27534615 PMCID: PMC5133148 DOI: 10.1038/gene.2016.34] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2015] [Revised: 06/07/2016] [Accepted: 06/09/2016] [Indexed: 12/27/2022]
Abstract
This study aimed to identify gene expression markers shared between both influenza hemagglutination-inhibition (HAI) and virus-neutralization antibody (VNA) responses. We enrolled 158 older subjects who received the 2010–2011 trivalent inactivated influenza vaccine (TIV). Influenza-specific HAI and VNA titers, and mRNA-sequencing were performed using blood samples obtained at Days 0, 3 and 28 post-vaccination. For antibody response at Day 28 vs Day 0, several genesets were identified as significant in predictive models for HAI (n=7) and VNA (n=35) responses. Five genesets (comprising the genes MAZ, TTF, GSTM, RABGGTA, SMS, CA, IFNG, and DOPEY) were in common for both HAI and VNA. For response at Day 28 vs Day 3, many genesets were identified in predictive models for HAI (n=13) and VNA (n=41). Ten genesets (comprising biologically related genes, such as MAN1B1, POLL, CEBPG, FOXP3, IL12A, TLR3, TLR7, and others) were shared between HAI and VNA. These identified genesets demonstrated a high degree of network interactions and likelihood for functional relationships. Influenza-specific HAI and VNA responses demonstrated a remarkable degree of similarity. Although unique geneset signatures were identified for each humoral outcome, several genesets were determined to be in common with both HAI and VNA response to influenza vaccine.
Collapse
|
12
|
Haralambieva IH, Zimmermann MT, Ovsyannikova IG, Grill DE, Oberg AL, Kennedy RB, Poland GA. Whole Transcriptome Profiling Identifies CD93 and Other Plasma Cell Survival Factor Genes Associated with Measles-Specific Antibody Response after Vaccination. PLoS One 2016; 11:e0160970. [PMID: 27529750 PMCID: PMC4987012 DOI: 10.1371/journal.pone.0160970] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2016] [Accepted: 07/27/2016] [Indexed: 11/29/2022] Open
Abstract
Background There are insufficient system-wide transcriptomic (or other) data that help explain the observed inter-individual variability in antibody titers after measles vaccination in otherwise healthy individuals. Methods We performed a transcriptome(mRNA-Seq)-profiling study after in vitro viral stimulation of PBMCs from 30 measles vaccine recipients, selected from a cohort of 764 schoolchildren, based on the highest and lowest antibody titers. We used regression and network biology modeling to define markers associated with neutralizing antibody response. Results We identified 39 differentially expressed genes that demonstrate significant differences between the high and low antibody responder groups (p-value≤0.0002, q-value≤0.092), including the top gene CD93 (p<1.0E-13, q<1.0E-09), encoding a receptor required for antigen-driven B-cell differentiation, maintenance of immunoglobulin production and preservation of plasma cells in the bone marrow. Network biology modeling highlighted plasma cell survival (CD93, IL6, CXCL12), chemokine/cytokine activity and cell-cell communication/adhesion/migration as biological processes associated with the observed differential response in the two responder groups. Conclusion We identified genes and pathways that explain in part, and are associated with, neutralizing antibody titers after measles vaccination. This new knowledge could assist in the identification of biomarkers and predictive signatures of protective immunity that may be useful in the design of new vaccine candidates and in clinical studies.
Collapse
Affiliation(s)
- Iana H Haralambieva
- Mayo Clinic Vaccine Research Group-Department of Medicine, Mayo Clinic and Foundation, Rochester, MN, United States of America
| | - Michael T Zimmermann
- Division of Biomedical Statistics and Informatics- Department of Health Science Research, Mayo Clinic and Foundation, Rochester, MN, United States of America
| | - Inna G Ovsyannikova
- Mayo Clinic Vaccine Research Group-Department of Medicine, Mayo Clinic and Foundation, Rochester, MN, United States of America
| | - Diane E Grill
- Division of Biomedical Statistics and Informatics- Department of Health Science Research, Mayo Clinic and Foundation, Rochester, MN, United States of America
| | - Ann L Oberg
- Division of Biomedical Statistics and Informatics- Department of Health Science Research, Mayo Clinic and Foundation, Rochester, MN, United States of America
| | - Richard B Kennedy
- Mayo Clinic Vaccine Research Group-Department of Medicine, Mayo Clinic and Foundation, Rochester, MN, United States of America
| | - Gregory A Poland
- Mayo Clinic Vaccine Research Group-Department of Medicine, Mayo Clinic and Foundation, Rochester, MN, United States of America
| |
Collapse
|
13
|
Yang EW, Jiang T. SDEAP: a splice graph based differential transcript expression analysis tool for population data. Bioinformatics 2016; 32:3593-3602. [PMID: 27522083 DOI: 10.1093/bioinformatics/btw513] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2016] [Revised: 07/21/2016] [Accepted: 07/28/2016] [Indexed: 12/26/2022] Open
Abstract
MOTIVATION Differential transcript expression (DTE) analysis without predefined conditions is critical to biological studies. For example, it can be used to discover biomarkers to classify cancer samples into previously unknown subtypes such that better diagnosis and therapy methods can be developed for the subtypes. Although several DTE tools for population data, i.e. data without known biological conditions, have been published, these tools either assume binary conditions in the input population or require the number of conditions as a part of the input. Fixing the number of conditions to binary is unrealistic and may distort the results of a DTE analysis. Estimating the correct number of conditions in a population could also be challenging for a routine user. Moreover, the existing tools only provide differential usages of exons, which may be insufficient to interpret the patterns of alternative splicing across samples and restrains the applications of the tools from many biology studies. RESULTS We propose a novel DTE analysis algorithm, called SDEAP, that estimates the number of conditions directly from the input samples using a Dirichlet mixture model and discovers alternative splicing events using a new graph modular decomposition algorithm. By taking advantage of the above technical improvement, SDEAP was able to outperform the other DTE analysis methods in our extensive experiments on simulated data and real data with qPCR validation. The prediction of SDEAP also allowed us to classify the samples of cancer subtypes and cell-cycle phases more accurately. AVAILABILITY AND IMPLEMENTATION SDEAP is publicly available for free at https://github.com/ewyang089/SDEAP/wiki CONTACT: yyang027@cs.ucr.edu; jiang@cs.ucr.eduSupplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ei-Wen Yang
- Department of Computer Science and Engineering, University of California, Riverside, CA, USA.,Department of Integrative Biology and Physiology, University of California, Los Angeles, CA, USA
| | - Tao Jiang
- Department of Computer Science and Engineering, University of California, Riverside, CA, USA.,Institute of Integrative Genome Biology, University of California, Riverside, CA, USA.,MOE Key Lab of Bioinformatics and Bioinformatics Division, TNLIST/Department of Computer Science and Technology, Tsinghua University, Beijing, China
| |
Collapse
|
14
|
Buschmann D, Haberberger A, Kirchner B, Spornraft M, Riedmaier I, Schelling G, Pfaffl MW. Toward reliable biomarker signatures in the age of liquid biopsies - how to standardize the small RNA-Seq workflow. Nucleic Acids Res 2016; 44:5995-6018. [PMID: 27317696 PMCID: PMC5291277 DOI: 10.1093/nar/gkw545] [Citation(s) in RCA: 78] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2016] [Accepted: 06/03/2016] [Indexed: 12/21/2022] Open
Abstract
Small RNA-Seq has emerged as a powerful tool in transcriptomics, gene expression profiling and biomarker discovery. Sequencing cell-free nucleic acids, particularly microRNA (miRNA), from liquid biopsies additionally provides exciting possibilities for molecular diagnostics, and might help establish disease-specific biomarker signatures. The complexity of the small RNA-Seq workflow, however, bears challenges and biases that researchers need to be aware of in order to generate high-quality data. Rigorous standardization and extensive validation are required to guarantee reliability, reproducibility and comparability of research findings. Hypotheses based on flawed experimental conditions can be inconsistent and even misleading. Comparable to the well-established MIQE guidelines for qPCR experiments, this work aims at establishing guidelines for experimental design and pre-analytical sample processing, standardization of library preparation and sequencing reactions, as well as facilitating data analysis. We highlight bottlenecks in small RNA-Seq experiments, point out the importance of stringent quality control and validation, and provide a primer for differential expression analysis and biomarker discovery. Following our recommendations will encourage better sequencing practice, increase experimental transparency and lead to more reproducible small RNA-Seq results. This will ultimately enhance the validity of biomarker signatures, and allow reliable and robust clinical predictions.
Collapse
Affiliation(s)
- Dominik Buschmann
- Department of Animal Physiology and Immunology, TUM School of Life Sciences Weihenstephan, Technical University of Munich, Weihenstephaner Berg 3, 85354 Freising, Germany Institute of Human Genetics, University Hospital, Ludwig-Maximilians-University Munich, Goethestraße 29, 80336 München, Germany
| | - Anna Haberberger
- Department of Animal Physiology and Immunology, TUM School of Life Sciences Weihenstephan, Technical University of Munich, Weihenstephaner Berg 3, 85354 Freising, Germany
| | - Benedikt Kirchner
- Department of Animal Physiology and Immunology, TUM School of Life Sciences Weihenstephan, Technical University of Munich, Weihenstephaner Berg 3, 85354 Freising, Germany
| | - Melanie Spornraft
- Department of Animal Physiology and Immunology, TUM School of Life Sciences Weihenstephan, Technical University of Munich, Weihenstephaner Berg 3, 85354 Freising, Germany
| | - Irmgard Riedmaier
- Eurofins Medigenomix Forensik GmbH, Anzinger Straße 7a, 85560 Ebersberg, Germany Department of Anesthesiology, University Hospital, Ludwig-Maximilians-University Munich, Marchioninistraße 15, 81377 München, Germany
| | - Gustav Schelling
- Department of Physiology, TUM School of Life Sciences Weihenstephan, Technical University of Munich, Weihenstephaner Berg 3, 85354 Freising, Germany
| | - Michael W Pfaffl
- Department of Animal Physiology and Immunology, TUM School of Life Sciences Weihenstephan, Technical University of Munich, Weihenstephaner Berg 3, 85354 Freising, Germany
| |
Collapse
|
15
|
Vincent AT, Derome N, Boyle B, Culley AI, Charette SJ. Next-generation sequencing (NGS) in the microbiological world: How to make the most of your money. J Microbiol Methods 2016; 138:60-71. [PMID: 26995332 DOI: 10.1016/j.mimet.2016.02.016] [Citation(s) in RCA: 71] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2015] [Revised: 01/26/2016] [Accepted: 02/24/2016] [Indexed: 12/16/2022]
Abstract
The Sanger sequencing method produces relatively long DNA sequences of unmatched quality and has been considered for long time as the gold standard for sequencing DNA. Many improvements of the Sanger method that culminated with fluorescent dyes coupled with automated capillary electrophoresis enabled the sequencing of the first genomes. Nevertheless, using this technology to sequence whole genomes was costly, laborious and time consuming even for genomes that are relatively small in size. A major technological advance was the introduction of next-generation sequencing (NGS) pioneered by 454 Life Sciences in the early part of the 21th century. NGS allowed scientists to sequence thousands to millions of DNA molecules in a single machine run. Since then, new NGS technologies have emerged and existing NGS platforms have been improved, enabling the production of genome sequences at an unprecedented rate as well as broadening the spectrum of NGS applications. The current affordability of generating genomic information, especially with microbial samples, has resulted in a false sense of simplicity that belies the fact that many researchers still consider these technologies a black box. In this review, our objective is to identify and discuss four steps that we consider crucial to the success of any NGS-related project. These steps are: (1) the definition of the research objectives beyond sequencing and appropriate experimental planning, (2) library preparation, (3) sequencing and (4) data analysis. The goal of this review is to give an overview of the process, from sample to analysis, and discuss how to optimize your resources to achieve the most from your NGS-based research. Regardless of the evolution and improvement of the sequencing technologies, these four steps will remain relevant.
Collapse
Affiliation(s)
- Antony T Vincent
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Quebec City, QC G1V 0A6, Canada; Département de biochimie, de microbiologie et de bio-informatique, Faculté des sciences et de génie, Université Laval, Quebec City, QC G1V 0A6, Canada; Centre de recherche de l'Institut universitaire de cardiologie et de pneumologie de Québec, Quebec City, QC G1V 4G5, Canada
| | - Nicolas Derome
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Quebec City, QC G1V 0A6, Canada; Département de biologie, Faculté des sciences et de génie, Université Laval, Quebec City G1V 0A6, Canada
| | - Brian Boyle
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Quebec City, QC G1V 0A6, Canada
| | - Alexander I Culley
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Quebec City, QC G1V 0A6, Canada; Département de biochimie, de microbiologie et de bio-informatique, Faculté des sciences et de génie, Université Laval, Quebec City, QC G1V 0A6, Canada; Groupe de Recherche en Écologie Buccale (GREB), Faculté de médecine dentaire, Université Laval, Quebec City, QC G1V 0A6, Canada
| | - Steve J Charette
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Quebec City, QC G1V 0A6, Canada; Département de biochimie, de microbiologie et de bio-informatique, Faculté des sciences et de génie, Université Laval, Quebec City, QC G1V 0A6, Canada; Centre de recherche de l'Institut universitaire de cardiologie et de pneumologie de Québec, Quebec City, QC G1V 4G5, Canada.
| |
Collapse
|
16
|
Zhen H, Krumins V, Fennell DE, Mainelis G. Development of a dual-internal-reference technique to improve accuracy when determining bacterial 16S rRNA:16S rRNA gene ratio with application to Escherichia coli liquid and aerosol samples. J Microbiol Methods 2015; 117:113-21. [PMID: 26241659 DOI: 10.1016/j.mimet.2015.07.023] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2015] [Revised: 07/27/2015] [Accepted: 07/27/2015] [Indexed: 01/04/2023]
Abstract
Accurate enumeration of rRNA content in microbial cells, e.g. by using the 16S rRNA:16S rRNA gene ratio, is critical to properly understand its relationship to microbial activities. However, few studies have considered possible methodological artifacts that may contribute to the variability of rRNA analysis results. In this study, a technique utilizing genomic DNA and 16S rRNA from an exogenous species (Pseudomonas fluorescens) as dual internal references was developed to improve accuracy when determining the 16S rRNA:16S rRNA gene ratio of a target organism, Escherichia coli. This technique was able to adequately control the variability in sample processing and analysis procedures due to nucleic acid (DNA and RNA) losses, inefficient reverse transcription of RNA, and inefficient PCR amplification. The measured 16S rRNA:16S rRNA gene ratio of E. coli increased by 2-3 fold when E. coli 16S rRNA gene and 16S rRNA quantities were normalized to the sample-specific fractional recoveries of reference (P. fluorescens) 16S rRNA gene and 16S rRNA, respectively. In addition, the intra-sample variation of this ratio, represented by coefficients of variation from replicate samples, decreased significantly after normalization. This technique was applied to investigate the temporal variation of 16S rRNA:16S rRNA gene ratio of E. coli during its non-steady-state growth in a complex liquid medium, and to E. coli aerosols when exposed to particle-free air after their collection on a filter. The 16S rRNA:16S rRNA gene ratio of E. coli increased significantly during its early exponential phase of growth; when E. coli aerosols were exposed to extended filtration stress after sample collection, the ratio also increased. In contrast, no significant temporal trend in E. coli 16S rRNA:16S rRNA gene ratio was observed when the determined ratios were not normalized based on the recoveries of dual references. The developed technique could be widely applied in studies of relationship between cellular rRNA abundance and bacterial activity.
Collapse
Affiliation(s)
- Huajun Zhen
- Rutgers University, Department of Environmental Sciences, 14 College Farm Rd., New Brunswick, NJ 08901, United States
| | - Valdis Krumins
- Rutgers University, Department of Environmental Sciences, 14 College Farm Rd., New Brunswick, NJ 08901, United States
| | - Donna E Fennell
- Rutgers University, Department of Environmental Sciences, 14 College Farm Rd., New Brunswick, NJ 08901, United States
| | - Gediminas Mainelis
- Rutgers University, Department of Environmental Sciences, 14 College Farm Rd., New Brunswick, NJ 08901, United States.
| |
Collapse
|
17
|
High Intensity Interval Training Favourably Affects Angiotensinogen mRNA Expression and Markers of Cardiorenal Health in a Rat Model of Early-Stage Chronic Kidney Disease. BIOMED RESEARCH INTERNATIONAL 2015; 2015:156584. [PMID: 26090382 PMCID: PMC4458272 DOI: 10.1155/2015/156584] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/29/2015] [Accepted: 04/03/2015] [Indexed: 12/19/2022]
Abstract
The majority of CKD-related complications stem from cardiovascular pathologies such as hypertension. To help reduce cardiovascular complications, aerobic exercise is often prescribed. Emerging evidence suggests high intensity interval training (HIIT) may be more beneficial than traditional aerobic exercise. However, appraisals of varying forms of aerobic exercise, along with descriptions of mechanisms responsible for health-related improvements, are lacking. This study examined the effects of 8 weeks of HIIT (85% VO2max), versus low intensity aerobic exercise (LIT; 45–50% VO2max) and sedentary behaviour (SED), in an animal model of early-stage CKD. Tissue-specific mRNA expression of RAAS-related genes and CKD-related clinical markers were examined. Compared to SED, HIIT resulted in increased plasma albumin (p = 0.001), reduced remnant kidney weight (p = 0.028), and reduced kidney weight-body weight ratios (p = 0.045). Compared to LIT, HIIT resulted in reduced Agt mRNA expression (p = 0.035), reduced plasma LDL (p = 0.001), triglycerides (p = 0.029), and total cholesterol (p = 0.002), increased plasma albumin (p = 0.047), reduced remnant kidney weight (p = 0.005), and reduced kidney weight-body weight ratios (p = 0.048). These results suggest HIIT is a more potent regulator of several markers that describe and influence health in CKD.
Collapse
|
18
|
Oberg AL, McKinney BA, Schaid DJ, Pankratz VS, Kennedy RB, Poland GA. Lessons learned in the analysis of high-dimensional data in vaccinomics. Vaccine 2015; 33:5262-70. [PMID: 25957070 DOI: 10.1016/j.vaccine.2015.04.088] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2014] [Revised: 04/16/2015] [Accepted: 04/23/2015] [Indexed: 12/17/2022]
Abstract
The field of vaccinology is increasingly moving toward the generation, analysis, and modeling of extremely large and complex high-dimensional datasets. We have used data such as these in the development and advancement of the field of vaccinomics to enable prediction of vaccine responses and to develop new vaccine candidates. However, the application of systems biology to what has been termed "big data," or "high-dimensional data," is not without significant challenges-chief among them a paucity of gold standard analysis and modeling paradigms with which to interpret the data. In this article, we relate some of the lessons we have learned over the last decade of working with high-dimensional, high-throughput data as applied to the field of vaccinomics. The value of such efforts, however, is ultimately to better understand the immune mechanisms by which protective and non-protective responses to vaccines are generated, and to use this information to support a personalized vaccinology approach in creating better, and safer, vaccines for the public health.
Collapse
Affiliation(s)
- Ann L Oberg
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA; Mayo Clinic Vaccine Research Group, Mayo Clinic, Rochester, MN, USA
| | - Brett A McKinney
- Tandy School of Computer Science, Department of Mathematics, University of Tulsa, Tulsa, OK, USA
| | - Daniel J Schaid
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA; Mayo Clinic Vaccine Research Group, Mayo Clinic, Rochester, MN, USA
| | - V Shane Pankratz
- UNM Health Sciences Library & Informatics Center, Division of Nephrology, University of New Mexico, Albuquerque, NM, USA
| | | | - Gregory A Poland
- Mayo Clinic Vaccine Research Group, Mayo Clinic, Rochester, MN, USA.
| |
Collapse
|
19
|
Tsementzi D, Poretsky R, Rodriguez-R LM, Luo C, Konstantinidis KT. Evaluation of metatranscriptomic protocols and application to the study of freshwater microbial communities. ENVIRONMENTAL MICROBIOLOGY REPORTS 2014; 6:640-655. [PMID: 25756118 DOI: 10.1111/1758-2229.12180] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Metatranscriptomics of environmental samples enables the identification of community activities without a priori knowledge of taxonomic or functional composition. However, several technical challenges associated with the RNA preparation protocols can affect the relative representation of transcripts and data interpretation. Here, seven replicate metatranscriptomes from planktonic freshwater samples (Lake Lanier, USA) were sequenced to evaluate technical and biological reproducibility of different RNA extraction protocols. Organic versus bead-beating extraction showed significant enrichment for low versus high G + C% mRNA populations respectively. The sequencing data were best modelled by a negative binomial distribution to account for the large technical and biological variation observed. Despite the variation, the transcriptional activities of populations that persisted in year-round metagenomes from the same site consistently showed distinct expression patterns, reflecting different ecologic strategies and allowing us to test prevailing models on the contribution of both rare biosphere and abundant members to community activity. For instance, abundant members of the Verrucomicrobia phylum systematically showed low transcriptional activity compared with other abundant taxa. Our results provide a practical guide to the analysis of metatranscriptomes and advance understanding of the activity and ecology of abundant and rare members of temperate freshwater microbial communities.
Collapse
|
20
|
Finotello F, Di Camillo B. Measuring differential gene expression with RNA-seq: challenges and strategies for data analysis. Brief Funct Genomics 2014; 14:130-42. [PMID: 25240000 DOI: 10.1093/bfgp/elu035] [Citation(s) in RCA: 137] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
RNA-seq is a methodology for RNA profiling based on next-generation sequencing that enables to measure and compare gene expression patterns at unprecedented resolution. Although the appealing features of this technique have promoted its application to a wide panel of transcriptomics studies, the fast-evolving nature of experimental protocols and computational tools challenges the definition of a unified RNA-seq analysis pipeline. In this review, focused on the study of differential gene expression with RNA-seq, we go through the main steps of data processing and discuss open challenges and possible solutions.
Collapse
|
21
|
Gatto A, Torroja-Fungairiño C, Mazzarotto F, Cook SA, Barton PJR, Sánchez-Cabo F, Lara-Pezzi E. FineSplice, enhanced splice junction detection and quantification: a novel pipeline based on the assessment of diverse RNA-Seq alignment solutions. Nucleic Acids Res 2014; 42:e71. [PMID: 24574529 PMCID: PMC4005686 DOI: 10.1093/nar/gku166] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023] Open
Abstract
Alternative splicing is the main mechanism governing protein diversity. The recent developments in RNA-Seq technology have enabled the study of the global impact and regulation of this biological process. However, the lack of standardized protocols constitutes a major bottleneck in the analysis of alternative splicing. This is particularly important for the identification of exon–exon junctions, which is a critical step in any analysis workflow. Here we performed a systematic benchmarking of alignment tools to dissect the impact of design and method on the mapping, detection and quantification of splice junctions from multi-exon reads. Accordingly, we devised a novel pipeline based on TopHat2 combined with a splice junction detection algorithm, which we have named FineSplice. FineSplice allows effective elimination of spurious junction hits arising from artefactual alignments, achieving up to 99% precision in both real and simulated data sets and yielding superior F1 scores under most tested conditions. The proposed strategy conjugates an efficient mapping solution with a semi-supervised anomaly detection scheme to filter out false positives and allows reliable estimation of expressed junctions from the alignment output. Ultimately this provides more accurate information to identify meaningful splicing patterns. FineSplice is freely available at https://sourceforge.net/p/finesplice/.
Collapse
Affiliation(s)
- Alberto Gatto
- Cardiovascular Development and Repair Department, Centro Nacional de Investigaciones Cardiovasculares, Madrid, 28029, Spain, Bioinformatics Unit, Centro Nacional de Investigaciones Cardiovasculares, Madrid, 28029, Spain, National Heart and Lung Institute, Imperial College London, London SW7 2AZ, UK, Cardiovascular Biomedical Research Unit, NIHR Royal Brompton and Harefield NHS Foundation Trust, London SW3 6NP, UK, Department of Cardiology, National Heart Centre Singapore, Singapore 168752, Singapore and Cardiovascular and Metabolic Disorders Program, Duke-NUS Graduate Medical School, Singapore 169857, Singapore
| | | | | | | | | | | | | |
Collapse
|
22
|
Wang X, Cairns MJ. SeqGSEA: a Bioconductor package for gene set enrichment analysis of RNA-Seq data integrating differential expression and splicing. ACTA ACUST UNITED AC 2014; 30:1777-9. [PMID: 24535097 DOI: 10.1093/bioinformatics/btu090] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
SUMMARY SeqGSEA is an open-source Bioconductor package for the functional integration of differential expression and splicing analysis in RNA-Seq data. SeqGSEA implements an analysis pipeline, which first computes differential splicing and differential expression scores, followed by integrating them into a per-gene score that quantifies each gene's association with a phenotype of interest, and finally executes gene set enrichment analysis in a cutoff-free manner to achieve biological insights. SeqGSEA accounts for biological variability and determines the statistical significance of gene pathways and networks using subject permutation, and thus requires at least five samples per group. Real applications show that SeqGSEA detects more biologically meaningful gene sets without biases toward long or highly expressed genes. SeqGSEA can be set up to run in parallel to reduce the analysis time. AVAILABILITY AND IMPLEMENTATION The SeqGSEA package with a vignette is available at http://bioconductor.org/packages/release/bioc/html/SeqGSEA.html.
Collapse
Affiliation(s)
- Xi Wang
- School of Biomedical Sciences and Pharmacy, Faculty of Health and Medicine, University of Newcastle, Newcastle, Hunter Medical Research Institute, New Lambton, and Schizophrenia Research Institute, Sydney, NSW, AustraliaSchool of Biomedical Sciences and Pharmacy, Faculty of Health and Medicine, University of Newcastle, Newcastle, Hunter Medical Research Institute, New Lambton, and Schizophrenia Research Institute, Sydney, NSW, Australia
| | - Murray J Cairns
- School of Biomedical Sciences and Pharmacy, Faculty of Health and Medicine, University of Newcastle, Newcastle, Hunter Medical Research Institute, New Lambton, and Schizophrenia Research Institute, Sydney, NSW, AustraliaSchool of Biomedical Sciences and Pharmacy, Faculty of Health and Medicine, University of Newcastle, Newcastle, Hunter Medical Research Institute, New Lambton, and Schizophrenia Research Institute, Sydney, NSW, AustraliaSchool of Biomedical Sciences and Pharmacy, Faculty of Health and Medicine, University of Newcastle, Newcastle, Hunter Medical Research Institute, New Lambton, and Schizophrenia Research Institute, Sydney, NSW, Australia
| |
Collapse
|
23
|
McKinney BA, White BC, Grill DE, Li PW, Kennedy RB, Poland GA, Oberg AL. ReliefSeq: a gene-wise adaptive-K nearest-neighbor feature selection tool for finding gene-gene interactions and main effects in mRNA-Seq gene expression data. PLoS One 2013; 8:e81527. [PMID: 24339943 PMCID: PMC3858248 DOI: 10.1371/journal.pone.0081527] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2013] [Accepted: 10/14/2013] [Indexed: 11/29/2022] Open
Abstract
Relief-F is a nonparametric, nearest-neighbor machine learning method that has been successfully used to identify relevant variables that may interact in complex multivariate models to explain phenotypic variation. While several tools have been developed for assessing differential expression in sequence-based transcriptomics, the detection of statistical interactions between transcripts has received less attention in the area of RNA-seq analysis. We describe a new extension and assessment of Relief-F for feature selection in RNA-seq data. The ReliefSeq implementation adapts the number of nearest neighbors (k) for each gene to optimize the Relief-F test statistics (importance scores) for finding both main effects and interactions. We compare this gene-wise adaptive-k (gwak) Relief-F method with standard RNA-seq feature selection tools, such as DESeq and edgeR, and with the popular machine learning method Random Forests. We demonstrate performance on a panel of simulated data that have a range of distributional properties reflected in real mRNA-seq data including multiple transcripts with varying sizes of main effects and interaction effects. For simulated main effects, gwak-Relief-F feature selection performs comparably to standard tools DESeq and edgeR for ranking relevant transcripts. For gene-gene interactions, gwak-Relief-F outperforms all comparison methods at ranking relevant genes in all but the highest fold change/highest signal situations where it performs similarly. The gwak-Relief-F algorithm outperforms Random Forests for detecting relevant genes in all simulation experiments. In addition, Relief-F is comparable to the other methods based on computational time. We also apply ReliefSeq to an RNA-Seq study of smallpox vaccine to identify gene expression changes between vaccinia virus-stimulated and unstimulated samples. ReliefSeq is an attractive tool for inclusion in the suite of tools used for analysis of mRNA-Seq data; it has power to detect both main effects and interaction effects. Software Availability: http://insilico.utulsa.edu/ReliefSeq.php.
Collapse
Affiliation(s)
- Brett A. McKinney
- Tandy School of Computer Science, Department of Mathematics, University of Tulsa, Tulsa, Oklahoma, United States of America
- Laureate Institute for Brain Research, Tulsa, Oklahoma, United States of America
| | - Bill C. White
- Tandy School of Computer Science, Department of Mathematics, University of Tulsa, Tulsa, Oklahoma, United States of America
| | - Diane E. Grill
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, United States of America
- Mayo Clinic Vaccine Research Group, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Peter W. Li
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, United States of America
- Mayo Clinic Vaccine Research Group, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Richard B. Kennedy
- Mayo Clinic Vaccine Research Group, Mayo Clinic, Rochester, Minnesota, United States of America
- Department of Medicine, Mayo Clinic, Rochester, Minnesota, United States of America
- Program in Translational Immunovirology and Biodefense, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Gregory A. Poland
- Mayo Clinic Vaccine Research Group, Mayo Clinic, Rochester, Minnesota, United States of America
- Department of Medicine, Mayo Clinic, Rochester, Minnesota, United States of America
- Program in Translational Immunovirology and Biodefense, Mayo Clinic, Rochester, Minnesota, United States of America
| | - Ann L. Oberg
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, United States of America
- Mayo Clinic Vaccine Research Group, Mayo Clinic, Rochester, Minnesota, United States of America
| |
Collapse
|
24
|
Hart SN, Therneau TM, Zhang Y, Poland GA, Kocher JP. Calculating sample size estimates for RNA sequencing data. J Comput Biol 2013; 20:970-8. [PMID: 23961961 DOI: 10.1089/cmb.2012.0283] [Citation(s) in RCA: 199] [Impact Index Per Article: 18.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
BACKGROUND Given the high technical reproducibility and orders of magnitude greater resolution than microarrays, next-generation sequencing of mRNA (RNA-Seq) is quickly becoming the de facto standard for measuring levels of gene expression in biological experiments. Two important questions must be taken into consideration when designing a particular experiment, namely, 1) how deep does one need to sequence? and, 2) how many biological replicates are necessary to observe a significant change in expression? RESULTS Based on the gene expression distributions from 127 RNA-Seq experiments, we find evidence that 91% ± 4% of all annotated genes are sequenced at a frequency of 0.1 times per million bases mapped, regardless of sample source. Based on this observation, and combining this information with other parameters such as biological variation and technical variation that we empirically estimate from our large datasets, we developed a model to estimate the statistical power needed to identify differentially expressed genes from RNA-Seq experiments. CONCLUSIONS Our results provide a needed reference for ensuring RNA-Seq gene expression studies are conducted with the optimally sample size, power, and sequencing depth. We also make available both R code and an Excel worksheet for investigators to calculate for their own experiments.
Collapse
Affiliation(s)
- Steven N Hart
- 1 Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic , Rochester, Minnesota
| | | | | | | | | |
Collapse
|
25
|
Kennedy RB, Oberg AL, Ovsyannikova IG, Haralambieva IH, Grill D, Poland GA. Transcriptomic profiles of high and low antibody responders to smallpox vaccine. Genes Immun 2013; 14:277-85. [PMID: 23594957 PMCID: PMC3723701 DOI: 10.1038/gene.2013.14] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2012] [Revised: 03/13/2013] [Accepted: 03/15/2013] [Indexed: 12/21/2022]
Abstract
Despite its eradication over 30 years ago, smallpox (as well as other orthopox viruses) remains a pathogen of interest both in terms of biodefense and for its use as a vector for vaccines and immunotherapies. Here we describe the application of mRNA-Seq transcriptome profiling to understanding immune responses in smallpox vaccine recipients. Contrary to other studies examining gene expression in virally infected cell lines, we utilized a mixed population of peripheral blood mononuclear cells in order to capture the essential intercellular interactions that occur in vivo, and would otherwise be lost, using single cell lines or isolated primary cell subsets. In this mixed cell population we were able to detect expression of all annotated vaccinia genes. On the host side, a number of genes encoding cytokines, chemokines, complement factors and intracellular signaling molecules were downregulated upon viral infection, whereas genes encoding histone proteins and the interferon response were upregulated. We also identified a small number of genes that exhibited significantly different expression profiles in subjects with robust humoral immunity compared with those with weaker humoral responses. Our results provide evidence that differential gene regulation patterns may be at work in individuals with robust humoral immunity compared with those with weaker humoral immune responses.
Collapse
Affiliation(s)
- Richard B. Kennedy
- Mayo Vaccine Research Group, Mayo Clinic, Rochester MN, USA
- Program in Translational Immunovirology and Biodefense, Mayo Clinic, Rochester MN, USA
| | - Ann L. Oberg
- Mayo Vaccine Research Group, Mayo Clinic, Rochester MN, USA
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | | | | | - Diane Grill
- Mayo Vaccine Research Group, Mayo Clinic, Rochester MN, USA
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Gregory A. Poland
- Mayo Vaccine Research Group, Mayo Clinic, Rochester MN, USA
- Program in Translational Immunovirology and Biodefense, Mayo Clinic, Rochester MN, USA
| |
Collapse
|
26
|
Ferté C, Trister AD, Huang E, Bot BM, Guinney J, Commo F, Sieberts S, André F, Besse B, Soria JC, Friend SH. Impact of bioinformatic procedures in the development and translation of high-throughput molecular classifiers in oncology. Clin Cancer Res 2013; 19:4315-25. [PMID: 23780890 DOI: 10.1158/1078-0432.ccr-12-3937] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
The progressive introduction of high-throughput molecular techniques in the clinic allows for the extensive and systematic exploration of multiple biologic layers of tumors. Molecular profiles and classifiers generated from these assays represent the foundation of what the National Academy describes as the future of "precision medicine". However, the analysis of such complex data requires the implementation of sophisticated bioinformatic and statistical procedures. It is critical that oncology practitioners be aware of the advantages and limitations of the methods used to generate classifiers to usher them into the clinic. This article uses publicly available expression data from patients with non-small cell lung cancer to first illustrate the challenges of experimental design and preprocessing of data before clinical application and highlights the challenges of high-dimensional statistical analysis. It provides a roadmap for the translation of such classifiers to clinical practice and makes key recommendations for good practice.
Collapse
|
27
|
Genome-wide characterization of transcriptional patterns in high and low antibody responders to rubella vaccination. PLoS One 2013; 8:e62149. [PMID: 23658707 PMCID: PMC3641062 DOI: 10.1371/journal.pone.0062149] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2012] [Accepted: 03/18/2013] [Indexed: 12/16/2022] Open
Abstract
Immune responses to current rubella vaccines demonstrate significant inter-individual variability. We performed mRNA-Seq profiling on PBMCs from high and low antibody responders to rubella vaccination to delineate transcriptional differences upon viral stimulation. Generalized linear models were used to assess the per gene fold change (FC) for stimulated versus unstimulated samples or the interaction between outcome and stimulation. Model results were evaluated by both FC and p-value. Pathway analysis and self-contained gene set tests were performed for assessment of gene group effects. Of 17,566 detected genes, we identified 1,080 highly significant differentially expressed genes upon viral stimulation (p<1.00E−15, FDR<1.00E−14), including various immune function and inflammation-related genes, genes involved in cell signaling, cell regulation and transcription, and genes with unknown function. Analysis by immune outcome and stimulation status identified 27 genes (p≤0.0006 and FDR≤0.30) that responded differently to viral stimulation in high vs. low antibody responders, including major histocompatibility complex (MHC) class I genes (HLA-A, HLA-B and B2M with p = 0.0001, p = 0.0005 and p = 0.0002, respectively), and two genes related to innate immunity and inflammation (EMR3 and MEFV with p = 1.46E−08 and p = 0.0004, respectively). Pathway and gene set analysis also revealed transcriptional differences in antigen presentation and innate/inflammatory gene sets and pathways between high and low responders. Using mRNA-Seq genome-wide transcriptional profiling, we identified antigen presentation and innate/inflammatory genes that may assist in explaining rubella vaccine-induced immune response variations. Such information may provide new scientific insights into vaccine-induced immunity useful in rational vaccine development and immune response monitoring.
Collapse
|
28
|
Wang X, Cairns MJ. Gene set enrichment analysis of RNA-Seq data: integrating differential expression and splicing. BMC Bioinformatics 2013; 14 Suppl 5:S16. [PMID: 23734663 PMCID: PMC3622641 DOI: 10.1186/1471-2105-14-s5-s16] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND RNA-Seq has become a key technology in transcriptome studies because it can quantify overall expression levels and the degree of alternative splicing for each gene simultaneously. To interpret high-throughout transcriptome profiling data, functional enrichment analysis is critical. However, existing functional analysis methods can only account for differential expression, leaving differential splicing out altogether. RESULTS In this work, we present a novel approach to derive biological insight by integrating differential expression and splicing from RNA-Seq data with functional gene set analysis. This approach designated SeqGSEA, uses count data modelling with negative binomial distributions to first score differential expression and splicing in each gene, respectively, followed by two strategies to combine the two scores for integrated gene set enrichment analysis. Method comparison results and biological insight analysis on an artificial data set and three real RNA-Seq data sets indicate that our approach outperforms alternative analysis pipelines and can detect biological meaningful gene sets with high confidence, and that it has the ability to determine if transcription or splicing is their predominant regulatory mechanism. CONCLUSIONS By integrating differential expression and splicing, the proposed method SeqGSEA is particularly useful for efficiently translating RNA-Seq data to biological discoveries.
Collapse
Affiliation(s)
- Xi Wang
- School of Biomedical Sciences and Pharmacy, The University of Newcastle, Callaghan, New South Wales, Australia
| | | |
Collapse
|
29
|
Kloster MB, Bilgrau AE, Rodrigo-Domingo M, Bergkvist KS, Schmitz A, Sønderkær M, Bødker JS, Falgreen S, Nyegaard M, Johnsen HE, Nielsen KL, Dybkaer K, Bøgsted M. A model system for assessing and comparing the ability of exon microarray and tag sequencing to detect genes specific for malignant B-cells. BMC Genomics 2012; 13:596. [PMID: 23127183 PMCID: PMC3505742 DOI: 10.1186/1471-2164-13-596] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2012] [Accepted: 10/11/2012] [Indexed: 11/30/2022] Open
Abstract
Background Malignant cells in tumours of B-cell origin account for 0.1% to 98% of the total cell content, depending on disease entity. Recently, gene expression profiles (GEPs) of B-cell lymphomas based on microarray technologies have contributed significantly to improved sub-classification and diagnostics. However, the varying degrees of malignant B-cell frequencies in analysed samples influence the interpretation of the GEPs. Based on emerging next-generation sequencing technologies (NGS) like tag sequencing (tag-seq) for GEP, it is expected that the detection of mRNA transcripts from malignant B-cells can be supplemented. This study provides a quantitative assessment and comparison of the ability of microarrays and tag-seq to detect mRNA transcripts from malignant B-cells. A model system was established by eight serial dilutions of the malignant B-cell lymphoma cell line, OCI-Ly8, into the embryonic kidney cell line, HEK293, prior to parallel analysis by exon microarrays and tag-seq. Results We identified 123 and 117 differentially expressed genes between pure OCI-Ly8 and HEK293 cells by exon microarray and tag-seq, respectively. There were thirty genes in common, and of those, most were B-cell specific. Hierarchical clustering from all dilutions based on the differentially expressed genes showed that neither technology could distinguish between samples with less than 1% malignant B-cells from non-B-cells. A novel statistical concept was developed to assess the ability to detect single genes for both technologies, and used to demonstrate an inverse proportional relationship with the sample purity. Of the 30 common genes, the detection capability of a representative set of three B-cell specific genes - CD74, HLA-DRA, and BCL6 - was analysed. It was noticed that at least 5%, 13% and 22% sample purity respectively was required for detection of the three genes by exon microarray whereas at least 2%, 4% and 51% percent sample purity of malignant B-cells were required for tag-seq detection. Conclusion A sample purity-dependent loss of the ability to detect genes for both technologies was demonstrated. Taq-seq, in comparison to exon microarray, required slightly less malignant B-cells in the samples analysed in order to detect the two most abundantly expressed of the selected genes. The results show that malignant cell frequency is an important variable, with fundamental impact when interpreting GEPs from both technologies.
Collapse
Affiliation(s)
- Maria Bro Kloster
- Department of Haematology, Aalborg Hospital, Aarhus University Hospital, Sdr. Skovvej 15, 9000 Aalborg, Denmark
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|