1
|
Lio CT, Düz T, Hoffmann M, Willruth LL, Baumbach J, List M, Tsoy O. Comprehensive benchmark of differential transcript usage analysis for static and dynamic conditions. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.14.575548. [PMID: 38313260 PMCID: PMC10836064 DOI: 10.1101/2024.01.14.575548] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/06/2024]
Abstract
RNA sequencing offers unique insights into transcriptome diversity, and a plethora of tools have been developed to analyze alternative splicing. One important task is to detect changes in the relative transcript abundance in differential transcript usage (DTU) analysis. The choice of the right analysis tool is non-trivial and depends on experimental factors such as the availability of single- or paired-end and bulk or single-cell data. To help users select the most promising tool for their task, we performed a comprehensive benchmark of DTU detection tools. We cover a wide array of experimental settings, using simulated bulk and single-cell RNA-seq data as well as real transcriptomics datasets, including time-series data. Our results suggest that DEXSeq, edgeR, and LimmaDS are better choices for paired-end data, while DSGseq and DEXSeq can be used for single-end data. In single-cell simulation settings, we showed that satuRn performs better than DTUrtle. In addition, we showed that Spycone is optimal for time series DTU/IS analysis based on the evidence provided using GO terms enrichment analysis.
Collapse
Affiliation(s)
- Chit Tong Lio
- Data Science in Systems Biology, Technical University of Munich, 85354 Freising, Germany
| | - Tolga Düz
- Chair of Computational Systems Biology, University of Hamburg, Notkestrasse 9, 22607 Hamburg, Germany
| | - Markus Hoffmann
- Data Science in Systems Biology, Technical University of Munich, 85354 Freising, Germany
- Institute for Advanced Study, Technical University of Munich, Garching D-85748, Germany
- National Institute of Diabetes, Digestive, and Kidney Diseases, National Institutes of Health, Bethesda, MD 20892, USA
| | - Lina-Liv Willruth
- Data Science in Systems Biology, Technical University of Munich, 85354 Freising, Germany
| | - Jan Baumbach
- Chair of Computational Systems Biology, University of Hamburg, Notkestrasse 9, 22607 Hamburg, Germany
- Institute of Mathematics and Computer Science, University of Southern Denmark, Campusvej 55, 5000 Odense, Denmark
| | - Markus List
- Data Science in Systems Biology, Technical University of Munich, 85354 Freising, Germany
| | - Olga Tsoy
- Chair of Computational Systems Biology, University of Hamburg, Notkestrasse 9, 22607 Hamburg, Germany
| |
Collapse
|
2
|
Heiling HM, Wilson DR, Rashid NU, Sun W, Ibrahim JG. Estimating cell type composition using isoform expression one gene at a time. Biometrics 2023; 79:854-865. [PMID: 34921386 PMCID: PMC11245124 DOI: 10.1111/biom.13614] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2020] [Accepted: 12/08/2021] [Indexed: 11/29/2022]
Abstract
Human tissue samples are often mixtures of heterogeneous cell types, which can confound the analyses of gene expression data derived from such tissues. The cell type composition of a tissue sample may itself be of interest and is needed for proper analysis of differential gene expression. A variety of computational methods have been developed to estimate cell type proportions using gene-level expression data. However, RNA isoforms can also be differentially expressed across cell types, and isoform-level expression could be equally or more informative for determining cell type origin than gene-level expression. We propose a new computational method, IsoDeconvMM, which estimates cell type fractions using isoform-level gene expression data. A novel and useful feature of IsoDeconvMM is that it can estimate cell type proportions using only a single gene, though in practice we recommend aggregating estimates of a few dozen genes to obtain more accurate results. We demonstrate the performance of IsoDeconvMM using a unique data set with cell type-specific RNA-seq data across more than 135 individuals. This data set allows us to evaluate different methods given the biological variation of cell type-specific gene expression data across individuals. We further complement this analysis with additional simulations.
Collapse
Affiliation(s)
- Hillary M Heiling
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina
| | - Douglas R Wilson
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina
| | - Naim U Rashid
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina
| | - Wei Sun
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington
| | - Joseph G Ibrahim
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina
| |
Collapse
|
3
|
Kim J, Zhang Y, Day J, Zhou H. MGLM: An R Package for Multivariate Categorical Data Analysis. THE R JOURNAL 2018; 10:73-90. [PMID: 32523781 DOI: 10.32614/rj-2018-015] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Data with multiple responses is ubiquitous in modern applications. However, few tools are available for regression analysis of multivariate counts. The most popular multinomial-logit model has a very restrictive mean-variance structure, limiting its applicability to many data sets. This article introduces an R package MGLM, short for multivariate response generalized linear models, that expands the current tools for regression analysis of polytomous data. Distribution fitting, random number generation, regression, and sparse regression are treated in a unifying framework. The algorithm, usage, and implementation details are discussed.
Collapse
Affiliation(s)
- Juhyun Kim
- Department of Biostatistics, University of California, Los Angeles
| | | | - Joshua Day
- Department of Statistics, North Carolina State University
| | - Hua Zhou
- Department of Biostatistics, University of California, Los Angeles
| |
Collapse
|
4
|
Comparative genomic evidence for the involvement of schizophrenia risk genes in antipsychotic effects. Mol Psychiatry 2018; 23:708-712. [PMID: 28555076 PMCID: PMC5709242 DOI: 10.1038/mp.2017.111] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/17/2016] [Revised: 03/08/2017] [Accepted: 04/12/2017] [Indexed: 01/14/2023]
Abstract
Genome-wide association studies (GWAS) for schizophrenia have identified over 100 loci encoding >500 genes. It is unclear whether any of these genes, other than dopamine receptor D2, are immediately relevant to antipsychotic effects or represent novel antipsychotic targets. We applied an in vivo molecular approach to this question by performing RNA sequencing of brain tissue from mice chronically treated with the antipsychotic haloperidol or vehicle. We observed significant enrichments of haloperidol-regulated genes in schizophrenia GWAS loci and in schizophrenia-associated biological pathways. Our findings provide empirical support for overlap between genetic variation underlying the pathophysiology of schizophrenia and the molecular effects of a prototypical antipsychotic.
Collapse
|
5
|
Wang W, Sun W, Wang W, Szatkiewicz J. A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection. BMC Bioinformatics 2018; 19:74. [PMID: 29490610 PMCID: PMC5831535 DOI: 10.1186/s12859-018-2077-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2017] [Accepted: 02/20/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The application of high-throughput sequencing in a broad range of quantitative genomic assays (e.g., DNA-seq, ChIP-seq) has created a high demand for the analysis of large-scale read-count data. Typically, the genome is divided into tiling windows and windowed read-count data is generated for the entire genome from which genomic signals are detected (e.g. copy number changes in DNA-seq, enrichment peaks in ChIP-seq). For accurate analysis of read-count data, many state-of-the-art statistical methods use generalized linear models (GLM) coupled with the negative-binomial (NB) distribution by leveraging its ability for simultaneous bias correction and signal detection. However, although statistically powerful, the GLM+NB method has a quadratic computational complexity and therefore suffers from slow running time when applied to large-scale windowed read-count data. In this study, we aimed to speed up substantially the GLM+NB method by using a randomized algorithm and we demonstrate here the utility of our approach in the application of detecting copy number variants (CNVs) using a real example. RESULTS We propose an efficient estimator, the randomized GLM+NB coefficients estimator (RGE), for speeding up the GLM+NB method. RGE samples the read-count data and solves the estimation problem on a smaller scale. We first theoretically validated the consistency and the variance properties of RGE. We then applied RGE to GENSENG, a GLM+NB based method for detecting CNVs. We named the resulting method as "R-GENSENG". Based on extensive evaluation using both simulated and empirical data, we concluded that R-GENSENG is ten times faster than the original GENSENG while maintaining GENSENG's accuracy in CNV detection. CONCLUSIONS Our results suggest that RGE strategy developed here could be applied to other GLM+NB based read-count analyses, i.e. ChIP-seq data analysis, to substantially improve their computational efficiency while preserving the analytic power.
Collapse
Affiliation(s)
- WeiBo Wang
- Department of Computer Science, University of North Carolina at Chapel Hill, 201 S. Columbia St., Chapel Hill, 27599-3175 USA
| | - Wei Sun
- Biostatistics Program, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave N, Seattle, 19024 USA
| | - Wei Wang
- Department of Computer Science, University of California, Los Angeles, 580 Portola Plaza, Los Angeles, 90095-1596 USA
| | - Jin Szatkiewicz
- Department of Genetics, University of North Carolina at Chapel Hill, 120 Mason Farm Road, Chapel Hill, 27599-7264 USA
| |
Collapse
|
6
|
Zhou YH, Cichocki JA, Soldatow VY, Scholl EH, Gallins PJ, Jima D, Yoo HS, Chiu WA, Wright FA, Rusyn I. Editor's Highlight: Comparative Dose-Response Analysis of Liver and Kidney Transcriptomic Effects of Trichloroethylene and Tetrachloroethylene in B6C3F1 Mouse. Toxicol Sci 2017; 160:95-110. [PMID: 28973375 PMCID: PMC5837274 DOI: 10.1093/toxsci/kfx165] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023] Open
Abstract
Trichloroethylene (TCE) and tetrachloroethylene (PCE) are ubiquitous environmental contaminants and occupational health hazards. Recent health assessments of these agents identified several critical data gaps, including lack of comparative analysis of their effects. This study examined liver and kidney effects of TCE and PCE in a dose-response study design. Equimolar doses of TCE (24, 80, 240, and 800 mg/kg) or PCE (30, 100, 300, and 1000 mg/kg) were administered by gavage in aqueous vehicle to male B6C3F1/J mice. Tissues were collected 24 h after exposure. Trichloroacetic acid (TCA), a major oxidative metabolite of both compounds, was measured and RNA sequencing was performed. PCE had a stronger effect on liver and kidney transcriptomes, as well as greater concentrations of TCA. Most dose-responsive pathways were common among chemicals/tissues, with the strongest effect on peroxisomal β-oxidation. Effects on liver and kidney mitochondria-related pathways were notably unique to PCE. We performed dose-response modeling of the transcriptomic data and compared the resulting points of departure (PODs) to those for apical endpoints derived from long-term studies with these chemicals in rats, mice, and humans, converting to human equivalent doses using tissue-specific dosimetry models. Tissue-specific acute transcriptional effects of TCE and PCE occurred at human equivalent doses comparable to those for apical effects. These data are relevant for human health assessments of TCE and PCE as they provide data for dose-response analysis of the toxicity mechanisms. Additionally, they provide further evidence that transcriptomic data can be useful surrogates for in vivo PODs, especially when toxicokinetic differences are taken into account.
Collapse
Affiliation(s)
- Yi-Hui Zhou
- Department of Biological Sciences
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina
| | - Joseph A. Cichocki
- Department of Veterinary Integrative Biosciences, Texas A&M University, College Station, Texas
| | - Valerie Y. Soldatow
- Department of Environmental Sciences and Engineering, University of North Carolina, Chapel Hill, North Carolina
| | - Elizabeth H. Scholl
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina
| | - Paul J. Gallins
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina
| | - Dereje Jima
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina
| | - Hong-Sik Yoo
- Department of Environmental Sciences and Engineering, University of North Carolina, Chapel Hill, North Carolina
| | - Weihsueh A. Chiu
- Department of Veterinary Integrative Biosciences, Texas A&M University, College Station, Texas
| | - Fred A. Wright
- Department of Biological Sciences
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina
- Department of Statistics, North Carolina State University, Raleigh, North Carolina
| | - Ivan Rusyn
- Department of Veterinary Integrative Biosciences, Texas A&M University, College Station, Texas
| |
Collapse
|
7
|
Zhang Y, Zhou H, Zhou J, Sun W. Regression Models For Multivariate Count Data. J Comput Graph Stat 2017; 26:1-13. [PMID: 28348500 PMCID: PMC5365157 DOI: 10.1080/10618600.2016.1154063] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2015] [Revised: 10/01/2015] [Indexed: 10/22/2022]
Abstract
Data with multivariate count responses frequently occur in modern applications. The commonly used multinomial-logit model is limiting due to its restrictive mean-variance structure. For instance, analyzing count data from the recent RNA-seq technology by the multinomial-logit model leads to serious errors in hypothesis testing. The ubiquity of over-dispersion and complicated correlation structures among multivariate counts calls for more flexible regression models. In this article, we study some generalized linear models that incorporate various correlation structures among the counts. Current literature lacks a treatment of these models, partly due to the fact that they do not belong to the natural exponential family. We study the estimation, testing, and variable selection for these models in a unifying framework. The regression models are compared on both synthetic and real RNA-seq data.
Collapse
Affiliation(s)
- Yiwen Zhang
- Department of Statistics, North Carolina State University, Raleigh, NC 27695-8203
| | - Hua Zhou
- Department of Biostatistics, University of California, Los Angeles, Los Angeles, CA 90095-1772
| | - Jin Zhou
- Division of Epidemiology and Biostatistics, University of Arizona, Tucson, AZ 85721-0066
| | - Wei Sun
- Program in Biostatistics and Biomathematics, Fred Hutchinson Cancer Research Center, Seattle, WA 98109
| |
Collapse
|
8
|
Rashid NU, Sun W, Ibrahim JG. A STATISTICAL MODEL TO ASSESS (ALLELE-SPECIFIC) ASSOCIATIONS BETWEEN GENE EXPRESSION AND EPIGENETIC FEATURES USING SEQUENCING DATA. Ann Appl Stat 2016; 10:2254-2273. [PMID: 29034055 DOI: 10.1214/16-aoas973] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Sequencing techniques have been widely used to assess gene expression (i.e., RNA-seq) or the presence of epigenetic features (e.g., DNase-seq to identify open chromatin regions). In contrast to traditional microarray platforms, sequencing data are typically summarized in the form of discrete counts, and they are able to delineate allele-specific signals, which are not available from microarrays. The presence of epigenetic features are often associated with gene expression, both of which have been shown to be affected by DNA polymorphisms. However, joint models with the flexibility to assess interactions between gene expression, epigenetic features and DNA polymorphisms are currently lacking. In this paper, we develop a statistical model to assess the associations between gene expression and epigenetic features using sequencing data, while explicitly modeling the effects of DNA polymorphisms in either an allele-specific or nonallele-specific manner. We show that in doing so we provide the flexibility to detect associations between gene expression and epigenetic features, as well as conditional associations given DNA polymorphisms. We evaluate the performance of our method using simulations and apply our method to study the association between gene expression and the presence of DNase I Hypersensitive sites (DHSs) in HapMap individuals. Our model can be generalized to exploring the relationships between DNA polymorphisms and any two types of sequencing experiments, a useful feature as the variety of sequencing experiments continue to expand.
Collapse
Affiliation(s)
| | - Wei Sun
- Fred Hutchinson Cancer Research Center
| | | |
Collapse
|