1
|
Krutto A, Haugdahl Nøst T, Thoresen M. A heavy-tailed model for analyzing miRNA-seq raw read counts. Stat Appl Genet Mol Biol 2024; 23:sagmb-2023-0016. [PMID: 38810893 DOI: 10.1515/sagmb-2023-0016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2023] [Accepted: 05/02/2024] [Indexed: 05/31/2024]
Abstract
This article addresses the limitations of existing statistical models in analyzing and interpreting highly skewed miRNA-seq raw read count data that can range from zero to millions. A heavy-tailed model using discrete stable distributions is proposed as a novel approach to better capture the heterogeneity and extreme values commonly observed in miRNA-seq data. Additionally, the parameters of the discrete stable distribution are proposed as an alternative target for differential expression analysis. An R package for computing and estimating the discrete stable distribution is provided. The proposed model is applied to miRNA-seq raw counts from the Norwegian Women and Cancer Study (NOWAC) and the Cancer Genome Atlas (TCGA) databases. The goodness-of-fit is compared with the popular Poisson and negative binomial distributions, and the discrete stable distributions are found to give a better fit for both datasets. In conclusion, the use of discrete stable distributions is shown to potentially lead to more accurate modeling of the underlying biological processes.
Collapse
Affiliation(s)
- Annika Krutto
- Oslo Centre for Biostatistics and Epidemiology, Department of Biostatistics, University of Oslo, Oslo, Norway
| | - Therese Haugdahl Nøst
- Department of Community Medicine, Department of Community Medicine, 8016 UiT The Arctic University of Norway , Tromsø, Norway
- Department of Public Health and Nursing, K.G. Jebsen Center for Genetic Epidemiology, 8016 UiT The Arctic University of Norway , Trondheim, Norway
| | - Magne Thoresen
- Oslo Centre for Biostatistics and Epidemiology, Department of Biostatistics, University of Oslo, Oslo, Norway
| |
Collapse
|
2
|
Nakano D, Akiba J, Tsutsumi T, Kawaguchi M, Yoshida T, Koga H, Kawaguchi T. Hepatic expression of sodium-glucose cotransporter 2 (SGLT2) in patients with chronic liver disease. Med Mol Morphol 2022; 55:304-315. [PMID: 36131166 PMCID: PMC9606064 DOI: 10.1007/s00795-022-00334-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2022] [Accepted: 08/24/2022] [Indexed: 11/04/2022]
Abstract
Sodium–glucose cotransporter 2 (SGLT2) occurs in the proximal renal tubule cells. We investigate the hepatic expression of SGLT2 and its related factors in patients with chronic liver disease. This is a retrospective human study. The liver tissues were biopsied from patients with chronic liver disease (n = 30). The expression levels of SGLT2 were evaluated by immunostaining. Furthermore, the undirected graphical model was used to identify factors associated with hepatic expression levels of SGLT2. The SGLT2 expression was observed in not only the kidney, but also the liver in immunostaining (SGLT2 intensity: kidney 165.8 ± 15.6, liver 114.4 ± 49.0 arbitrary units, P < 0.01) and immunoblotting. There was no significant difference in hepatic expression of SGLT2 in the stratified analysis according to age, sex, BMI, and the severity of the liver disease. In the undirected graphical model, SGLT2 directly interacted with various factors such as sex, fatty change, neutrophil-to-lymphocyte ratio, triglyceride, hemoglobin A1c, creatinine, and albumin (partial correlation coefficient 0.4–0.6 for sex and 0.2–0.4 for others). The expression of SGLT2 was observed in the hepatocytes of patients with chronic liver disease. The undirected graphical model demonstrated the complex interaction of hepatic expression levels of SGLT2 with gender, inflammation, renal function, and lipid/glucose/protein metabolisms.
Collapse
Affiliation(s)
- Dan Nakano
- Division of Gastroenterology, Department of Medicine, Kurume University School of Medicine, 67 Asahi-machi Kurume, Kurume, 830-0011, Japan.
| | - Jun Akiba
- Department of Pathology, Kurume University Hospital, Kurume, Japan
| | - Tsubasa Tsutsumi
- Division of Gastroenterology, Department of Medicine, Kurume University School of Medicine, 67 Asahi-machi Kurume, Kurume, 830-0011, Japan
| | - Machiko Kawaguchi
- Division of Gastroenterology, Department of Medicine, Kurume University School of Medicine, 67 Asahi-machi Kurume, Kurume, 830-0011, Japan
| | - Takafumi Yoshida
- Division of Gastroenterology, Department of Medicine, Kurume University School of Medicine, 67 Asahi-machi Kurume, Kurume, 830-0011, Japan
| | - Hironori Koga
- Division of Gastroenterology, Department of Medicine, Kurume University School of Medicine, 67 Asahi-machi Kurume, Kurume, 830-0011, Japan.,Liver Cancer Division, Research Center for Innovative Cancer Therapy, Kurume University, Kurume, Japan
| | - Takumi Kawaguchi
- Division of Gastroenterology, Department of Medicine, Kurume University School of Medicine, 67 Asahi-machi Kurume, Kurume, 830-0011, Japan.,Liver Cancer Division, Research Center for Innovative Cancer Therapy, Kurume University, Kurume, Japan
| |
Collapse
|
3
|
Grimes T, Datta S. A novel probabilistic generator for large-scale gene association networks. PLoS One 2021; 16:e0259193. [PMID: 34767561 PMCID: PMC8589155 DOI: 10.1371/journal.pone.0259193] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2021] [Accepted: 10/14/2021] [Indexed: 11/18/2022] Open
Abstract
MOTIVATION Gene expression data provide an opportunity for reverse-engineering gene-gene associations using network inference methods. However, it is difficult to assess the performance of these methods because the true underlying network is unknown in real data. Current benchmarks address this problem by subsampling a known regulatory network to conduct simulations. But the topology of regulatory networks can vary greatly across organisms or tissues, and reference-based generators-such as GeneNetWeaver-are not designed to capture this heterogeneity. This means, for example, benchmark results from the E. coli regulatory network will not carry over to other organisms or tissues. In contrast, probabilistic generators do not require a reference network, and they have the potential to capture a rich distribution of topologies. This makes probabilistic generators an ideal approach for obtaining a robust benchmarking of network inference methods. RESULTS We propose a novel probabilistic network generator that (1) provides an alternative to address the inherent limitation of reference-based generators and (2) is able to create realistic gene association networks, and (3) captures the heterogeneity found across gold-standard networks better than existing generators used in practice. Eight organism-specific and 12 human tissue-specific gold-standard association networks are considered. Several measures of global topology are used to determine the similarity of generated networks to the gold-standards. Along with demonstrating the variability of network structure across organisms and tissues, we show that the commonly used "scale-free" model is insufficient for replicating these structures. AVAILABILITY This generator is implemented in the R package "SeqNet" and is available on CRAN (https://cran.r-project.org/web/packages/SeqNet/index.html).
Collapse
Affiliation(s)
- Tyler Grimes
- Department of Biostatistics, University of Florida, Gainesville, Florida, United States of America
| | - Somnath Datta
- Department of Biostatistics, University of Florida, Gainesville, Florida, United States of America
| |
Collapse
|
4
|
Park B, Choi H, Park C. Negative binomial graphical model with excess zeros. Stat Anal Data Min 2021. [DOI: 10.1002/sam.11536] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Affiliation(s)
- Beomjin Park
- Department of Statistics University of Seoul Seoul South Korea
| | - Hosik Choi
- Graduate School, Department of Urban Big Data Convergence University of Seoul Seoul South Korea
| | - Changyi Park
- Department of Statistics University of Seoul Seoul South Korea
| |
Collapse
|
5
|
Li CZ, Kawaguchi ES, Li G. A New ℓ0-Regularized Log-Linear Poisson Graphical Model with Applications to RNA Sequencing Data. J Comput Biol 2021; 28:880-891. [PMID: 34375132 PMCID: PMC8558075 DOI: 10.1089/cmb.2020.0558] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
In this article, we develop a new ℓ 0 -based sparse Poisson graphical model with applications to gene network inference from RNA-seq gene expression count data. Assuming a pair-wise Markov property, we propose to fit a separate broken adaptive ridge-regularized log-linear Poisson regression on each node to evaluate the conditional, instead of marginal, association between two genes in the presence of all other genes. The resulting sparse gene networks are generally more accurate than those generated by the ℓ 1 -regularized Poisson graphical model as demonstrated by our empirical studies. A real data illustration is given on a kidney renal clear cell carcinoma micro-RNA-seq data from the Cancer Genome Atlas.
Collapse
Affiliation(s)
- Caesar Z. Li
- Department of Biostatistics, School of Public Health, University of California at Los Angeles, Los Angeles, California, USA
| | - Eric S. Kawaguchi
- Graduate Programs in Biostatistics and Epidemiology, Keck School of Medicine, University of Southern California, Los Angeles, California, USA
| | - Gang Li
- Department of Biostatistics, School of Public Health, University of California at Los Angeles, Los Angeles, California, USA
| |
Collapse
|
6
|
Prost V, Gazut S, Brüls T. A zero inflated log-normal model for inference of sparse microbial association networks. PLoS Comput Biol 2021; 17:e1009089. [PMID: 34143768 PMCID: PMC8244920 DOI: 10.1371/journal.pcbi.1009089] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Revised: 06/30/2021] [Accepted: 05/17/2021] [Indexed: 01/03/2023] Open
Abstract
The advent of high-throughput metagenomic sequencing has prompted the development of efficient taxonomic profiling methods allowing to measure the presence, abundance and phylogeny of organisms in a wide range of environmental samples. Multivariate sequence-derived abundance data further has the potential to enable inference of ecological associations between microbial populations, but several technical issues need to be accounted for, like the compositional nature of the data, its extreme sparsity and overdispersion, as well as the frequent need to operate in under-determined regimes. The ecological network reconstruction problem is frequently cast into the paradigm of Gaussian Graphical Models (GGMs) for which efficient structure inference algorithms are available, like the graphical lasso and neighborhood selection. Unfortunately, GGMs or variants thereof can not properly account for the extremely sparse patterns occurring in real-world metagenomic taxonomic profiles. In particular, structural zeros (as opposed to sampling zeros) corresponding to true absences of biological signals fail to be properly handled by most statistical methods. We present here a zero-inflated log-normal graphical model (available at https://github.com/vincentprost/Zi-LN) specifically aimed at handling such "biological" zeros, and demonstrate significant performance gains over state-of-the-art statistical methods for the inference of microbial association networks, with most notable gains obtained when analyzing taxonomic profiles displaying sparsity levels on par with real-world metagenomic datasets.
Collapse
Affiliation(s)
- Vincent Prost
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-Saclay, Evry, France.,Université Paris-Saclay, CEA, List, Palaiseau, France
| | | | - Thomas Brüls
- Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, Université Paris-Saclay, Evry, France
| |
Collapse
|
7
|
Silverman JD, Roche K, Mukherjee S, David LA. Naught all zeros in sequence count data are the same. Comput Struct Biotechnol J 2020; 18:2789-2798. [PMID: 33101615 PMCID: PMC7568192 DOI: 10.1016/j.csbj.2020.09.014] [Citation(s) in RCA: 54] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2020] [Revised: 09/09/2020] [Accepted: 09/10/2020] [Indexed: 12/21/2022] Open
Abstract
Genomic studies feature multivariate count data from high-throughput DNA sequencing experiments, which often contain many zero values. These zeros can cause artifacts for statistical analyses and multiple modeling approaches have been developed in response. Here, we apply different zero-handling models to gene-expression and microbiome datasets and show models can disagree substantially in terms of identifying the most differentially expressed sequences. Next, to rationally examine how different zero handling models behave, we developed a conceptual framework outlining four types of processes that may give rise to zero values in sequence count data. Last, we performed simulations to test how zero handling models behave in the presence of these different zero generating processes. Our simulations showed that simple count models are sufficient across multiple processes, even when the true underlying process is unknown. On the other hand, a common zero handling technique known as "zero-inflation" was only suitable under a zero generating process associated with an unlikely set of biological and experimental conditions. In concert, our work here suggests several specific guidelines for developing and choosing state-of-the-art models for analyzing sparse sequence count data.
Collapse
Affiliation(s)
- Justin D Silverman
- College of Information Science and Technology, Pennsylvania State University, State College, PA 16802, United States
- Institute for Computational and Data Science, Pennsylvania State University, State College, PA 16802, United States
- Department of Medicine, Pennsylvania State University, Hershey, PA 17033, United States
| | - Kimberly Roche
- Program in Computational Biology and Bioinformatics, Duke University, Durham, NC 27708, United States
| | - Sayan Mukherjee
- Program in Computational Biology and Bioinformatics, Duke University, Durham, NC 27708, United States
- Departments of Statistical Science, Mathematics, Computer Science, Biostatistics & Bioinformatics, Duke University, Durham, NC 27708, United States
- Center for Genomic and Computational Biology, Duke University, Durham, NC 27708, United States
| | - Lawrence A David
- Program in Computational Biology and Bioinformatics, Duke University, Durham, NC 27708, United States
- Center for Genomic and Computational Biology, Duke University, Durham, NC 27708, United States
- Department of Molecular Genetics and Microbiology, Duke University, Durham, NC 27708, United States
| |
Collapse
|