26
|
Haris A, Simon N, Shojaie A. Generalized Sparse Additive Models. JOURNAL OF MACHINE LEARNING RESEARCH : JMLR 2022; 23:70. [PMID: 37873545 PMCID: PMC10593424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 10/25/2023]
Abstract
We present a unified framework for estimation and analysis of generalized additive models in high dimensions. The framework defines a large class of penalized regression estimators, encompassing many existing methods. An efficient computational algorithm for this class is presented that easily scales to thousands of observations and features. We prove minimax optimal convergence bounds for this class under a weak compatibility condition. In addition, we characterize the rate of convergence when this compatibility condition is not met. Finally, we also show that the optimal penalty parameters for structure and sparsity penalties in our framework are linked, allowing cross-validation to be conducted over only a single tuning parameter. We complement our theoretical results with empirical studies comparing some existing methods within this framework.
Collapse
|
27
|
Prater KE, Green KJ, Chiou KL, Smith CL, Sun W, Shojaie A, Heath LM, Rose S, Keene CD, Logsdon BA, Snyder-Mackler N, Blue EE, Young JE, Garden GA, Jayadev S. Microglia subtype transcriptomes differ between Alzheimer Disease and control human postmortem brain samples. Alzheimers Dement 2022. [PMID: 34971137 DOI: 10.1002/alz.058474] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
BACKGROUND Microglia-mediated neuroinflammation is hypothesized to contribute to disease progression in neurodegenerative diseases such as Alzheimer's Disease (AD). Microglia subtypes are complex, with beneficial and harmful phenotypes. Understanding the gene expression networks which define the spectrum of microglia phenotypes is critical to identifying specific targets for neuroinflammation modulating therapies. METHOD Our study utilized post-mortem brain tissue from 22 total (7 male) participants; 12 (3 male) had significant AD neuropathic change. Nuclei isolated from prefrontal cortex were sorted for the myeloid marker PU.1 using fluorescence activated nucleus sorting (FANS). The FANS approach yields larger numbers of nuclei annotated as microglia with high quality sequence from each individual. We performed single-nucleus RNA-seq using the 10X Genomics Chromium platform. RESULTS We isolated more than 120,000 microglia nuclei, facilitating group comparisons based on disease state. Unbiased clustering revealed 10 microglia clusters and improved resolution of microglia heterogeneity compared to standard single-cell approaches. We identify clusters of microglia enriched for biological pathways implicating defined myeloid roles including interferon-stimulated, endo/lysosomal, neurodegenerative with a "disease-associated microglia" (DAM) signature, as well as a metabolically active and autophagic cluster. Interestingly, the cluster proportionately enriched for AD individuals' nuclei is not the DAM cluster but instead one of the clusters in which endo/lysosomal genes are highly upregulated. Furthermore, many of the genes in known AD risk loci are strongly differentially regulated in this AD associated cluster. We also identify a cluster of microglia that is proportionately enriched for control samples with upregulated cell cycle and proliferation genes. Trajectory analysis suggests that the paths AD and control nuclei take from unactivated "homeostatic" to various phenotypic states are also distinct. CONCLUSION Using human AD tissue collected with uniform protocols we characterize the transcriptomic profiles of microglia subtypes in human brain. By enriching for myeloid cells prior to analysis we can resolve microglia subtypes revealing the diversity of microglia which are "inflammatory" as well as other microglia subtypes responding with induction of metabolic and lysosomal pathways. Our data identifies subtypes of microglia that are unique to AD and control individuals. These results support the possibility of pharmacological targeting of specific subtypes of microglia to alter AD progression.
Collapse
|
28
|
Wang X, Shojaie A. Causal Discovery in High-Dimensional Point Process Networks with Hidden Nodes. ENTROPY (BASEL, SWITZERLAND) 2021; 23:1622. [PMID: 34945928 PMCID: PMC8700240 DOI: 10.3390/e23121622] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Revised: 11/20/2021] [Accepted: 11/27/2021] [Indexed: 12/01/2022]
Abstract
Thanks to technological advances leading to near-continuous time observations, emerging multivariate point process data offer new opportunities for causal discovery. However, a key obstacle in achieving this goal is that many relevant processes may not be observed in practice. Naïve estimation approaches that ignore these hidden variables can generate misleading results because of the unadjusted confounding. To plug this gap, we propose a deconfounding procedure to estimate high-dimensional point process networks with only a subset of the nodes being observed. Our method allows flexible connections between the observed and unobserved processes. It also allows the number of unobserved processes to be unknown and potentially larger than the number of observed nodes. Theoretical analyses and numerical studies highlight the advantages of the proposed method in identifying causal interactions among the observed processes.
Collapse
|
29
|
Zhao S, Witten D, Shojaie A. In Defense of the Indefensible: A Very Naïve Approach to High-Dimensional Inference. Stat Sci 2021; 36:562-577. [PMID: 37860618 PMCID: PMC10586523 DOI: 10.1214/20-sts815] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2023]
Abstract
A great deal of interest has recently focused on conducting inference on the parameters in a high-dimensional linear model. In this paper, we consider a simple and very naïve two-step procedure for this task, in which we (i) fit a lasso model in order to obtain a subset of the variables, and (ii) fit a least squares model on the lasso-selected set. Conventional statistical wisdom tells us that we cannot make use of the standard statistical inference tools for the resulting least squares model (such as confidence intervals and p-values), since we peeked at the data twice: once in running the lasso, and again in fitting the least squares model. However, in this paper, we show that under a certain set of assumptions, with high probability, the set of variables selected by the lasso is identical to the one selected by the noiseless lasso and is hence deterministic. Consequently, the naïve two-step approach can yield asymptotically valid inference. We utilize this finding to develop the naïve confidence interval, which can be used to draw inference on the regression coefficients of the model selected by the lasso, as well as the naïve score test, which can be used to test the hypotheses regarding the full-model regression coefficients.
Collapse
|
30
|
Yue K, Ma J, Thornton T, Shojaie A. REHE: Fast variance components estimation for linear mixed models. Genet Epidemiol 2021; 45:891-905. [PMID: 34658056 DOI: 10.1002/gepi.22432] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2021] [Revised: 06/11/2021] [Accepted: 10/04/2021] [Indexed: 11/07/2022]
Abstract
Linear mixed models are widely used in ecological and biological applications, especially in genetic studies. Reliable estimation of variance components is crucial for using linear mixed models. However, standard methods, such as the restricted maximum likelihood (REML), are computationally inefficient in large samples and may be unstable with small samples. Other commonly used methods, such as the Haseman-Elston (HE) regression, may yield negative estimates of variances. Utilizing regularized estimation strategies, we propose the restricted Haseman-Elston (REHE) regression and REHE with resampling (reREHE) estimators, along with an inference framework for REHE, as fast and robust alternatives that provide nonnegative estimates with comparable accuracy to REML. The merits of REHE are illustrated using real data and benchmark simulation studies.
Collapse
|
31
|
Yu S, Drton M, Promislow DEL, Shojaie A. CorDiffViz: an R package for visualizing multi-omics differential correlation networks. BMC Bioinformatics 2021; 22:486. [PMID: 34627139 PMCID: PMC8501646 DOI: 10.1186/s12859-021-04383-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2020] [Accepted: 09/20/2021] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND Differential correlation networks are increasingly used to delineate changes in interactions among biomolecules. They characterize differences between omics networks under two different conditions, and can be used to delineate mechanisms of disease initiation and progression. RESULTS We present a new R package, CorDiffViz, that facilitates the estimation and visualization of differential correlation networks using multiple correlation measures and inference methods. The software is implemented in R, HTML and Javascript, and is available at https://github.com/sqyu/CorDiffViz . Visualization has been tested for the Chrome and Firefox web browsers. A demo is available at https://diffcornet.github.io/CorDiffViz/demo.html . CONCLUSIONS Our software offers considerable flexibility by allowing the user to interact with the visualization and choose from different estimation methods and visualizations. It also allows the user to easily toggle between correlation networks for samples under one condition and differential correlations between samples under two conditions. Moreover, the software facilitates integrative analysis of cross-correlation networks between two omics data sets.
Collapse
|
32
|
Hellstern M, Ma J, Yue K, Shojaie A. netgsa: Fast computation and interactive visualization for topology-based pathway enrichment analysis. PLoS Comput Biol 2021; 17:e1008979. [PMID: 34115744 PMCID: PMC8221786 DOI: 10.1371/journal.pcbi.1008979] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2020] [Revised: 06/23/2021] [Accepted: 04/18/2021] [Indexed: 01/26/2023] Open
Abstract
Existing software tools for topology-based pathway enrichment analysis are either computationally inefficient, have undesirable statistical power, or require expert knowledge to leverage the methods' capabilities. To address these limitations, we have overhauled NetGSA, an existing topology-based method, to provide a computationally-efficient user-friendly tool that offers interactive visualization. Pathway enrichment analysis for thousands of genes can be performed in minutes on a personal computer without sacrificing statistical power. The new software also removes the need for expert knowledge by directly curating gene-gene interaction information from multiple external databases. Lastly, by utilizing the capabilities of Cytoscape, the new software also offers interactive and intuitive network visualization.
Collapse
|
33
|
Shojaie A. Differential Network Analysis: A Statistical Perspective. WILEY INTERDISCIPLINARY REVIEWS. COMPUTATIONAL STATISTICS 2021; 13:e1508. [PMID: 37050915 PMCID: PMC10088462 DOI: 10.1002/wics.1508] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/09/2019] [Accepted: 03/03/2020] [Indexed: 11/06/2022]
Abstract
Networks effectively capture interactions among components of complex systems, and have thus become a mainstay in many scientific disciplines. Growing evidence, especially from biology, suggest that networks undergo changes over time, and in response to external stimuli. In biology and medicine, these changes have been found to be predictive of complex diseases. They have also been used to gain insight into mechanisms of disease initiation and progression. Primarily motivated by biological applications, this article provides a review of recent statistical machine learning methods for inferring networks and identifying changes in their structures.
Collapse
|
34
|
Manzour H, Küçükyavuz S, Wu HH, Shojaie A. Integer Programming for Learning Directed Acyclic Graphs from Continuous Data. INFORMS JOURNAL ON OPTIMIZATION 2021; 3:46-73. [PMID: 37051459 PMCID: PMC10088505 DOI: 10.1287/ijoo.2019.0040] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
Learning directed acyclic graphs (DAGs) from data is a challenging task both in theory and in practice, because the number of possible DAGs scales superexponentially with the number of nodes. In this paper, we study the problem of learning an optimal DAG from continuous observational data. We cast this problem in the form of a mathematical programming model that can naturally incorporate a superstructure to reduce the set of possible candidate DAGs. We use a negative log-likelihood score function with both [Formula: see text] and [Formula: see text] penalties and propose a new mixed-integer quadratic program, referred to as a layered network (LN) formulation. The LN formulation is a compact model that enjoys as tight an optimal continuous relaxation value as the stronger but larger formulations under a mild condition. Computational results indicate that the proposed formulation outperforms existing mathematical formulations and scales better than available algorithms that can solve the same problem with only [Formula: see text] regularization. In particular, the LN formulation clearly outperforms existing methods in terms of computational time needed to find an optimal DAG in the presence of a sparse superstructure.
Collapse
|
35
|
Simon N, Shojaie A. Convergence Rates of Nonparametric Penalized Regression under Misspecified Smoothness. Stat Sin 2021. [DOI: 10.5705/ss.202018.0144] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
36
|
Tank A, Li X, Fox EB, Shojaie A. The Convex Mixture Distribution: Granger Causality for Categorical Time Series. SIAM JOURNAL ON MATHEMATICS OF DATA SCIENCE 2021; 3:83-112. [PMID: 37859797 PMCID: PMC10586348 DOI: 10.1137/20m133097x] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/21/2023]
Abstract
We present a framework for learning Granger causality networks for multivariate categorical time series based on the mixture transition distribution (MTD) model. Traditionally, MTD is plagued by a nonconvex objective, non-identifiability, and presence of local optima. To circumvent these problems, we recast inference in the MTD as a convex problem. The new formulation facilitates the application of MTD to high-dimensional multivariate time series. As a baseline, we also formulate a multi-output logistic autoregressive model (mLTD), which while a straightforward extension of autoregressive Bernoulli generalized linear models, has not been previously applied to the analysis of multivariate categorial time series. We establish identifiability conditions of the MTD model and compare them to those for mLTD. We further devise novel and efficient optimization algorithms for MTD based on our proposed convex formulation, and compare the MTD and mLTD in both simulated and real data experiments. Finally, we establish consistency of the convex MTD in high dimensions. Our approach simultaneously provides a comparison of methods for network inference in categorical time series and opens the door to modern, regularized inference with the MTD model.
Collapse
|
37
|
Li X, Shojaie A. Discussion of “A Tuning-Free Robust and Efficient Approach to High-Dimensional Regression”. J Am Stat Assoc 2020. [DOI: 10.1080/01621459.2020.1837139] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
38
|
Dibay Moghadam S, Navarro SL, Shojaie A, Randolph TW, Bettcher LF, Le CB, Hullar MA, Kratz M, Neuhouser ML, Lampe PD, Raftery D, Lampe JW. Plasma lipidomic profiles after a low and high glycemic load dietary pattern in a randomized controlled crossover feeding study. Metabolomics 2020; 16:121. [PMID: 33219392 PMCID: PMC8116047 DOI: 10.1007/s11306-020-01746-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/28/2020] [Accepted: 11/09/2020] [Indexed: 12/11/2022]
Abstract
BACKGROUND Dietary patterns low in glycemic load are associated with reduced risk of cardiometabolic diseases. Improvements in serum lipid concentrations may play a role in these observed associations. OBJECTIVE We investigated how dietary patterns differing in glycemic load affect clinical lipid panel measures and plasma lipidomics profiles. METHODS In a crossover, controlled feeding study, 80 healthy participants (n = 40 men, n = 40 women), 18-45 y were randomized to receive low-glycemic load (LGL) or high glycemic load (HGL) diets for 28 days each with at least a 28-day washout period between controlled diets. Fasting plasma samples were collected at baseline and end of each diet period. Lipids on a clinical panel including total-, VLDL-, LDL-, and HDL-cholesterol and triglycerides were measured using an auto-analyzer. Lipidomics analysis using mass-spectrometry provided the concentrations of 863 species. Linear mixed models and lipid ontology enrichment analysis were implemented. RESULTS Lipids from the clinical panel were not significantly different between diets. Univariate analysis showed that 67 species on the lipidomics panel, predominantly in the triacylglycerol class, were higher after the LGL diet compared to the HGL (FDR < 0.05). Three species with FA 17:0 were lower after LGL diet with enrichment analysis (FDR < 0.05). CONCLUSION In the context of controlled eucaloric diets with similar macronutrient distribution, these results suggest that there are relative shifts in lipid species, but the overall pool does not change. Further studies are needed to better understand in which compartment the different lipid species are transported in blood, and how these shifts are related to health outcomes. This trial was registered at clinicaltrials.gov as NCT00622661.
Collapse
|
39
|
Lin L, Drton M, Shojaie A. Statistical significance in high-dimensional linear mixed models. FODS '20 : PROCEEDINGS OF THE 2020 ACM-IMS FOUNDATIONS OF DATA SCIENCE CONFERENCE : OCTOBER 19-20, 2020, VIRTUAL EVENT, USA. ACM-IMS FOUNDATIONS OF DATA SCIENCE CONFERENCE (2020 : ONLINE) 2020; 2020:171-181. [PMID: 35497571 PMCID: PMC9053448 DOI: 10.1145/3412815.3416883] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
This paper concerns the development of an inferential framework for high-dimensional linear mixed effect models. These are suitable models, for instance, when we have n repeated measurements for M subjects. We consider a scenario where the number of fixed effects p is large (and may be larger than M), but the number of random effects q is small. Our framework is inspired by a recent line of work that proposes de-biasing penalized estimators to perform inference for high-dimensional linear models with fixed effects only. In particular, we demonstrate how to correct a 'naive' ridge estimator in extension of work by Bühlmann (2013) to build asymptotically valid confidence intervals for mixed effect models. We validate our theoretical results with numerical experiments, in which we show our method outperforms those that fail to account for correlation induced by the random effects. For a practical demonstration we consider a riboflavin production dataset that exhibits group structure, and show that conclusions drawn using our method are consistent with those obtained on a similar dataset without group structure.
Collapse
|
40
|
Jin K, Wilson KA, Beck JN, Nelson CS, Brownridge GW, Harrison BR, Djukovic D, Raftery D, Brem RB, Yu S, Drton M, Shojaie A, Kapahi P, Promislow D. Genetic and metabolomic architecture of variation in diet restriction-mediated lifespan extension in Drosophila. PLoS Genet 2020; 16:e1008835. [PMID: 32644988 PMCID: PMC7347105 DOI: 10.1371/journal.pgen.1008835] [Citation(s) in RCA: 38] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2020] [Accepted: 05/06/2020] [Indexed: 01/08/2023] Open
Abstract
In most organisms, dietary restriction (DR) increases lifespan. However, several studies have found that genotypes within the same species vary widely in how they respond to DR. To explore the mechanisms underlying this variation, we exposed 178 inbred Drosophila melanogaster lines to a DR or ad libitum (AL) diet, and measured a panel of 105 metabolites under both diets. Twenty four out of 105 metabolites were associated with the magnitude of the lifespan response. These included proteinogenic amino acids and metabolites involved in α-ketoglutarate (α-KG)/glutamine metabolism. We confirm the role of α-KG/glutamine synthesis pathways in the DR response through genetic manipulations. We used covariance network analysis to investigate diet-dependent interactions between metabolites, identifying the essential amino acids threonine and arginine as “hub” metabolites in the DR response. Finally, we employ a novel metabolic and genetic bipartite network analysis to reveal multiple genes that influence DR lifespan response, some of which have not previously been implicated in DR regulation. One of these is CCHa2R, a gene that encodes a neuropeptide receptor that influences satiety response and insulin signaling. Across the lines, variation in an intronic single nucleotide variant of CCHa2R correlated with variation in levels of five metabolites, all of which in turn were correlated with DR lifespan response. Inhibition of adult CCHa2R expression extended DR lifespan of flies, confirming the role of CCHa2R in lifespan response. These results provide support for the power of combined genomic and metabolomic analysis to identify key pathways underlying variation in this complex quantitative trait. Dietary restriction extends lifespan across most organisms in which it has been tested. However, several studies have now demonstrated that this effect can vary dramatically across different genotypes within a population. Within a population, dietary restriction might be beneficial for some, yet detrimental for others. Here, we measure the metabolome of 178 genetically characterized fly strains on fully fed and restricted diets. The fly strains vary widely in their lifespan response to dietary restriction. We then use information about each strain’s genome and metabolome (a measure of small molecules circulating in flies) to pinpoint cellular pathways that govern this variation in response. We identify a novel pathway involving the gene CCHa2R, which encodes a neuropeptide receptor that has not previously been implicated in dietary restriction or age-related signaling pathways. This study demonstrates the power of leveraging systems biology and network biology methods to understand how and why different individuals vary in their response to health and lifespan-extending interventions.
Collapse
|
41
|
Safikhani A, Shojaie A. Joint Structural Break Detection and Parameter Estimation in High-Dimensional Non-Stationary VAR Models. J Am Stat Assoc 2020; 117:251-264. [PMID: 38375186 PMCID: PMC10874880 DOI: 10.1080/01621459.2020.1770097] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2018] [Revised: 11/01/2019] [Accepted: 05/11/2020] [Indexed: 10/24/2022]
Abstract
Assuming stationarity is unrealistic in many time series applications. A more realistic alternative is to assume piecewise stationarity, where the model can change at potentially many change points. We propose a three-stage procedure for simultaneous estimation of change points and parameters of high-dimensional piecewise vector autoregressive (VAR) models. In the first step, we reformulate the change point detection problem as a high-dimensional variable selection one, and solve it using a penalized least square estimator with a total variation penalty. We show that the penalized estimation method over-estimates the number of change points, and propose a selection criterion to identify the change points. In the last step of our procedure, we estimate the VAR parameters in each of the segments. We prove that the proposed procedure consistently detects the number and location of change points, and provides consistent estimates of VAR parameters. The performance of the method is illustrated through several simulated and real data examples.
Collapse
|
42
|
Whitney D, Shojaie A, Carone M. Comment: Models as (deliberate) approximations. Stat Sci 2020; 34:591-598. [PMID: 32581422 DOI: 10.1214/19-sts747] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
43
|
Wang Y, Randolph TW, Shojaie A, Ma J. The Generalized Matrix Decomposition Biplot and Its Application to Microbiome Data. mSystems 2019; 4:e00504-19. [PMID: 31848304 PMCID: PMC6918030 DOI: 10.1128/msystems.00504-19] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2019] [Accepted: 11/13/2019] [Indexed: 11/20/2022] Open
Abstract
Exploratory analysis of human microbiome data is often based on dimension-reduced graphical displays derived from similarities based on non-Euclidean distances, such as UniFrac or Bray-Curtis. However, a display of this type, often referred to as the principal-coordinate analysis (PCoA) plot, does not reveal which taxa are related to the observed clustering because the configuration of samples is not based on a coordinate system in which both the samples and variables can be represented. The reason is that the PCoA plot is based on the eigen-decomposition of a similarity matrix and not the singular value decomposition (SVD) of the sample-by-abundance matrix. We propose a novel biplot that is based on an extension of the SVD, called the generalized matrix decomposition biplot (GMD-biplot), which involves an arbitrary matrix of similarities and the original matrix of variable measures, such as taxon abundances. As in a traditional biplot, points represent the samples, and arrows represent the variables. The proposed GMD-biplot is illustrated by analyzing multiple real and simulated data sets which demonstrate that the GMD-biplot provides improved clustering capability and a more meaningful relationship between the arrows and points.IMPORTANCE Biplots that simultaneously display the sample clustering and the important taxa have gained popularity in the exploratory analysis of human microbiome data. Traditional biplots, assuming Euclidean distances between samples, are not appropriate for microbiome data, when non-Euclidean distances are used to characterize dissimilarities among microbial communities. Thus, incorporating information from non-Euclidean distances into a biplot becomes useful for graphical displays of microbiome data. The proposed GMD-biplot accounts for any arbitrary non-Euclidean distances and provides a robust and computationally efficient approach for graphical visualization of microbiome data. In addition, the proposed GMD-biplot displays both the samples and taxa with respect to the same coordinate system, which further allows the configuration of future samples.
Collapse
|
44
|
Ma J, Shojaie A, Michailidis G. A comparative study of topology-based pathway enrichment analysis methods. BMC Bioinformatics 2019; 20:546. [PMID: 31684881 PMCID: PMC6829999 DOI: 10.1186/s12859-019-3146-1] [Citation(s) in RCA: 38] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2018] [Accepted: 10/02/2019] [Indexed: 02/01/2023] Open
Abstract
BACKGROUND Pathway enrichment extensively used in the analysis of Omics data for gaining biological insights into the functional roles of pre-defined subsets of genes, proteins and metabolites. A large number of methods have been proposed in the literature for this task. The vast majority of these methods use as input expression levels of the biomolecules under study together with their membership in pathways of interest. The latest generation of pathway enrichment methods also leverages information on the topology of the underlying pathways, which as evidence from their evaluation reveals, lead to improved sensitivity and specificity. Nevertheless, a systematic empirical comparison of such methods is still lacking, making selection of the most suitable method for a specific experimental setting challenging. This comparative study of nine network-based methods for pathway enrichment analysis aims to provide a systematic evaluation of their performance based on three real data sets with different number of features (genes/metabolites) and number of samples. RESULTS The findings highlight both methodological and empirical differences across the nine methods. In particular, certain methods assess pathway enrichment due to differences both across expression levels and in the strength of the interconnectedness of the members of the pathway, while others only leverage differential expression levels. In the more challenging setting involving a metabolomics data set, the results show that methods that utilize both pieces of information (with NetGSA being a prototypical one) exhibit superior statistical power in detecting pathway enrichment. CONCLUSION The analysis reveals that a number of methods perform equally well when testing large size pathways, which is the case with genomic data. On the other hand, NetGSA that takes into consideration both differential expression of the biomolecules in the pathway, as well as changes in the topology exhibits a superior performance when testing small size pathways, which is usually the case for metabolomics data.
Collapse
|
45
|
Wang X, Shojaie A, Zou J. Bayesian Hidden Markov Models for Dependent Large-Scale Multiple Testing. Comput Stat Data Anal 2019; 136:123-136. [PMID: 31662591 DOI: 10.1016/j.csda.2019.01.009] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
An optimal and flexible multiple hypotheses testing procedure is constructed for dependent data based on Bayesian techniques, aiming at handling two challenges, namely dependence structure and non-null distribution specification. Ignoring dependence among hypotheses tests may lead to loss of efficiency and bias in decision. Misspecification in the non-null distribution, on the other hand, can result in both false positive and false negative errors. Hidden Markov models are used to accommodate the dependence structure among the tests. Dirichlet mixture process prior is applied on the non-null distribution to overcome the potential pitfalls in distribution misspecification. The testing algorithm based on Bayesian techniques optimizes the false negative rate (FNR) while controlling the false discovery rate (FDR). The procedure is applied to pointwise and clusterwise analysis. Its performance is compared with existing approaches using both simulated and real data examples.
Collapse
|
46
|
Navarro SL, Tarkhan A, Shojaie A, Randolph TW, Gu H, Djukovic D, Osterbauer KJ, Hullar MA, Kratz M, Neuhouser ML, Lampe PD, Raftery D, Lampe JW. Plasma metabolomics profiles suggest beneficial effects of a low-glycemic load dietary pattern on inflammation and energy metabolism. Am J Clin Nutr 2019; 110:984-992. [PMID: 31432072 PMCID: PMC6766441 DOI: 10.1093/ajcn/nqz169] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2019] [Accepted: 07/02/2019] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Low-glycemic load dietary patterns, characterized by consumption of whole grains, legumes, fruits, and vegetables, are associated with reduced risk of several chronic diseases. METHODS Using samples from a randomized, controlled, crossover feeding trial, we evaluated the effects on metabolic profiles of a low-glycemic whole-grain dietary pattern (WG) compared with a dietary pattern high in refined grains and added sugars (RG) for 28 d. LC-MS-based targeted metabolomics analysis was performed on fasting plasma samples from 80 healthy participants (n = 40 men, n = 40 women) aged 18-45 y. Linear mixed models were used to evaluate differences in response between diets for individual metabolites. Kyoto Encyclopedia of Genes and Genomes (KEGG)-defined pathways and 2 novel data-driven analyses were conducted to consider differences at the pathway level. RESULTS There were 121 metabolites with detectable signal in >98% of all plasma samples. Eighteen metabolites were significantly different between diets at day 28 [false discovery rate (FDR) < 0.05]. Inositol, hydroxyphenylpyruvate, citrulline, ornithine, 13-hydroxyoctadecadienoic acid, glutamine, and oxaloacetate were higher after the WG diet than after the RG diet, whereas melatonin, betaine, creatine, acetylcholine, aspartate, hydroxyproline, methylhistidine, tryptophan, cystamine, carnitine, and trimethylamine were lower. Analyses using KEGG-defined pathways revealed statistically significant differences in tryptophan metabolism between diets, with kynurenine and melatonin positively associated with serum C-reactive protein concentrations. Novel data-driven methods at the metabolite and network levels found correlations among metabolites involved in branched-chain amino acid (BCAA) degradation, trimethylamine-N-oxide production, and β oxidation of fatty acids (FDR < 0.1) that differed between diets, with more favorable metabolic profiles detected after the WG diet. Higher BCAAs and trimethylamine were positively associated with homeostasis model assessment-insulin resistance. CONCLUSIONS These exploratory metabolomics results support beneficial effects of a low-glycemic load dietary pattern characterized by whole grains, legumes, fruits, and vegetables, compared with a diet high in refined grains and added sugars on inflammation and energy metabolism pathways. This trial was registered at clinicaltrials.gov as NCT00622661.
Collapse
|
47
|
Haris A, Shojaie A, Simon N. Nonparametric regression with adaptive truncation via a convex hierarchical penalty. Biometrika 2019; 106:87-107. [PMID: 31427821 DOI: 10.1093/biomet/asy056] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2017] [Indexed: 11/13/2022] Open
Abstract
We consider the problem of nonparametric regression with a potentially large number of covariates. We propose a convex, penalized estimation framework that is particularly well suited to high-dimensional sparse additive models and combines the appealing features of finite basis representation and smoothing penalties. In the case of additive models, a finite basis representation provides a parsimonious representation for fitted functions but is not adaptive when component functions possess different levels of complexity. In contrast, a smoothing spline-type penalty on the component functions is adaptive but does not provide a parsimonious representation. Our proposal simultaneously achieves parsimony and adaptivity in a computationally efficient way. We demonstrate these properties through empirical studies and show that our estimator converges at the minimax rate for functions within a hierarchical class. We further establish minimax rates for a large class of sparse additive models. We also develop an efficient algorithm that scales similarly to the lasso with the number of covariates and sample size.
Collapse
|
48
|
Moghadam SD, Navarro S, Shojaie A, Randolph T, Bettcher L, Le C, Hullar M, Kratz M, Neuhouser M, Lampe P, Raftery D, Lampe J. Plasma Lipidomics Profiles After a Diet Characterized by Whole Grains Compared to a Diet High in Refined Grains and Added Sugars (FS03-07-19). Curr Dev Nutr 2019. [DOI: 10.1093/cdn/nzz046.fs03-07-19] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Abstract
Objectives
Dietary patterns high in fiber from sources including whole grains, legumes, fruits, vegetables, nuts and seeds, are associated with lower risk of chronic disease, such as cardiovascular disease and cancer. We investigated how plasma lipidomics profiles differed between a diet high in whole grains (WG) versus a diet high in refined grains and added sugars (RG).
Methods
Using a randomized, crossover, controlled feeding study, 80 healthy participants (n = 40 men, n = 40 women, 40 normal weight, 40 overweight/obese), 18–45 y, were randomized to receive either a WG or RG diet for 28 days. After a 28-day washout period where participants resumed their habitual diet, they crossed over to the other diet. Targeted, differential mobility mass spectrometry was performed on fasting plasma samples collected at the baseline and end of each diet period and quantified the concentrations of 863 lipids from 13 classes. Paired t-tests and pairwise partial least squares-discriminant analysis (PLS-DA) were used to evaluate differences in lipid profiles between the two diets.
Results
At a class level, only ceramides were significantly different when comparing the two diets. After removing lipid species with > 20% missing values or CVs < 25%, 606 were retained for species analysis. Sixty-seven lipid species were significantly different between diets at day 28 (FDR < 0.05): 38 of 414 detected triglycerides, 9 of 59 phosphatidylethanolamines, 9 of 63 phosphatidylcholines, 4 of 22 cholesterol esters, 3 of 11 sphingomyelins, 2 of 13 lysophosphatidylcholines, and 1 of 5 ceramides. The majority of significant lipids were higher in plasma after the WG diet. PLSDA analysis showed the first and second components explaining 49% and 8.4%, respectively. Based on the selected components, lipidomic profiles showed fair separation for the two groups of diet. R2 values were 0.07 and 0.43, and Q2 values were -0.03 and 0.04 for components 1 and 2, respectively.
Conclusions
Higher concentrations of some lipid species such as cholesterol ester 12:0, a carrier of high-density lipoprotein, could indicate a favorable shift in lipid profiles. Further investigation using more complex models are being conducted.
Funding Sources
National Cancer Institute - National Institutes of Health.
Collapse
|
49
|
Yu S, Drton M, Shojaie A. Generalized Score Matching for Non-Negative Data. JOURNAL OF MACHINE LEARNING RESEARCH : JMLR 2019; 20:76. [PMID: 34290571 PMCID: PMC8291733] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
A common challenge in estimating parameters of probability density functions is the intractability of the normalizing constant. While in such cases maximum likelihood estimation may be implemented using numerical integration, the approach becomes computationally intensive. The score matching method of Hyvärinen (2005) avoids direct calculation of the normalizing constant and yields closed-form estimates for exponential families of continuous distributions over R m . Hyvärinen (2007) extended the approach to distributions supported on the non-negative orthant, R + m . In this paper, we give a generalized form of score matching for non-negative data that improves estimation efficiency. As an example, we consider a general class of pairwise interaction models. Addressing an overlooked inexistence problem, we generalize the regularized score matching method of Lin et al. (2016) and improve its theoretical guarantees for non-negative Gaussian graphical models.
Collapse
|
50
|
Sondhi A, Shojaie A. The Reduced PC-Algorithm: Improved Causal Structure Learning in Large Random Networks. JOURNAL OF MACHINE LEARNING RESEARCH : JMLR 2019; 20:https://jmlr.org/papers/v20/17-601.html. [PMID: 37799538 PMCID: PMC10552884] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 10/07/2023]
Abstract
We consider the task of estimating a high-dimensional directed acyclic graph, given observations from a linear structural equation model with arbitrary noise distribution. By exploiting properties of common random graphs, we develop a new algorithm that requires conditioning only on small sets of variables. The proposed algorithm, which is essentially a modified version of the PC-Algorithm, offers significant gains in both computational complexity and estimation accuracy. In particular, it results in more efficient and accurate estimation in large networks containing hub nodes, which are common in biological systems. We prove the consistency of the proposed algorithm, and show that it also requires a less stringent faithfulness assumption than the PC-Algorithm. Simulations in low and high-dimensional settings are used to illustrate these findings. An application to gene expression data suggests that the proposed algorithm can identify a greater number of clinically relevant genes than current methods.
Collapse
|