1
|
Tárraga JM, Sevillano-Marco E, Muñoz-Marí J, Piles M, Sitokonstantinou V, Ronco M, Miranda MT, Cerdà J, Camps-Valls G. Causal discovery reveals complex patterns of drought-induced displacement. iScience 2024; 27:110628. [PMID: 39262799 PMCID: PMC11387590 DOI: 10.1016/j.isci.2024.110628] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2024] [Revised: 06/15/2024] [Accepted: 07/29/2024] [Indexed: 09/13/2024] Open
Abstract
The increasing frequency and severity of droughts present a significant risk to vulnerable regions of the globe, potentially leading to substantial human displacement in extreme situations. Drought-induced displacement is a complex and multifaceted issue that can perpetuate cycles of poverty, exacerbate food and water scarcity, and reinforce socio-economic inequalities. However, our understanding of human mobility in drought scenarios is currently limited, inhibiting accurate predictions and effective policy responses. Drought-induced displacement is driven by numerous factors and identifying its key drivers, causal-effect lags, and consequential effects is often challenging, typically relying on mechanistic models and qualitative assumptions. This paper presents a novel, data-driven methodology, grounded in causal discovery, to retrieve the drivers of drought-induced displacement within Somalia from 2016 to 2023. Our model exposes the intertwined vulnerabilities and the leading times that connect drought impacts, water and food security systems along with episodes of violent conflict, emphasizing that causal mechanisms change across districts. These findings pave the way for the development of algorithms with the ability to learn from human mobility data, enhancing anticipatory action, policy formulation, and humanitarian aid.
Collapse
Affiliation(s)
- Jose María Tárraga
- Image Processing Laboratory, Universitat de València, 46980 Paterna, Spain
| | | | - Jordi Muñoz-Marí
- Image Processing Laboratory, Universitat de València, 46980 Paterna, Spain
| | - María Piles
- Image Processing Laboratory, Universitat de València, 46980 Paterna, Spain
| | | | - Michele Ronco
- Image Processing Laboratory, Universitat de València, 46980 Paterna, Spain
| | | | - Jordi Cerdà
- Image Processing Laboratory, Universitat de València, 46980 Paterna, Spain
| | - Gustau Camps-Valls
- Image Processing Laboratory, Universitat de València, 46980 Paterna, Spain
| |
Collapse
|
2
|
Kernfeld E, Keener R, Cahan P, Battle A. Transcriptome data are insufficient to control false discoveries in regulatory network inference. Cell Syst 2024; 15:709-724.e13. [PMID: 39173585 DOI: 10.1016/j.cels.2024.07.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2023] [Revised: 05/31/2024] [Accepted: 07/22/2024] [Indexed: 08/24/2024]
Abstract
Inference of causal transcriptional regulatory networks (TRNs) from transcriptomic data suffers notoriously from false positives. Approaches to control the false discovery rate (FDR), for example, via permutation, bootstrapping, or multivariate Gaussian distributions, suffer from several complications: difficulty in distinguishing direct from indirect regulation, nonlinear effects, and causal structure inference requiring "causal sufficiency," meaning experiments that are free of any unmeasured, confounding variables. Here, we use a recently developed statistical framework, model-X knockoffs, to control the FDR while accounting for indirect effects, nonlinear dose-response, and user-provided covariates. We adjust the procedure to estimate the FDR correctly even when measured against incomplete gold standards. However, benchmarking against chromatin immunoprecipitation (ChIP) and other gold standards reveals higher observed than reported FDR. This indicates that unmeasured confounding is a major driver of FDR in TRN inference. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
Affiliation(s)
- Eric Kernfeld
- Department of Biomedical Engineering, Johns Hopkins University, 3400 N. Charles Street, Wyman Park Building, Suite 400 West, Baltimore, MD 21218, USA
| | - Rebecca Keener
- Department of Biomedical Engineering, Johns Hopkins University, 3400 N. Charles Street, Wyman Park Building, Suite 400 West, Baltimore, MD 21218, USA
| | - Patrick Cahan
- Department of Biomedical Engineering, Johns Hopkins University, 3400 N. Charles Street, Wyman Park Building, Suite 400 West, Baltimore, MD 21218, USA; Institute for Cell Engineering, Johns Hopkins Medicine, Baltimore, MD, USA; Department of Molecular Biology and Genetics, Johns Hopkins University, Baltimore, MD, USA.
| | - Alexis Battle
- Department of Biomedical Engineering, Johns Hopkins University, 3400 N. Charles Street, Wyman Park Building, Suite 400 West, Baltimore, MD 21218, USA; Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA; Department of Genetic Medicine, Johns Hopkins Medicine, Baltimore, MD, USA; Malone Center for Engineering and Healthcare, Johns Hopkins University, Baltimore, MD, USA; Data Science and AI Institute, Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
3
|
Shutta KH, Balzer LB, Scholtens DM, Balasubramanian R. SpiderLearner: An ensemble approach to Gaussian graphical model estimation. Stat Med 2023; 42:2116-2133. [PMID: 37004994 DOI: 10.1002/sim.9714] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2022] [Revised: 12/10/2022] [Accepted: 03/07/2023] [Indexed: 04/04/2023]
Abstract
Gaussian graphical models (GGMs) are a popular form of network model in which nodes represent features in multivariate normal data and edges reflect conditional dependencies between these features. GGM estimation is an active area of research. Currently available tools for GGM estimation require investigators to make several choices regarding algorithms, scoring criteria, and tuning parameters. An estimated GGM may be highly sensitive to these choices, and the accuracy of each method can vary based on structural characteristics of the network such as topology, degree distribution, and density. Because these characteristics are a priori unknown, it is not straightforward to establish universal guidelines for choosing a GGM estimation method. We address this problem by introducing SpiderLearner, an ensemble method that constructs a consensus network from multiple estimated GGMs. Given a set of candidate methods, SpiderLearner estimates the optimal convex combination of results from each method using a likelihood-based loss function.K $$ K $$ -fold cross-validation is applied in this process, reducing the risk of overfitting. In simulations, SpiderLearner performs better than or comparably to the best candidate methods according to a variety of metrics, including relative Frobenius norm and out-of-sample likelihood. We apply SpiderLearner to publicly available ovarian cancer gene expression data including 2013 participants from 13 diverse studies, demonstrating our tool's potential to identify biomarkers of complex disease. SpiderLearner is implemented as flexible, extensible, open-source code in the R package ensembleGGM at https://github.com/katehoffshutta/ensembleGGM.
Collapse
Affiliation(s)
- Katherine H Shutta
- Department of Biostatistics and Epidemiology, University of Massachusetts-Amherst, Amherst, Massachusetts, USA
- Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts, USA
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts, USA
| | - Laura B Balzer
- Division of Biostatistics, University of California-Berkeley, Berkeley, California, USA
| | - Denise M Scholtens
- Division of Biostatistics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, Illinois, USA
| | - Raji Balasubramanian
- Department of Biostatistics and Epidemiology, University of Massachusetts-Amherst, Amherst, Massachusetts, USA
| |
Collapse
|
4
|
Yu H, Wu S, Dauwels J. Efficient Variational Bayes Learning of Graphical Models With Smooth Structural Changes. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:475-488. [PMID: 34990351 DOI: 10.1109/tpami.2022.3140886] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Estimating a sequence of dynamic undirected graphical models, in which adjacent graphs share similar structures, is of paramount importance in various social, financial, biological, and engineering systems, since the evolution of such networks can be utilized for example to spot trends, detect anomalies, predict vulnerability, and evaluate the impact of interventions. Existing methods for learning dynamic graphical models require the tuning parameters that control the graph sparsity and the temporal smoothness to be selected via brute-force grid search. Furthermore, these methods are computationally burdensome with time complexity O(NP3) for P variables and N time points. As a remedy, we propose a low-complexity tuning-free Bayesian approach, named BASS. Specifically, we impose temporally dependent spike and slab priors on the graphs such that they are sparse and varying smoothly across time. An efficient variational inference algorithm based on natural gradients is then derived to learn the graph structures from the data in an automatic manner. Owing to the pseudo-likelihood and the mean-field approximation, the time complexity of BASS is only O(NP2). To cope with the local maxima problem of variational inference, we resort to simulated annealing and propose a method based on bootstrapping of the observations to generate the annealing noise. We provide numerical evidence that BASS outperforms existing methods on synthetic data in terms of structure estimation, while being more efficient especially when the dimension P becomes high. We further apply the approach to the stock return data of 78 banks from 2005 to 2013 and find that the number of edges in the financial network as a function of time contains three peaks, in coincidence with the 2008 global financial crisis and the two subsequent European debt crisis. On the other hand, by identifying the frequency-domain resemblance to the time-varying graphical models, we show that BASS can be extended to learning frequency-varying inverse spectral density matrices, and further yields graphical models for multivariate stationary time series. As an illustration, we analyze scalp EEG signals of patients at the early stages of Alzheimer's disease (AD) and show that the brain networks extracted by BASS can better distinguish between the patients and the healthy controls.
Collapse
|
5
|
Dai X, Lyu X, Li L. Kernel Knockoffs Selection for Nonparametric Additive Models. J Am Stat Assoc 2022; 118:2158-2170. [PMID: 38143786 PMCID: PMC10746135 DOI: 10.1080/01621459.2022.2039671] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2021] [Accepted: 01/07/2022] [Indexed: 12/17/2022]
Abstract
Thanks to its fine balance between model flexibility and interpretability, the nonparametric additive model has been widely used, and variable selection for this type of model has been frequently studied. However, none of the existing solutions can control the false discovery rate (FDR) unless the sample size tends to infinity. The knockoff framework is a recent proposal that can address this issue, but few knockoff solutions are directly applicable to nonparametric models. In this article, we propose a novel kernel knockoffs selection procedure for the nonparametric additive model. We integrate three key components: the knockoffs, the subsampling for stability, and the random feature mapping for nonparametric function approximation. We show that the proposed method is guaranteed to control the FDR for any sample size, and achieves a power that approaches one as the sample size tends to infinity. We demonstrate the efficacy of our method through intensive simulations and comparisons with the alternative solutions. our proposal thus makes useful contributions to the methodology of nonparametric variable selection, FDR-based inference, as well as knockoffs.
Collapse
Affiliation(s)
| | | | - Lexin Li
- University of California, Berkeley
| |
Collapse
|
6
|
Haldimann D, Guerriero M, Maret Y, Bonavita N, Ciarlo G, Sabbadin M. A Scalable Algorithm for Identifying Multiple-Sensor Faults Using Disentangled RNNs. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:1093-1106. [PMID: 33290232 DOI: 10.1109/tnnls.2020.3040224] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The problem of detecting and identifying sensor faults is critical for efficient, safe, regulatory-compliant, and sustainable operations of modern industrial processing systems. The increasing complexity of such systems brings, however, new challenges for sensor fault detection and sensor fault isolation (SFD-SFI). One of the key enablers for any SFD-SFI method is analytical redundancy, which is provided by an analytical model of sensor observations derived from first principles or identified from historical data. As defective sensors generate measurements that are inconsistent with their expected behavior as defined by the model, SFD amounts to the generation and monitoring of residuals between sensor observations and model predictions. In this article, we introduce a disentangled recurrent neural network (RNN) with the objective to cope with the smearing-out effect, i.e., where the propagation of a sensor fault to nonfaulty sensor results in large and misleading residuals. The introduction of a probabilistic model for the residual generation allows us to develop a novel procedure for the identification of the faulty sensors. The computational complexity of the proposed algorithm is linear in the number of sensors as opposed to the combinatorial nature of the SFI problem. Finally, we empirically verify the performance of the proposed SFD-SFI architecture using a real data set collected at a petrochemical plant.
Collapse
|
7
|
Abstract
Cancer is a genetic disease in which multiple genes are perturbed. Thus, information about the regulatory relationships between genes is necessary for the identification of biomarkers and therapeutic targets. In this review, methods for inference of gene regulatory networks (GRNs) from transcriptomics data that are used in cancer research are introduced. The methods are classified into three categories according to the analysis model. The first category includes methods that use pair-wise measures between genes, including correlation coefficient and mutual information. The second category includes methods that determine the genetic regulatory relationship using multivariate measures, which consider the expression profiles of all genes concurrently. The third category includes methods using supervised and integrative approaches. The supervised approach estimates the regulatory relationship using a supervised learning method that constructs a regression or classification model for predicting whether there is a regulatory relationship between genes with input data of gene expression profiles and class labels of prior biological knowledge. The integrative method is an expansion of the supervised method and uses more data and biological knowledge for predicting the regulatory relationship. Furthermore, simulation and experimental validation of the estimated GRNs are also discussed in this review. This review identified that most GRN inference methods are not specific for cancer transcriptome data, and such methods are required for better understanding of cancer pathophysiology. In addition, more systematic methods for validation of the estimated GRNs need to be developed in the context of cancer biology.
Collapse
|
8
|
Tu JJ, Ou-Yang L, Zhu Y, Yan H, Qin H, Zhang XF. Differential network analysis by simultaneously considering changes in gene interactions and gene expression. Bioinformatics 2021; 37:4414-4423. [PMID: 34245246 DOI: 10.1093/bioinformatics/btab502] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2021] [Revised: 06/13/2021] [Accepted: 07/05/2021] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Differential network analysis is an important tool to investigate the rewiring of gene interactions under different conditions. Several computational methods have been developed to estimate differential networks from gene expression data, but most of them do not consider that gene network rewiring may be driven by the differential expression of individual genes. New differential network analysis methods that simultaneously take account of the changes in gene interactions and changes in expression levels are needed. RESULTS In this paper, we propose a differential network analysis method that considers the differential expression of individual genes when identifying differential edges. First, two hypothesis test statistics are used to quantify changes in partial correlations between gene pairs and changes in expression levels for individual genes. Then, an optimization framework is proposed to combine the two test statistics so that the resulting differential network has a hierarchical property, where a differential edge can be considered only if at least one of the two involved genes is differentially expressed. Simulation results indicate that our method outperforms current state-of-the-art methods. We apply our method to identify the differential networks between the luminal A and basal-like subtypes of breast cancer and those between acute myeloid leukemia and normal samples. Hub nodes in the differential networks estimated by our method, including both differentially and non-differentially expressed genes, have important biological functions. AVAILABILITY The source code is available at https://github.com/Zhangxf-ccnu/chNet. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jia-Juan Tu
- School of Mathematics and Statistics & Hubei Key Laboratory of Mathematical Sciences, Central China Normal University, Wuhan, 430079, China
| | - Le Ou-Yang
- College of Electronics and Information Engineering, Shenzhen University, Shenzhen, 518060, China
| | - Yuan Zhu
- School of Automation, China University of Geosciences, Wuhan, 430074, China.,Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, China University of Geosciences, Wuhan, 430074, China
| | - Hong Yan
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong, China
| | - Hong Qin
- Department of Statistics, Zhongnan University of Economics and Law, Wuhan, 430073, China
| | - Xiao-Fei Zhang
- School of Mathematics and Statistics & Hubei Key Laboratory of Mathematical Sciences, Central China Normal University, Wuhan, 430079, China
| |
Collapse
|
9
|
Joint estimation of heterogeneous exponential Markov Random Fields through an approximate likelihood inference. J Stat Plan Inference 2020. [DOI: 10.1016/j.jspi.2020.04.003] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
10
|
Petralia F, Wang L, Peng J, Yan A, Zhu J, Wang P. A new method for constructing tumor specific gene co-expression networks based on samples with tumor purity heterogeneity. Bioinformatics 2019; 34:i528-i536. [PMID: 29949994 PMCID: PMC6022554 DOI: 10.1093/bioinformatics/bty280] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Motivation Tumor tissue samples often contain an unknown fraction of stromal cells. This problem is widely known as tumor purity heterogeneity (TPH) was recently recognized as a severe issue in omics studies. Specifically, if TPH is ignored when inferring co-expression networks, edges are likely to be estimated among genes with mean shift between non-tumor- and tumor cells rather than among gene pairs interacting with each other in tumor cells. To address this issue, we propose Tumor Specific Net (TSNet), a new method which constructs tumor-cell specific gene/protein co-expression networks based on gene/protein expression profiles of tumor tissues. TSNet treats the observed expression profile as a mixture of expressions from different cell types and explicitly models tumor purity percentage in each tumor sample. Results Using extensive synthetic data experiments, we demonstrate that TSNet outperforms a standard graphical model which does not account for TPH. We then apply TSNet to estimate tumor specific gene co-expression networks based on TCGA ovarian cancer RNAseq data. We identify novel co-expression modules and hub structure specific to tumor cells. Availability and implementation R codes can be found at https://github.com/petraf01/TSNet. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Francesca Petralia
- Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA.,Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Li Wang
- Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA.,Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA.,Sema4, a Mount Sinai Venture, Stamford, CT, USA
| | - Jie Peng
- Department of Statistics, University of California, Davis, Davis, CA, USA
| | - Arthur Yan
- Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA.,Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Jun Zhu
- Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA.,Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA.,Sema4, a Mount Sinai Venture, Stamford, CT, USA
| | - Pei Wang
- Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA.,Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| |
Collapse
|
11
|
Fan X, Fang K, Ma S, Wang S, Zhang Q. Assisted graphical model for gene expression data analysis. Stat Med 2019; 38:2364-2380. [PMID: 30854706 DOI: 10.1002/sim.8112] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2018] [Revised: 12/16/2018] [Accepted: 01/09/2019] [Indexed: 11/12/2022]
Abstract
The analysis of gene expression data has been playing a pivotal role in recent biomedical research. For gene expression data, network analysis has been shown to be more informative and powerful than individual-gene and geneset-based analysis. Despite promising successes, with the high dimensionality of gene expression data and often low sample sizes, network construction with gene expression data is still often challenged. In recent studies, a prominent trend is to conduct multidimensional profiling, under which data are collected on gene expressions as well as their regulators (copy number variations, methylation, microRNAs, SNPs, etc). With the regulation relationship, regulators contain information on gene expressions and can potentially assist in estimating their characteristics. In this study, we develop an assisted graphical model (AGM) approach, which can effectively use information in regulators to improve the estimation of gene expression graphical structure. The proposed approach has an intuitive formulation and can adaptively accommodate different regulator scenarios. Its consistency properties are rigorously established. Extensive simulations and the analysis of a breast cancer gene expression data set demonstrate the practical effectiveness of the AGM.
Collapse
Affiliation(s)
- Xinyan Fan
- Department of Statistics, School of Economics, Xiamen University, Xiamen, China
| | - Kuangnan Fang
- Department of Statistics, School of Economics, Xiamen University, Xiamen, China.,Fujian Key Laboratory of Statistical Sciences, Xiamen University, Xiamen, China
| | - Shuangge Ma
- Department of Statistics, School of Economics, Xiamen University, Xiamen, China.,Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut
| | - Shuaichao Wang
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Qingzhao Zhang
- Department of Statistics, School of Economics, Xiamen University, Xiamen, China.,Fujian Key Laboratory of Statistical Sciences, Xiamen University, Xiamen, China.,The Wang Yanan Institute for Studies in Economics, Xiamen University, Xiamen, China
| |
Collapse
|
12
|
Morgan D, Tjärnberg A, Nordling TEM, Sonnhammer ELL. A generalized framework for controlling FDR in gene regulatory network inference. Bioinformatics 2018; 35:1026-1032. [DOI: 10.1093/bioinformatics/bty764] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2018] [Revised: 08/23/2018] [Accepted: 08/28/2018] [Indexed: 12/23/2022] Open
Affiliation(s)
- Daniel Morgan
- Department of Biochemistry and Biophysics, Stockholm Bioinformatics Center, Science for Life Laboratory, Stockholm University, Stockholm, Sweden
| | - Andreas Tjärnberg
- Department of Physics, Chemistry and Biology/Bioinformatics, Linköping University, Linköping, Sweden
| | - Torbjörn E M Nordling
- Department of Mechanical Engineering, National Cheng Kung University, Tainan, Taiwan
| | - Erik L L Sonnhammer
- Department of Biochemistry and Biophysics, Stockholm Bioinformatics Center, Science for Life Laboratory, Stockholm University, Stockholm, Sweden
| |
Collapse
|
13
|
Characterizing functional consequences of DNA copy number alterations in breast and ovarian tumors by spaceMap. J Genet Genomics 2018; 45:361-371. [PMID: 30057342 DOI: 10.1016/j.jgg.2018.07.003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2018] [Revised: 07/09/2018] [Accepted: 07/09/2018] [Indexed: 01/18/2023]
Abstract
We propose a novel conditional graphical model - spaceMap - to construct gene regulatory networks from multiple types of high dimensional omic profiles. A motivating application is to characterize the perturbation of DNA copy number alterations (CNAs) on downstream protein levels in tumors. Through a penalized multivariate regression framework, spaceMap jointly models high dimensional protein levels as responses and high dimensional CNAs as predictors. In this setup, spaceMap infers an undirected network among proteins together with a directed network encoding how CNAs perturb the protein network. spaceMap can be applied to learn other types of regulatory relationships from high dimensional molecular profiles, especially those exhibiting hub structures. Simulation studies show spaceMap has greater power in detecting regulatory relationships over competing methods. Additionally, spaceMap includes a network analysis toolkit for biological interpretation of inferred networks. We applies spaceMap to the CNAs, gene expression and proteomics data sets from CPTAC-TCGA breast (n=77) and ovarian (n=174) cancer studies. Each cancer exhibits disruption of 'ion transmembrane transport' and 'regulation from RNA polymerase II promoter' by CNA events unique to each cancer. Moreover, using protein levels as a response yields a more functionally-enriched network than using RNA expressions in both cancer types. The network results also help to pinpoint crucial cancer genes and provide insights on the functional consequences of important CNA in breast and ovarian cancers. The R package spaceMap - including vignettes and documentation - is hosted on https://topherconley.github.io/spacemap.
Collapse
|
14
|
Choi Y, Coram M, Peng J, Tang H. A Poisson Log-Normal Model for Constructing Gene Covariation Network Using RNA-seq Data. J Comput Biol 2017; 24:721-731. [PMID: 28557607 PMCID: PMC5510689 DOI: 10.1089/cmb.2017.0053] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022] Open
Abstract
Constructing expression networks using transcriptomic data is an effective approach for studying gene regulation. A popular approach for constructing such a network is based on the Gaussian graphical model (GGM), in which an edge between a pair of genes indicates that the expression levels of these two genes are conditionally dependent, given the expression levels of all other genes. However, GGMs are not appropriate for non-Gaussian data, such as those generated in RNA-seq experiments. We propose a novel statistical framework that maximizes a penalized likelihood, in which the observed count data follow a Poisson log-normal distribution. To overcome the computational challenges, we use Laplace's method to approximate the likelihood and its gradients, and apply the alternating directions method of multipliers to find the penalized maximum likelihood estimates. The proposed method is evaluated and compared with GGMs using both simulated and real RNA-seq data. The proposed method shows improved performance in detecting edges that represent covarying pairs of genes, particularly for edges connecting low-abundant genes and edges around regulatory hubs.
Collapse
Affiliation(s)
- Yoonha Choi
- Department of Genetics, Stanford University, Stanford, California
| | - Marc Coram
- Department of Health Research and Policy, Stanford University, Stanford, California
| | - Jie Peng
- Department of Statistics, University of California, Davis, Davis, California
| | - Hua Tang
- Department of Genetics, Stanford University, Stanford, California
| |
Collapse
|
15
|
Zhao Y, Chung M, Johnson BA, Moreno CS, Long Q. Hierarchical Feature Selection Incorporating Known and Novel Biological Information: Identifying Genomic Features Related to Prostate Cancer Recurrence. J Am Stat Assoc 2017; 111:1427-1439. [PMID: 28435175 DOI: 10.1080/01621459.2016.1164051] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
Our work is motivated by a prostate cancer study aimed at identifying mRNA and miRNA biomarkers that are predictive of cancer recurrence after prostatectomy. It has been shown in the literature that incorporating known biological information on pathway memberships and interactions among biomarkers improves feature selection of high-dimensional biomarkers in relation to disease risk. Biological information is often represented by graphs or networks, in which biomarkers are represented by nodes and interactions among them are represented by edges; however, biological information is often not fully known. For example, the role of microRNAs (miRNAs) in regulating gene expression is not fully understood and the miRNA regulatory network is not fully established, in which case new strategies are needed for feature selection. To this end, we treat unknown biological information as missing data (i.e., missing edges in graphs), different from commonly encountered missing data problems where variable values are missing. We propose a new concept of imputing unknown biological information based on observed data and define the imputed information as the novel biological information. In addition, we propose a hierarchical group penalty to encourage sparsity and feature selection at both the pathway level and the within-pathway level, which, combined with the imputation step, allows for incorporation of known and novel biological information. While it is applicable to general regression settings, we develop and investigate the proposed approach in the context of semiparametric accelerated failure time models motivated by our data example. Data application and simulation studies show that incorporation of novel biological information improves performance in risk prediction and feature selection and the proposed penalty outperforms the extensions of several existing penalties.
Collapse
Affiliation(s)
- Yize Zhao
- Postdoctoral Fellow, Statistical and Applied Mathematical Sciences Institute, Research Triangle Park, NC 27709
| | - Matthias Chung
- Assistant Professor, Department of Mathematics, Virginia Tech, Blacksburg, VA 24061
| | - Brent A Johnson
- Associate Professor, Department of Biostatistics and Computational Biology, University of Rochester, Rochester, NY 14642
| | - Carlos S Moreno
- Associate Professor, Department of Pathology and Laboratory Medicine
| | - Qi Long
- Associate Professor, Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA 30322
| |
Collapse
|
16
|
An CI, Ichihashi Y, Peng J, Sinha NR, Hagiwara N. Transcriptome Dynamics and Potential Roles of Sox6 in the Postnatal Heart. PLoS One 2016; 11:e0166574. [PMID: 27832192 PMCID: PMC5104335 DOI: 10.1371/journal.pone.0166574] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2016] [Accepted: 10/31/2016] [Indexed: 01/20/2023] Open
Abstract
The postnatal heart undergoes highly coordinated developmental processes culminating in the complex physiologic properties of the adult heart. The molecular mechanisms of postnatal heart development remain largely unexplored despite their important clinical implications. To gain an integrated view of the dynamic changes in gene expression during postnatal heart development at the organ level, time-series transcriptome analyses of the postnatal hearts of neonatal through adult mice (P1, P7, P14, P30, and P60) were performed using a newly developed bioinformatics pipeline. We identified functional gene clusters by principal component analysis with self-organizing map clustering which revealed organized, discrete gene expression patterns corresponding to biological functions associated with the neonatal, juvenile and adult stages of postnatal heart development. Using weighted gene co-expression network analysis with bootstrap inference for each of these functional gene clusters, highly robust hub genes were identified which likely play key roles in regulating expression of co-expressed, functionally linked genes. Additionally, motivated by the role of the transcription factor Sox6 in the functional maturation of skeletal muscle, the role of Sox6 in the postnatal maturation of cardiac muscle was investigated. Differentially expressed transcriptome analyses between Sox6 knockout (KO) and control hearts uncovered significant upregulation of genes involved in cell proliferation at postnatal day 7 (P7) in the Sox6 KO heart. This result was validated by detecting mitotically active cells in the P7 Sox6 KO heart. The current report provides a framework for the complex molecular processes of postnatal heart development, thus enabling systematic dissection of the developmental regression observed in the stressed and failing adult heart.
Collapse
Affiliation(s)
- Chung-Il An
- Division of Cardiovascular Medicine, Department of Internal Medicine, University of California Davis, Davis, California, United States of America
- * E-mail: (CA); (YI); (NH)
| | - Yasunori Ichihashi
- Department of Plant Biology, University of California Davis, Davis, California, United States of America
- * E-mail: (CA); (YI); (NH)
| | - Jie Peng
- Department of Statistics, University of California Davis, Davis, California, United States of America
| | - Neelima R. Sinha
- Department of Plant Biology, University of California Davis, Davis, California, United States of America
| | - Nobuko Hagiwara
- Division of Cardiovascular Medicine, Department of Internal Medicine, University of California Davis, Davis, California, United States of America
- * E-mail: (CA); (YI); (NH)
| |
Collapse
|
17
|
Abram SV, Helwig NE, Moodie CA, DeYoung CG, MacDonald AW, Waller NG. Bootstrap Enhanced Penalized Regression for Variable Selection with Neuroimaging Data. Front Neurosci 2016; 10:344. [PMID: 27516732 PMCID: PMC4964314 DOI: 10.3389/fnins.2016.00344] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2016] [Accepted: 07/08/2016] [Indexed: 11/13/2022] Open
Abstract
Recent advances in fMRI research highlight the use of multivariate methods for examining whole-brain connectivity. Complementary data-driven methods are needed for determining the subset of predictors related to individual differences. Although commonly used for this purpose, ordinary least squares (OLS) regression may not be ideal due to multi-collinearity and over-fitting issues. Penalized regression is a promising and underutilized alternative to OLS regression. In this paper, we propose a nonparametric bootstrap quantile (QNT) approach for variable selection with neuroimaging data. We use real and simulated data, as well as annotated R code, to demonstrate the benefits of our proposed method. Our results illustrate the practical potential of our proposed bootstrap QNT approach. Our real data example demonstrates how our method can be used to relate individual differences in neural network connectivity with an externalizing personality measure. Also, our simulation results reveal that the QNT method is effective under a variety of data conditions. Penalized regression yields more stable estimates and sparser models than OLS regression in situations with large numbers of highly correlated neural predictors. Our results demonstrate that penalized regression is a promising method for examining associations between neural predictors and clinically relevant traits or behaviors. These findings have important implications for the growing field of functional connectivity research, where multivariate methods produce numerous, highly correlated brain networks.
Collapse
Affiliation(s)
- Samantha V Abram
- Department of Psychology, University of Minnesota Minneapolis, MN, USA
| | - Nathaniel E Helwig
- Department of Psychology, University of MinnesotaMinneapolis, MN, USA; School of Statistics, University of MinnesotaMinneapolis, MN, USA
| | - Craig A Moodie
- Department of Psychology, Stanford University Stanford, CA, USA
| | - Colin G DeYoung
- Department of Psychology, University of Minnesota Minneapolis, MN, USA
| | - Angus W MacDonald
- Department of Psychology, University of MinnesotaMinneapolis, MN, USA; Department of Psychiatry, University of MinnesotaMinneapolis, MN, USA
| | - Niels G Waller
- Department of Psychology, University of Minnesota Minneapolis, MN, USA
| |
Collapse
|
18
|
Abstract
Motivated by analysis of gene expression data measured in different tissues or disease states, we consider joint estimation of multiple precision matrices to effectively utilize the partially shared graphical structures of the corresponding graphs. The procedure is based on a weighted constrained ℓ∞/ℓ1 minimization, which can be effectively implemented by a second-order cone programming. Compared to separate estimation methods, the proposed joint estimation method leads to estimators converging to the true precision matrices faster. Under certain regularity conditions, the proposed procedure leads to an exact graph structure recovery with a probability tending to 1. Simulation studies show that the proposed joint estimation methods outperform other methods in graph structure recovery. The method is illustrated through an analysis of an ovarian cancer gene expression data. The results indicate that the patients with poor prognostic subtype lack some important links among the genes in the apoptosis pathway.
Collapse
Affiliation(s)
- T Tony Cai
- Professor of Statistics, Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104
| | - Hongzhe Li
- Professor of Biostatistics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104
| | - Weidong Liu
- Professor, Department of Mathematics, Institute of Natural Sciences and MOE-LSC, Shanghai Jiao Tong University, Shanghai, China
| | - Jichun Xie
- Assistant Professor, Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC 27707
| |
Collapse
|
19
|
Deng W, Geng Z, Li H. LEARNING LOCAL DIRECTED ACYCLIC GRAPHS BASED ON MULTIVARIATE TIME SERIES DATA. Ann Appl Stat 2014; 7:1249-1835. [PMID: 24465291 DOI: 10.1214/13-aoas635] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Multivariate time series (MTS) data such as time course gene expression data in genomics are often collected to study the dynamic nature of the systems. These data provide important information about the causal dependency among a set of random variables. In this paper, we introduce a computationally efficient algorithm to learn directed acyclic graphs (DAGs) based on MTS data, focusing on learning the local structure of a given target variable. Our algorithm is based on learning all parents (P), all children (C) and some descendants (D) (PCD) iteratively, utilizing the time order of the variables to orient the edges. This time series PCD-PCD algorithm (tsPCD-PCD) extends the previous PCD-PCD algorithm to dependent observations and utilizes composite likelihood ratio tests (CLRTs) for testing the conditional independence. We present the asymptotic distribution of the CLRT statistic and show that the tsPCD-PCD is guaranteed to recover the true DAG structure when the faithfulness condition holds and the tests correctly reject the null hypotheses. Simulation studies show that the CLRTs are valid and perform well even when the sample sizes are small. In addition, the tsPCD-PCD algorithm outperforms the PCD-PCD algorithm in recovering the local graph structures. We illustrate the algorithm by analyzing a time course gene expression data related to mouse T-cell activation.
Collapse
Affiliation(s)
- Wanlu Deng
- Department of Statistics and Probability, Peking University, Beijing 100871, PR China. Department of Biostatistics, University of Pennsylvania School of Medicine, Philadelphia, PA 19104, USA
| | - Zhi Geng
- Department of Statistics and Probability, Peking University, Beijing 100871, PR China. Department of Biostatistics, University of Pennsylvania School of Medicine, Philadelphia, PA 19104, USA
| | - Hongzhe Li
- Department of Statistics and Probability, Peking University, Beijing 100871, PR China. Department of Biostatistics, University of Pennsylvania School of Medicine, Philadelphia, PA 19104, USA
| |
Collapse
|
20
|
Danaher P, Wang P, Witten DM. The joint graphical lasso for inverse covariance estimation across multiple classes. J R Stat Soc Series B Stat Methodol 2013; 76:373-397. [PMID: 24817823 DOI: 10.1111/rssb.12033] [Citation(s) in RCA: 340] [Impact Index Per Article: 30.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
We consider the problem of estimating multiple related Gaussian graphical models from a high-dimensional data set with observations belonging to distinct classes. We propose the joint graphical lasso, which borrows strength across the classes in order to estimate multiple graphical models that share certain characteristics, such as the locations or weights of nonzero edges. Our approach is based upon maximizing a penalized log likelihood. We employ generalized fused lasso or group lasso penalties, and implement a fast ADMM algorithm to solve the corresponding convex optimization problems. The performance of the proposed method is illustrated through simulated and real data examples.
Collapse
Affiliation(s)
| | - Pei Wang
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, USA
| | | |
Collapse
|
21
|
Li S, Hsu L, Peng J, Wang P. BOOTSTRAP INFERENCE FOR NETWORK CONSTRUCTION WITH AN APPLICATION TO A BREAST CANCER MICROARRAY STUDY. Ann Appl Stat 2013; 7:391-417. [PMID: 24563684 DOI: 10.1214/12-aoas589] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
Gaussian Graphical Models (GGMs) have been used to construct genetic regulatory networks where regularization techniques are widely used since the network inference usually falls into a high-dimension-low-sample-size scenario. Yet, finding the right amount of regularization can be challenging, especially in an unsupervised setting where traditional methods such as BIC or cross-validation often do not work well. In this paper, we propose a new method - Bootstrap Inference for Network COnstruction (BINCO) - to infer networks by directly controlling the false discovery rates (FDRs) of the selected edges. This method fits a mixture model for the distribution of edge selection frequencies to estimate the FDRs, where the selection frequencies are calculated via model aggregation. This method is applicable to a wide range of applications beyond network construction. When we applied our proposed method to building a gene regulatory network with microarray expression breast cancer data, we were able to identify high-confidence edges and well-connected hub genes that could potentially play important roles in understanding the underlying biological processes of breast cancer.
Collapse
Affiliation(s)
- Shuang Li
- Fred Hutchinson Cancer Research Center, M2-B500, 1100 Fairview Ave N., Seattle, WA 98109, USA
| | - Li Hsu
- Fred Hutchinson Cancer Research Center, M2-B500, 1100 Fairview Ave N., Seattle, WA 98109, USA
| | - Jie Peng
- Department of Statistics, University of California, Davis, Mathematical Sciences Building, One Shields Avenue, Davis, CA 95616
| | - Pei Wang
- Fred Hutchinson Cancer Research Center, M2-B500, 1100 Fairview Ave N., Seattle, WA 98109, USA
| |
Collapse
|