1
|
Huang YJ, Mukherjee R, Hsiao CK. Probabilistic edge inference of gene networks with markov random field-based bayesian learning. Front Genet 2022; 13:1034946. [DOI: 10.3389/fgene.2022.1034946] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2022] [Accepted: 10/24/2022] [Indexed: 11/11/2022] Open
Abstract
Current algorithms for gene regulatory network construction based on Gaussian graphical models focuses on the deterministic decision of whether an edge exists. Both the probabilistic inference of edge existence and the relative strength of edges are often overlooked, either because the computational algorithms cannot account for this uncertainty or because it is not straightforward in implementation. In this study, we combine the Bayesian Markov random field and the conditional autoregressive (CAR) model to tackle simultaneously these two tasks. The uncertainty of edge existence and the relative strength of edges can be measured and quantified based on a Bayesian model such as the CAR model and the spike-and-slab lasso prior. In addition, the strength of the edges can be utilized to prioritize the importance of the edges in a network graph. Simulations and a glioblastoma cancer study were carried out to assess the proposed model’s performance and to compare it with existing methods when a binary decision is of interest. The proposed approach shows stable performance and may provide novel structures with biological insights.
Collapse
|
2
|
Molstad AJ, Sun W, Hsu L. A COVARIANCE-ENHANCED APPROACH TO MULTI-TISSUE JOINT EQTL MAPPING WITH APPLICATION TO TRANSCRIPTOME-WIDE ASSOCIATION STUDIES. Ann Appl Stat 2021; 15:998-1016. [PMID: 34413922 DOI: 10.1214/20-aoas1432] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Transcriptome-wide association studies based on genetically predicted gene expression have the potential to identify novel regions associated with various complex traits. It has been shown that incorporating expression quantitative trait loci (eQTLs) corresponding to multiple tissue types can improve power for association studies involving complex etiology. In this article, we propose a new multivariate response linear regression model and method for predicting gene expression in multiple tissues simultaneously. Unlike existing methods for multi-tissue joint eQTL mapping, our approach incorporates tissue-tissue expression correlation, which allows us to more efficiently handle missing expression measurements and more accurately predict gene expression using a weighted summation of eQTL genotypes. We show through simulation studies that our approach performs better than the existing methods in many scenarios. We use our method to estimate eQTL weights for 29 tissues collected by GTEx, and show that our approach significantly improves expression prediction accuracy compared to competitors. Using our eQTL weights, we perform a multi-tissue-based S-MultiXcan [2] transcriptome-wide association study and show that our method leads to more discoveries in novel regions and more discoveries overall than the existing methods. Estimated eQTL weights and code for implementing the method are available for download online at github.com/ajmolstad/MTeQTLResults.
Collapse
|
3
|
Zhang S, Hu X, Luo Z, Jiang Y, Sun Y, Ma S. Biomarker-guided heterogeneity analysis of genetic regulations via multivariate sparse fusion. Stat Med 2021; 40:3915-3936. [PMID: 33906263 PMCID: PMC8277716 DOI: 10.1002/sim.9006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2020] [Revised: 04/07/2021] [Accepted: 04/07/2021] [Indexed: 11/06/2022]
Abstract
Heterogeneity is a hallmark of many complex diseases. There are multiple ways of defining heterogeneity, among which the heterogeneity in genetic regulations, for example, gene expressions (GEs) by copy number variations (CNVs), and methylation, has been suggested but little investigated. Heterogeneity in genetic regulations can be linked with disease severity, progression, and other traits and is biologically important. However, the analysis can be very challenging with the high dimensionality of both sides of regulation as well as sparse and weak signals. In this article, we consider the scenario where subjects form unknown subgroups, and each subgroup has unique genetic regulation relationships. Further, such heterogeneity is "guided" by a known biomarker. We develop a multivariate sparse fusion (MSF) approach, which innovatively applies the penalized fusion technique to simultaneously determine the number and structure of subgroups and regulation relationships within each subgroup. An effective computational algorithm is developed, and extensive simulations are conducted. The analysis of heterogeneity in the GE-CNV regulations in melanoma and GE-methylation regulations in stomach cancer using the TCGA data leads to interesting findings.
Collapse
Affiliation(s)
- Sanguo Zhang
- School of Mathematical Sciences, and Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Science, Beijing, China
| | - Xiaonan Hu
- School of Mathematical Sciences, and Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Science, Beijing, China
| | - Ziye Luo
- Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, China
| | - Yu Jiang
- School of Public Health, University of Memphis, Tennessee, USA
| | - Yifan Sun
- Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, China
| | - Shuangge Ma
- Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, China
- Department of Biostatistics, Yale University, Connecticut, USA
| |
Collapse
|
4
|
Diaz-Ramirez LG, Lee SJ, Smith AK, Gan S, Boscardin WJ. A Novel Method for Identifying a Parsimonious and Accurate Predictive Model for Multiple Clinical Outcomes. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2021; 204:106073. [PMID: 33831724 PMCID: PMC8098121 DOI: 10.1016/j.cmpb.2021.106073] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/01/2020] [Accepted: 03/22/2021] [Indexed: 06/12/2023]
Abstract
BACKGROUND AND OBJECTIVE Most methods for developing clinical prognostic models focus on identifying parsimonious and accurate models to predict a single outcome; however, patients and providers often want to predict multiple outcomes simultaneously. As an example, for older adults one is often interested in predicting nursing home admission as well as mortality. We propose and evaluate a novel predictor-selection computing method for multiple outcomes and provide the code for its implementation. METHODS Our proposed algorithm selected the best subset of common predictors based on the minimum average normalized Bayesian Information Criterion (BIC) across outcomes: the Best Average BIC (baBIC) method. We compared the predictive accuracy (Harrell's C-statistic) and parsimony (number of predictors) of the model obtained using the baBIC method with: 1) a subset of common predictors obtained from the union of optimal models for each outcome (Union method), 2) a subset obtained from the intersection of optimal models for each outcome (Intersection method), and 3) a model with no variable selection (Full method). We used a case-study data from the Health and Retirement Study (HRS) to demonstrate our method and conducted a simulation study to investigate performance. RESULTS In the case-study data and simulations, the average Harrell's C-statistics across outcomes of the models obtained with the baBIC and Union methods were comparable. Despite the similar discrimination, the baBIC method produced more parsimonious models than the Union method. In contrast, the models selected with the Intersection method were the most parsimonious, but with worst predictive accuracy, and the opposite was true in the Full method. In the simulations, the baBIC method performed well by identifying many of the predictors selected in the baBIC model of the case-study data most of the time and excluding those not selected in the majority of the simulations. CONCLUSIONS Our method identified a common subset of variables to predict multiple clinical outcomes with superior balance between parsimony and predictive accuracy to current methods.
Collapse
Affiliation(s)
- L Grisell Diaz-Ramirez
- Division of Geriatrics, University of California, San Francisco, 490 Illinois Street, Floor 08, Box 1265, San Francisco, CA 94143, United States; San Francisco Veterans Affairs (VA) Medical Center, 4150 Clement Street, 181G, San Francisco, CA 94121, United States.
| | - Sei J Lee
- Division of Geriatrics, University of California, San Francisco, 490 Illinois Street, Floor 08, Box 1265, San Francisco, CA 94143, United States; San Francisco Veterans Affairs (VA) Medical Center, 4150 Clement Street, 181G, San Francisco, CA 94121, United States.
| | - Alexander K Smith
- Division of Geriatrics, University of California, San Francisco, 490 Illinois Street, Floor 08, Box 1265, San Francisco, CA 94143, United States; San Francisco Veterans Affairs (VA) Medical Center, 4150 Clement Street, 181G, San Francisco, CA 94121, United States.
| | - Siqi Gan
- Division of Geriatrics, University of California, San Francisco, 490 Illinois Street, Floor 08, Box 1265, San Francisco, CA 94143, United States; San Francisco Veterans Affairs (VA) Medical Center, 4150 Clement Street, 181G, San Francisco, CA 94121, United States.
| | - W John Boscardin
- Division of Geriatrics, University of California, San Francisco, 490 Illinois Street, Floor 08, Box 1265, San Francisco, CA 94143, United States; San Francisco Veterans Affairs (VA) Medical Center, 4150 Clement Street, 181G, San Francisco, CA 94121, United States.
| |
Collapse
|
5
|
Zhou Y, Song PXK, Wen X. Structural factor equation models for causal network construction via directed acyclic mixed graphs. Biometrics 2021; 77:573-586. [PMID: 32627167 PMCID: PMC8240035 DOI: 10.1111/biom.13322] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2018] [Accepted: 05/29/2020] [Indexed: 11/30/2022]
Abstract
Directed acyclic mixed graphs (DAMGs) provide a useful representation of network topology with both directed and undirected edges subject to the restriction of no directed cycles in the graph. This graphical framework may arise in many biomedical studies, for example, when a directed acyclic graph (DAG) of interest is contaminated with undirected edges induced by some unobserved confounding factors (eg, unmeasured environmental factors). Directed edges in a DAG are widely used to evaluate causal relationships among variables in a network, but detecting them is challenging when the underlying causality is obscured by some shared latent factors. The objective of this paper is to develop an effective structural equation model (SEM) method to extract reliable causal relationships from a DAMG. The proposed approach, termed structural factor equation model (SFEM), uses the SEM to capture the network topology of the DAG while accounting for the undirected edges in the graph with a factor analysis model. The latent factors in the SFEM enable the identification and removal of undirected edges, leading to a simpler and more interpretable causal network. The proposed method is evaluated and compared to existing methods through extensive simulation studies, and illustrated through the construction of gene regulatory networks related to breast cancer.
Collapse
Affiliation(s)
- Yan Zhou
- Gilead Sciences, Foster City, California
| | - Peter X.-K. Song
- Department of Biostatistics, University of Michigan, Ann Arbor, MI
| | - Xiaoquan Wen
- Department of Biostatistics, University of Michigan, Ann Arbor, MI
| |
Collapse
|
6
|
Abstract
In recent biomedical studies, multidimensional profiling, which collects proteomics as well as other types of omics data on the same subjects, is getting increasingly popular. Proteomics, transcriptomics, genomics, epigenomics, and other types of data contain overlapping as well as independent information, which suggests the possibility of integrating multiple types of data to generate more reliable findings/models with better classification/prediction performance. In this chapter, a selective review is conducted on recent data integration techniques for both unsupervised and supervised analysis. The main objective is to provide the "big picture" of data integration that involves proteomics data and discuss the "intuition" beneath the recently developed approaches without invoking too many mathematical details. Potential pitfalls and possible directions for future developments are also discussed.
Collapse
Affiliation(s)
- Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| | - Yu Jiang
- School of Public Health, University of Memphis, Memphis, TN, USA
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, Yale University, New Haven, CT, USA.
| |
Collapse
|
7
|
Cho SB. Set-Wise Differential Interaction Between Copy Number Alterations and Gene Expressions of Lower-Grade Glioma Reveals Prognosis-Associated Pathways. ENTROPY 2020; 22:e22121434. [PMID: 33353229 PMCID: PMC7765960 DOI: 10.3390/e22121434] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/22/2020] [Revised: 11/30/2020] [Accepted: 12/16/2020] [Indexed: 12/22/2022]
Abstract
The integrative analysis of copy number alteration (CNA) and gene expression (GE) is an essential part of cancer research considering the impact of CNAs on cancer progression and prognosis. In this research, an integrative analysis was performed with generalized differentially coexpressed gene sets (gdCoxS), which is a modification of dCoxS. In gdCoxS, set-wise interaction is measured using the correlation of sample-wise distances with Renyi’s relative entropy, which requires an estimation of sample density based on omics profiles. To capture correlations between the variables, multivariate density estimation with covariance was applied. In the simulation study, the power of gdCoxS outperformed dCoxS that did not use the correlations in the density estimation explicitly. In the analysis of the lower-grade glioma of the cancer genome atlas program (TCGA-LGG) data, the gdCoxS identified 577 pathway CNAs and GEs pairs that showed significant changes of interaction between the survival and non-survival group, while other benchmark methods detected lower numbers of such pathways. The biological implications of the significant pathways were well consistent with previous reports of the TCGA-LGG. Taken together, the gdCoxS is a useful method for an integrative analysis of CNAs and GEs.
Collapse
Affiliation(s)
- Seong Beom Cho
- Department of Biomedical Informatics, College of Medicine, Gachon University, Seongnam-Daero 1342, Korea
| |
Collapse
|
8
|
Wang H, Wu Y, Fang R, Sa J, Li Z, Cao H, Cui Y. Time-Varying Gene Network Analysis of Human Prefrontal Cortex Development. Front Genet 2020; 11:574543. [PMID: 33304381 PMCID: PMC7701309 DOI: 10.3389/fgene.2020.574543] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2020] [Accepted: 10/19/2020] [Indexed: 11/13/2022] Open
Abstract
The prefrontal cortex (PFC) constitutes a large part of the human central nervous system and is essential for the normal social affection and executive function of humans and other primates. Despite ongoing research in this region, the development of interactions between PFC genes over the lifespan is still unknown. To investigate the conversion of PFC gene interaction networks and further identify hub genes, we obtained time-series gene expression data of human PFC tissues from the Gene Expression Omnibus (GEO) database. A statistical model, loggle, was used to construct time-varying networks and several common network attributes were used to explore the development of PFC gene networks with age. Network similarity analysis showed that the development of human PFC is divided into three stages, namely, fast development period, deceleration to stationary period, and recession period. We identified some genes related to PFC development at these different stages, including genes involved in neuronal differentiation or synapse formation, genes involved in nerve impulse transmission, and genes involved in the development of myelin around neurons. Some of these genes are consistent with findings in previous reports. At the same time, we explored the development of several known KEGG pathways in PFC and corresponding hub genes. This study clarified the development trajectory of the interaction between PFC genes, and proposed a set of candidate genes related to PFC development, which helps further study of human brain development at the genomic level supplemental to regular anatomical analyses. The analytical process used in this study, involving the loggle model, similarity analysis, and central analysis, provides a comprehensive strategy to gain novel insights into the evolution and development of brain networks in other organisms.
Collapse
Affiliation(s)
- Huihui Wang
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Yongqing Wu
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Ruiling Fang
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Jian Sa
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Zhi Li
- Department of Hematology, Taiyuan Central Hospital of Shanxi Medical University, Taiyuan, China
| | - Hongyan Cao
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Yuehua Cui
- Department of Statistics and Probability, Michigan State University, East Lansing, MI, United States
| |
Collapse
|
9
|
Feng Y, Xiao L, Chi EC. Sparse Single Index Models for Multivariate Responses. J Comput Graph Stat 2020; 30:115-124. [PMID: 34025100 DOI: 10.1080/10618600.2020.1779080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
Joint models are popular for analyzing data with multivariate responses. We propose a sparse multivariate single index model, where responses and predictors are linked by unspecified smooth functions and multiple matrix level penalties are employed to select predictors and induce low-rank structures across responses. An alternating direction method of multipliers (ADMM) based algorithm is proposed for model estimation. We demonstrate the effectiveness of proposed model in simulation studies and an application to a genetic association study.
Collapse
Affiliation(s)
- Yuan Feng
- Department of Statistics, North Carolina State University, Raleigh, NC 27695-8203
| | - Luo Xiao
- Department of Statistics, North Carolina State University, Raleigh, NC 27695-8203
| | - Eric C Chi
- Department of Statistics, North Carolina State University, Raleigh, NC 27695-8203
| |
Collapse
|
10
|
Hilafu H, Safo SE, Haine L. Sparse reduced-rank regression for integrating omics data. BMC Bioinformatics 2020; 21:283. [PMID: 32620072 PMCID: PMC7333421 DOI: 10.1186/s12859-020-03606-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2020] [Accepted: 06/16/2020] [Indexed: 12/04/2022] Open
Abstract
BACKGROUND The problem of assessing associations between multiple omics data including genomics and metabolomics data to identify biomarkers potentially predictive of complex diseases has garnered considerable research interest nowadays. A popular epidemiology approach is to consider an association of each of the predictors with each of the response using a univariate linear regression model, and to select predictors that meet a priori specified significance level. Although this approach is simple and intuitive, it tends to require larger sample size which is costly. It also assumes variables for each data type are independent, and thus ignores correlations that exist between variables both within each data type and across the data types. RESULTS We consider a multivariate linear regression model that relates multiple predictors with multiple responses, and to identify multiple relevant predictors that are simultaneously associated with the responses. We assume the coefficient matrix of the responses on the predictors is both row-sparse and of low-rank, and propose a group Dantzig type formulation to estimate the coefficient matrix. CONCLUSION Extensive simulations demonstrate the competitive performance of our proposed method when compared to existing methods in terms of estimation, prediction, and variable selection. We use the proposed method to integrate genomics and metabolomics data to identify genetic variants that are potentially predictive of atherosclerosis cardiovascular disease (ASCVD) beyond well-established risk factors. Our analysis shows some genetic variants that increase prediction of ASCVD beyond some well-established factors of ASCVD, and also suggest a potential utility of the identified genetic variants in explaining possible association between certain metabolites and ASCVD.
Collapse
Affiliation(s)
- Haileab Hilafu
- Department of Business Analytics and Statistics, University of Tennessee, Knoxville, 37996 TN USA
| | - Sandra E. Safo
- Division of Biostatistics, University of Minnesota, Minneapolis, 55455 MN USA
| | - Lillian Haine
- Division of Biostatistics, University of Minnesota, Minneapolis, 55455 MN USA
| |
Collapse
|
11
|
Alpay BA, Demetci P, Istrail S, Aguiar D. Combinatorial and statistical prediction of gene expression from haplotype sequence. Bioinformatics 2020; 36:i194-i202. [PMID: 32657373 PMCID: PMC7355230 DOI: 10.1093/bioinformatics/btaa318] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
MOTIVATION Genome-wide association studies (GWAS) have discovered thousands of significant genetic effects on disease phenotypes. By considering gene expression as the intermediary between genotype and disease phenotype, expression quantitative trait loci studies have interpreted many of these variants by their regulatory effects on gene expression. However, there remains a considerable gap between genotype-to-gene expression association and genotype-to-gene expression prediction. Accurate prediction of gene expression enables gene-based association studies to be performed post hoc for existing GWAS, reduces multiple testing burden, and can prioritize genes for subsequent experimental investigation. RESULTS In this work, we develop gene expression prediction methods that relax the independence and additivity assumptions between genetic markers. First, we consider gene expression prediction from a regression perspective and develop the HAPLEXR algorithm which combines haplotype clusterings with allelic dosages. Second, we introduce the new gene expression classification problem, which focuses on identifying expression groups rather than continuous measurements; we formalize the selection of an appropriate number of expression groups using the principle of maximum entropy. Third, we develop the HAPLEXD algorithm that models haplotype sharing with a modified suffix tree data structure and computes expression groups by spectral clustering. In both models, we penalize model complexity by prioritizing genetic clusters that indicate significant effects on expression. We compare HAPLEXR and HAPLEXD with three state-of-the-art expression prediction methods and two novel logistic regression approaches across five GTEx v8 tissues. HAPLEXD exhibits significantly higher classification accuracy overall; HAPLEXR shows higher prediction accuracy on approximately half of the genes tested and the largest number of best predicted genes (r2>0.1) among all methods. We show that variant and haplotype features selected by HAPLEXR are smaller in size than competing methods (and thus more interpretable) and are significantly enriched in functional annotations related to gene regulation. These results demonstrate the importance of explicitly modeling non-dosage dependent and intragenic epistatic effects when predicting expression. AVAILABILITY AND IMPLEMENTATION Source code and binaries are freely available at https://github.com/rapturous/HAPLEX. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Berk A Alpay
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269, USA
| | - Pinar Demetci
- Department of Computer Science and Center for Computational Biology, Brown University, Providence, RI 02912, USA
| | - Sorin Istrail
- Department of Computer Science and Center for Computational Biology, Brown University, Providence, RI 02912, USA
| | - Derek Aguiar
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269, USA
| |
Collapse
|
12
|
Variable Selection in Threshold Regression Model with Applications to HIV Drug Adherence Data. STATISTICS IN BIOSCIENCES 2020; 12:376-398. [PMID: 33796162 DOI: 10.1007/s12561-020-09284-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
The threshold regression model is an effective alternative to the Cox proportional hazards regression model when the proportional hazards assumption is not met. This paper considers variable selection for threshold regression. This model has separate regression functions for the initial health status and the speed of degradation in health. This flexibility is an important advantage when considering relevant risk factors for a complex time-to-event model where one needs to decide which variables should be included in the regression function for the initial health status, in the function for the speed of degradation in health, or in both functions. In this paper, we extend the broken adaptive ridge (BAR) method, originally designed for variable selection for one regression function, to simultaneous variable selection for both regression functions needed in the threshold regression model. We establish variable selection consistency of the proposed method and asymptotic normality of the estimator of non-zero regression coefficients. Simulation results show that our method outperformed threshold regression without variable selection and variable selection based on the Akaike information criterion. We apply the proposed method to data from an HIV drug adherence study in which electronic monitoring of drug intake is used to identify risk factors for non- adherence.
Collapse
|
13
|
Kong D, An B, Zhang J, Zhu H. L2RM: Low-rank Linear Regression Models for High-dimensional Matrix Responses. J Am Stat Assoc 2020; 115:403-424. [PMID: 33408427 PMCID: PMC7781207 DOI: 10.1080/01621459.2018.1555092] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2017] [Revised: 11/11/2018] [Accepted: 11/26/2018] [Indexed: 10/27/2022]
Abstract
The aim of this paper is to develop a low-rank linear regression model (L2RM) to correlate a high-dimensional response matrix with a high dimensional vector of covariates when coefficient matrices have low-rank structures. We propose a fast and efficient screening procedure based on the spectral norm of each coefficient matrix in order to deal with the case when the number of covariates is extremely large. We develop an efficient estimation procedure based on the trace norm regularization, which explicitly imposes the low rank structure of coefficient matrices. When both the dimension of response matrix and that of covariate vector diverge at the exponential order of the sample size, we investigate the sure independence screening property under some mild conditions. We also systematically investigate some theoretical properties of our estimation procedure including estimation consistency, rank consistency and non-asymptotic error bound under some mild conditions. We further establish a theoretical guarantee for the overall solution of our two-step screening and estimation procedure. We examine the finite-sample performance of our screening and estimation methods using simulations and a large-scale imaging genetic dataset collected by the Philadelphia Neurodevelopmental Cohort (PNC) study.
Collapse
Affiliation(s)
- Dehan Kong
- Department of Statistical Sciences, University of Toronto
| | - Baiguo An
- School of Statistics, Capital University of Economics and Business
| | - Jingwen Zhang
- Department of Biostatistics, University of North Carolina at Chapel Hill
| | - Hongtu Zhu
- Department of Biostatistics, University of North Carolina at Chapel Hill
| |
Collapse
|
14
|
Jiang D, Armour CR, Hu C, Mei M, Tian C, Sharpton TJ, Jiang Y. Microbiome Multi-Omics Network Analysis: Statistical Considerations, Limitations, and Opportunities. Front Genet 2019; 10:995. [PMID: 31781153 PMCID: PMC6857202 DOI: 10.3389/fgene.2019.00995] [Citation(s) in RCA: 87] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2019] [Accepted: 09/18/2019] [Indexed: 12/21/2022] Open
Abstract
The advent of large-scale microbiome studies affords newfound analytical opportunities to understand how these communities of microbes operate and relate to their environment. However, the analytical methodology needed to model microbiome data and integrate them with other data constructs remains nascent. This emergent analytical toolset frequently ports over techniques developed in other multi-omics investigations, especially the growing array of statistical and computational techniques for integrating and representing data through networks. While network analysis has emerged as a powerful approach to modeling microbiome data, oftentimes by integrating these data with other types of omics data to discern their functional linkages, it is not always evident if the statistical details of the approach being applied are consistent with the assumptions of microbiome data or how they impact data interpretation. In this review, we overview some of the most important network methods for integrative analysis, with an emphasis on methods that have been applied or have great potential to be applied to the analysis of multi-omics integration of microbiome data. We compare advantages and disadvantages of various statistical tools, assess their applicability to microbiome data, and discuss their biological interpretability. We also highlight on-going statistical challenges and opportunities for integrative network analysis of microbiome data.
Collapse
Affiliation(s)
- Duo Jiang
- Department of Statistics, Oregon State University, Corvallis, OR, United States
| | - Courtney R Armour
- Department of Microbiology, Oregon State University, Corvallis, OR, United States
| | - Chenxiao Hu
- Department of Statistics, Oregon State University, Corvallis, OR, United States
| | - Meng Mei
- Department of Statistics, Oregon State University, Corvallis, OR, United States
| | - Chuan Tian
- Department of Statistics, Oregon State University, Corvallis, OR, United States
| | - Thomas J Sharpton
- Department of Statistics, Oregon State University, Corvallis, OR, United States
- Department of Microbiology, Oregon State University, Corvallis, OR, United States
| | - Yuan Jiang
- Department of Statistics, Oregon State University, Corvallis, OR, United States
| |
Collapse
|
15
|
Fang K, Zhang X, Ma S, Zhang Q. Smooth and Locally Sparse Estimation for Multiple-Output Functional Linear Regression. J STAT COMPUT SIM 2019; 90:341-354. [PMID: 33012883 DOI: 10.1080/00949655.2019.1680676] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
Functional data analysis has attracted substantial research interest and the goal of functional sparsity is to produce a sparse estimate which assigns zero values over regions where the true underlying function is zero, i.e., no relationship between the response variable and the predictor variable. In this paper, we consider a functional linear regression models that explicitly incorporates the interconnections among the responses. We propose a locally sparse (i.e., zero on some subregions) estimator, multiple-smooth and locally sparse (m-SLoS) estimator, for coefficient functions base on the interconnections among the responses. This method is based on a combination of smooth and locally sparse (SLoS) estimator and Laplacian quadratic penalty function, where we used SLoS for encouraging locally sparse and Laplacian quadratic penalty for promoting similar locally sparse among coefficient functions associated with the interconnections among the responses. Simulations show excellent numerical performance of the proposed method in terms of the estimation of coefficient functions especially the coefficient functions are same for all multivariate responses. Practical merit of this modeling is demonstrated by one real application and the prediction shows significant improvements.
Collapse
Affiliation(s)
- Kuangnan Fang
- Department of Statistics, School of Economics, Xiamen University, China.,Key Laboratory of Econometrics, Ministry of Education, Xiamen University, China
| | - Xiaochen Zhang
- Department of Statistics, School of Economics, Xiamen University, China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, USA
| | - Qingzhao Zhang
- Department of Statistics, School of Economics, Xiamen University, China.,Key Laboratory of Econometrics, Ministry of Education, Xiamen University, China.,The Wang Yanan Institute for Studies in Economics, Xiamen University, China
| |
Collapse
|
16
|
Lu M. An embedded method for gene identification problems involving unwanted data heterogeneity. Hum Genomics 2019; 13:45. [PMID: 31639059 PMCID: PMC6805328 DOI: 10.1186/s40246-019-0228-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Modern applications such as bioinformatics collecting data in various ways can easily result in heterogeneous data. Traditional variable selection methods assume samples are independent and identically distributed, which however is not suitable for these applications. Some existing statistical models capable of taking care of unwanted variation were developed for gene identification involving heterogeneous data, but they lack model predictability and suffer from variable redundancy. RESULTS By accounting for the unwanted heterogeneity effectively, our method have shown its superiority over several state-of-the art methods, which is validated by the experimental results in both unsupervised and supervised gene identification problems. Moreover, we also applied our method to a pan-cancer study where our method can identify the most discriminative genes best distinguishing different cancer types. CONCLUSIONS This article provides an alternative gene identification method that can accounting for unwanted data heterogeneity. It is a promising method to provide new insights into the complex cancer biology and clues for understanding tumorigenesis and tumor progression.
Collapse
Affiliation(s)
- Meng Lu
- Department of Information Management,Tianjin University, Tianjin, China.
| |
Collapse
|
17
|
Newcombe PJ, Nelson CP, Samani NJ, Dudbridge F. A flexible and parallelizable approach to genome-wide polygenic risk scores. Genet Epidemiol 2019; 43:730-741. [PMID: 31328830 PMCID: PMC6764842 DOI: 10.1002/gepi.22245] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2018] [Revised: 05/03/2019] [Accepted: 05/30/2019] [Indexed: 01/06/2023]
Abstract
The heritability of most complex traits is driven by variants throughout the genome. Consequently, polygenic risk scores, which combine information on multiple variants genome-wide, have demonstrated improved accuracy in genetic risk prediction. We present a new two-step approach to constructing genome-wide polygenic risk scores from meta-GWAS summary statistics. Local linkage disequilibrium (LD) is adjusted for in Step 1, followed by, uniquely, long-range LD in Step 2. Our algorithm is highly parallelizable since block-wise analyses in Step 1 can be distributed across a high-performance computing cluster, and flexible, since sparsity and heritability are estimated within each block. Inference is obtained through a formal Bayesian variable selection framework, meaning final risk predictions are averaged over competing models. We compared our method to two alternative approaches: LDPred and lassosum using all seven traits in the Welcome Trust Case Control Consortium as well as meta-GWAS summaries for type 1 diabetes (T1D), coronary artery disease, and schizophrenia. Performance was generally similar across methods, although our framework provided more accurate predictions for T1D, for which there are multiple heterogeneous signals in regions of both short- and long-range LD. With sufficient compute resources, our method also allows the fastest runtimes.
Collapse
Affiliation(s)
- Paul J. Newcombe
- MRC Biostatistics Unit, School of Clinical Medicine, Cambridge Institute of Public HealthCambridge Biomedical CampusCambridgeUK
| | - Christopher P. Nelson
- Department of Cardiovascular Sciences, Cardiovascular Research Centre, Glenfield HospitalUniversity of LeicesterLeicesterUK
- NIHR Leicester Biomedical Research CentreGlenfield HospitalLeicesterUK
| | - Nilesh J. Samani
- Department of Cardiovascular Sciences, Cardiovascular Research Centre, Glenfield HospitalUniversity of LeicesterLeicesterUK
- NIHR Leicester Biomedical Research CentreGlenfield HospitalLeicesterUK
| | - Frank Dudbridge
- Department of Health Sciences, Centre for MedicineUniversity of LeicesterLeicesterUK
| |
Collapse
|
18
|
Liang X, Young WC, Hung LH, Raftery AE, Yeung KY. Integration of Multiple Data Sources for Gene Network Inference Using Genetic Perturbation Data. J Comput Biol 2019; 26:1113-1129. [PMID: 31009236 PMCID: PMC6786343 DOI: 10.1089/cmb.2019.0036] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
The inference of gene networks from large-scale human genomic data is challenging due to the difficulty in identifying correct regulators for each gene in a high-dimensional search space. We present a Bayesian approach integrating external data sources with knockdown data from human cell lines to infer gene regulatory networks. In particular, we assemble multiple data sources, including gene expression data, genome-wide binding data, gene ontology, and known pathways, and use a supervised learning framework to compute prior probabilities of regulatory relationships. We show that our integrated method improves the accuracy of inferred gene networks as well as extends some previous Bayesian frameworks both in theory and applications. We apply our method to two different human cell lines, namely skin melanoma cell line A375 and lung cancer cell line A549, to illustrate the capabilities of our method. Our results show that the improvement in performance could vary from cell line to cell line and that we might need to choose different external data sources serving as prior knowledge if we hope to obtain better accuracy for different cell lines.
Collapse
Affiliation(s)
- Xiao Liang
- Department of Computer Science, Virginia Tech, Blacksburg, Virginia
| | - William Chad Young
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, Washington
| | - Ling-Hong Hung
- School of Engineering and Technology, University of Washington, Tacoma, Washington
| | - Adrian E. Raftery
- Department of Statistics, University of Washington, Seattle, Washington
| | - Ka Yee Yeung
- School of Engineering and Technology, University of Washington, Tacoma, Washington
| |
Collapse
|
19
|
Wang S, Shi X, Wu M, Ma S. Horizontal and vertical integrative analysis methods for mental disorders omics data. Sci Rep 2019; 9:13430. [PMID: 31530853 PMCID: PMC6748966 DOI: 10.1038/s41598-019-49718-5] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2019] [Accepted: 08/30/2019] [Indexed: 12/18/2022] Open
Abstract
In recent biomedical studies, omics profiling has been extensively conducted on various types of mental disorders. In most of the existing analyses, a single type of mental disorder and a single type of omics measurement are analyzed. In the study of other complex diseases, integrative analysis, both vertical and horizontal integration, has been conducted and shown to bring significantly new insights into disease etiology, progression, biomarkers, and treatment. In this article, we showcase the applicability of integrative analysis to mental disorders. In particular, the horizontal integration of bipolar disorder and schizophrenia and the vertical integration of gene expression and copy number variation data are conducted. The analysis is based on the sparse principal component analysis, penalization, and other advanced statistical techniques. In data analysis, integration leads to biologically sensible findings, including the disease-related gene expressions, copy number variations, and their associations, which differ from the "benchmark" analysis. Overall, this study suggests the potential of integrative analysis in mental disorder research.
Collapse
Affiliation(s)
- Shuaichao Wang
- SJTU-Yale Joint Center for Biostatistics, Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Xingjie Shi
- School of Economics, Nanjing University of Finance and Economics, Nanjing, 210046, China
| | - Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, 200433, China.
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, CT, 06520, USA.
| |
Collapse
|
20
|
Yang J, Peng J. Estimating Time-Varying Graphical Models. J Comput Graph Stat 2019; 29:191-202. [PMID: 33828398 PMCID: PMC8023339 DOI: 10.1080/10618600.2019.1647848] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2018] [Revised: 06/28/2019] [Accepted: 07/17/2019] [Indexed: 10/26/2022]
Abstract
In this paper, we study time-varying graphical models based on data measured over a temporal grid. Such models are motivated by the needs to describe and understand evolving interacting relationships among a set of random variables in many real applications, for instance the study of how stock prices interact with each other and how such interactions change over time. We propose a new model, LOcal Group Graphical Lasso Estimation (loggle), under the assumption that the graph topology changes gradually over time. Specifically, loggle uses a novel local group-lasso type penalty to efficiently incorporate information from neighboring time points and to impose structural smoothness of the graphs. We implement an ADMM based algorithm to fit the loggle model. This algorithm utilizes blockwise fast computation and pseudo-likelihood approximation to improve computational efficiency. An R package loggle has also been developed and is available on https://cran.r-project.org/. We evaluate the performance of loggle by simulation experiments. We also apply loggle to S&P 500 stock price data and demonstrate that loggle is able to reveal the interacting relationships among stock prices and among industrial sectors in a time period that covers the recent global financial crisis. The supplemental materials for this paper are also available online.
Collapse
Affiliation(s)
- Jilei Yang
- Department of Statistics, University of California, Davis
| | - Jie Peng
- Department of Statistics, University of California, Davis
| |
Collapse
|
21
|
Petralia F, Wang L, Peng J, Yan A, Zhu J, Wang P. A new method for constructing tumor specific gene co-expression networks based on samples with tumor purity heterogeneity. Bioinformatics 2019; 34:i528-i536. [PMID: 29949994 PMCID: PMC6022554 DOI: 10.1093/bioinformatics/bty280] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Motivation Tumor tissue samples often contain an unknown fraction of stromal cells. This problem is widely known as tumor purity heterogeneity (TPH) was recently recognized as a severe issue in omics studies. Specifically, if TPH is ignored when inferring co-expression networks, edges are likely to be estimated among genes with mean shift between non-tumor- and tumor cells rather than among gene pairs interacting with each other in tumor cells. To address this issue, we propose Tumor Specific Net (TSNet), a new method which constructs tumor-cell specific gene/protein co-expression networks based on gene/protein expression profiles of tumor tissues. TSNet treats the observed expression profile as a mixture of expressions from different cell types and explicitly models tumor purity percentage in each tumor sample. Results Using extensive synthetic data experiments, we demonstrate that TSNet outperforms a standard graphical model which does not account for TPH. We then apply TSNet to estimate tumor specific gene co-expression networks based on TCGA ovarian cancer RNAseq data. We identify novel co-expression modules and hub structure specific to tumor cells. Availability and implementation R codes can be found at https://github.com/petraf01/TSNet. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Francesca Petralia
- Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA.,Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Li Wang
- Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA.,Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA.,Sema4, a Mount Sinai Venture, Stamford, CT, USA
| | - Jie Peng
- Department of Statistics, University of California, Davis, Davis, CA, USA
| | - Arthur Yan
- Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA.,Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Jun Zhu
- Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA.,Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA.,Sema4, a Mount Sinai Venture, Stamford, CT, USA
| | - Pei Wang
- Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA.,Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| |
Collapse
|
22
|
Ma W, Chen LS, Özbek U, Han SW, Lin C, Paulovich AG, Zhong H, Wang P. Integrative Proteo-genomic Analysis to Construct CNA-protein Regulatory Map in Breast and Ovarian Tumors. Mol Cell Proteomics 2019; 18:S66-S81. [PMID: 31281117 PMCID: PMC6692778 DOI: 10.1074/mcp.ra118.001229] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2018] [Revised: 07/01/2019] [Indexed: 12/16/2022] Open
Abstract
Recent development in high throughput proteomics and genomics profiling enable one to study regulations of genome alterations on protein activities in a systematic manner. In this article, we propose a new statistical method, ProMAP, to systematically characterize the regulatory relationships between proteins and DNA copy number alterations (CNA) in breast and ovarian tumors based on proteogenomic data from the CPTAC-TCGA studies. Because of the dynamic nature of mass spectrometry instruments, proteomics data from labeled mass spectrometry experiments usually have non-ignorable batch effects. Moreover, mass spectrometry based proteomic data often possesses high percentages of missing values and non-ignorable missing-data patterns. Thus, we use a linear mixed effects model to account for the batch structure and explicitly incorporate the abundance-dependent-missing-data mechanism of proteomic data in ProMAP. In addition, we employ a multivariate regression framework to characterize the multiple-to-multiple regulatory relationships between CNA and proteins. Further, we use proper statistical regularization to facilitate the detection of master genetic regulators, which affect the activities of many proteins and often play important roles in genetic regulatory networks. Improved performance of ProMAP over existing methods were illustrated through extensive simulation studies and real data examples. Applying ProMAP to the CPTAC-TCGA breast and ovarian cancer data sets, we identified many genome regions, including a few novel ones, whose CNA were associated with protein and or phosphoprotein abundances. For example, in breast tumors, a small region in 8p11.21 was recognized as the second biggest hub in the CNA-phosphoprotein regulatory map, and further investigation of the regulatory targets suggests the potential role of 8p11.21 CNA in perturbing oxygen binding and transport activities in tumor cells. This and other findings from our analyses help to characterize the impacts of CNAs on protein activity landscapes and cast light on the genetic regulation mechanisms underlying these tumors.
Collapse
Affiliation(s)
- Weiping Ma
- ‡Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York 10029
| | - Lin S. Chen
- §Department of Public Health Sciences, University of Chicago Chicago, IL 60637
| | - Umut Özbek
- ¶Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai New York, New York 10029
| | - Sung Won Han
- ‖School of Industrial Management Engineering, Korea University, 145, Anam-ro, Seongbuk-gu, Seoul, 02841, Rep. of KOREA
| | - Chenwei Lin
- **Clinical Research Division, Fred Hutchinson Cancer Research Center Seattle Washington 98109–1024
| | - Amanda G. Paulovich
- **Clinical Research Division, Fred Hutchinson Cancer Research Center Seattle Washington 98109–1024
| | - Hua Zhong
- ‡‡Division of Biostatistics, Department of Population Health, New York University New York, New York 10016
| | - Pei Wang
- ‡Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York 10029
| |
Collapse
|
23
|
Uematsu Y, Fan Y, Chen K, Lv J, Lin W. SOFAR: Large-Scale Association Network Learning. IEEE TRANSACTIONS ON INFORMATION THEORY 2019; 65:4924-4939. [PMID: 33746241 PMCID: PMC7970712 DOI: 10.1109/tit.2019.2909889] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Many modern big data applications feature large scale in both numbers of responses and predictors. Better statistical efficiency and scientific insights can be enabled by understanding the large-scale response-predictor association network structures via layers of sparse latent factors ranked by importance. Yet sparsity and orthogonality have been two largely incompatible goals. To accommodate both features, in this paper we suggest the method of sparse orthogonal factor regression (SOFAR) via the sparse singular value decomposition with orthogonality constrained optimization to learn the underlying association networks, with broad applications to both unsupervised and supervised learning tasks such as biclustering with sparse singular value decomposition, sparse principal component analysis, sparse factor analysis, and spare vector autoregression analysis. Exploiting the framework of convexity-assisted nonconvex optimization, we derive nonasymptotic error bounds for the suggested procedure characterizing the theoretical advantages. The statistical guarantees are powered by an efficient SOFAR algorithm with convergence property. Both computational and theoretical advantages of our procedure are demonstrated with several simulations and real data examples.
Collapse
Affiliation(s)
- Yoshimasa Uematsu
- Yoshimasa Uematsu is Assistant Professor, Department of Economics and Management, Tohoku University, Sendai 980-8576, Japan. Yingying Fan is Dean's Associate Professor in Business Administration, Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089. Kun Chen is Associate Professor, Department of Statistics, University of Connecticut, Storrs, CT 06269. Jinchi Lv is Kenneth King Stonier Chair in Business Administration and Professor, Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089. Wei Lin is Assistant Professor, School of Mathematical Sciences and Center for Statistical Science, Peking University, Beijing, China 100871
| | - Yingying Fan
- Yoshimasa Uematsu is Assistant Professor, Department of Economics and Management, Tohoku University, Sendai 980-8576, Japan. Yingying Fan is Dean's Associate Professor in Business Administration, Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089. Kun Chen is Associate Professor, Department of Statistics, University of Connecticut, Storrs, CT 06269. Jinchi Lv is Kenneth King Stonier Chair in Business Administration and Professor, Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089. Wei Lin is Assistant Professor, School of Mathematical Sciences and Center for Statistical Science, Peking University, Beijing, China 100871
| | - Kun Chen
- Yoshimasa Uematsu is Assistant Professor, Department of Economics and Management, Tohoku University, Sendai 980-8576, Japan. Yingying Fan is Dean's Associate Professor in Business Administration, Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089. Kun Chen is Associate Professor, Department of Statistics, University of Connecticut, Storrs, CT 06269. Jinchi Lv is Kenneth King Stonier Chair in Business Administration and Professor, Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089. Wei Lin is Assistant Professor, School of Mathematical Sciences and Center for Statistical Science, Peking University, Beijing, China 100871
| | - Jinchi Lv
- Yoshimasa Uematsu is Assistant Professor, Department of Economics and Management, Tohoku University, Sendai 980-8576, Japan. Yingying Fan is Dean's Associate Professor in Business Administration, Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089. Kun Chen is Associate Professor, Department of Statistics, University of Connecticut, Storrs, CT 06269. Jinchi Lv is Kenneth King Stonier Chair in Business Administration and Professor, Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089. Wei Lin is Assistant Professor, School of Mathematical Sciences and Center for Statistical Science, Peking University, Beijing, China 100871
| | - Wei Lin
- Yoshimasa Uematsu is Assistant Professor, Department of Economics and Management, Tohoku University, Sendai 980-8576, Japan. Yingying Fan is Dean's Associate Professor in Business Administration, Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089. Kun Chen is Associate Professor, Department of Statistics, University of Connecticut, Storrs, CT 06269. Jinchi Lv is Kenneth King Stonier Chair in Business Administration and Professor, Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089. Wei Lin is Assistant Professor, School of Mathematical Sciences and Center for Statistical Science, Peking University, Beijing, China 100871
| |
Collapse
|
24
|
Li G, Liu X, Chen K. Integrative multi-view regression: Bridging group-sparse and low-rank models. Biometrics 2019; 75:593-602. [PMID: 30456759 PMCID: PMC6849205 DOI: 10.1111/biom.13006] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2018] [Accepted: 10/24/2018] [Indexed: 11/30/2022]
Abstract
Multi-view data have been routinely collected in various fields of science and engineering. A general problem is to study the predictive association between multivariate responses and multi-view predictor sets, all of which can be of high dimensionality. It is likely that only a few views are relevant to prediction, and the predictors within each relevant view contribute to the prediction collectively rather than sparsely. We cast this new problem under the familiar multivariate regression framework and propose an integrative reduced-rank regression (iRRR), where each view has its own low-rank coefficient matrix. As such, latent features are extracted from each view in a supervised fashion. For model estimation, we develop a convex composite nuclear norm penalization approach, which admits an efficient algorithm via alternating direction method of multipliers. Extensions to non-Gaussian and incomplete data are discussed. Theoretically, we derive non-asymptotic oracle bounds of iRRR under a restricted eigenvalue condition. Our results recover oracle bounds of several special cases of iRRR including Lasso, group Lasso, and nuclear norm penalized regression. Therefore, iRRR seamlessly bridges group-sparse and low-rank methods and can achieve substantially faster convergence rate under realistic settings of multi-view learning. Simulation studies and an application in the Longitudinal Studies of Aging further showcase the efficacy of the proposed methods.
Collapse
Affiliation(s)
- Gen Li
- Department of Biostatistics, Columbia University, New York
| | - Xiaokang Liu
- Department of Statistics, University of Connecticut, Storrs, Connecticut
| | - Kun Chen
- Department of Statistics, University of Connecticut, Storrs, Connecticut
| |
Collapse
|
25
|
Ren J, Du Y, Li S, Ma S, Jiang Y, Wu C. Robust network-based regularization and variable selection for high-dimensional genomic data in cancer prognosis. Genet Epidemiol 2019; 43:276-291. [PMID: 30746793 PMCID: PMC6446588 DOI: 10.1002/gepi.22194] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2018] [Revised: 11/19/2018] [Accepted: 11/29/2018] [Indexed: 12/21/2022]
Abstract
In cancer genomic studies, an important objective is to identify prognostic markers associated with patients' survival. Network-based regularization has achieved success in variable selections for high-dimensional cancer genomic data, because of its ability to incorporate the correlations among genomic features. However, as survival time data usually follow skewed distributions, and are contaminated by outliers, network-constrained regularization that does not take the robustness into account leads to false identifications of network structure and biased estimation of patients' survival. In this study, we develop a novel robust network-based variable selection method under the accelerated failure time model. Extensive simulation studies show the advantage of the proposed method over the alternative methods. Two case studies of lung cancer datasets with high-dimensional gene expression measurements demonstrate that the proposed approach has identified markers with important implications.
Collapse
Affiliation(s)
- Jie Ren
- Department of Statistics, Kansas State University, Manhattan, KS
| | - Yinhao Du
- Department of Statistics, Kansas State University, Manhattan, KS
| | - Shaoyu Li
- Department of Mathematics and Statistics, University of North Carolina at Charlotte, Charlotte, NC
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, CT
| | - Yu Jiang
- Division of Epidemiology, Biostatistics and Environmental Health, School of Public Health, University of Memphis, Memphis, TN
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS
| |
Collapse
|
26
|
Wu C, Zhou F, Ren J, Li X, Jiang Y, Ma S. A Selective Review of Multi-Level Omics Data Integration Using Variable Selection. High Throughput 2019; 8:E4. [PMID: 30669303 PMCID: PMC6473252 DOI: 10.3390/ht8010004] [Citation(s) in RCA: 122] [Impact Index Per Article: 20.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2018] [Revised: 12/24/2018] [Accepted: 01/10/2019] [Indexed: 01/02/2023] Open
Abstract
High-throughput technologies have been used to generate a large amount of omics data. In the past, single-level analysis has been extensively conducted where the omics measurements at different levels, including mRNA, microRNA, CNV and DNA methylation, are analyzed separately. As the molecular complexity of disease etiology exists at all different levels, integrative analysis offers an effective way to borrow strength across multi-level omics data and can be more powerful than single level analysis. In this article, we focus on reviewing existing multi-omics integration studies by paying special attention to variable selection methods. We first summarize published reviews on integrating multi-level omics data. Next, after a brief overview on variable selection methods, we review existing supervised, semi-supervised and unsupervised integrative analyses within parallel and hierarchical integration studies, respectively. The strength and limitations of the methods are discussed in detail. No existing integration method can dominate the rest. The computation aspects are also investigated. The review concludes with possible limitations and future directions for multi-level omics data integration.
Collapse
Affiliation(s)
- Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Jie Ren
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Xiaoxi Li
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Yu Jiang
- Division of Epidemiology, Biostatistics and Environmental Health, School of Public Health, University of Memphis, Memphis, TN 38152, USA.
| | - Shuangge Ma
- Department of Biostatistics, School of Public Health, Yale University, New Haven, CT 06510, USA.
| |
Collapse
|
27
|
Shi R, Liang F, Song Q, Luo Y, Ghosh M. A Blockwise Consistency Method for Parameter Estimation of Complex Models. SANKHYA. SERIES B. [METHODOLOGICAL.] 2018; 80:179-223. [PMID: 33833478 PMCID: PMC8026010 DOI: 10.1007/s13571-018-0183-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/19/2018] [Indexed: 10/27/2022]
Abstract
The drastic improvement in data collection and acquisition technologies has enabled scientists to collect a great amount of data. With the growing dataset size, typically comes a growing complexity of data structures and of complex models to account for the data structures. How to estimate the parameters of complex models has put a great challenge on current statistical methods. This paper proposes a blockwise consistency approach as a potential solution to the problem, which works by iteratively finding consistent estimates for each block of parameters conditional on the current estimates of the parameters in other blocks. The blockwise consistency approach decomposes the high-dimensional parameter estimation problem into a series of lower-dimensional parameter estimation problems, which often have much simpler structures than the original problem and thus can be easily solved. Moreover, under the framework provided by the blockwise consistency approach, a variety of methods, such as Bayesian and frequentist methods, can be jointly used to achieve a consistent estimator for the original high-dimensional complex model. The blockwise consistency approach is illustrated using two high-dimensional problems, variable selection and multivariate regression. The results of both problems show that the blockwise consistency approach can provide drastic improvements over the existing methods. Extension of the blockwise consistency approach to many other complex models is straightforward.
Collapse
Affiliation(s)
- Runmin Shi
- Department of Statistics, University of Florida, Gainesville, FL 32611
| | - Faming Liang
- Department of Statistics, Purdue University, West Lafayette, IN 47906
| | - Qifan Song
- Department of Statistics, Purdue University, West Lafayette, IN 47907
| | - Ye Luo
- Department of Economics, University of Florida, Gainesville, FL 32611
| | - Malay Ghosh
- University of Florida, Gainesville, FL 32611
| |
Collapse
|
28
|
Wu C, Zhang Q, Jiang Y, Ma S. Robust network-based analysis of the associations between (epi)genetic measurements. J MULTIVARIATE ANAL 2018; 168:119-130. [PMID: 30983643 PMCID: PMC6456078 DOI: 10.1016/j.jmva.2018.06.009] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
With its important biological implications, modeling the associations of gene expression (GE) and copy number variation (CNV) has been extensively conducted. Such analysis is challenging because of the high data dimensionality, lack of knowledge regulating CNVs for a specific GE, different behaviors of the cis-acting and trans-acting CNVs, possible long-tailed distributions and contamination of GE measurements, and correlations between CNVs. The existing methods fail to address one or more of these challenges. In this study, a new method is developed to model more effectively the GE-CNV associations. Specifically, for each GE, a partially linear model, with a nonlinear cis-acting CNV effect, is assumed. A robust loss function is adopted to accommodate long-tailed distributions and data contamination. We adopt penalization to accommodate the high dimensionality and identify relevant CNVs. A network structure is introduced to accommodate the correlations among CNVs. The proposed method comprehensively accommodates multiple challenging characteristics of GE-CNV modeling and effectively overcomes the limitations of existing methods. We develop an effective computational algorithm and rigorously establish the consistency properties. Simulation shows the superiority of the proposed method over alternatives. The TCGA (The Cancer Genome Atlas) data on the PCD (programmed cell death) pathway are analyzed, and the proposed method has improved prediction and stability and biologically plausible findings.
Collapse
Affiliation(s)
- Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS, 66506, USA
| | - Qingzhao Zhang
- School of Economics and the Wang Yanan Institute for Studies in Economics, Xiamen University
| | - Yu Jiang
- Division of Epidemiology, Biostatistics, and Environmental Health, School of Public Health, University of Memphis, Memphis, TN, 38111, USA
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, CT, 06510, USA
| |
Collapse
|
29
|
Characterizing functional consequences of DNA copy number alterations in breast and ovarian tumors by spaceMap. J Genet Genomics 2018; 45:361-371. [PMID: 30057342 DOI: 10.1016/j.jgg.2018.07.003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2018] [Revised: 07/09/2018] [Accepted: 07/09/2018] [Indexed: 01/18/2023]
Abstract
We propose a novel conditional graphical model - spaceMap - to construct gene regulatory networks from multiple types of high dimensional omic profiles. A motivating application is to characterize the perturbation of DNA copy number alterations (CNAs) on downstream protein levels in tumors. Through a penalized multivariate regression framework, spaceMap jointly models high dimensional protein levels as responses and high dimensional CNAs as predictors. In this setup, spaceMap infers an undirected network among proteins together with a directed network encoding how CNAs perturb the protein network. spaceMap can be applied to learn other types of regulatory relationships from high dimensional molecular profiles, especially those exhibiting hub structures. Simulation studies show spaceMap has greater power in detecting regulatory relationships over competing methods. Additionally, spaceMap includes a network analysis toolkit for biological interpretation of inferred networks. We applies spaceMap to the CNAs, gene expression and proteomics data sets from CPTAC-TCGA breast (n=77) and ovarian (n=174) cancer studies. Each cancer exhibits disruption of 'ion transmembrane transport' and 'regulation from RNA polymerase II promoter' by CNA events unique to each cancer. Moreover, using protein levels as a response yields a more functionally-enriched network than using RNA expressions in both cancer types. The network results also help to pinpoint crucial cancer genes and provide insights on the functional consequences of important CNA in breast and ovarian cancers. The R package spaceMap - including vignettes and documentation - is hosted on https://topherconley.github.io/spacemap.
Collapse
|
30
|
Capobianco E, Valdes C, Sarti S, Jiang Z, Poliseno L, Tsinoremas NF. Ensemble Modeling Approach Targeting Heterogeneous RNA-Seq data: Application to Melanoma Pseudogenes. Sci Rep 2017; 7:17344. [PMID: 29229974 PMCID: PMC5725464 DOI: 10.1038/s41598-017-17337-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2017] [Accepted: 11/23/2017] [Indexed: 01/28/2023] Open
Abstract
We studied the transcriptome landscape of skin cutaneous melanoma (SKCM) using 103 primary tumor samples from TCGA, and measured the expression levels of both protein coding genes and non-coding RNAs (ncRNAs). In particular, we emphasized pseudogenes potentially relevant to this cancer. While cataloguing the profiles based on the known biotypes, all the employed RNA-Seq methods generated just a small consensus of significant biotypes. We thus designed an approach to reconcile the profiles from all methods following a simple strategy: we selected genes that were confirmed as differentially expressed by the ensemble predictions obtained in a regression model. The main advantages of this approach are: 1) Selection of a high-confidence gene set identifying relevant pathways; 2) Use of a regression model whose covariates embed all method-driven outcomes to predict an averaged profile; 3) Method-specific assessment of prediction power and significance. Furthermore, the approach can be generalized to any biological system for which noisy RNA-Seq profiles are computed. As our analyses concerned bio-annotations of both high-quality protein coding genes and ncRNAs, we considered the associations between pseudogenes and parental genes (targets). Among the candidate targets that were validated, we identified PINK1, which is studied in patients with Parkinson and cancer (especially melanoma).
Collapse
Affiliation(s)
- Enrico Capobianco
- Center for Computational Science, University of Miami, Miami, FL, USA.
| | - Camilo Valdes
- Center for Computational Science, University of Miami, Miami, FL, USA
| | | | - Zhijie Jiang
- Center for Computational Science, University of Miami, Miami, FL, USA
| | - Laura Poliseno
- Istituto Toscano Tumori Oncogenomics Unit, Institute of Clinical Physiology-National Research Council, Pisa, Italy
| | - Nicolas F Tsinoremas
- Center for Computational Science, University of Miami, Miami, FL, USA
- Department of Medicine, Miller School of Medicine, University of Miami, Miami, FL, USA
| |
Collapse
|
31
|
Ning Z, Lee Y, Joshi PK, Wilson JF, Pawitan Y, Shen X. A Selection Operator for Summary Association Statistics Reveals Allelic Heterogeneity of Complex Traits. Am J Hum Genet 2017; 101:903-912. [PMID: 29198721 PMCID: PMC5812891 DOI: 10.1016/j.ajhg.2017.09.027] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2017] [Accepted: 09/28/2017] [Indexed: 02/04/2023] Open
Abstract
In recent years, as a secondary analysis in genome-wide association studies (GWASs), conditional and joint multiple-SNP analysis (GCTA-COJO) has been successful in allowing the discovery of additional association signals within detected loci. This suggests that many loci mapped in GWASs harbor more than a single causal variant. In order to interpret the underlying mechanism regulating a complex trait of interest in each discovered locus, researchers must assess the magnitude of allelic heterogeneity within the locus. We developed a penalized selection operator for jointly analyzing multiple variants (SOJO) within each mapped locus on the basis of LASSO (least absolute shrinkage and selection operator) regression derived from summary association statistics. We found that, compared to stepwise conditional multiple-SNP analysis, SOJO provided better sensitivity and specificity in predicting the number of alleles associated with complex traits in each locus. SOJO suggested causal variants potentially missed by GCTA-COJO. Compared to using top variants from genome-wide significant loci in GWAS, using SOJO increased the proportion of variance prediction for height by 65% without additional discovery samples or additional loci in the genome. Our empirical results indicate that human height is not only a highly polygenic trait, but also has high allelic heterogeneity within its established hundreds of loci.
Collapse
|
32
|
Chai H, Shi X, Zhang Q, Zhao Q, Huang Y, Ma S. Analysis of cancer gene expression data with an assisted robust marker identification approach. Genet Epidemiol 2017; 41:779-789. [PMID: 28913902 DOI: 10.1002/gepi.22066] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2016] [Revised: 04/10/2017] [Accepted: 07/10/2017] [Indexed: 12/22/2022]
Abstract
Gene expression (GE) studies have been playing a critical role in cancer research. Despite tremendous effort, the analysis results are still often unsatisfactory, because of the weak signals and high data dimensionality. Analysis is often further challenged by the long-tailed distributions of the outcome variables. In recent multidimensional studies, data have been collected on GEs as well as their regulators (e.g., copy number alterations (CNAs), methylation, and microRNAs), which can provide additional information on the associations between GEs and cancer outcomes. In this study, we develop an ARMI (assisted robust marker identification) approach for analyzing cancer studies with measurements on GEs as well as regulators. The proposed approach borrows information from regulators and can be more effective than analyzing GE data alone. A robust objective function is adopted to accommodate long-tailed distributions. Marker identification is effectively realized using penalization. The proposed approach has an intuitive formulation and is computationally much affordable. Simulation shows its satisfactory performance under a variety of settings. TCGA (The Cancer Genome Atlas) data on melanoma and lung cancer are analyzed, which leads to biologically plausible marker identification and superior prediction.
Collapse
Affiliation(s)
- Hao Chai
- Department of Biostatistics, Yale University, New Haven, Connecticut, United States of America
| | - Xingjie Shi
- Department of Statistics, Nanjing University of Finance and Economics, Nanjing Shi, Jiangsu Sheng, China
| | - Qingzhao Zhang
- School of Economics, Wang Yanan Institute for Studies in Economics, Xiamen University, Xiamen Shi, Fujian Sheng, China
| | - Qing Zhao
- Merck Research Laboratories, Rahway, New Jersey, United States of America
| | - Yuan Huang
- Department of Biostatistics, Yale University, New Haven, Connecticut, United States of America
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut, United States of America
| |
Collapse
|
33
|
Newcombe PJ, Raza Ali H, Blows FM, Provenzano E, Pharoah PD, Caldas C, Richardson S. Weibull regression with Bayesian variable selection to identify prognostic tumour markers of breast cancer survival. Stat Methods Med Res 2017; 26:414-436. [PMID: 25193065 PMCID: PMC6055985 DOI: 10.1177/0962280214548748] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
As data-rich medical datasets are becoming routinely collected, there is a growing demand for regression methodology that facilitates variable selection over a large number of predictors. Bayesian variable selection algorithms offer an attractive solution, whereby a sparsity inducing prior allows inclusion of sets of predictors simultaneously, leading to adjusted effect estimates and inference of which covariates are most important. We present a new implementation of Bayesian variable selection, based on a Reversible Jump MCMC algorithm, for survival analysis under the Weibull regression model. A realistic simulation study is presented comparing against an alternative LASSO-based variable selection strategy in datasets of up to 20,000 covariates. Across half the scenarios, our new method achieved identical sensitivity and specificity to the LASSO strategy, and a marginal improvement otherwise. Runtimes were comparable for both approaches, taking approximately a day for 20,000 covariates. Subsequently, we present a real data application in which 119 protein-based markers are explored for association with breast cancer survival in a case cohort of 2287 patients with oestrogen receptor-positive disease. Evidence was found for three independent prognostic tumour markers of survival, one of which is novel. Our new approach demonstrated the best specificity.
Collapse
Affiliation(s)
| | - H Raza Ali
- Cancer Research UK Cambridge Institute, Cambridge, UK
- Department of Pathology, University of Cambridge, Cambridge, UK
- Cambridge Experimental Cancer Medicine Centre and NIHR Cambridge Biomedical Research Centre, Cambridge, UK
| | - FM Blows
- Department of Oncology, University of Cambridge, Cambridge, UK
| | - E Provenzano
- NIH Cambridge Biomedical Research Centre, Cambridge, UK
| | - PD Pharoah
- Cambridge Experimental Cancer Medicine Centre and NIHR Cambridge Biomedical Research Centre, Cambridge, UK
- Department of Oncology, University of Cambridge, Cambridge, UK
- Strangeways Research Laboratory, Cambridge, UK
| | - C Caldas
- Cancer Research UK Cambridge Institute, Cambridge, UK
- Cambridge Experimental Cancer Medicine Centre and NIHR Cambridge Biomedical Research Centre, Cambridge, UK
- Department of Oncology, University of Cambridge, Cambridge, UK
| | | |
Collapse
|
34
|
Zhao Y, Chung M, Johnson BA, Moreno CS, Long Q. Hierarchical Feature Selection Incorporating Known and Novel Biological Information: Identifying Genomic Features Related to Prostate Cancer Recurrence. J Am Stat Assoc 2017; 111:1427-1439. [PMID: 28435175 DOI: 10.1080/01621459.2016.1164051] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
Our work is motivated by a prostate cancer study aimed at identifying mRNA and miRNA biomarkers that are predictive of cancer recurrence after prostatectomy. It has been shown in the literature that incorporating known biological information on pathway memberships and interactions among biomarkers improves feature selection of high-dimensional biomarkers in relation to disease risk. Biological information is often represented by graphs or networks, in which biomarkers are represented by nodes and interactions among them are represented by edges; however, biological information is often not fully known. For example, the role of microRNAs (miRNAs) in regulating gene expression is not fully understood and the miRNA regulatory network is not fully established, in which case new strategies are needed for feature selection. To this end, we treat unknown biological information as missing data (i.e., missing edges in graphs), different from commonly encountered missing data problems where variable values are missing. We propose a new concept of imputing unknown biological information based on observed data and define the imputed information as the novel biological information. In addition, we propose a hierarchical group penalty to encourage sparsity and feature selection at both the pathway level and the within-pathway level, which, combined with the imputation step, allows for incorporation of known and novel biological information. While it is applicable to general regression settings, we develop and investigate the proposed approach in the context of semiparametric accelerated failure time models motivated by our data example. Data application and simulation studies show that incorporation of novel biological information improves performance in risk prediction and feature selection and the proposed penalty outperforms the extensions of several existing penalties.
Collapse
Affiliation(s)
- Yize Zhao
- Postdoctoral Fellow, Statistical and Applied Mathematical Sciences Institute, Research Triangle Park, NC 27709
| | - Matthias Chung
- Assistant Professor, Department of Mathematics, Virginia Tech, Blacksburg, VA 24061
| | - Brent A Johnson
- Associate Professor, Department of Biostatistics and Computational Biology, University of Rochester, Rochester, NY 14642
| | - Carlos S Moreno
- Associate Professor, Department of Pathology and Laboratory Medicine
| | - Qi Long
- Associate Professor, Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA 30322
| |
Collapse
|
35
|
Zhou Y, Wang P, Wang X, Zhu J, Song PXK. Sparse multivariate factor analysis regression models and its applications to integrative genomics analysis. Genet Epidemiol 2016; 41:70-80. [PMID: 27862229 DOI: 10.1002/gepi.22018] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2015] [Revised: 09/16/2016] [Accepted: 09/19/2016] [Indexed: 01/25/2023]
Abstract
The multivariate regression model is a useful tool to explore complex associations between two kinds of molecular markers, which enables the understanding of the biological pathways underlying disease etiology. For a set of correlated response variables, accounting for such dependency can increase statistical power. Motivated by integrative genomic data analyses, we propose a new methodology-sparse multivariate factor analysis regression model (smFARM), in which correlations of response variables are assumed to follow a factor analysis model with latent factors. This proposed method not only allows us to address the challenge that the number of association parameters is larger than the sample size, but also to adjust for unobserved genetic and/or nongenetic factors that potentially conceal the underlying response-predictor associations. The proposed smFARM is implemented by the EM algorithm and the blockwise coordinate descent algorithm. The proposed methodology is evaluated and compared to the existing methods through extensive simulation studies. Our results show that accounting for latent factors through the proposed smFARM can improve sensitivity of signal detection and accuracy of sparse association map estimation. We illustrate smFARM by two integrative genomics analysis examples, a breast cancer dataset, and an ovarian cancer dataset, to assess the relationship between DNA copy numbers and gene expression arrays to understand genetic regulatory patterns relevant to the disease. We identify two trans-hub regions: one in cytoband 17q12 whose amplification influences the RNA expression levels of important breast cancer genes, and the other in cytoband 9q21.32-33, which is associated with chemoresistance in ovarian cancer.
Collapse
Affiliation(s)
- Yan Zhou
- Merck & Co, North Wales, PA, USA
| | - Pei Wang
- Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Xianlong Wang
- Fred Hutchinson Cancer Research Center, Seattle, WA, USA
| | - Ji Zhu
- University of Michigan, Ann Arbor, MI, USA
| | | |
Collapse
|
36
|
Lipid metabolism is associated with developmental epigenetic programming. Sci Rep 2016; 6:34857. [PMID: 27713555 PMCID: PMC5054359 DOI: 10.1038/srep34857] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2016] [Accepted: 09/19/2016] [Indexed: 12/24/2022] Open
Abstract
Maternal diet and metabolism impact fetal development. Epigenetic reprogramming facilitates fetal adaptation to these in utero cues. To determine if maternal metabolite levels impact infant DNA methylation globally and at growth and development genes, we followed a clinical birth cohort of 40 mother-infant dyads. Targeted metabolomics and quantitative DNA methylation were analyzed in 1st trimester maternal plasma (M1) and delivery maternal plasma (M2) as well as infant umbilical cord blood plasma (CB). We found very long chain fatty acids, medium chain acylcarnitines, and histidine were: (1) stable in maternal plasma from pregnancy to delivery, (2) significantly correlated between M1, M2, and CB, and (3) in the top 10% of maternal metabolites correlating with infant DNA methylation, suggesting maternal metabolites associated with infant DNA methylation are tightly controlled. Global DNA methylation was highly correlated across M1, M2, and CB. Thus, circulating maternal lipids are associated with developmental epigenetic programming, which in turn may impact lifelong health and disease risk. Further studies are required to determine the causal link between maternal plasma lipids and infant DNA methylation patterns.
Collapse
|
37
|
Kurum E, Benayoun BA, Malhotra A, George J, Ucar D. Computational inference of a genomic pluripotency signature in human and mouse stem cells. Biol Direct 2016; 11:47. [PMID: 27639379 PMCID: PMC5027095 DOI: 10.1186/s13062-016-0148-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2016] [Accepted: 09/03/2016] [Indexed: 12/18/2022] Open
Abstract
UNLABELLED Recent analyses of next-generation sequencing datasets have shown that cell-specific regulatory elements in stem cells are marked with distinguishable patterns of transcription factor (TF) binding and epigenetic marks. For example, we recently demonstrated that promoters of cell-specific genes are covered with expanded trimethylation of histone H3 at lysine 4 (H3K4me3) marks (i.e., broad H3K4me3 domains). Moreover, binding of specific TFs, such as OCT4, NANOG, and SOX2, have been shown to play a critical role in maintaining the pluripotency of stem cells. Despite these observations, a systematic exploration of genomic and epigenomic features of stem-cell-specific gene promoters has not been conducted. Advanced machine-learning models can capture distinguishable genomic and epigenomic characteristics of stem-cell-specific promoters by taking advantage of the wealth of publicly available datasets. Here, we propose a three-step framework to discover novel data characteristics of high-throughput next generation sequencing datasets that distinguish pluripotency genes in human and mouse embryonic stem cells (ESCs). Our framework involves: i) feature extraction to identify novel features of genomic datasets; ii) feature selection using a logistic regression model combined with the Least Absolute Shrinkage and Selection Operator (LASSO) method to find the most critical datasets and features; and iii) cross validation with features selected using LASSO method to assess the predictive power of selected data features in distinguishing pluripotency genes. We show that specific epigenetic marks, and specific features of these marks, are enriched at pluripotency gene promoters. Moreover, we also assess both the individual and combined effect of TF binding, epigenetic mark deposition, gene expression datasets for marking pluripotency genes. Our findings are consistent with the existence of a conserved, complex and integrative genomic signature in ESCs that can be exploited to flag important candidate pluripotency genes. They also validate our computational framework for fostering a deeper understanding of genomic datasets in stem cells, in the future, could be extended to study cell-type-specific genomic landscapes in other cell types. REVIEWERS This article was reviewed by Zoltan Gaspari and Piotr Zielenkiewicz.
Collapse
Affiliation(s)
- Esra Kurum
- Department of Statistics, University of California, Riverside, Riverside, CA, USA
| | | | - Ankit Malhotra
- The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, CT, 06032, USA
| | - Joshy George
- The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, CT, 06032, USA
| | - Duygu Ucar
- The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, CT, 06032, USA.
| |
Collapse
|
38
|
Li Z, Suk HI, Shen D, Li L. Sparse Multi-Response Tensor Regression for Alzheimer's Disease Study With Multivariate Clinical Assessments. IEEE TRANSACTIONS ON MEDICAL IMAGING 2016; 35:1927-1936. [PMID: 26960221 PMCID: PMC5154176 DOI: 10.1109/tmi.2016.2538289] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Alzheimer's disease (AD) is a progressive and irreversible neurodegenerative disorder that has recently seen serious increase in the number of affected subjects. In the last decade, neuroimaging has been shown to be a useful tool to understand AD and its prodromal stage, amnestic mild cognitive impairment (MCI). The majority of AD/MCI studies have focused on disease diagnosis, by formulating the problem as classification with a binary outcome of AD/MCI or healthy controls. There have recently emerged studies that associate image scans with continuous clinical scores that are expected to contain richer information than a binary outcome. However, very few studies aim at modeling multiple clinical scores simultaneously, even though it is commonly conceived that multivariate outcomes provide correlated and complementary information about the disease pathology. In this article, we propose a sparse multi-response tensor regression method to model multiple outcomes jointly as well as to model multiple voxels of an image jointly. The proposed method is particularly useful to both infer clinical scores and thus disease diagnosis, and to identify brain subregions that are highly relevant to the disease outcomes. We conducted experiments on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, and showed that the proposed method enhances the performance and clearly outperforms the competing solutions.
Collapse
Affiliation(s)
- Zhou Li
- Department of Statistics, North Carolina State University, Raleigh, NC 27695 USA
| | - Heung-Il Suk
- Department of Brain and Cognitive Engineering, Korea University, Seoul 02841, South Korea
| | - Dinggang Shen
- Biomedical Research Imaging Center (BRIC) and Department of Radiology, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA, and also with the Department of Brain and Cognitive Engineering, Korea University, Seoul 02841, South Korea
| | - Lexin Li
- Division of Biostatistics, University of California at Berkeley, Berkeley, CA 94720 USA
| |
Collapse
|
39
|
Richardson S, Tseng GC, Sun W. Statistical Methods in Integrative Genomics. ANNUAL REVIEW OF STATISTICS AND ITS APPLICATION 2016; 3:181-209. [PMID: 27482531 PMCID: PMC4963036 DOI: 10.1146/annurev-statistics-041715-033506] [Citation(s) in RCA: 55] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
Statistical methods in integrative genomics aim to answer important biology questions by jointly analyzing multiple types of genomic data (vertical integration) or aggregating the same type of data across multiple studies (horizontal integration). In this article, we introduce different types of genomic data and data resources, and then review statistical methods of integrative genomics, with emphasis on the motivation and rationale of these methods. We conclude with some summary points and future research directions.
Collapse
Affiliation(s)
- Sylvia Richardson
- MRC Biostatistics Unit, Cambridge Institute of Public Health, University of Cambridge, CB2 0SR, United Kingdom
| | - George C. Tseng
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA 15261
| | - Wei Sun
- Department of Biostatistics, Department of Genetics, University of North Carolina, Chapel Hill, NC 27599
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington 27516
| |
Collapse
|
40
|
Chen M, Ren Z, Zhao H, Zhou H. Asymptotically Normal and Efficient Estimation of Covariate-Adjusted Gaussian Graphical Model. J Am Stat Assoc 2016; 111:394-406. [PMID: 27499564 DOI: 10.1080/01621459.2015.1010039] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
A tuning-free procedure is proposed to estimate the covariate-adjusted Gaussian graphical model. For each finite subgraph, this estimator is asymptotically normal and efficient. As a consequence, a confidence interval can be obtained for each edge. The procedure enjoys easy implementation and efficient computation through parallel estimation on subgraphs or edges. We further apply the asymptotic normality result to perform support recovery through edge-wise adaptive thresholding. This support recovery procedure is called ANTAC, standing for Asymptotically Normal estimation with Thresholding after Adjusting Covariates. ANTAC outperforms other methodologies in the literature in a range of simulation studies. We apply ANTAC to identify gene-gene interactions using an eQTL dataset. Our result achieves better interpretability and accuracy in comparison with CAMPE.
Collapse
Affiliation(s)
- Mengjie Chen
- Program of Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA
| | - Zhao Ren
- Department of Statistics, Yale University, New Haven, CT 06520, USA
| | - Hongyu Zhao
- Department of Biostatistics, School of Public Health, Yale University, New Haven, CT 06520, USA
| | - Harrison Zhou
- Department of Statistics, Yale University, New Haven, CT 06520, USA
| |
Collapse
|
41
|
Newcombe PJ, Conti DV, Richardson S. JAM: A Scalable Bayesian Framework for Joint Analysis of Marginal SNP Effects. Genet Epidemiol 2016; 40:188-201. [PMID: 27027514 PMCID: PMC4817278 DOI: 10.1002/gepi.21953] [Citation(s) in RCA: 49] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2015] [Revised: 12/03/2015] [Accepted: 12/15/2015] [Indexed: 01/06/2023]
Abstract
Recently, large scale genome-wide association study (GWAS) meta-analyses have boosted the number of known signals for some traits into the tens and hundreds. Typically, however, variants are only analysed one-at-a-time. This complicates the ability of fine-mapping to identify a small set of SNPs for further functional follow-up. We describe a new and scalable algorithm, joint analysis of marginal summary statistics (JAM), for the re-analysis of published marginal summary statistics under joint multi-SNP models. The correlation is accounted for according to estimates from a reference dataset, and models and SNPs that best explain the complete joint pattern of marginal effects are highlighted via an integrated Bayesian penalized regression framework. We provide both enumerated and Reversible Jump MCMC implementations of JAM and present some comparisons of performance. In a series of realistic simulation studies, JAM demonstrated identical performance to various alternatives designed for single region settings. In multi-region settings, where the only multivariate alternative involves stepwise selection, JAM offered greater power and specificity. We also present an application to real published results from MAGIC (meta-analysis of glucose and insulin related traits consortium) - a GWAS meta-analysis of more than 15,000 people. We re-analysed several genomic regions that produced multiple significant signals with glucose levels 2 hr after oral stimulation. Through joint multivariate modelling, JAM was able to formally rule out many SNPs, and for one gene, ADCY5, suggests that an additional SNP, which transpired to be more biologically plausible, should be followed up with equal priority to the reported index.
Collapse
Affiliation(s)
| | - David V. Conti
- Division of BiostatisticsDepartment of Preventive MedicineZilkha Neurogenetic InstituteUniversity of Southern CaliforniaLos AngelesCaliforniaUnited States of America
| | | |
Collapse
|
42
|
Chen K, Chan KS. A note on rank reduction in sparse multivariate regression. JOURNAL OF STATISTICAL THEORY AND PRACTICE 2016; 10:100-120. [PMID: 26997938 PMCID: PMC4797956 DOI: 10.1080/15598608.2015.1081573] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
A reduced-rank regression with sparse singular value decomposition (RSSVD) approach was proposed by Chen et al. for conducting variable selection in a reduced-rank model. To jointly model the multivariate response, the method efficiently constructs a prespecified number of latent variables as some sparse linear combinations of the predictors. Here, we generalize the method to also perform rank reduction, and enable its usage in reduced-rank vector autoregressive (VAR) modeling to perform automatic rank determination and order selection. We show that in the context of stationary time-series data, the generalized approach correctly identifies both the model rank and the sparse dependence structure between the multivariate response and the predictors, with probability one asymptotically. We demonstrate the efficacy of the proposed method by simulations and analyzing a macro-economical multivariate time series using a reduced-rank VAR model.
Collapse
Affiliation(s)
- Kun Chen
- Department of Statistics, University of Connecticut, Storrs, Connecticut, USA
| | - Kung-Sik Chan
- Department of Statistics and Actuarial Science, University of Iowa, Iowa City, Iowa, USA
| |
Collapse
|
43
|
Agniel D, Liao KP, Cai T. Estimation and testing for multiple regulation of multivariate mixed outcomes. Biometrics 2016; 72:1194-1205. [PMID: 26910481 DOI: 10.1111/biom.12495] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2015] [Revised: 11/01/2015] [Accepted: 12/01/2015] [Indexed: 11/27/2022]
Abstract
Considerable interest has recently been focused on studying multiple phenotypes simultaneously in both epidemiological and genomic studies, either to capture the multidimensionality of complex disorders or to understand shared etiology of related disorders. We seek to identify multiple regulators or predictors that are associated with multiple outcomes when these outcomes may be measured on very different scales or composed of a mixture of continuous, binary, and not-fully observed elements. We first propose an estimation technique to put all effects on similar scales, and we induce sparsity on the estimated effects. We provide standard asymptotic results for this estimator and show that resampling can be used to quantify uncertainty in finite samples. We finally provide a multiple testing procedure which can be geared specifically to the types of multiple regulators of interest, and we establish that, under standard regularity conditions, the familywise error rate will approach 0 as sample size diverges. Simulation results indicate that our approach can improve over unregularized methods both in reducing bias in estimation and improving power for testing.
Collapse
Affiliation(s)
- Denis Agniel
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, U.S.A. 02115
| | - Katherine P Liao
- Brigham and Women's Hospital, Boston, Massachusetts, U.S.A. 02115
| | - Tianxi Cai
- Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts, U.S.A. 02115
| |
Collapse
|
44
|
Menezes RX, Mohammadi L, Goeman JJ, Boer JM. Analysing multiple types of molecular profiles simultaneously: connecting the needles in the haystack. BMC Bioinformatics 2016; 17:77. [PMID: 26860128 PMCID: PMC4746904 DOI: 10.1186/s12859-016-0926-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2015] [Accepted: 01/29/2016] [Indexed: 11/23/2022] Open
Abstract
Background It has been shown that a random-effects framework can be used to test the association between a gene’s expression level and the number of DNA copies of a set of genes. This gene-set modelling framework was later applied to find associations between mRNA expression and microRNA expression, by defining the gene sets using target prediction information. Methods and results Here, we extend the model introduced by Menezes et al. 2009 to consider the effect of not just copy number, but also of other molecular profiles such as methylation changes and loss-of-heterozigosity (LOH), on gene expression levels. We will consider again sets of measurements, to improve robustness of results and increase the power to find associations. Our approach can be used genome-wide to find associations and yields a test to help separate true associations from noise. We apply our method to colon and to breast cancer samples, for which genome-wide copy number, methylation and gene expression profiles are available. Our findings include interesting gene expression-regulating mechanisms, which may involve only one of copy number or methylation, or both for the same samples. We even are able to find effects due to different molecular mechanisms in different samples. Conclusions Our method can equally well be applied to cases where other types of molecular (high-dimensional) data are collected, such as LOH, SNP genotype and microRNA expression data. Computationally efficient, it represents a flexible and powerful tool to study associations between high-dimensional datasets. The method is freely available via the SIM BioConductor package. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-0926-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Renée X Menezes
- Department of Epidemiology and Biostatistics, VU University Medical Center, De Boelelaan 1089a, HV Amsterdam, 1081, The Netherlands.
| | | | - Jelle J Goeman
- Biostatistics, Department for Health Evidence, Radboud University Medical Center, Nijmegen, The Netherlands. .,Medical Statistics and Bioinformatics, Leiden University Medical Center, Nijmegen, The Netherlands.
| | - Judith M Boer
- Department of Pediatric Oncology and Hematology, Erasmus MC-Sophia Children's Hospital, Rotterdam, The Netherlands. .,Netherlands Bioinformatics Centre, Nijmegen, The Netherlands.
| |
Collapse
|
45
|
Abstract
Reduced-rank methods are very popular in high-dimensional multivariate analysis for conducting simultaneous dimension reduction and model estimation. However, the commonly-used reduced-rank methods are not robust, as the underlying reduced-rank structure can be easily distorted by only a few data outliers. Anomalies are bound to exist in big data problems, and in some applications they themselves could be of the primary interest. While naive residual analysis is often inadequate for outlier detection due to potential masking and swamping, robust reduced-rank estimation approaches could be computationally demanding. Under Stein's unbiased risk estimation framework, we propose a set of tools, including leverage score and generalized information score, to perform model diagnostics and outlier detection in large-scale reduced-rank estimation. The leverage scores give an exact decomposition of the so-called model degrees of freedom to the observation level, which lead to exact decomposition of many commonly-used information criteria; the resulting quantities are thus named information scores of the observations. The proposed information score approach provides a principled way of combining the residuals and leverage scores for anomaly detection. Simulation studies confirm that the proposed diagnostic tools work well. A pattern recognition example with hand-writing digital images and a time series analysis example with monthly U.S. macroeconomic data further demonstrate the efficacy of the proposed approaches.
Collapse
Affiliation(s)
- Kun Chen
- Department of Statistics, University of Connecticut, 215 Glenbrook Rd. U-4120, Storrs, CT 06269-4120,
| |
Collapse
|
46
|
Xiong L, Kuan PF, Tian J, Keles S, Wang S. Multivariate Boosting for Integrative Analysis of High-Dimensional Cancer Genomic Data. Cancer Inform 2015; 13:123-31. [PMID: 26609213 PMCID: PMC4648611 DOI: 10.4137/cin.s16353] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2014] [Revised: 03/16/2015] [Accepted: 03/20/2015] [Indexed: 12/29/2022] Open
Abstract
In this paper, we propose a novel multivariate component-wise boosting method for fitting multivariate response regression models under the high-dimension, low sample size setting. Our method is motivated by modeling the association among different biological molecules based on multiple types of high-dimensional genomic data. Particularly, we are interested in two applications: studying the influence of DNA copy number alterations on RNA transcript levels and investigating the association between DNA methylation and gene expression. For this purpose, we model the dependence of the RNA expression levels on DNA copy number alterations and the dependence of gene expression on DNA methylation through multivariate regression models and utilize boosting-type method to handle the high dimensionality as well as model the possible nonlinear associations. The performance of the proposed method is demonstrated through simulation studies. Finally, our multivariate boosting method is applied to two breast cancer studies.
Collapse
Affiliation(s)
- Lie Xiong
- Department of Statistics, University of Wisconsin, Madison, WI, USA
| | - Pei-Fen Kuan
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, USA
| | - Jianan Tian
- Department of Statistics, University of Wisconsin, Madison, WI, USA
| | - Sunduz Keles
- Department of Statistics, University of Wisconsin, Madison, WI, USA. ; Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI, USA
| | - Sijian Wang
- Department of Statistics, University of Wisconsin, Madison, WI, USA. ; Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI, USA
| |
Collapse
|
47
|
Shi X, Zhao Q, Huang J, Xie Y, Ma S. Deciphering the associations between gene expression and copy number alteration using a sparse double Laplacian shrinkage approach. Bioinformatics 2015; 31:3977-83. [PMID: 26342102 DOI: 10.1093/bioinformatics/btv518] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2015] [Accepted: 07/20/2015] [Indexed: 12/31/2022] Open
Abstract
MOTIVATION Both gene expression levels (GEs) and copy number alterations (CNAs) have important biological implications. GEs are partly regulated by CNAs, and much effort has been devoted to understanding their relations. The regulation analysis is challenging with one gene expression possibly regulated by multiple CNAs and one CNA potentially regulating the expressions of multiple genes. The correlations among GEs and among CNAs make the analysis even more complicated. The existing methods have limitations and cannot comprehensively describe the regulation. RESULTS A sparse double Laplacian shrinkage method is developed. It jointly models the effects of multiple CNAs on multiple GEs. Penalization is adopted to achieve sparsity and identify the regulation relationships. Network adjacency is computed to describe the interconnections among GEs and among CNAs. Two Laplacian shrinkage penalties are imposed to accommodate the network adjacency measures. Simulation shows that the proposed method outperforms the competing alternatives with more accurate marker identification. The Cancer Genome Atlas data are analysed to further demonstrate advantages of the proposed method. AVAILABILITY AND IMPLEMENTATION R code is available at http://works.bepress.com/shuangge/49/.
Collapse
Affiliation(s)
- Xingjie Shi
- Department of Statistics, Nanjing University of Finance and Economics, Nanjing, China, School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| | - Qing Zhao
- Department of Biostatistics, Yale University, New Haven, CT, USA
| | - Jian Huang
- Department of Statistics and Actuarial Science, University of Iowa, Iowa, IA, USA
| | - Yang Xie
- Department of Clinical Science, The University of Texas Southwestern Medical Center, Dallas, TX, USA and
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, CT, USA, VA Cooperative Studies Program Coordinating Center, West Haven, CT, USA
| |
Collapse
|
48
|
Wang X, Qin L, Zhang H, Zhang Y, Hsu L, Wang P. A regularized multivariate regression approach for eQTL analysis. STATISTICS IN BIOSCIENCES 2015; 7:129-146. [PMID: 26085849 DOI: 10.1007/s12561-013-9106-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Expression quantitative trait loci (eQTLs) are genomic loci that regulate expression levels of mRNAs or proteins. Understanding these regulatory provides important clues to biological pathways that underlie diseases. In this paper, we propose a new statistical method, GroupRemMap, for identifying eQTLs. We model the relationship between gene expression and single nucleotide variants (SNVs) through multivariate linear regression models, in which gene expression levels are responses and SNV genotypes are predictors. To handle the high-dimensionality as well as to incorporate the intrinsic group structure of SNVs, we introduce a new regularization scheme to (1) control the overall sparsity of the model; (2) encourage the group selection of SNVs from the same gene; and (3) facilitate the detection of trans-hub-eQTLs. We apply the proposed method to the colorectal and breast cancer data sets from The Cancer Genome Atlas (TCGA), and identify several biologically interesting eQTLs. These findings may provide insight into biological processes associated with cancers and generate hypotheses for future studies.
Collapse
Affiliation(s)
- Xianlong Wang
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue N, Seattle, WA, USA
| | - Li Qin
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue N, Seattle, WA, USA
| | - Hexin Zhang
- Institute of Mathematics Sciences, Peking University, Beijing, China
| | - Yuzheng Zhang
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue N, Seattle, WA, USA
| | - Li Hsu
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue N, Seattle, WA, USA
| | - Pei Wang
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue N, Seattle, WA, USA
| |
Collapse
|
49
|
Sun Q, Zhu H, Liu Y, Ibrahim JG. SPReM: Sparse Projection Regression Model For High-dimensional Linear Regression. J Am Stat Assoc 2015; 110:289-302. [PMID: 26527844 DOI: 10.1080/01621459.2014.892008] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
The aim of this paper is to develop a sparse projection regression modeling (SPReM) framework to perform multivariate regression modeling with a large number of responses and a multivariate covariate of interest. We propose two novel heritability ratios to simultaneously perform dimension reduction, response selection, estimation, and testing, while explicitly accounting for correlations among multivariate responses. Our SPReM is devised to specifically address the low statistical power issue of many standard statistical approaches, such as the Hotelling's T2 test statistic or a mass univariate analysis, for high-dimensional data. We formulate the estimation problem of SPREM as a novel sparse unit rank projection (SURP) problem and propose a fast optimization algorithm for SURP. Furthermore, we extend SURP to the sparse multi-rank projection (SMURP) by adopting a sequential SURP approximation. Theoretically, we have systematically investigated the convergence properties of SURP and the convergence rate of SURP estimates. Our simulation results and real data analysis have shown that SPReM out-performs other state-of-the-art methods.
Collapse
Affiliation(s)
- Qiang Sun
- Department of Biostatistics, University of North Carolina at Chapel Hill, NC 27599-7420
| | - Hongtu Zhu
- Department of Biostatistics, University of North Carolina at Chapel Hill, NC 27599-7420
| | - Yufeng Liu
- Department of Statistics and Operation Research, University of North Carolina at Chapel Hill, CB 3260, Chapel Hill, NC 27599
| | - Joseph G Ibrahim
- Department of Biostatistics, University of North Carolina at Chapel Hill, NC 27599-7420
| | | |
Collapse
|
50
|
Li Y, Nan B, Zhu J. Multivariate sparse group lasso for the multivariate multiple linear regression with an arbitrary group structure. Biometrics 2015; 71:354-63. [PMID: 25732839 DOI: 10.1111/biom.12292] [Citation(s) in RCA: 61] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2013] [Revised: 12/01/2014] [Accepted: 01/01/2015] [Indexed: 11/27/2022]
Abstract
We propose a multivariate sparse group lasso variable selection and estimation method for data with high-dimensional predictors as well as high-dimensional response variables. The method is carried out through a penalized multivariate multiple linear regression model with an arbitrary group structure for the regression coefficient matrix. It suits many biology studies well in detecting associations between multiple traits and multiple predictors, with each trait and each predictor embedded in some biological functional groups such as genes, pathways or brain regions. The method is able to effectively remove unimportant groups as well as unimportant individual coefficients within important groups, particularly for large p small n problems, and is flexible in handling various complex group structures such as overlapping or nested or multilevel hierarchical structures. The method is evaluated through extensive simulations with comparisons to the conventional lasso and group lasso methods, and is applied to an eQTL association study.
Collapse
Affiliation(s)
- Yanming Li
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, 48109, U.S.A
| | - Bin Nan
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, 48109, U.S.A
| | - Ji Zhu
- Department of Statistics, University of Michigan, Ann Arbor, Michigan, 48109, U.S.A
| |
Collapse
|