1
|
Qin X, Hu J, Ma S, Wu M. Estimation of multiple networks with common structures in heterogeneous subgroups. J MULTIVARIATE ANAL 2024; 202:105298. [PMID: 38433779 PMCID: PMC10907012 DOI: 10.1016/j.jmva.2024.105298] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/05/2024]
Abstract
Network estimation has been a critical component of high-dimensional data analysis and can provide an understanding of the underlying complex dependence structures. Among the existing studies, Gaussian graphical models have been highly popular. However, they still have limitations due to the homogeneous distribution assumption and the fact that they are only applicable to small-scale data. For example, cancers have various levels of unknown heterogeneity, and biological networks, which include thousands of molecular components, often differ across subgroups while also sharing some commonalities. In this article, we propose a new joint estimation approach for multiple networks with unknown sample heterogeneity, by decomposing the Gaussian graphical model (GGM) into a collection of sparse regression problems. A reparameterization technique and a composite minimax concave penalty are introduced to effectively accommodate the specific and common information across the networks of multiple subgroups, making the proposed estimator significantly advancing from the existing heterogeneity network analysis based on the regularized likelihood of GGM directly and enjoying scale-invariant, tuning-insensitive, and optimization convexity properties. The proposed analysis can be effectively realized using parallel computing. The estimation and selection consistency properties are rigorously established. The proposed approach allows the theoretical studies to focus on independent network estimation only and has the significant advantage of being both theoretically and computationally applicable to large-scale data. Extensive numerical experiments with simulated data and the TCGA breast cancer data demonstrate the prominent performance of the proposed approach in both subgroup and network identifications.
Collapse
Affiliation(s)
- Xing Qin
- School of Statistics and Information, Shanghai University of International Business and Economics, Shanghai, China
| | - Jianhua Hu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven, USA
| | - Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| |
Collapse
|
2
|
Liang W, Zhang Q, Ma S. Locally sparse quantile estimation for a partially functional interaction model. Comput Stat Data Anal 2023; 186:107782. [PMID: 39555004 PMCID: PMC11566403 DOI: 10.1016/j.csda.2023.107782] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2024]
Abstract
Functional data analysis has been extensively conducted. In this study, we consider a partially functional model, under which some covariates are scalars and have linear effects, while some other variables are functional and have unspecified nonlinear effects. Significantly advancing from the existing literature, we consider a model with interactions between the functional and scalar covariates. To accommodate long-tailed error distributions which are not uncommon in data analysis, we adopt the quantile technique for estimation. To achieve more interpretable estimation, and to accommodate many practical settings, we assume that the functional covariate effects are locally sparse (that is, there exist subregions on which the effects are exactly zero), which naturally leads to a variable/model selection problem. We propose respecting the "main effect, interaction" hierarchy, which postulates that if a subregion has a nonzero effect in an interaction term, then its effect has to be nonzero in the corresponding main functional effect. For estimation, identification of local sparsity, and respect of the hierarchy, we propose a penalization approach. An effective computational algorithm is developed, and the consistency properties are rigorously established under mild regularity conditions. Simulation shows the practical effectiveness of the proposed approach. The analysis of the Tecator data further demonstrates its practical applicability. Overall, this study can deliver a novel and practically useful model and a statistically and numerically satisfactory estimation approach.
Collapse
Affiliation(s)
- Weijuan Liang
- School of Statistics, Renmin University of China, Beijing, China
| | - Qingzhao Zhang
- Department of Statistics and Data Science, School of Economics, The Wang Yanan Institute for Studies in Economics, and Fujian Key Lab of Statistics, Xiamen University, Xiamen, China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| |
Collapse
|
3
|
Akdemir D, Somo M, Isidro-Sanchéz J. An Expectation-Maximization Algorithm for Combining a Sample of Partially Overlapping Covariance Matrices. AXIOMS 2023; 12:161. [PMID: 37284612 PMCID: PMC10243021 DOI: 10.3390/axioms12020161] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The generation of unprecedented amounts of data brings new challenges in data management, but also an opportunity to accelerate the identification of processes of multiple science disciplines. One of these challenges is the harmonization of high-dimensional unbalanced and heterogeneous data. In this manuscript, we propose a statistical approach to combine incomplete and partially-overlapping pieces of covariance matrices that come from independent experiments. We assume that the data are a random sample of partial covariance matrices sampled from Wishart distributions and we derive an expectation-maximization algorithm for parameter estimation. We demonstrate the properties of our method by (i) using simulation studies and (ii) using empirical datasets. In general, being able to make inferences about the covariance of variables not observed in the same experiment is a valuable tool for data analysis since covariance estimation is an important step in many statistical applications, such as multivariate analysis, principal component analysis, factor analysis, and structural equation modeling.
Collapse
Affiliation(s)
- Deniz Akdemir
- Center of International Bone Marrow Transplantation Research, Minneapolis, MN 55401-1206, USA
| | | | - Julio Isidro-Sanchéz
- Centro de Biotecnologia y Genómica de Plantas, Instituto Nacional de Investigación y Tecnologia Agraria y Alimentaria, Universidad Politécnica de Madrid, 28223, Madrid, Spain
| |
Collapse
|
4
|
Affiliation(s)
- Yunfei Wei
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, People's Republic of China
- NCMIS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, People's Republic of China
| | - Shifeng Xiong
- NCMIS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, People's Republic of China
| |
Collapse
|
5
|
Huang Y, Liu J, Yi H, Shia BC, Ma S. Promoting similarity of model sparsity structures in integrative analysis of cancer genetic data. Stat Med 2016; 36:509-559. [PMID: 27667129 DOI: 10.1002/sim.7138] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2014] [Revised: 07/24/2016] [Accepted: 09/02/2016] [Indexed: 01/05/2023]
Abstract
In profiling studies, the analysis of a single dataset often leads to unsatisfactory results because of the small sample size. Multi-dataset analysis utilizes information of multiple independent datasets and outperforms single-dataset analysis. Among the available multi-dataset analysis methods, integrative analysis methods aggregate and analyze raw data and outperform meta-analysis methods, which analyze multiple datasets separately and then pool summary statistics. In this study, we conduct integrative analysis and marker selection under the heterogeneity structure, which allows different datasets to have overlapping but not necessarily identical sets of markers. Under certain scenarios, it is reasonable to expect some similarity of identified marker sets - or equivalently, similarity of model sparsity structures - across multiple datasets. However, the existing methods do not have a mechanism to explicitly promote such similarity. To tackle this problem, we develop a sparse boosting method. This method uses a BIC/HDBIC criterion to select weak learners in boosting and encourages sparsity. A new penalty is introduced to promote the similarity of model sparsity structures across datasets. The proposed method has a intuitive formulation and is broadly applicable and computationally affordable. In numerical studies, we analyze right censored survival data under the accelerated failure time model. Simulation shows that the proposed method outperforms alternative boosting and penalization methods with more accurate marker identification. The analysis of three breast cancer prognosis datasets shows that the proposed method can identify marker sets with increased similarity across datasets and improved prediction performance. Copyright © 2016 John Wiley & Sons, Ltd.
Collapse
Affiliation(s)
- Yuan Huang
- VA Cooperative Studies Program Coordinating Center, West Haven, CT; Department of Biostatistics, Yale University, New Haven, CT, U.S.A
| | - Jin Liu
- Center of Quantitative Medicine, Duke-NUS Medical School, Singapore
| | - Huangdi Yi
- Department of Biostatistics, Yale University, New Haven, CT, U.S.A
| | - Ben-Chang Shia
- School of Health Care Administration, Big Data Research Center & School of Management, Taipei Medical University, Taipei, Taiwan
| | - Shuangge Ma
- VA Cooperative Studies Program Coordinating Center, West Haven, CT; Department of Biostatistics, Yale University, New Haven, CT, U.S.A
| |
Collapse
|
6
|
Matsui H. SPARSE REGULARIZATION FOR BI-LEVEL VARIABLE SELECTION. JOURNAL JAPANESE SOCIETY OF COMPUTATIONAL STATISTICS 2015. [DOI: 10.5183/jjscs.1502001_216] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
7
|
Lin D, Zhang J, Li J, He H, Deng HW, Wang YP. Integrative analysis of multiple diverse omics datasets by sparse group multitask regression. Front Cell Dev Biol 2014; 2:62. [PMID: 25364766 PMCID: PMC4209817 DOI: 10.3389/fcell.2014.00062] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2014] [Accepted: 10/01/2014] [Indexed: 01/10/2023] Open
Abstract
A variety of high throughput genome-wide assays enable the exploration of genetic risk factors underlying complex traits. Although these studies have remarkable impact on identifying susceptible biomarkers, they suffer from issues such as limited sample size and low reproducibility. Combining individual studies of different genetic levels/platforms has the promise to improve the power and consistency of biomarker identification. In this paper, we propose a novel integrative method, namely sparse group multitask regression, for integrating diverse omics datasets, platforms, and populations to identify risk genes/factors of complex diseases. This method combines multitask learning with sparse group regularization, which will: (1) treat the biomarker identification in each single study as a task and then combine them by multitask learning; (2) group variables from all studies for identifying significant genes; (3) enforce sparse constraint on groups of variables to overcome the "small sample, but large variables" problem. We introduce two sparse group penalties: sparse group lasso and sparse group ridge in our multitask model, and provide an effective algorithm for each model. In addition, we propose a significance test for the identification of potential risk genes. Two simulation studies are performed to evaluate the performance of our integrative method by comparing it with conventional meta-analysis method. The results show that our sparse group multitask method outperforms meta-analysis method significantly. In an application to our osteoporosis studies, 7 genes are identified as significant genes by our method and are found to have significant effects in other three independent studies for validation. The most significant gene SOD2 has been identified in our previous osteoporosis study involving the same expression dataset. Several other genes such as TREML2, HTR1E, and GLO1 are shown to be novel susceptible genes for osteoporosis, as confirmed from other studies.
Collapse
Affiliation(s)
- Dongdong Lin
- Biomedical Engineering Department, Tulane University New Orleans, LA, USA ; Center for Bioinformatics and Genomics, Tulane University New Orleans, LA, USA
| | - Jigang Zhang
- Center for Bioinformatics and Genomics, Tulane University New Orleans, LA, USA ; Department of Biostatistics and Bioinformatics, Tulane University New Orleans, LA, USA
| | - Jingyao Li
- Biomedical Engineering Department, Tulane University New Orleans, LA, USA ; Center for Bioinformatics and Genomics, Tulane University New Orleans, LA, USA
| | - Hao He
- Center for Bioinformatics and Genomics, Tulane University New Orleans, LA, USA ; Department of Biostatistics and Bioinformatics, Tulane University New Orleans, LA, USA
| | - Hong-Wen Deng
- Center for Bioinformatics and Genomics, Tulane University New Orleans, LA, USA ; Department of Biostatistics and Bioinformatics, Tulane University New Orleans, LA, USA
| | - Yu-Ping Wang
- Biomedical Engineering Department, Tulane University New Orleans, LA, USA ; Center for Bioinformatics and Genomics, Tulane University New Orleans, LA, USA ; Department of Biostatistics and Bioinformatics, Tulane University New Orleans, LA, USA
| |
Collapse
|
8
|
Liu J, Huang J, Ma S. Penalized multivariate linear mixed model for longitudinal genome-wide association studies. BMC Proc 2014; 8:S73. [PMID: 25519343 PMCID: PMC4143695 DOI: 10.1186/1753-6561-8-s1-s73] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
We consider analysis of Genetic Analysis Workshop 18 data, which involves multiple longitudinal traits and dense genome-wide single-nucleotide polymorphism (SNP) markers. We use a multivariate linear mixed model to account for the covariance of random effects and multivariate residuals. We divide the SNPs into groups according to the genes they belong to and score them using weighted sum statistics. We propose a penalized approach for genetic variant selection at the gene level. The overall modeling and penalized selection method is referred to as the penalized multivariate linear mixed model. Cross-validation is used for tuning parameter selection. A resampling approach is adopted to evaluate the relative stability of the identified genes. Application to the Genetic Analysis Workshop 18 data shows that the proposed approach can effectively select markers associated with phenotypes at gene level.
Collapse
Affiliation(s)
- Jin Liu
- School of Public Health, University of Illinois at Chicago, 1601 W. Taylor Street, Chicago, IL 60612, USA
| | - Jian Huang
- Department of Statistics & Actuarial Science, Department of Biostatistics, University of Iowa, 241 Schaeffer Hall, Iowa City, IA 52242, USA
| | - Shuangge Ma
- School of Public Health, Yale University, 60 College Street, New Haven, CT 06520, USA.,VA Cooperative Studies Program Coordinating Center, West Haven, CT 06516, USA
| |
Collapse
|