1
|
Mei B, Jiang Y, Sun Y. Unveiling Commonalities and Differences in Genetic Regulations via Two-Way Fusion. J Comput Biol 2024; 31:834-870. [PMID: 39133672 DOI: 10.1089/cmb.2023.0437] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/10/2024] Open
Abstract
Understanding the genetic regulation, for example, gene expressions (GEs) by copy number variations and methylations, is crucial to uncover the development and progression of complex diseases. Advancing from early studies that are mostly focused on homogeneous groups of patients, some recent studies have shifted their focus toward different patient groups, explored their commonalities and differences, and led to insightful findings. However, the analysis can be very challenging with one GE possibly regulated by multiple regulators and one regulator potentially regulating the expressions of multiple genes, leading to two distinct types of commonalities/differences in the patterns of genetic regulation. In addition, the high dimensionality of both sides of regulation poses challenges to computation. In this study, we develop a two-way fusion integrative analysis approach, which innovatively applies two fusion penalties to simultaneously identify commonalities/differences in the regulated pattern of GEs and regulating pattern of regulators, and adopt a Huber loss function to accommodate the possible data contamination. Moreover, a simple yet efficient iterative optimization algorithm is developed, which does not need to introduce any auxiliary variables and extra tuning parameters and is guaranteed to converge to a globally optimal solution. The advantages of the proposed approach are demonstrated in extensive simulations. The analysis of The Cancer Genome Atlas data on melanoma and lung cancer leads to interesting findings and satisfactory prediction performance.
Collapse
Affiliation(s)
- Biao Mei
- Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, China
| | - Yu Jiang
- School of Public Health, University of Memphis, Memphis, Tennessee, USA
| | - Yifan Sun
- Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, China
- Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, Beijing, China
| |
Collapse
|
2
|
Yan H, Zhang S, Ma S. Hierarchy‐assisted gene expression regulatory network analysis. Stat Anal Data Min 2023. [DOI: 10.1002/sam.11609] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Affiliation(s)
- Han Yan
- School of Mathematical Sciences University of Chinese Academy of Sciences Beijing China
- Key Laboratory of Big Data Mining and Knowledge Management Chinese Academy of Sciences Beijing China
- Department of Biostatistics Yale School of Public Health New Haven Connecticut USA
| | - Sanguo Zhang
- School of Mathematical Sciences University of Chinese Academy of Sciences Beijing China
- Key Laboratory of Big Data Mining and Knowledge Management Chinese Academy of Sciences Beijing China
- Pazhou Lab Guangzhou China
| | - Shuangge Ma
- Department of Biostatistics Yale School of Public Health New Haven Connecticut USA
| |
Collapse
|
3
|
Yi H, Ma S. Assisted differential network analysis for gene expression data. Genet Epidemiol 2021; 45:604-620. [PMID: 34174112 PMCID: PMC8376770 DOI: 10.1002/gepi.22419] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Revised: 05/10/2021] [Accepted: 05/17/2021] [Indexed: 11/12/2022]
Abstract
In the analysis of gene expression data, when there are two or more disease conditions/groups (e.g., diseased and normal, responder and nonresponder, and multiple stages/subtypes), differential analysis has been extensively conducted to identify key differences and has important implications. Network analysis takes a system perspective and can be more informative than that limited to simple statistics such as mean and variance. In differential network analysis, a common practice is to first estimate a gene expression network for each condition/group, and then spectral clustering can be applied to the network difference(s) to identify key genes and biological mechanisms that lead to the differences. Compared to "simple" analysis such as regression, differential network analysis can be more challenging with the significantly larger number of parameters. In this study, taking advantage of the increasing popularity of multidimensional profiling data, we develop an assisted analysis strategy and propose incorporating regulator information to improve the identification of key genes (that lead to the differences in gene expression networks). An effective computational algorithm is developed. Comprehensive simulation is conducted, showing that the proposed approach can outperform the benchmark alternatives in identification accuracy. With the The Cancer Genome Atlas lung adenocarcinoma data, we analyze the expressions of genes in the KEGG cell cycle pathway, assisted by copy number variation data. The proposed assisted analysis leads to identification results similar to the alternatives but different estimations. Overall, this study can deliver an efficient and cost-effective way of improving differential network analysis.
Collapse
Affiliation(s)
- Huangdi Yi
- Department of Biostatistics, Yale University
| | - Shuangge Ma
- Department of Biostatistics, Yale University
| |
Collapse
|
4
|
Zhang S, Hu X, Luo Z, Jiang Y, Sun Y, Ma S. Biomarker-guided heterogeneity analysis of genetic regulations via multivariate sparse fusion. Stat Med 2021; 40:3915-3936. [PMID: 33906263 PMCID: PMC8277716 DOI: 10.1002/sim.9006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2020] [Revised: 04/07/2021] [Accepted: 04/07/2021] [Indexed: 11/06/2022]
Abstract
Heterogeneity is a hallmark of many complex diseases. There are multiple ways of defining heterogeneity, among which the heterogeneity in genetic regulations, for example, gene expressions (GEs) by copy number variations (CNVs), and methylation, has been suggested but little investigated. Heterogeneity in genetic regulations can be linked with disease severity, progression, and other traits and is biologically important. However, the analysis can be very challenging with the high dimensionality of both sides of regulation as well as sparse and weak signals. In this article, we consider the scenario where subjects form unknown subgroups, and each subgroup has unique genetic regulation relationships. Further, such heterogeneity is "guided" by a known biomarker. We develop a multivariate sparse fusion (MSF) approach, which innovatively applies the penalized fusion technique to simultaneously determine the number and structure of subgroups and regulation relationships within each subgroup. An effective computational algorithm is developed, and extensive simulations are conducted. The analysis of heterogeneity in the GE-CNV regulations in melanoma and GE-methylation regulations in stomach cancer using the TCGA data leads to interesting findings.
Collapse
Affiliation(s)
- Sanguo Zhang
- School of Mathematical Sciences, and Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Science, Beijing, China
| | - Xiaonan Hu
- School of Mathematical Sciences, and Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Science, Beijing, China
| | - Ziye Luo
- Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, China
| | - Yu Jiang
- School of Public Health, University of Memphis, Tennessee, USA
| | - Yifan Sun
- Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, China
| | - Shuangge Ma
- Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, China
- Department of Biostatistics, Yale University, Connecticut, USA
| |
Collapse
|
5
|
Wu M, Yi H, Ma S. Vertical integration methods for gene expression data analysis. Brief Bioinform 2021; 22:bbaa169. [PMID: 32793970 PMCID: PMC8138889 DOI: 10.1093/bib/bbaa169] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2020] [Revised: 06/18/2020] [Accepted: 07/04/2020] [Indexed: 12/12/2022] Open
Abstract
Gene expression data have played an essential role in many biomedical studies. When the number of genes is large and sample size is limited, there is a 'lack of information' problem, leading to low-quality findings. To tackle this problem, both horizontal and vertical data integrations have been developed, where vertical integration methods collectively analyze data on gene expressions as well as their regulators (such as mutations, DNA methylation and miRNAs). In this article, we conduct a selective review of vertical data integration methods for gene expression data. The reviewed methods cover both marginal and joint analysis and supervised and unsupervised analysis. The main goal is to provide a sketch of the vertical data integration paradigm without digging into too many technical details. We also briefly discuss potential pitfalls, directions for future developments and application notes.
Collapse
Affiliation(s)
- Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics
| | - Huangdi Yi
- Department of Biostatistics at Yale University
| | - Shuangge Ma
- Department of Biostatistics at Yale University
| |
Collapse
|
6
|
Abstract
In recent biomedical studies, multidimensional profiling, which collects proteomics as well as other types of omics data on the same subjects, is getting increasingly popular. Proteomics, transcriptomics, genomics, epigenomics, and other types of data contain overlapping as well as independent information, which suggests the possibility of integrating multiple types of data to generate more reliable findings/models with better classification/prediction performance. In this chapter, a selective review is conducted on recent data integration techniques for both unsupervised and supervised analysis. The main objective is to provide the "big picture" of data integration that involves proteomics data and discuss the "intuition" beneath the recently developed approaches without invoking too many mathematical details. Potential pitfalls and possible directions for future developments are also discussed.
Collapse
Affiliation(s)
- Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| | - Yu Jiang
- School of Public Health, University of Memphis, Memphis, TN, USA
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, Yale University, New Haven, CT, USA.
| |
Collapse
|
7
|
Nguyen QH, Nguyen H, Nguyen T, Le DH. Multi-Omics Analysis Detects Novel Prognostic Subgroups of Breast Cancer. Front Genet 2020; 11:574661. [PMID: 33193681 PMCID: PMC7594512 DOI: 10.3389/fgene.2020.574661] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2020] [Accepted: 09/23/2020] [Indexed: 12/02/2022] Open
Abstract
The unprecedented proliferation of recent large-scale and multi-omics databases of cancers has given us many new insights into genomic and epigenomic deregulation in cancer discovery in general. However, we wonder whether or not there exists a systematic connection between copy number aberrations (CNA) and methylation (MET)? If so, what is the role of this connection in breast cancer (BRCA) tumorigenesis and progression? At the same time, the PAM50 intrinsic subtypes of BRCA have gained the most attention from BRCA experts. However, this classification system manifests its weaknesses including low accuracy as well as a possible lack of association with biological phenotypes, and even further investigations on their clinical utility were still needed. In this study, we performed an integrative analysis of three-omics profiles, CNA, MET, and mRNA expression, in two BRCA patient cohorts (one for discovery and another for validation) – to elucidate those complicated relationships. To this purpose, we first established a set of CNAcor and METcor genes, which had CNA and MET levels significantly correlated (and anti-correlated) with their corresponding expression levels, respectively. Next, to revisit the current classification of BRCA, we performed single and integrated clustering analyses using our clustering method PINSPlus. We then discovered two biologically distinct subgroups that could be an improved and refined classification system for breast cancer patients, which can be validated by a third-party data. Further studies were then performed and realized each-subgroup-specific genes and different interactions between each of the two identified subgroups with the age factor. These findings can show promise as diagnostic and prognostic values in BRCA, and a potential alternative to the PAM50 intrinsic subtypes in the future.
Collapse
Affiliation(s)
- Quang-Huy Nguyen
- Department of Computational Biomedicine, Vingroup Big Data Institute, Hanoi, Vietnam.,Faculty of Pharmacy, Dainam University, Hanoi, Vietnam
| | - Hung Nguyen
- Department of Computer Science and Engineering, University of Nevada, Reno, Reno, NV, United States
| | - Tin Nguyen
- Department of Computer Science and Engineering, University of Nevada, Reno, Reno, NV, United States
| | - Duc-Hau Le
- Department of Computational Biomedicine, Vingroup Big Data Institute, Hanoi, Vietnam.,School of Computer Science and Engineering, Thuyloi University, Hanoi, Vietnam
| |
Collapse
|
8
|
Fan X, Fang K, Ma S, Zhang Q. Integrating approximate single factor graphical models. Stat Med 2019; 39:146-155. [PMID: 31749227 DOI: 10.1002/sim.8408] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2019] [Revised: 09/23/2019] [Accepted: 10/02/2019] [Indexed: 02/03/2023]
Abstract
In the analysis of complex and high-dimensional data, graphical models have been commonly adopted to describe associations among variables. When common factors exist which make the associations dense, the single factor graphical model has been proposed, which first extracts the common factor and then conducts graphical modeling. Under other simpler contexts, it has been recognized that results generated from analyzing a single dataset are often unsatisfactory, and integrating multiple datasets can effectively improve variable selection and estimation. In graphical modeling, the increased number of parameters makes the "lack of information" problem more severe. In this article, we integrate multiple datasets and conduct the approximate single factor graphical model analysis. A novel penalization approach is developed for the identification and estimation of important loadings and edges. An effective computational algorithm is developed. A wide spectrum of simulations and the analysis of breast cancer gene expression datasets demonstrate the competitive performance of the proposed approach. Overall, this study provides an effective new venue for taking advantage of multiple datasets and improving graphical model analysis.
Collapse
Affiliation(s)
- Xinyan Fan
- School of Statistics, Renmin University of China, Beijing, China
| | - Kuangnan Fang
- Department of Statistics, School of Economics, Xiamen University, Xiamen, China.,Key Laboratory of Econometrics, Ministry of Education, Xiamen University, Xiamen, China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut
| | - Qingzhao Zhang
- Department of Statistics, School of Economics, Xiamen University, Xiamen, China.,Key Laboratory of Econometrics, Ministry of Education, Xiamen University, Xiamen, China.,The Wang Yanan Institute for Studies in Economics, Xiamen University, Xiamen, China
| |
Collapse
|
9
|
Fang K, Zhang X, Ma S, Zhang Q. Smooth and Locally Sparse Estimation for Multiple-Output Functional Linear Regression. J STAT COMPUT SIM 2019; 90:341-354. [PMID: 33012883 DOI: 10.1080/00949655.2019.1680676] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
Functional data analysis has attracted substantial research interest and the goal of functional sparsity is to produce a sparse estimate which assigns zero values over regions where the true underlying function is zero, i.e., no relationship between the response variable and the predictor variable. In this paper, we consider a functional linear regression models that explicitly incorporates the interconnections among the responses. We propose a locally sparse (i.e., zero on some subregions) estimator, multiple-smooth and locally sparse (m-SLoS) estimator, for coefficient functions base on the interconnections among the responses. This method is based on a combination of smooth and locally sparse (SLoS) estimator and Laplacian quadratic penalty function, where we used SLoS for encouraging locally sparse and Laplacian quadratic penalty for promoting similar locally sparse among coefficient functions associated with the interconnections among the responses. Simulations show excellent numerical performance of the proposed method in terms of the estimation of coefficient functions especially the coefficient functions are same for all multivariate responses. Practical merit of this modeling is demonstrated by one real application and the prediction shows significant improvements.
Collapse
Affiliation(s)
- Kuangnan Fang
- Department of Statistics, School of Economics, Xiamen University, China.,Key Laboratory of Econometrics, Ministry of Education, Xiamen University, China
| | - Xiaochen Zhang
- Department of Statistics, School of Economics, Xiamen University, China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, USA
| | - Qingzhao Zhang
- Department of Statistics, School of Economics, Xiamen University, China.,Key Laboratory of Econometrics, Ministry of Education, Xiamen University, China.,The Wang Yanan Institute for Studies in Economics, Xiamen University, China
| |
Collapse
|
10
|
Wu M, Zhang Q, Ma S. Structured gene-environment interaction analysis. Biometrics 2019; 76:23-35. [PMID: 31424088 DOI: 10.1111/biom.13139] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2018] [Accepted: 08/06/2019] [Indexed: 01/03/2023]
Abstract
For the etiology, progression, and treatment of complex diseases, gene-environment (G-E) interactions have important implications beyond the main G and E effects. G-E interaction analysis can be more challenging with higher dimensionality and need for accommodating the "main effects, interactions" hierarchy. In recent literature, an array of novel methods, many of which are based on the penalization technique, have been developed. In most of these studies, however, the structures of G measurements, for example, the adjacency structure of single nucleotide polymorphisms (SNPs; attributable to their physical adjacency on the chromosomes) and the network structure of gene expressions (attributable to their coordinated biological functions and correlated measurements) have not been well accommodated. In this study, we develop structured G-E interaction analysis, where such structures are accommodated using penalization for both the main G effects and interactions. Penalization is also applied for regularized estimation and selection. The proposed structured interaction analysis can be effectively realized. It is shown to have consistency properties under high-dimensional settings. Simulations and analysis of GENEVA diabetes data with SNP measurements and TCGA melanoma data with gene expression measurements demonstrate its competitive practical performance.
Collapse
Affiliation(s)
- Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China.,Department of Biostatistics, Yale University, New Haven, Connecticut
| | - Qingzhao Zhang
- School of Economics and Wang Yanan Institute for Studies in Economics, Xiamen University, Xiamen, China
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut
| |
Collapse
|
11
|
Horizontal and vertical integrative analysis methods for mental disorders omics data. Sci Rep 2019; 9:13430. [PMID: 31530853 PMCID: PMC6748966 DOI: 10.1038/s41598-019-49718-5] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2019] [Accepted: 08/30/2019] [Indexed: 12/18/2022] Open
Abstract
In recent biomedical studies, omics profiling has been extensively conducted on various types of mental disorders. In most of the existing analyses, a single type of mental disorder and a single type of omics measurement are analyzed. In the study of other complex diseases, integrative analysis, both vertical and horizontal integration, has been conducted and shown to bring significantly new insights into disease etiology, progression, biomarkers, and treatment. In this article, we showcase the applicability of integrative analysis to mental disorders. In particular, the horizontal integration of bipolar disorder and schizophrenia and the vertical integration of gene expression and copy number variation data are conducted. The analysis is based on the sparse principal component analysis, penalization, and other advanced statistical techniques. In data analysis, integration leads to biologically sensible findings, including the disease-related gene expressions, copy number variations, and their associations, which differ from the “benchmark” analysis. Overall, this study suggests the potential of integrative analysis in mental disorder research.
Collapse
|
12
|
Ren J, Du Y, Li S, Ma S, Jiang Y, Wu C. Robust network-based regularization and variable selection for high-dimensional genomic data in cancer prognosis. Genet Epidemiol 2019; 43:276-291. [PMID: 30746793 PMCID: PMC6446588 DOI: 10.1002/gepi.22194] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2018] [Revised: 11/19/2018] [Accepted: 11/29/2018] [Indexed: 12/21/2022]
Abstract
In cancer genomic studies, an important objective is to identify prognostic markers associated with patients' survival. Network-based regularization has achieved success in variable selections for high-dimensional cancer genomic data, because of its ability to incorporate the correlations among genomic features. However, as survival time data usually follow skewed distributions, and are contaminated by outliers, network-constrained regularization that does not take the robustness into account leads to false identifications of network structure and biased estimation of patients' survival. In this study, we develop a novel robust network-based variable selection method under the accelerated failure time model. Extensive simulation studies show the advantage of the proposed method over the alternative methods. Two case studies of lung cancer datasets with high-dimensional gene expression measurements demonstrate that the proposed approach has identified markers with important implications.
Collapse
Affiliation(s)
- Jie Ren
- Department of Statistics, Kansas State University, Manhattan, KS
| | - Yinhao Du
- Department of Statistics, Kansas State University, Manhattan, KS
| | - Shaoyu Li
- Department of Mathematics and Statistics, University of North Carolina at Charlotte, Charlotte, NC
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, CT
| | - Yu Jiang
- Division of Epidemiology, Biostatistics and Environmental Health, School of Public Health, University of Memphis, Memphis, TN
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS
| |
Collapse
|
13
|
Fan X, Fang K, Ma S, Wang S, Zhang Q. Assisted graphical model for gene expression data analysis. Stat Med 2019; 38:2364-2380. [PMID: 30854706 DOI: 10.1002/sim.8112] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2018] [Revised: 12/16/2018] [Accepted: 01/09/2019] [Indexed: 11/12/2022]
Abstract
The analysis of gene expression data has been playing a pivotal role in recent biomedical research. For gene expression data, network analysis has been shown to be more informative and powerful than individual-gene and geneset-based analysis. Despite promising successes, with the high dimensionality of gene expression data and often low sample sizes, network construction with gene expression data is still often challenged. In recent studies, a prominent trend is to conduct multidimensional profiling, under which data are collected on gene expressions as well as their regulators (copy number variations, methylation, microRNAs, SNPs, etc). With the regulation relationship, regulators contain information on gene expressions and can potentially assist in estimating their characteristics. In this study, we develop an assisted graphical model (AGM) approach, which can effectively use information in regulators to improve the estimation of gene expression graphical structure. The proposed approach has an intuitive formulation and can adaptively accommodate different regulator scenarios. Its consistency properties are rigorously established. Extensive simulations and the analysis of a breast cancer gene expression data set demonstrate the practical effectiveness of the AGM.
Collapse
Affiliation(s)
- Xinyan Fan
- Department of Statistics, School of Economics, Xiamen University, Xiamen, China
| | - Kuangnan Fang
- Department of Statistics, School of Economics, Xiamen University, Xiamen, China.,Fujian Key Laboratory of Statistical Sciences, Xiamen University, Xiamen, China
| | - Shuangge Ma
- Department of Statistics, School of Economics, Xiamen University, Xiamen, China.,Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut
| | - Shuaichao Wang
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Qingzhao Zhang
- Department of Statistics, School of Economics, Xiamen University, Xiamen, China.,Fujian Key Laboratory of Statistical Sciences, Xiamen University, Xiamen, China.,The Wang Yanan Institute for Studies in Economics, Xiamen University, Xiamen, China
| |
Collapse
|
14
|
Wu C, Zhang Q, Jiang Y, Ma S. Robust network-based analysis of the associations between (epi)genetic measurements. J MULTIVARIATE ANAL 2018; 168:119-130. [PMID: 30983643 PMCID: PMC6456078 DOI: 10.1016/j.jmva.2018.06.009] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
With its important biological implications, modeling the associations of gene expression (GE) and copy number variation (CNV) has been extensively conducted. Such analysis is challenging because of the high data dimensionality, lack of knowledge regulating CNVs for a specific GE, different behaviors of the cis-acting and trans-acting CNVs, possible long-tailed distributions and contamination of GE measurements, and correlations between CNVs. The existing methods fail to address one or more of these challenges. In this study, a new method is developed to model more effectively the GE-CNV associations. Specifically, for each GE, a partially linear model, with a nonlinear cis-acting CNV effect, is assumed. A robust loss function is adopted to accommodate long-tailed distributions and data contamination. We adopt penalization to accommodate the high dimensionality and identify relevant CNVs. A network structure is introduced to accommodate the correlations among CNVs. The proposed method comprehensively accommodates multiple challenging characteristics of GE-CNV modeling and effectively overcomes the limitations of existing methods. We develop an effective computational algorithm and rigorously establish the consistency properties. Simulation shows the superiority of the proposed method over alternatives. The TCGA (The Cancer Genome Atlas) data on the PCD (programmed cell death) pathway are analyzed, and the proposed method has improved prediction and stability and biologically plausible findings.
Collapse
Affiliation(s)
- Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS, 66506, USA
| | - Qingzhao Zhang
- School of Economics and the Wang Yanan Institute for Studies in Economics, Xiamen University
| | - Yu Jiang
- Division of Epidemiology, Biostatistics, and Environmental Health, School of Public Health, University of Memphis, Memphis, TN, 38111, USA
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, CT, 06510, USA
| |
Collapse
|
15
|
Abstract
BACKGROUND Omics profiling is now a routine component of biomedical studies. In the analysis of omics data, clustering is an essential step and serves multiple purposes including for example revealing the unknown functionalities of omics units, assisting dimension reduction in outcome model building, and others. In the most recent omics studies, a prominent trend is to conduct multilayer profiling, which collects multiple types of genetic, genomic, epigenetic and other measurements on the same subjects. In the literature, clustering methods tailored to multilayer omics data are still limited. Directly applying the existing clustering methods to multilayer omics data and clustering each layer first and then combing across layers are both "suboptimal" in that they do not accommodate the interconnections within layers and across layers in an informative way. METHODS In this study, we develop the MuNCut (Multilayer NCut) clustering approach. It is tailored to multilayer omics data and sufficiently accounts for both across- and within-layer connections. It is based on the novel NCut technique and also takes advantages of regularized sparse estimation. It has an intuitive formulation and is computationally very feasible. To facilitate implementation, we develop the function muncut in the R package NcutYX. RESULTS Under a wide spectrum of simulation settings, it outperforms competitors. The analysis of TCGA (The Cancer Genome Atlas) data on breast cancer and cervical cancer shows that MuNCut generates biologically meaningful results which differ from those using the alternatives. CONCLUSIONS We propose a more effective clustering analysis of multiple omics data. It provides a new venue for jointly analyzing genetic, genomic, epigenetic and other measurements.
Collapse
|
16
|
Chai H, Shi X, Zhang Q, Zhao Q, Huang Y, Ma S. Analysis of cancer gene expression data with an assisted robust marker identification approach. Genet Epidemiol 2017; 41:779-789. [PMID: 28913902 DOI: 10.1002/gepi.22066] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2016] [Revised: 04/10/2017] [Accepted: 07/10/2017] [Indexed: 12/22/2022]
Abstract
Gene expression (GE) studies have been playing a critical role in cancer research. Despite tremendous effort, the analysis results are still often unsatisfactory, because of the weak signals and high data dimensionality. Analysis is often further challenged by the long-tailed distributions of the outcome variables. In recent multidimensional studies, data have been collected on GEs as well as their regulators (e.g., copy number alterations (CNAs), methylation, and microRNAs), which can provide additional information on the associations between GEs and cancer outcomes. In this study, we develop an ARMI (assisted robust marker identification) approach for analyzing cancer studies with measurements on GEs as well as regulators. The proposed approach borrows information from regulators and can be more effective than analyzing GE data alone. A robust objective function is adopted to accommodate long-tailed distributions. Marker identification is effectively realized using penalization. The proposed approach has an intuitive formulation and is computationally much affordable. Simulation shows its satisfactory performance under a variety of settings. TCGA (The Cancer Genome Atlas) data on melanoma and lung cancer are analyzed, which leads to biologically plausible marker identification and superior prediction.
Collapse
Affiliation(s)
- Hao Chai
- Department of Biostatistics, Yale University, New Haven, Connecticut, United States of America
| | - Xingjie Shi
- Department of Statistics, Nanjing University of Finance and Economics, Nanjing Shi, Jiangsu Sheng, China
| | - Qingzhao Zhang
- School of Economics, Wang Yanan Institute for Studies in Economics, Xiamen University, Xiamen Shi, Fujian Sheng, China
| | - Qing Zhao
- Merck Research Laboratories, Rahway, New Jersey, United States of America
| | - Yuan Huang
- Department of Biostatistics, Yale University, New Haven, Connecticut, United States of America
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut, United States of America
| |
Collapse
|
17
|
Teran Hidalgo SJ, Wu M, Ma S. Assisted clustering of gene expression data using ANCut. BMC Genomics 2017; 18:623. [PMID: 28814280 PMCID: PMC5559859 DOI: 10.1186/s12864-017-3990-1] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2017] [Accepted: 08/01/2017] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND In biomedical research, gene expression profiling studies have been extensively conducted. The analysis of gene expression data has led to a deeper understanding of human genetics as well as practically useful models. Clustering analysis has been a critical component of gene expression data analysis and can reveal the (previously unknown) interconnections among genes. With the high dimensionality of gene expression data, many of the existing clustering methods and results are not as satisfactory. Intuitively, this is caused by "a lack of information". In recent profiling studies, a prominent trend is to collect data on gene expressions as well as their regulators (copy number alteration, microRNA, methylation, etc.) on the same subjects, making it possible to borrow information from other types of omics measurements in gene expression analysis. METHODS In this study, an ANCut approach is developed, which is built on the regularized estimation and NCut techniques. An effective R code that implements this approach is developed. RESULTS Simulation shows that the proposed approach outperforms direct competitors. The analysis of TCGA (The Cancer Genome Atlas) data further demonstrates its satisfactory performance. CONCLUSIONS We propose a more effective clustering analysis of gene expression data, with the assistance of information from regulators. It provides a new venue for analyzing gene expression data based on the assisted analysis strategy.
Collapse
Affiliation(s)
| | - Mengyun Wu
- Department of Biostatistics, Yale University, 60 College Street, New Haven, 06520 USA
- School of Statistics and Management, Shanghai University of Finance and Economics, 777 Guoding Road, Shanghai, 200433 China
| | - Shuangge Ma
- Department of Biostatistics, Yale University, 60 College Street, New Haven, 06520 USA
- Department of Statistics, Taiyuan University of Technology, 79 Yingze W St, Wanbailin Qu, Shanxi Sheng, 030024 Taiyuan Shi People’s Republic of China
| |
Collapse
|
18
|
Zang Y, Zhao Q, Zhang Q, Li Y, Zhang S, Ma S. Inferring gene regulatory relationships with a high-dimensional robust approach. Genet Epidemiol 2017; 41:437-454. [PMID: 28464328 DOI: 10.1002/gepi.22047] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2016] [Revised: 02/12/2017] [Accepted: 02/17/2017] [Indexed: 11/11/2022]
Abstract
Gene expression (GE) levels have important biological and clinical implications. They are regulated by copy number alterations (CNAs). Modeling the regulatory relationships between GEs and CNAs facilitates understanding disease biology and can also have values in translational medicine. The expression level of a gene can be regulated by its cis-acting as well as trans-acting CNAs, and the set of trans-acting CNAs is usually not known, which poses a high-dimensional selection and estimation problem. Most of the existing studies share a common limitation in that they cannot accommodate long-tailed distributions or contamination of GE data. In this study, we develop a high-dimensional robust regression approach to infer the regulatory relationships between GEs and CNAs. A high-dimensional regression model is used to accommodate the effects of both cis-acting and trans-acting CNAs. A density power divergence loss function is used to accommodate long-tailed GE distributions and contamination. Penalization is adopted for regularized estimation and selection of relevant CNAs. The proposed approach is effectively realized using a coordinate descent algorithm. Simulation shows that it has competitive performance compared to the nonrobust benchmark and the robust LAD (least absolute deviation) approach. We analyze TCGA (The Cancer Genome Atlas) data on cutaneous melanoma and study GE-CNA regulations in the RAP (regulation of apoptosis) pathway, which further demonstrates the satisfactory performance of the proposed approach.
Collapse
Affiliation(s)
- Yangguang Zang
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China.,Department of Biostatistics, Yale University, New Haven, Connecticut, United States of America
| | - Qing Zhao
- Merck Research Lab, Rahway, New Jersey, United States of America
| | - Qingzhao Zhang
- School of Economics and Wang Yanan Institute for Studies in Economics, Xiamen University, Fujian Sheng, China
| | - Yang Li
- School of Statistics, Remin University of China, Beijing, China
| | - Sanguo Zhang
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut, United States of America.,School of Economics and Wang Yanan Institute for Studies in Economics, Xiamen University, Fujian Sheng, China
| |
Collapse
|
19
|
Yan W, Xue W, Chen J, Hu G. Biological Networks for Cancer Candidate Biomarkers Discovery. Cancer Inform 2016; 15:1-7. [PMID: 27625573 PMCID: PMC5012434 DOI: 10.4137/cin.s39458] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2016] [Revised: 06/06/2016] [Accepted: 06/16/2016] [Indexed: 12/16/2022] Open
Abstract
Due to its extraordinary heterogeneity and complexity, cancer is often proposed as a model case of a systems biology disease or network disease. There is a critical need of effective biomarkers for cancer diagnosis and/or outcome prediction from system level analyses. Methods based on integrating omics data into networks have the potential to revolutionize the identification of cancer biomarkers. Deciphering the biological networks underlying cancer is undoubtedly important for understanding the molecular mechanisms of the disease and identifying effective biomarkers. In this review, the networks constructed for cancer biomarker discovery based on different omics level data are described and illustrated from recent advances in the field.
Collapse
Affiliation(s)
- Wenying Yan
- Center for Systems Biology, Soochow University, Suzhou, Jiangsu, China
| | - Wenjin Xue
- Department of Electrical Engineering, Technician College of Taizhou, Taizhou, Jiangsu, China
| | - Jiajia Chen
- School of Chemistry, Biology and Material Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Guang Hu
- Center for Systems Biology, Soochow University, Suzhou, Jiangsu, China
| |
Collapse
|