1
|
Min W, Wan X, Chang TH, Zhang S. A Novel Sparse Graph-Regularized Singular Value Decomposition Model and Its Application to Genomic Data Analysis. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:3842-3856. [PMID: 33556027 DOI: 10.1109/tnnls.2021.3054635] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Learning the gene coexpression pattern is a central challenge for high-dimensional gene expression analysis. Recently, sparse singular value decomposition (SVD) has been used to achieve this goal. However, this model ignores the structural information between variables (e.g., a gene network). The typical graph-regularized penalty can be used to incorporate such prior graph information to achieve more accurate discovery and better interpretability. However, the existing approach fails to consider the opposite effect of variables with negative correlations. In this article, we propose a novel sparse graph-regularized SVD model with absolute operator (AGSVD) for high-dimensional gene expression pattern discovery. The key of AGSVD is to impose a novel graph-regularized penalty ( | u|T L| u| ). However, such a penalty is a nonconvex and nonsmooth function, so it brings new challenges to model solving. We show that the nonconvex problem can be efficiently handled in a convex fashion by adopting an alternating optimization strategy. The simulation results on synthetic data show that our method is more effective than the existing SVD-based ones. In addition, the results on several real gene expression data sets show that the proposed methods can discover more biologically interpretable expression patterns by incorporating the prior gene network.
Collapse
|
2
|
Wen C, Ba H, Pan W, Huang M. Co-sparse reduced-rank regression for association analysis between imaging phenotypes and genetic variants. Bioinformatics 2021; 36:5214-5222. [PMID: 32683450 DOI: 10.1093/bioinformatics/btaa650] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2019] [Revised: 05/22/2020] [Accepted: 07/14/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The association analysis between genetic variants and imaging phenotypes must be carried out to understand the inherited neuropsychiatric disorders via imaging genetic studies. Given the high dimensionality in imaging and genetic data, traditional methods based on massive univariate regression entail large computational cost and disregard many-to-many correlations between phenotypes and genetic variants. Several multivariate imaging genetic methods have been proposed to alleviate the above problems. However, most of these methods are based on the l1 penalty, which might cause the over-selection of variables and thus mislead scientists in analyzing data from the field of neuroimaging genetics. RESULTS To address these challenges in both statistics and computation, we propose a novel co-sparse reduced-rank regression model that identifies complex correlations in a dimensional reduction manner. We developed an iterative algorithm based on a group primal dual-active set formulation to detect simultaneously important genetic variants and imaging phenotypes efficiently and precisely via non-convex penalty. The simulation studies showed that our method achieved accurate and stable performance in parameter estimation and variable selection. In real application, the proposed approach successfully detected several novel Alzheimer's disease-related genetic variants and regions of interest, which indicate that our method may be a valuable statistical toolbox for imaging genetic studies. AVAILABILITY AND IMPLEMENTATION The R package csrrr, and the code for experiments in this article is available in Github: https://github.com/hailongba/csrrr. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Canhong Wen
- International Institute of Finance, School of Management, University of Science and Technology of China, Hefei 230026, China
| | - Hailong Ba
- International Institute of Finance, School of Management, University of Science and Technology of China, Hefei 230026, China
| | - Wenliang Pan
- Department of Statistical Science, School of Mathematics, Sun Yat-Sen University, Guangzhou 510275, China
| | - Meiyan Huang
- School of Biomedical Engineering, Guangzhou 510515, China.,Guangdong Provincial Key Laboratory of Medical Image Processing, Southern Medical University, Guangzhou 510515, China
| | | |
Collapse
|
3
|
Mokhtaridoost M, Gönen M. An efficient framework to identify key miRNA-mRNA regulatory modules in cancer. Bioinformatics 2020; 36:i592-i600. [PMID: 33381822 DOI: 10.1093/bioinformatics/btaa798] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Micro-RNAs (miRNAs) are known as the important components of RNA silencing and post-transcriptional gene regulation, and they interact with messenger RNAs (mRNAs) either by degradation or by translational repression. miRNA alterations have a significant impact on the formation and progression of human cancers. Accordingly, it is important to establish computational methods with high predictive performance to identify cancer-specific miRNA-mRNA regulatory modules. RESULTS We presented a two-step framework to model miRNA-mRNA relationships and identify cancer-specific modules between miRNAs and mRNAs from their matched expression profiles of more than 9000 primary tumors. We first estimated the regulatory matrix between miRNA and mRNA expression profiles by solving multiple linear programming problems. We then formulated a unified regularized factor regression (RFR) model that simultaneously estimates the effective number of modules (i.e. latent factors) and extracts modules by decomposing regulatory matrix into two low-rank matrices. Our RFR model groups correlated miRNAs together and correlated mRNAs together, and also controls sparsity levels of both matrices. These attributes lead to interpretable results with high predictive performance. We applied our method on a very comprehensive data collection by including 32 TCGA cancer types. To find the biological relevance of our approach, we performed functional gene set enrichment and survival analyses. A large portion of the identified modules are significantly enriched in Hallmark, PID and KEGG pathways/gene sets. To validate the identified modules, we also performed literature validation as well as validation using experimentally supported miRTarBase database. AVAILABILITY AND IMPLEMENTATION Our implementation of proposed two-step RFR algorithm in R is available at https://github.com/MiladMokhtaridoost/2sRFR together with the scripts that replicate the reported experiments. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Mehmet Gönen
- Department of Industrial Engineering, College of Engineering, İstanbul 34450, Turkey.,School of Medicine, Koç University, İstanbul 34450, Turkey.,Department of Biomedical Engineering, School of Medicine, Oregon Health & Science University, Portland, OR 97239, USA
| |
Collapse
|
4
|
Feng Y, Xiao L, Chi EC. Sparse Single Index Models for Multivariate Responses. J Comput Graph Stat 2020; 30:115-124. [PMID: 34025100 DOI: 10.1080/10618600.2020.1779080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
Joint models are popular for analyzing data with multivariate responses. We propose a sparse multivariate single index model, where responses and predictors are linked by unspecified smooth functions and multiple matrix level penalties are employed to select predictors and induce low-rank structures across responses. An alternating direction method of multipliers (ADMM) based algorithm is proposed for model estimation. We demonstrate the effectiveness of proposed model in simulation studies and an application to a genetic association study.
Collapse
Affiliation(s)
- Yuan Feng
- Department of Statistics, North Carolina State University, Raleigh, NC 27695-8203
| | - Luo Xiao
- Department of Statistics, North Carolina State University, Raleigh, NC 27695-8203
| | - Eric C Chi
- Department of Statistics, North Carolina State University, Raleigh, NC 27695-8203
| |
Collapse
|
5
|
Subramanian I, Verma S, Kumar S, Jere A, Anamika K. Multi-omics Data Integration, Interpretation, and Its Application. Bioinform Biol Insights 2020; 14:1177932219899051. [PMID: 32076369 PMCID: PMC7003173 DOI: 10.1177/1177932219899051] [Citation(s) in RCA: 574] [Impact Index Per Article: 143.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2019] [Accepted: 11/09/2019] [Indexed: 12/22/2022] Open
Abstract
To study complex biological processes holistically, it is imperative to take an integrative approach that combines multi-omics data to highlight the interrelationships of the involved biomolecules and their functions. With the advent of high-throughput techniques and availability of multi-omics data generated from a large set of samples, several promising tools and methods have been developed for data integration and interpretation. In this review, we collected the tools and methods that adopt integrative approach to analyze multiple omics data and summarized their ability to address applications such as disease subtyping, biomarker prediction, and deriving insights into the data. We provide the methodology, use-cases, and limitations of these tools; brief account of multi-omics data repositories and visualization portals; and challenges associated with multi-omics data integration.
Collapse
Affiliation(s)
| | | | | | - Abhay Jere
- Innovation Cell, Ministry of Human Resource Development, New Delhi, India
| | | |
Collapse
|
6
|
Yu M, Gupta V, Kolar M. Recovery of simultaneous low rank and two-way sparse coefficient matrices, a nonconvex approach. Electron J Stat 2020. [DOI: 10.1214/19-ejs1658] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
7
|
Min W, Liu J, Zhang S. Edge-group sparse PCA for network-guided high dimensional data analysis. Bioinformatics 2019; 34:3479-3487. [PMID: 29726900 DOI: 10.1093/bioinformatics/bty362] [Citation(s) in RCA: 40] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2017] [Accepted: 05/02/2018] [Indexed: 12/14/2022] Open
Abstract
Motivation Principal component analysis (PCA) has been widely used to deal with high-dimensional gene expression data. In this study, we proposed an Edge-group Sparse PCA (ESPCA) model by incorporating the group structure from a prior gene network into the PCA framework for dimension reduction and feature interpretation. ESPCA enforces sparsity of principal component (PC) loadings through considering the connectivity of gene variables in the prior network. We developed an alternating iterative algorithm to solve ESPCA. The key of this algorithm is to solve a new k-edge sparse projection problem and a greedy strategy has been adapted to address it. Here we adopted ESPCA for analyzing multiple gene expression matrices simultaneously. By incorporating prior knowledge, our method can overcome the drawbacks of sparse PCA and capture some gene modules with better biological interpretations. Results We evaluated the performance of ESPCA using a set of artificial datasets and two real biological datasets (including TCGA pan-cancer expression data and ENCODE expression data), and compared their performance with PCA and sparse PCA. The results showed that ESPCA could identify more biologically relevant genes, improve their biological interpretations and reveal distinct sample characteristics. Availability and implementation An R package of ESPCA is available at http://page.amss.ac.cn/shihua.zhang/. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Wenwen Min
- School of Computer Science, Wuhan University, Wuhan, China
| | - Juan Liu
- School of Computer Science, Wuhan University, Wuhan, China
| | - Shihua Zhang
- NCMIS, CEMS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China.,School of Mathematics Sciences, University of Chinese Academy of Sciences, Beijing, China.,Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, China
| |
Collapse
|
8
|
Uematsu Y, Fan Y, Chen K, Lv J, Lin W. SOFAR: Large-Scale Association Network Learning. IEEE TRANSACTIONS ON INFORMATION THEORY 2019; 65:4924-4939. [PMID: 33746241 PMCID: PMC7970712 DOI: 10.1109/tit.2019.2909889] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Many modern big data applications feature large scale in both numbers of responses and predictors. Better statistical efficiency and scientific insights can be enabled by understanding the large-scale response-predictor association network structures via layers of sparse latent factors ranked by importance. Yet sparsity and orthogonality have been two largely incompatible goals. To accommodate both features, in this paper we suggest the method of sparse orthogonal factor regression (SOFAR) via the sparse singular value decomposition with orthogonality constrained optimization to learn the underlying association networks, with broad applications to both unsupervised and supervised learning tasks such as biclustering with sparse singular value decomposition, sparse principal component analysis, sparse factor analysis, and spare vector autoregression analysis. Exploiting the framework of convexity-assisted nonconvex optimization, we derive nonasymptotic error bounds for the suggested procedure characterizing the theoretical advantages. The statistical guarantees are powered by an efficient SOFAR algorithm with convergence property. Both computational and theoretical advantages of our procedure are demonstrated with several simulations and real data examples.
Collapse
Affiliation(s)
- Yoshimasa Uematsu
- Yoshimasa Uematsu is Assistant Professor, Department of Economics and Management, Tohoku University, Sendai 980-8576, Japan. Yingying Fan is Dean's Associate Professor in Business Administration, Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089. Kun Chen is Associate Professor, Department of Statistics, University of Connecticut, Storrs, CT 06269. Jinchi Lv is Kenneth King Stonier Chair in Business Administration and Professor, Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089. Wei Lin is Assistant Professor, School of Mathematical Sciences and Center for Statistical Science, Peking University, Beijing, China 100871
| | - Yingying Fan
- Yoshimasa Uematsu is Assistant Professor, Department of Economics and Management, Tohoku University, Sendai 980-8576, Japan. Yingying Fan is Dean's Associate Professor in Business Administration, Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089. Kun Chen is Associate Professor, Department of Statistics, University of Connecticut, Storrs, CT 06269. Jinchi Lv is Kenneth King Stonier Chair in Business Administration and Professor, Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089. Wei Lin is Assistant Professor, School of Mathematical Sciences and Center for Statistical Science, Peking University, Beijing, China 100871
| | - Kun Chen
- Yoshimasa Uematsu is Assistant Professor, Department of Economics and Management, Tohoku University, Sendai 980-8576, Japan. Yingying Fan is Dean's Associate Professor in Business Administration, Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089. Kun Chen is Associate Professor, Department of Statistics, University of Connecticut, Storrs, CT 06269. Jinchi Lv is Kenneth King Stonier Chair in Business Administration and Professor, Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089. Wei Lin is Assistant Professor, School of Mathematical Sciences and Center for Statistical Science, Peking University, Beijing, China 100871
| | - Jinchi Lv
- Yoshimasa Uematsu is Assistant Professor, Department of Economics and Management, Tohoku University, Sendai 980-8576, Japan. Yingying Fan is Dean's Associate Professor in Business Administration, Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089. Kun Chen is Associate Professor, Department of Statistics, University of Connecticut, Storrs, CT 06269. Jinchi Lv is Kenneth King Stonier Chair in Business Administration and Professor, Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089. Wei Lin is Assistant Professor, School of Mathematical Sciences and Center for Statistical Science, Peking University, Beijing, China 100871
| | - Wei Lin
- Yoshimasa Uematsu is Assistant Professor, Department of Economics and Management, Tohoku University, Sendai 980-8576, Japan. Yingying Fan is Dean's Associate Professor in Business Administration, Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089. Kun Chen is Associate Professor, Department of Statistics, University of Connecticut, Storrs, CT 06269. Jinchi Lv is Kenneth King Stonier Chair in Business Administration and Professor, Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089. Wei Lin is Assistant Professor, School of Mathematical Sciences and Center for Statistical Science, Peking University, Beijing, China 100871
| |
Collapse
|
9
|
He K, Lian H, Ma S, Huang JZ. Dimensionality Reduction and Variable Selection in Multivariate Varying-Coefficient Models With a Large Number of Covariates. J Am Stat Assoc 2018. [DOI: 10.1080/01621459.2017.1285774] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Affiliation(s)
- Kejun He
- Institute of Statistics and Big Data, Renmin University of China, Beijing, China
| | - Heng Lian
- Department of Mathematics, City University of Hong Kong, Kowloon Tong, Hong Kong
| | - Shujie Ma
- Department of Statistics, University of California-Riverside, Riverside, CA
| | - Jianhua Z. Huang
- Department of Statistics, Texas A & M University, College Station, TX
| |
Collapse
|
10
|
|
11
|
Wang Y, Jiang R, Wong WH. Modeling the causal regulatory network by integrating chromatin accessibility and transcriptome data. Natl Sci Rev 2016; 3:240-251. [PMID: 28690910 DOI: 10.1093/nsr/nww025] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Cell packs a lot of genetic and regulatory information through a structure known as chromatin, i.e. DNA is wrapped around histone proteins and is tightly packed in a remarkable way. To express a gene in a specific coding region, the chromatin would open up and DNA loop may be formed by interacting enhancers and promoters. Furthermore, the mediator and cohesion complexes, sequence-specific transcription factors, and RNA polymerase II are recruited and work together to elaborately regulate the expression level. It is in pressing need to understand how the information, about when, where, and to what degree genes should be expressed, is embedded into chromatin structure and gene regulatory elements. Thanks to large consortia such as Encyclopedia of DNA Elements (ENCODE) and Roadmap Epigenomic projects, extensive data on chromatin accessibility and transcript abundance are available across many tissues and cell types. This rich data offer an exciting opportunity to model the causal regulatory relationship. Here, we will review the current experimental approaches, foundational data, computational problems, interpretive frameworks, and integrative models that will enable the accurate interpretation of regulatory landscape. Particularly, we will discuss the efforts to organize, analyze, model, and integrate the DNA accessibility data, transcriptional data, and functional genomic regions together. We believe that these efforts will eventually help us understand the information flow within the cell and will influence research directions across many fields.
Collapse
Affiliation(s)
- Yong Wang
- Department of Statistics, Department of Biomedical Data Science, Bio-X Program, Stanford University, Stanford, CA 94305, USA.,Academy of Mathematics and Systems Science, National Center for Mathematics and Interdisciplinary Sciences, Chinese Academy of Sciences, Beijing 100080, China
| | - Rui Jiang
- Department of Statistics, Department of Biomedical Data Science, Bio-X Program, Stanford University, Stanford, CA 94305, USA.,MOE Key Laboratory of Bioinformatics, Bioinformatics Division and Center for Synthetic and Systems Biology, TNLIST, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Wing Hung Wong
- Department of Statistics, Department of Biomedical Data Science, Bio-X Program, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
12
|
Chen K, Chan KS. A note on rank reduction in sparse multivariate regression. JOURNAL OF STATISTICAL THEORY AND PRACTICE 2016; 10:100-120. [PMID: 26997938 PMCID: PMC4797956 DOI: 10.1080/15598608.2015.1081573] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
A reduced-rank regression with sparse singular value decomposition (RSSVD) approach was proposed by Chen et al. for conducting variable selection in a reduced-rank model. To jointly model the multivariate response, the method efficiently constructs a prespecified number of latent variables as some sparse linear combinations of the predictors. Here, we generalize the method to also perform rank reduction, and enable its usage in reduced-rank vector autoregressive (VAR) modeling to perform automatic rank determination and order selection. We show that in the context of stationary time-series data, the generalized approach correctly identifies both the model rank and the sparse dependence structure between the multivariate response and the predictors, with probability one asymptotically. We demonstrate the efficacy of the proposed method by simulations and analyzing a macro-economical multivariate time series using a reduced-rank VAR model.
Collapse
Affiliation(s)
- Kun Chen
- Department of Statistics, University of Connecticut, Storrs, Connecticut, USA
| | - Kung-Sik Chan
- Department of Statistics and Actuarial Science, University of Iowa, Iowa City, Iowa, USA
| |
Collapse
|
13
|
Chen J, Zhang S. Integrative analysis for identifying joint modular patterns of gene-expression and drug-response data. Bioinformatics 2016; 32:1724-32. [DOI: 10.1093/bioinformatics/btw059] [Citation(s) in RCA: 58] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2015] [Accepted: 01/27/2016] [Indexed: 12/13/2022] Open
|
14
|
Abstract
Reduced-rank methods are very popular in high-dimensional multivariate analysis for conducting simultaneous dimension reduction and model estimation. However, the commonly-used reduced-rank methods are not robust, as the underlying reduced-rank structure can be easily distorted by only a few data outliers. Anomalies are bound to exist in big data problems, and in some applications they themselves could be of the primary interest. While naive residual analysis is often inadequate for outlier detection due to potential masking and swamping, robust reduced-rank estimation approaches could be computationally demanding. Under Stein's unbiased risk estimation framework, we propose a set of tools, including leverage score and generalized information score, to perform model diagnostics and outlier detection in large-scale reduced-rank estimation. The leverage scores give an exact decomposition of the so-called model degrees of freedom to the observation level, which lead to exact decomposition of many commonly-used information criteria; the resulting quantities are thus named information scores of the observations. The proposed information score approach provides a principled way of combining the residuals and leverage scores for anomaly detection. Simulation studies confirm that the proposed diagnostic tools work well. A pattern recognition example with hand-writing digital images and a time series analysis example with monthly U.S. macroeconomic data further demonstrate the efficacy of the proposed approaches.
Collapse
Affiliation(s)
- Kun Chen
- Department of Statistics, University of Connecticut, 215 Glenbrook Rd. U-4120, Storrs, CT 06269-4120,
| |
Collapse
|
15
|
Abstract
Despite the rapid accumulation of tumor-profiling data and transcription factor (TF) ChIP-seq profiles, efforts integrating TF binding with the tumor-profiling data to understand how TFs regulate tumor gene expression are still limited. To systematically search for cancer-associated TFs, we comprehensively integrated 686 ENCODE ChIP-seq profiles representing 150 TFs with 7484 TCGA tumor data in 18 cancer types. For efficient and accurate inference on gene regulatory rules across a large number and variety of datasets, we developed an algorithm, RABIT (regression analysis with background integration). In each tumor sample, RABIT tests whether the TF target genes from ChIP-seq show strong differential regulation after controlling for background effect from copy number alteration and DNA methylation. When multiple ChIP-seq profiles are available for a TF, RABIT prioritizes the most relevant ChIP-seq profile in each tumor. In each cancer type, RABIT further tests whether the TF expression and somatic mutation variations are correlated with differential expression patterns of its target genes across tumors. Our predicted TF impact on tumor gene expression is highly consistent with the knowledge from cancer-related gene databases and reveals many previously unidentified aspects of transcriptional regulation in tumor progression. We also applied RABIT on RNA-binding protein motifs and found that some alternative splicing factors could affect tumor-specific gene expression by binding to target gene 3'UTR regions. Thus, RABIT (rabit.dfci.harvard.edu) is a general platform for predicting the oncogenic role of gene expression regulators.
Collapse
|
16
|
Wei Y. Integrative analyses of cancer data: a review from a statistical perspective. Cancer Inform 2015; 14:173-81. [PMID: 26041968 PMCID: PMC4435444 DOI: 10.4137/cin.s17303] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2014] [Revised: 02/01/2015] [Accepted: 02/09/2015] [Indexed: 12/17/2022] Open
Abstract
It has become increasingly common for large-scale public data repositories and clinical settings to have multiple types of data, including high-dimensional genomics, epigenomics, and proteomics data as well as survival data, measured simultaneously for the same group of biological samples, which provides unprecedented opportunities to understand cancer mechanisms from a more comprehensive scope and to develop new cancer therapies. Nevertheless, how to interpret a wealth of data into biologically and clinically meaningful information remains very challenging. In this paper, I review recent development in statistics for integrative analyses of cancer data. Topics will cover meta-analysis of homogeneous type of data across multiple studies, integrating multiple heterogeneous genomic data types, survival analysis with high-or ultrahigh-dimensional genomic profiles, and cross-data-type prediction where both predictors and responses are high-or ultrahigh-dimensional vectors. I compare existing statistical methods and comment on potential future research problems.
Collapse
Affiliation(s)
- Yingying Wei
- Department of Statistics, The Chinese University of Hong Kong, Shatin, Hong Kong
| |
Collapse
|