1
|
Chang C, Dai Z, Oh J, Long Q. Integrative Learning of Structured High-Dimensional Data from Multiple Datasets. Stat Anal Data Min 2023; 16:120-134. [PMID: 37213790 PMCID: PMC10195070 DOI: 10.1002/sam.11601] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2021] [Accepted: 10/14/2022] [Indexed: 11/11/2022]
Abstract
Integrative learning of multiple datasets has the potential to mitigate the challenge of small n and large p that is often encountered in analysis of big biomedical data such as genomics data. Detection of weak yet important signals can be enhanced by jointly selecting features for all datasets. However, the set of important features may not always be the same across all datasets. Although some existing integrative learning methods allow heterogeneous sparsity structure where a subset of datasets can have zero coefficients for some selected features, they tend to yield reduced efficiency, reinstating the problem of losing weak important signals. We propose a new integrative learning approach which can not only aggregate important signals well in homogeneous sparsity structure, but also substantially alleviate the problem of losing weak important signals in heterogeneous sparsity structure. Our approach exploits a priori known graphical structure of features and encourages joint selection of features that are connected in the graph. Integrating such prior information over multiple datasets enhances the power, while also accounting for the heterogeneity across datasets. Theoretical properties of the proposed method are investigated. We also demonstrate the limitations of existing approaches and the superiority of our method using a simulation study and analysis of gene expression data from ADNI.
Collapse
Affiliation(s)
- Changgee Chang
- Perelman School of Medicine, University of Pennsylvania, Pennsylvania, U.S.A
| | - Zongyu Dai
- School of Arts and Science, University of Pennsylvania, Pennsylvania, U.S.A
| | - Jihwan Oh
- Perelman School of Medicine, University of Pennsylvania, Pennsylvania, U.S.A
| | - Qi Long
- Perelman School of Medicine, University of Pennsylvania, Pennsylvania, U.S.A
| |
Collapse
|
2
|
Ma X, Kundu S. Multi-task Learning with High-Dimensional Noisy Images. J Am Stat Assoc 2022; 119:650-663. [PMID: 38660581 PMCID: PMC11035991 DOI: 10.1080/01621459.2022.2140052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2021] [Accepted: 10/17/2022] [Indexed: 10/31/2022]
Abstract
Recent medical imaging studies have given rise to distinct but inter-related datasets corresponding to multiple experimental tasks or longitudinal visits. Standard scalar-on-image regression models that fit each dataset separately are not equipped to leverage information across inter-related images, and existing multi-task learning approaches are compromised by the inability to account for the noise that is often observed in images. We propose a novel joint scalar-on-image regression framework involving wavelet-based image representations with grouped penalties that are designed to pool information across inter-related images for joint learning, and which explicitly accounts for noise in high-dimensional images via a projection-based approach. In the presence of non-convexity arising due to noisy images, we derive non-asymptotic error bounds under non-convex as well as convex grouped penalties, even when the number of voxels increases exponentially with sample size. A projected gradient descent algorithm is used for computation, which is shown to approximate the optimal solution via well-defined non-asymptotic optimization error bounds under noisy images. Extensive simulations and application to a motivating longitudinal Alzheimer's disease study illustrate significantly improved predictive ability and greater power to detect true signals, that are simply missed by existing methods without noise correction due to the attenuation to null phenomenon.
Collapse
Affiliation(s)
- Xin Ma
- Department of Biostatistics and Bioinfomatics, Emory University
| | - Suprateek Kundu
- Department of Biostatistics, The University of Texas at MD Anderson Cancer Center
| | | |
Collapse
|
3
|
Huang HH, Rao H, Miao R, Liang Y. A novel meta-analysis based on data augmentation and elastic data shared lasso regularization for gene expression. BMC Bioinformatics 2022; 23:353. [PMID: 35999505 PMCID: PMC9396780 DOI: 10.1186/s12859-022-04887-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2022] [Accepted: 08/10/2022] [Indexed: 12/22/2022] Open
Abstract
Background Gene expression analysis can provide useful information for analyzing complex biological mechanisms. However, many reported findings are unrepeatable due to small sample sizes relative to a large number of genes and the low signal-to-noise ratios of most gene expression datasets. Results Meta-analysis of multi-data sets is an efficient method for tackling the above problem. To improve the performance of meta-analysis, we propose a novel meta-analysis framework. It consists of two parts: (1) a novel data augmentation strategy. Various cross-platform normalization methods exist, which can preserve original biological information of gene expression datasets from different angles and add different “perturbations” to the dataset. Using such perturbation, we provide a feasible means for gene expression data augmentation; (2) elastic data shared lasso (DSL-\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${{\varvec{L}}}_{\mathbf{2}}$$\end{document}L2). The DSL-\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${\mathbf{L}}_{\mathbf{2}}$$\end{document}L2 method spans the continuum between individual models for each dataset and one model for all datasets. It also overcomes the shortcomings of the data shared lasso method when dealing with highly correlated features. Comprehensive simulation experiment results show that the proposed method has high prediction and gene selection performance. We then apply the proposed method to non-small cell lung cancer (NSCLC) blood gene expression data in order to identify key tumor-related genes. The outcomes of our experiment indicate that the method could be used for identifying a set of robust disease-related gene signatures that may be used for NSCLC early diagnosis or prognosis or even targeting. Conclusion We propose a novel and effective meta-analysis method for biological research, extrapolating and integrating information from multiple gene expression datasets.
Collapse
Affiliation(s)
- Hai-Hui Huang
- Provincial Demonstration Software Institute, Shaoguan University, Shaoguan, China
| | - Hao Rao
- Provincial Demonstration Software Institute, Shaoguan University, Shaoguan, China
| | - Rui Miao
- Faculty of Information Technology, Macau University of Science and Technology, Macau, China
| | - Yong Liang
- The Peng Cheng Laboratory, Shenzhen, China.
| |
Collapse
|
4
|
Hu Z, Zhou Y, Tong T. Meta-Analyzing Multiple Omics Data With Robust Variable Selection. Front Genet 2021; 12:656826. [PMID: 34290735 PMCID: PMC8288516 DOI: 10.3389/fgene.2021.656826] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2021] [Accepted: 05/24/2021] [Indexed: 12/03/2022] Open
Abstract
High-throughput omics data are becoming more and more popular in various areas of science. Given that many publicly available datasets address the same questions, researchers have applied meta-analysis to synthesize multiple datasets to achieve more reliable results for model estimation and prediction. Due to the high dimensionality of omics data, it is also desirable to incorporate variable selection into meta-analysis. Existing meta-analyzing variable selection methods are often sensitive to the presence of outliers, and may lead to missed detections of relevant covariates, especially for lasso-type penalties. In this paper, we develop a robust variable selection algorithm for meta-analyzing high-dimensional datasets based on logistic regression. We first search an outlier-free subset from each dataset by borrowing information across the datasets with repeatedly use of the least trimmed squared estimates for the logistic model and together with a hierarchical bi-level variable selection technique. We then refine a reweighting step to further improve the efficiency after obtaining a reliable non-outlier subset. Simulation studies and real data analysis show that our new method can provide more reliable results than the existing meta-analysis methods in the presence of outliers.
Collapse
Affiliation(s)
- Zongliang Hu
- College of Mathematics and Statistics, Shenzhen University, Shenzhen, China
| | - Yan Zhou
- College of Mathematics and Statistics, Shenzhen University, Shenzhen, China
| | - Tiejun Tong
- Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong
| |
Collapse
|
5
|
Chen H, He Y, Ji J, Shi Y. The sparse group lasso for high-dimensional integrative linear discriminant analysis with application to alzheimer's disease prediction. J STAT COMPUT SIM 2020. [DOI: 10.1080/00949655.2020.1800011] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Affiliation(s)
- Hao Chen
- School of Statistics, Shandong University of Finance and Economics, Jinan, People's Republic of China
| | - Yong He
- Institute for Financial Studies, Shandong University, Jinan, People's Republic of China
| | - Jiadong Ji
- School of Statistics, Shandong University of Finance and Economics, Jinan, People's Republic of China
| | - Yufeng Shi
- School of Statistics, Shandong University of Finance and Economics, Jinan, People's Republic of China
- Institute for Financial Studies, Shandong University, Jinan, People's Republic of China
| |
Collapse
|
6
|
Cui J, Shu J. Circulating microRNA trafficking and regulation: computational principles and practice. Brief Bioinform 2020; 21:1313-1326. [PMID: 31504144 PMCID: PMC7412956 DOI: 10.1093/bib/bbz079] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2019] [Revised: 06/07/2019] [Accepted: 06/07/2019] [Indexed: 01/18/2023] Open
Abstract
Rapid advances in genomics discovery tools and a growing realization of microRNA's implication in intercellular communication have led to a proliferation of studies of circulating microRNA sorting and regulation across cells and different species. Although sometimes, reaching controversial scientific discoveries and conclusions, these studies have yielded new insights in the functional roles of circulating microRNA and a plethora of analytical methods and tools. Here, we consider this body of work in light of key computational principles underpinning discovery of circulating microRNAs in terms of their sorting and targeting, with the goal of providing practical guidance for applications that is focused on the design and analysis of circulating microRNAs and their context-dependent regulation. We survey a broad range of informatics methods and tools that are available to the researcher, discuss their key features, applications and various unsolved problems and close this review with prospects and broader implication of this field.
Collapse
Affiliation(s)
- Juan Cui
- Systems Biology and Biomedical Informatics Laboratory, Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE, USA
| | - Jiang Shu
- Systems Biology and Biomedical Informatics Laboratory, Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE, USA
| |
Collapse
|
7
|
Zhang H, Li SJ, Zhang H, Yang ZY, Ren YQ, Xia LY, Liang Y. Meta-Analysis Based on Nonconvex Regularization. Sci Rep 2020; 10:5755. [PMID: 32238826 PMCID: PMC7113298 DOI: 10.1038/s41598-020-62473-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2019] [Accepted: 03/06/2020] [Indexed: 01/10/2023] Open
Abstract
The widespread applications of high-throughput sequencing technology have produced a large number of publicly available gene expression datasets. However, due to the gene expression datasets have the characteristics of small sample size, high dimensionality and high noise, the application of biostatistics and machine learning methods to analyze gene expression data is a challenging task, such as the low reproducibility of important biomarkers in different studies. Meta-analysis is an effective approach to deal with these problems, but the current methods have some limitations. In this paper, we propose the meta-analysis based on three nonconvex regularization methods, which are L1/2 regularization (meta-Half), Minimax Concave Penalty regularization (meta-MCP) and Smoothly Clipped Absolute Deviation regularization (meta-SCAD). The three nonconvex regularization methods are effective approaches for variable selection developed in recent years. Through the hierarchical decomposition of coefficients, our methods not only maintain the flexibility of variable selection and improve the efficiency of selecting important biomarkers, but also summarize and synthesize scientific evidence from multiple studies to consider the relationship between different datasets. We give the efficient algorithms and the theoretical property for our methods. Furthermore, we apply our methods to the simulation data and three publicly available lung cancer gene expression datasets, and compare the performance with state-of-the-art methods. Our methods have good performance in simulation studies, and the analysis results on the three publicly available lung cancer gene expression datasets are clinically meaningful. Our methods can also be extended to other areas where datasets are heterogeneous.
Collapse
Affiliation(s)
- Hui Zhang
- Faculty of Information Technology & State Key Laboratory of Quality Research in Chinese Medicines, Macau University of Science and Technology, Taipa, 999078, Macau
| | - Shou-Jiang Li
- Faculty of Information Technology & State Key Laboratory of Quality Research in Chinese Medicines, Macau University of Science and Technology, Taipa, 999078, Macau
| | - Hai Zhang
- Faculty of Information Technology & State Key Laboratory of Quality Research in Chinese Medicines, Macau University of Science and Technology, Taipa, 999078, Macau
- School of Mathematics, Northwest University, 710127, Xi'an, China
| | - Zi-Yi Yang
- Faculty of Information Technology & State Key Laboratory of Quality Research in Chinese Medicines, Macau University of Science and Technology, Taipa, 999078, Macau
| | - Yan-Qiong Ren
- Faculty of Information Technology & State Key Laboratory of Quality Research in Chinese Medicines, Macau University of Science and Technology, Taipa, 999078, Macau
| | - Liang-Yong Xia
- Faculty of Information Technology & State Key Laboratory of Quality Research in Chinese Medicines, Macau University of Science and Technology, Taipa, 999078, Macau
| | - Yong Liang
- Faculty of Information Technology & State Key Laboratory of Quality Research in Chinese Medicines, Macau University of Science and Technology, Taipa, 999078, Macau.
| |
Collapse
|
8
|
Xia Y, Li L, Lockhart SN, Jagust WJ. Simultaneous Covariance Inference for Multimodal Integrative Analysis. J Am Stat Assoc 2020; 115:1279-1291. [PMID: 33867602 DOI: 10.1080/01621459.2019.1623040] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
Multimodal integrative analysis fuses different types of data collected on the same set of experimental subjects. It is becoming a norm in many branches of scientific research, such as multi-omics and multimodal neuroimaging analysis. In this article, we address the problem of simultaneous covariance inference of associations between multiple modalities, which is of a vital interest in multimodal integrative analysis. Recognizing that there are few readily available solutions in the literature for this type of problem, we develop a new simultaneous testing procedure. It provides an explicit quantification of statistical significance, a much improved detection power, as well as a rigid false discovery control. Our proposal makes novel and useful contributions from both the scientific perspective and the statistical methodological perspective. We demonstrate the efficacy of the new method through both simulations and a multimodal positron emission tomography study of associations between two hallmark pathological proteins of Alzheimer's disease.
Collapse
Affiliation(s)
- Yin Xia
- Department of Statistics, School of Management, Fudan University, Shanghai, China
| | - Lexin Li
- Department of Biostatistics and Epidemiology, Helen Wills Neuroscience Institute, University of California at Berkeley, Berkeley, CA
| | - Samuel N Lockhart
- Department of Internal Medicine, Wake Forest School of Medicine, Winston-Salem, NC
| | - William J Jagust
- Lawrence Berkeley National Laboratory and School of Public Health, HelenWills Neuroscience Institute, University of California at Berkeley, Berkeley, CA
| |
Collapse
|
9
|
Rashid NU, Li Q, Yeh JJ, Ibrahim JG. Modeling Between-Study Heterogeneity for Improved Replicability in Gene Signature Selection and Clinical Prediction. J Am Stat Assoc 2019; 115:1125-1138. [PMID: 33012902 DOI: 10.1080/01621459.2019.1671197] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
In the genomic era, the identification of gene signatures associated with disease is of significant interest. Such signatures are often used to predict clinical outcomes in new patients and aid clinical decision-making. However, recent studies have shown that gene signatures are often not replicable. This occurrence has practical implications regarding the generalizability and clinical applicability of such signatures. To improve replicability, we introduce a novel approach to select gene signatures from multiple datasets whose effects are consistently non-zero and account for between-study heterogeneity. We build our model upon some rank-based quantities, facilitating integration over different genomic datasets. A high dimensional penalized Generalized Linear Mixed Model (pGLMM) is used to select gene signatures and address data heterogeneity. We compare our method to some commonly used strategies that select gene signatures ignoring between-study heterogeneity. We provide asymptotic results justifying the performance of our method and demonstrate its advantage in the presence of heterogeneity through thorough simulation studies. Lastly, we motivate our method through a case study subtyping pancreatic cancer patients from four gene expression studies.
Collapse
Affiliation(s)
- Naim U Rashid
- Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC, U.S.A.,Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, U.S.A
| | - Quefeng Li
- Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC, U.S.A
| | - Jen Jen Yeh
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, U.S.A.,Department of Surgery, University of North Carolina at Chapel Hill, Chapel Hill, NC, U.S.A.,Department of Pharmacology, University of North Carolina at Chapel Hill, Chapel Hill, NC, U.S.A
| | - Joseph G Ibrahim
- Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC, U.S.A
| |
Collapse
|
10
|
Yang ZY, Liu XY, Shu J, Zhang H, Ren YQ, Xu ZB, Liang Y. Multi-view based integrative analysis of gene expression data for identifying biomarkers. Sci Rep 2019; 9:13504. [PMID: 31534156 PMCID: PMC6751173 DOI: 10.1038/s41598-019-49967-4] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2019] [Accepted: 08/30/2019] [Indexed: 01/05/2023] Open
Abstract
The widespread applications in microarray technology have produced the vast quantity of publicly available gene expression datasets. However, analysis of gene expression data using biostatistics and machine learning approaches is a challenging task due to (1) high noise; (2) small sample size with high dimensionality; (3) batch effects and (4) low reproducibility of significant biomarkers. These issues reveal the complexity of gene expression data, thus significantly obstructing microarray technology in clinical applications. The integrative analysis offers an opportunity to address these issues and provides a more comprehensive understanding of the biological systems, but current methods have several limitations. This work leverages state of the art machine learning development for multiple gene expression datasets integration, classification and identification of significant biomarkers. We design a novel integrative framework, MVIAm - Multi-View based Integrative Analysis of microarray data for identifying biomarkers. It applies multiple cross-platform normalization methods to aggregate multiple datasets into a multi-view dataset and utilizes a robust learning mechanism Multi-View Self-Paced Learning (MVSPL) for gene selection in cancer classification problems. We demonstrate the capabilities of MVIAm using simulated data and studies of breast cancer and lung cancer, it can be applied flexibly and is an effective tool for facing the four challenges of gene expression data analysis. Our proposed model makes microarray integrative analysis more systematic and expands its range of applications.
Collapse
Affiliation(s)
- Zi-Yi Yang
- Faculty of Information Technology & State Key Laboratory of Quality Research in Chinese Medicines, Macau University of Science and Technology, Taipa, 999078, Macau, China
| | - Xiao-Ying Liu
- Computer Engineering Technical College, Guangdong Polytechnic of Science and Technology, Zhuhai, 519090, China
| | - Jun Shu
- School of Mathematics and Statistics & Ministry of Education Key Lab of Intelligent Networks and Network Security, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Hui Zhang
- Faculty of Information Technology & State Key Laboratory of Quality Research in Chinese Medicines, Macau University of Science and Technology, Taipa, 999078, Macau, China
| | - Yan-Qiong Ren
- Faculty of Information Technology & State Key Laboratory of Quality Research in Chinese Medicines, Macau University of Science and Technology, Taipa, 999078, Macau, China
| | - Zong-Ben Xu
- School of Mathematics and Statistics & Ministry of Education Key Lab of Intelligent Networks and Network Security, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Yong Liang
- Faculty of Information Technology & State Key Laboratory of Quality Research in Chinese Medicines, Macau University of Science and Technology, Taipa, 999078, Macau, China.
| |
Collapse
|
11
|
Huo Z, Song C, Tseng G. BAYESIAN LATENT HIERARCHICAL MODEL FOR TRANSCRIPTOMIC META-ANALYSIS TO DETECT BIOMARKERS WITH CLUSTERED META-PATTERNS OF DIFFERENTIAL EXPRESSION SIGNALS. Ann Appl Stat 2019; 13:340-366. [PMID: 31007807 PMCID: PMC6472949 DOI: 10.1214/18-aoas1188] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
Abstract
Due to the rapid development of high-throughput experimental techniques and fast-dropping prices, many transcriptomic datasets have been generated and accumulated in the public domain. Meta-analysis combining multiple transcriptomic studies can increase the statistical power to detect disease-related biomarkers. In this paper, we introduce a Bayesian latent hierarchical model to perform transcriptomic meta-analysis. This method is capable of detecting genes that are differentially expressed (DE) in only a subset of the combined studies, and the latent variables help quantify homogeneous and heterogeneous differential expression signals across studies. A tight clustering algorithm is applied to detected biomarkers to capture differential meta-patterns that are informative to guide further biological investigation. Simulations and three examples, including a microarray dataset from metabolism-related knockout mice, an RNA-seq dataset from HIV transgenic rats, and cross-platform datasets from human breast cancer, are used to demonstrate the performance of the proposed method.
Collapse
Affiliation(s)
- Zhiguang Huo
- Department of Biostatistics University of Florida Gainesville, FL 32611
| | - Chi Song
- Division of Biostatistics College of Public Health The Ohio State University Columbus, OH 43210
| | - George Tseng
- Department of Biostatistics, Human Genetics and Computational Biology University of Pittsburgh Pittsburgh, PA 15261
| |
Collapse
|
12
|
Long NP, Park S, Anh NH, Nghi TD, Yoon SJ, Park JH, Lim J, Kwon SW. High-Throughput Omics and Statistical Learning Integration for the Discovery and Validation of Novel Diagnostic Signatures in Colorectal Cancer. Int J Mol Sci 2019; 20:E296. [PMID: 30642095 PMCID: PMC6358915 DOI: 10.3390/ijms20020296] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2018] [Revised: 12/31/2018] [Accepted: 01/04/2019] [Indexed: 02/07/2023] Open
Abstract
The advancement of bioinformatics and machine learning has facilitated the discovery and validation of omics-based biomarkers. This study employed a novel approach combining multi-platform transcriptomics and cutting-edge algorithms to introduce novel signatures for accurate diagnosis of colorectal cancer (CRC). Different random forests (RF)-based feature selection methods including the area under the curve (AUC)-RF, Boruta, and Vita were used and the diagnostic performance of the proposed biosignatures was benchmarked using RF, logistic regression, naïve Bayes, and k-nearest neighbors models. All models showed satisfactory performance in which RF appeared to be the best. For instance, regarding the RF model, the following were observed: mean accuracy 0.998 (standard deviation (SD) < 0.003), mean specificity 0.999 (SD < 0.003), and mean sensitivity 0.998 (SD < 0.004). Moreover, proposed biomarker signatures were highly associated with multifaceted hallmarks in cancer. Some biomarkers were found to be enriched in epithelial cell signaling in Helicobacter pylori infection and inflammatory processes. The overexpression of TGFBI and S100A2 was associated with poor disease-free survival while the down-regulation of NR5A2, SLC4A4, and CD177 was linked to worse overall survival of the patients. In conclusion, novel transcriptome signatures to improve the diagnostic accuracy in CRC are introduced for further validations in various clinical settings.
Collapse
Affiliation(s)
- Nguyen Phuoc Long
- College of Pharmacy and Research Institute of Pharmaceutical Sciences, Seoul National University, Seoul 08826, Korea.
| | - Seongoh Park
- Department of Statistics, Seoul National University, Seoul 08826, Korea.
| | - Nguyen Hoang Anh
- College of Pharmacy and Research Institute of Pharmaceutical Sciences, Seoul National University, Seoul 08826, Korea.
| | - Tran Diem Nghi
- School of Medicine, Vietnam National University, Ho Chi Minh 70000, Vietnam.
| | - Sang Jun Yoon
- College of Pharmacy and Research Institute of Pharmaceutical Sciences, Seoul National University, Seoul 08826, Korea.
| | - Jeong Hill Park
- College of Pharmacy and Research Institute of Pharmaceutical Sciences, Seoul National University, Seoul 08826, Korea.
| | - Johan Lim
- Department of Statistics, Seoul National University, Seoul 08826, Korea.
| | - Sung Won Kwon
- College of Pharmacy and Research Institute of Pharmaceutical Sciences, Seoul National University, Seoul 08826, Korea.
| |
Collapse
|
13
|
Zhang K, Geng W, Zhang S. Network-based logistic regression integration method for biomarker identification. BMC SYSTEMS BIOLOGY 2018; 12:135. [PMID: 30598085 PMCID: PMC6311907 DOI: 10.1186/s12918-018-0657-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
Abstract
Background Many mathematical and statistical models and algorithms have been proposed to do biomarker identification in recent years. However, the biomarkers inferred from different datasets suffer a lack of reproducibilities due to the heterogeneity of the data generated from different platforms or laboratories. This motivates us to develop robust biomarker identification methods by integrating multiple datasets. Methods In this paper, we developed an integrative method for classification based on logistic regression. Different constant terms are set in the logistic regression model to measure the heterogeneity of the samples. By minimizing the differences of the constant terms within the same dataset, both the homogeneity within the same dataset and the heterogeneity in multiple datasets can be kept. The model is formulated as an optimization problem with a network penalty measuring the differences of the constant terms. The L1 penalty, elastic penalty and network related penalties are added to the objective function for the biomarker discovery purpose. Algorithms based on proximal Newton method are proposed to solve the optimization problem. Results We first applied the proposed method to the simulated datasets. Both the AUC of the prediction and the biomarker identification accuracy are improved. We then applied the method to two breast cancer gene expression datasets. By integrating both datasets, the prediction AUC is improved over directly merging the datasets and MetaLasso. And it’s comparable to the best AUC when doing biomarker identification in an individual dataset. The identified biomarkers using network related penalty for variables were further analyzed. Meaningful subnetworks enriched by breast cancer were identified. Conclusion A network-based integrative logistic regression model is proposed in the paper. It improves both the prediction and biomarker identification accuracy. Electronic supplementary material The online version of this article (10.1186/s12918-018-0657-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Ke Zhang
- School of Mathematical Sciences, Fudan University, No.220 Handan Road, Shanghai, 200433, China
| | - Wei Geng
- School of Mathematical Sciences, Fudan University, No.220 Handan Road, Shanghai, 200433, China
| | - Shuqin Zhang
- Center for Computational Systems Biology, Shanghai Key Laboratory for Contemporary Applied Mathematics, School of Mathematical Sciences, Fudan University, No.220 Handan Road, Shanghai, 200433, China.
| |
Collapse
|
14
|
Li Q, Li L. Integrative linear discriminant analysis with guaranteed error rate improvement. Biometrika 2018; 105:917-930. [PMID: 31762476 DOI: 10.1093/biomet/asy047] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
Abstract
Multiple types of data measured on a common set of subjects arise in many areas. Numerous empirical studies have found that integrative analysis of such data can result in better statistical performance in terms of prediction and feature selection. However, the advantages of integrative analysis have mostly been demonstrated empirically. In the context of two-class classification, we propose an integrative linear discriminant analysis method and establish a theoretical guarantee that it achieves a smaller classification error than running linear discriminant analysis on each data type individually. We address the issues of outliers and missing values, frequently encountered in integrative analysis, and illustrate our method through simulations and a neuroimaging study of Alzheimer's disease.
Collapse
Affiliation(s)
- Quefeng Li
- Department of Biostatistics, University of North Carolina at Chapel Hill, 3105D McGavran-Greenberg Hall, Chapel Hill, North Carolina 27599, U.S.A
| | - Lexin Li
- Division of Biostatistics, University of California at Berkeley, 50 University Hall 7360, Berkeley, California 94720, U.S.A
| |
Collapse
|
15
|
Shu J, Silva BVRE, Gao T, Xu Z, Cui J. Dynamic and Modularized MicroRNA Regulation and Its Implication in Human Cancers. Sci Rep 2017; 7:13356. [PMID: 29042600 PMCID: PMC5645395 DOI: 10.1038/s41598-017-13470-5] [Citation(s) in RCA: 42] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2017] [Accepted: 09/26/2017] [Indexed: 12/19/2022] Open
Abstract
MicroRNA is responsible for the fine-tuning of fundamental cellular activities and human disease development. The altered availability of microRNAs, target mRNAs, and other types of endogenous RNAs competing for microRNA interactions reflects the dynamic and conditional property of microRNA-mediated gene regulation that remains under-investigated. Here we propose a new integrative method to study this dynamic process by considering both competing and cooperative mechanisms and identifying functional modules where different microRNAs co-regulate the same functional process. Specifically, a new pipeline was built based on a meta-Lasso regression model and the proof-of-concept study was performed using a large-scale genomic dataset from ~4,200 patients with 9 cancer types. In the analysis, 10,726 microRNA-mRNA interactions were identified to be associated with a specific stage and/or type of cancer, which demonstrated the dynamic and conditional miRNA regulation during cancer progression. On the other hands, we detected 4,134 regulatory modules that exhibit high fidelity of microRNA function through selective microRNA-mRNA binding and modulation. For example, miR-18a-3p, -320a, -193b-3p, and -92b-3p co-regulate the glycolysis/gluconeogenesis and focal adhesion in cancers of kidney, liver, lung, and uterus. Furthermore, several new insights into dynamic microRNA regulation in cancers have been discovered in this study.
Collapse
Affiliation(s)
- Jiang Shu
- Systems Biology and Biomedical Informatics (SBBI) Laboratory, Department of Computer Science and Engineering, Lincoln, NE, 68588, USA
| | - Bruno Vieira Resende E Silva
- Systems Biology and Biomedical Informatics (SBBI) Laboratory, Department of Computer Science and Engineering, Lincoln, NE, 68588, USA
| | - Tian Gao
- Systems Biology and Biomedical Informatics (SBBI) Laboratory, Department of Computer Science and Engineering, Lincoln, NE, 68588, USA
| | - Zheng Xu
- Department of Statistics, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA
- Quantitative Life Sciences Initiative, University of Nebraska-Lincoln, Lincoln, NE, 68588, USA
| | - Juan Cui
- Systems Biology and Biomedical Informatics (SBBI) Laboratory, Department of Computer Science and Engineering, Lincoln, NE, 68588, USA.
| |
Collapse
|
16
|
Li Q, Yu M, Wang S. A Statistical Framework for Pathway and Gene Identification from Integrative Analysis. J MULTIVARIATE ANAL 2017; 156:1-17. [PMID: 28943673 PMCID: PMC5606168 DOI: 10.1016/j.jmva.2016.12.005] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
In the era of big data, integrative analyses that pool data from different sources are now extensively conducted in order to improve performance. Among many interesting applications, genomics research is an area where integrative methods become popular tools to identify prognostic biomarkers for various diseases. In this paper, we propose such a framework for pathway and gene identification. Our method employs a hierarchical decomposition on genes' effects followed by a proper regularization to identify important pathways and genes across multiple studies. Asymptotic theories are provided to show that our method is both pathway and gene selection consistent. More importantly, we explicitly show that pathway selection consistency needs milder statistical conditions than gene selection consistency, as it would allow false positives and negatives at the gene selection level. Finite-sample performance of our method is shown to be superior than other ad hoc methods in various simulation studies. We further apply our method to analyze five cardiovascular disease studies. Our method is intrinsically a general method on group-wise and element-wise selections from integrative analysis, which can have other applications beyond genomic research.
Collapse
Affiliation(s)
- Quefeng Li
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27517, USA. Statistical and Applied Mathematical Sciences Institute, Research Triangle Park, NC 27709, USA
| | - Menggang Yu
- Department of Biostatistics and Medical Informatics, University of Wisconsin at Madison, Madison, WI 53792, USA
| | - Sijian Wang
- Department of Biostatistics and Medical Informatics, University of Wisconsin at Madison, Madison, WI 53792, USA
| |
Collapse
|
17
|
Kim S, Jhong JH, Lee J, Koo JY. Meta-analytic support vector machine for integrating multiple omics data. BioData Min 2017; 10:2. [PMID: 28149325 PMCID: PMC5270233 DOI: 10.1186/s13040-017-0126-8] [Citation(s) in RCA: 76] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2016] [Accepted: 01/11/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Of late, high-throughput microarray and sequencing data have been extensively used to monitor biomarkers and biological processes related to many diseases. Under this circumstance, the support vector machine (SVM) has been popularly used and been successful for gene selection in many applications. Despite surpassing benefits of the SVMs, single data analysis using small- and mid-size of data inevitably runs into the problem of low reproducibility and statistical power. To address this problem, we propose a meta-analytic support vector machine (Meta-SVM) that can accommodate multiple omics data, making it possible to detect consensus genes associated with diseases across studies. RESULTS Experimental studies show that the Meta-SVM is superior to the existing meta-analysis method in detecting true signal genes. In real data applications, diverse omics data of breast cancer (TCGA) and mRNA expression data of lung disease (idiopathic pulmonary fibrosis; IPF) were applied. As a result, we identified gene sets consistently associated with the diseases across studies. In particular, the ascertained gene set of TCGA omics data was found to be significantly enriched in the ABC transporters pathways well known as critical for the breast cancer mechanism. CONCLUSION The Meta-SVM effectively achieves the purpose of meta-analysis as jointly leveraging multiple omics data, and facilitates identifying potential biomarkers and elucidating the disease process.
Collapse
Affiliation(s)
- SungHwan Kim
- Department of Statistics, Korea University, Anam-dong, Seoul, 136-701 South Korea.,Department of Statistics, Keimyung University, Dalseoku, Daegu, 42601 South Korea
| | - Jae-Hwan Jhong
- Department of Statistics, Korea University, Anam-dong, Seoul, 136-701 South Korea
| | - JungJun Lee
- Department of Statistics, Korea University, Anam-dong, Seoul, 136-701 South Korea
| | - Ja-Yong Koo
- Department of Statistics, Korea University, Anam-dong, Seoul, 136-701 South Korea
| |
Collapse
|
18
|
Richardson S, Tseng GC, Sun W. Statistical Methods in Integrative Genomics. ANNUAL REVIEW OF STATISTICS AND ITS APPLICATION 2016; 3:181-209. [PMID: 27482531 PMCID: PMC4963036 DOI: 10.1146/annurev-statistics-041715-033506] [Citation(s) in RCA: 55] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
Statistical methods in integrative genomics aim to answer important biology questions by jointly analyzing multiple types of genomic data (vertical integration) or aggregating the same type of data across multiple studies (horizontal integration). In this article, we introduce different types of genomic data and data resources, and then review statistical methods of integrative genomics, with emphasis on the motivation and rationale of these methods. We conclude with some summary points and future research directions.
Collapse
Affiliation(s)
- Sylvia Richardson
- MRC Biostatistics Unit, Cambridge Institute of Public Health, University of Cambridge, CB2 0SR, United Kingdom
| | - George C. Tseng
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA 15261
| | - Wei Sun
- Department of Biostatistics, Department of Genetics, University of North Carolina, Chapel Hill, NC 27599
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington 27516
| |
Collapse
|