1
|
Muller E, Shiryan I, Borenstein E. Multi-omic integration of microbiome data for identifying disease-associated modules. Nat Commun 2024; 15:2621. [PMID: 38521774 PMCID: PMC10960825 DOI: 10.1038/s41467-024-46888-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2024] [Accepted: 03/08/2024] [Indexed: 03/25/2024] Open
Abstract
Multi-omic studies of the human gut microbiome are crucial for understanding its role in disease across multiple functional layers. Nevertheless, integrating and analyzing such complex datasets poses significant challenges. Most notably, current analysis methods often yield extensive lists of disease-associated features (e.g., species, pathways, or metabolites), without capturing the multi-layered structure of the data. Here, we address this challenge by introducing "MintTea", an intermediate integration-based approach combining canonical correlation analysis extensions, consensus analysis, and an evaluation protocol. MintTea identifies "disease-associated multi-omic modules", comprising features from multiple omics that shift in concord and that collectively associate with the disease. Applied to diverse cohorts, MintTea captures modules with high predictive power, significant cross-omic correlations, and alignment with known microbiome-disease associations. For example, analyzing samples from a metabolic syndrome study, MintTea identifies a module with serum glutamate- and TCA cycle-related metabolites, along with bacterial species linked to insulin resistance. In another dataset, MintTea identifies a module associated with late-stage colorectal cancer, including Peptostreptococcus and Gemella species and fecal amino acids, in line with these species' metabolic activity and their coordinated gradual increase with cancer development. This work demonstrates the potential of advanced integration methods in generating systems-level, multifaceted hypotheses underlying microbiome-disease interactions.
Collapse
Affiliation(s)
- Efrat Muller
- Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| | - Itamar Shiryan
- Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| | - Elhanan Borenstein
- Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel.
- Faculty of Medical and Health Sciences, Tel Aviv University, Tel Aviv, Israel.
- Santa Fe Institute, Santa Fe, NM, USA.
| |
Collapse
|
2
|
Das S, West FD, Park C. Sparse multiway canonical correlation analysis for multimodal stroke recovery data. Biom J 2024; 66:e2300037. [PMID: 38368275 DOI: 10.1002/bimj.202300037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2023] [Revised: 10/09/2023] [Accepted: 10/26/2023] [Indexed: 02/19/2024]
Abstract
Conventional canonical correlation analysis (CCA) measures the association between two datasets and identifies relevant contributors. However, it encounters issues with execution and interpretation when the sample size is smaller than the number of variables or there are more than two datasets. Our motivating example is a stroke-related clinical study on pigs. The data are multimodal and consist of measurements taken at multiple time points and have many more variables than observations. This study aims to uncover important biomarkers and stroke recovery patterns based on physiological changes. To address the issues in the data, we develop two sparse CCA methods for multiple datasets. Various simulated examples are used to illustrate and contrast the performance of the proposed methods with that of the existing methods. In analyzing the pig stroke data, we apply the proposed sparse CCA methods along with dimension reduction techniques, interpret the recovery patterns, and identify influential variables in recovery.
Collapse
Affiliation(s)
- Subham Das
- Department of Statistics, University of Georgia, Athens, Georgia, USA
| | - Franklin D West
- Department of Animal & Dairy Science, University of Georgia, Athens, Georgia, USA
| | - Cheolwoo Park
- Department of Mathematical Sciences, KAIST, Daejeon, South Korea
| |
Collapse
|
3
|
Fang K, Li J, Zhang Q, Xu Y, Ma S. Pathological imaging-assisted cancer gene-environment interaction analysis. Biometrics 2023; 79:3883-3894. [PMID: 37132273 PMCID: PMC10622332 DOI: 10.1111/biom.13873] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2022] [Accepted: 04/26/2023] [Indexed: 05/04/2023]
Abstract
Gene-environment (G-E) interactions have important implications for cancer outcomes and phenotypes beyond the main G and E effects. Compared to main-effect-only analysis, G-E interaction analysis more seriously suffers from a lack of information caused by higher dimensionality, weaker signals, and other factors. It is also uniquely challenged by the "main effects, interactions" variable selection hierarchy. Effort has been made to bring in additional information to assist cancer G-E interaction analysis. In this study, we take a strategy different from the existing literature and borrow information from pathological imaging data. Such data are a "byproduct" of biopsy, enjoys broad availability and low cost, and has been shown as informative for modeling prognosis and other cancer outcomes/phenotypes in recent studies. Building on penalization, we develop an assisted estimation and variable selection approach for G-E interaction analysis. The approach is intuitive, can be effectively realized, and has competitive performance in simulation. We further analyze The Cancer Genome Atlas (TCGA) data on lung adenocarcinoma (LUAD). The outcome of interest is overall survival, and for G variables, we analyze gene expressions. Assisted by pathological imaging data, our G-E interaction analysis leads to different findings with competitive prediction performance and stability.
Collapse
Affiliation(s)
- Kuangnan Fang
- Department of Statistics and Data Science, School of Economics, Xiamen University, Xiamen, China
| | - Jingmao Li
- Department of Statistics and Data Science, School of Economics, Xiamen University, Xiamen, China
| | - Qingzhao Zhang
- Department of Statistics and Data Science, School of Economics, Xiamen University, Xiamen, China
- The Wang Yanan Institute for Studies in Economics, Xiamen University, Xiamen, China
| | - Yaqing Xu
- School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven, U.S.A
| |
Collapse
|
4
|
Orlichenko A, Qu G, Zhang G, Patel B, Wilson TW, Stephen JM, Calhoun VD, Wang YP. Latent Similarity Identifies Important Functional Connections for Phenotype Prediction. IEEE Trans Biomed Eng 2023; 70:1979-1989. [PMID: 37015625 PMCID: PMC10284019 DOI: 10.1109/tbme.2022.3232964] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
OBJECTIVE Endophenotypes such as brain age and fluid intelligence are important biomarkers of disease status. However, brain imaging studies to identify these biomarkers often encounter limited numbers of subjects but high dimensional imaging features, hindering reproducibility. Therefore, we develop an interpretable, multivariate classification/regression algorithm, called Latent Similarity (LatSim), suitable for small sample size but high feature dimension datasets. METHODS LatSim combines metric learning with a kernel similarity function and softmax aggregation to identify task-related similarities between subjects. Inter-subject similarity is utilized to improve performance on three prediction tasks using multi-paradigm fMRI data. A greedy selection algorithm, made possible by LatSim's computational efficiency, is developed as an interpretability method. RESULTS LatSim achieved significantly higher predictive accuracy at small sample sizes on the Philadelphia Neurodevelopmental Cohort (PNC) dataset. Connections identified by LatSim gave superior discriminative power compared to those identified by other methods. We identified 4 functional brain networks enriched in connections for predicting brain age, sex, and intelligence. CONCLUSION We find that most information for a predictive task comes from only a few (1-5) connections. Additionally, we find that the default mode network is over-represented in the top connections of all predictive tasks. SIGNIFICANCE We propose a novel prediction algorithm for small sample, high feature dimension datasets and use it to identify connections in task fMRI data. Our work can lead to new insights in both algorithm design and neuroscience research.
Collapse
|
5
|
Zhong T, Zhang Q, Huang J, Wu M, Ma S. HETEROGENEITY ANALYSIS VIA INTEGRATING MULTI-SOURCES HIGH-DIMENSIONAL DATA WITH APPLICATIONS TO CANCER STUDIES. Stat Sin 2023; 33:729-758. [PMID: 38037567 PMCID: PMC10686523 DOI: 10.5705/ss.202021.0002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2023]
Abstract
This study has been motivated by cancer research, in which heterogeneity analysis plays an important role and can be roughly classified as unsupervised or supervised. In supervised heterogeneity analysis, the finite mixture of regression (FMR) technique is used extensively, under which the covariates affect the response differently in subgroups. High-dimensional molecular and, very recently, histopathological imaging features have been analyzed separately and shown to be effective for heterogeneity analysis. For simpler analysis, they have been shown to contain overlapping, but also independent information. In this article, our goal is to conduct the first and more effective FMR-based cancer heterogeneity analysis by integrating high-dimensional molecular and histopathological imaging features. A penalization approach is developed to regularize estimation, select relevant variables, and, equally importantly, promote the identification of independent information. Consistency properties are rigorously established. An effective computational algorithm is developed. A simulation and an analysis of The Cancer Genome Atlas (TCGA) lung cancer data demonstrate the practical effectiveness of the proposed approach. Overall, this study provides a practical and useful new way of conducting supervised cancer heterogeneity analysis.
Collapse
Affiliation(s)
- Tingyan Zhong
- SJTU-Yale Joint Center for Biostatistics, Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Qingzhao Zhang
- School of Economics and Wang Yanan Institute for Studies in Economics, Xiamen University, Fujian, China
| | - Jian Huang
- Department of Applied Mathematics, The Hong Kong Polytechnic University, Kowloon, Hong Kong
| | - Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, Yale University, New Haven, CT 06520-0834, USA
| |
Collapse
|
6
|
Song X, Li R, Wang K, Bai Y, Xiao Y, Wang YP. Joint Sparse Collaborative Regression on Imaging Genetics Study of Schizophrenia. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1137-1146. [PMID: 35503837 PMCID: PMC10321021 DOI: 10.1109/tcbb.2022.3172289] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The imaging genetics approach generates large amount of high dimensional and multi-modal data, providing complementary information for comprehensive study of Schizophrenia, a complex mental disease. However, at the same time, the variety of these data in structures, resolutions, and formats makes their integrative study a forbidding task. In this paper, we propose a novel model called Joint Sparse Collaborative Regression (JSCoReg), which can extract class-specific features from different health conditions/disease classes. We first evaluate the performance of feature selection in terms of Receiver operating characteristic curve and the area under the ROC curve in the simulation experiment. We demonstrate that the JSCoReg model can achieve higher accuracy compared with similar models including Joint Sparse Canonical Correlation Analysis and Sparse Collaborative Regression. We then applied the JSCoReg model to the analysis of schizophrenia dataset collected from the Mind Clinical Imaging Consortium. The JSCoReg enables us to better identify biomarkers associated with schizophrenia, which are verified to be both biologically and statistically significant.
Collapse
Affiliation(s)
- Xueli Song
- School of Sciences, Chang’an University, Xi’an, 710064, China
| | - Rongpeng Li
- School of Sciences, Chang’an University, Xi’an, 710064, China
| | - Kaiming Wang
- School of Sciences, Chang’an University, Xi’an, 710064, China
| | - Yuntong Bai
- Biomedical Engineering Department, Tulane University, New Orleans, LA 70118, USA
| | - Yuzhu Xiao
- School of Sciences, Chang’an University, Xi’an, 710064, China
| | - Yu-ping Wang
- Biomedical Engineering Department, Tulane University, New Orleans, LA 70118, USA
| |
Collapse
|
7
|
Zhang Y, Gaynanova I. Joint association and classification analysis of multi-view data. Biometrics 2022; 78:1614-1625. [PMID: 34343342 DOI: 10.1111/biom.13536] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2020] [Revised: 07/20/2021] [Accepted: 07/28/2021] [Indexed: 12/30/2022]
Abstract
Multi-view data, which is matched sets of measurements on the same subjects, have become increasingly common with advances in multi-omics technology. Often, it is of interest to find associations between the views that are related to the intrinsic class memberships. Existing association methods cannot directly incorporate class information, while existing classification methods do not take into account between-views associations. In this work, we propose a framework for Joint Association and Classification Analysis of multi-view data (JACA). Our goal is not to merely improve the misclassification rates, but to provide a latent representation of high-dimensional data that is both relevant for the subtype discrimination and coherent across the views. We motivate the methodology by establishing a connection between canonical correlation analysis and discriminant analysis. We also establish the estimation consistency of JACA in high-dimensional settings. A distinct advantage of JACA is that it can be applied to the multi-view data with block-missing structure, that is to cases where a subset of views or class labels is missing for some subjects. The application of JACA to quantify the associations between RNAseq and miRNA views with respect to consensus molecular subtypes in colorectal cancer data from The Cancer Genome Atlas project leads to improved misclassification rates and stronger found associations compared to existing methods.
Collapse
Affiliation(s)
- Yunfeng Zhang
- Department of Statistics, Texas A&M University, College Station, Texas, USA
| | - Irina Gaynanova
- Department of Statistics, Texas A&M University, College Station, Texas, USA
| |
Collapse
|
8
|
Xu Y, Wu M, Ma S. Multidimensional molecular measurements-environment interaction analysis for disease outcomes. Biometrics 2022; 78:1542-1554. [PMID: 34213006 PMCID: PMC9366385 DOI: 10.1111/biom.13526] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2020] [Revised: 02/27/2021] [Accepted: 06/28/2021] [Indexed: 12/30/2022]
Abstract
Multiple types of molecular (genetic, genomic, epigenetic, etc.) measurements, environmental risk factors, and their interactions have been found to contribute to the outcomes and phenotypes of complex diseases. In each of the previous studies, only the interactions between one type of molecular measurement and environmental risk factors have been analyzed. In recent biomedical studies, multidimensional profiling, in which data from multiple types of molecular measurements are collected from the same subjects, is becoming popular. A myriad of recent studies have shown that collectively analyzing multiple types of molecular measurements is not only biologically sensible but also leads to improved estimation and prediction. In this study, we conduct an M-E interaction analysis, with M standing for multidimensional molecular measurements and E standing for environmental risk factors. This can accommodate multiple types of molecular measurements and sufficiently account for their overlapping as well as independent information. Extensive simulation shows that it outperforms several closely related alternatives. In the analysis of TCGA (The Cancer Genome Atlas) data on lung adenocarcinoma and cutaneous melanoma, we make some stable biological findings and achieve stable prediction.
Collapse
Affiliation(s)
- Yaqing Xu
- Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut, USA
| | - Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut, USA
| |
Collapse
|
9
|
Palzer EF, Wendt CH, Bowler RP, Hersh CP, Safo SE, Lock EF. sJIVE: Supervised Joint and Individual Variation Explained. Comput Stat Data Anal 2022; 175:107547. [PMID: 36119152 PMCID: PMC9481062 DOI: 10.1016/j.csda.2022.107547] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Abstract
Analyzing multi-source data, which are multiple views of data on the same subjects, has become increasingly common in molecular biomedical research. Recent methods have sought to uncover underlying structure and relationships within and/or between the data sources, and other methods have sought to build a predictive model for an outcome using all sources. However, existing methods that do both are presently limited because they either (1) only consider data structure shared by all datasets while ignoring structures unique to each source, or (2) they extract underlying structures first without consideration to the outcome. The proposed method, supervised joint and individual variation explained (sJIVE), can simultaneously (1) identify shared (joint) and source-specific (individual) underlying structure and (2) build a linear prediction model for an outcome using these structures. These two components are weighted to compromise between explaining variation in the multi-source data and in the outcome. Simulations show sJIVE to outperform existing methods when large amounts of noise are present in the multi-source data. An application to data from the COPDGene study explores gene expression and proteomic patterns associated with lung function.
Collapse
Affiliation(s)
- Elise F. Palzer
- Division of Biostatistics, University of Minnesota, Minneapolis, 55455, USA
| | - Christine H. Wendt
- Division of Pulmonary, Allergy and Critical Care, University of Minnesota, Minneapolis, 55455, USA
| | - Russell P. Bowler
- Division of Pulmonary, Critical Care and Sleep Medicine, Department of Medicine, National Jewish Health, Denver, CO, USA
| | - Craig P. Hersh
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA
| | - Sandra E. Safo
- Division of Biostatistics, University of Minnesota, Minneapolis, 55455, USA
| | - Eric F. Lock
- Division of Biostatistics, University of Minnesota, Minneapolis, 55455, USA
| |
Collapse
|
10
|
Abstract
We propose a method for supervised learning with multiple sets of features ("views"). The multiview problem is especially important in biology and medicine, where "-omics" data, such as genomics, proteomics, and radiomics, are measured on a common set of samples. "Cooperative learning" combines the usual squared-error loss of predictions with an "agreement" penalty to encourage the predictions from different data views to agree. By varying the weight of the agreement penalty, we get a continuum of solutions that include the well-known early and late fusion approaches. Cooperative learning chooses the degree of agreement (or fusion) in an adaptive manner, using a validation set or cross-validation to estimate test set prediction error. One version of our fitting procedure is modular, where one can choose different fitting mechanisms (e.g., lasso, random forests, boosting, or neural networks) appropriate for different data views. In the setting of cooperative regularized linear regression, the method combines the lasso penalty with the agreement penalty, yielding feature sparsity. The method can be especially powerful when the different data views share some underlying relationship in their signals that can be exploited to boost the signals. We show that cooperative learning achieves higher predictive accuracy on simulated data and real multiomics examples of labor-onset prediction. By leveraging aligned signals and allowing flexible fitting mechanisms for different modalities, cooperative learning offers a powerful approach to multiomics data fusion.
Collapse
|
11
|
Kawaguchi ES, Li S, Weaver GM, Lewinger JP. Hierarchical Ridge Regression for Incorporating Prior Information in Genomic Studies. JOURNAL OF DATA SCIENCE : JDS 2022; 20:34-50. [PMID: 36274755 PMCID: PMC9581069 DOI: 10.6339/21-jds1030] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
There is a great deal of prior knowledge about gene function and regulation in the form of annotations or prior results that, if directly integrated into individual prognostic or diagnostic studies, could improve predictive performance. For example, in a study to develop a predictive model for cancer survival based on gene expression, effect sizes from previous studies or the grouping of genes based on pathways constitute such prior knowledge. However, this external information is typically only used post-analysis to aid in the interpretation of any findings. We propose a new hierarchical two-level ridge regression model that can integrate external information in the form of "meta features" to predict an outcome. We show that the model can be fit efficiently using cyclic coordinate descent by recasting the problem as a single-level regression model. In a simulation-based evaluation we show that the proposed method outperforms standard ridge regression and competing methods that integrate prior information, in terms of prediction performance when the meta features are informative on the mean of the features, and that there is no loss in performance when the meta features are uninformative. We demonstrate our approach with applications to the prediction of chronological age based on methylation features and breast cancer mortality based on gene expression features.
Collapse
Affiliation(s)
- Eric S. Kawaguchi
- Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, California, USA,Corresponding author:
| | - Sisi Li
- Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, California, USA
| | - Garrett M. Weaver
- Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, California, USA
| | - Juan Pablo Lewinger
- Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, California, USA
| |
Collapse
|
12
|
Cho MH, Kurtek S, Bharath K. Tangent functional canonical correlation analysis for densities and shapes, with applications to multimodal imaging data. J MULTIVARIATE ANAL 2021; 189. [PMID: 35601473 PMCID: PMC9122284 DOI: 10.1016/j.jmva.2021.104870] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
It is quite common for functional data arising from imaging data to assume values in infinite-dimensional manifolds. Uncovering associations between two or more such nonlinear functional data extracted from the same object across medical imaging modalities can assist development of personalized treatment strategies. We propose a method for canonical correlation analysis between paired probability densities or shapes of closed planar curves, routinely used in biomedical studies, which combines a convenient linearization and dimension reduction of the data using tangent space coordinates. Leveraging the fact that the corresponding manifolds are submanifolds of unit Hilbert spheres, we describe how finite-dimensional representations of the functional data objects can be easily computed, which then facilitates use of standard multivariate canonical correlation analysis methods. We further construct and visualize canonical variate directions directly on the space of densities or shapes. Utility of the method is demonstrated through numerical simulations and performance on a magnetic resonance imaging dataset of glioblastoma multiforme brain tumors.
Collapse
Affiliation(s)
- Min Ho Cho
- Department of Applied and Computational Mathematics and Statistics, The University of Notre Dame
| | - Sebastian Kurtek
- Department of Statistics, The Ohio State University
- Corresponding author.
| | - Karthik Bharath
- School of Mathematical Sciences, The University of Nottingham
| |
Collapse
|
13
|
Yi H, Ma S. Assisted differential network analysis for gene expression data. Genet Epidemiol 2021; 45:604-620. [PMID: 34174112 PMCID: PMC8376770 DOI: 10.1002/gepi.22419] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Revised: 05/10/2021] [Accepted: 05/17/2021] [Indexed: 11/12/2022]
Abstract
In the analysis of gene expression data, when there are two or more disease conditions/groups (e.g., diseased and normal, responder and nonresponder, and multiple stages/subtypes), differential analysis has been extensively conducted to identify key differences and has important implications. Network analysis takes a system perspective and can be more informative than that limited to simple statistics such as mean and variance. In differential network analysis, a common practice is to first estimate a gene expression network for each condition/group, and then spectral clustering can be applied to the network difference(s) to identify key genes and biological mechanisms that lead to the differences. Compared to "simple" analysis such as regression, differential network analysis can be more challenging with the significantly larger number of parameters. In this study, taking advantage of the increasing popularity of multidimensional profiling data, we develop an assisted analysis strategy and propose incorporating regulator information to improve the identification of key genes (that lead to the differences in gene expression networks). An effective computational algorithm is developed. Comprehensive simulation is conducted, showing that the proposed approach can outperform the benchmark alternatives in identification accuracy. With the The Cancer Genome Atlas lung adenocarcinoma data, we analyze the expressions of genes in the KEGG cell cycle pathway, assisted by copy number variation data. The proposed assisted analysis leads to identification results similar to the alternatives but different estimations. Overall, this study can deliver an efficient and cost-effective way of improving differential network analysis.
Collapse
Affiliation(s)
- Huangdi Yi
- Department of Biostatistics, Yale University
| | - Shuangge Ma
- Department of Biostatistics, Yale University
| |
Collapse
|
14
|
Wu M, Yi H, Ma S. Vertical integration methods for gene expression data analysis. Brief Bioinform 2021; 22:bbaa169. [PMID: 32793970 PMCID: PMC8138889 DOI: 10.1093/bib/bbaa169] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2020] [Revised: 06/18/2020] [Accepted: 07/04/2020] [Indexed: 12/12/2022] Open
Abstract
Gene expression data have played an essential role in many biomedical studies. When the number of genes is large and sample size is limited, there is a 'lack of information' problem, leading to low-quality findings. To tackle this problem, both horizontal and vertical data integrations have been developed, where vertical integration methods collectively analyze data on gene expressions as well as their regulators (such as mutations, DNA methylation and miRNAs). In this article, we conduct a selective review of vertical data integration methods for gene expression data. The reviewed methods cover both marginal and joint analysis and supervised and unsupervised analysis. The main goal is to provide a sketch of the vertical data integration paradigm without digging into too many technical details. We also briefly discuss potential pitfalls, directions for future developments and application notes.
Collapse
Affiliation(s)
- Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics
| | - Huangdi Yi
- Department of Biostatistics at Yale University
| | - Shuangge Ma
- Department of Biostatistics at Yale University
| |
Collapse
|
15
|
Bai Y, Gong Y, Bai J, Liu J, Deng HW, Calhoun V, Wang YP. A Joint Analysis of Multi-Paradigm fMRI Data With Its Application to Cognitive Study. IEEE TRANSACTIONS ON MEDICAL IMAGING 2021; 40:951-962. [PMID: 33284749 PMCID: PMC7925383 DOI: 10.1109/tmi.2020.3042786] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
With the development of neuroimaging techniques, a growing amount of multi-modal brain imaging data are collected, facilitating comprehensive study of the brain. In this paper, we jointly analyzed functional magnetic resonance imaging (fMRI) collected under different paradigms in order to understand cognitive behaviors of an individual. To this end, we proposed a novel multi-view learning algorithm called structure-enforced collaborative regression (SCoRe) to extract co-expressed discriminative brain regions under the guidance of anatomical structure of the brain. An advantage of SCoRe over its predecessor collaborative regression (CoRe) lies in its incorporation of group structures in the brain imaging data, which makes the model biologically more meaningful. Results from real data analysis has confirmed that by incorporating prior knowledge of brain structure, SCoRe can deliver better prediction performance and is less sensitive to hyper-parameters than CoRe. After validation with simulation experiments, we applied SCoRe to fMRI data collected from the Philadelphia Neurodevelopmental Cohort and adopted the scores from the wide range achievement test (WRAT) to evaluate an individual's cognitive skills. We located 14 relevant brain regions that can efficiently predict WRAT scores and these brain regions were further confirmed by other independent studies.
Collapse
|
16
|
Du Y, Fan K, Lu X, Wu C. Integrating Multi–Omics Data for Gene-Environment Interactions. BIOTECH 2021; 10:biotech10010003. [PMID: 35822775 PMCID: PMC9245467 DOI: 10.3390/biotech10010003] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2020] [Revised: 01/22/2021] [Accepted: 01/22/2021] [Indexed: 01/05/2023] Open
Abstract
Gene-environment (G×E) interaction is critical for understanding the genetic basis of complex disease beyond genetic and environment main effects. In addition to existing tools for interaction studies, penalized variable selection emerges as a promising alternative for dissecting G×E interactions. Despite the success, variable selection is limited in terms of accounting for multidimensional measurements. Published variable selection methods cannot accommodate structured sparsity in the framework of integrating multiomics data for disease outcomes. In this paper, we have developed a novel variable selection method in order to integrate multi-omics measurements in G×E interaction studies. Extensive studies have already revealed that analyzing omics data across multi-platforms is not only sensible biologically, but also resulting in improved identification and prediction performance. Our integrative model can efficiently pinpoint important regulators of gene expressions through sparse dimensionality reduction, and link the disease outcomes to multiple effects in the integrative G×E studies through accommodating a sparse bi-level structure. The simulation studies show the integrative model leads to better identification of G×E interactions and regulators than alternative methods. In two G×E lung cancer studies with high dimensional multi-omics data, the integrative model leads to an improved prediction and findings with important biological implications.
Collapse
|
17
|
Abstract
In recent biomedical studies, multidimensional profiling, which collects proteomics as well as other types of omics data on the same subjects, is getting increasingly popular. Proteomics, transcriptomics, genomics, epigenomics, and other types of data contain overlapping as well as independent information, which suggests the possibility of integrating multiple types of data to generate more reliable findings/models with better classification/prediction performance. In this chapter, a selective review is conducted on recent data integration techniques for both unsupervised and supervised analysis. The main objective is to provide the "big picture" of data integration that involves proteomics data and discuss the "intuition" beneath the recently developed approaches without invoking too many mathematical details. Potential pitfalls and possible directions for future developments are also discussed.
Collapse
Affiliation(s)
- Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| | - Yu Jiang
- School of Public Health, University of Memphis, Memphis, TN, USA
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, Yale University, New Haven, CT, USA.
| |
Collapse
|
18
|
Identification of early liver toxicity gene biomarkers using comparative supervised machine learning. Sci Rep 2020; 10:19128. [PMID: 33154507 PMCID: PMC7645727 DOI: 10.1038/s41598-020-76129-8] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2020] [Accepted: 10/12/2020] [Indexed: 02/08/2023] Open
Abstract
Screening agrochemicals and pharmaceuticals for potential liver toxicity is required for regulatory approval and is an expensive and time-consuming process. The identification and utilization of early exposure gene signatures and robust predictive models in regulatory toxicity testing has the potential to reduce time and costs substantially. In this study, comparative supervised machine learning approaches were applied to the rat liver TG-GATEs dataset to develop feature selection and predictive testing. We identified ten gene biomarkers using three different feature selection methods that predicted liver necrosis with high specificity and selectivity in an independent validation dataset from the Microarray Quality Control (MAQC)-II study. Nine of the ten genes that were selected with the supervised methods are involved in metabolism and detoxification (Car3, Crat, Cyp39a1, Dcd, Lbp, Scly, Slc23a1, and Tkfc) and transcriptional regulation (Ablim3). Several of these genes are also implicated in liver carcinogenesis, including Crat, Car3 and Slc23a1. Our biomarker gene signature provides high statistical accuracy and a manageable number of genes to study as indicators to potentially accelerate toxicity testing based on their ability to induce liver necrosis and, eventually, liver cancer.
Collapse
|
19
|
Li J, Lu Q, Wen Y. Multi-kernel linear mixed model with adaptive lasso for prediction analysis on high-dimensional multi-omics data. Bioinformatics 2020; 36:1785-1794. [PMID: 31693075 DOI: 10.1093/bioinformatics/btz822] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2019] [Revised: 10/08/2019] [Accepted: 11/01/2019] [Indexed: 12/11/2022] Open
Abstract
MOTIVATION The use of human genome discoveries and other established factors to build an accurate risk prediction model is an essential step toward precision medicine. While multi-layer high-dimensional omics data provide unprecedented data resources for prediction studies, their corresponding analytical methods are much less developed. RESULTS We present a multi-kernel penalized linear mixed model with adaptive lasso (MKpLMM), a predictive modeling framework that extends the standard linear mixed models widely used in genomic risk prediction, for multi-omics data analysis. MKpLMM can capture not only the predictive effects from each layer of omics data but also their interactions via using multiple kernel functions. It adopts a data-driven approach to select predictive regions as well as predictive layers of omics data, and achieves robust selection performance. Through extensive simulation studies, the analyses of PET-imaging outcomes from the Alzheimer's Disease Neuroimaging Initiative study, and the analyses of 64 drug responses, we demonstrate that MKpLMM consistently outperforms competing methods in phenotype prediction. AVAILABILITY AND IMPLEMENTATION The R-package is available at https://github.com/YaluWen/OmicPred. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jun Li
- Department of Thoracic Surgery, Dalian Municipal Central Hospital Affiliated of Dalian Medical University, Dalian 116000, China
| | - Qing Lu
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI 48824, USA
| | - Yalu Wen
- Department of Statistics, University of Auckland, Auckland 1010, New Zealand
| |
Collapse
|
20
|
Eicher T, Kinnebrew G, Patt A, Spencer K, Ying K, Ma Q, Machiraju R, Mathé EA. Metabolomics and Multi-Omics Integration: A Survey of Computational Methods and Resources. Metabolites 2020; 10:E202. [PMID: 32429287 PMCID: PMC7281435 DOI: 10.3390/metabo10050202] [Citation(s) in RCA: 60] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2020] [Revised: 05/07/2020] [Accepted: 05/13/2020] [Indexed: 02/06/2023] Open
Abstract
As researchers are increasingly able to collect data on a large scale from multiple clinical and omics modalities, multi-omics integration is becoming a critical component of metabolomics research. This introduces a need for increased understanding by the metabolomics researcher of computational and statistical analysis methods relevant to multi-omics studies. In this review, we discuss common types of analyses performed in multi-omics studies and the computational and statistical methods that can be used for each type of analysis. We pinpoint the caveats and considerations for analysis methods, including required parameters, sample size and data distribution requirements, sources of a priori knowledge, and techniques for the evaluation of model accuracy. Finally, for the types of analyses discussed, we provide examples of the applications of corresponding methods to clinical and basic research. We intend that our review may be used as a guide for metabolomics researchers to choose effective techniques for multi-omics analyses relevant to their field of study.
Collapse
Affiliation(s)
- Tara Eicher
- Biomedical Informatics Department, The Ohio State University College of Medicine, Columbus, OH 43210, USA; (T.E.); (G.K.); (K.S.); (Q.M.); (R.M.)
- Computer Science and Engineering Department, The Ohio State University College of Engineering, Columbus, OH 43210, USA
| | - Garrett Kinnebrew
- Biomedical Informatics Department, The Ohio State University College of Medicine, Columbus, OH 43210, USA; (T.E.); (G.K.); (K.S.); (Q.M.); (R.M.)
- Comprehensive Cancer Center, The Ohio State University and James Cancer Hospital, Columbus, OH 43210, USA;
- Bioinformatics Shared Resource Group, The Ohio State University, Columbus, OH 43210, USA
| | - Andrew Patt
- Division of Preclinical Innovation, National Center for Advancing Translational Sciences, NIH, 9800 Medical Center Dr., Rockville, MD, 20892, USA;
- Biomedical Sciences Graduate Program, The Ohio State University, Columbus, OH 43210, USA
| | - Kyle Spencer
- Biomedical Informatics Department, The Ohio State University College of Medicine, Columbus, OH 43210, USA; (T.E.); (G.K.); (K.S.); (Q.M.); (R.M.)
- Biomedical Sciences Graduate Program, The Ohio State University, Columbus, OH 43210, USA
- Nationwide Children’s Research Hospital, Columbus, OH 43210, USA
| | - Kevin Ying
- Comprehensive Cancer Center, The Ohio State University and James Cancer Hospital, Columbus, OH 43210, USA;
- Molecular, Cellular and Developmental Biology Program, The Ohio State University, Columbus, OH 43210, USA
| | - Qin Ma
- Biomedical Informatics Department, The Ohio State University College of Medicine, Columbus, OH 43210, USA; (T.E.); (G.K.); (K.S.); (Q.M.); (R.M.)
| | - Raghu Machiraju
- Biomedical Informatics Department, The Ohio State University College of Medicine, Columbus, OH 43210, USA; (T.E.); (G.K.); (K.S.); (Q.M.); (R.M.)
- Computer Science and Engineering Department, The Ohio State University College of Engineering, Columbus, OH 43210, USA
- Department of Pathology, Wexner Medical Center, The Ohio State University, Columbus, OH 43210, USA
- Translational Data Analytics Institute, The Ohio State University, Columbus, OH 43210, USA
| | - Ewy A. Mathé
- Biomedical Informatics Department, The Ohio State University College of Medicine, Columbus, OH 43210, USA; (T.E.); (G.K.); (K.S.); (Q.M.); (R.M.)
- Division of Preclinical Innovation, National Center for Advancing Translational Sciences, NIH, 9800 Medical Center Dr., Rockville, MD, 20892, USA;
| |
Collapse
|
21
|
Wang HT, Smallwood J, Mourao-Miranda J, Xia CH, Satterthwaite TD, Bassett DS, Bzdok D. Finding the needle in a high-dimensional haystack: Canonical correlation analysis for neuroscientists. Neuroimage 2020; 216:116745. [PMID: 32278095 DOI: 10.1016/j.neuroimage.2020.116745] [Citation(s) in RCA: 117] [Impact Index Per Article: 29.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2019] [Revised: 02/12/2020] [Accepted: 03/12/2020] [Indexed: 12/12/2022] Open
Abstract
The 21st century marks the emergence of "big data" with a rapid increase in the availability of datasets with multiple measurements. In neuroscience, brain-imaging datasets are more commonly accompanied by dozens or hundreds of phenotypic subject descriptors on the behavioral, neural, and genomic level. The complexity of such "big data" repositories offer new opportunities and pose new challenges for systems neuroscience. Canonical correlation analysis (CCA) is a prototypical family of methods that is useful in identifying the links between variable sets from different modalities. Importantly, CCA is well suited to describing relationships across multiple sets of data, such as in recently available big biomedical datasets. Our primer discusses the rationale, promises, and pitfalls of CCA.
Collapse
Affiliation(s)
- Hao-Ting Wang
- Department of Psychology, University of York, Heslington, York, United Kingdom; Sackler Center for Consciousness Science, University of Sussex, Brighton, United Kingdom.
| | - Jonathan Smallwood
- Department of Psychology, University of York, Heslington, York, United Kingdom
| | - Janaina Mourao-Miranda
- Centre for Medical Image Computing, Department of Computer Science, University College London, London, United Kingdom; Max Planck University College London Centre for Computational Psychiatry and Ageing Research, University College London, London, United Kingdom
| | - Cedric Huchuan Xia
- Department of Psychiatry, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Theodore D Satterthwaite
- Department of Psychiatry, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Danielle S Bassett
- Department of Bioengineering, University of Pennsylvania, Philadelphia, PA, 19104, USA; Department of Electrical and Systems Engineering, University of Pennsylvania, Philadelphia, PA, 19104, USA; Department of Neurology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA; Department of Physics & Astronomy, School of Arts & Sciences, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Danilo Bzdok
- Department of Psychiatry, Psychotherapy and Psychosomatics, RWTH Aachen University, Germany; JARA-BRAIN, Jülich-Aachen Research Alliance, Germany; Parietal Team, INRIA, Neurospin, Bat 145, CEA Saclay, 91191, Gif-sur-Yvette, France; Department of Biomedical Engineering, Montreal Neurological Institute, Faculty of Medicine, McGill University, Montreal, Canada; Mila - Quebec Artificial Intelligence Institute, Canada.
| |
Collapse
|
22
|
Bai Y, Pascal Z, Hu W, Calhoun VD, Wang YP. Biomarker Identification Through Integrating fMRI and Epigenetics. IEEE Trans Biomed Eng 2019; 67:1186-1196. [PMID: 31395533 DOI: 10.1109/tbme.2019.2932895] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
OBJECTIVE Integration of multiple datasets is a hot topic in many fields. When studying complex mental disorders, great effort has been dedicated to fusing genetic and brain imaging data. However, an increasing number of studies have pointed out the importance of epigenetic factors in the cause of psychiatric diseases. In this study, we endeavor to fill the gap by combining epigenetics (e.g., DNA methylation) with imaging data (e.g., fMRI) to identify biomarkers for schizophrenia (SZ). METHODS We propose to combine linear regression with canonical correlation analysis (CCA) in a relaxed yet coupled manner to extract discriminative features for SZ that are co-expressed in the fMRI and DNA methylation data. RESULT After validation through simulations, we applied our method to real imaging epigenetics data of 184 subjects from the Mental Illness and Neuroscience Discovery Clinical Imaging Consortium. After significance test, we identified 14 brain regions and 44 cytosine-phosphate-guanine(CpG) sites. Average classification accuracy is [Formula: see text]. By linking the CpG sites to genes, we identified pathways Guanosine ribonucleotides de novo biosynthesis and Guanosine nucleotides de novo biosynthesis, and a GO term Perikaryon. CONCLUSION This imaging epigenetics study has identified both brain regions and genes that are associated with neuron development and memory processing. These biomarkers contribute to a good understanding of the mechanism underlying SZ but are overlooked by previous imaging genetics studies. SIGNIFICANCE Our study sheds light on the understanding and diagnosis of SZ with a imaging epigenetics approach, which is demonstrated to be effective in extracting novel biomarkers associated with SZ.
Collapse
|
23
|
Hu W, Cai B, Zhang A, Calhoun VD, Wang YP. Deep Collaborative Learning With Application to the Study of Multimodal Brain Development. IEEE Trans Biomed Eng 2019; 66:3346-3359. [PMID: 30872216 DOI: 10.1109/tbme.2019.2904301] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
OBJECTIVE Multi-modal functional magnetic resonance imaging has been widely used for brain research. Conventional data-fusion methods cannot capture complex relationship (e.g., nonlinear predictive relationship) between multiple data. This paper aims to develop a neural network framework to extract phenotype related cross-data relationships and use it to study the brain development. METHODS We propose a novel method, deep collaborative learning (DCL), to address the limitation of existing methods. DCL first uses a deep network to represent original data and then seeks their correlations, while also linking the data representation with phenotypical information. RESULTS We studied the difference of functional connectivity (FCs) between different age groups and also use FCs as a fingerprint to predict cognitive abilities. Our experiments demonstrated higher accuracy of using DCL over other conventional models when classifying populations of different ages and cognitive scores. Moreover, DCL revealed that brain connections became stronger at adolescence stage. Furthermore, DCL detected strong correlations between default mode network and other networks which were overlooked by linear canonical correlation analysis, demonstrating DCL's ability of detecting nonlinear correlations. CONCLUSION The results verified the superiority of DCL over conventional data-fusion methods. In addition, the stronger brain connection demonstrated the importance of adolescence stage for brain development. SIGNIFICANCE DCL can better combine complex correlations between multiple data sets in addition to their fitting to phenotypes, with the potential to overcome the limitations of several current data-fusion models.
Collapse
|
24
|
Wu C, Zhou F, Ren J, Li X, Jiang Y, Ma S. A Selective Review of Multi-Level Omics Data Integration Using Variable Selection. High Throughput 2019; 8:E4. [PMID: 30669303 PMCID: PMC6473252 DOI: 10.3390/ht8010004] [Citation(s) in RCA: 114] [Impact Index Per Article: 22.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2018] [Revised: 12/24/2018] [Accepted: 01/10/2019] [Indexed: 01/02/2023] Open
Abstract
High-throughput technologies have been used to generate a large amount of omics data. In the past, single-level analysis has been extensively conducted where the omics measurements at different levels, including mRNA, microRNA, CNV and DNA methylation, are analyzed separately. As the molecular complexity of disease etiology exists at all different levels, integrative analysis offers an effective way to borrow strength across multi-level omics data and can be more powerful than single level analysis. In this article, we focus on reviewing existing multi-omics integration studies by paying special attention to variable selection methods. We first summarize published reviews on integrating multi-level omics data. Next, after a brief overview on variable selection methods, we review existing supervised, semi-supervised and unsupervised integrative analyses within parallel and hierarchical integration studies, respectively. The strength and limitations of the methods are discussed in detail. No existing integration method can dominate the rest. The computation aspects are also investigated. The review concludes with possible limitations and future directions for multi-level omics data integration.
Collapse
Affiliation(s)
- Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Jie Ren
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Xiaoxi Li
- Department of Statistics, Kansas State University, Manhattan, KS 66506, USA.
| | - Yu Jiang
- Division of Epidemiology, Biostatistics and Environmental Health, School of Public Health, University of Memphis, Memphis, TN 38152, USA.
| | - Shuangge Ma
- Department of Biostatistics, School of Public Health, Yale University, New Haven, CT 06510, USA.
| |
Collapse
|
25
|
Zille P, Calhoun VD, Wang YP. Enforcing Co-Expression Within a Brain-Imaging Genomics Regression Framework. IEEE TRANSACTIONS ON MEDICAL IMAGING 2018; 37:2561-2571. [PMID: 28678703 PMCID: PMC6415768 DOI: 10.1109/tmi.2017.2721301] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
Among the challenges arising in brain imaging genetic studies, estimating the potential links between neurological and genetic variability within a population is key. In this paper, we propose a multivariate, multimodal formulation for variable selection that leverages co-expression patterns across various data modalities. Our approach is based on an intuitive combination of two widely used statistical models: sparse regression and canonical correlation analysis (CCA). While the former seeks multivariate linear relationships between a given phenotype and associated observations, the latter searches to extract co-expression patterns between sets of variables belonging to different modalities. In the following, we propose to rely on a "CCA-type" formulation in order to regularize the classical multimodal sparse regression problem (essentially incorporating both CCA and regression models within a unified formulation). The underlying motivation is to extract discriminative variables that are also co-expressed across modalities. We first show that the simplest formulation of such model can be expressed as a special case of collaborative learning methods. After discussing its limitation, we propose an extended, more flexible formulation, and introduce a simple and efficient alternating minimization algorithm to solve the associated optimization problem. We explore the parameter space and provide some guidelines regarding parameter selection. Both the original and extended versions are then compared on a simple toy data set and a more advanced simulated imaging genomics data set in order to illustrate the benefits of the latter. Finally, we validate the proposed formulation using single nucleotide polymorphisms data and functional magnetic resonance imaging data from a population of adolescents ( subjects, age 16.9 ± 1.9 years from the Philadelphia Neurodevelopmental Cohort) for the study of learning ability. Furthermore, we carry out a significance analysis of the resulting features that allow us to carefully extract brain regions and genes linked to learning and cognitive ability.
Collapse
|
26
|
de Cheveigné A, Di Liberto GM, Arzounian D, Wong DDE, Hjortkjær J, Fuglsang S, Parra LC. Multiway canonical correlation analysis of brain data. Neuroimage 2018; 186:728-740. [PMID: 30496819 DOI: 10.1016/j.neuroimage.2018.11.026] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2018] [Revised: 10/11/2018] [Accepted: 11/16/2018] [Indexed: 01/12/2023] Open
Abstract
Brain data recorded with electroencephalography (EEG), magnetoencephalography (MEG) and related techniques often have poor signal-to-noise ratios due to the presence of multiple competing sources and artifacts. A common remedy is to average responses over repeats of the same stimulus, but this is not applicable for temporally extended stimuli that are presented only once (speech, music, movies, natural sound). An alternative is to average responses over multiple subjects that were presented with identical stimuli, but differences in geometry of brain sources and sensors reduce the effectiveness of this solution. Multiway canonical correlation analysis (MCCA) brings a solution to this problem by allowing data from multiple subjects to be fused in such a way as to extract components common to all. This paper reviews the method, offers application examples that illustrate its effectiveness, and outlines the caveats and risks entailed by the method.
Collapse
Affiliation(s)
- Alain de Cheveigné
- Laboratoire des Systèmes Perceptifs, UMR 8248, CNRS, France; Département d'Etudes Cognitives, Ecole Normale Supérieure, PSL University, Paris, France; UCL Ear Institute, London, United Kingdom.
| | - Giovanni M Di Liberto
- Laboratoire des Systèmes Perceptifs, UMR 8248, CNRS, France; Département d'Etudes Cognitives, Ecole Normale Supérieure, PSL University, Paris, France
| | - Dorothée Arzounian
- Laboratoire des Systèmes Perceptifs, UMR 8248, CNRS, France; Département d'Etudes Cognitives, Ecole Normale Supérieure, PSL University, Paris, France
| | - Daniel D E Wong
- Laboratoire des Systèmes Perceptifs, UMR 8248, CNRS, France; Département d'Etudes Cognitives, Ecole Normale Supérieure, PSL University, Paris, France
| | - Jens Hjortkjær
- Hearing Systems Group, Department of Electrical Engineering, Technical University of Denmark, Denmark; Danish Research Centre for Magnetic Resonance, Centre for Functional and Diagnostic Imaging and Research, Copenhagen University Hospital Hvidovre, Denmark
| | - Søren Fuglsang
- Hearing Systems Group, Department of Electrical Engineering, Technical University of Denmark, Denmark
| | | |
Collapse
|
27
|
Li Y, Bie R, Teran Hidalgo SJ, Qin Y, Wu M, Ma S. Assisted gene expression-based clustering with AWNCut. Stat Med 2018; 37:4386-4403. [PMID: 30094873 DOI: 10.1002/sim.7928] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2018] [Revised: 05/15/2018] [Accepted: 07/05/2018] [Indexed: 01/06/2023]
Abstract
In the research on complex diseases, gene expression (GE) data have been extensively used for clustering samples. The clusters so generated can serve as the basis for disease subtype identification, risk stratification, and many other purposes. With the small sample sizes of genetic profiling studies and noisy nature of GE data, clustering analysis results are often unsatisfactory. In the most recent studies, a prominent trend is to conduct multidimensional profiling, which collects data on GEs and their regulators (copy number alterations, microRNAs, methylation, etc.) on the same subjects. With the regulation relationships, regulators contain important information on the properties of GEs. We develop a novel assisted clustering method, which effectively uses regulator information to improve clustering analysis using GE data. To account for the fact that not all GEs are informative, we propose a weighted strategy, where the weights are determined data-dependently and can discriminate informative GEs from noises. The proposed method is built on the NCut technique and effectively realized using a simulated annealing algorithm. Simulations demonstrate that it can well outperform multiple direct competitors. In the analysis of TCGA cutaneous melanoma and lung adenocarcinoma data, biologically sensible findings different from the alternatives are made.
Collapse
Affiliation(s)
- Yang Li
- Center for Applied Statistics, Renmin University of China, Beijing, China.,School of Statistics, Renmin University of China, Beijing, China
| | - Ruofan Bie
- School of Statistics, Renmin University of China, Beijing, China
| | | | - Yichen Qin
- Department of Operations, Business Analytics, and Information Systems, University of Cincinnati, Cincinnati, Ohio
| | - Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China.,Department of Biostatistics, Yale University, New Haven, Connecticut
| | - Shuangge Ma
- School of Statistics, Renmin University of China, Beijing, China.,Department of Biostatistics, Yale University, New Haven, Connecticut
| |
Collapse
|
28
|
Kawaguchi A, Yamashita F. Supervised multiblock sparse multivariable analysis with application to multimodal brain imaging genetics. Biostatistics 2018; 18:651-665. [PMID: 28369170 DOI: 10.1093/biostatistics/kxx011] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2016] [Accepted: 02/06/2017] [Indexed: 12/25/2022] Open
Abstract
This article proposes a procedure for describing the relationship between high-dimensional data sets, such as multimodal brain images and genetic data. We propose a supervised technique to incorporate the clinical outcome to determine a score, which is a linear combination of variables with hieratical structures to multimodalities. This approach is expected to obtain interpretable and predictive scores. The proposed method was applied to a study of Alzheimer's disease (AD). We propose a diagnostic method for AD that involves using whole-brain magnetic resonance imaging (MRI) and positron emission tomography (PET), and we select effective brain regions for the diagnostic probability and investigate the genome-wide association with the regions using single nucleotide polymorphisms (SNPs). The two-step dimension reduction method, which we previously introduced, was considered applicable to such a study and allows us to partially incorporate the proposed method. We show that the proposed method offers classification functions with feasibility and reasonable prediction accuracy based on the receiver operating characteristic (ROC) analysis and reasonable regions of the brain and genomes. Our simulation study based on the synthetic structured data set showed that the proposed method outperformed the original method and provided the characteristic for the supervised feature.
Collapse
Affiliation(s)
- Atsushi Kawaguchi
- Center for Comprehensive Community Medicine, Faculty of Medicine, Saga University, 5-1-1 Nabeshima, Saga 849-8501, Japan
| | - Fumio Yamashita
- Division of Ultrahigh Field MRI, Institute for Biomedical Sciences, Iwate Medical University, Yahaba, Iwate 028-3694, Japan
| | | |
Collapse
|
29
|
Hu W, Lin D, Cao S, Liu J, Chen J, Calhoun VD, Wang YP. Adaptive Sparse Multiple Canonical Correlation Analysis With Application to Imaging (Epi)Genomics Study of Schizophrenia. IEEE Trans Biomed Eng 2018; 65:390-399. [PMID: 29364120 PMCID: PMC5826588 DOI: 10.1109/tbme.2017.2771483] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
Finding correlations across multiple data sets in imaging and (epi)genomics is a common challenge. Sparse multiple canonical correlation analysis (SMCCA) is a multivariate model widely used to extract contributing features from each data while maximizing the cross-modality correlation. The model is achieved by using the combination of pairwise covariances between any two data sets. However, the scales of different pairwise covariances could be quite different and the direct combination of pairwise covariances in SMCCA is unfair. The problem of "unfair combination of pairwise covariances" restricts the power of SMCCA for feature selection. In this paper, we propose a novel formulation of SMCCA, called adaptive SMCCA, to overcome the problem by introducing adaptive weights when combining pairwise covariances. Both simulation and real-data analysis show the outperformance of adaptive SMCCA in terms of feature selection over conventional SMCCA and SMCCA with fixed weights. Large-scale numerical experiments show that adaptive SMCCA converges as fast as conventional SMCCA. When applying it to imaging (epi)genetics study of schizophrenia subjects, we can detect significant (epi)genetic variants and brain regions, which are consistent with other existing reports. In addition, several significant brain-development related pathways, e.g., neural tube development, are detected by our model, demonstrating imaging epigenetic association may be overlooked by conventional SMCCA. All these results demonstrate that adaptive SMCCA are well suited for detecting three-way or multiway correlations and thus can find widespread applications in multiple omics and imaging data integration.
Collapse
Affiliation(s)
- Wenxing Hu
- Biomedical Engineering Department, Tulane University, New Orleans, LA 70118, USA
| | - Dongdong Lin
- Mind Research Network and Dept. of ECE, University of New Mexico, Albuquerque, NM, 87106
| | - Shaolong Cao
- Department of Bioinformatics & Computational Biology, UT MD Anderson Cancer Center, Houston, TX
| | - Jingyu Liu
- Mind Research Network and Dept. of ECE, University of New Mexico, Albuquerque, NM, 87106
| | - Jiayu Chen
- Mind Research Network and Dept. of ECE, University of New Mexico, Albuquerque, NM, 87106
| | - Vince D. Calhoun
- Mind Research Network and Dept. of ECE, University of New Mexico, Albuquerque, NM, 87106
| | - Yu-Ping Wang
- Biomedical Engineering Department, Tulane University, New Orleans, LA 70118, USA
| |
Collapse
|
30
|
Chai H, Shi X, Zhang Q, Zhao Q, Huang Y, Ma S. Analysis of cancer gene expression data with an assisted robust marker identification approach. Genet Epidemiol 2017; 41:779-789. [PMID: 28913902 DOI: 10.1002/gepi.22066] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2016] [Revised: 04/10/2017] [Accepted: 07/10/2017] [Indexed: 12/22/2022]
Abstract
Gene expression (GE) studies have been playing a critical role in cancer research. Despite tremendous effort, the analysis results are still often unsatisfactory, because of the weak signals and high data dimensionality. Analysis is often further challenged by the long-tailed distributions of the outcome variables. In recent multidimensional studies, data have been collected on GEs as well as their regulators (e.g., copy number alterations (CNAs), methylation, and microRNAs), which can provide additional information on the associations between GEs and cancer outcomes. In this study, we develop an ARMI (assisted robust marker identification) approach for analyzing cancer studies with measurements on GEs as well as regulators. The proposed approach borrows information from regulators and can be more effective than analyzing GE data alone. A robust objective function is adopted to accommodate long-tailed distributions. Marker identification is effectively realized using penalization. The proposed approach has an intuitive formulation and is computationally much affordable. Simulation shows its satisfactory performance under a variety of settings. TCGA (The Cancer Genome Atlas) data on melanoma and lung cancer are analyzed, which leads to biologically plausible marker identification and superior prediction.
Collapse
Affiliation(s)
- Hao Chai
- Department of Biostatistics, Yale University, New Haven, Connecticut, United States of America
| | - Xingjie Shi
- Department of Statistics, Nanjing University of Finance and Economics, Nanjing Shi, Jiangsu Sheng, China
| | - Qingzhao Zhang
- School of Economics, Wang Yanan Institute for Studies in Economics, Xiamen University, Xiamen Shi, Fujian Sheng, China
| | - Qing Zhao
- Merck Research Laboratories, Rahway, New Jersey, United States of America
| | - Yuan Huang
- Department of Biostatistics, Yale University, New Haven, Connecticut, United States of America
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut, United States of America
| |
Collapse
|
31
|
IPF-LASSO: Integrative L1-Penalized Regression with Penalty Factors for Prediction Based on Multi-Omics Data. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2017; 2017:7691937. [PMID: 28546826 PMCID: PMC5435977 DOI: 10.1155/2017/7691937] [Citation(s) in RCA: 48] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/20/2017] [Accepted: 03/14/2017] [Indexed: 11/29/2022]
Abstract
As modern biotechnologies advance, it has become increasingly frequent that different modalities of high-dimensional molecular data (termed “omics” data in this paper), such as gene expression, methylation, and copy number, are collected from the same patient cohort to predict the clinical outcome. While prediction based on omics data has been widely studied in the last fifteen years, little has been done in the statistical literature on the integration of multiple omics modalities to select a subset of variables for prediction, which is a critical task in personalized medicine. In this paper, we propose a simple penalized regression method to address this problem by assigning different penalty factors to different data modalities for feature selection and prediction. The penalty factors can be chosen in a fully data-driven fashion by cross-validation or by taking practical considerations into account. In simulation studies, we compare the prediction performance of our approach, called IPF-LASSO (Integrative LASSO with Penalty Factors) and implemented in the R package ipflasso, with the standard LASSO and sparse group LASSO. The use of IPF-LASSO is also illustrated through applications to two real-life cancer datasets. All data and codes are available on the companion website to ensure reproducibility.
Collapse
|
32
|
Zille P, Calhoun VD, Wang YP. ENFORCING CO-EXPRESSION IN MULTIMODAL REGRESSION FRAMEWORK. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2017; 22:105-116. [PMID: 27896966 PMCID: PMC5415360 DOI: 10.1142/9789813207813_0011] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
We consider the problem of multimodal data integration for the study of complex neurological diseases (e.g. schizophrenia). Among the challenges arising in such situation, estimating the link between genetic and neurological variability within a population sample has been a promising direction. A wide variety of statistical models arose from such applications. For example, Lasso regression and its multitask extension are often used to fit a multivariate linear relationship between given phenotype(s) and associated observations. Other approaches, such as canonical correlation analysis (CCA), are widely used to extract relationships between sets of variables from different modalities. In this paper, we propose an exploratory multivariate method combining these two methods. More Specifically, we rely on a 'CCA-type' formulation in order to regularize the classical multimodal Lasso regression problem. The underlying motivation is to extract discriminative variables that display are also co-expressed across modalities. We first evaluate the method on a simulated dataset, and further validate it using Single Nucleotide Polymorphisms (SNP) and functional Magnetic Resonance Imaging (fMRI) data for the study of schizophrenia.
Collapse
Affiliation(s)
- Pascal Zille
- Biomedical Engineering Department, Tulane University, USA
| | | | | |
Collapse
|
33
|
An X, Hu J, Do KA. SIFORM: shared informative factor models for integration of multi-platform bioinformatic data. Bioinformatics 2016; 32:3279-3290. [PMID: 27381342 DOI: 10.1093/bioinformatics/btw295] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2015] [Accepted: 04/28/2016] [Indexed: 12/11/2022] Open
Abstract
MOTIVATION High-dimensional omic data derived from different technological platforms have been extensively used to facilitate comprehensive understanding of disease mechanisms and to determine personalized health treatments. Numerous studies have integrated multi-platform omic data; however, few have efficiently and simultaneously addressed the problems that arise from high dimensionality and complex correlations. RESULTS We propose a statistical framework of shared informative factor models that can jointly analyze multi-platform omic data and explore their associations with a disease phenotype. The common disease-associated sample characteristics across different data types can be captured through the shared structure space, while the corresponding weights of genetic variables directly index the strengths of their association with the phenotype. Extensive simulation studies demonstrate the performance of the proposed method in terms of biomarker detection accuracy via comparisons with three popular regularized regression methods. We also apply the proposed method to The Cancer Genome Atlas lung adenocarcinoma dataset to jointly explore associations of mRNA expression and protein expression with smoking status. Many of the identified biomarkers belong to key pathways for lung tumorigenesis, some of which are known to show differential expression across smoking levels. We discover potential biomarkers that reveal different mechanisms of lung tumorigenesis between light smokers and heavy smokers. AVAILABILITY AND IMPLEMENTATION R code to implement the new method can be downloaded from http://odin.mdacc.tmc.edu/jhhu/ CONTACT: jhu@mdanderson.org.
Collapse
Affiliation(s)
- Xuebei An
- Department of Biostatistics, the University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| | - Jianhua Hu
- Department of Biostatistics, the University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| | - Kim-Anh Do
- Department of Biostatistics, the University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| |
Collapse
|
34
|
Zhu R, Zhao Q, Zhao H, Ma S. Integrating multidimensional omics data for cancer outcome. Biostatistics 2016; 17:605-18. [PMID: 26980320 PMCID: PMC5031941 DOI: 10.1093/biostatistics/kxw010] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2015] [Accepted: 01/27/2016] [Indexed: 01/06/2023] Open
Abstract
In multidimensional cancer omics studies, one subject is profiled on multiple layers of omics activities. In this article, the goal is to integrate multiple types of omics measurements, identify markers, and build a model for cancer outcome. The proposed analysis is achieved in two steps. In the first step, we analyze the regulation among different types of omics measurements, through the construction of linear regulatory modules (LRMs). The LRMs have sound biological basis, and their construction differs from the existing analyses by modeling the regulation of sets of gene expressions (GEs) by sets of regulators. The construction is realized with the assistance of regularized singular value decomposition. In the second step, the proposed cancer outcome model includes the regulated GEs, "residuals" of GEs, and "residuals" of regulators, and we use regularized estimation to select relevant markers. Simulation shows that the proposed method outperforms the alternatives with more accurate marker identification. We analyze the The Cancer Genome Atlas data on cutaneous melanoma and lung adenocarcinoma and obtain meaningful results.
Collapse
Affiliation(s)
- Ruoqing Zhu
- Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, IL, USA
| | - Qing Zhao
- Department of Biostatistics, Yale University, New Haven, CT, USA
| | - Hongyu Zhao
- Department of Biostatistics, Yale University, New Haven, CT, USA
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, CT, USA
| |
Collapse
|
35
|
Luo C, Liu J, Dey DK, Chen K. Canonical variate regression. Biostatistics 2016; 17:468-83. [PMID: 26861909 DOI: 10.1093/biostatistics/kxw001] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2015] [Accepted: 01/01/2016] [Indexed: 11/13/2022] Open
Abstract
In many fields, multi-view datasets, measuring multiple distinct but interrelated sets of characteristics on the same set of subjects, together with data on certain outcomes or phenotypes, are routinely collected. The objective in such a problem is often two-fold: both to explore the association structures of multiple sets of measurements and to develop a parsimonious model for predicting the future outcomes. We study a unified canonical variate regression framework to tackle the two problems simultaneously. The proposed criterion integrates multiple canonical correlation analysis with predictive modeling, balancing between the association strength of the canonical variates and their joint predictive power on the outcomes. Moreover, the proposed criterion seeks multiple sets of canonical variates simultaneously to enable the examination of their joint effects on the outcomes, and is able to handle multivariate and non-Gaussian outcomes. An efficient algorithm based on variable splitting and Lagrangian multipliers is proposed. Simulation studies show the superior performance of the proposed approach. We demonstrate the effectiveness of the proposed approach in an [Formula: see text] intercross mice study and an alcohol dependence study.
Collapse
Affiliation(s)
- Chongliang Luo
- Department of Statistics, University of Connecticut, Storrs, CT 06269, USA
| | - Jin Liu
- Centre for Quantitative Medicine, Duke-NUS Graduate Medical School, Singapore 169856, Singapore
| | - Dipak K Dey
- Department of Statistics, University of Connecticut, Storrs, CT 06269, USA
| | - Kun Chen
- Department of Statistics, University of Connecticut, Storrs, CT 06269, USA
| |
Collapse
|